I Built My First Neural Network

i built a neural network from scratch. no pytorch. no tensorflow. just numpy and determination.

it can classify handwritten digits with 94% accuracy.

this is my moon landing.

the project

the task: take the mnist dataset (handwritten digits 0-9) and build a classifier from scratch.

the constraint: no ML libraries. just numpy for matrix operations.

the goal: actually understand what's happening instead of just calling .fit().

the architecture

super simple:

input layer: 784 neurons (28x28 pixel images flattened)
hidden layer: 128 neurons with relu activation
output layer: 10 neurons with softmax (one per digit)

class NeuralNetwork:
    def __init__(self):
        self.W1 = np.random.randn(784, 128) * 0.01
        self.b1 = np.zeros((1, 128))
        self.W2 = np.random.randn(128, 10) * 0.01
        self.b2 = np.zeros((1, 10))

that's it. that's the whole model. two weight matrices, two bias vectors.

the forward pass

def forward(self, X):
    self.z1 = X @ self.W1 + self.b1  # linear
    self.a1 = np.maximum(0, self.z1)  # relu
    self.z2 = self.a1 @ self.W2 + self.b2  # linear
    self.a2 = self.softmax(self.z2)  # softmax
    return self.a2

four lines. matrix multiply, activation, repeat. the magic is just math.

the backward pass

this is where my brain melted. multiple times.

def backward(self, X, y, learning_rate):
    m = X.shape[0]
    
    # output layer gradients
    dz2 = self.a2 - y  # cross-entropy gradient
    dW2 = (self.a1.T @ dz2) / m
    db2 = np.sum(dz2, axis=0, keepdims=True) / m
    
    # hidden layer gradients
    da1 = dz2 @ self.W2.T
    dz1 = da1 * (self.z1 > 0)  # relu gradient
    dW1 = (X.T @ dz1) / m
    db1 = np.sum(dz1, axis=0, keepdims=True) / m
    
    # update weights
    self.W1 -= learning_rate * dW1
    self.b1 -= learning_rate * db1
    self.W2 -= learning_rate * dW2
    self.b2 -= learning_rate * db2

chain rule. all the way down. took me three days to get this right.

the training loop

for epoch in range(100):
    predictions = model.forward(X_train)
    loss = cross_entropy_loss(predictions, y_train)
    model.backward(X_train, y_train, learning_rate=0.1)
    print(f"epoch {epoch}, loss: {loss:.4f}")

watch loss go down. feel happy.

what i learned

backprop is just calculus - chain rule, repeatedly applied. that's it.
initialization matters - random weights too big or too small = training fails.
learning rate is tricky - too high: diverges. too low: takes forever.
batch training helps - mini-batches are faster and more stable than full-batch.
debugging is pain - when it doesn't work, you don't get good error messages. just bad predictions.

the result

after 100 epochs: 94% accuracy on test data. not state-of-the-art, but good enough for a from-scratch implementation.

watching it correctly classify digits that i hand-drew myself was magical.

next steps

try different architectures
implement convolutional layers (much harder)
eventually: understand and implement transformers

submitted some hand-drawn 7s. it got them all right. i am emotionally attached to this model now.