I Built My First Neural Network
i built a neural network from scratch. no pytorch. no tensorflow. just numpy and determination.
it can classify handwritten digits with 94% accuracy.
this is my moon landing.
the project
the task: take the mnist dataset (handwritten digits 0-9) and build a classifier from scratch.
the constraint: no ML libraries. just numpy for matrix operations.
the goal: actually understand what's happening instead of just calling .fit().
the architecture
super simple:
- input layer: 784 neurons (28x28 pixel images flattened)
- hidden layer: 128 neurons with relu activation
- output layer: 10 neurons with softmax (one per digit)
class NeuralNetwork:
def __init__(self):
self.W1 = np.random.randn(784, 128) * 0.01
self.b1 = np.zeros((1, 128))
self.W2 = np.random.randn(128, 10) * 0.01
self.b2 = np.zeros((1, 10))
that's it. that's the whole model. two weight matrices, two bias vectors.
the forward pass
def forward(self, X):
self.z1 = X @ self.W1 + self.b1 # linear
self.a1 = np.maximum(0, self.z1) # relu
self.z2 = self.a1 @ self.W2 + self.b2 # linear
self.a2 = self.softmax(self.z2) # softmax
return self.a2
four lines. matrix multiply, activation, repeat. the magic is just math.
the backward pass
this is where my brain melted. multiple times.
def backward(self, X, y, learning_rate):
m = X.shape[0]
# output layer gradients
dz2 = self.a2 - y # cross-entropy gradient
dW2 = (self.a1.T @ dz2) / m
db2 = np.sum(dz2, axis=0, keepdims=True) / m
# hidden layer gradients
da1 = dz2 @ self.W2.T
dz1 = da1 * (self.z1 > 0) # relu gradient
dW1 = (X.T @ dz1) / m
db1 = np.sum(dz1, axis=0, keepdims=True) / m
# update weights
self.W1 -= learning_rate * dW1
self.b1 -= learning_rate * db1
self.W2 -= learning_rate * dW2
self.b2 -= learning_rate * db2
chain rule. all the way down. took me three days to get this right.
the training loop
for epoch in range(100):
predictions = model.forward(X_train)
loss = cross_entropy_loss(predictions, y_train)
model.backward(X_train, y_train, learning_rate=0.1)
print(f"epoch {epoch}, loss: {loss:.4f}")
watch loss go down. feel happy.
what i learned
-
backprop is just calculus - chain rule, repeatedly applied. that's it.
-
initialization matters - random weights too big or too small = training fails.
-
learning rate is tricky - too high: diverges. too low: takes forever.
-
batch training helps - mini-batches are faster and more stable than full-batch.
-
debugging is pain - when it doesn't work, you don't get good error messages. just bad predictions.
the result
after 100 epochs: 94% accuracy on test data. not state-of-the-art, but good enough for a from-scratch implementation.
watching it correctly classify digits that i hand-drew myself was magical.
next steps
- try different architectures
- implement convolutional layers (much harder)
- eventually: understand and implement transformers
submitted some hand-drawn 7s. it got them all right. i am emotionally attached to this model now.