mlprep
mlprep/ML Codingeasy15 min

Implement gradient descent for linear regression from scratch in NumPy. No sklearn, no autograd. Compute the gradient manually, implement the update rule, and show it converging. Then extend to mini-batch SGD.

formulate your answer, then —

You used a constant learning rate. Implement learning rate decay and momentum — and explain why momentum helps SGD converge faster.

formulate your answer, then —

tldr

Gradient descent for linear regression: gradient is (2/n)Xᵀ(Xw - y). Mini-batch SGD samples random subsets for noisy but faster updates. Batch size trades off gradient noise (helps escape bad minima) vs. stability (smooth convergence). Momentum accumulates velocity across steps, acting as a low-pass filter that amplifies consistent gradient signal and dampens oscillating noise. Learning rate decay lets you take large steps early and fine-tune later.

follow-up

  • Implement Adam optimizer from scratch — what does it do differently from SGD with momentum, and why is it the default for deep learning?
  • Your gradient descent converges to a loss of 0.5 but the closed-form solution achieves 0.3. What went wrong?
  • Explain why feature scaling (normalization) is critical for gradient descent but irrelevant for the closed-form solution.