mlprep
mlprep/ML Codingmedium20 min

Implement k-nearest neighbors classification from scratch in NumPy. Focus on making the distance computation vectorized and efficient — no for loops over data points. Then discuss: what's the computational bottleneck and how would you address it at scale?

formulate your answer, then —

Good. Now: your argpartition finds the k nearest neighbors, but then you do a Python loop for majority voting. Can you vectorize the majority vote too?

formulate your answer, then —

tldr

Vectorized KNN uses the identity ||a-b||² = ||a||² + ||b||² - 2a·b to compute all pairwise distances via matrix multiplication. Use argpartition (O(n)) instead of full sort (O(n log n)) to find k nearest. At scale, brute-force KNN fails — use approximate nearest neighbor structures (FAISS, ScaNN) for sub-linear query time. Vectorize majority voting with one-hot encoding and summation over the k axis.

follow-up

  • How would you adapt this implementation to handle weighted features — where some dimensions matter more than others?
  • Explain the curse of dimensionality for KNN — why does KNN break down in high dimensions, and at what dimension count does this become a problem in practice?
  • How would you implement an online KNN that can add new training points without recomputing all distances?