mlprep
mlprep/ML Codinghard25 min

Implement Word2Vec skip-gram with negative sampling from scratch in NumPy. Explain what each part is doing, why negative sampling is necessary, and what the resulting embeddings capture.

formulate your answer, then —

Your implementation processes one word pair at a time. In practice, Word2Vec trains on billions of tokens. How would you make this significantly faster without changing the algorithm?

formulate your answer, then —

tldr

Word2Vec skip-gram learns embeddings by predicting context words from center words. Negative sampling avoids the expensive full-vocabulary softmax by training a binary classifier (real context vs. random noise). Two embedding matrices prevent collapse. The ^0.75 power on sampling frequency boosts rare-word negatives for better learning. At scale: pre-sample negative tables, subsample frequent words, vectorize operations, and use lock-free parallel training (Hogwild!).

follow-up

  • What's the relationship between Word2Vec and matrix factorization of the PMI (pointwise mutual information) matrix? What does this equivalence tell you?
  • How would you evaluate whether your learned embeddings are good — what benchmarks and intrinsic/extrinsic evaluations would you use?
  • Modern language models (BERT, GPT) produce contextual embeddings. What do they capture that Word2Vec cannot, and when would you still prefer Word2Vec?