What are embeddings and how are they learned? Why do similar things end up close together in embedding space?
You mentioned the skip-gram objective — what is the model actually optimizing, and what's the practical challenge with the vocabulary-sized softmax?
You've deployed an embedding model to production powering a recommendation system. What breaks over time, and how do you manage the full lifecycle of embeddings in production?
tldr
Embeddings map discrete objects to dense vectors by training on a distributional objective — similar objects appear in similar contexts, so they learn similar representations. Word2vec's skip-gram predicts surrounding words; the embedding is a side effect of solving that prediction task. Full-vocabulary softmax is too slow at scale; negative sampling replaces it with a binary classification against random noise samples. In production: embeddings drift as data distribution shifts — monitor pairwise similarity and ANN retrieval quality. Retraining invalidates all existing embeddings; atomic index swaps with rollback windows are required. HNSW indexes have memory and rebuild costs that dominate at 100M+ items. Collapse (all embeddings similar) is detected by tracking probe-pair cosine similarity; fixed by hard negative mining and temperature tuning.
follow-up
- How does the transformer's embedding layer differ from word2vec, and why do contextual embeddings (BERT) outperform static ones?
- How would you design an embedding system for a cold-start problem, where new items have no interaction history?
- What's the difference between collaborative filtering embeddings and content-based embeddings, and when would you combine them?
- How do you detect and measure embedding collapse during training, and what does the uniformity-alignment framework tell you about embedding quality?