mlprep
mlprep/ML Breadthmedium18 min

What are embeddings and how are they learned? Why do similar things end up close together in embedding space?

formulate your answer, then —

You mentioned the skip-gram objective — what is the model actually optimizing, and what's the practical challenge with the vocabulary-sized softmax?

formulate your answer, then —

You've deployed an embedding model to production powering a recommendation system. What breaks over time, and how do you manage the full lifecycle of embeddings in production?

formulate your answer, then —

tldr

Embeddings map discrete objects to dense vectors by training on a distributional objective — similar objects appear in similar contexts, so they learn similar representations. Word2vec's skip-gram predicts surrounding words; the embedding is a side effect of solving that prediction task. Full-vocabulary softmax is too slow at scale; negative sampling replaces it with a binary classification against random noise samples. In production: embeddings drift as data distribution shifts — monitor pairwise similarity and ANN retrieval quality. Retraining invalidates all existing embeddings; atomic index swaps with rollback windows are required. HNSW indexes have memory and rebuild costs that dominate at 100M+ items. Collapse (all embeddings similar) is detected by tracking probe-pair cosine similarity; fixed by hard negative mining and temperature tuning.

follow-up

  • How does the transformer's embedding layer differ from word2vec, and why do contextual embeddings (BERT) outperform static ones?
  • How would you design an embedding system for a cold-start problem, where new items have no interaction history?
  • What's the difference between collaborative filtering embeddings and content-based embeddings, and when would you combine them?
  • How do you detect and measure embedding collapse during training, and what does the uniformity-alignment framework tell you about embedding quality?