Implement Word2Vec skip-gram with negative sampling from scratch in NumPy. Explain what each part is doing, why negative sampling is necessary, and what the resulting embeddings capture.
Your implementation processes one word pair at a time. In practice, Word2Vec trains on billions of tokens. How would you make this significantly faster without changing the algorithm?
tldr
Word2Vec skip-gram learns embeddings by predicting context words from center words. Negative sampling avoids the expensive full-vocabulary softmax by training a binary classifier (real context vs. random noise). Two embedding matrices prevent collapse. The ^0.75 power on sampling frequency boosts rare-word negatives for better learning. At scale: pre-sample negative tables, subsample frequent words, vectorize operations, and use lock-free parallel training (Hogwild!).
follow-up
- What's the relationship between Word2Vec and matrix factorization of the PMI (pointwise mutual information) matrix? What does this equivalence tell you?
- How would you evaluate whether your learned embeddings are good — what benchmarks and intrinsic/extrinsic evaluations would you use?
- Modern language models (BERT, GPT) produce contextual embeddings. What do they capture that Word2Vec cannot, and when would you still prefer Word2Vec?