Walk me through a two-tower model for candidate retrieval. What are its limitations versus a single cross-attention model, and why do we use it anyway at scale?
formulate your answer, then —
tldr
Two-tower: query tower and item tower encode independently, score = dot product in shared space. Key win: item embeddings precomputed, retrieval = ANN search (milliseconds for 1B items). Key loss: no query-item interaction during encoding — can't capture cross features. Used because it's the only architecture that scales to O(1B) candidates. Pair with a heavy cross-attention ranker on the top-k retrieved candidates.
follow-up
- How do you mine hard negatives for training a two-tower model, and why are in-batch negatives alone insufficient?
- What is ColBERT's late interaction approach and how does it trade off expressiveness vs. retrieval cost compared to two-tower?
- How do you handle the cold start problem for new items in a two-tower retrieval system?