mlprep
mlprep/ML Breadthhard12 min

Walk me through a two-tower model for candidate retrieval. What are its limitations versus a single cross-attention model, and why do we use it anyway at scale?

formulate your answer, then —

tldr

Two-tower: query tower and item tower encode independently, score = dot product in shared space. Key win: item embeddings precomputed, retrieval = ANN search (milliseconds for 1B items). Key loss: no query-item interaction during encoding — can't capture cross features. Used because it's the only architecture that scales to O(1B) candidates. Pair with a heavy cross-attention ranker on the top-k retrieved candidates.

follow-up

  • How do you mine hard negatives for training a two-tower model, and why are in-batch negatives alone insufficient?
  • What is ColBERT's late interaction approach and how does it trade off expressiveness vs. retrieval cost compared to two-tower?
  • How do you handle the cold start problem for new items in a two-tower retrieval system?