Most production ranking systems use a multi-stage funnel rather than a single model that scores all candidates. Walk me through why, what each stage does, and what the tradeoffs are at each layer. Then tell me: what breaks if you skip a stage?

Question

Accepted Answer

Why multi-stage

A single model scoring all candidates is infeasible at scale. YouTube has 800M videos; scoring each with a deep ranking model per request under 100ms is impossible. The funnel narrows from millions to tens using progressively more expensive, more accurate models.

The fundamental tension: accuracy increases with model complexity, but latency and cost increase too. Multi-stage distributes this tradeoff across stages.

Stage 1: Retrieval (millions → hundreds)

Goal: recall. Don't miss relevant items. Precision is secondary — you'd rather over-retrieve than miss the best result.

ANN over embeddings: bi-encoder (two-tower) produces user and item embeddings. Approximate nearest neighbor search (FAISS, ScaNN) retrieves top-K items by dot product or cosine similarity. Sub-10ms for millions of items.

Multiple retrieval sources: production systems run several retrievers in parallel and merge:
- Collaborative filtering tower (user history → similar items)
- Content-based tower (item features → semantically similar items)
- Trending/popularity bucket
- Diversity bucket (avoid echo chamber)

False negatives here are unrecoverable — an item not retrieved never reaches the ranker. Retrieval recall is the first thing to measure.

Stage 2: Pre-rank (hundreds → ~100)

Goal: cheap filtering before expensive ranking. Remove obviously bad candidates.

Typically a lightweight model: logistic regression or a small 2-layer MLP on sparse features. Inference in under 5ms for 500 candidates.

Pre-rank is often the most underinvested stage. A bad pre-ranker drops good candidates before the expensive ranker ever sees them — silent quality loss that's hard to attribute.

Stage 3: Rank (100 → 10-20)

Goal: precision. This is the main model — GBDT, DCN, or a full neural ranker with dense features, user context, and cross features.

Typical features at rank stage:
- Item features: content embeddings, historical CTR, engagement signals
- User features: session context, long-term preferences, demographics
- Context features: time, device, query (for search)
- Cross features: user × item affinity scores, position-aware features

Inference: 5-50ms for 100 candidates depending on model complexity.

Stage 4: Re-rank (10-20 → final list)

Goal: business objectives beyond relevance — diversity, freshness, policy constraints, monetization.

Diversity re-ranking: maximal marginal relevance or determinantal point processes to avoid showing 10 items from the same category.

Business rules: legal constraints, content policy enforcement, promotions injection.

Cross-item interactions: some re-rankers model the full slate jointly (sequence models, listwise objectives). Expensive but captures list-level effects.

Position bias correction: adjust scores based on where the item will be displayed.

Cascade training and the distribution shift problem

Each stage is trained on data filtered by earlier stages — distribution shifts compound. The ranker trains on items that passed pre-rank, so it never sees the items pre-rank filtered out. If pre-rank degrades (e.g., due to feature drift), ranker performance falls for reasons the ranker team can't observe.

Mitigation: log a small uniform exploration slice (1-5% of traffic) that bypasses pre-rank and pre-rank filtering, ensuring the ranker occasionally sees items pre-rank would have cut.

Calibration across stages

Raw scores from each stage model are not comparable. A 0.8 score from a logistic regression pre-ranker and a 0.8 from a DCN ranker mean different things. Calibration must be done per-stage, especially if scores are used for business logic (bid pricing, capping).

What breaks if you skip a stage

Skip retrieval, run ranker on everything: infeasible at >1M items. Latency blows up.

Skip pre-rank: ranker runs on 500 items instead of 100 — 5× latency increase, or you run a smaller ranker that hurts quality.

Skip re-rank: diversity collapses (model shows 10 similar items), policy violations slip through, monetization integration becomes ad-hoc.

Latency budget example (100ms total SLO)

Retrieval:   10ms (ANN over 10M items)
Pre-rank:     5ms (LR on 500 candidates)
Rank:        40ms (DCN on 100 candidates)
Re-rank:      5ms (rules + diversity)
Network+misc: 40ms
─────────────────
Total:       100ms