Explain the Wide & Deep architecture and why Google built it for app recommendations. Then walk me through how Deep & Cross Network (DCN) improves on it. What problem are both trying to solve, and when would you use each in a ranking system today?

Question

Accepted Answer

The problem: memorization vs generalization

Ranking models need two things simultaneously:

Memorization: exploit known patterns exactly. "Users who installed Candy Crush also install Farm Heroes" — a specific (user_history, app) co-occurrence that should fire exactly as seen in training data. Logistic regression with manually engineered feature crosses does this well.

Generalization: transfer to unseen (user, item) pairs using similarity. If a user likes strategy games, recommend other strategy games even without an exact co-occurrence signal. Deep neural networks do this well by learning dense representations.

Wide & Deep (Cheng et al., Google Play, 2016) trains both jointly.

Wide & Deep architecture

Input features
    │
    ├──► Wide component: linear model on raw + crossed features
    │         y_wide = w^T [x, x_crossed]
    │
    └──► Deep component: DNN on dense embeddings
              x_embed → FC → FC → FC → y_deep

Final logit = y_wide + y_deep

Wide component: sparse features + manually engineered crosses (e.g., user_installed_app × impression_app). Standard logistic regression. Captures specific co-occurrence memorization.

Deep component: categorical features → learned embeddings (32-128d) → concatenate → 3-4 FC layers with ReLU. Learns generalizable dense representations.

Both optimized jointly end-to-end. The wide component's sparse crosses require domain knowledge to engineer — this is its main limitation.

Deep & Cross Network (DCN, Wang et al., 2017)

DCN replaces the hand-engineered wide component with a learned cross network that automatically generates feature interactions up to arbitrary order.

Cross layer:
x_{l+1} = x_0 · x_l^T · w_l + b_l + x_l

- x_0: original feature vector (residual connection back to input)
- x_l^T · w_l: scalar — projects the current layer back to a scalar, then scales x_0
- Adding x_l: residual connection

Each cross layer adds one order of feature interaction. After L layers: degree-(L+1) polynomial interactions between all feature pairs, with only O(d) parameters per layer (vs O(d²) for an explicit interaction matrix).

DCN v2 (2021): replaces the scalar projection with a full matrix: x_{l+1} = x_0 ⊙ (W_l x_l + b_l) + x_l. More expressive, captures richer interactions with moderate parameter cost.

Comparison

| | Wide & Deep | DCN |
|---|---|---|
| Feature crosses | Hand-engineered (domain expertise required) | Learned automatically |
| Cross expressiveness | Only crosses you enumerate | All pairs up to degree L |
| Parameters | Wide: sparse, large; Deep: dense | Cross: O(d·L); Deep: standard |
| Interpretability | Wide component is interpretable | Cross interactions less interpretable |
| Engineering cost | High (feature cross curation) | Lower |

When to use each today

Both are relatively mature architectures (2016-2021). Modern production systems often use:

- DCN v2 + multi-task heads: DCN cross layers for feature interaction, shared tower with separate output heads per objective (CTR, CVR, engagement time)
- DLRM (Meta): similar philosophy — embedding tables for sparse features, MLP for dense, dot-product interactions
- Stacked or parallel cross: DCN cross layers stacked with deep layers (stacked) or running in parallel with outputs concatenated

The choice today is less "Wide & Deep vs DCN" and more "how many cross layers, what interaction order, and how to balance cross vs DNN capacity."

Common traps

- Wide & Deep's wide component needs careful feature engineering — bad crosses hurt more than no crosses
- Cross layers are not attention — they don't select which interactions matter, they compute all interactions up to degree L
- Adding more cross layers doesn't monotonically improve quality — overfitting and training instability increase