What's the difference between an embedding model like text-embedding-3-large and a generative model like GPT-4? Walk me through how each is trained, what the output represents, and when you'd use one vs the other for a real task.

Question

Accepted Answer

Architecture and output

Generative models (decoder-only): autoregressive transformers trained to predict the next token. Each forward pass produces a probability distribution over the vocabulary at the last position. Output: one token at a time. The last hidden state is a representation of the sequence, but it's optimized for next-token prediction, not semantic similarity.

Embedding models (encoder or bi-encoder): trained to map a sequence to a fixed-size dense vector that captures semantic meaning. Output: one vector per input sequence (typically 768–3072 dimensions). The vector is the product; generation is not the goal.

How embedding models are trained

Masked Language Modeling (MLM): BERT-style. Randomly mask tokens, predict them. Trains bidirectional attention — each token attends to the full sequence in both directions. Good for understanding; not generative.

Contrastive learning: more powerful for retrieval. Given (query, positive document) pairs, train so that similar pairs have high cosine similarity and dissimilar pairs have low similarity.
L = -log(exp(sim(q, d+) / τ) / Σ_j exp(sim(q, d_j) / τ))
This is InfoNCE / NT-Xent loss. In-batch negatives: all other examples in the batch become negatives for each anchor. Large batch sizes are critical — more negatives = harder contrastive task = better representations.

RLHF-aligned generative models used as embedders: recent trend — take a generative model (Mistral, LLaMA) and add mean pooling or a classification head, then fine-tune contrastively. Large generative models encode more world knowledge, which helps retrieval.

Key architectural differences

| | Embedding model | Generative model |
|---|---|---|
| Attention | Bidirectional (encoder) | Causal / unidirectional (decoder) |
| Training objective | Contrastive / MLM | Next-token prediction |
| Output | Fixed-size vector | Token distribution |
| Context window | Usually shorter (512-8k) | Longer (8k-1M) |
| Inference cost | One forward pass | N forward passes for N tokens |

Why you can't use GPT's hidden states as embeddings

GPT's last hidden state at position i is conditioned on tokens 0..i-1 only (causal masking). It's not a symmetric representation — sim(A, B) ≠ sim(B, A) using GPT hidden states. Also, it's optimized for predicting what comes next, not for capturing that two semantically equivalent sentences mean the same thing.

Embedding models are trained explicitly so that cos_sim(embed(A), embed(B)) correlates with semantic similarity, measured by human labels or downstream retrieval metrics.

When to use which

Embedding model:
- Semantic search: find documents similar to a query (ANN over embedding vectors)
- RAG retrieval: embed query + documents, retrieve by cosine similarity
- Duplicate detection, clustering
- Classification with frozen embeddings

Generative model:
- Open-ended text generation: summaries, emails, code
- Multi-step reasoning: chain-of-thought, tool use
- Tasks requiring output format control (JSON, structured responses)
- Question answering where the answer isn't in a retrieved passage

Cross-encoders vs bi-encoders

Bi-encoder (embedding model): encode query and document independently, compare with dot product. Scales to millions of documents — embed once, search with ANN.

Cross-encoder: concatenate query + document, run one forward pass to produce a relevance score. Much more accurate (attends to both simultaneously) but O(n) forward passes per query — can't scale to retrieval over large corpora.

Production RAG: bi-encoder for retrieval (top-100), cross-encoder for reranking (top-5 → top-3).