You have a large model with great accuracy, but it's too slow and memory-intensive for production. How do you make it smaller and faster without losing much accuracy?
You mentioned structured pruning removes attention heads — how do you decide which heads to prune, and is there a principled way to measure head importance?
tldr
Distillation trains a smaller model on soft teacher outputs — biggest accuracy-preserving compression. Quantization reduces numerical precision (float16 is nearly free; int8 needs calibration). Structured pruning removes components (heads, layers) that map to real speedup. Combine all three for maximum compression. Measure head importance via gradient magnitude or Taylor expansion on your specific task — importance is task-dependent and you should prune iteratively with fine-tuning between steps.
follow-up
- What is the lottery ticket hypothesis and what does it imply about how neural networks store information?
- How would you compress a model for mobile deployment where you have no GPU?
- When does distillation fail — what kinds of tasks or architectures make the teacher-student transfer difficult?