mlprep
mlprep/ML Breadthmedium10 min

Explain transfer learning. When do you freeze pretrained layers vs fine-tune them? What are common failure modes?

formulate your answer, then —

tldr

Feature extraction (freeze all) when data is small or domains match. Fine-tune when data is sufficient and domains differ. Use 10-100× smaller LR for fine-tuning than pretraining; add warmup. Catastrophic forgetting is real — small LR and LoRA help. LoRA (low-rank adaptation) is the practical standard for LLM fine-tuning: under 1% parameter updates, near full fine-tune quality. Discriminative LRs (lower for early layers) improve fine-tuning in practice.

follow-up

  • How does LoRA's rank r affect the trade-off between expressiveness and compute/memory?
  • What is the difference between task-specific fine-tuning and instruction tuning for LLMs?
  • You fine-tuned a BERT model and it performs well in offline evaluation but degrades after deployment. What could cause this?