Compare gradient descent variants: SGD, mini-batch, Adam, and learning rate schedules

Question

Accepted Answer

Walk me through the main gradient descent variants — batch GD, SGD, mini-batch SGD, and Adam. What are the tradeoffs, when would you choose each, and how do learning rate schedules fit in? Think about the fundamental tradeoff: gradient quality vs. update frequency. Batch GD computes the true gradient but only once per epoch. SGD updates after every sample — noisy but fast. Where does mini-batch fall? And what problem is Adam actually