Given the high up-front cost of training a language model, any non-trivial improvement to the optimization process would drastically reduce the time and money needed to complete the training process. Adam and its variants were the states of the art for a long time, while second-order (Hessian-based) optimizers were rarely utilized due to their greater per-step overhead.
The researchers’ suggested pre-conditioner for the second-order optimizer Sophia, Second-order Clipped Stochastic Optimization, is a compact estimate of the diagonal Hessian. A new optimizer named Sophia can solve LLMs twice as quickly as Adam. The update is followed by an element-by-element clip, which is calculated by dividing the mean of the gradients by the mean of the estimated Hessian. The clipping reduces the worst-case update’s size and lessens the impact of the trajectory’s non-convexity and fast Hessian fluctuations. The $2M budget may be reduced to the $1M range by adding a few extra lines of code (provided scaling rules hold true).
The average per-step time and memory overhead are low because Sophia only estimates the diagonal Hessian every few iterations. Sophia doubles Adam’s speed in terms of the number of steps, total compute, and wall-clock time while modeling language with GPT-2 models ranging in size from 125 million to 770 million. Researchers demonstrate that Sophia can accommodate large parameter variations that underlie language modeling tasks. The runtime bound is independent of the loss’s condition number.
- Sophia is straightforward to implement with PyTorch, as it requires a lightweight estimate of the diagonal Hessian as a pre-condition on the gradient (see pseudo-code in the first picture) before individually clipping elements.
- Sophia also helps with pre-workout steadiness. Much less often than in Adam and Lion, gradient clipping is induced. The re-parameterization trick, where the focused temperature varies with the layer index, is unnecessary.
- Sophia ensures a consistent loss reduction across all parameter dimensions by penalizing updates more heavily in sharp sizes (with large Hessian) than in flat dimensions (with small Hessian). In two-dimensional space, Adam converges more slowly.
Important aspects of this undertaking
- This shows that even with limited resources, academics may examine LLM pre-training and develop novel, effective algorithms.
- In addition to reviewing material from previous optimization courses, researchers extensively used theoretical reasoning throughout the study process.
In the code scheduled for release tomorrow, researchers used a slightly modified version of the commonly accepted definition of LR. While tidier for typing, the paper’s LR definition could be better for computer code.