β°οΈ The Landscape (Loss Function)
Think of training an AI as a hiker trying to find the lowest point in a mountain range in the dark. The "Height" represents Error (Loss). The lower the ball goes, the better the AI performs.
π’ Learning Rate (Step Size)
How big of a step the hiker takes downhill.
β’ Too Small: Takes forever to reach the bottom.
β’ Too Big: The hiker might step over the valley and end up on the other side (divergence).
ποΈ Momentum (Velocity)
Standard Gradient Descent stops immediately if the slope is flat. Momentum gives the ball mass. If it's rolling fast, it can power through small bumps and shallow valleys (local minima) to find deeper, better solutions.
π² SGD Noise (Stochastic Gradient Descent)
In real training, we don't calculate the slope using all data at once (too slow). We use small batches. This creates "noise" or jitter in the movement. This randomness shakes the ball, helping it vibrate out of bad spots where it might get stuck.