GradientDescent.md (2090B)
1 # Gradient Descent 2 3 ML L2 4 5 **Definition:** Gradient Descent is an algorithm used to find a 'near' optimal approach to the given problem. This is used with [LinearRegression](LinearRegression.md) to optimize the function by selecting a set of parameters $\theta$ and then repeatedly finding the direction that results in the fastest movement towards a cost function's value nearest to 0. This will find a local optimum. With linear regression however there will not be local optimum but only global. 6 7 General idea is to start with some $\theta$ (parameters) and keep changing it to reduce J($\theta$). (Find J in [LinearRegression](LinearRegression.md)) 8 9 More specifically, you pick a starting point, see what direction you should go to get closer to 0 the fastest. You then repeat this algorithm. It's not perfect, but it's fast. 10 11 This is a common algorithm used for [LinearRegression](LinearRegression.md) when there are lots of features or lots of samples (too big for memory) which would cause the formula for linear regression to be too slow. 12 13 For a simple implementation of gradient descent using a [LearningRate](LearningRate.md) for third degree polynomials see [[GradientDescentCode.md]]. 14 15 When using gradient descent for linear regression one must calculate the partial derivative for each variable and then determine if it is positive or negative and move in the correct direction. 16 17 Another thing, batch gradient descent is calculating the descents based on all of the samples given. An alternative to this is stochastic gradient descent which is much lighter and faster because it only tries to get closer at each step to a random point in the dataset. This allows for out-of-core learning where the entire dataset does not to be loaded in memory at any given time. 18 19 Another type of gradient descent is mini-batch gradient descent which stands by batch and stochastic gradient descent. This form of GD uses smalll batches of random sets and then performs descent upon them. This is basically stochastic, but with a few more samples each time instead of a single random sample.