Added all these notes about optimizers - notes - Unnamed repository; edit this file 'description' to name the repository.

commit 2d750a4eb7534b652291f5cc68ec47287381161a
parent babcced96f5ee2a0e3ad748dbaf9e05267660f6e
Author: Andrew <andrewlaack1@gmail.com>
Date:   Fri, 21 Jun 2024 09:49:43 -0500

Added all these notes about optimizers

Diffstat:
A AdaGrad.md  | 10 ++++++++++
A Adam.md  | 12 ++++++++++++
A Autoencoder.md  | 14 ++++++++++++++
M GradientClipping.md  | 4 ++++
M MachineLearning.md  | 8 ++++++++
A Momentum.md  | 14 ++++++++++++++
A NAG.md  | 8 ++++++++
A Optimizer.md  | 15 +++++++++++++++
A PretrainedModels.md  | 14 ++++++++++++++
M UnsupervisedLearning.md  | 2 ++
A UnsupervisedPretraining.md  | 12 ++++++++++++

11 files changed, 113 insertions(+), 0 deletions(-)
diff --git a/AdaGrad.md b/AdaGrad.md
@@ -0,0 +1,10 @@
+:ml:
+# AdaGrad
+
+ML P584
+
+## Notes
+
+**Definition:** Adaptively adjusts learning rate based on historical gradients.
+
+I don't understand this very well.
diff --git a/Adam.md b/Adam.md
@@ -0,0 +1,12 @@
+:ml:
+# Adam (Adaptive moment estimation)
+
+ML P587
+
+## Notes
+
+**Definition:** Adam combines momentum with RMSProp to calculate gradients based on momentum and historical gradients.
+
+This is the best in most cases.
+
+There are variants of adam as well such as AdaMax (generally worse), Nadam (uses [[NAG.md]] idea for calculating in direction of momentum and generally outperforms adam), AdamW (regularized with weight decay).
diff --git a/Autoencoder.md b/Autoencoder.md
@@ -0,0 +1,14 @@
+:ml:
+# Autoencoder 
+
+ML General
+
+## Notes
+
+**Definition:** An autoencoder is an unsupervised neural network that takes inputs, compresses them into a smaller representation while trying to maintain as much information as possible, and then reconstructs the compressed representation into a new full representation.
+
+The idea of an autoencoder is for the model to learn the best way to extract features out of a large input (many features) so it can then be passed to another model that will require less features and subsequently be faster to train and use. 
+
+Autoencoder are made of two part they have an encoder and a decoder. The encoder takes in an input with all of the features and then outputs a compressed representation of it where the output has less features. The decoder then takes the compressed representation as the input and tries to create the original input to the encoder. The error (difference between output and actual input) is what we are trying to minimize. 
+
+Autoencoders are often used for unsupervised pretraing by training the autoencoder and then using the lower layers of it as the lower layers of a neural network. This uses the encoders compression as the input for the neural network.
diff --git a/GradientClipping.md b/GradientClipping.md
@@ -8,3 +8,7 @@ ML P569
 **Definition:** Gradient clipping is the process of clipping gradients during backpropogration so they never exceed some threshold.
 
 This is another technique used to resolve issues relating to [[ExplodingGradients.md]] particularly for RNNs where batch normalization does not work.
+
+There are two ways to do gradient clipping either with a threshold cut off or with vector scaling. With vector scaling we retain the direction of the vector and set the minimize the largest value to 1 (if greater than 1) while scaling all other features proprotionally. More commonly, we simply truncate values so if we have [100, .1] with a threshold of (-1,1) we would then scale the vector to [1, .1].
+
+Scaling the entire vector is called normalization.
diff --git a/MachineLearning.md b/MachineLearning.md
@@ -142,3 +142,11 @@ Concepts:
 [[LeakyReLU.md]]
 [[GradientClipping.md]]
 [[BatchNormalization.md]]
+[[PretrainedModels.md]]
+[[UnsupervisedPretraining.md]]
+[[Autoencoder.md]]
+[[Optimizer.md]]
+[[Momentum.md]]
+[[NAG.md]]
+[[AdaGrad.md]]
+[[Adam.md]]
diff --git a/Momentum.md b/Momentum.md
@@ -0,0 +1,14 @@
+:ml:
+# Momentum
+
+ML P580
+
+## Notes
+
+**Definition:** Momentum optimization is an optimization algorithm that uses the idea of momentum to reach an optimum faster.
+
+As we continue to have a negative gradient the optimizer moves faster and faster until it inverts where it then begins to slow down the gradient steps and subsequently change directions.
+
+The gradient is used as an acceleration factor and not as the speed.
+
+This is always faster than gradient descent and can also have a friction factor to reduce overshooting the target. 
diff --git a/NAG.md b/NAG.md
@@ -0,0 +1,8 @@
+:ml:
+# NAG (Nesterov Accelerated Gradient (optimization)) 
+
+ML P582
+
+## Notes
+
+**Definition:** NAG is an improvment upon the momentum optimization algorithm where instead of finding the gradient of the current position and adding this to the velocity, we instead find the gradient slightly ahead (in direction of momentum) and then add this factor to the velocity.
diff --git a/Optimizer.md b/Optimizer.md
@@ -0,0 +1,15 @@
+:ml:
+# Optimizer
+
+ML P580
+
+## Notes
+
+**Definition:** An optimizer is an algorithm to adjust the weights and biases of neural networks.
+
+Here are a list of common optimizers:
+
+[[Momentum.md]] - Gradient is acceleration
+[[NAG.md]] - Calculates momentum slightly ahead of current position
+[[AdaGrad.md]] - Good for simple quadratic problems
+[[Adam.md]] - Generally the best
diff --git a/PretrainedModels.md b/PretrainedModels.md
@@ -0,0 +1,14 @@
+:ml:
+# Pretrained Models
+
+ML P570
+
+## Notes
+
+**Definition:** Pretrained models are ML models that have been trained in the past and can be used for doing other things.
+
+Pretrained models often use [[TransferLearning.md]] because the goal with pretrained models is to use the existing model that has already been trained to work well with a new set of data. This often involves changing the model's top layers (training new ones for the specific task) while keeping the lower layers in tact as they often do simple tasks like edge detection which are reusable.
+
+When doing this the layers that don't change are called the fixed weights while the ones that are changed are called the trainable weights.
+
+A good thing about pretrained models is that they generally require less training data to get a certain level of accuracy for predictions.
diff --git a/UnsupervisedLearning.md b/UnsupervisedLearning.md
@@ -10,3 +10,5 @@ ML L1
 [[ClusteringAlgorithms.md]] are often created using unsupervised learning.
 
 Another example of unsupervised learning is the cocktail party problem where you have multiple microphones in a room that is noisy, how do you separate out individual voices?
+
+See [[UnsupervisedPretraining.md]] for information about unsupervised training followed by supervised training.
diff --git a/UnsupervisedPretraining.md b/UnsupervisedPretraining.md
@@ -0,0 +1,12 @@
+:ml:
+# Unsupervised Pretraining
+
+ML P576
+
+## Notes
+
+**Definition:** Unsupervised pretraining is the process of pretraining a model on unlabeled data and then adding layers on top of the model using labelled data to get predictions.
+
+This is often used because unlabeled data is often abundant, but labeled data is expensive.
+
+We can do this with GANs as well as [[Autoencoder.md]]. With autoencoders we train the autoencoder to compress the data and then reuse the lower layers of this autoencoder as the lower layers for a neural network. This is useful because autoencoders are good at finding representations of the data without the need for labeled data.

	notes Unnamed repository; edit this file 'description' to name the repository.
	Log \| Files \| Refs

A	AdaGrad.md	\|	10	++++++++++
A	Adam.md	\|	12	++++++++++++
A	Autoencoder.md	\|	14	++++++++++++++
M	GradientClipping.md	\|	4	++++
M	MachineLearning.md	\|	8	++++++++
A	Momentum.md	\|	14	++++++++++++++
A	NAG.md	\|	8	++++++++
A	Optimizer.md	\|	15	+++++++++++++++
A	PretrainedModels.md	\|	14	++++++++++++++
M	UnsupervisedLearning.md	\|	2	++
A	UnsupervisedPretraining.md	\|	12	++++++++++++