notes

Unnamed repository; edit this file 'description' to name the repository.
Log | Files | Refs

commit babcced96f5ee2a0e3ad748dbaf9e05267660f6e
parent 5afcd29c8707a7b9549827c1ae109e38d627b67f
Author: Andrew <andrewlaack1@gmail.com>
Date:   Tue, 18 Jun 2024 17:57:36 -0500

Added today's notes

Diffstat:
ABatchNormalization.md | 10++++++++++
AExplodingGradients.md | 18++++++++++++++++++
AGradientClipping.md | 10++++++++++
ALeakyReLU.md | 23+++++++++++++++++++++++
MMLP.md | 3++-
MMachineLearning.md | 6++++++
MNeuralNetworks.md | 18++++++++++++++++++
AUnstableGradients.md | 18++++++++++++++++++
AVanishingGradients.md | 16++++++++++++++++
9 files changed, 121 insertions(+), 1 deletion(-)

diff --git a/BatchNormalization.md b/BatchNormalization.md @@ -0,0 +1,10 @@ +:ml: +# Batch Normalization + +ML P569 + +## Notes + +**Definition:** Batch normalization is the process of adding layers to a neural network that perform normalization upon inputs and output the normalized values. + +This helps with unstable gradient issues and removes the need to normalize inputs for the network. On the flip side, these computations are bad for TPUs and are generally slow. They also don't work with RNNs. diff --git a/ExplodingGradients.md b/ExplodingGradients.md @@ -0,0 +1,18 @@ +:ml: +# Exploding Gradients + +ML 550 + +## Notes + +**Definition:** Exploding gradients is a problem with training neural networks where lower levels have very high gradients and thus the gradient steps diverge from a proper solution. + +This is the opposite of [[VanishingGradients.md]] + +This often occurs for recurrent neural networks. + +### Solutions + +Use ReLU and better weight initialization (not gaussian distribution with std deviation of 1). + +See [[UnstableGradients.md]] for more. diff --git a/GradientClipping.md b/GradientClipping.md @@ -0,0 +1,10 @@ +:ml: +# Gradient Clipping + +ML P569 + +## Notes + +**Definition:** Gradient clipping is the process of clipping gradients during backpropogration so they never exceed some threshold. + +This is another technique used to resolve issues relating to [[ExplodingGradients.md]] particularly for RNNs where batch normalization does not work. diff --git a/LeakyReLU.md b/LeakyReLU.md @@ -0,0 +1,23 @@ +:ml: +# Leaky ReLU + +ML P554 + +## Notes + +**Definition:** Leaky ReLU is a variant of ReLU designed to solve the problem of neurons dying due to the use of ReLU. + +Leaky ReLU adds a small (or larger) slope to the function representing values less than 0 for the activation function. This ensures neurons don't die, but they can enter long coma phases. + +ReLU sometimes kills neurons because all inputs for all training samples result in a negative input to the activation function thus causing it to always output 0. + +This can be specified in keras as follows: + +```python3 + +leaky_relu = tf.keras.layers.LeakyReLU(alpha=0.2) # defaults to alpha=0.3 +dense = tf.keras.layers.Dense(50, activation=leaky_relu, kernel_initializer="he_normal") + +``` + +Basically, initialize leaky relu with the hyperparameter of slope and then set it as the layer's activation function. Interestingly, when not specified 'Dense' uses a linear activation function which outputs the inputs * weights + bias. diff --git a/MLP.md b/MLP.md @@ -4,10 +4,11 @@ ML D6 ## Notes -**Definition:** Multilayer perceptrons are a form of deep neural network that are a feedforward process where each output goes forward to the next layer of perceptrons until reaching the output layer. +**Definition:** Multilayer perceptrons are a form of deep neural network that are a feedforward process where each output goes forward to the next layer of perceptrons until reaching the output layer. This is a subset of neural networks as not all NNs are fully connected like RNNs/CNNs. MLPs can do regression and classification tasks. For regression we need one output for each output feature we would like to predict. With these outputs we can also apply an activation function (default is none), to bound the output range. For classification tasks you need to dedicate one output neuron for each class. These classes then use a sigmoid activation function that determines the probability of class membership. To get an output with a sum of 1 (wanted in the case of multiclass classification where only one output is expected) we can use a softmax function for each output. For classification tasks with neural networks we generally want to minimize cross entropy rather than MSE which is the normal metric for regression. Cross entropy is the difference between the predicted distribution and the true distribution. This is also used for logistic regression. + diff --git a/MachineLearning.md b/MachineLearning.md @@ -136,3 +136,9 @@ Concepts: [[MLP.md]] [[WideAndDeepNN.md]] [[CategoricalCrossEntropy.md]] +[[VanishingGradients.md]] +[[ExplodingGradients.md]] +[[UnstableGradients.md]] +[[LeakyReLU.md]] +[[GradientClipping.md]] +[[BatchNormalization.md]] diff --git a/NeuralNetworks.md b/NeuralNetworks.md @@ -8,3 +8,21 @@ ML D5 **Definition:** Artificial neural networks are machine learning models that mimick biological neurons to complete some task. ReLU activations can be used on output layers to force the output to be positive. Additionally, we can use softplus which is relu but smooth to set output values because by default there is not an activation function for the output layer. + +### Hidden Layer Count Selection + +Deeper neural networks have better parameter efficiency. This means you need less neurons to model complex functions when compared with shallower NNs. + +### Neuron Count Per Layer + +It is common for all layers to be the same in most cases. There are however times when we make them a pyramid shape, descending, because each layer picks out different information that coalesces into higher level information. Another common approach is to make the first hidden layuer large and then all subsequent ones the same size (smaller). + +In most cases, having all layers the same size is equally as accurate as a pyramid structure and reduces the number of hyperparameters to tune which is a good thing. + +Basically, normally they should all be the same size. Sometimes first hidden is bigger and the rest are same size smaller. Sometimes make a pyramid, but this increases the number of hyperparams. + +### Count Info (Combined # of layers and neurons per layer) + +Sometimes we use a stretch pants method to prevent overfitting. We do this by selecting a bigger model than needed and then using early stopping to prevent overfitting. + +Generally, increasing the number of layers is better than increasing the number of neurons. diff --git a/UnstableGradients.md b/UnstableGradients.md @@ -0,0 +1,18 @@ +:ml: +# Unstable Gradients + +ML 550 + +## Notes + +**Definition:** Unstable gradients are the idea that different layers of a neural network can learn at widely different rates. + +This often manifests as [[ExplodingGradients.md]] or [[VanishingGradients.md]] + +This was a reason that deep neural networks were mostly abandoned in the early 2000s until there were revisions to model architecture. It was found that the initialization scheme of a normal weight distribution about 0 with a std deviation of 1 and the use of sigmoid activation functions caused this issue. Mainly the sigmoid function as they backpropogate gradients that are generally very small. + +To resolve this issue we need to ensure the variance of inputs and outputs are roughly equal. This can be done through a different initialization strategy called He initialization which uses ReLU. + +There is also another solution using LeCun initialization with a SeLU activation function. + +The final common approach, used with softmax activation, is to us the Glorot initialization method. diff --git a/VanishingGradients.md b/VanishingGradients.md @@ -0,0 +1,16 @@ +:ml: +# Vanishing Gradients + +ML 550 + +## Notes + +**Definition:** Vanishing gradients is a neural network problem where lower levels (earlier hidden layers) have such small gradients that gradient steps make tiny changes and the model never converges upon an a good solution. + +This is a very common problem as most of the time gradients get smaller and smaller. As such, this problem is much more common than [[ExplodingGradients.md]] which primarly happens with RNNs. + +### Solutions + +Use ReLU and better weight initialization (not gaussian distribution with std deviation of 1). + +See [[UnstableGradients.md]] for more.