commit 265af28c5afa0ca616237b387c395026ec128139
parent 2f25fe6abc9393529bbc95da2d36685cbb0afa94
Author: Andrew <andrewlaack1@gmail.com>
Date: Sun, 26 May 2024 09:59:25 -0500
Added notes about feature scaling
Diffstat:
6 files changed, 71 insertions(+), 11 deletions(-)
diff --git a/; b/;
@@ -0,0 +1,28 @@
+:ml:
+# Standardization
+
+ML CH2
+
+## Notes
+
+**Definition:** Standardization is the process of scaling values such that the value is equivalent to itself subtracing the mean and dividing by the standard deviation.
+
+This is optimal in some cases as [[MinMaxScaling.md]] has issues with outliers. If there is one outlier that is much bigger than all other values the max will be very large thus squishing the range of most values to be low numbers which can effect the accuracy of models.
+
+See [[FeatureScaling.md]] for more.
+
+Sample implementation:
+
+```python
+
+# Get number columns
+df = df.select_dtypes(include=['number'])
+
+for i in df:
+ mean = df[i].mean()
+ std = df[i].std()
+ df[i] = (df[i] - mean) / std
+
+print(df)
+
+```
diff --git a/CS202.md b/CS202.md
@@ -20,11 +20,3 @@ This is the index for my cs 202 notes.
[[CanaryValue.md]]
[[TwosComplement.md]]
[[OnesComplement.md]]
-[[Imputation.md]]
-[[OneHotEncoding.md]]
-[[LabelEncoding.md]]
-[[TargetEncoding.md]]
-[[Hyperparameter.md]]
-[[FeatureScaling.md]]
-[[Standardization.md]]
-[[MinMaxScaling.md]]
diff --git a/MachineLearning.md b/MachineLearning.md
@@ -69,10 +69,19 @@ Concepts:
[[MAE.md]]
[[StratifiedSampling.md]]
[[CorrelationCoefficient.md]]
+[[LogisticRegression.md]]
+[[Imputation.md]]
+[[OneHotEncoding.md]]
+[[LabelEncoding.md]]
+[[TargetEncoding.md]]
+[[Hyperparameter.md]]
+[[FeatureScaling.md]]
+[[Standardization.md]]
+[[MinMaxScaling.md]]
+[[OrdinaryLeastSquares.md]]
To do:
-[[LogisticRegression.md]]
[[DeepLearning.md]]
[[Kernels.md]]
[[Backpropagation.md]]
diff --git a/MinMaxScaling.md b/MinMaxScaling.md
@@ -5,7 +5,9 @@ ML CH2
## Notes
-**Definition:** Min-max scaling also referred to as normalization is a shift from the current values to range from 0 to 1.
+**Definition:** Min-max scaling also referred to as normalization is a shift from the current values to between two arbitrary values.
+
+These two bounds are normally either 0 and 1 or -1 and 1. It is optimal for neural networks to have zero mean inputs so a range from -1 to 1 is generally good.
This is often done by subtracting the min value and then dividing by the difference between the min and the max.
@@ -15,6 +17,7 @@ Here is an example implementation:
# For each column (assuming they are numbers) iterate through them and set all
# features to be equal to the (current - min) / diff.
+# This has a lower bound of -1 and upper bound of 1.
for i in df:
min = df[i].min()
diff --git a/OrdinaryLeastSquares.md b/OrdinaryLeastSquares.md
@@ -0,0 +1,10 @@
+:ml:
+# Ordinary Least Squares (OLS)
+
+ML CH2
+
+## Notes
+
+**Definition:** Ordinary least squares is a formula used to find the statistical line of best fit for some dataset where we are trying to minimize the square error.
+
+When doing [[LinearRegression.md]] there are two common methods to find the line. One is OLS and the other is [[GradientDescent.md]].
diff --git a/Standardization.md b/Standardization.md
@@ -5,6 +5,24 @@ ML CH2
## Notes
-**Definition:**
+**Definition:** Standardization is the process of scaling values such that the value is equivalent to itself subtracing the mean and dividing by the standard deviation.
+
+This is optimal in some cases as [[MinMaxScaling.md]] has issues with outliers. If there is one outlier that is much bigger than all other values the max will be very large thus squishing the range of most values to be low numbers which can effect the accuracy of models.
See [[FeatureScaling.md]] for more.
+
+Sample implementation:
+
+```python
+
+# Get number columns
+df = df.select_dtypes(include=['number'])
+
+for i in df:
+ mean = df[i].mean()
+ std = df[i].std()
+ df[i] = (df[i] - mean) / std
+
+print(df)
+
+``r