Added notes about data sanitization for ml using sklearn - notes - Unnamed repository; edit this file 'description' to name the repository.

commit 2f25fe6abc9393529bbc95da2d36685cbb0afa94
parent 97caba2fa94b7460acade1e01b6a73f3007ac760
Author: Andrew <andrewlaack1@gmail.com>
Date:   Thu, 23 May 2024 18:34:54 -0500

Added notes about data sanitization for ml using sklearn

Diffstat:
A BIOL115.md  | 8 ++++++++
M CS202.md  | 8 ++++++++
A FeatureScaling.md  | 12 ++++++++++++
A Hyperparameter.md  | 10 ++++++++++
A Imputation.md  | 25 +++++++++++++++++++++++++
A LabelEncoding.md  | 14 ++++++++++++++
A MinMaxScaling.md  | 27 +++++++++++++++++++++++++++
A OneHotEncoding.md  | 12 ++++++++++++
A Standardization.md  | 10 ++++++++++
A TargetEncoding.md  | 22 ++++++++++++++++++++++
M index.md  | 1 +

11 files changed, 149 insertions(+), 0 deletions(-)
diff --git a/BIOL115.md b/BIOL115.md
@@ -0,0 +1,8 @@
+:index: :biol115:
+# Biology 115 - Human Biology
+
+Summer 24
+
+## Main Links
+
+
diff --git a/CS202.md b/CS202.md
@@ -20,3 +20,11 @@ This is the index for my cs 202 notes.
 [[CanaryValue.md]]
 [[TwosComplement.md]]
 [[OnesComplement.md]]
+[[Imputation.md]]
+[[OneHotEncoding.md]]
+[[LabelEncoding.md]]
+[[TargetEncoding.md]]
+[[Hyperparameter.md]]
+[[FeatureScaling.md]]
+[[Standardization.md]]
+[[MinMaxScaling.md]]
diff --git a/FeatureScaling.md b/FeatureScaling.md
@@ -0,0 +1,12 @@
+:ml:
+# Feature Scaling
+
+ML CH2
+
+## Notes
+
+**Definition:** Feature scaling is the process of changing input features to be scaled in a similar way. 
+
+Feature scaling is important because machine learning algorithms don't do well when you have lots of vectors that use vastly different scales of values.
+
+There are two types of feature scaling namely [[MinMaxScaling.md]] and [[Standardization.md]]
diff --git a/Hyperparameter.md b/Hyperparameter.md
@@ -0,0 +1,10 @@
+:ml:
+# Hyperparameter
+
+ML CH2
+
+## Notes
+
+**Definition:** A hyperparameter in ML is a parameter that is defined prior to training that is not influenced by samples.
+
+Examples of hyperparmeters are [[LearningRate.md]] and m in the case of calculating weighted means. More about this can be seen here [[TargetEncoding.md]]
diff --git a/Imputation.md b/Imputation.md
@@ -0,0 +1,25 @@
+:ml: 
+# Imputation
+
+CH2
+
+## Notes
+
+**Definition:** Imputation is the process of filling in null values with some appropriate value.
+
+This is often done with ml to set null values to 0, mean, median, or some other appropriate value.
+
+Using pandas, this can be done using df.fillna().
+
+There is also another way to do this using sklearn.impute's SimpleImputer. This can be used as follows:
+
+```python
+from sklearn.impute import SimpleImputer
+
+imputer = SimpleImputer(strategy="median")
+imputer.fit(df) # Ensure the df only has np.number dtypes. 
+
+X = imputer.transform(df) # Set null values to medians (as specified) for the df. 
+```
+
+The imputer above can also be used with most_frequent (mode), mean, or constant where you would then need to specify a fill_value.
diff --git a/LabelEncoding.md b/LabelEncoding.md
@@ -0,0 +1,14 @@
+:ml:
+# Label Encoding
+
+ML CH2
+
+## Notes
+
+**Definition:** Label encoding is the process of encoding some arbitrary label as an arbitrary number. 
+
+This is often done when you have a string input to a neural network or linear regression model and there are too many options for the given feature to do [[OneHotEncoding.md]]. 
+
+One issue with this is that the labels are arbitrary so if the model tries to use these numbers to predict higher being better or worse there will be issues. 
+
+See also [[TargetEncoding.md]] for another way to encode strings as numbers.
diff --git a/MinMaxScaling.md b/MinMaxScaling.md
@@ -0,0 +1,27 @@
+:ml: 
+# Min-max scaling 
+
+ML CH2
+
+## Notes
+
+**Definition:** Min-max scaling also referred to as normalization is a shift from the current values to range from 0 to 1. 
+
+This is often done by subtracting the min value and then dividing by the difference between the min and the max. 
+
+Here is an example implementation:
+
+```python
+
+# For each column (assuming they are numbers) iterate through them and set all
+# features to be equal to the (current - min) / diff. 
+
+for i in df:
+    min = df[i].min()
+    diff = df[i].max() - min
+    df[i] = (df[i] - min) / diff 
+
+df.describe()
+```
+
+See [[FeatureScaling.md]] for more.
diff --git a/OneHotEncoding.md b/OneHotEncoding.md
@@ -0,0 +1,12 @@
+:ml:
+# One-hot Encoding
+
+ML CH2
+
+## Notes
+
+**Definition:** One hot encoding is the process of taking all unique features of a given feature and expanding these out to be individual boolean attributes of a sample. 
+
+An example of this is if you have a column that states the distance from the ocean. The options are island, 1 hour, and near ocean. These could be encoded as integers, but the issue is that these value are not representative of what the values mean thus mapping this to a linear regression would cause issues because higher or lower does not necessarily mean better. As such, you would then add 1 hour, near ocean, and island as columns and then set booleans as true or false based on the distance string. 
+
+See [[LabelEncoding.md]] for a simple way of encoding strings as numbers. This is useful when there are lots of options and the model knows the data is arbitrarily numbered.
diff --git a/Standardization.md b/Standardization.md
@@ -0,0 +1,10 @@
+:ml: 
+# Standardization 
+
+ML CH2
+
+## Notes
+
+**Definition:** 
+
+See [[FeatureScaling.md]] for more.
diff --git a/TargetEncoding.md b/TargetEncoding.md
@@ -0,0 +1,22 @@
+# Target Encoding
+
+ML CH2
+
+## Notes
+
+**Definition:** Target encoding is the process of mapping some feature to a representative value that is calculated. 
+
+This is different than [[LabelEncoding.md]] as label encoding uses an arbitrary mapping instead of a representative one. 
+
+A simple way to do this would be to find the mean target value of a given feature label (group by) and then mapping the feature to this mean. This is simple, but is imperfect especially when there is not a lot of information for a specific label.
+
+Another way to do this is by using a weighted mean that takes into account the means of all other feature options as well. This is often done by finding the current option's mean, multiplying it by the number of occurrences of said option, then adding the overall mean multiplied by some [[Hyperparameter.md]] m. The final thing to do is to divide this value by the number of instances of this option added to m.
+
+Equation:
+
+$\frac{n* \text{option mean} + m* \text{overall mean}}{n+m}$
+
+
+## Issues
+
+The main issue with this approach is overfitting. When setting a parameter based on the target there is a higher likelihood that you will overfit the training data. 
diff --git a/index.md b/index.md
@@ -7,6 +7,7 @@ This is the index for my main note classifications. I will maintain this as a ho
 
 [[CS202.md]]
 [[CS331.md]]
+[[BIOL115.md]]
 [[Math310.md]]
 [[TexRef.md]]

	notes Unnamed repository; edit this file 'description' to name the repository.
	Log \| Files \| Refs

A	BIOL115.md	\|	8	++++++++
M	CS202.md	\|	8	++++++++
A	FeatureScaling.md	\|	12	++++++++++++
A	Hyperparameter.md	\|	10	++++++++++
A	Imputation.md	\|	25	+++++++++++++++++++++++++
A	LabelEncoding.md	\|	14	++++++++++++++
A	MinMaxScaling.md	\|	27	+++++++++++++++++++++++++++
A	OneHotEncoding.md	\|	12	++++++++++++
A	Standardization.md	\|	10	++++++++++
A	TargetEncoding.md	\|	22	++++++++++++++++++++++
M	index.md	\|	1	+