Took notes about ml (logit, svms, sgd, gd, etc) - notes - Unnamed repository; edit this file 'description' to name the repository.

commit ae0f56192f99dbee70e9a5caec151fbf5e7ac093
parent 0625a452e62ad787bc0c5374e35fd767334346c9
Author: Andrew <andrewlaack1@gmail.com>
Date:   Wed,  5 Jun 2024 20:14:22 -0500

Took notes about ml (logit, svms, sgd, gd, etc)

Diffstat:
A Bandwidth.md  | 10 ++++++++++
M DensityEstimation.md  | 2 ++
A EarlyStopping.md  | 12 ++++++++++++
A ElasticNetRegression.md  | 10 ++++++++++
A LassoRegression.md  | 10 ++++++++++
M LinearRegression.md  | 4 ++++
M LogisticRegression.md  | 16 ++++++++++++----
M MachineLearning.md  | 12 ++++++++----
A Oversmooothing.md  | 10 ++++++++++
A RidgeRegression.md  | 10 ++++++++++
A SVM.md  | 14 ++++++++++++++
A SoftmaxRegression.md  | 10 ++++++++++
M Statistics.md  | 4 +++-
A Undersmoothing.md  | 8 ++++++++

14 files changed, 123 insertions(+), 9 deletions(-)
diff --git a/Bandwidth.md b/Bandwidth.md
@@ -0,0 +1,10 @@
+:stats:
+# Bandwidth
+
+Stats D3
+
+## Notes
+
+**Definition:** Bandwidth is a hyperparameter used in smoothing techniques that describes the width of kernels.
+
+With regard to kdes, a higher value means the graph will be more smooth while the inverse is true as well. 
diff --git a/DensityEstimation.md b/DensityEstimation.md
@@ -8,3 +8,5 @@ Stats D3
 **Definition:** Density estimation is the process of modeling the probability of given values for a dataset.
 
 This can be thought of similar to a histogram without the bins. A common form of this is a kde. The reason these can be better is that it does not have binning which can make data appear innacurately depending on the cut points and bin widths.
+
+In a general sense, kdes work by creating gaussian distributions about datapoints and then summing up these values at each point and then graphing that. This averages out the data to give a general graph of the data. The width of these gaussian distributions is dictated by the bandwidth hyperparameter.
diff --git a/EarlyStopping.md b/EarlyStopping.md
@@ -0,0 +1,12 @@
+:ml:
+# Early Stopping
+
+ML D3
+
+## Notes
+
+**Definition:** Early stopping is the process of stopping a model early in training (assuming it uses GD or something akin to that) as a form of regularization.
+
+Early stopping decreases overfitting by stopping once a certain prediction error threshold is met. This also reduces time to train.
+
+Using sklearn, we can use partial_fit along with an epoch (pass) counter and loss calculation to determine if we are close enough to some goal to stop.
diff --git a/ElasticNetRegression.md b/ElasticNetRegression.md
@@ -0,0 +1,10 @@
+:ml:
+# Elastic Net Regression
+
+ML D3
+
+## Notes
+
+**Definition:** Elastic net regression is another form of linear regression that adds a regularization term to the loss function which is a middle ground between ridge and lasso regression.
+
+As it relates to linear regression, it is good to add some regularization and when we know some coefficients should be 0 we should rely upon elastic regression. Otherwise ridge regression is a good option when we don't think there are useless features.
diff --git a/LassoRegression.md b/LassoRegression.md
@@ -0,0 +1,10 @@
+:ml:
+# Lasso Regression (Least absolute shrinkage and selection operator regression)
+
+ML D3
+
+## Notes
+
+**Definition:** Lasso regression is another form of linear regression that adds a regularization term to the loss function but weights it different than ridge regression.
+
+The main difference between this and ridge is that ridge scales coeficients consistently whereas this does not. As such, often it outputs a sparse model which scales certain coeficcients to 0.
diff --git a/LinearRegression.md b/LinearRegression.md
@@ -27,3 +27,7 @@ Theta = (X transpose * X) ^ -1 * X transpose * y
 Where y is an m x 1 vector of target values and X is in some way related to inputs as a matrix with a column of ones for the intercept term... 
 
 This way of linear regression, the closed form way, is better when there are not a massive number of features, but if there are lots of features or the training instances aer too vast to fit into memory, then the [[GradientDescent.md]] way is better.
+
+See [[RidgeRegression.md]], [[LassoRegression.md]], and [[ElasticNetRegression.md]] for some ways to constrain linear models (decrease degrees of freedom to avoid overfitting).
+
+As it relates to linear regression, it is good to add some regularization and when we know only a few features matter elastic regression is good. Otherwise, in most cases, ridge regression is a good option when we don't think there are useless features.
diff --git a/LogisticRegression.md b/LogisticRegression.md
@@ -1,10 +1,18 @@
 :ml:
-# Logistic Regression
+# Logistic Regression (Logit Regression)
 
-ML CH1
+ML D3
 
 ##  Notes
 
-**Definition:** Logistic regression is the process of assigning a probablility of an item being part of a given class. 
+**Definition:** Logistic regression is a regression method used to determine the probability of some sample being part of some class. 
 
-This is a [[RegressionProblem.md]] and a [[ClassificationProblem.md]] as it generates a continuous value to determine the classification of a piece of data. 
+These are often binary classifiers (when they don't output probabilities) as they can simply output 1 or 0 depending on which probability is higher.
+
+Logistic regression works under the hood by computing a weighted sum of input features and then uses that as an input to a sigmoid function. The output of the sigmoid function is then the probability of it being in the class. The outputting of the output of the sigmoid function is called the logistic of the result.
+
+An interesting thing about logistic regression is that the log loss function does not have a known closed form equation for gradient descent must be used to optimize the algorithm.
+
+With the sigmoid function we define the decision boundary as the x-value for which greater values are true and lesser values are false. This position is at the 50% probability mark.
+
+See [[SoftmaxRegression.md]] for an extrapolation of linear regression for multi-class classification without combining binary classifiers.
diff --git a/MachineLearning.md b/MachineLearning.md
@@ -21,14 +21,12 @@ y = output also known as target variable
 
 (x,y) = Training example
 
-m = Number of training examples
+m = Number of samples 
 
-n = # of features (size of inputs/features/x)
+n = # of features
 
 h(x) = this is the function with an input of x this should be about the correct y.
 
-
-
 ## Main Links
 
 ML Categories:
@@ -98,6 +96,12 @@ Concepts:
 [[MultilabelClassification.md]]
 [[MultioutputClassification.md]]
 [[PartialDerivative.md]]
+[[RidgeRegression.md]]
+[[LassoRegression.md]]
+[[ElasticNetRegression.md]]
+[[EarlyStopping.md]]
+[[SoftmaxRegression.md]]
+[[SVM.md]]
 
 To do:
 
diff --git a/Oversmooothing.md b/Oversmooothing.md
@@ -0,0 +1,10 @@
+:stats:
+# Oversmoothing
+
+Stats D3
+
+## Notes
+
+**Definition:** Oversmoothing is the process of making the bandwidth of a kernel too large such that resulting visualizations smooth over important information.
+
+This can be thought of as underfitting the dataset.
diff --git a/RidgeRegression.md b/RidgeRegression.md
@@ -0,0 +1,10 @@
+:ml:
+# Ridge Regression
+
+ML D3
+
+## Notes
+
+**Definition:** Ridge regression uses a different cost function than standard linear regression to limit the size of coefficients.
+
+There is a regularization portion to the cost function which increases loss when coefficients are large thus incentivizing smaller coefficient values. Along with this, there is a hyperparameter, lambda, that gives more or less weight to this portion of the equation so a value of 0 would be standard linear regression while a high number would move the coeficcients closer and closer to 0.
diff --git a/SVM.md b/SVM.md
@@ -0,0 +1,14 @@
+:ml:
+# Support Vector Machines (SVMs)
+
+ML D3
+
+## Notes
+
+**Definition:** Support vector machines are models that create lines to separate different outputs by drawing lines between them leaving as much space possible between the different classes. They also have edges to the "street" where there is a line up the middle and these edges are only affected by instances located on the edge of the street and not by instances far off. These are the support vectors.
+
+Think of trying to make a street as wide as possible where there are buildings on the side that can't be moved. If the buildings move in the edges of the street need to as well. We would also see that the center line for the street moves accordingly as there is width lost on one side. Regardless of how many buildings are made far away, they do not affect the optimal width of the road. This describes how hard margin classification works, and the issue that arises with it is that if two samples are of different classes but in any way intermingle, the algorithm won't work.
+
+As such, there is also soft margin classification for svms which tries to limit margin violations while also balancing this with making the street as large as possible. With scikit learn, if you reduce the C value (hyperparameter) then it will have more margin violations. This decreases the likelihood of overfitting but reducing it too much will cause underfitting.
+
+Support Vector Machines are good for small datasets, but they do not scale well. They are also subject to feature scaling.
diff --git a/SoftmaxRegression.md b/SoftmaxRegression.md
@@ -0,0 +1,10 @@
+:ml:
+# Softmax Regression
+
+ML D3
+
+## Notes
+
+**Definition:** Softmax regression is the process of running linear regression for k classes for a sample and then using the softmax function to determine the probability of it being a member of each class.
+
+The softmax function is simply a function where you find the each element e^z, sum these values, and then divide the exponential of each element by the sum of all exponentials.
diff --git a/Statistics.md b/Statistics.md
@@ -26,4 +26,6 @@ Links to Stats Notes
 [[Quantile.md]]
 [[ExploratoryDataAnalysis.md]]
 [[DensityEstimation.md]]
-
+[[Bandwidth.md]] 
+[[Oversmooothing.md]] 
+[[Undersmoothing.md]] 
diff --git a/Undersmoothing.md b/Undersmoothing.md
@@ -0,0 +1,8 @@
+:stats:
+# Undersmoothing
+
+Stats D3
+
+## Notes
+
+**Definition:** Undersmoothing is when a bandwidth value that is too small is selected for the kernel bandwidth of a kde and by doing this is overfits the dataset.

	notes Unnamed repository; edit this file 'description' to name the repository.
	Log \| Files \| Refs

A	Bandwidth.md	\|	10	++++++++++
M	DensityEstimation.md	\|	2	++
A	EarlyStopping.md	\|	12	++++++++++++
A	ElasticNetRegression.md	\|	10	++++++++++
A	LassoRegression.md	\|	10	++++++++++
M	LinearRegression.md	\|	4	++++
M	LogisticRegression.md	\|	16	++++++++++++----
M	MachineLearning.md	\|	12	++++++++----
A	Oversmooothing.md	\|	10	++++++++++
A	RidgeRegression.md	\|	10	++++++++++
A	SVM.md	\|	14	++++++++++++++
A	SoftmaxRegression.md	\|	10	++++++++++
M	Statistics.md	\|	4	+++-
A	Undersmoothing.md	\|	8	++++++++