notes

Unnamed repository; edit this file 'description' to name the repository.
Log | Files | Refs

commit fa73349d141d55e6a0bfdd66d5a9a36ce43628de
parent ae0f56192f99dbee70e9a5caec151fbf5e7ac093
Author: Andrew <andrewlaack1@gmail.com>
Date:   Fri,  7 Jun 2024 01:03:14 -0500

Done with this ml shit

Diffstat:
ACART.md | 14++++++++++++++
ADecisionTrees.md | 59+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
MEnsembles.md | 4+---
MMachineLearning.md | 5+++++
ARandomForest.md | 12++++++++++++
MSVM.md | 12++++++++++++
ASimilarityFeature.md | 10++++++++++
AVotingClassifiers.md | 12++++++++++++
Mindex.md | 8+++++---
9 files changed, 130 insertions(+), 6 deletions(-)

diff --git a/CART.md b/CART.md @@ -0,0 +1,14 @@ +:ml: +# CART - Classification and Regression Tree Algorithm + +ML D4 + +## Notes + +**Definition:** The CART algorithm is used to train decision trees and works by splitting a training set into two parts using a single feature k where k is the feature that produces the purest subsets weighted by size. This is then repeated at each step (greedy) until reaching either a max depth, or until reaching some depth whereby it can not find a split that will reduce impurity. + +Note that this algorithm is greedy so there may be better lines that could be drawn if it took a suboptimal line at a given point in time, but that would increase the computing cost drastically. + +There are two common cost functions that fall under CART being reducing entropy and gini impurity. Gini impurity is default (trying to minimize this) while entropy also known as information gain can be used, but it is slower as it uses logarithms. + +This can also be used with MSE instead of gini or entropy to do regression. We basically just want to minimize MSE at each step. diff --git a/DecisionTrees.md b/DecisionTrees.md @@ -0,0 +1,59 @@ +:ml: +# Decision Trees + +ML D4 + +## Notes + +**Definition:** Decision trees are a machine learning algorithm that does true/false comparison to go left and right until reaching a leaf node. This leaf node will then describe the output. + +### Visualizing + +You can use graphviz to visualize this graph. First, you train the model using sklearn.tree then you import export_graphviz from the same location. Using export_graphviz you can pass in the model, output file, feature names, class names , and some other information which will create a dotfile. + +Then, you can import graphviz and user Source.from_file() to load in the dot file and view it. + +Ex: + +```python3 +from sklearn.tree import export_graphviz +from graphviz import Source + +graphData = export_graphviz( + tree_clf, + out_file='../graphs/iris_tree.dot', + feature_names=["petal length (cm)", "petal width (cm)"], + class_names=iris.target_names, + rounded=True, + filled=True +) +Source.from_file('../graphs/iris_tree.dot') +``` + +### Other Info + +There are root nodes and what are called 'split nodes' which is where the trees splits into two more nodes based on True/False comparisons. + +An interesting thing about decision trees is that no feature scaling is required as features aren't compared to other features, unless you engineer another feature as some combination of them. + +In the context of decision trees, samples for a split node refers to the number of samples that made it to this point. This also applies for leaf nodes as well whereby it describes the number of samples made it to said leaf node. + +The 'gini' attribute measures the impurity of a leaf node. A leaf node of 0 would mean all samples that made it to the node are a member of the target class whereas a value of .4 would mean 40% of the samples would be of another class. + +Scikit learn creates binary trees by using the CART algorithm but there are other decision tree implementations where it is not expressly yes/no such as ID3 where nodes can have more than two children. + +Decision trees can output probabilities based on the values that are used to generate the gini value. These are generally a list such as [50 , 2, 5] where 50 is the most probable and the others are lesser probabilities. + +The max_depth hyperparameter is the best way to regularize decision trees and reduce overfitting risks. There is also max features (comparisons per node), leaf nodes, min samples split, and min samples leaf which do similar restriction. + + +### Uhh Ohh + +These things really like orthogonals but not so much angles. If you have a dataset that is easily seperatble at an angle but not vertically or horizontally you will have a bad time with decision trees. + +One mediation for this is to use a PCA which rotates the data to reduce correlation between features. + + +### Hmmm.... + +Scikit learn uses a stocastic sampling when training decision trees meaning they aren't consistent training to training. This is why random forests can be cool. diff --git a/Ensembles.md b/Ensembles.md @@ -5,6 +5,4 @@ CH2 ## Notes -**Definition:** Ensembles are models composed of multiple other models. - -An example is a random forest regressor. +**Definition:** Ensembles are models composed of multiple models. These models can be the same like with random forests or different models put together. diff --git a/MachineLearning.md b/MachineLearning.md @@ -102,6 +102,11 @@ Concepts: [[EarlyStopping.md]] [[SoftmaxRegression.md]] [[SVM.md]] +[[DecisionTrees.md]] +[[SimilarityFeature.md]] +[[CART.md]] +[[RandomForest.md]] +[[VotingClassifiers.md]] To do: diff --git a/RandomForest.md b/RandomForest.md @@ -0,0 +1,12 @@ +:ml: +# Random Forest + +ML D4 + +## Notes + +**Definition:** A random forest is an [[Ensembles.md]] of [[DecisionTrees.md]] used to make predictions based on majority voting or some other cost function. + +This uses a wisdom of the crowd philosophy where most likely the aggregated sum of many answers is better than one expert answer. + + diff --git a/SVM.md b/SVM.md @@ -7,8 +7,20 @@ ML D3 **Definition:** Support vector machines are models that create lines to separate different outputs by drawing lines between them leaving as much space possible between the different classes. They also have edges to the "street" where there is a line up the middle and these edges are only affected by instances located on the edge of the street and not by instances far off. These are the support vectors. +### Classification + Think of trying to make a street as wide as possible where there are buildings on the side that can't be moved. If the buildings move in the edges of the street need to as well. We would also see that the center line for the street moves accordingly as there is width lost on one side. Regardless of how many buildings are made far away, they do not affect the optimal width of the road. This describes how hard margin classification works, and the issue that arises with it is that if two samples are of different classes but in any way intermingle, the algorithm won't work. As such, there is also soft margin classification for svms which tries to limit margin violations while also balancing this with making the street as large as possible. With scikit learn, if you reduce the C value (hyperparameter) then it will have more margin violations. This decreases the likelihood of overfitting but reducing it too much will cause underfitting. Support Vector Machines are good for small datasets, but they do not scale well. They are also subject to feature scaling. + +When dealing with non-linearly classifiable datasets we can use the same polynomial strategy used with linear regression to plot based on any degree polynomial. + +A trick related to SVMs is called the polynomial kernel (kernel trick). This allows for polynomial mapping without the need for a combinatorial explosion of features by doing higher dimensional mapping without having to compute everything (unclear about this). + +### Regression + +When trying to use SVMs for regression we try to fit as many samples on the street while still limiting margin violations. The width of the street is controlled by the hyperparameter epsilon. + + diff --git a/SimilarityFeature.md b/SimilarityFeature.md @@ -0,0 +1,10 @@ +:ml: +# Similarity Feature + +ML 4 + +## Notes + +**Definition:** A similarity feature is an added feature that describes how similar some feature is to a particular landmark. This value generally ranges from 1 being the same to nearly or exactly 0 (depending on RBF used) being entirely different. + +With housing data, as an example, we may use an RBF to add another feature based on lat and long to see how far away points are from some landmark city. diff --git a/VotingClassifiers.md b/VotingClassifiers.md @@ -0,0 +1,12 @@ +:ml: +# Voting Classifiers + +ML D4 + +## Notes + +**Definition:** Voting classifiers are ensembles of classification models that use each of their outputs to predict the final output. + +Assume you are ussing an SVM classifier, random forest, and logistic regression, the outputs of these may be computed and then whichever classification gets the most votes is decided to be the output. + +This process aggregates the outputs of the individual models into one output. Majority voting is called hard voting where the most popular output is chosen. diff --git a/index.md b/index.md @@ -22,7 +22,9 @@ This is the index for my main note classifications. I will maintain this as a ho ## Technology Books to Read -- [ ] "The Structure of Scientific Revolutions" - Thomas Kuhn -- [ ] "Introduction to Computing Systems" - Patt and Patel +- [ ] "Introduction to Statistical Thought" - Michael Lavine - [ ] "Hands-On Machine Learning with Scikit-Learn and TensorFlow" - Aurelien Geron -- [ ] "" +- [ ] "Introduction to Linear Algebra" - Gilber Strang +- [ ] "Calculus Early Transcendentals" - James Stewart + +Maybe book about NN, transformers, cv, pytorch, I don't know what I will be missing by the end of the current ml book I am reading.