Added notes on some stuff related to training ensembles with bagging/pasting - notes - Unnamed repository; edit this file 'description' to name the repository.

commit 008d8ae437adc2962adae34e117491c060f9490e
parent 3ec2665d6a9d70d9660cb9628d31f0ef74280285
Author: Andrew <andrewlaack1@gmail.com>
Date:   Sun,  9 Jun 2024 20:02:19 -0500

Added notes on some stuff related to training ensembles with bagging/pasting

Diffstat:
A Bagging.md  | 10 ++++++++++
A Bias.md  | 12 ++++++++++++
M MachineLearning.md  | 5 +++++
A OutOfBag.md  | 23 +++++++++++++++++++++++
A Pasting.md  | 8 ++++++++
M Variance.md  | 10 ++++++++--
M VotingClassifiers.md  | 2 ++

7 files changed, 68 insertions(+), 2 deletions(-)
diff --git a/Bagging.md b/Bagging.md
@@ -0,0 +1,10 @@
+:ml:
+# Bagging
+
+ML D5
+
+## Notes
+
+**Definition:** Bagging is the process of training the same model multiple times with a different subset of the data. Bagging is different than pasting as bagging does not take samples that are selected as part of the random sample for training out of the options to add to the random sample. This means one model (predictor) can be trained with multiple instances of the same sample.
+
+One reason bagging and pasting are good is that they both allow for parallel processing because multiple models do predictions concurrently. The same is also true for model training.
diff --git a/Bias.md b/Bias.md
@@ -0,0 +1,12 @@
+:ml:
+# Bias
+
+ML D5
+
+## Notes
+
+**Definition:** Bias is a generalization error caused by incorrect assumptions such as assuming data is linear when it is not.
+
+High bias models are likely to underfit training data.
+
+See also [[Variance.md]]
diff --git a/MachineLearning.md b/MachineLearning.md
@@ -107,6 +107,11 @@ Concepts:
 [[CART.md]]
 [[RandomForest.md]]
 [[VotingClassifiers.md]]
+[[Bagging.md]]
+[[Pasting.md]]
+[[Bias.md]]
+[[Variance.md]]
+[[OutOfBag.md]]
 
 To do:
 
diff --git a/OutOfBag.md b/OutOfBag.md
@@ -0,0 +1,23 @@
+:ml:
+# Out of Bag
+
+ML D5
+
+## Notes
+
+**Definition:** Out of bag refers to samples that are not contained within a training sampling for a given predictor when using bagging/pasting.
+
+It is 37% likely that when using bagging and selecting m random samples from the training set that a given sample will be out of bag. These can be useful because these values can then be used for validation of the individual predictor.
+
+Here is an example implementation of oob scoring used on a decision tree classifier with scikit learn:
+
+```python3
+
+# Train and then validate predictors on their out of bag samples.
+
+bag_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators=500, oob_score=True, n_jobs=-1, random_state=10)
+
+bag_clf.fit(X_train, y_train)
+bag_clf.oob_score_
+
+```
diff --git a/Pasting.md b/Pasting.md
@@ -0,0 +1,8 @@
+:ml:
+# Pasting
+
+ML D5
+
+## Notes
+
+**Definition:** Pasting is the process of training multiple models of the same type on subsets of a dataset. This is different than bagging as pasting removes selected samples of the current subset subset from the current predictors options. This means the same predictor (model) can't be trained on the same sample twice, but different predictors may use some of the same samples. 
diff --git a/Variance.md b/Variance.md
@@ -1,9 +1,9 @@
-:stats:
+:stats: :ml:
 # Variance
 
 Stats D2
 
-## Notes
+## Notes (Stats)
 
 **Definition:** The variance of samples is the average squared difference between each value and the mean. 
 
@@ -14,3 +14,9 @@ Var(X) = |X|^-1 * sum((x - mean)^2)
 Shown above, find the difference between each value and the mean, square it to get a positive, and then sum the values. We then average it by multiplying by 1 over the cardinality of X.
 
 If we take the square root of the variance we then have the [[StandardDeviation.md]]
+
+## Notes (ML)
+
+**Definition:** Variance is error cause by an oversensitive model (sensitive to variance/outliers).
+
+These models are likely to overfit training data.
diff --git a/VotingClassifiers.md b/VotingClassifiers.md
@@ -10,3 +10,5 @@ ML D4
 Assume you are ussing an SVM classifier, random forest, and logistic regression, the outputs of these may be computed and then whichever classification gets the most votes is decided to be the output. 
 
 This process aggregates the outputs of the individual models into one output. Majority voting is called hard voting where the most popular output is chosen. 
+
+The alternative to hard voting is soft voting which takes the average of probability outputs from each model to make a determination.

	notes Unnamed repository; edit this file 'description' to name the repository.
	Log \| Files \| Refs

A	Bagging.md	\|	10	++++++++++
A	Bias.md	\|	12	++++++++++++
M	MachineLearning.md	\|	5	+++++
A	OutOfBag.md	\|	23	+++++++++++++++++++++++
A	Pasting.md	\|	8	++++++++
M	Variance.md	\|	10	++++++++--
M	VotingClassifiers.md	\|	2	++