Completed notes on clustering algorithms and dimensionality reduction algorithms - notes - Unnamed repository; edit this file 'description' to name the repository.

commit e1ab383509ea71e9b938862a8f85bb86112a1574
parent c157c455eb08ab8a4a086f9201f75cbf802aaf97
Author: Andrew <andrewlaack1@gmail.com>
Date:   Mon, 10 Jun 2024 16:28:21 -0500

Completed notes on clustering algorithms and dimensionality reduction algorithms

Diffstat:
A Affinity.md  | 10 ++++++++++
A DBSCAN.md  | 14 ++++++++++++++
A Inertia.md  | 10 ++++++++++
M KMeans.md  | 6 +++++-
A LLE.md  | 12 ++++++++++++
M MachineLearning.md  | 10 +++++-----
M ManifoldLearning.md  | 2 +-
M PCA.md  | 4 ++++
A RandomProjection.md  | 32 ++++++++++++++++++++++++++++++++
A Segmentation.md  | 14 ++++++++++++++

10 files changed, 107 insertions(+), 7 deletions(-)
diff --git a/Affinity.md b/Affinity.md
@@ -0,0 +1,10 @@
+:ml:
+# Affinity
+
+ML D5
+
+## Notes
+
+**Definition:** Affinity is any measure of how well an instance fits into a given cluster. 
+
+This is closely related to unsupervised clustering algorithms.
diff --git a/DBSCAN.md b/DBSCAN.md
@@ -0,0 +1,14 @@
+:ml:
+# DBSCAN (Density based spatial clustering of applications with noise)
+
+ML D5
+
+## Notes
+
+**Definition:** DBSCAN is a clustering algorithm that groups clusters by continuous regions of high density.
+
+Steps to perform:
+1. For each instance count how many instances are in the neighborhood
+2. If it has at least min_samples instances in neighborhood it is a core instance (located in dense area)
+3. All instances in the neighborhood of a core instance belong to the same cluster
+4. All other instance that are not core instances and do not have one in the neighborhood are anomalies.
diff --git a/Inertia.md b/Inertia.md
@@ -0,0 +1,10 @@
+:ml:
+# Inertia
+
+ML D5
+
+## Notes
+
+**Definition:** Inertia in machine learning is the sum of the squared distances from instances to their closest centroid. 
+
+This is often used as a gauge for the accuracy of a [[KMeans.md]] model.
diff --git a/KMeans.md b/KMeans.md
@@ -1,5 +1,5 @@
 :ml:
-# K-means Clustering
+# K-means (Clustering)
 
 ML CH2
 
@@ -13,3 +13,7 @@ Basic idea:
 2. Go through elements finding nearest centroid mean
 3. Add item to centroid and update the mean position
 4. Repeat Step 2
+
+When using kmeans clustering it can, at times, find local optimum instead of global optimum. To help with this issue one thing that can be done is passing in a list of starting positions for centroids. 
+
+Another solution is to run the algorithm multiple times with different random starting positions. We then take the best solution which minimizes [[Inertia.md]].
diff --git a/LLE.md b/LLE.md
@@ -0,0 +1,12 @@
+:ml:
+# LLE (Locally Linear Embedding)
+
+ML D5
+
+## Notes
+
+**Definition:** LLE is a dimensionality reduction technique that uses manifold learning instead of projection.
+
+LLE works by finding the distance between an instance and its nearest neighbors and then lookking for a low-dimensional representation of the training set where these relationships are best preserved. 
+
+This approach is good at unrolling twisted manifolds when there is not too much noise.
diff --git a/MachineLearning.md b/MachineLearning.md
@@ -124,9 +124,9 @@ Concepts:
 [[Subspace.md]]
 [[ManifoldLearning.md]]
 [[PCA.md]]
+[[RandomProjection.md]]
+[[LLE.md]]
+[[Affinity.md]]
+[[Segmentation.md]]
+[[DBSCAN.md]]
 
-To do:
-
-[[DeepLearning.md]]
-[[Kernels.md]]
-[[Backpropagation.md]]
diff --git a/ManifoldLearning.md b/ManifoldLearning.md
@@ -4,7 +4,7 @@ ML D5
 
 ## Notes
 
-**Definition:** Manifold learning is the process of mapping a 3D object to a 2D manifold.
+**Definition:** Manifold learning is the process of mapping a higher dimensional object to a lower dimensional manifold.
 
 Manifolds are representations of objects in higher dimensional space using lower dimensional space such that they still maintain attributes. This can be thought of like uv wrapping.
 
diff --git a/PCA.md b/PCA.md
@@ -10,3 +10,7 @@ ML D5
 The goal of this algorithm is to preserve maximum variance so values in the dataset are optimally spread out.
 
 The way to describe this as a cost function would be to minimize the mean squared distance between the original dataset and the projected position.
+
+When using PCA this compresses data and it is possible to get close to the original values. To do this using sklearn we can simply use the inverse transform. 
+
+There is also IPCA (incremental) which allows for out of core processing. Using this in concatenation with np.memmap which can load and unload np arrays from disk is useful. 
diff --git a/RandomProjection.md b/RandomProjection.md
@@ -0,0 +1,32 @@
+:ml:
+# Random Projection
+
+## Notes
+
+**Definition:** Random projection is an algorithm that selects dimensions at random to project onto. 
+
+Random projection is used because PCA can often be slow, and it has been shown that random projection does not loose too much data.
+
+This is for when you have things like 20,000 dimensions.
+
+There is also the johnson lindenstrauss min dim function from sklearn random projection that calculates based on the number of samples and some value reprensting the acceptable loss amount, the minimum number of dimensions to show all of the information with at least a certain level of accuracy.
+
+Example:
+```python3
+from sklearn.random_projection import johnson_lindenstrauss_min_dim
+m, ε = 5_000, 0.1
+d = johnson_lindenstrauss_min_dim(m, eps=ε)
+d
+```
+
+The output of this is 7300 so any higher dimensional values can be randomly projected to 7300 dimensional space without losing more than approximately 10% accuracy.
+
+Below is an example implementation of this random projection where we simply pass in the acceptable loss amount:
+
+```python3
+
+sklearn.random_projection import GaussianRandomProjection
+gaussian_rnd_proj = GaussianRandomProjection(eps=ε, random_state=42)
+X_reduced = gaussian_rnd_proj.fit_transform(X) # same result as above
+
+```
diff --git a/Segmentation.md b/Segmentation.md
@@ -0,0 +1,14 @@
+:ml:
+# Segmentation
+
+ML D5
+
+## Notes
+
+**Definition:** Segmentation in machine learning is the process of breaking up a large group into smaller ones.
+
+Image segmentation is partitioning an image into multiple segments. There are a few different types:
+
+1. Color Segmentation - Broken up by color similarities
+2. Semantic Segmentation - All pixels that are part of the same object are assigned to a segment (one segment for all people).
+3. Instance Segmentation - All pixels that are part of the same individual object (one segment per person).

	notes Unnamed repository; edit this file 'description' to name the repository.
	Log \| Files \| Refs

A	Affinity.md	\|	10	++++++++++
A	DBSCAN.md	\|	14	++++++++++++++
A	Inertia.md	\|	10	++++++++++
M	KMeans.md	\|	6	+++++-
A	LLE.md	\|	12	++++++++++++
M	MachineLearning.md	\|	10	+++++-----
M	ManifoldLearning.md	\|	2	+-
M	PCA.md	\|	4	++++
A	RandomProjection.md	\|	32	++++++++++++++++++++++++++++++++
A	Segmentation.md	\|	14	++++++++++++++