commit 04003be530824c91d5a8ef8c7732d38fab24ef42
parent 651def6a288be3c91a398420ec08966e0825b3ec
Author: Andrew Laack <andrew@laack.co>
Date: Sun, 18 Jan 2026 23:33:25 -0600
Refactoring, OS notes, DL notes
Diffstat:
25 files changed, 105 insertions(+), 59 deletions(-)
diff --git a/docs/Adder.md b/docs/Adder.md
diff --git a/docs/Attention.md b/docs/Attention.md
@@ -0,0 +1,32 @@
+# Attention
+
+**Source:** [https://arxiv.org/abs/1706.03762](https://arxiv.org/abs/1706.03762)
+
+**Definition:** Attention is a method to determine the importance of each token in a sequence of tokens relative to other tokens in said sequence.
+
+## Standard Scaled Dot-Product Self-Attention
+
+Consider the following matrices
+
+- $Q$ - Query matrix
+ - This represents what we **want** to get from the other vectors
+ - Basically, we are interested in gathering some type of information from the other vectors because that information is useful in the current context.
+- $K$ - Key matrix
+ - This matrix represents the information each vector offers.
+ - The key can be thought of as indexing / labeling the information afforded by the vector.
+- $V$
+ - This matrix represents the content each vector contributes.
+
+We then have $\text{Attention}(Q,K,V) = \text{softmax}(\frac{QK^T}{\sqrt{d_k}})V$
+
+### Break Down
+
+1. $QK^T$
+ - This gives us a matrix of the dot products between queries and keys, giving their similarity.
+ - This is useful because it describes how relevant each key is to the queried information.
+2. $\frac{QK^T}{\sqrt{d_k}}$
+ - $d_k$ is the number of elements in each key. This prevents the dot products from becoming too large.
+3. $\text{softmax}(\frac{QK^T}{\sqrt{d_k}})$
+ - This converts our similarity scores (dot products) to attention weights, describing how much attention to give to each key, normalized to sum up to 1.
+3. $\text{softmax}(\frac{QK^T}{\sqrt{d_k}})V$
+ - This computes the weighted sum (weighted by the attention weights) of the value vectors.
diff --git a/docs/BeamSearch.md b/docs/BeamSearch.md
@@ -0,0 +1,5 @@
+# Beam Search
+
+**Source:** https://en.wikipedia.org/wiki/Beam_search
+
+**Definition:** Beam search is a modification of best-first search where we only explore the best $\beta$ nodes at each step, where $\beta$ is defined as the beam width.
diff --git a/docs/Cache.md b/docs/Cache.md
@@ -1,5 +0,0 @@
-# Cache
-
-
-
-
diff --git a/docs/CircuitTechnology.md b/docs/CircuitTechnology.md
@@ -1,3 +0,0 @@
-# Circuit Technology
-
-Discussion of materials, gates, and things of that sort.
diff --git a/docs/ComputerArchitecture.md b/docs/ComputerArchitecture.md
@@ -31,14 +31,3 @@ Links to information learned from computer architecture course
- [BCD](BCD.md)
- [Programmer Visible State](ProgrammerVisibleState.md)
- [MUX](MUX.md)
-
-To do:
-
-- [Hamming](Hamming.md)
-- [Pipeline Control](PipelineControl.md)
-- [Circuit Technology](CircuitTechnology.md)
-- [VLIW](VLIW.md)
-- [SRAM](SRAM.md)
-- [Adder](Adder.md)
-- [Cache](Cache.md)
-- [Critical Path](CriticalPath.md)
diff --git a/docs/ComputerScience.md b/docs/ComputerScience.md
@@ -27,6 +27,7 @@ This is the index for my Computer Science related notes.
- [Haskell](Haskell.md)
- [Free Software](FreeSoftware.md)
- [Code Verification](CodeVerification.md)
+- [Beam filSearch](BeamSearch.md)
## Forced to Take Notes on
diff --git a/docs/CriticalPath.md b/docs/CriticalPath.md
diff --git a/docs/DeepLearning.md b/docs/DeepLearning.md
@@ -40,7 +40,7 @@ Chapter 2
### Personal Research
-- PEFT
+- PEFT Approaches
- [LoRA](LoRA.md)
- https://arxiv.org/abs/2106.09685
- PLoP
@@ -54,3 +54,6 @@ Chapter 2
- https://arxiv.org/abs/2305.03495
- OPRO
- https://arxiv.org/abs/2309.03409
+
+- Architectures
+ - [Transformers](Transformers.md)
diff --git a/docs/Embedding.md b/docs/Embedding.md
@@ -2,8 +2,6 @@
ML P722
-
-
**Definition:** Embeddings are a high dimensional dense representation of data.
When using one hot encoding we get a sparse output with only one 1 and the rest 0s. However, when using embeddings all representations are high dimensional and don't have sparsity.
diff --git a/docs/FreeSoftware.md b/docs/FreeSoftware.md
@@ -10,18 +10,3 @@
## Good Software
- PNG
-
----
-
-What are these patents:
-
-- https://patents.google.com/patent/US20240111498A1/en?q=(LLM)&oq=LLM
-- https://patents.google.com/patent/US11900068B1/en?q=(LLM)&oq=LLM
-- https://patents.google.com/patent/US12148421B2/en?q=(LLM)&oq=LLM
-- https://patents.google.com/patent/US12430503B2/en?q=(LLM)&country=US&status=GRANT&language=ENGLISH&type=PATENT
-- https://patents.google.com/patent/US12242826B2/en?q=(LLM)&country=US&status=GRANT&language=ENGLISH&type=PATENT&page=1
-- https://patents.google.com/patent/US11971914B1/en?q=(LLM)&country=US&status=GRANT&language=ENGLISH&type=PATENT&page=4
-- https://patents.google.com/patent/US12282411B2/en?q=(LLM)&country=US&status=GRANT&language=ENGLISH&type=PATENT&page=5
-- https://patents.google.com/patent/US11775414B2/en?q=(code+evaluation)&country=US&status=GRANT&language=ENGLISH&type=PATENT&oq=(code+evaluation)+country:US+status:GRANT+language:ENGLISH+type:PATENT&page=1
-- https://patents.google.com/patent/US11113185B2/en?q=(code+evaluation)&country=US&status=GRANT&language=ENGLISH&type=PATENT&oq=(code+evaluation)+country:US+status:GRANT+language:ENGLISH+type:PATENT&page=1
-- https://patents.google.com/patent/US10606739B2/en?q=(code+quality+evaluation)&country=US&status=GRANT&language=ENGLISH&type=PATENT&oq=(code+quality+evaluation)+country:US+status:GRANT+language:ENGLISH+type:PATENT&page=1
diff --git a/docs/Hamming.md b/docs/Hamming.md
@@ -1,11 +0,0 @@
-# Hamming
-
-He was a person who was influential to computing
-
-
-
-**Hamming Distance:** The difference between two strings. This is defined as the number of positions that are different.
-
-Hamming distance led to the inception of error correction (hamming codes)
-
-**Hamming Codes:** :todo:
diff --git a/docs/InformationRetrieval.md b/docs/InformationRetrieval.md
@@ -7,3 +7,4 @@
- [Stemming](Stemming.md)
- [Lemmatization](Lemmatization.md)
- [BM25](BM25.md)
+- [Word2vec](Word2vec.md)
diff --git a/docs/MarkovChains.md b/docs/MarkovChains.md
@@ -2,8 +2,6 @@
L13
-
-
**Definition:** A markov chain is a sequence of events where the probability of any given event is **entirely** based on the previous event.
Given that the state needs to have all relevant information, we need to choose our states properly to ensure accuracy.
diff --git a/docs/OperatingSystems.md b/docs/OperatingSystems.md
@@ -40,3 +40,6 @@
- [Modular OS](ModularOS.md)
- [Microkernel](Microkernel.md)
- [Linux](Linux.md)
+- [Process](Process.md)
+- [Virtual Address Space](VirtualAddressSpace.md)
+- [Page Table](PageTable.md)
diff --git a/docs/PageTable.md b/docs/PageTable.md
@@ -0,0 +1,7 @@
+# Page Table
+
+**Source:** CS 6200
+
+**Chapter:** P2L1
+
+**Definition:** A page table is a data structure that stores mappings from virtual memory addresses to physical memory addresses.
diff --git a/docs/PipelineControl.md b/docs/PipelineControl.md
@@ -1,7 +0,0 @@
-# Pipline Control
-
-CA L3
-
-
-
-**Definition:** Pipline control describes the management and coordinatei
diff --git a/docs/Process.md b/docs/Process.md
@@ -0,0 +1,11 @@
+# Process
+
+**Source:** CS 6200
+
+**Chapter:** P2L1
+
+**Definition:** A process is an instance of an executing program.
+
+## Specifics
+
+Each process has its own virtual address space.
diff --git a/docs/SRAM.md b/docs/SRAM.md
diff --git a/docs/SoftmaxRegression.md b/docs/SoftmaxRegression.md
@@ -2,8 +2,6 @@
ML D3
-
-
**Definition:** Softmax regression is the process of running linear regression for k classes for a sample and then using the softmax function to determine the probability of it being a member of each class.
The softmax function is simply a function where you find the each element e^z, sum these values, and then divide the exponential of each element by the sum of all exponentials.
diff --git a/docs/TF-IDF.md b/docs/TF-IDF.md
@@ -113,3 +113,7 @@ if __name__ == "__main__":
for i in range(top_k):
print(sorted_items[i])
```
+
+## TF-IDF For Feature Extraction
+
+There is some literature about using TF-IDF for feature extraction. In particular, existing approaches have used TF-IDF with [SVMs](SVM.md) and (separately) [Naive Bayes](NaiveBayes.md) for text classification tasks, first computing features with TF-IDF, and then passing the processed samples into the subsequent classifier. That said, the literature appears sparse.
diff --git a/docs/Transformers.md b/docs/Transformers.md
@@ -0,0 +1,9 @@
+# Transformers
+
+**Source:** [https://arxiv.org/abs/1706.03762](https://arxiv.org/abs/1706.03762)
+
+**Definition:** Transformers (as originally introduced) are a neural network architecture consisting of an encoder and a decoder that use [attention](Attention.md).
+
+## Attention is All You Need
+
+Existing approaches for sequence transduction (input sequence -> output sequence) used used RNNs and CNNs with encoders and decoders. The best models connected these encoders and decoders with attention. Transformers are an architecture that use attention without the recurrence / convolutions of existing approaches.
diff --git a/docs/VLIW.md b/docs/VLIW.md
diff --git a/docs/VirtualAddressSpace.md b/docs/VirtualAddressSpace.md
@@ -0,0 +1,23 @@
+# Virtual Address Space
+
+**Source:** CS 6200
+
+**Chapter:** P2L1
+
+**Definition:** A virtual address space is a contiguous block of virtual memory addresses made accessible to a process by an operating system.
+
+## Specifics
+
+A virtual address space is addressed from $v_0$ to $v_{max}$.
+
+The virtual address space for a process is made of the following parts:
+
+- Static state (available when the process first loads)
+ - This is memory that is allocated when the process first loads
+ - text (code)
+ - data
+- Heap
+- Stack
+ - Grows and shrinks during execution using LIFO order
+
+See also: [Page Table](PageTable.md)
diff --git a/docs/Word2vec.md b/docs/Word2vec.md
@@ -0,0 +1,5 @@
+# Word2vec
+
+**Source:** [https://arxiv.org/abs/1301.3781](https://arxiv.org/abs/1301.3781)
+
+**Definition:** TODO