commit 0a9290b1deddbcebfbf5c00dc9795caccfb27fa7
parent a2936e1937d4b8f59bec9f5df6a3f1b9e3561c4f
Author: Andrew <andrewlaack1@gmail.com>
Date: Sat, 2 Nov 2024 15:47:47 -0500
Took some more notes
Diffstat:
18 files changed, 173 insertions(+), 20 deletions(-)
diff --git a/Bandits.md b/Bandits.md
@@ -0,0 +1,8 @@
+:rl: :ml:
+# Bandits
+
+L1
+
+## Notes
+
+**Definition:** Bandits are a class of problems in RL where an agent repeatedly chooses from a set of actions which give a reward drawn from an unknown probability distribution.
diff --git a/CreditAssignmentProblem.md b/CreditAssignmentProblem.md
@@ -0,0 +1,8 @@
+:rl: :ml:
+# Credit Assignment Problem
+
+L1
+
+## Notes
+
+**Definition:** The credit assigment problem is an RL problem where we need to determine how to rate choices in the near term given their long term consequences.
diff --git a/DiscountFactor.md b/DiscountFactor.md
@@ -0,0 +1,10 @@
+:rl: :ml:
+# Discount Factor
+
+L2
+
+## Notes
+
+**Definition:** The discount factor in RL is the value gamma we use to describe how much or little we care about long term rewards with respect to the value function.
+
+The discount factor is to the power of the steps away you are from that reward so if gamma = .5 then we see we only care .5x as much about the next step as the current and then .25x as much about the one after that and so on.
diff --git a/Evaluation.md b/Evaluation.md
@@ -0,0 +1,8 @@
+:rl: :ml:
+# Evaluation
+
+L1
+
+## Notes
+
+**Definition:** Evaluation in RL is the process of seeing how good a policy is.
diff --git a/EvolutionaryMethods.md b/EvolutionaryMethods.md
@@ -0,0 +1,8 @@
+:rl: :ml:
+# Evolutionary Methods
+
+RL Ch 1
+
+## Notes
+
+**Definition:** Evolutionary methods are a class of RL strategies where learning is not done by interacting with the environment but rather by updating policies using a strategy akin to evolution where the best models continue on.
diff --git a/Exploit.md b/Exploit.md
@@ -0,0 +1,10 @@
+:rl: :ml:
+# Exploit
+
+RL Ch 1
+
+## Notes
+
+**Definition:** To exploit in RL means to take the known best move in the current state.
+
+This is the opposite of explore which is to take a random move and see how that plays out in the future in case it may be better than the current best known option.
diff --git a/Explore.md b/Explore.md
@@ -0,0 +1,8 @@
+:rl: :ml:
+# Explore
+
+RL Ch 1
+
+## Notes
+
+**Definition:** To explore in RL means to select an option that is either unknown or suboptimal and then continuing the evaluate that path with the hope it may lead to a better outcome than the known best option.
diff --git a/ImitationLearning.md b/ImitationLearning.md
@@ -0,0 +1,10 @@
+:ml:
+# Imitation Learning
+
+L1
+
+## Notes
+
+**Definition:** Imitation learning is not RL. It is the process of training a model on expert data making it a form of supervised learning.
+
+Tangentially related is inverse reinforcement learning where a moduel learns the reward function that the expert is trying to follow.
diff --git a/MarkovAssumption.md b/MarkovAssumption.md
@@ -0,0 +1,8 @@
+:rl: :ml:
+# Markov Assumption
+
+L1
+
+## Notes
+
+**Definition:** The Markov assumption is the assumption that prior events don't matter and all necessary information that dictates the future is in the current state.
diff --git a/MarkovDecisionProcesses.md b/MarkovDecisionProcesses.md
@@ -0,0 +1,8 @@
+:rl: :ml:
+# Markov Decision Process (MDP)
+
+RL Ch 1
+
+## Notes
+
+**Definition:** Markov decision processes are used to model decision making processes that are partly stochastic and partly controlled via decisions.
diff --git a/MarkovRewardProcess.md b/MarkovRewardProcess.md
@@ -0,0 +1,8 @@
+:rl: :ml:
+# Markov Reward Process (MRP)
+
+L2
+
+## Notes
+
+**Definition:** A markov reward process is a markov chain with values associated with states or transitions.
diff --git a/Model.md b/Model.md
@@ -0,0 +1,13 @@
+:rl: :ml:
+# Model
+
+RL Ch 1
+
+## Notes
+
+**Definition:** A model in RL is an agents representation of its environment that allows it to predict expected outcomes.
+
+There are two parts to the model:
+
+1. Transition Model (probabilities of switching between states)
+2. Reward Model (expected rewards after taking certain actions)
diff --git a/ModelFree.md b/ModelFree.md
@@ -0,0 +1,8 @@
+:rl: :ml:
+# Model Free
+
+L1
+
+## Notes
+
+**Definition:** A model free approach in RL means the agent does not know or estimate probabilities of state transitions and as such learns directly from experience.
diff --git a/PartiallyObservableMarkovDecisionProcess.md b/PartiallyObservableMarkovDecisionProcess.md
@@ -0,0 +1,8 @@
+:rl: :ml:
+# Partially Observable Markov Decision Process (POMDP)
+
+L1
+
+## Notes
+
+**Definition:** A partially observable markov decision process is a type of markov decision process where the agent doesn't have access to the entire current state.
diff --git a/Policy.md b/Policy.md
@@ -0,0 +1,10 @@
+:rl: :ml:
+# Policy
+
+RL Ch 1
+
+## Notes
+
+**Definition:** A policy in machine learning is a function from the current state to the action an agent will take.
+
+Basically, this dictates what the agent will do in a given scenario.
diff --git a/ReinforcementLearning.md b/ReinforcementLearning.md
@@ -6,28 +6,26 @@ Reinforcement Learning Index
Reinforcement Learning: An Introduction (Sutton & Barto)
Chapter 1 (Introduction)
-* MarkovDecisionProcesses
-* Exploit
-* Explore
-* Policy
-* RewardSignal
-* ValueFunction
-* Model
-* EvolutionaryMethods (learn not interacting)
+* [MarkovDecisionProcesses](MarkovDecisionProcesses.md)
+* [Exploit](Exploit.md)
+* [Explore](Explore.md)
+* [Policy](Policy.md)
+* [RewardSignal](RewardSignal.md)
+* [ValueFunction](ValueFunction.md)
+* [Model](Model.md)
+* [EvolutionaryMethods](EvolutionaryMethods.md)
-Stanford Lectures
+DeepMind UCL Lectures
L1
-* CreditAssignmentProblem
-* ImitationLearning (separate)
-* MarkovAssumption
-* MDP
-* POMDP
-* ModelFree
-* Bandits
-* Evaluation
-* Control
+* [CreditAssignmentProblem](CreditAssignmentProblem.md)
+* [ImitationLearning](ImitationLearning.md) (separate)
+* [MarkovAssumption](MarkovAssumption.md)
+* [PartiallyObservableMarkovDecisionProcess](PartiallyObservableMarkovDecisionProcess.md)
+* [ModelFree](ModelFree.md)
+* [Bandits](Bandits.md)
+* [Evaluation](Evaluation.md)
L2
-* DiscountFactor (MRP gamma)
-* MarkovRewardProcess
+* [DiscountFactor](DiscountFactor.md)
+* [MarkovRewardProcess](MarkovRewardProcess.md)
diff --git a/RewardSignal.md b/RewardSignal.md
@@ -0,0 +1,10 @@
+:rl: :ml:
+# Reward Signal
+
+RL Ch 1
+
+## Notes
+
+**Definition:** The reward signal is a one time signal sent to an agent telling them that the something right now is good.
+
+In this context right now may imply the current state is good or the next state will be good based on the action currently chosen.
diff --git a/ValueFunction.md b/ValueFunction.md
@@ -0,0 +1,12 @@
+:rl: :ml:
+# Value Function
+
+RL Ch 1
+
+## Notes
+
+**Definition:** The value function describes the overall expected reward for an agent.
+
+This includes a gamma term (discount factor) which is between 1 and 0 with 0 meaning future rewards don't mean anything and 1 meaning future rewards are equally as important as short term rewards.
+
+When evaluating this function we take gamma to the power of the term number (how many steps in future) it is associated with making a geometric sequence.