notes

Unnamed repository; edit this file 'description' to name the repository.
Log | Files | Refs

commit 0a9290b1deddbcebfbf5c00dc9795caccfb27fa7
parent a2936e1937d4b8f59bec9f5df6a3f1b9e3561c4f
Author: Andrew <andrewlaack1@gmail.com>
Date:   Sat,  2 Nov 2024 15:47:47 -0500

Took some more notes

Diffstat:
ABandits.md | 8++++++++
ACreditAssignmentProblem.md | 8++++++++
ADiscountFactor.md | 10++++++++++
AEvaluation.md | 8++++++++
AEvolutionaryMethods.md | 8++++++++
AExploit.md | 10++++++++++
AExplore.md | 8++++++++
AImitationLearning.md | 10++++++++++
AMarkovAssumption.md | 8++++++++
AMarkovDecisionProcesses.md | 8++++++++
AMarkovRewardProcess.md | 8++++++++
AModel.md | 13+++++++++++++
AModelFree.md | 8++++++++
APartiallyObservableMarkovDecisionProcess.md | 8++++++++
APolicy.md | 10++++++++++
MReinforcementLearning.md | 38++++++++++++++++++--------------------
ARewardSignal.md | 10++++++++++
AValueFunction.md | 12++++++++++++
18 files changed, 173 insertions(+), 20 deletions(-)

diff --git a/Bandits.md b/Bandits.md @@ -0,0 +1,8 @@ +:rl: :ml: +# Bandits + +L1 + +## Notes + +**Definition:** Bandits are a class of problems in RL where an agent repeatedly chooses from a set of actions which give a reward drawn from an unknown probability distribution. diff --git a/CreditAssignmentProblem.md b/CreditAssignmentProblem.md @@ -0,0 +1,8 @@ +:rl: :ml: +# Credit Assignment Problem + +L1 + +## Notes + +**Definition:** The credit assigment problem is an RL problem where we need to determine how to rate choices in the near term given their long term consequences. diff --git a/DiscountFactor.md b/DiscountFactor.md @@ -0,0 +1,10 @@ +:rl: :ml: +# Discount Factor + +L2 + +## Notes + +**Definition:** The discount factor in RL is the value gamma we use to describe how much or little we care about long term rewards with respect to the value function. + +The discount factor is to the power of the steps away you are from that reward so if gamma = .5 then we see we only care .5x as much about the next step as the current and then .25x as much about the one after that and so on. diff --git a/Evaluation.md b/Evaluation.md @@ -0,0 +1,8 @@ +:rl: :ml: +# Evaluation + +L1 + +## Notes + +**Definition:** Evaluation in RL is the process of seeing how good a policy is. diff --git a/EvolutionaryMethods.md b/EvolutionaryMethods.md @@ -0,0 +1,8 @@ +:rl: :ml: +# Evolutionary Methods + +RL Ch 1 + +## Notes + +**Definition:** Evolutionary methods are a class of RL strategies where learning is not done by interacting with the environment but rather by updating policies using a strategy akin to evolution where the best models continue on. diff --git a/Exploit.md b/Exploit.md @@ -0,0 +1,10 @@ +:rl: :ml: +# Exploit + +RL Ch 1 + +## Notes + +**Definition:** To exploit in RL means to take the known best move in the current state. + +This is the opposite of explore which is to take a random move and see how that plays out in the future in case it may be better than the current best known option. diff --git a/Explore.md b/Explore.md @@ -0,0 +1,8 @@ +:rl: :ml: +# Explore + +RL Ch 1 + +## Notes + +**Definition:** To explore in RL means to select an option that is either unknown or suboptimal and then continuing the evaluate that path with the hope it may lead to a better outcome than the known best option. diff --git a/ImitationLearning.md b/ImitationLearning.md @@ -0,0 +1,10 @@ +:ml: +# Imitation Learning + +L1 + +## Notes + +**Definition:** Imitation learning is not RL. It is the process of training a model on expert data making it a form of supervised learning. + +Tangentially related is inverse reinforcement learning where a moduel learns the reward function that the expert is trying to follow. diff --git a/MarkovAssumption.md b/MarkovAssumption.md @@ -0,0 +1,8 @@ +:rl: :ml: +# Markov Assumption + +L1 + +## Notes + +**Definition:** The Markov assumption is the assumption that prior events don't matter and all necessary information that dictates the future is in the current state. diff --git a/MarkovDecisionProcesses.md b/MarkovDecisionProcesses.md @@ -0,0 +1,8 @@ +:rl: :ml: +# Markov Decision Process (MDP) + +RL Ch 1 + +## Notes + +**Definition:** Markov decision processes are used to model decision making processes that are partly stochastic and partly controlled via decisions. diff --git a/MarkovRewardProcess.md b/MarkovRewardProcess.md @@ -0,0 +1,8 @@ +:rl: :ml: +# Markov Reward Process (MRP) + +L2 + +## Notes + +**Definition:** A markov reward process is a markov chain with values associated with states or transitions. diff --git a/Model.md b/Model.md @@ -0,0 +1,13 @@ +:rl: :ml: +# Model + +RL Ch 1 + +## Notes + +**Definition:** A model in RL is an agents representation of its environment that allows it to predict expected outcomes. + +There are two parts to the model: + +1. Transition Model (probabilities of switching between states) +2. Reward Model (expected rewards after taking certain actions) diff --git a/ModelFree.md b/ModelFree.md @@ -0,0 +1,8 @@ +:rl: :ml: +# Model Free + +L1 + +## Notes + +**Definition:** A model free approach in RL means the agent does not know or estimate probabilities of state transitions and as such learns directly from experience. diff --git a/PartiallyObservableMarkovDecisionProcess.md b/PartiallyObservableMarkovDecisionProcess.md @@ -0,0 +1,8 @@ +:rl: :ml: +# Partially Observable Markov Decision Process (POMDP) + +L1 + +## Notes + +**Definition:** A partially observable markov decision process is a type of markov decision process where the agent doesn't have access to the entire current state. diff --git a/Policy.md b/Policy.md @@ -0,0 +1,10 @@ +:rl: :ml: +# Policy + +RL Ch 1 + +## Notes + +**Definition:** A policy in machine learning is a function from the current state to the action an agent will take. + +Basically, this dictates what the agent will do in a given scenario. diff --git a/ReinforcementLearning.md b/ReinforcementLearning.md @@ -6,28 +6,26 @@ Reinforcement Learning Index Reinforcement Learning: An Introduction (Sutton & Barto) Chapter 1 (Introduction) -* MarkovDecisionProcesses -* Exploit -* Explore -* Policy -* RewardSignal -* ValueFunction -* Model -* EvolutionaryMethods (learn not interacting) +* [MarkovDecisionProcesses](MarkovDecisionProcesses.md) +* [Exploit](Exploit.md) +* [Explore](Explore.md) +* [Policy](Policy.md) +* [RewardSignal](RewardSignal.md) +* [ValueFunction](ValueFunction.md) +* [Model](Model.md) +* [EvolutionaryMethods](EvolutionaryMethods.md) -Stanford Lectures +DeepMind UCL Lectures L1 -* CreditAssignmentProblem -* ImitationLearning (separate) -* MarkovAssumption -* MDP -* POMDP -* ModelFree -* Bandits -* Evaluation -* Control +* [CreditAssignmentProblem](CreditAssignmentProblem.md) +* [ImitationLearning](ImitationLearning.md) (separate) +* [MarkovAssumption](MarkovAssumption.md) +* [PartiallyObservableMarkovDecisionProcess](PartiallyObservableMarkovDecisionProcess.md) +* [ModelFree](ModelFree.md) +* [Bandits](Bandits.md) +* [Evaluation](Evaluation.md) L2 -* DiscountFactor (MRP gamma) -* MarkovRewardProcess +* [DiscountFactor](DiscountFactor.md) +* [MarkovRewardProcess](MarkovRewardProcess.md) diff --git a/RewardSignal.md b/RewardSignal.md @@ -0,0 +1,10 @@ +:rl: :ml: +# Reward Signal + +RL Ch 1 + +## Notes + +**Definition:** The reward signal is a one time signal sent to an agent telling them that the something right now is good. + +In this context right now may imply the current state is good or the next state will be good based on the action currently chosen. diff --git a/ValueFunction.md b/ValueFunction.md @@ -0,0 +1,12 @@ +:rl: :ml: +# Value Function + +RL Ch 1 + +## Notes + +**Definition:** The value function describes the overall expected reward for an agent. + +This includes a gamma term (discount factor) which is between 1 and 0 with 0 meaning future rewards don't mean anything and 1 meaning future rewards are equally as important as short term rewards. + +When evaluating this function we take gamma to the power of the term number (how many steps in future) it is associated with making a geometric sequence.