22_reinforcement_learning_from_q_value_till_the_end#24
Open
msrazavi wants to merge 40 commits intosut-ai:masterfrom
Open
22_reinforcement_learning_from_q_value_till_the_end#24msrazavi wants to merge 40 commits intosut-ai:masterfrom
msrazavi wants to merge 40 commits intosut-ai:masterfrom
Conversation
SepehrAsh
reviewed
Jan 18, 2022
| Q-learning is an **off-policy** learner, which means it learns the value of the optimal policy independently of the agent’s actions. In other words, it converges to optimal policy eventually even if you are acting sub-optimally. | ||
| Q-learning is a **sample-based** q-value iteration method and in it, you Learn $Q(s,a)$ values as you go: | ||
|
|
||
| - Receive a sample. $(s_{t+1}, s_t, a_t, r_t)$ |
|
|
||
| $$Q(s,a) \leftarrow (1 - \alpha)Q(s,a) + \alpha(sample)$$ | ||
|
|
||
| $$\rightarrow Q^{new}(s_t,a_t) \leftarrow \underbrace{Q(s_t,a_t)}\_\text{old value} + \underbrace{\alpha}\_\text{learning rate} . \overbrace{(\underbrace{\underbrace{r_t}\_\text{reward} + \underbrace{\gamma}\_\text{discount factor} . \underbrace{\max_aQ(s_{t+1},a)}\_\text{estimate of optimal future value}}\_\text{new value (temporal difference target)} - \underbrace{Q(s_t,a_t)}\_\text{old value})}^\text{temporal difference}$$ |
|
|
||
| <div id='epsilon-greedy-strategy'><h2> Epsilon greedy strategy </h2></div> | ||
|
|
||
| The tradeoff between exploration and exploitation is fundamental. the simplest way to force exploration is using **epsilon greedy strategy**. This method does a random action with a small probability of $\epsilon$ (exploration) and with a probability of $(1 - \epsilon)$ does the current policy action (exploitation). |
|
|
||
| <div id='exploration-functions'><h2> Exploration functions </h2></div> | ||
|
|
||
| Another solution is to use **exploration functions**. For example, this function can take a value estimate u and a visit count n, and return an optimistic utility, e.g. $f(u,n) = v + \frac{k}{n}$ . we are counting how many times we did some random action. if it had yet to reach a fixed amount, we should try it more often and if it doesn't return a good output we should just stop exploring it. |
|
|
||
| $$V(s) = \omega_1f_1(s) + \omega_2f_2(s) + ... + \omega_nf_n(s)$$ | ||
|
|
||
| $$Q(s,a) = \omega_1f_1(s,a) + \omega_2f_2(s,a) + ... + \omega_nf_n(s,a)$$ |
|
|
||
| Q-Learning is a basic form of Reinforcement Learning which uses Q-values (action values) to iteratively improve the behavior of the learning agent. | ||
|
|
||
| Q-values are defined for states and actions. $Q(s, a)$ is an estimation of how good is it to take the action a at the state s. This estimation of $Q(s, a)$ will be iteratively computed using the temporal difference update. |
| @@ -0,0 +1,130 @@ | |||
| <div id='reinforcement-learning-from-q-value-till-the-end'><h1> Reinforcement Learning (from Q Value till the end) </h1></div> | |||
|
|
|||
There was a problem hiding this comment.
General Note: please fix the latex syntax for formulas.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
@SepehrAsh