22_reinforcement_learning_from_q_value_till_the_end by msrazavi · Pull Request #24 · sut-ai/notes

msrazavi · 2022-01-07T18:11:54Z

SepehrAsh · 2022-01-18T07:41:25Z

notebooks/22_reinforcement_learning_from_q_value_till_the_end/index.md

+Q-learning is an **off-policy** learner, which means it learns the value of the optimal policy independently of the agent’s actions. In other words, it converges to optimal policy eventually even if you are acting sub-optimally.
+Q-learning is a **sample-based** q-value iteration method and in it, you Learn $Q(s,a)$ values as you go:
+
+- Receive a sample. $(s_{t+1}, s_t, a_t, r_t)$


Please Fix the latex syntax.

SepehrAsh · 2022-01-18T07:41:41Z

notebooks/22_reinforcement_learning_from_q_value_till_the_end/index.md

+
+  $$Q(s,a) \leftarrow (1 - \alpha)Q(s,a) + \alpha(sample)$$
+
+  $$\rightarrow Q^{new}(s_t,a_t) \leftarrow \underbrace{Q(s_t,a_t)}\_\text{old value} + \underbrace{\alpha}\_\text{learning rate} . \overbrace{(\underbrace{\underbrace{r_t}\_\text{reward} + \underbrace{\gamma}\_\text{discount factor} . \underbrace{\max_aQ(s_{t+1},a)}\_\text{estimate of optimal future value}}\_\text{new value (temporal difference target)} - \underbrace{Q(s_t,a_t)}\_\text{old value})}^\text{temporal difference}$$


SepehrAsh · 2022-01-18T07:42:03Z

notebooks/22_reinforcement_learning_from_q_value_till_the_end/index.md

+
+<div id='epsilon-greedy-strategy'><h2> Epsilon greedy strategy </h2></div>
+
+The tradeoff between exploration and exploitation is fundamental. the simplest way to force exploration is using **epsilon greedy strategy**. This method does a random action with a small probability of $\epsilon$ (exploration) and with a probability of $(1 - \epsilon)$ does the current policy action (exploitation).


SepehrAsh · 2022-01-18T07:42:11Z

notebooks/22_reinforcement_learning_from_q_value_till_the_end/index.md

+
+<div id='exploration-functions'><h2> Exploration functions </h2></div>
+
+Another solution is to use **exploration functions**. For example, this function can take a value estimate u and a visit count n, and return an optimistic utility, e.g. $f(u,n) = v + \frac{k}{n}$ . we are counting how many times we did some random action. if it had yet to reach a fixed amount, we should try it more often and if it doesn't return a good output we should just stop exploring it.


SepehrAsh · 2022-01-18T07:42:25Z

notebooks/22_reinforcement_learning_from_q_value_till_the_end/index.md

+
+$$V(s) = \omega_1f_1(s) + \omega_2f_2(s) + ... + \omega_nf_n(s)$$
+
+$$Q(s,a) = \omega_1f_1(s,a) + \omega_2f_2(s,a) + ... + \omega_nf_n(s,a)$$


SepehrAsh · 2022-01-18T07:42:34Z

notebooks/22_reinforcement_learning_from_q_value_till_the_end/index.md

+
+Q-Learning is a basic form of Reinforcement Learning which uses Q-values (action values) to iteratively improve the behavior of the learning agent.
+
+Q-values are defined for states and actions. $Q(s, a)$ is an estimation of how good is it to take the action a at the state s. This estimation of $Q(s, a)$ will be iteratively computed using the temporal difference update.


SepehrAsh · 2022-01-18T07:43:24Z

notebooks/22_reinforcement_learning_from_q_value_till_the_end/index.md

@@ -0,0 +1,130 @@
+<div id='reinforcement-learning-from-q-value-till-the-end'><h1> Reinforcement Learning (from Q Value till the end) </h1></div>
+


General Note: please fix the latex syntax for formulas.

msrazavi added 30 commits January 7, 2022 21:26

Create index.md

608631f

Add files via upload

0ea3d8c

Create a.txt

b5f77e8

Add files via upload

2f46373

Update index.yml

b080a16

Update index.yml

31804a2

Update matadata.yml

3327758

Update main.yml

75e270b

Update index.md

7a309ee

Update index.md

f53f9dd

Update index.md

631763a

Update index.md

d34e0f3

Update index.md

cbc2567

Update index.md

7c17e75

Update index.md

c4172c3

Update index.md

3c36ab6

Update index.md

1b29739

Update index.md

6888180

Update index.md

cf9927c

Update index.yml

0e2df0c

Update index.yml

4768f0f

Update index.md

251373f

Update matadata.yml

e4df5cd

Update index.md

a4e4154

Update index.md

c53bb5c

Update index.yml

79b59ae

Update main.yml

f5dba3e

Delete a.txt

6cebfc5

Update index.md

8387bc5

Update matadata.yml

61774f4

vahidzee and others added 7 commits January 9, 2022 15:42

Fix: metadata

326ce80

Fix: metadata linkage

42b80e2

Update index.md

44beb68

Update index.md

7891e8c

Update index.md

4162848

Update index.md

2261a87

Merge branch 'sut-ai:master' into master

43bf6d8

sararajabzadeh requested a review from SepehrAsh January 12, 2022 15:41

SepehrAsh reviewed Jan 18, 2022

View reviewed changes

msrazavi added 3 commits January 28, 2022 08:48

Update index.md

10dda75

Update index.md

e4bd635

Update index.md

c0dd8a0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

22_reinforcement_learning_from_q_value_till_the_end#24

22_reinforcement_learning_from_q_value_till_the_end#24
msrazavi wants to merge 40 commits intosut-ai:masterfrom
msrazavi:master

msrazavi commented Jan 7, 2022

Uh oh!

SepehrAsh Jan 18, 2022

Uh oh!

SepehrAsh Jan 18, 2022

Uh oh!

SepehrAsh Jan 18, 2022

Uh oh!

SepehrAsh Jan 18, 2022

Uh oh!

SepehrAsh Jan 18, 2022

Uh oh!

SepehrAsh Jan 18, 2022

Uh oh!

SepehrAsh Jan 18, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		$$Q(s,a) \leftarrow (1 - \alpha)Q(s,a) + \alpha(sample)$$

		$$\rightarrow Q^{new}(s_t,a_t) \leftarrow \underbrace{Q(s_t,a_t)}\_\text{old value} + \underbrace{\alpha}\_\text{learning rate} . \overbrace{(\underbrace{\underbrace{r_t}\_\text{reward} + \underbrace{\gamma}\_\text{discount factor} . \underbrace{\max_aQ(s_{t+1},a)}\_\text{estimate of optimal future value}}\_\text{new value (temporal difference target)} - \underbrace{Q(s_t,a_t)}\_\text{old value})}^\text{temporal difference}$$


		<div id='epsilon-greedy-strategy'><h2> Epsilon greedy strategy </h2></div>

		The tradeoff between exploration and exploitation is fundamental. the simplest way to force exploration is using epsilon greedy strategy. This method does a random action with a small probability of $\epsilon$ (exploration) and with a probability of $(1 - \epsilon)$ does the current policy action (exploitation).


		<div id='exploration-functions'><h2> Exploration functions </h2></div>

		Another solution is to use exploration functions. For example, this function can take a value estimate u and a visit count n, and return an optimistic utility, e.g. $f(u,n) = v + \frac{k}{n}$ . we are counting how many times we did some random action. if it had yet to reach a fixed amount, we should try it more often and if it doesn't return a good output we should just stop exploring it.


		$$V(s) = \omega_1f_1(s) + \omega_2f_2(s) + ... + \omega_nf_n(s)$$

		$$Q(s,a) = \omega_1f_1(s,a) + \omega_2f_2(s,a) + ... + \omega_nf_n(s,a)$$


		Q-Learning is a basic form of Reinforcement Learning which uses Q-values (action values) to iteratively improve the behavior of the learning agent.

		Q-values are defined for states and actions. $Q(s, a)$ is an estimation of how good is it to take the action a at the state s. This estimation of $Q(s, a)$ will be iteratively computed using the temporal difference update.

		@@ -0,0 +1,130 @@
		<div id='reinforcement-learning-from-q-value-till-the-end'><h1> Reinforcement Learning (from Q Value till the end) </h1></div>

Conversation

msrazavi commented Jan 7, 2022

Uh oh!

SepehrAsh Jan 18, 2022

Choose a reason for hiding this comment

Uh oh!

SepehrAsh Jan 18, 2022

Choose a reason for hiding this comment

Uh oh!

SepehrAsh Jan 18, 2022

Choose a reason for hiding this comment

Uh oh!

SepehrAsh Jan 18, 2022

Choose a reason for hiding this comment

Uh oh!

SepehrAsh Jan 18, 2022

Choose a reason for hiding this comment

Uh oh!

SepehrAsh Jan 18, 2022

Choose a reason for hiding this comment

Uh oh!

SepehrAsh Jan 18, 2022

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants