bitdribble.github.io/reinforcement_learning/notations.md at main · Bitdribble/bitdribble.github.io · GitHub

41 lines (36 loc) · 2.36 KB

layout

mathjax

title

author

post

true

Notations

Andrei Radulescu-Banu

Top | Notations | Bibliography

Notations

This is a list of notations and definitions used throughout the series.

Symbol	Meaning
$$s \in \mathcal{S}$$	States
$$a \in \mathcal{A}$$	Actions
$$d_0(s)$$	Initial distribution of states
$$p(s', r \vert s, a)$$	State-reward transition probability of getting the next state $s'$ from the current state $s$ with action $$a$$ and reward $$r \in \mathbb{R}$$
$$p(s' \vert s, a)$$	State transition probability $$Pr(s_{t+1} = s' \vert s_t = s, a_t = a)$$
$$r(s, a, s')$$	State-action-state reward $$\mathbb{E}[r_{t+1} \vert s_t = s, a_t = a, s_{t+1} = s']$$
$$r(s, a')$$	State-action reward $$\mathbb{E}[r_{t+1} \vert s_t = s, a_t = a]$$
$$\pi(a \vert a)$$	Policy
$$x \sim P$$	$$x$$ sampled with probability $$P$$
$$\tau$$	State-action trajectory $$s_0, a_0, s_1, ..., a_{T-1}, s_T$$ for $$T$$ possibly infinite
$$\overline{\tau}$$	State-action-reward trajectory $$s_0, a_0, r_1, s_1,..., a_{T-1}, r_{T-1}, s_T$$
$$\gamma$$	Discount factor $$0 \le \gamma \le 1$$, always $$\lt 1$$ for infinite trajectories
$$r(\overline{\tau})$$	Return of the state-action-reward trajectory $$
$$J_\pi$$	Agent objective $$\mathbb{E}_{\overline{\tau} \sim \pi}[r(\overline{\tau})]$$ when we follow policy $$\pi$$
$$V_\pi(s)$$	State value function $$\mathbb{E}_{s_0=s, \overline{\tau} \sim \pi}[r(\overline{\tau})]$$ when we follow policy $$\pi$$
$$Q_\pi(s, a)$$	Action value function $$\mathbb{E}_{s_0=s, a_0=a, \overline{\tau} \sim \pi}[r(\overline{\tau})]$$ when we follow policy $$\pi$$
$$A_\pi(s, a)$$	Advantage function $$Q_\pi(s, a) - V_\pi(s)$$
$$V_(s), Q_(s, a)$$	Optimal state and action value functions
$$\mathbb{N}, \mathbb{Z}, \mathbb{R}$$	The sets of nonnegative integers, integers, and real numbers

Term	Meaning
the model	$$p(s' \vert s, a)$$ in an MDP - sometimes known in advance (e.g. in a simulated environment), other times learned through sampling
the policy	$$\pi(a \vert s)$$ in an MDP
bootstrapping	an algorithm is bootstrapping if it uses predicted output as targets