Skip to content

Latest commit

 

History

History
41 lines (36 loc) · 2.36 KB

File metadata and controls

41 lines (36 loc) · 2.36 KB
layout mathjax title author
post
true
Notations
Andrei Radulescu-Banu

Top | Notations | Bibliography

Notations

This is a list of notations and definitions used throughout the series.

Symbol Meaning
$$s \in \mathcal{S}$$ States
$$a \in \mathcal{A}$$ Actions
$$d_0(s)$$ Initial distribution of states
$$p(s', r \vert s, a)$$ State-reward transition probability of getting the next state $s'$ from the current state $s$ with action $$a$$ and reward $$r \in \mathbb{R}$$
$$p(s' \vert s, a)$$ State transition probability $$Pr(s_{t+1} = s' \vert s_t = s, a_t = a)$$
$$r(s, a, s')$$ State-action-state reward $$\mathbb{E}[r_{t+1} \vert s_t = s, a_t = a, s_{t+1} = s']$$
$$r(s, a')$$ State-action reward $$\mathbb{E}[r_{t+1} \vert s_t = s, a_t = a]$$
$$\pi(a \vert a)$$ Policy
$$x \sim P$$ $$x$$ sampled with probability $$P$$
$$\tau$$ State-action trajectory $$s_0, a_0, s_1, ..., a_{T-1}, s_T$$ for $$T$$ possibly infinite
$$\overline{\tau}$$ State-action-reward trajectory $$s_0, a_0, r_1, s_1,..., a_{T-1}, r_{T-1}, s_T$$
$$\gamma$$ Discount factor $$0 \le \gamma \le 1$$, always $$\lt 1$$ for infinite trajectories
$$r(\overline{\tau})$$ Return of the state-action-reward trajectory $$
$$J_\pi$$ Agent objective $$\mathbb{E}_{\overline{\tau} \sim \pi}[r(\overline{\tau})]$$ when we follow policy $$\pi$$
$$V_\pi(s)$$ State value function $$\mathbb{E}_{s_0=s, \overline{\tau} \sim \pi}[r(\overline{\tau})]$$ when we follow policy $$\pi$$
$$Q_\pi(s, a)$$ Action value function $$\mathbb{E}_{s_0=s, a_0=a, \overline{\tau} \sim \pi}[r(\overline{\tau})]$$ when we follow policy $$\pi$$
$$A_\pi(s, a)$$ Advantage function $$Q_\pi(s, a) - V_\pi(s)$$
$$V_(s), Q_(s, a)$$ Optimal state and action value functions
$$\mathbb{N}, \mathbb{Z}, \mathbb{R}$$ The sets of nonnegative integers, integers, and real numbers
Term Meaning
the model $$p(s' \vert s, a)$$ in an MDP - sometimes known in advance (e.g. in a simulated environment), other times learned through sampling
the policy $$\pi(a \vert s)$$ in an MDP
bootstrapping an algorithm is bootstrapping if it uses predicted output as targets