## Welcome to STATS 701 WI 2021

This is a special topics course on the Theory of Reinforcement Learning (RL). We will focus on the design and analysis of RL algorithms using tools from regret analysis of online algorithms, concentration inequalities, and stochastic approximation. The “core” of this course will be based on online RL in a finite state-action (often called the “tabular” setting) Markov Decision Process (MDP) and will be taught in traditional lecture style (fully remote due to COVID-19). The “advanced” part of this course will choose topics based on audience interest and will be more discussion based. Students will volunteer to read a paper (or a small group of related papers) and lead its discussion in the class. Topics for the advanced part could include:

- RL with function approximation (from simple cases like linear functions to more difficult cases like deep RL)
- continuous state and action spaces (e.g., LQR systems [linear systems with quadratic costs])
- offline and off-policy evaluation
- multi-task and transfer learning in RL
- topics related to the above (e.g., lifelong RL, meta RL)
- hierarchical RL
- multi-agent RL

We will use the following resources:

- for the core part: course notes by Prof. Vidyasagar and the boot camp from the Simons Institute Fall 2020 program on the Theory of Reinforcement Learning
- for the advanced part: the three workshops from the Simons Theory of RL program (deep RL, online RL, RL using batch/simulation data)
- for more resources, look here

The course is aimed at advanced graduate students in statistics, computer science, control theory, operations research, and other related disciplines that study sequential decision making under uncertainty. Prior exposure to RL (or at least MDPs) is strongly recommended. This course has a theoretical/mathematical flavor and therefore mathematical maturity is expected.

## Instructor Information

**Name**: Ambuj Tewari

**Zoom Office Hours**: By appointment

**Email**: tewaria@umich.edu

## Grading

There will be no graded homeworks or exams. Grades will be determined on the basis of class presentations and a final project.

## Schedule

*Caveat Lector*: The material below is provided with the hope that it might be useful in learning the theory of RL. Please note that the content is rough and unpolished in many places. It has not been subjected to the scrutiny reserved for scholarly publications. Please watch out for errors and inconsistencies!

- V = Prof. Vidyasagar’s course notes (available to UM students via the Canvas website for this course)
- SB = Richard Sutton and Andrew Barto, Reinforcement Learning: An Introduction (2nd edition)
- P = Martin L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming
- H = Elad Hazan, Introduction to Online Convex Optimization
- LS = Tor Lattimore and Csaba Szepesvari, Bandit Algorithms

Lecture No. | Date | Topic | Readings | Media |
---|---|---|---|---|

01 | Jan 19 | Introduction | V, Chapter 1 (Introduction) SB, Chapter 1 (Introduction) SB, Section 16.2 (Samuel’s Checkers Player) SB, Section 16.6 (Mastering the Game of Go) |
slides zoom recording |

Tabular Methods |
||||

02 | Jan 21 | Markov Processes | V, Section 8.2 (Markov processes) | slides zoom recording |

03 | Jan 26 | Markov Reward Processes | V, Section 8.4 (Contraction Mapping Theorem) V, Section 2.1 (Markov Reward Processes) |
slides zoom recording |

04 | Jan 28 | MRPs Continued Markov Decision Processes |
V, Section 2.2 (Markov Decision Processes) SB, Chapter 3 (Finite Markov Decision Processes) |
slides zoom recording |

05 | Feb 02 | MDPs Continued | SB, Chapter 4 (Dynamic Programming) | slides zoom recording |

06 | Feb 04 | LP Approach to Solving MDPs MLE of Markov Processes Hoeffding’s Inequality |
P, Section 6.9 (Linear Programming) V, Section 8.3 (MLE of Markov Processes) |
slides zoom recording |

07 | Feb 09 | Monte Carlo Methods | V, Section 4.1 (Monte-Carlo Methods) SB, Chapter 5 (Monte Carlo Methods) |
slides zoom recording |

08 | Feb 11 | Temporal Difference Methods | V, Section 4.2 (Temporal Difference Methods) SB, Chapter 6 (Temporal-Difference Learning) |
slides zoom recording |

09 | Feb 16 | Project Discussion | ||

Online Learning and Bandits |
||||

10 | Feb 18 | Online Convex Optimization (OCO) | H, Chapter 1 (Introduction) H, Chapter 2 (First order algorithms for OCO) |
slides zoom recording |

11 | Feb 23 | Experts Problem Follow the Regularized Leader Mirror Descent |
H, Chapter 5 (Regularization) | slides zoom recording |

12 | Feb 25 | Adversarial Multi-Armed Bandits | LS, Chapter 11 (The Exp3 Algorithm) | slides zoom recording |

– | Mar 05 | Proposals due | Note: Mar 2nd, Mar 4th lectures were cancelled | |

13 | Mar 09 | Stochastic Multi-Armed Bandits | LS, Chapter 4 (Stochastic Bandits) LS, Chapter 6 (The Explore-Then-Commit Algorithm) LS, Chapter 7 (The Upper Confidence Bound Algorithm) LS, Chapter 36 (Thompson Sampling) |
slides zoom recording |

Online Learning in MDPs |
||||

14 | Mar 11 | Optimism Based Algorithms | UCRL2 paper | slides zoom recording |

15 | Mar 16 | Posterior/Thompson Sampling | PSRL paper | slides zoom recording |

16 | Mar 18 | Adversarial MDPs | MDP Experts paper (Journal version) O-REPS paper |
slides zoom recording |

Mar 23 | Well-being break | |||

Mar 25 | Well-being break |

### Student Presentations

Date | Group | Topic | Media |
---|---|---|---|

Mar 30 | Dhar-Weitze | Reinforcement Learning in Enormous Action Spaces | slides zoom recording |

Mar 30 | Gupta-Li-Li | Game Theoretical Foundations for Multi-agent Reinforcement Learning | slides zoom recording |

Apr 01 | Liu-Qiao-Sun | A deep reinforcement learning based long-term recommender system | slides zoom recording |

Apr 01 | DiGiovanni-Zell | Self-Play Compatibility in Multi-Agent Learning | slides zoom recording |

Apr 06 | Li-Xu | Overview of Thompson sampling in RL and BCI applications | slides zoom recording |

Apr 06 | Jang-Moug | Comparison of Safe Reinforcement Learning Algorithms in Safety Gym | slides zoom recording |

Apr 08 | Park-Yu | Real-time Update in Robot Imitation Learning | slides zoom recording |

Apr 08 | Otero-Leon | Patient panel prioritization with finite resources | slides zoom recording |

Apr 13 | Wang-Zhong | Inverse Reinforcement Learning and its application on calibration of human driver models | slides zoom recording |

Apr 13 | Chandrasekaran-De | Statistical Properties of Monte Carlo Estimates in Reinforcment Learning | slides zoom recording |

Apr 15 | Zhalechian | Reinforcement Learning for Parameterized Markov Decision Processes using Posterior Sampling | slides zoom recording |

Apr 15 | Bull-Li | Reinforcement Learning of Optimal Delivery Destination Estimation | slides zoom recording |

Apr 20 | Chakraborty-Roy | High Dimensional Linear Contextual Bandit and Thompson Sampling using Langevin dynamics | slides zoom recording |

Apr 20 | Li-Shang | Intelligent transportation system | slides zoom recording |

Apr 27 | Reports due |