View on GitHub

stats701-winter2021

Theory of Reinforcement Learning

Welcome to STATS 701 WI 2021

This is a special topics course on the Theory of Reinforcement Learning (RL). We will focus on the design and analysis of RL algorithms using tools from regret analysis of online algorithms, concentration inequalities, and stochastic approximation. The “core” of this course will be based on online RL in a finite state-action (often called the “tabular” setting) Markov Decision Process (MDP) and will be taught in traditional lecture style (fully remote due to COVID-19). The “advanced” part of this course will choose topics based on audience interest and will be more discussion based. Students will volunteer to read a paper (or a small group of related papers) and lead its discussion in the class. Topics for the advanced part could include:

RL with function approximation (from simple cases like linear functions to more difficult cases like deep RL)
continuous state and action spaces (e.g., LQR systems [linear systems with quadratic costs])
offline and off-policy evaluation
multi-task and transfer learning in RL
topics related to the above (e.g., lifelong RL, meta RL)
hierarchical RL
multi-agent RL

We will use the following resources:

for the core part: course notes by Prof. Vidyasagar and the boot camp from the Simons Institute Fall 2020 program on the Theory of Reinforcement Learning
for the advanced part: the three workshops from the Simons Theory of RL program (deep RL, online RL, RL using batch/simulation data)
for more resources, look here

The course is aimed at advanced graduate students in statistics, computer science, control theory, operations research, and other related disciplines that study sequential decision making under uncertainty. Prior exposure to RL (or at least MDPs) is strongly recommended. This course has a theoretical/mathematical flavor and therefore mathematical maturity is expected.

Instructor Information

Name: Ambuj Tewari
Zoom Office Hours: By appointment
Email: tewaria@umich.edu

Grading

There will be no graded homeworks or exams. Grades will be determined on the basis of class presentations and a final project.

Schedule

Caveat Lector: The material below is provided with the hope that it might be useful in learning the theory of RL. Please note that the content is rough and unpolished in many places. It has not been subjected to the scrutiny reserved for scholarly publications. Please watch out for errors and inconsistencies!

V = Prof. Vidyasagar’s course notes (available to UM students via the Canvas website for this course)
SB = Richard Sutton and Andrew Barto, Reinforcement Learning: An Introduction (2nd edition)
P = Martin L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming
H = Elad Hazan, Introduction to Online Convex Optimization
LS = Tor Lattimore and Csaba Szepesvari, Bandit Algorithms

Lecture No.	Date	Topic	Readings	Media
01	Jan 19	Introduction	V, Chapter 1 (Introduction) SB, Chapter 1 (Introduction) SB, Section 16.2 (Samuel’s Checkers Player) SB, Section 16.6 (Mastering the Game of Go)	slides zoom recording
		Tabular Methods
02	Jan 21	Markov Processes	V, Section 8.2 (Markov processes)	slides zoom recording
03	Jan 26	Markov Reward Processes	V, Section 8.4 (Contraction Mapping Theorem) V, Section 2.1 (Markov Reward Processes)	slides zoom recording
04	Jan 28	MRPs Continued Markov Decision Processes	V, Section 2.2 (Markov Decision Processes) SB, Chapter 3 (Finite Markov Decision Processes)	slides zoom recording
05	Feb 02	MDPs Continued	SB, Chapter 4 (Dynamic Programming)	slides zoom recording
06	Feb 04	LP Approach to Solving MDPs MLE of Markov Processes Hoeffding’s Inequality	P, Section 6.9 (Linear Programming) V, Section 8.3 (MLE of Markov Processes)	slides zoom recording
07	Feb 09	Monte Carlo Methods	V, Section 4.1 (Monte-Carlo Methods) SB, Chapter 5 (Monte Carlo Methods)	slides zoom recording
08	Feb 11	Temporal Difference Methods	V, Section 4.2 (Temporal Difference Methods) SB, Chapter 6 (Temporal-Difference Learning)	slides zoom recording
09	Feb 16	Project Discussion
		Online Learning and Bandits
10	Feb 18	Online Convex Optimization (OCO)	H, Chapter 1 (Introduction) H, Chapter 2 (First order algorithms for OCO)	slides zoom recording
11	Feb 23	Experts Problem Follow the Regularized Leader Mirror Descent	H, Chapter 5 (Regularization)	slides zoom recording
12	Feb 25	Adversarial Multi-Armed Bandits	LS, Chapter 11 (The Exp3 Algorithm)	slides zoom recording
–	Mar 05	Proposals due	Note: Mar 2nd, Mar 4th lectures were cancelled
13	Mar 09	Stochastic Multi-Armed Bandits	LS, Chapter 4 (Stochastic Bandits) LS, Chapter 6 (The Explore-Then-Commit Algorithm) LS, Chapter 7 (The Upper Confidence Bound Algorithm) LS, Chapter 36 (Thompson Sampling)	slides zoom recording
		Online Learning in MDPs
14	Mar 11	Optimism Based Algorithms	UCRL2 paper	slides zoom recording
15	Mar 16	Posterior/Thompson Sampling	PSRL paper	slides zoom recording
16	Mar 18	Adversarial MDPs	MDP Experts paper (Journal version) O-REPS paper	slides zoom recording
	Mar 23	Well-being break
	Mar 25	Well-being break

Student Presentations

Date	Group	Topic	Media
Mar 30	Dhar-Weitze	Reinforcement Learning in Enormous Action Spaces	slides zoom recording
Mar 30	Gupta-Li-Li	Game Theoretical Foundations for Multi-agent Reinforcement Learning	slides zoom recording
Apr 01	Liu-Qiao-Sun	A deep reinforcement learning based long-term recommender system	slides zoom recording
Apr 01	DiGiovanni-Zell	Self-Play Compatibility in Multi-Agent Learning	slides zoom recording
Apr 06	Li-Xu	Overview of Thompson sampling in RL and BCI applications	slides zoom recording
Apr 06	Jang-Moug	Comparison of Safe Reinforcement Learning Algorithms in Safety Gym	slides zoom recording
Apr 08	Park-Yu	Real-time Update in Robot Imitation Learning	slides zoom recording
Apr 08	Otero-Leon	Patient panel prioritization with finite resources	slides zoom recording
Apr 13	Wang-Zhong	Inverse Reinforcement Learning and its application on calibration of human driver models	slides zoom recording
Apr 13	Chandrasekaran-De	Statistical Properties of Monte Carlo Estimates in Reinforcment Learning	slides zoom recording
Apr 15	Zhalechian	Reinforcement Learning for Parameterized Markov Decision Processes using Posterior Sampling	slides zoom recording
Apr 15	Bull-Li	Reinforcement Learning of Optimal Delivery Destination Estimation	slides zoom recording
Apr 20	Chakraborty-Roy	High Dimensional Linear Contextual Bandit and Thompson Sampling using Langevin dynamics	slides zoom recording
Apr 20	Li-Shang	Intelligent transportation system	slides zoom recording
Apr 27	Reports due