STATS 700, Fall 2024
The Attention mechanism and the Transformer architecture have completely changed the landscape of AI, deep learning, and NLP research in the past few years. This course will be a selective review of the fast growing literature on Transformers and Large Language Models (LLMs) with a preference towards theoretical and mathematical analyses. We will study the limitations and capabilities of the transformer architecture. We will discuss empirical phenomena such as neural scaling laws and emergence of skills as the models are scaled up in size. LLMs also raise issues around copyright, trust, safety, fairness, and watermarking. We will look at alignment to human values and techniques such as RLHF (reinforcement learning with human feedback) as well as adaptation of LLMs to downstream tasks via few shot fine-tuning and in-context learning. Towards the end, we might look at the impact that LLMs are having in disciplines such as Cognitive Science, Linguistics, and Neuroscience. We might also discuss ongoing efforts to build LLMs and foundation models for science and mathematics. This course is inspired by the Special Year (Part 1, Part 2 and an earlier workshop) on LLMs and Transformers being hosted by the Simons Institute at UC Berkeley and may be tweaked to better align with it as the Special Year progresses.
Note: This course is primarily meant for Statistics PhD students. Others will need instructor’s permission to enroll. Graduate coursework in statistics, theoretical computer science, mathematics, or related disciplines required. Students will be expected to possess that hard-to-define quality usually referred to as “mathematical maturity”.
Stanford
Princeton
Berkeley
Michigan EECS
Borealis AI blog series:
Analogies Explained: Towards Understanding Word Embeddings
What can a Single Attention Layer Learn? A Study Through the Random Features Lens
Inductive Biases and Variable Creation in Self-Attention Mechanisms
Implicit Regularization of Gradient Flow on One-Layer Softmax Attention
Formal Algorithms for Transformers
Infinite attention: NNGP and NTK for deep attention networks
Tensor Programs II: Neural Tangent Kernel for Any Architecture
A Kernel-Based View of Language Model Fine-Tuning
On the Turing Completeness of Modern Neural Network Architectures
Are Transformers universal approximators of sequence-to-sequence functions?
From Self-Attention to Markov Models: Unveiling the Dynamics of Generative Transformers
On the Ability and Limitations of Transformers to Recognize Formal Languages
Theoretical Limitations of Self-Attention in Neural Sequence Models
Self-Attention Networks Can Process Bounded Hierarchical Languages
On Limitations of the Transformer Architecture
Transformers Learn Shortcuts to Automata
On the Learnability of Discrete Distributions
Grammatical Inference: Learning Automata and Grammars
Mathematical Linguistics especially Chapter 7 (Complexity) and Chapter 8 (Linguistic pattern recognition)
Are Emergent Abilities of Large Language Models a Mirage?
A Theory for Emergence of Complex Skills in Language Models
What Can Transformers Learn In-Context? A Case Study of Simple Function Classes
The Learnability of In-Context Learning
Supervised Pretraining Can Learn In-Context Reinforcement Learning
Large Language Models can Implement Policy Iteration
Trainable Transformer in Transformer
The Evolution of Statistical Induction Heads: In-Context Learning Markov Chains
Distinguishing the Knowable from the Unknowable with Language Models
Conformal Language Modeling
Language Models with Conformal Factuality Guarantees
Training language models to follow instructions with human feedback
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Efficient Exploration for LLMs
Mamba: Linear-time sequence modeling with selective state spaces
Repeat After Me: Transformers are Better than State Space Models at Copying
A mathematical perspective on Transformers
Transformers in Reinforcement Learning: A Survey
The debate over understanding in AI’s large language models
Language models and linguistic theories beyond words
Noam Chomsky: The False Promise of ChatGPT
Modern language models refute Chomsky’s approach to language
Dissociating language and thought in large language models
Embers of Autoregression: Understanding Large Language Models Through the Problem They are Trained to Solve
Shared computational principles for language processing in humans and deep language models
The neural architecture of language: Integrative modeling converges on predictive processing
Predictive Coding or Just Feature Discovery? An Alternative Account of Why Language Models Fit Brain Data
On the Opportunities and Risks of Foundation Models
MIDAS Symposium 2024
MICDE Symposium 2024