LLMs and Transformers

STATS 700, Fall 2024

The Attention mechanism and the Transformer architecture have completely changed the landscape of AI, deep learning, and NLP research in the past few years. This course will be a selective review of the fast growing literature on Transformers and Large Language Models (LLMs) with a preference towards theoretical and mathematical analyses. We will study the limitations and capabilities of the transformer architecture. We will discuss empirical phenomena such as neural scaling laws and emergence of skills as the models are scaled up in size. LLMs also raise issues around copyright, trust, safety, fairness, and watermarking. We will look at alignment to human values and techniques such as RLHF (reinforcement learning with human feedback) as well as adaptation of LLMs to downstream tasks via few shot fine-tuning and in-context learning. Towards the end, we might look at the impact that LLMs are having in disciplines such as Cognitive Science, Linguistics, and Neuroscience. We might also discuss ongoing efforts to build LLMs and foundation models for science and mathematics. This course is inspired by the Special Year (Part 1, Part 2 and an earlier workshop) on LLMs and Transformers being hosted by the Simons Institute at UC Berkeley and may be tweaked to better align with it as the Special Year progresses.

Note: This course is primarily meant for Statistics PhD students. Others will need instructor’s permission to enroll. Graduate coursework in statistics, theoretical computer science, mathematics, or related disciplines required. Students will be expected to possess that hard-to-define quality usually referred to as “mathematical maturity”.

Courses / Blogs
Topics

Courses / Blogs

Stanford
Princeton
Berkeley
Michigan EECS
Borealis AI blog series:

Topics

Word Embeddings

Analogies Explained: Towards Understanding Word Embeddings

Attention

What can a Single Attention Layer Learn? A Study Through the Random Features Lens
Inductive Biases and Variable Creation in Self-Attention Mechanisms

Implicit Regularization

Implicit Regularization of Gradient Flow on One-Layer Softmax Attention

Basics of Transformers

Formal Algorithms for Transformers

NTK Theory for Transformers

Infinite attention: NNGP and NTK for deep attention networks
Tensor Programs II: Neural Tangent Kernel for Any Architecture
A Kernel-Based View of Language Model Fine-Tuning

Capabilities and Limitations of Transformers

On the Turing Completeness of Modern Neural Network Architectures
Are Transformers universal approximators of sequence-to-sequence functions?
From Self-Attention to Markov Models: Unveiling the Dynamics of Generative Transformers
On the Ability and Limitations of Transformers to Recognize Formal Languages
Theoretical Limitations of Self-Attention in Neural Sequence Models
Self-Attention Networks Can Process Bounded Hierarchical Languages
On Limitations of the Transformer Architecture
Transformers Learn Shortcuts to Automata

Beyond PAC Learning: Learning Distributions and Grammars

On the Learnability of Discrete Distributions
Grammatical Inference: Learning Automata and Grammars
Mathematical Linguistics especially Chapter 7 (Complexity) and Chapter 8 (Linguistic pattern recognition)

Emergence

Are Emergent Abilities of Large Language Models a Mirage?
A Theory for Emergence of Complex Skills in Language Models

In-Context Learning

What Can Transformers Learn In-Context? A Case Study of Simple Function Classes
The Learnability of In-Context Learning
Supervised Pretraining Can Learn In-Context Reinforcement Learning
Large Language Models can Implement Policy Iteration
Trainable Transformer in Transformer
The Evolution of Statistical Induction Heads: In-Context Learning Markov Chains

Assessing Model Uncertainty

Distinguishing the Knowable from the Unknowable with Language Models
Conformal Language Modeling
Language Models with Conformal Factuality Guarantees

RLHF

Training language models to follow instructions with human feedback
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Efficient Exploration for LLMs

State Space Models

Mamba: Linear-time sequence modeling with selective state spaces
Repeat After Me: Transformers are Better than State Space Models at Copying

Transformers as Interacting Particle Systems

A mathematical perspective on Transformers

Transformers in RL

Transformers in Reinforcement Learning: A Survey

LLMs and Cognitive Science, Linguistics, Neuroscience

The debate over understanding in AI’s large language models
Language models and linguistic theories beyond words
Noam Chomsky: The False Promise of ChatGPT
Modern language models refute Chomsky’s approach to language
Dissociating language and thought in large language models
Embers of Autoregression: Understanding Large Language Models Through the Problem They are Trained to Solve
Shared computational principles for language processing in humans and deep language models
The neural architecture of language: Integrative modeling converges on predictive processing
Predictive Coding or Just Feature Discovery? An Alternative Account of Why Language Models Fit Brain Data

LLMs and Foundation Models for Science and Mathematics

On the Opportunities and Risks of Foundation Models
MIDAS Symposium 2024
MICDE Symposium 2024