LLMs and Transformers

STATS 700, Fall 2024

The Attention mechanism and the Transformer architecture have completely changed the landscape of AI, deep learning, and NLP research in the past few years. This advanced graduate level course will consist of two parts. In the first part, we will review foundational material in information theory, statistical NLP, and deep learning theory. In the second part, student project teams will explore topics from the fast growing literature on Transformers and Large Language Models (LLMs) especially papers that provide theoretical and mathematical analyses. Topics include, but are not limited to:

If there is time, we might look at the impact that LLMs are having in disciplines such as Cognitive Science, Linguistics, and Neuroscience. We might also discuss ongoing efforts to build LLMs and foundation models for science and mathematics.

Note: This course is primarily meant for Statistics PhD students. Others will need instructor’s permission to enroll. Graduate coursework in statistics, theoretical computer science, mathematics, or related disciplines required. Students will be expected to possess that hard-to-define quality usually referred to as “mathematical maturity”.

Logistics & Schedule

Days and Times: Tuesdays and Thursdays, 11:30 am-1:00 pm
Location: USB2260

J&M = Speech and Language Processing (3rd ed. draft), Jurafsky and Martin
C&T = Elements of Information Theory (2nd ed.), Cover and Thomas

Part 1

Supplementary Material

Part 2

Dec 3, Poster Session I: Understanding and Improving LLMs & Transformers

T1 Soham, Sunrit: Evolution of Iteratively Trained Generative AI Models
T2 Xuanyu, Yiling: Context-Aware Ranking of Large Language Models via Pairwise Comparison
T3 Elvin, Jason, Kellen, Mihir: Is Linear Probing better than Fine-tuning for Weak-to-Strong Generalization?
T5 Jake, Jaylin, Noah: Dimension Decisions: The Impact of Embedding Size on LLMs
T7 Eduardo, Felipe, Harry, Xinhe: shIRT: similarity heuristics for data selection using IRT
T10 Mojtaba, Tara: Beat LLMs in Their Own Game: Statistics-based Methods for AI Detection with Theoretical Guarantees
T11 Unique, Vinod: The Limitations of Self-Attention for Information Retrieval
T12 Paolo, Sahana: Budget-Constrained Learning to Defer for Autoregressive Models
T17 Yuezhou: Classifying Uncertainty with In-Context Learning Methods

Dec 5, Poster Session II: Applying LLMs & Transformers

T4 Gabe, Ki, Marc: Online RL Considerations for LLM-Assisted JITAIs
T6 Andrej: Modern autoregressive architectures for the collective variable problem
T8 Abhiti, Julian, Yash: Adaptive Spectral Neural Operators
T9 Joe: Applying Transformers to Spectral Data
T13 Qiyuan, Zhilin: A Unified Framework for Multimodal Learning: Integrating Image, Text, and Crowd-Sourced Annotations
T14 Victor: Forecasting Solar Flares Using Time Series Foundation Models and LLM-Based Models
T15 Jiwoo, Yue: AI Agent Evaluation
T16 Daniel: Incorporating Domain Knowledge in Transformer-based models for Symbolic Regression

Dec 13, Project Reports Due

Courses / Blogs

Stanford
Princeton
Berkeley
Michigan EECS
Borealis AI blog series:

Basics

Special Year on LLMs and Transformers at the Simons Institute, UC Berkeley:

Classics

Markov (1913) An Example of Statistical Investigation of the Text Eugene Onegin Concerning the Connection of Samples in Chains
Shannon (1951) Prediction and Entropy of Printed English
Zipf (1935) The Psycho-Biology of Language Houghton, Mifflin. (Reprinted by MIT Press in 1965)
Good(-Turing) (1953) The Population Frequencies of Species and the Estimation of Population Parameters

Topics

Word Embeddings

Neural Word Embedding as Implicit Matrix Factorization
A Latent Variable Model Approach to PMI-based Word Embeddings
Skip-Gram – Zipf + Uniform = Vector Additivity
Analogies Explained: Towards Understanding Word Embeddings

Attention

What can a Single Attention Layer Learn? A Study Through the Random Features Lens
Inductive Biases and Variable Creation in Self-Attention Mechanisms

Implicit Regularization

Implicit Regularization of Gradient Flow on One-Layer Softmax Attention

NTK Theory for Transformers

Infinite attention: NNGP and NTK for deep attention networks
Tensor Programs II: Neural Tangent Kernel for Any Architecture
A Kernel-Based View of Language Model Fine-Tuning

Reverse Engineering Transformers

Transformer Circuits Thread Project
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
Progress Measures for Grokking Via Mechanistic Interpretability

Capabilities and Limitations of LLMs and Transformers

On the Turing Completeness of Modern Neural Network Architectures
Are Transformers universal approximators of sequence-to-sequence functions?
Theoretical Limitations of Self-Attention in Neural Sequence Models
On the Ability and Limitations of Transformers to Recognize Formal Languages
Self-Attention Networks Can Process Bounded Hierarchical Languages
Transformers Learn Shortcuts to Automata
Representational Strengths and Limitations of Transformers
On Limitations of the Transformer Architecture
Transformers, parallel computation, and logarithmic depth
From Self-Attention to Markov Models: Unveiling the Dynamics of Generative Transformers
Slaves to the Law of Large Numbers: An Asymptotic Equipartition Property for Perplexity in Generative Language Models
One-layer transformers fail to solve the induction heads task
Thinking Like Transformers

Beyond PAC Learning: Learning Distributions and Grammars, Learning to Generate

On the Learnability of Discrete Distributions
Near-optimal Sample Complexity Bounds for Robust Learning of Gaussian Mixtures via Compression Schemes
Distribution Learnability and Robustness
Inherent limitations of dimensions for characterizing learnability of distribution classes
Grammatical Inference: Learning Automata and Grammars
Mathematical Linguistics especially Chapter 7 (Complexity) and Chapter 8 (Linguistic pattern recognition)
Language Generation in the Limit

Emergence

Are Emergent Abilities of Large Language Models a Mirage?
A Theory for Emergence of Complex Skills in Language Models

In-Context Learning

What Can Transformers Learn In-Context? A Case Study of Simple Function Classes
The Learnability of In-Context Learning
Supervised Pretraining Can Learn In-Context Reinforcement Learning
Large Language Models can Implement Policy Iteration
Trainable Transformer in Transformer
Trained Transformers Learn Linear Models In-Context
The Evolution of Statistical Induction Heads: In-Context Learning Markov Chains

Hallucinations

Calibrated Language Models Must Hallucinate

Assessing Model Uncertainty

Distinguishing the Knowable from the Unknowable with Language Models
Conformal Language Modeling
Language Models with Conformal Factuality Guarantees

RLHF

Training language models to follow instructions with human feedback
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Efficient Exploration for LLMs

State Space Models

Mamba: Linear-time sequence modeling with selective state spaces
Repeat After Me: Transformers are Better than State Space Models at Copying

Transformers as Interacting Particle Systems

A mathematical perspective on Transformers

Language Modeling, Prediction, and Compression

On prediction by data compression
Language Modeling Is Compression
Prediction by Compression

LLMs, Online Learning, and Regret

Do LLM Agents Have Regret? A Case Study in Online Learning and Games

Transformers in RL

Transformers in Reinforcement Learning: A Survey

LLMs and Causality

Causal Reasoning and Large Language Models: Opening a New Frontier for Causality

LLMs and Cognitive Science, Linguistics, Neuroscience

Formal grammar and information theory: together again?
The debate over understanding in AI’s large language models
Language models and linguistic theories beyond words
Noam Chomsky: The False Promise of ChatGPT
Modern language models refute Chomsky’s approach to language
Dissociating language and thought in large language models
Embers of Autoregression: Understanding Large Language Models Through the Problem They are Trained to Solve
Shared computational principles for language processing in humans and deep language models
The neural architecture of language: Integrative modeling converges on predictive processing
Predictive Coding or Just Feature Discovery? An Alternative Account of Why Language Models Fit Brain Data

LLMs and Foundation Models for Science and Mathematics

On the Opportunities and Risks of Foundation Models
MIDAS Symposium 2024
MICDE Symposium 2024