LLMs and Transformers

STATS 700, Fall 2024

The Attention mechanism and the Transformer architecture have completely changed the landscape of AI, deep learning, and NLP research in the past few years. This advanced graduate level course will consist of two parts. In the first part, we will review foundational material in information theory, statistical NLP, and deep learning theory. In the second part, student project teams will explore topics from the fast growing literature on Transformers and Large Language Models (LLMs) especially papers that provide theoretical and mathematical analyses. Topics include, but are not limited to:

limitations and capabilities of the transformer architecture
neural scaling laws and emergence of skills as the models are scaled up in size
issues around copyright, trust, safety, fairness, and watermarking
alignment to human values and techniques such as RLHF (reinforcement learning with human feedback)
adaptation of LLMs to downstream tasks via few shot fine-tuning and in-context learning.

If there is time, we might look at the impact that LLMs are having in disciplines such as Cognitive Science, Linguistics, and Neuroscience. We might also discuss ongoing efforts to build LLMs and foundation models for science and mathematics.

Note: This course is primarily meant for Statistics PhD students. Others will need instructor’s permission to enroll. Graduate coursework in statistics, theoretical computer science, mathematics, or related disciplines required. Students will be expected to possess that hard-to-define quality usually referred to as “mathematical maturity”.

Logistics & Schedule
- Part 1
- Part 2
Courses / Blogs
Classics
Topics

Logistics & Schedule

Days and Times: Tuesdays and Thursdays, 11:30 am-1:00 pm
Location: USB2260

J&M = Speech and Language Processing (3rd ed. draft), Jurafsky and Martin
C&T = Elements of Information Theory (2nd ed.), Cover and Thomas

Part 1

8/27 Introduction slides
8/29 N-gram Language Models, J&M Chapter 3 annotated chapter
9/3, 9/5 Entropy, Relative Entropy, and Mutual Information, C&T Chapter 2, Sections 2.1-2.5 notes
9/10, 9/12 Entropy, Relative Entropy, and Mutual Information, C&T Chapter 2, Sections 2.6-2.10 notes
9/17, 9/19 Paper presentation notes
9/19 (wrap up) Entropy, Relative Entropy, and Mutual Information, C&T Chapter 2, Sections 2.6-2.10 notes
9/24 Vector Semantics and Embeddings, J&M Chapter 6 annotated chapter
9/26, 10/1 Asymptotic Equipartition Property, C&T Chapter 3, Sections 3.1-3.3 notes
10/3 Entropy Rates of a Stochastic Process, C&T Chapter 4, Sections 4.1-4.3 notes
10/10 Project Pitches
10/15 FALL BREAK
10/17 Project Pitches
10/22 Entropy Rates of a Stochastic Process, C&T Chapter 4, Sections 4.4-4.5 notes
10/24 Guest Lecture by Vinod Raman: Generation through the lens of learning theory
10/28 Project Proposals Due

Supplementary Material

Neural Networks, J&M Chapter 7 annotated chapter
RNNs and LSTMs, J&M Chapter 8 annotated chapter
Transformers, J&M Chapter 9 annotated chapter
Large Language Models, J&M Chapter 10 annotated chapter

Part 2

Dec 3, Poster Session I: Understanding and Improving LLMs & Transformers

T1 Soham, Sunrit: Evolution of Iteratively Trained Generative AI Models
T2 Xuanyu, Yiling: Context-Aware Ranking of Large Language Models via Pairwise Comparison
T3 Elvin, Jason, Kellen, Mihir: Is Linear Probing better than Fine-tuning for Weak-to-Strong Generalization?
T5 Jake, Jaylin, Noah: Dimension Decisions: The Impact of Embedding Size on LLMs
T7 Eduardo, Felipe, Harry, Xinhe: shIRT: similarity heuristics for data selection using IRT
T10 Mojtaba, Tara: Beat LLMs in Their Own Game: Statistics-based Methods for AI Detection with Theoretical Guarantees
T11 Unique, Vinod: The Limitations of Self-Attention for Information Retrieval
T12 Paolo, Sahana: Budget-Constrained Learning to Defer for Autoregressive Models
T17 Yuezhou: Classifying Uncertainty with In-Context Learning Methods

Dec 5, Poster Session II: Applying LLMs & Transformers

T4 Gabe, Ki, Marc: Online RL Considerations for LLM-Assisted JITAIs
T6 Andrej: Modern autoregressive architectures for the collective variable problem
T8 Abhiti, Julian, Yash: Adaptive Spectral Neural Operators
T9 Joe: Applying Transformers to Spectral Data
T13 Qiyuan, Zhilin: A Unified Framework for Multimodal Learning: Integrating Image, Text, and Crowd-Sourced Annotations
T14 Victor: Forecasting Solar Flares Using Time Series Foundation Models and LLM-Based Models
T15 Jiwoo, Yue: AI Agent Evaluation
T16 Daniel: Incorporating Domain Knowledge in Transformer-based models for Symbolic Regression

Dec 13, Project Reports Due