Toward a Theory of Reasoning in LLMs

STATS 700, Fall 2025

Since the release of OpenAI’s o1 and DeepSeek’s R1 models, interest in the reasoning capabilities of LLMs has increased. This half-semester (7-week) course will cover some of the main ingredients that go into enhancing an LLM’s reasoning capability. We will also discuss some recent theory papers that try to understand this fascinating emerging area from a mathematical perspective.

A strong interest in reasoning, LLMs, and a high level of mathematical maturity will be needed to fully benefit from this course. The topics list below is tentative and subject to change.

Logistics

Time & Days: TuTh 2:30PM - 4:00PM
Location: 2060 SKB
Half semester course dates: Aug 25, 2025-Oct 10, 2025

Topics

Background (2.5 weeks)

J&M = Speech and Language Processing (3rd ed. draft) by Dan Jurafsky and James H. Martin

LLMs

Transformers, J&M Chapter 9 annotated chapter
Large Language Models, J&M Chapter 10 annotated chapter
- notes on proper scores
Model Alignment, Prompting, and In-Context Learning, J&M Chapter 12 annotated chapter
- Since Section 12.7 (Model Alignment with Human Preferences) is missing in Chapter 12 above, we will refer to these notes. ChatGPT generated Latex pdf is here (warning: might have errors!)
For more on RLHF, you can also refer to the RLHF book being written by Nathan Lambert, currently a post-training lead at the Allen Institute for AI.

Reasoning LLMs

Theory Papers (4.5 weeks)

A Theory of Emergent In-Context Learning as Implicit Structure Induction
- Two main results: (1) ICL abilities can arise if next-token pretraining is done on distributions with compositional structure. (2) Prompting an LLM to produce intermediate tokens can improve performance.
Chain of Thought Empowers Transformers to Solve Inherently Serial Problems, ICLR 2024
- With T steps of CoT, constant-depth transformers with constant-bit precision and logarithmic embedding size can solve any problem solvable by Boolean circuits of size T.
Scaling Test-Time Compute Without Verification or RL is Suboptimal, ICML 2025
- Proves that verifier-based methods using RL/search dominate verifier-free methods based on distillation or cloning search traces, given fixed compute/data budgets.
Optimizing Test-Time Compute via Meta Reinforcement Finetuning, ICML 2025
- Formalizes optimizing test-time compute as a meta-RL problem, offering guidance on how to optimally allocate inference-time computation.
On the Power of Context-Enhanced Learning in LLMs, ICML 2025
- Proposes CEL, a variant of supervised fine-tuning where extra context is provided but gradients are not taken through it. In a simplified setting, shows CEL can be exponentially more sample-efficient than vanilla SFT for multi-step reasoning tasks.
Understanding Chain-of-Thought in LLMs through Information Theory, ICML 2025
- Provides an information-theoretic framework that quantifies the “information gain” at each reasoning step.
A Theory of Learning with Autoregressive Chain of Thought, COLT 2025
- Proposes a learning-theoretic framework where prompt-to-answer mapping is modeled as repeated application of a time-invariant “single-step” function. Considers both observed and latent CoT settings, showing sample complexity can be independent of CoT length, with attention arising naturally in the framework.
When More is Less: Understanding Chain-of-Thought Length in LLMs
- Studies optimal CoT lengths, showing that longer is not always better: performance peaks at a sweet spot, then declines due to error accumulation.
Rethinking Fine-Tuning when Scaling Test-Time Compute: Limiting Confidence Improves Mathematical Reasoning
- Shows that Pass@N is misaligned with cross-entropy training. Proposes confidence-limiting objectives that improve performance on math and reasoning tasks.
On Learning Verifiers for Chain-of-Thought Reasoning
- Analyzes the PAC-learnability of verifiers for CoT reasoning. Derives sample-complexity upper bounds and impossibility results.
Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens
- Argues that CoT gains are distribution-dependent and may vanish out-of-distribution, suggesting that CoT reasoning is brittle and not robustly general.
CoT Information: Improved Sample Complexity under Chain-of-Thought Supervision
- A statistical learning and information theory perspective on how CoT supervision can reduce sample complexity.

Interesting Observations Waiting for Theoretical Analysis

Chain-of-Thought Reasoning Without Prompting, NeurIPS 2024
Scaling LLM Test-Time Compute Optimally Can be More Effective than Scaling Parameters for Reasoning, ICLR 2025
Stop Anthropomorphizing Intermediate Tokens as Reasoning/Thinking Traces!
- Poaition paper warning researchers against interpreting CoT tokens in human terms. One conclusion to draw from this position paper is that more work in needed in understanding how and when CoT traces help LLMs reason and plan. E.g.,
  - They might provide an extra “scratchpad” for the LLM to compute the final answer
  - They might enable exploration of reasoning paths of more depth and breadth
  - Training using CoT traces might help LLMs learn reusable skills and benefit from process supervision
  - Training using CoT traces might help LLM’s training objective with test time metrics
  - How can we tell is CoT traces are helping? Or not helping?
  - If CoT traces could be trusted, they might provide ways to audit/critique/check/verify LLM solutions to reasoning problems
Reinforcement Learning for Reasoning in Large Language Models with One Training Example
Spurious Rewards: Rethinking Training Signals in RLVR