Tag

chain-of-thought

Chain-of-thought focuses on how models connect intermediate reasoning steps, not just final answers. It includes long-horizon benchmarks, agent loops, structured outputs, and stability under long context, all of which matter when evaluating and deploying LLMs.

7 articles

Research/Jun 8

LLMs stumble on counterintuitive probability

A benchmark finds LLMs are strong on standard probability problems but falter on counterintuitive ones.

Research/Jun 5

Why Prompt Engineering Is Wrong About 2026

Prompt engineering is giving way to context engineering, and structured frameworks win because they reduce errors and improve repeatability.

Research/Jun 3

IPT helps VLMs reason about hidden space

Imaginative Perception Tokens improve multimodal models’ ability to reason about unseen spatial structure.

Research/May 21

What large language models are, and how they work

Large language models turn huge text corpora into systems that generate, summarize, and reason with language.

Tools & Apps/May 21

Prompt engineering turns vague asks into usable outputs

I break down prompt engineering into practical patterns, with a copy-ready template for better LLM outputs.

Research/Apr 16

LongCoT Benchmark: 2,500-Probl. Long-Horizon Reasoning

LongCoT is a 2,500-problem benchmark for measuring whether frontier models can sustain long, interdependent reasoning chains.

AI Agent/Apr 3

Prompt Engineering for Agents and Structured Outputs

Prompt engineering gets harder in production: reasoning, long contexts, JSON contracts, and agent loops all need different prompt tactics.