Tag
chain-of-thought
Chain-of-thought focuses on how models connect intermediate reasoning steps, not just final answers. It includes long-horizon benchmarks, agent loops, structured outputs, and stability under long context, all of which matter when evaluating and deploying LLMs.
7 articles

LLMs stumble on counterintuitive probability
A benchmark finds LLMs are strong on standard probability problems but falter on counterintuitive ones.

Why Prompt Engineering Is Wrong About 2026
Prompt engineering is giving way to context engineering, and structured frameworks win because they reduce errors and improve repeatability.

IPT helps VLMs reason about hidden space
Imaginative Perception Tokens improve multimodal models’ ability to reason about unseen spatial structure.

What large language models are, and how they work
Large language models turn huge text corpora into systems that generate, summarize, and reason with language.

Prompt engineering turns vague asks into usable outputs
I break down prompt engineering into practical patterns, with a copy-ready template for better LLM outputs.

LongCoT Benchmark: 2,500-Probl. Long-Horizon Reasoning
LongCoT is a 2,500-problem benchmark for measuring whether frontier models can sustain long, interdependent reasoning chains.

Prompt Engineering for Agents and Structured Outputs
Prompt engineering gets harder in production: reasoning, long contexts, JSON contracts, and agent loops all need different prompt tactics.