Measuring when LLM behavior actually переносится

OraCore Editors

Back to home

[RSCH] June 29, 20267 min readOraCore Editors

Measuring when LLM behavior actually переносится

A new framework tests whether an LLM’s behavior transfers across payoff-equivalent decision environments.

LLM evaluation

Share LinkedIn

Measuring when LLM behavior actually переносится

This paper shows that LLM behavior often fails to transfer across payoff-equivalent environments.

Research org: University of Chicago Knowledge Lab
Core data: Seven canonical economic decision problems
Breakthrough: Train on source environments, test on payoff-equivalent targets

Large language models are increasingly used as decision makers, which means developers need to know more than whether a model looks good on one prompt or one benchmark. This paper argues that the real question is portability: if you change the surface framing but keep the underlying incentives the same, does the model keep behaving the same way?

The answer here is mostly no. The authors build a framework for measuring whether a behavioral mapping learned in one environment still works in another payoff-equivalent environment, and they find substantial portability losses across several economic decision tasks.

What problem this paper is trying to fix

Get the latest AI news in your inbox

Weekly picks of model releases, tools, and deep dives — no spam, unsubscribe anytime.

No spam. Unsubscribe at any time.

Most LLM evaluation is suite-based. You test a model on a curated set of prompts or tasks, then hope the result generalizes. That works only if the model’s behavior is stable under harmless changes in framing. In real deployments, though, the same decision problem can show up with different wording, structure, or presentation.

This matters for domains like hiring, task allocation, or any setting where an LLM is acting as a delegate. If the model’s decisions shift when the same underlying incentives are described differently, a finite evaluation suite can give a false sense of reliability.

The paper frames this as a behavioral portability problem. The key question is not just whether a model performs well in one environment, but whether a behavioral pattern recovered from that environment carries over to another environment that preserves the same payoff structure.

How the method works in plain English

The setup separates each decision environment into two parts: payoff-relevant features, called x, and the rest of the presentation, called z. If a model is truly responding only to the payoff-relevant part, then changing z should not change its action distribution.

To test that, the authors construct many payoff-equivalent environments for each task. They train an interpretable behavioral model on a set of source environments, then evaluate it in a held-out target environment. They compare that source-trained model against a benchmark trained directly on the target environment.

The paper uses two related views of portability. One is predictive transfer: how much worse does the source-trained representation do on the target than the target-trained one? The other is a loss-agnostic measure based on total variation distance between the joint distributions over predicted and realized actions. That gives a worst-case bound on how much the two representations can differ on any bounded criterion.

That design is useful because it avoids tying the main portability notion to one scoring rule. In other words, the paper is not just asking whether one model has lower loss than another under one metric; it is asking whether the behavioral mapping itself changes across frames.

What the paper actually shows

The experiments cover seven one-shot economic decision problems: Dictator, Ultimatum, Trust, Public Goods, Beauty Contest, Lottery Choice, and a Normal-Form game. For each task, the authors build a large set of decision environments that preserve the payoff mapping while varying framing and style.

They evaluate several models: GPT-4.1-nano, Gemma-3-12B, Llama-3.1-8B, Llama-3.1-70B, and DeepSeek-R1. They test both answer-only prompting and chain-of-thought prompting.

The main result is blunt: the tested LLMs do not demonstrate portability. Behavioral mappings learned in one environment often predict worse in another, even when the environments are payoff-equivalent by construction.

The paper also finds that chain-of-thought changes portability, but not in a simple one-directional way. On average it improves portability, but not uniformly across all cases. In some settings portability gets better; in others it does not. The reasoning model DeepSeek-R1 performs better on portability across the tested tasks.

There are no benchmark numbers in the abstract itself, so the safe takeaway is qualitative rather than numeric. The paper’s contribution is the measurement framework and the consistent finding of portability loss, not a single headline score.

Why developers should care

If you are building systems that route decisions through an LLM, this paper is a warning about overconfidence in prompt tests. A model can appear stable within one suite and still shift behavior when the same underlying decision is reframed.

That makes portability a deployment concern, not just an academic one. If your application depends on the model respecting an incentive structure, you need to know whether that structure survives paraphrase, reformulation, or changes in presentation.

For engineers, the practical lesson is that evaluation should not stop at aggregate task scores. You need tests that deliberately vary the surface form while holding the payoff-relevant structure fixed. The paper’s framework offers one way to do that.

What this does not prove

The study is controlled and useful, but it is still bounded by the tasks it chose. The experiments are in experimental economics, where payoff structure is transparent. That makes the analysis cleaner, but it also means the result is not a universal proof about every possible deployment setting.

The paper also does not claim that all LLM behavior is unstable in every context. It shows that portability losses are substantial and systematic in the tested environments, and that even reasoning-oriented prompting does not eliminate the problem.

Another important limitation is that the abstract does not provide the underlying numeric portability values. So while the paper clearly reports loss, the exact magnitude needs the full results section.

The bigger idea

This paper treats prompt sensitivity as a measurement problem. Instead of asking whether a model is “good” in the abstract, it asks whether the behavior learned from one environment survives transport to another environment with the same incentives.

That is a useful lens for anyone evaluating LLMs as agents. If the model’s action policy depends on payoff-irrelevant framing, then the behavior you observe in a benchmark suite may not be the behavior you get in the wild.

In short: the paper shows that behavioral portability is measurable, and that current LLMs can fail that test even when the underlying task is unchanged.

Portability should be tested across payoff-equivalent frames, not just across prompts.
Chain-of-thought can change portability, but not reliably improve it everywhere.
Reasoning-oriented models may transfer better, but portability gaps still remain.

// Related Articles

Measuring when LLM behavior actually переносится

What problem this paper is trying to fix

Get the latest AI news in your inbox

How the method works in plain English

What the paper actually shows

Why developers should care

What this does not prove

The bigger idea

Google DeepMind turns science into tools

Prompt injection is now an AI security problem

Solver choice changes which Nash equilibrium wins

Proper positive-only learning gets a full characterization

DexCompose Reuses Dexterous Policies Across Tasks

HaWoR turns hand motion into MANO params