Can LLMs Write Correct TLA+ Specs?
A benchmark of 30 LLMs shows they rarely generate semantically correct TLA+ specs from natural language.

A benchmark of 30 LLMs shows they rarely generate semantically correct TLA+ specs from natural language.
- Research org: Loyola University Chicago
- Core data: 8.6% semantic correctness
- Breakthrough: First systematic evaluation of natural-language-to-TLA+ synthesis
This paper looks at a very practical question for formal methods: if you describe a system in plain English, can an LLM turn that description into a correct TLA+ specification? The short answer is no, not reliably. The authors show that current models can sometimes produce something that parses, but they still struggle to capture the intended behavior in a way that passes formal checking.
That matters because TLA+ is already used in industry for high-stakes distributed and concurrent systems. If an LLM can help draft specs, it could reduce the time and expertise needed to get started. But if it quietly gets the semantics wrong, it creates a new kind of risk: a specification that looks plausible and still misses the real system behavior.
What problem this paper is trying to fix
Get the latest AI news in your inbox
Weekly picks of model releases, tools, and deep dives — no spam, unsubscribe anytime.
No spam. Unsubscribe at any time.
Writing TLA+ is hard. The language combines temporal logic, first-order logic, and set theory, so a spec has to be both syntactically valid and semantically faithful to the system being described. The paper points out that this is a real adoption barrier: engineers have to translate informal requirements, hidden assumptions, failure modes, concurrency behavior, and consistency rules into precise mathematical statements.

That translation step is exactly where LLMs seem attractive. Code models already do well on many programming tasks, so it is reasonable to ask whether they can also draft formal specifications. The catch is that formal specs are stricter than ordinary code. A model can produce something that compiles or parses and still miss a fairness condition, weaken an invariant, or omit a variable that changes the meaning of the whole system.
The authors also note a data problem. TLA+ has a much smaller public corpus than mainstream programming languages, so models have far less exposure to it during training. That makes it a tougher target than ordinary code generation, and it helps explain why natural-language-to-TLA+ generation has been under-studied.
How the method works in plain English
The paper builds a benchmark of 205 TLA+ specifications from the TLA+ Foundation, each paired with natural-language comments and TLC configurations. Those examples are split into train, validation, and test sets, and the authors use them to evaluate whether models can synthesize TLA+ from natural language.
They test 30 LLMs across eight families, including DeepSeek, LLaMA, Qwen, QwQ, GPT-OSS, code-specialized models such as CodeLLaMA and Granite, instruction-tuned models such as Mistral, Phi, Gemma, and Starling-LM, plus proprietary APIs such as OpenAI GPT and Anthropic Claude. The core sweep covers 25 open-weight models across four prompting strategies for 2,600 runs. They also evaluate five proprietary models with few-shot prompting only, for 130 runs.
Every generated output is checked twice: first by the SANY parser for syntax, then by the TLC model checker for semantics. That distinction is important. Syntax tells you whether the file is structurally valid TLA+. Semantics tells you whether the spec actually behaves like the intended system.
The paper also compares prompting styles and looks for patterns in failure. One of the key ideas is progressive prompting, which is the only strategy that produced semantic successes in this study. The authors also examine whether model size matters, and whether code-specialized training helps or hurts on a formal language like TLA+.
What the paper actually shows
The clearest result is that current LLMs are not dependable TLA+ spec writers. Across the evaluated models, the best syntactic correctness reached 26.6%, but semantic correctness topped out at only 8.6%. In other words, even when a model could produce something that looked like TLA+, it usually did not survive semantic validation.

Another important result is that bigger is not automatically better. The paper says model size does not predict quality, and gives a concrete example: DeepSeek r1:8b outperforms its 70B variant across all prompting strategies. The authors interpret that as evidence that reasoning alignment matters more than raw parameter count for formal languages.
The study also finds that code-specialized models consistently underperform general-purpose ones. The paper attributes that to negative transfer from mainstream programming-language training. That is a useful warning for teams that might assume “code model” automatically means “better at formal specs.” For TLA+, the opposite can happen.
On the failure-analysis side, the authors identify five recurring hallucination categories: Unicode operator substitution, cross-language syntax injection, reasoning and formatting leakage, generation-length miscalibration, and structural errors. They trace these back to biases in current training data, especially around code, formal math, and reasoning samples. The paper’s broader claim is not that LLMs are useless here, but that they are not reliable without expert oversight.
What this means for developers
If you work on distributed systems, concurrency, or verification-heavy infrastructure, this paper is a reality check. LLMs may help draft or bootstrap a spec, but they are nowhere near a “trust the output” tool for TLA+. The gap between syntax and semantics is the whole story here: a model can look competent while still missing the property that actually matters.
That suggests a workflow where LLMs are assistants, not authors. They might help explore wording, generate a first pass, or support iterative editing, but the final spec still needs an expert and a checker. The paper’s results also imply that prompt engineering alone is not enough. Even the best prompting strategy in the study did not close the semantic gap.
For tool builders, the paper points toward two concrete directions mentioned by the authors: higher-quality datasets for specifications and grammar-constrained generation. Those are sensible targets because the errors they found are not random noise; they are recurring, structured failure modes. If you want an LLM to write formal specs, you probably need stronger constraints than plain text prompting.
There are also limitations to keep in mind. The abstract and notes do not give a full benchmark breakdown by model family in this summary, and they do not claim that the curated dataset covers all kinds of TLA+ specifications. The evaluation is still valuable because it is systematic and reproducible, but it is not the final word on every formal-methods use case.
Why this paper stands out
What makes this work useful is that it measures the right thing. A lot of AI-for-code work stops at surface validity. This paper goes further and asks whether the generated spec actually matches the intended behavior under formal checking. For TLA+, that is the only question that really matters.
It also fills a gap in the literature. Prior GenAI work around TLA+ mostly focused on generating specs from code, constraining syntax, or using specs to guide code generation. This paper instead tests the harder and more direct task: plain-language-to-TLA+ synthesis. That makes it a useful baseline for future research and a caution flag for anyone hoping that general LLM progress has already solved formal specification.
The authors say they release the evaluation framework, code, dataset, models, and results to support reproducibility and future work. For engineers, that means this is not just a one-off critique; it is a starting point for building better systems and better benchmarks around formal-language generation.
Bottom line
Current LLMs can sometimes produce TLA+ that looks plausible, but they still fail too often at the semantic level to be trusted without expert review. If your team wants to use AI for formal methods, this paper says the safest path is augmentation, not automation.
- LLMs can parse into TLA+, but semantic correctness remains very low.
- Progressive prompting helps, but it does not solve the core problem.
- Formal-spec generation needs constraints, better data, and human oversight.
// Related Articles
- [RSCH]
Project Glasswing shows Mythos can chain bugs
- [RSCH]
Mana turns articulated tools into animation tasks
- [RSCH]
Retrieval that teaches models to reason by analogy
- [RSCH]
EvoArena tests LLM agents in changing worlds
- [RSCH]
ART fine-tunes multimodal LLMs through images
- [RSCH]
Which LoRA? Multilingual tuning says simpler wins