Tag
1 articles
A benchmark of 30 LLMs shows they rarely generate semantically correct TLA+ specs from natural language.