[RSCH] 7 min readOraCore Editors

ReproRepo scales reproducibility audits with GitHub issues

ReproRepo uses GitHub issues to scale reproducibility audits for machine learning papers.

Share LinkedIn
ReproRepo scales reproducibility audits with GitHub issues

ReproRepo uses GitHub issues to scale reproducibility audits for machine learning papers.

  • Research org: Unspecified in arXiv abstract
  • Core data: 1,149 recent machine learning papers
  • Breakthrough: Uses human-raised GitHub issues as supervision for reproducibility blockers

Reproducibility is one of those problems that sounds abstract until you have to debug a paper implementation and realize the missing detail is buried in a footnote, a repo issue, or not written down at all. This paper is trying to make that process more scalable by turning GitHub issues into a reusable signal for reproducibility audits.

For engineers, the interesting part is not just that the authors built another benchmark. It is that they try to ground evaluation in real-world failure modes already reported by humans, instead of relying entirely on hand-curated tasks that are expensive to maintain and hard to expand.

What problem this paper is trying to fix

Get the latest AI news in your inbox

Weekly picks of model releases, tools, and deep dives — no spam, unsubscribe anytime.

No spam. Unsubscribe at any time.

The paper starts from a practical bottleneck: reproducibility evaluation is important, but existing benchmarks for LLM agents are costly to build because they depend on substantial manual curation and manual evaluation. That makes them hard to scale across many papers, repositories, and failure types.

ReproRepo scales reproducibility audits with GitHub issues

In other words, the field has a measurement problem. If you want to know whether an LLM agent can help audit reproducibility, you need a way to test it against realistic blockers. But if every new benchmark requires a lot of human labor, the benchmark itself becomes the bottleneck.

ReproRepo is the authors’ answer to that scaling problem. Instead of inventing artificial tasks from scratch, they use human-raised GitHub issues as naturally occurring supervision for what actually blocked reproduction in the wild.

How ReproRepo works in plain English

The core idea is straightforward: pair papers with their released code repositories, then use the issues people filed in those repos as labels for reproducibility problems. Those issues become a source of supervision that reflects real debugging pain rather than synthetic test cases.

That matters because GitHub issues often capture the kind of friction that breaks reproduction attempts: missing instructions, ambiguous implementation details, dependency mismatches, or code paths that do not line up cleanly with the paper. The abstract does not enumerate specific issue categories, so that is as far as we can go on the taxonomy.

The framework is designed to be reusable. The authors say it can support future evaluations of LLM agents on real-world reproducibility auditing, which suggests they are aiming for a benchmark style that can be refreshed and extended without rebuilding everything by hand.

They instantiate ReproRepo on 1,149 recent machine learning papers from major conferences and evaluate four frontier model-agent configurations. The abstract does not list all four configurations, so the paper summary only gives one named result: Codex with GPT-5.5.

What the paper actually shows

The headline result is encouraging but also specific about what the agents can and cannot do. The best agent in the study, Codex with GPT-5.5, surfaces at least one semantically related human-reported blocker for about 90% of the papers in the dataset.

ReproRepo scales reproducibility audits with GitHub issues

That is a useful signal for anyone building or using coding agents. It suggests these systems can often identify the right kind of reproduction problem even without executing code. In other words, the model can frequently find the place where things go wrong, or at least the neighborhood of the issue, from the paper-repository pair alone.

The paper also adds a more nuanced finding: agents are particularly good at surfacing visible failures and identifying the right semantic region, but they are still not reliable at exact localization. So the system may tell you “the problem is around this part of the repo or this part of the method,” while still missing the precise line, file, or root cause.

Importantly, the abstract does not report traditional benchmark numbers like accuracy, F1, or pass rates beyond the ~90% figure for semantically related blocker surfacing. It also does not claim that the agent executes code or fully reproduces results. That limits how far you should generalize the result.

Why this matters for developers

If you work on ML tooling, agentic debugging, or research infrastructure, ReproRepo points to a practical way to evaluate systems against messy real-world failures. A benchmark built from GitHub issues is closer to how teams actually debug research code than a clean synthetic task.

That makes the framework interesting for at least three groups: people building code agents, maintainers of research repos, and teams trying to measure reproducibility risk at scale. It gives them a way to ask whether a model can spot blockers before someone spends hours trying to rerun a broken experiment.

It also hints at a broader lesson for benchmark design: supervision does not always have to be hand-labeled from scratch if the ecosystem already produces useful signals. In this case, the signal is human issue reporting, which is both practical and grounded in real failure modes.

Limitations and open questions

The biggest limitation is also the main methodological tradeoff. GitHub issues are useful, but they are not a perfect proxy for reproducibility. Some blockers never get filed, some issues are vague, and some reported problems may not map cleanly to the exact reproduction barrier a benchmark wants to measure.

Another limitation is localization. The abstract says the agents are insufficient in exact localization, which means the framework can reveal that a problem exists without fully solving the debugging task. For developers, that is still valuable, but it is not the same as a system that can patch the repo or reproduce the result end to end.

Finally, the paper is careful about scope: the evaluation is on recent machine learning papers from major conferences, so it is not yet evidence about every research domain or every kind of software project. The abstract also does not provide details on how the four frontier model-agent configurations compare beyond the best-agent headline.

Even with those limits, ReproRepo is a strong signal that reproducibility auditing can be measured more realistically. If the framework holds up, it could become a useful testbed for future agents that need to read papers, inspect repos, and reason about why a result is hard to reproduce.

Bottom line

ReproRepo is less about flashy automation and more about making a hard evaluation problem tractable. It shows that human GitHub issues can be turned into scalable supervision for reproducibility audits, and that frontier agents can already catch many real blockers without running code.

For practitioners, the takeaway is simple: if you are building AI systems for research debugging, the next useful benchmark may not be synthetic at all. It may be sitting in the issue tracker.