OPSD lets you turn user clicks into training
I break down OPSD into a copyable loop for turning implicit user feedback into targeted correction and continual training.

OPSD turns implicit user feedback into a continual correction loop for LLM training.
I've been staring at post-training loops for a while now, and the part that kept bothering me was how fake the feedback often is. You build a nice pipeline, collect thumbs-up and thumbs-down labels, maybe even write a reward model, and then act surprised when the model still misses what users actually wanted. In real products, the signal is usually messier. A user retries. They edit the answer. They accept one suggestion and ignore the next. They keep typing because the model almost got there, but not quite. That’s the part I care about, because that’s where the useful correction lives.
What pulled me into this specific idea was a Chinese post on Zhihu about On-Policy Self-Distillation, or OPSD, framed as an evolution from OPD and tied to industrial training practice. The post also points to DeepSeek and mentions Cursor’s training direction, which is the kind of real-world pressure that makes me pay attention. I’m not treating the post like a paper I can verify line by line here. I’m treating it like a practical prompt: if a model can learn from its own on-policy outputs plus implicit user feedback, then maybe I can stop waiting for perfect labels and start using the product itself as the teacher.
Stop treating feedback like a lab artifact
Get the latest AI news in your inbox
Weekly picks of model releases, tools, and deep dives — no spam, unsubscribe anytime.
No spam. Unsubscribe at any time.
最近 DeepSeek V4 的多专家整合方案采用了OPD(On-Policy Distillation),在工业级项目上证明了OPD 在后训练中占据一席之地。而它的进阶版本OPSD(On-Policy Self-Distillation)也在 Cursor 的模型训练上大规模…
What this actually means is pretty simple: the model doesn’t just learn from a static dataset. It learns from what it just did, in the context where users actually reacted to it. OPD is already a step away from old-school offline distillation. OPSD pushes further by making the model’s own on-policy behavior part of the teaching signal. That matters because the model is no longer being judged in a vacuum. It’s being corrected in the same distribution where it will keep operating.

I’ve run into this exact failure mode in product work. A model looks great in eval because the benchmark prompt is clean, but users don’t talk like benchmarks. They interrupt. They refine. They ask for a rewrite after the first answer. If I only train on curated gold answers, I miss the messy trail that shows where the model actually drifted. OPSD is basically saying: keep the trail, and use it.
How to apply it: log the model’s live outputs, the surrounding prompt state, and the user’s next action. Don’t just store final acceptance. Store edits, retries, abandonment, and follow-up prompts. Those are the breadcrumbs. If you’re building an internal assistant, this can be as plain as capturing conversation turns and the point where the user stopped correcting the model. If you’re building a coding agent, keep the diff between the first suggestion and the accepted one. That diff is often more valuable than a hand-written label.
- Store the on-policy response, not just the prompt.
- Store the user’s reaction in context, not as a detached rating.
- Prefer correction traces over binary approval when you can.
Implicit feedback is not silence, it is behavior
The post’s real bet is that implicit feedback can be more useful than explicit scoring, if you know how to read it. A user who rewrites your answer is telling you something. A user who accepts one code block but discards the explanation is telling you something else. A user who keeps asking the same question in a different way is basically handing you a failure cluster. That’s feedback. It’s just not wrapped in a neat label.
I like this framing because explicit labels are expensive and often biased. People click “good” when they mean “good enough.” They click “bad” when the answer was correct but annoying. I’ve seen teams overfit to those labels and end up optimizing for politeness instead of usefulness. Implicit feedback is uglier, but it’s closer to actual product value.
The catch is that you can’t treat every behavior as equal. A retry might mean the answer failed, or it might mean the user wanted a different tone. A copied code snippet might mean success, or it might mean the user copied only to inspect it. So the job is to define a small set of interpretable actions that map to training signals. That’s the part people skip, and then they wonder why the model learns junk.
How to apply it: define a feedback taxonomy before you train. I’d start with four buckets: accepted, edited, retried, abandoned. If you have richer product telemetry, add “copied,” “ran,” “executed successfully,” or “user reverted.” Then weight those events differently. Accepted is not the same as unchanged. Edited is not the same as rejected. The model should see the difference.
- Accepted unchanged: strong positive signal.
- Accepted after edit: partial positive, with correction target.
- Retry or reformulation: negative signal on the previous output.
- Abandonment: weak negative, but useful at scale.
On-policy matters because off-policy lies to you
On-policy sounds like jargon until you watch a model fail in production. Off-policy training uses data from somewhere else. That’s the problem. Once the model changes, the old data stops matching the new behavior. You get stale supervision and drift between what the model is learning and what users are actually seeing.

On-policy training tries to close that gap by training on the model’s current behavior. That sounds minor, but it changes the whole loop. The model generates an answer, users react, and that exact interaction becomes the next teaching example. I’ve always found this more honest than pretending yesterday’s dataset still describes today’s model. It usually doesn’t.
In my own work, the mismatch showed up as “mysterious regressions.” A model improved on one class of prompts and got worse on another because the training set never saw the new style of interaction. On-policy data catches that faster. It’s not magic. It just keeps the training distribution closer to the live distribution, which is where the pain actually is.
How to apply it: run your post-training loop on fresh model outputs from the current checkpoint. Don’t keep recycling old responses forever. Build a rolling window of live interactions, then sample from the most recent behavior. If you’re doing preference training, pair each on-policy response with the user’s correction or with a better alternative generated later in the same session.
One practical rule I use: if the model version changes, the training mix should change too. Otherwise you’re teaching yesterday’s habits to today’s model and calling it improvement.
Self-distillation is just the model learning from its own better draft
Self-distillation is the part that makes this feel less like a standard feedback loop and more like a writing process. The model produces a draft, then a later pass uses that draft, plus the user’s behavior, to produce a better version. The “self” part is important because the model is not being reset by an external oracle every time. It is improving from its own trajectory.
I’ve seen this pattern work surprisingly well in code generation. The first answer is often structurally right but locally sloppy. A second pass, informed by the exact places the user hesitated, can clean up the weak spots without rewriting the whole solution. That is the appeal of self-distillation: keep the good bones, fix the bad joints.
What this actually means is you can train the model to imitate a better version of itself under the same operating conditions. The user feedback tells you where the draft failed. The distillation step turns that into a target. That target can be another model’s output, a revised completion, or a corrected segment. The important thing is that the correction stays tied to the original on-policy context.
How to apply it: generate a candidate answer, then generate a revised answer after observing the user’s reaction. Use the revised answer as the teacher target for the original prompt state. If you have a coding assistant, the target might be the accepted diff rather than a full rewritten file. If you have a chat assistant, the target might be the revised paragraph that fixed the user’s complaint.
This is where I’d be careful. Self-distillation can quietly amplify your own mistakes if the revision step is weak. So I’d keep a human audit lane for a sample of corrections. Not for every example. Just enough to catch the model teaching itself nonsense.
The loop is more important than the label
People get obsessed with the training objective, and I get it. Loss functions are tidy. Pipelines are tidy. But the real value here is the loop: observe, react, correct, retrain, repeat. OPSD is not a one-off trick. It is a way to make the product itself feed the model in a structured way.
The Zhihu post connects this to industrial practice, and that’s exactly where the idea matters. In a real system, you don’t have the luxury of perfect annotation cycles. You have live users, changing behavior, and a model that needs to keep up. A loop built around implicit feedback lets you keep learning without freezing the product for a labeling sprint every time the model drifts.
I’ve personally found that the best loops are boring. They don’t need heroic infrastructure. They need disciplined logging, a clear feedback schema, and retraining triggers that don’t fire on noise. If you can’t explain what a user action means, don’t train on it yet. If you can explain it, start small and measure whether the correction actually changes the next interaction.
How to apply it: set a retraining cadence based on signal volume, not calendar hype. For example, retrain when you collect enough accepted edits or repeated retries in a specific task category. Keep a holdout set of recent live interactions. If the new checkpoint improves correction rate but hurts completion quality, you’ve overfit the feedback. That’s the kind of failure you want to catch early.
- Use a rolling window of recent interactions.
- Retrain on the cases where the user behavior is interpretable.
- Measure post-correction success, not just offline loss.
Industrial use means ugly constraints, not clean theory
The reason I take this seriously is not the acronym. It’s the industrial constraint behind it. Systems like DeepSeek and Cursor live under real latency, cost, and quality pressure. That means the training method has to survive messy deployment reality. OPSD is appealing because it accepts that reality instead of pretending the world is a benchmark suite.
In practice, that means your pipeline has to answer uncomfortable questions. What counts as a user correction? How do you avoid training on accidental behavior? How do you keep private data out of the loop? How do you stop a model from overcorrecting toward the loudest users? Those questions are annoying, but they’re the actual work.
How to apply it: add guardrails before you scale. Filter sensitive content, deduplicate repetitive sessions, and segment feedback by task type. Don’t mix code completion signals with customer-support chat signals unless you really know why. Different tasks produce different kinds of implicit feedback, and one universal trainer will usually smear them together in a bad way.
If I were implementing this from scratch, I’d start with one narrow workflow, one feedback taxonomy, and one retraining trigger. Once that loop improves a real metric, then I’d widen it. Not before.
The template you can copy
# OPSD feedback loop template
## 1. Capture on-policy interactions
For every model response, log:
- prompt state
- model output
- model version
- timestamp
- user action after output
- follow-up prompt or edit
- task category
## 2. Normalize implicit feedback
Map product actions to training signals:
- accepted unchanged -> positive
- accepted after edit -> positive with correction target
- retry / reformulation -> negative on prior output
- abandonment -> weak negative
- copied / executed / saved -> task-specific positive
## 3. Build correction pairs
For each session, create:
- input: prompt state + model draft
- target: user-edited answer, accepted diff, or revised completion
- weight: based on feedback strength
## 4. Train on current behavior
Use recent model outputs from the active checkpoint.
Prefer rolling windows over stale archives.
## 5. Distill the better draft
Generate a revised answer from the same context.
Use the revised version as the teacher target for the original draft.
## 6. Filter and audit
Before training, remove:
- private data
- ambiguous feedback
- duplicate sessions
- low-confidence corrections
Audit a sample of examples by hand.
## 7. Measure what matters
Track:
- correction rate
- retry rate
- acceptance after edit
- task completion success
- regression on recent live prompts
## 8. Retrain and redeploy
Retrain when you have enough fresh signal in one task slice.
Deploy only if the new checkpoint improves live correction metrics without hurting baseline quality.
## Minimal data schema
{
"model_version": "vX",
"task": "code_assist",
"prompt": "...",
"draft": "...",
"user_action": "edited",
"user_edit": "...",
"follow_up": "...",
"signal_weight": 0.8,
"created_at": "2026-06-24T00:00:00Z"
}
## Minimal training pair
input: prompt + draft + context
target: accepted edit or revised completion
weight: signal_weight
What I like about this template is that it’s boring enough to ship. It doesn’t require mystical annotation. It just forces you to respect the actual user trail. That’s the whole point of OPSD as I read it from the Zhihu post: use the model’s own live behavior, plus the user’s implicit correction, to keep the system learning in the direction real people are pushing it.
Source attribution: the core idea here comes from the original Zhihu post. My breakdown, examples, and template are my own synthesis for builders who want to implement the loop rather than just name it.
// Related Articles
- [RSCH]
UltraQuant: 4-bit KV caching for long agents
- [RSCH]
FLUX3D fixes 3DGS detail loss from images
- [RSCH]
Stochastic Subgradient Last Iterate Gets Tight Bounds
- [RSCH]
InSight lets VLAs learn new skills on their own
- [RSCH]
Anthropic is right to sound the alarm on recursive self-improvement
- [RSCH]
OpenAI’s bug hunt rattled Chrome, Safari, Firefox