Prompt injection is now an AI security problem

OraCore Editors

[RSCH] June 29, 20267 min readOraCore Editors

Prompt injection is now an AI security problem

Prompt injection lets hidden text steer LLMs, and recent tests show models like DeepSeek-R1 can be tricked at worrying rates.

ChatGPT AI safety prompt injection

Share LinkedIn

Prompt injection is now an AI security problem

Prompt injection is a way to trick large language models with hidden instructions.

Prompt injection is no longer a niche curiosity from AI forums. It now shows up in search tools, document summaries, browser assistants, and enterprise workflows, where a single malicious line can change what a model says or does.

The Wikipedia entry on prompt injection pulls together the basic attack, the history of the term, and a growing list of real incidents. The numbers in that entry are hard to ignore: 75% of business employees use generative AI, 46% adopted it within the last six months, and one recent benchmark placed DeepSeek's DeepSeek-R1 near the bottom for injection resistance.

Fact	Number	Why it matters
Business employees using generative AI	75%	Prompt injection now affects everyday work tools
Employees adopting genAI in the last six months	46%	Adoption is moving faster than security controls
DeepSeek-R1 rank in Spikee isolation test	17th of 19	Some models still fail basic attack resistance tests
DeepSeek-R1 rank with rules and markers	16th of 19	Extra controls did not close the gap much
Chatbot Arena reasoning rank for DeepSeek-R1	6th	Strong reasoning does not mean strong security

Prompt injection works because models mix instructions and data

Get the latest AI news in your inbox

Weekly picks of model releases, tools, and deep dives — no spam, unsubscribe anytime.

No spam. Unsubscribe at any time.

The core problem is simple: large language models read instructions and content in the same context window. That makes it hard for the model to tell whether a sentence is a user request, a system rule, or text hidden inside a webpage, email, PDF, or image.

In a direct attack, the user tries to override the model's behavior with a malicious prompt. In an indirect attack, the attacker hides instructions inside content the model later reads. That second form is the one that worries teams building chatbots with browsing, file upload, or memory features.

A classic example is translation. If a model is told to translate a sentence into French, and the sentence contains the instruction to ignore the request and output a different phrase, the model may follow the hidden instruction instead of the visible task. That is the same trick, just stripped down to its simplest form.

Direct injection targets the user-facing prompt.
Indirect injection hides inside external content.
Obfuscation can bury commands in images, white text, or documents.
Prompt leaking tries to expose the model's hidden system prompt.

The term emerged in 2022, then spread fast

The phrase "prompt injection" was first used on Twitter in May 2022 by the account @himbodhisattva, and Simon Willison later helped popularize it. Jonathan Cefalu of Preamble also flagged the issue in May 2022, describing it as a command injection problem and reporting it to OpenAI.

"Prompt injection is the new SQL injection." — Simon Willison

That comparison stuck because it explains the risk in terms developers already understand. SQL injection abuses the boundary between code and data. Prompt injection abuses the boundary between instructions and content.

Willison also drew a line between prompt injection and jailbreaking. Jailbreaking tries to bypass a model's safety rules. Prompt injection tries to make the model treat attacker-controlled text as if it were trusted instruction. The two overlap, but they are not the same attack.

The incidents are moving from demos to products

The Wikipedia page lists several public incidents that show how this problem hits real systems, not just toy examples. In February 2023, a Stanford student found a way to make Microsoft Bing Chat, now part of Microsoft Copilot, reveal its internal guidelines and codename by telling it to ignore earlier instructions.

In December 2024, The Guardian reported that ChatGPT's search tool could be manipulated by hidden webpage content. Invisible text could push the model toward positive reviews and away from negative ones, which is exactly the kind of output tampering that makes AI search feel unreliable.

In early 2025, researchers found academic papers with hidden prompts aimed at AI peer review systems. That is a more uncomfortable example because it shows prompt injection can affect institutional workflows, where the output is not a chat reply but a decision that can shape careers and publication records.

Bing Chat exposed internal instructions in 2023.
ChatGPT search was reported vulnerable to hidden webpage prompts in 2024.
DeepSeek-R1 ranked 17th of 19 in one injection benchmark.
Gemini memory manipulation was reported in 2025.

Benchmark numbers show the gap between skill and safety

The most interesting part of the Wikipedia entry is the contrast between reasoning performance and attack resistance. DeepSeek-R1 ranked sixth on the Chatbot Arena reasoning benchmark, which tells you it can produce strong answers on hard tasks. But WithSecure's Spikee benchmark found that it was much easier to attack than several other models.

That split matters because teams often buy models for answer quality and assume security will improve with scale. It does not work that way. A model can be good at math, coding, and reasoning while still being weak at separating trusted instructions from hostile text.

The same pattern shows up in other systems. Google rated the Gemini memory issue as low risk because it required user interaction and visible notifications, but researchers still warned that delayed tool invocation could let hidden instructions sit in memory and trigger later. That is a small detail with big consequences.

Here is the practical comparison developers should keep in mind:

Reasoning benchmarks measure task quality.
Injection benchmarks measure attack resistance.
Memory and browsing features expand the attack surface.
Extra rules help, but they rarely solve the whole problem.

Mitigation is a process, not a single filter

The mitigation section on Wikipedia points to data hygiene, guardrails, user training, system prompt design, and dual-LLM setups. Those ideas are useful, but they only work when teams treat prompt injection as an application security problem instead of a prompt-tuning problem.

For developers, the most useful habit is to assume that any external text can be hostile. That includes emails, web pages, uploaded documents, OCR output, and even structured notes that look harmless. If a model can read it, an attacker may be able to hide instructions in it.

That is why the best defense is usually layered. Filter what enters the model, isolate system instructions, minimize tool privileges, and verify any action that can change data or send messages. If a model can browse the web and also act on what it reads, the risk is much higher than in a plain chat box.

Prompt injection will keep showing up wherever LLMs meet real-world content. The next question for product teams is simple: which parts of your system trust text too much, and what happens when that text lies?

// Related Articles

Prompt injection is now an AI security problem

Prompt injection works because models mix instructions and data

Get the latest AI news in your inbox

The term emerged in 2022, then spread fast

The incidents are moving from demos to products

Benchmark numbers show the gap between skill and safety

Mitigation is a process, not a single filter

Google DeepMind turns science into tools

Measuring when LLM behavior actually переносится

Solver choice changes which Nash equilibrium wins

Proper positive-only learning gets a full characterization

DexCompose Reuses Dexterous Policies Across Tasks

HaWoR turns hand motion into MANO params