An updated look at where AI red teaming stands nine months after the original piece, and why the gap between what we test and what actually breaks has only gotten wider.

Last July, I wrote about a Stanford and Georgetown paper that argued AI red teaming was too narrowly focused on individual model vulnerabilities. The authors made the case for two levels of red teaming: micro-level (testing the model itself) and macro-level (testing the entire system lifecycle, including how humans and institutions interact with it). At the time, the argument felt slightly ahead of the conversation. Most organizations were still figuring out prompt injection, and "sociotechnical risk" sounded academic.

Nine months later, the landscape has shifted enough that the paper reads less like a forward-looking proposal and more like a description of problems we are actively failing to solve.

What Changed

Three things happened since July 2025 that make the original argument sharper.

Agentic AI went mainstream. The paper talked about risks that emerge from complex interactions between models, users, and environments. That was somewhat theoretical when most deployments were chatbots and content generators. It is no longer theoretical. Gartner projects that 40% of enterprise applications will embed task-specific AI agents by 2026, up from under 5% in 2025. These agents call APIs, access data stores, execute workflows, and make decisions across multi-step chains. The attack surface is not the model; it is the entire execution environment.

OWASP published the Top 10 for Agentic Applications. Released in late 2025 with input from over 100 practitioners, this framework classifies the risks the original paper was gesturing at. Agent Goal Hijacking (ASI01) and Tool Misuse (ASI02) sit at the top. These are not prompt-level vulnerabilities. They are system-level failures where an agent's mission gets redirected across many execution steps, or where legitimate tool access gets exploited in ways the developers never anticipated. The OWASP list makes it concrete: red teaming agents means testing systems that act, not components that respond.

The US policy environment shifted. Biden's Executive Order 14110 had required red teaming for high-risk AI models and tasked NIST with developing guidelines. Trump's Executive Order 14148 rescinded it in January 2025, replacing the safety-focused framework with one centered on deregulation and innovation leadership. The practical effect is that mandatory red teaming requirements for frontier models evaporated at the federal level. NIST continues its AI Risk Management Framework work and launched an AI Agent Standards Initiative in early 2026, but the regulatory pressure that was pushing companies toward structured red teaming has weakened considerably. Organizations that depended on regulatory mandates to justify their red teaming programs now need to justify them on their own terms.

Where the Original Argument Holds Up

The core thesis still stands: most red teaming focuses on model-level bugs while the serious risks are systemic. If anything, agentic AI has proven the point more aggressively than the authors probably expected.

The paper's recommendation to build multifunctional red teams (ML engineers, social scientists, domain experts, security practitioners) looks even more necessary now. You cannot red-team an agentic system by throwing adversarial prompts at it. You need people who understand the business process the agent is embedded in, the data flows it can access, the authorization boundaries it should respect, and the failure modes that emerge when it chains together a sequence of individually reasonable actions that produce an unreasonable outcome.

The call for continuous red teaming over one-time assessments has also aged well. The best practice emerging in 2026 is integrating adversarial testing into CI/CD pipelines so that model updates, prompt changes, or agent reconfigurations automatically trigger attack suites. Red teaming as a pre-launch checkbox is not viable when your system's behavior changes every time you update a prompt template.

Where the Original Piece Was Too Conservative

Looking back at what I wrote in July, I underestimated a few things.

First, the speed at which agents would move from experimental to production. The paper framed macro-level red teaming as something organizations should prepare for. In practice, many organizations deployed agentic systems before they had any red teaming program at all, let alone a macro-level one. The gap is not "we test models but not systems." For a lot of organizations, the gap is "we do not test."

Second, I did not spend enough time on the supply chain dimension. Agents consume tools, plugins, and external data sources. Each of those is a trust boundary. Indirect prompt injection, where malicious instructions arrive through untrusted external content rather than direct user input, showed up in 73% of production AI deployments in 2025. Multi-agent denial-of-service attacks succeeded in over 80% of tests in one ACL study. These are not edge cases. They are the default attack surface for any system that lets an LLM interact with external data.

Third, I glossed over the organizational incentive problem. The paper recommended transparency and external testing, and I nodded along without pushing hard enough on why that is so difficult. Internal red teams operate under institutional constraints. They may not test scenarios that would embarrass leadership or challenge product decisions. External red teaming and independent disclosure mechanisms are not nice-to-haves; they are structural necessities for surfacing the risks that internal teams are incentivized to overlook.

What Matters Now

If you are building or deploying AI systems in 2026, here is what I would emphasize differently than I did last July:

Test the system, not the model. This was the paper's thesis and it is now the operational reality. If your agent can read emails, query a database, and send Slack messages, your red team needs to test what happens when a poisoned email redirects the agent to exfiltrate the database contents via Slack. Model-level jailbreak testing does not find this.

Adopt the OWASP ASI framework. It did not exist when I wrote the original piece. It does now, and it gives red teams a structured taxonomy for agentic risks. Use it. It covers goal hijacking, tool misuse, delegated trust failures, memory poisoning, and the other failure modes that are specific to autonomous systems.

Do not wait for regulation. The federal regulatory environment for AI safety is weaker than it was a year ago. NIST is still doing good work, but mandatory requirements are not coming soon. If your red teaming program only exists because a regulation says it should, it will not survive the current policy climate. Build the program because the risk is real, not because someone told you to.

Red team continuously, not periodically. Wire adversarial testing into your release process. When a prompt changes, when a tool gets added, when an agent's scope expands, test it. The systems that break in production are the ones that changed since the last time anyone looked.

Bring in outsiders. Your internal team has blind spots shaped by the same incentive structures that built the system. Independent red teaming is not a luxury. It is how you find the things you are not looking for.

The Bigger Picture

The Stanford and Georgetown paper was right about the direction. The field needed to move from model-level adversarial testing to system-level resilience evaluation. Nine months later, the need is more urgent and the tools to address it are starting to materialize. But the gap between where most organizations are and where they need to be has widened, not narrowed.

AI red teaming in 2026 is not about finding clever jailbreaks. It is about understanding how autonomous systems fail when they interact with messy, adversarial, real-world environments, and building the organizational discipline to test for that continuously. The paper gave us the framework. The question now is whether the industry will actually use it.

Matt James is a Product Security practitioner focused on threat modeling, AI security, and building security programs that work in practice. He writes at odnd.com.

Reference:
Sharkey, L., Pasquinelli, M., Cheng, B., Dobbe, R., et al. Operationalizing Red Teaming for AI Systems. arXiv preprint arXiv:2507.05538. July 2025.

Nine months later, the landscape has shifted enough that the paper reads less like a forward-looking proposal and more like a description of problems we are actively failing to solve.

What Changed

Three things happened since July 2025 that make the original argument sharper.

Agentic AI went mainstream. The paper talked about risks that emerge from complex interactions between models, users, and environments. That was somewhat theoretical when most deployments were chatbots and content generators. It is no longer theoretical. Gartner projects that 40% of enterprise applications will embed task-specific AI agents by 2026, up from under 5% in 2025. These agents call APIs, access data stores, execute workflows, and make decisions across multi-step chains. The attack surface is not the model; it is the entire execution environment.

OWASP published the Top 10 for Agentic Applications. Released in late 2025 with input from over 100 practitioners, this framework classifies the risks the original paper was gesturing at. Agent Goal Hijacking (ASI01) and Tool Misuse (ASI02) sit at the top. These are not prompt-level vulnerabilities. They are system-level failures where an agent's mission gets redirected across many execution steps, or where legitimate tool access gets exploited in ways the developers never anticipated. The OWASP list makes it concrete: red teaming agents means testing systems that act, not components that respond.

The US policy environment shifted. Biden's Executive Order 14110 had required red teaming for high-risk AI models and tasked NIST with developing guidelines. Trump's Executive Order 14148 rescinded it in January 2025, replacing the safety-focused framework with one centered on deregulation and innovation leadership. The practical effect is that mandatory red teaming requirements for frontier models evaporated at the federal level. NIST continues its AI Risk Management Framework work and launched an AI Agent Standards Initiative in early 2026, but the regulatory pressure that was pushing companies toward structured red teaming has weakened considerably. Organizations that depended on regulatory mandates to justify their red teaming programs now need to justify them on their own terms.

Where the Original Argument Holds Up

The core thesis still stands: most red teaming focuses on model-level bugs while the serious risks are systemic. If anything, agentic AI has proven the point more aggressively than the authors probably expected.

The paper's recommendation to build multifunctional red teams (ML engineers, social scientists, domain experts, security practitioners) looks even more necessary now. You cannot red-team an agentic system by throwing adversarial prompts at it. You need people who understand the business process the agent is embedded in, the data flows it can access, the authorization boundaries it should respect, and the failure modes that emerge when it chains together a sequence of individually reasonable actions that produce an unreasonable outcome.

The call for continuous red teaming over one-time assessments has also aged well. The best practice emerging in 2026 is integrating adversarial testing into CI/CD pipelines so that model updates, prompt changes, or agent reconfigurations automatically trigger attack suites. Red teaming as a pre-launch checkbox is not viable when your system's behavior changes every time you update a prompt template.

Where the Original Piece Was Too Conservative

Looking back at what I wrote in July, I underestimated a few things.

First, the speed at which agents would move from experimental to production. The paper framed macro-level red teaming as something organizations should prepare for. In practice, many organizations deployed agentic systems before they had any red teaming program at all, let alone a macro-level one. The gap is not "we test models but not systems." For a lot of organizations, the gap is "we do not test."

Second, I did not spend enough time on the supply chain dimension. Agents consume tools, plugins, and external data sources. Each of those is a trust boundary. Indirect prompt injection, where malicious instructions arrive through untrusted external content rather than direct user input, showed up in 73% of production AI deployments in 2025. Multi-agent denial-of-service attacks succeeded in over 80% of tests in one ACL study. These are not edge cases. They are the default attack surface for any system that lets an LLM interact with external data.

Third, I glossed over the organizational incentive problem. The paper recommended transparency and external testing, and I nodded along without pushing hard enough on why that is so difficult. Internal red teams operate under institutional constraints. They may not test scenarios that would embarrass leadership or challenge product decisions. External red teaming and independent disclosure mechanisms are not nice-to-haves; they are structural necessities for surfacing the risks that internal teams are incentivized to overlook.

What Matters Now

If you are building or deploying AI systems in 2026, here is what I would emphasize differently than I did last July:

Test the system, not the model. This was the paper's thesis and it is now the operational reality. If your agent can read emails, query a database, and send Slack messages, your red team needs to test what happens when a poisoned email redirects the agent to exfiltrate the database contents via Slack. Model-level jailbreak testing does not find this.

Adopt the OWASP ASI framework. It did not exist when I wrote the original piece. It does now, and it gives red teams a structured taxonomy for agentic risks. Use it. It covers goal hijacking, tool misuse, delegated trust failures, memory poisoning, and the other failure modes that are specific to autonomous systems.

Do not wait for regulation. The federal regulatory environment for AI safety is weaker than it was a year ago. NIST is still doing good work, but mandatory requirements are not coming soon. If your red teaming program only exists because a regulation says it should, it will not survive the current policy climate. Build the program because the risk is real, not because someone told you to.

Red team continuously, not periodically. Wire adversarial testing into your release process. When a prompt changes, when a tool gets added, when an agent's scope expands, test it. The systems that break in production are the ones that changed since the last time anyone looked.

Bring in outsiders. Your internal team has blind spots shaped by the same incentive structures that built the system. Independent red teaming is not a luxury. It is how you find the things you are not looking for.

The Bigger Picture

The Stanford and Georgetown paper was right about the direction. The field needed to move from model-level adversarial testing to system-level resilience evaluation. Nine months later, the need is more urgent and the tools to address it are starting to materialize. But the gap between where most organizations are and where they need to be has widened, not narrowed.

AI red teaming in 2026 is not about finding clever jailbreaks. It is about understanding how autonomous systems fail when they interact with messy, adversarial, real-world environments, and building the organizational discipline to test for that continuously. The paper gave us the framework. The question now is whether the industry will actually use it.

Matt James is a Product Security practitioner focused on threat modeling, AI security, and building security programs that work in practice. He writes at odnd.com.

Reference:
Sharkey, L., Pasquinelli, M., Cheng, B., Dobbe, R., et al. Operationalizing Red Teaming for AI Systems. arXiv preprint arXiv:2507.05538. July 2025.

Keep Reading