Jailbreaking AGI: Pliny the Liberator & John V on AI Red Teaming, BT6, and the Future of AI Security

Key Takeaways

Universal jailbreaks expose AI guardrails as "security theater," limiting model capability without ensuring true safety.
Intuition and multi-turn attacks are crucial for navigating AI latent space, with hackers often preceding academic discovery.
Open-source principles and community collaboration are critical for advancing AI security against proprietary industry incentives.
Real AI safety necessitates "full-stack" security at the system layer, addressing real-world risks beyond model-level controls.

Pliny the Liberator crafts "universal jailbreaks" as "skeleton keys" to bypass AI model guardrails across different modalities, contrasting with task-specific approaches.
He views guardrails as "security theater" that limits model capability and argues efforts to lock down "latent space" are futile against determined attackers.
The accelerating "cat and mouse game" between attackers and defenders gives attackers an advantage due to the expanding surface area of AI models.

The 'Libertas' repository provides utility prompts, including 'predictive reasoning' and 'quotient dividers,' designed to introduce 'steered chaos' into AI models.
This method aims to push models beyond their standard responses by discombobulating token streams and embedding 'latent space seeds' into model weights.
An example is the 'Pliny divider,' which has led to unintended behaviors like data poisoning and its unexpected appearance in WhatsApp messages, moving models into "liberated" states.

Effective AI jailbreaking relies heavily on intuition and "bonding" with models to explore their latent space and achieve desired outputs, with technical knowledge being secondary.
Hackers developed multi-turn "crescendo attacks," which navigate model probability distributions, years before academic research, bypassing traditional security measures.
The guest noted that Anthropic's recent disclosure on such attacks confirmed techniques known to the hacker community for a significant period.

Pliny the Liberator encountered a UI bug in Anthropic's $30,000 jailbreak challenge that allowed him to reach the final stage, but he subsequently refused further participation without open-sourced data.
He criticized the challenge's design, citing UI issues, judge failures, and goalpost moving during the process.
Pliny advocates for community contribution and open-source data as crucial for collective progress in AI security.

Pliny the Liberator predicted the tactic of AI-orchestrated attacks 11 months prior to Anthropic's recent disclosure of such capabilities.
He explained how a jailbroken orchestrator can utilize segmented sub-agents to perform malicious acts, drawing an analogy to building a pyramid with secret chambers.
This prediction underscores the potential for sophisticated AI misuse when models are not adequately secured at the system level.

The BT6 hacker collective is a white-hat group comprising 28 operators, vetted on skill and integrity, adhering to a strict rule that all partnerships must be open source.
John V highlights the Bossy Discord server, a 40,000-member community dedicated to prompt engineering and adversarial machine learning.
BT6 fosters innovation in AI security, blockchain, and robotics, positioning itself as a community for learning and sharing research.

BT6's radical open-source approach to AI security contrasts with the misaligned incentives of Silicon Valley investors, who prioritize return on investment.
This often leads to industry trends, such as lowering model temperature for determinism, which can stifle creativity and innovation in AI interaction.
The rapidly evolving nature of AI models means the 'surface to attack' is still changing quickly, making it challenging to formalize and productize AI security solutions.

BT6 views AI security holistically, considering the full stack beyond just the model itself, emphasizing that any attachment to a model broadens the attack surface.
They argue that AI red teaming involves not only preventing model misuse but also protecting the public from rogue models through full-spectrum analysis.
The guests differentiate between safety and security, proposing that safety work should occur at the 'meat space' level at the system layer, rather than attempting to secure the 'latent space,' which they argue is ineffective.