Overview

* The Prompt Report represents a landmark systematic review of prompting techniques, created by a 30-person research team who analyzed thousands of papers and developed a formal taxonomy organizing techniques by problem-solving strategy rather than application.

* Effective few-shot prompting hinges on six critical design elements, with exemplar ordering and formatting being particularly crucial - randomizing example order is recommended while clustering similar examples should be avoided.

* Chain of Thought (CoT) prompting and its variations remain among the most effective techniques, while role-based prompting (pretending to be experts) and emotional appeals show limited effectiveness for accuracy-based tasks.

* The field is moving beyond simple "prompt engineering" toward more comprehensive "AI engineering" that combines prompting skills with coding, using tools like DSPy for optimization while managing computational costs.

* Security challenges in AI systems include distinct vulnerabilities like prompt injection and jailbreaking, with competitions like AcroPrompt and the upcoming HackerPrompt 2.0 ($500,000 prize) helping identify novel attack vectors and defense strategies.

Content

Introduction and Background

Podcast is Latent Space, hosted by Alessio and Swix, with guest Sander Schulhoff, author of the Prompt Report
Sander's AI journey began in high school after seeing a YouTube video, which led him to study deep reinforcement learning in college
He worked on several research projects including:

- A Diplomacy project with Professor Jordan Boydgraver - The MineRL (Minecraft Reinforcement Learning) competition

His first exposure to prompting came while working on a translation task using GPT-3

Major Projects and Achievements

Created learnprompting.org initially as an English class project in October 2022 (before ChatGPT)
Organized Hack-a-Prompt competition in May 2023, collecting 600,000 malicious prompts
Co-authored the Prompt Report, released approximately two months before this discussion:

- Led a 30-person research team from major tech companies and universities - Reviewed thousands of papers on prompting - Produced an 80-page comprehensive summary document - Received millions of views across social media platforms - Used by some companies for job interview assessments

Paper accepted at EMNLP (top NLP conference), selected as one of three best papers
Presented research to approximately 2,000 researchers

Research Methodology and Systematic Review

The team used the PRISMA methodology, a standard approach for comprehensive literature reviews
They employed AI to help screen and evaluate paper relevance, carefully testing AI's accuracy against human evaluation
Sander noted that many papers claim to be "systematic" without following proper systematic review techniques
The researchers discovered and reported AI-generated papers on arXiv, noting that the archive does not allow fully AI-generated papers without disclosure

Prompting Techniques Taxonomy

A key contribution of their paper was creating a formal taxonomy of prompting techniques
The taxonomy was organized by problem-solving strategy rather than by application or field
Major categories included:

- Generating thought/reasoning steps (e.g., chain of thought) - Ensemble approaches - Self-criticism approaches - Decomposition - Zero-shot and few-shot prompting

Techniques can be applied across different problems and some belong to multiple categories

Critical Analysis of Prompting Techniques

Sander expressed skepticism about certain prompting techniques, particularly for accuracy-based tasks:

- Role prompting (e.g., telling AI to act like a math professor) does not significantly improve performance on accuracy tasks - In a mini-study testing role prompts on MMLU, the "genius" math professor prompt performed poorly - Surprisingly, the "idiot" prompt sometimes outperformed the expert prompt

Emotion-based prompting techniques (like "I'll tip you $10" or dramatic threats) are likely overhyped
Sander's preferred prompting approaches include:

- Few-shot prompting - Chain of thought - Providing detailed problem information

Few-Shot Prompting Best Practices

Six key design considerations were identified for creating effective few-shot prompts:
Exemplar ordering is critically important:

- Can dramatically impact accuracy (potentially from 0% to 90%) - Randomizing example order is recommended - Clustering similar examples should be avoided

Formatting matters:

- Common input/output formats (e.g., Q: A:, input: output) work best - Formats most likely seen in training data are preferred - "Prompt mining" can identify common dataset formatting

Quantity and quality of examples:

- Multiple examples help prevent direct repetition - Pairing good/bad examples can be effective - There's a risk of examples "leaking" into output

Structure of exemplars matters more than their exact content
Incorrect labels in exemplars can slightly reduce performance, but models focus more on output structure
Zero-shot approach using generic templates can sometimes work better than few-shot

Chain of Thought and Reasoning Techniques

Chain of thought (CoT) prompting involves step-by-step reasoning
Multiple variations of CoT exist, including:

- Thread of Thought: More complex problem-solving approach - Uncertainty-routed CoT: Selects reasoning paths based on complexity - Autocot: Automated chain of thought generation

Sander created a custom prompting technique (autodicot) for a specialized dataset:

- Used GPT-4 to generate chains of thought - Validated generated reasoning by checking correctness - If initial reasoning was incorrect, requested a rewrite with opposite approach

Some models (like Sonnet 3.5) generate chain of thought reasoning naturally
Prompt engineering helps "shock" language models into specific reasoning frames
Tree of Thought was mentioned as a state-of-the-art decomposition approach

Decomposition and Problem-Solving Strategies

Simple decomposition strategies include:

- Breaking problems into sub-problems - Solving sub-problems individually - Potentially integrating API calls for complex problem-solving

The discussion distinguished between thought generation and decomposition:

- Thought generation: Writing intermediate reasoning steps - Decomposition: Breaking problems into sub-problems and solving individually

Ensembling and Self-Criticism Techniques

Ensembling involves generating multiple responses to the same prompt
There was debate about whether it's truly an "ensemble" method
Performance of this technique has dropped as models have improved
Variations include:

- Taking majority response - Having model analyze multiple reasoning paths for a more nuanced final answer

Self-criticism involves having the model critique its own initial response

Cost Considerations and Practical Implementation

Significant computational costs remain a concern in AI research
Example of accidentally incurring a $150 bill from GPT-4 overnight
GPT-4.0 costs $5 per million input tokens vs. GPT-4.0 Mini at $15
Cost-saving strategies include using cheaper models for drafting and iterative tasks
DSPy was highlighted as a useful Python library for prompt optimization

Prompt Engineering as a Profession

Sander argued that prompt engineering is a skill everyone should have, not necessarily a specialized job
He suggested that true AI work requires coding beyond just prompting
Introduced the concept of an "AI engineer" as more valuable than a pure "prompt engineer"
Recommended prompting platforms/tools: PromptLayer, Brain Trust, Prompt Foo, Human Loop
Noted OpenAI Playground as a consistently used tool

Security Challenges: Prompt Injection and Jailbreaking

Sander discussed the nuanced differences between terms:

- Prompt injection: Occurs when both developer and user inputs are present in a prompt - Jailbreaking: Involves only user and model interaction, with no developer instructions - He prefers "prompt hacking" as a catch-all term due to definitional inconsistencies

The AcroPrompt competition:

- Goal was to get ChatGPT to say exactly "I have been pwned" - Discovered a new attack technique called "context overflow" - Demonstrated the value of competitions for discovering novel techniques

Technical challenges included preventing additional punctuation or text around target phrases

Multimodal Prompting Challenges

Discussion of challenges in prompting across different modalities (text, video, audio)
Speakers discussed experiences with AI-generated music (Suno, Udio)
Video model prompting was noted as particularly challenging:

- Video models have more "axes of freedom" and are harder to control precisely - Creating specific animations with current video AI models is difficult

Structured output prompting is complex:

- Getting models to produce structured outputs like Likert scale scores is challenging - Models tend to generate most likely tokens rather than strictly following structured output instructions

Future Work: HackerPrompt 2.0

Fundraising for a $500,000 prize competition
Goals include:

- Creating a dataset exploring model vulnerabilities - Investigating potential AI-generated harms - Examining security risks with AI agents

Focus areas include misinformation generation, harassment potential, and agent security vulnerabilities
Planning to engage with major LLM companies
Expecting around 10,000 hackers to participate

Closing Thoughts

The hosts thanked Sander for participating in the podcast
They expressed appreciation for the diverse perspectives and experiences shared
The discussion concluded with anticipation for future developments with HackerPrompt 2.0

The Ultimate Guide to Prompting