a free lightning course this Sunday for those new to ">

Latent Space: The AI Engineer Podcast

The Ultimate Guide to Prompting

Overview

* The Prompt Report represents a landmark systematic review of prompting techniques, created by a 30-person research team who analyzed thousands of papers and developed a formal taxonomy organizing techniques by problem-solving strategy rather than application.

* Effective few-shot prompting hinges on six critical design elements, with exemplar ordering and formatting being particularly crucial - randomizing example order is recommended while clustering similar examples should be avoided.

* Chain of Thought (CoT) prompting and its variations remain among the most effective techniques, while role-based prompting (pretending to be experts) and emotional appeals show limited effectiveness for accuracy-based tasks.

* The field is moving beyond simple "prompt engineering" toward more comprehensive "AI engineering" that combines prompting skills with coding, using tools like DSPy for optimization while managing computational costs.

* Security challenges in AI systems include distinct vulnerabilities like prompt injection and jailbreaking, with competitions like AcroPrompt and the upcoming HackerPrompt 2.0 ($500,000 prize) helping identify novel attack vectors and defense strategies.

Content

Introduction and Background

  • Podcast is Latent Space, hosted by Alessio and Swix, with guest Sander Schulhoff, author of the Prompt Report
  • Sander's AI journey began in high school after seeing a YouTube video, which led him to study deep reinforcement learning in college
  • He worked on several research projects including:
- A Diplomacy project with Professor Jordan Boydgraver - The MineRL (Minecraft Reinforcement Learning) competition
  • His first exposure to prompting came while working on a translation task using GPT-3

Major Projects and Achievements

  • Created learnprompting.org initially as an English class project in October 2022 (before ChatGPT)
  • Organized Hack-a-Prompt competition in May 2023, collecting 600,000 malicious prompts
  • Co-authored the Prompt Report, released approximately two months before this discussion:
- Led a 30-person research team from major tech companies and universities - Reviewed thousands of papers on prompting - Produced an 80-page comprehensive summary document - Received millions of views across social media platforms - Used by some companies for job interview assessments
  • Paper accepted at EMNLP (top NLP conference), selected as one of three best papers
  • Presented research to approximately 2,000 researchers

Research Methodology and Systematic Review

  • The team used the PRISMA methodology, a standard approach for comprehensive literature reviews
  • They employed AI to help screen and evaluate paper relevance, carefully testing AI's accuracy against human evaluation
  • Sander noted that many papers claim to be "systematic" without following proper systematic review techniques
  • The researchers discovered and reported AI-generated papers on arXiv, noting that the archive does not allow fully AI-generated papers without disclosure

Prompting Techniques Taxonomy

  • A key contribution of their paper was creating a formal taxonomy of prompting techniques
  • The taxonomy was organized by problem-solving strategy rather than by application or field
  • Major categories included:
- Generating thought/reasoning steps (e.g., chain of thought) - Ensemble approaches - Self-criticism approaches - Decomposition - Zero-shot and few-shot prompting
  • Techniques can be applied across different problems and some belong to multiple categories

Critical Analysis of Prompting Techniques

  • Sander expressed skepticism about certain prompting techniques, particularly for accuracy-based tasks:
- Role prompting (e.g., telling AI to act like a math professor) does not significantly improve performance on accuracy tasks - In a mini-study testing role prompts on MMLU, the "genius" math professor prompt performed poorly - Surprisingly, the "idiot" prompt sometimes outperformed the expert prompt
  • Emotion-based prompting techniques (like "I'll tip you $10" or dramatic threats) are likely overhyped
  • Sander's preferred prompting approaches include:
- Few-shot prompting - Chain of thought - Providing detailed problem information

Few-Shot Prompting Best Practices

  • Six key design considerations were identified for creating effective few-shot prompts:
  • Exemplar ordering is critically important:
- Can dramatically impact accuracy (potentially from 0% to 90%) - Randomizing example order is recommended - Clustering similar examples should be avoided
  • Formatting matters:
- Common input/output formats (e.g., Q: A:, input: output) work best - Formats most likely seen in training data are preferred - "Prompt mining" can identify common dataset formatting
  • Quantity and quality of examples:
- Multiple examples help prevent direct repetition - Pairing good/bad examples can be effective - There's a risk of examples "leaking" into output
  • Structure of exemplars matters more than their exact content
  • Incorrect labels in exemplars can slightly reduce performance, but models focus more on output structure
  • Zero-shot approach using generic templates can sometimes work better than few-shot

Chain of Thought and Reasoning Techniques

  • Chain of thought (CoT) prompting involves step-by-step reasoning
  • Multiple variations of CoT exist, including:
- Thread of Thought: More complex problem-solving approach - Uncertainty-routed CoT: Selects reasoning paths based on complexity - Autocot: Automated chain of thought generation
  • Sander created a custom prompting technique (autodicot) for a specialized dataset:
- Used GPT-4 to generate chains of thought - Validated generated reasoning by checking correctness - If initial reasoning was incorrect, requested a rewrite with opposite approach
  • Some models (like Sonnet 3.5) generate chain of thought reasoning naturally
  • Prompt engineering helps "shock" language models into specific reasoning frames
  • Tree of Thought was mentioned as a state-of-the-art decomposition approach

Decomposition and Problem-Solving Strategies

  • Simple decomposition strategies include:
- Breaking problems into sub-problems - Solving sub-problems individually - Potentially integrating API calls for complex problem-solving
  • The discussion distinguished between thought generation and decomposition:
- Thought generation: Writing intermediate reasoning steps - Decomposition: Breaking problems into sub-problems and solving individually

Ensembling and Self-Criticism Techniques

  • Ensembling involves generating multiple responses to the same prompt
  • There was debate about whether it's truly an "ensemble" method
  • Performance of this technique has dropped as models have improved
  • Variations include:
- Taking majority response - Having model analyze multiple reasoning paths for a more nuanced final answer
  • Self-criticism involves having the model critique its own initial response

Cost Considerations and Practical Implementation

  • Significant computational costs remain a concern in AI research
  • Example of accidentally incurring a $150 bill from GPT-4 overnight
  • GPT-4.0 costs $5 per million input tokens vs. GPT-4.0 Mini at $15
  • Cost-saving strategies include using cheaper models for drafting and iterative tasks
  • DSPy was highlighted as a useful Python library for prompt optimization

Prompt Engineering as a Profession

  • Sander argued that prompt engineering is a skill everyone should have, not necessarily a specialized job
  • He suggested that true AI work requires coding beyond just prompting
  • Introduced the concept of an "AI engineer" as more valuable than a pure "prompt engineer"
  • Recommended prompting platforms/tools: PromptLayer, Brain Trust, Prompt Foo, Human Loop
  • Noted OpenAI Playground as a consistently used tool

Security Challenges: Prompt Injection and Jailbreaking

  • Sander discussed the nuanced differences between terms:
- Prompt injection: Occurs when both developer and user inputs are present in a prompt - Jailbreaking: Involves only user and model interaction, with no developer instructions - He prefers "prompt hacking" as a catch-all term due to definitional inconsistencies
  • The AcroPrompt competition:
- Goal was to get ChatGPT to say exactly "I have been pwned" - Discovered a new attack technique called "context overflow" - Demonstrated the value of competitions for discovering novel techniques
  • Technical challenges included preventing additional punctuation or text around target phrases

Multimodal Prompting Challenges

  • Discussion of challenges in prompting across different modalities (text, video, audio)
  • Speakers discussed experiences with AI-generated music (Suno, Udio)
  • Video model prompting was noted as particularly challenging:
- Video models have more "axes of freedom" and are harder to control precisely - Creating specific animations with current video AI models is difficult
  • Structured output prompting is complex:
- Getting models to produce structured outputs like Likert scale scores is challenging - Models tend to generate most likely tokens rather than strictly following structured output instructions

Future Work: HackerPrompt 2.0

  • Fundraising for a $500,000 prize competition
  • Goals include:
- Creating a dataset exploring model vulnerabilities - Investigating potential AI-generated harms - Examining security risks with AI agents
  • Focus areas include misinformation generation, harassment potential, and agent security vulnerabilities
  • Planning to engage with major LLM companies
  • Expecting around 10,000 hackers to participate

Closing Thoughts

  • The hosts thanked Sander for participating in the podcast
  • They expressed appreciation for the diverse perspectives and experiences shared
  • The discussion concluded with anticipation for future developments with HackerPrompt 2.0

More from Latent Space: The AI Engineer Podcast

Explore all episode briefs from this podcast

View All Episodes →

Listen smarter with PodBrief

Get AI-powered briefs for all your favorite podcasts, plus a daily feed that keeps you informed.

Download on the App Store