[State of Post-Training] From GPT-4.1 to 5.1: RLVR, Agent & Token Efficiency — Josh McGrath, OpenAI | Latent Space: The AI Engineer Podcast Brief

Key Takeaways

OpenAI's post-training research prioritizes significant behavior change over marginal compute efficiency gains.
The evolution of reinforcement learning in AI emphasizes data quality and signal trust, moving beyond PPO vs. DPO debates.
Token efficiency is becoming a critical metric for advanced agent workflows, as demonstrated by GPT-5 to 5.1 improvements.
A major bottleneck in AI advancement is the shortage of professionals skilled in both distributed systems and machine learning research.

OpenAI researcher Josh McGrath transitioned from pre-training data curation to post-training research.
His focus shifted to models such as GPT-4o and GPT-5.
The move prioritized changing model behavior by 40% over achieving 3% compute efficiency gains.

User demand for specific model personalities is driving the development of personality toggles.
Two archetypes discussed are 'Anton' (tool-like, no warmth) and 'Clippy' (friendly, helpful).
The guest uses custom instructions to configure his model as a 'tool'.

The post-training landscape has shifted focus to RLVR and agent-specific RL, emphasizing data quality and signal trust.
Methods like RLHF and RLVR are policy gradient methods, but innovation lies in input data quality, such as verifiable correctness over human preference.
GRPO from DeepSeek Math is cited as an underappreciated method for its trustworthy reward signals derived from verifiable math answers.

Token efficiency, rather than wall-clock time, is becoming a key metric for long-horizon tasks.
GPT-5 to 5.1 achieved improved evaluations while substantially reducing token usage.
Better token efficiency enables agents to perform more tool calls and actions, improving user experience by reducing task completion time.

Long context research shows a 10x increase in effective context for GPT-4.1.
Strategies like GraphWalks are being developed to improve context utilization for future model capabilities.
Debate continues on whether context windows will extend indefinitely or if agents with 'grep'-like capabilities will offer alternative solutions.

A significant hiring challenge exists in finding individuals proficient in both distributed systems and machine learning research.
This hybrid skillset is deemed crucial for advancing the AI frontier.
The education system is currently not producing enough people with this specific combination of expertise.