Detailed Summary
Introduction and Initial Reports (0:00 - 0:49)
The video begins by acknowledging widespread user reports that GPT-5 feels "dumber" lately, particularly GPT-5 Codex. This sentiment is not new, as similar performance degradations have been observed with other models like Claude Opus and Sonnet, where Anthropic initially downplayed issues before admitting problems. In contrast, OpenAI has taken these reports seriously, conducting an in-depth investigation and sharing their findings in a detailed document titled "Ghosts in the Codeex machine."
Sponsor Segment: Depot (0:49 - 2:47)
The video includes a sponsored message for Depot, a service designed to accelerate Docker build times. Depot helped Post house cut their build times by 55x, from 2.5 hours to under 3 minutes. Another company, Jane, reduced CI failure rates from 60% and achieved 2.5 times faster builds, a 25% throughput increase, and a 55% cost reduction by using Depot. The sponsor emphasizes that Depot provides reliable infrastructure, faster builds, developer happiness, and improved observability.
Identifying the Problem: GPT-5 Codex (2:47 - 4:55)
Initial reports of GPT-5 being "dumber" specifically targeted GPT-5 Codex, the model released on September 15th. The presenter notes experiencing this degradation, particularly in the Codex CLI. While some users found performance varied across workspaces, suggesting context and instructions could influence results, OpenAI's Codex team concluded there wasn't a single large issue but rather a combination of behavioral shifts and smaller, concrete problems. OpenAI is praised for its historical transparency and for provisioning older model versions for testing, contrasting with Anthropic's handling of similar regressions.
OpenAI's Investigation Plan (4:55 - 7:49)
OpenAI initiated a full-time investigation due to increasing public reports, despite initial internal metrics not showing immediate evidence of degradation. Their plan included:
- Upgrading the CLI feedback command with structured options (bug, good/bad result, other) and free-form text to link feedback to specific hardware and clusters.
- Increasing awareness of the
/feedback command to boost feedback volume and identify anomalies.
- Reducing the surface area for issues by having all employees use the exact same setup as external traffic, a practice known as "dogfooding."
- Auditing infrastructure optimizations and feature flags to ensure consistency across employee and user experiences.
- Running more extensive evaluations and qualitative checks.
Initial actions included launching the new feedback system, moving internal usage to external setups, and reducing internal complexity by auditing and removing over 60 feature flags, with 80 more in process. The improved feedback mechanism proved valuable, allowing the team to triage over 100 issues daily. A dedicated "squad" was assembled to continuously generate and investigate hypotheses, operating without distractions.
Findings and Fixes: Hardware and Load Balancing (7:49 - 9:40)
OpenAI's investigation revealed several specific issues:
- Hardware Differences: A predictive model analyzed relationships between feedback, user retention, and request features (model, CLI build, OS, time, serving cluster, hardware, user plan). Evals confirmed slight performance issues with older hardware, which was then removed from the fleet.
- Load Balancing: An opportunity was discovered in the load balancing strategy to reduce latency under load, with improvements rolling out.
Findings and Fixes: Compaction Frequency (9:40 - 11:00)
Compaction, where the model summarizes conversations to avoid context limits, was identified as a source of degradation. The percentage of sessions using compaction increased, and the implementation could be improved. Evals confirmed performance degrades with more /compact or auto-compactions within a session. OpenAI landed improvements to prevent recursive summaries and added a warning to nudge users towards shorter, more targeted conversations. The presenter notes his own practice of starting new threads for most requests, highlighting a user behavior difference.
Findings and Fixes: Apply Patch Tool (11:00 - 12:59)
GPT-5 Codex uses an apply patch tool, which takes a unified diff as input. Users reported issues where the model failed to apply edits, sometimes resorting to deleting and recreating files. This behavior, while technically correct in the limit, can cause problems if the agent is interrupted. OpenAI plans to improve future models to prevent this behavior and implement immediate mitigations for high-risk edit sequences. This highlights an interesting trend where model behaviors are increasingly defined by the tools they are intended to use, with tool performance influencing future model training.
Findings and Fixes: Timeouts and Constrained Sampling (12:59 - 15:29)
- Timeouts: Users reported longer task completion times. While OpenAI's internal metrics showed improving latency, specific feedback indicated the model was retrying tasks with escalating timeouts, making it inefficient. OpenAI is investing in training models to better handle long-running or interactive processes.
- Constrained Sampling: A bug was found in the implementation of constrained sampling (used for structured outputs like JSON), causing token sequences to become "out of distribution." This bug was linked to reports of the model switching languages mid-sentence, affecting less than 0.25% of sessions but still a significant number of users.
Findings and Fixes: Responses API and Further Investigations (15:29 - 20:33)
- Responses API: Codex uses the Responses API, which acts as a proxy between REST requests and the model's token stream. An investigation into this API involved reviewing over 100 PRs and comparing raw token values across versions. While two extra newline characters were found around the tool description section, they were concluded not to affect model performance.
- CLI Versions and Web Search: Evals across CLI versions (0.40 to latest) showed expected improvements from
apply patch in 0.45, with equal performance and a 10% token usage reduction. Web search was confirmed not to contribute to regression, and it's suggested it should be on by default.
- Prompt Changes and Work Directory: Prompt changes over the last two months and errors setting the work directory were found not to contribute to regressions.
- Query Latencies: Analysis of end-to-end query latencies revealed lower than expected authentication cache rates, adding 50 milliseconds per request. This is being resolved, though the presenter questions the focus on such small latency improvements for tasks that can take minutes.
- Evolving Setup Sophistication: An increase in the complexity of user setups, with more MCP (Multi-Context Prompt) tools, was observed. OpenAI recommends minimalist setups and targeted conversations for best performance, noting that excessive context bloat can hinder the model. The need for better orchestration of tools is emphasized.
OpenAI expressed gratitude for user feedback, no matter how harsh, and confirmed they are establishing a permanently staffed team to obsess over Codex's real-world performance. They are actively recruiting for this team. Additionally, OpenAI reset Codex rate limits for all users and refunded all Codex credit usage up to a specific time due to a bug that overcharged for cloud tasks by 2 to 5x. This proactive approach to refunds is highlighted as a positive contrast to Anthropic's past actions. Cloud tasks consume limits faster and tend to involve more complex, one-shot code changes, leading to higher costs per message.
The video concludes by referencing a post suggesting that Codex is so good that users started attempting harder tasks, leading to a perception of degradation when it didn't perform as well. The presenter praises OpenAI's transparency and depth of reporting, calling it refreshing to see such focus on user-reported issues. He encourages viewers to share their thoughts on whether this is an overblown reaction or appropriate action from OpenAI.