Loading summary...

Related Videos

It's not just you (gpt-5 got dumber)

7 min read (70% time saved)

Too Long; Didn't Watch — Summary

OpenAI investigated widespread user reports of GPT-5 Codex performing worse, revealing several contributing factors including hardware variations, issues with context compaction, patch application failures, inefficient timeout handling, and a bug in constrained sampling, while demonstrating a transparent and proactive approach to resolving these issues.

Main Takeaways

Users widely reported a degradation in GPT-5 Codex's performance, prompting a thorough investigation by OpenAI.
OpenAI's investigation identified multiple technical issues, including hardware-specific performance differences, problems with context compaction, and a bug in constrained sampling causing language shifts.
The company implemented significant changes, such as improving feedback mechanisms, internal dogfooding, and forming a dedicated team to address the regressions.
OpenAI's transparency in sharing its findings and offering refunds for overcharging was highlighted as a positive contrast to other AI companies.
Recommendations for users include employing minimalist setups and keeping conversations targeted to optimize Codex performance.

Detailed Summary

Introduction and Initial Reports (0:00 - 0:49)

The video begins by acknowledging widespread user reports that GPT-5 feels "dumber" lately, particularly GPT-5 Codex. This sentiment is not new, as similar performance degradations have been observed with other models like Claude Opus and Sonnet, where Anthropic initially downplayed issues before admitting problems. In contrast, OpenAI has taken these reports seriously, conducting an in-depth investigation and sharing their findings in a detailed document titled "Ghosts in the Codeex machine."

Sponsor Segment: Depot (0:49 - 2:47)

The video includes a sponsored message for Depot, a service designed to accelerate Docker build times. Depot helped Post house cut their build times by 55x, from 2.5 hours to under 3 minutes. Another company, Jane, reduced CI failure rates from 60% and achieved 2.5 times faster builds, a 25% throughput increase, and a 55% cost reduction by using Depot. The sponsor emphasizes that Depot provides reliable infrastructure, faster builds, developer happiness, and improved observability.

Identifying the Problem: GPT-5 Codex (2:47 - 4:55)

Initial reports of GPT-5 being "dumber" specifically targeted GPT-5 Codex, the model released on September 15th. The presenter notes experiencing this degradation, particularly in the Codex CLI. While some users found performance varied across workspaces, suggesting context and instructions could influence results, OpenAI's Codex team concluded there wasn't a single large issue but rather a combination of behavioral shifts and smaller, concrete problems. OpenAI is praised for its historical transparency and for provisioning older model versions for testing, contrasting with Anthropic's handling of similar regressions.

OpenAI's Investigation Plan (4:55 - 7:49)

OpenAI initiated a full-time investigation due to increasing public reports, despite initial internal metrics not showing immediate evidence of degradation. Their plan included:

Upgrading the CLI feedback command with structured options (bug, good/bad result, other) and free-form text to link feedback to specific hardware and clusters.
Increasing awareness of the /feedback command to boost feedback volume and identify anomalies.
Reducing the surface area for issues by having all employees use the exact same setup as external traffic, a practice known as "dogfooding."
Auditing infrastructure optimizations and feature flags to ensure consistency across employee and user experiences.
Running more extensive evaluations and qualitative checks.

Initial actions included launching the new feedback system, moving internal usage to external setups, and reducing internal complexity by auditing and removing over 60 feature flags, with 80 more in process. The improved feedback mechanism proved valuable, allowing the team to triage over 100 issues daily. A dedicated "squad" was assembled to continuously generate and investigate hypotheses, operating without distractions.

Findings and Fixes: Hardware and Load Balancing (7:49 - 9:40)

OpenAI's investigation revealed several specific issues:

Hardware Differences: A predictive model analyzed relationships between feedback, user retention, and request features (model, CLI build, OS, time, serving cluster, hardware, user plan). Evals confirmed slight performance issues with older hardware, which was then removed from the fleet.
Load Balancing: An opportunity was discovered in the load balancing strategy to reduce latency under load, with improvements rolling out.

Findings and Fixes: Compaction Frequency (9:40 - 11:00)

Compaction, where the model summarizes conversations to avoid context limits, was identified as a source of degradation. The percentage of sessions using compaction increased, and the implementation could be improved. Evals confirmed performance degrades with more /compact or auto-compactions within a session. OpenAI landed improvements to prevent recursive summaries and added a warning to nudge users towards shorter, more targeted conversations. The presenter notes his own practice of starting new threads for most requests, highlighting a user behavior difference.

Findings and Fixes: Apply Patch Tool (11:00 - 12:59)

GPT-5 Codex uses an apply patch tool, which takes a unified diff as input. Users reported issues where the model failed to apply edits, sometimes resorting to deleting and recreating files. This behavior, while technically correct in the limit, can cause problems if the agent is interrupted. OpenAI plans to improve future models to prevent this behavior and implement immediate mitigations for high-risk edit sequences. This highlights an interesting trend where model behaviors are increasingly defined by the tools they are intended to use, with tool performance influencing future model training.

Findings and Fixes: Timeouts and Constrained Sampling (12:59 - 15:29)

Timeouts: Users reported longer task completion times. While OpenAI's internal metrics showed improving latency, specific feedback indicated the model was retrying tasks with escalating timeouts, making it inefficient. OpenAI is investing in training models to better handle long-running or interactive processes.
Constrained Sampling: A bug was found in the implementation of constrained sampling (used for structured outputs like JSON), causing token sequences to become "out of distribution." This bug was linked to reports of the model switching languages mid-sentence, affecting less than 0.25% of sessions but still a significant number of users.

Findings and Fixes: Responses API and Further Investigations (15:29 - 20:33)

Responses API: Codex uses the Responses API, which acts as a proxy between REST requests and the model's token stream. An investigation into this API involved reviewing over 100 PRs and comparing raw token values across versions. While two extra newline characters were found around the tool description section, they were concluded not to affect model performance.
CLI Versions and Web Search: Evals across CLI versions (0.40 to latest) showed expected improvements from apply patch in 0.45, with equal performance and a 10% token usage reduction. Web search was confirmed not to contribute to regression, and it's suggested it should be on by default.
Prompt Changes and Work Directory: Prompt changes over the last two months and errors setting the work directory were found not to contribute to regressions.
Query Latencies: Analysis of end-to-end query latencies revealed lower than expected authentication cache rates, adding 50 milliseconds per request. This is being resolved, though the presenter questions the focus on such small latency improvements for tasks that can take minutes.
Evolving Setup Sophistication: An increase in the complexity of user setups, with more MCP (Multi-Context Prompt) tools, was observed. OpenAI recommends minimalist setups and targeted conversations for best performance, noting that excessive context bloat can hinder the model. The need for better orchestration of tools is emphasized.

Future Work and Refunds (20:33 - 23:12)

OpenAI expressed gratitude for user feedback, no matter how harsh, and confirmed they are establishing a permanently staffed team to obsess over Codex's real-world performance. They are actively recruiting for this team. Additionally, OpenAI reset Codex rate limits for all users and refunded all Codex credit usage up to a specific time due to a bug that overcharged for cloud tasks by 2 to 5x. This proactive approach to refunds is highlighted as a positive contrast to Anthropic's past actions. Cloud tasks consume limits faster and tend to involve more complex, one-shot code changes, leading to higher costs per message.

Conclusion (23:12 - 23:45)

The video concludes by referencing a post suggesting that Codex is so good that users started attempting harder tasks, leading to a perception of degradation when it didn't perform as well. The presenter praises OpenAI's transparency and depth of reporting, calling it refreshing to see such focus on user-reported issues. He encourages viewers to share their thoughts on whether this is an overblown reaction or appropriate action from OpenAI.

Notable Quotes

"Have you noticed that GPT5 feels a bit dumber lately? If so, you're not alone." "We not found a conclusive large issue that would explain a consistent degradation of codecs over time. Instead, we believe there's a combination of shifts in behavior over time..." "The fact that Anthropic didn't have this is what led to the monthslong regression that a lot of people were experiencing, myself included." "Heads up, long conversations and multiple compactions can cause the model to be less accurate." "I've heard from some very high up people at some very cool labs that the best way to get a model to follow JSON structures was to threaten that you would hurt yourself." "I can confirm as someone who's been very harsh to OpenAI people behind the scenes, they are really good about taking the punches." "Codex is so good that people kept trying to use it for harder tasks and it didn't do those tasks as well. And people just assumed the model got worse."

Summarize another video

Press ⌘K to quickly paste a new URL

Related Videos