Detailed Summary
The video introduces a new lineup of agentic models, including GPT-5 Mini Nano, Claude Opus 4.1, Sonnet, Haiku, and on-device GPT-OSS models (20 billion and 120 billion parameters) running on an M4 Max MacBook Pro. The goal is to provide a concrete comparison of these models across performance, speed, and cost, using Claude Code with Opus 4.1 as an LLM-as-a-judge to ensure a fair evaluation. The presenter emphasizes moving beyond simple benchmarks to understand how these models perform on fundamental agentic coding tasks.
- New agentic models (GPT-5, Opus 4.1, GPT-OSS) are introduced for evaluation.
- On-device GPT-OSS models (20B, 120B) are highlighted as a significant development.
- Evaluation focuses on performance, speed, and cost, with Claude Code/Opus 4.1 acting as the judge.
- The objective is to assess agentic coding capabilities on real-world tasks.
- GPT-OSS models are noted for having zero cost due to local execution.
Fundamental Agentic Coding (1:55 - 6:12)
The presenter criticizes typical benchmark regurgitations and stresses the importance of deep understanding and practical application of AI models. The core trend identified is the focus on agent architecture, where models chain together multiple tools to achieve engineering results. An initial, simple task ("What's the capital?") shows unexpected results, with Claude 3 Haiku outperforming others, demonstrating that raw model power doesn't always translate to optimal performance in simple, constrained scenarios. The evaluation system uses a higher-order prompt (HOP) and lower-order prompts (LOP) within a multi-model evaluation system, where sub-agents interact with a Nano Agent MCP server.
- Critique of superficial model benchmarks and emphasis on practical application.
- Agent architecture is identified as the most important trend in AI.
- Agentic performance is defined by chaining tools for real engineering results, not single prompts.
- Initial tests show unexpected results, with smaller models sometimes outperforming larger ones on simple tasks.
- The evaluation uses a multi-model system with higher-order and lower-order prompts, leveraging a Nano Agent MCP server.
Nano Agent for GPT-5 and GPT-OSS (6:12 - 19:13)
The Nano Agent codebase is explored, detailing its structure with directories for plans, application-specific agents, documentation, commands, and agents. The system uses a grading scheme (S through F) and a classic prompt format that instructs agents to use the Nano Agent MCP server and report results in a specific JSON format. Claude Code Opus 4.1 serves as the LLM-as-a-judge. The video demonstrates a "basic read test" where models are asked to extract the first and last 10 lines of a README file, testing instruction following and tool use. GPT-5 Nano and Mini perform well in this combined evaluation of performance, speed, and cost, while Opus 4.1, despite high performance, is penalized for its cost and speed. An "operations test" is then introduced, requiring models to read, extract unique hook names, create a JSON file, write another file, and list the directory, further pushing agentic capabilities. The local GPT-OSS models impress by performing these tasks on-device.
- The Nano Agent codebase structure is detailed, including HOPs and LOPs for prompt orchestration.
- A simple grading system (S-F) and JSON response format are used for evaluation.
- Claude Code Opus 4.1 functions as the LLM-as-a-judge.
- The "basic read test" evaluates instruction following and tool use, with GPT-5 Nano/Mini showing strong overall performance.
- The "file operations test" increases complexity, requiring reading, writing, and directory listing, with local GPT-OSS models demonstrating impressive on-device agentic coding.
The presenter highlights the ability to atomize units of compute, allowing for individual sub-agents to be tested and scaled. This section demonstrates how to manually run a specific agent (e.g., Opus 4.1) with a precise prompt to debug or re-evaluate its performance, emphasizing the granularity and control offered by the Nano Agent architecture. The concept of composing powerful units of compute is central, leveraging Claude Code's agent architecture with tool-calling capabilities and sub-agents. The on-device GPT-OSS models are again praised for their ability to perform agentic coding locally, even for simple to moderate tasks, proving the viability of local LLMs for practical work.
- The architecture allows for atomized, testable, and scalable sub-agents.
- Demonstration of running individual agents with specific prompts for granular control and debugging.
- Emphasis on composable compute, combining Claude Code's architecture with powerful tool calls.
- Local GPT-OSS models are shown to handle simple to moderate agentic tasks on-device.
- The presenter notes the mind-blowing nature of on-device agentic coding.
The Nano Agent MCP server's internal workings are explained. It's a simple server with a single tool: "execute an autonomous agent with natural language task." This tool interprets an agentic prompt and runs a specific sub-agent. The Nano Agent itself has a few core tools: read, write, list, get file, and edit file. The system is built using the OpenAI agent SDK, which provides scaffolding and a fair way to compare models by allowing configuration of different providers (Anthropic, Ollama) and endpoints. The presenter then demonstrates a "code engineering tasks" evaluation, where models must read a constants file, analyze its structure, create a new Python file with docstring analysis and a function, and create another file with enhanced constants. Sonnet 4 and GPT-OSS 120B perform exceptionally well in following these intricate, multi-step instructions, showcasing their agentic behavior. The presenter stresses that engineers need to experience these tools firsthand rather than relying solely on benchmarks.
- The Nano Agent MCP server has a single tool: "execute an autonomous agent with natural language task."
- Nano Agent's core tools include read, write, list, get file, and edit file.
- The system is built on the OpenAI agent SDK, enabling fair comparison across various model providers.
- A complex "code engineering tasks" evaluation tests multi-step agentic behavior.
- Sonnet 4 and GPT-OSS 120B demonstrate excellent instruction following in this task.
- The presenter advises against relying solely on benchmarks, urging hands-on experience with models.
The video concludes by emphasizing that the current landscape offers more compute options than ever before. Understanding the fundamental principles of AI coding (context, model, prompt) is crucial for building effective evaluations, benchmarks, and agents. The presenter promotes their "Principled AI Coding" course and an upcoming "Agentic Coding" course, highlighting that agentic coding is a superset of AI coding. The key takeaway is the need to understand how to trade off performance, cost, and speed, as not every task requires the most expensive model like Opus 4. Cheaper alternatives like GPT-5, GPT-5 Mini, or even on-device models can be suitable depending on the specific engineering needs. The Nano Agent codebase will be made available to help engineers understand and build their own agentic systems, preparing them for the evolving AI landscape.
- The current AI landscape offers unprecedented compute options.
- Fundamental AI coding principles (context, model, prompt) are essential for building agents and evaluations.
- Promotion of "Principled AI Coding" and an upcoming "Agentic Coding" course.
- Agentic coding is presented as a superset of AI coding.
- The critical skill is understanding how to make trade-offs between performance, cost, and speed for different tasks.
- The Nano Agent codebase will be shared to facilitate learning and building agentic systems.