Detailed Summary
The speaker introduces the topic of the LLM landscape in 2025, aiming to discuss major LLMs, emerging alternatives, and personal thoughts on these developments. The presentation is based on slides prepared for an in-person event, initially intended to be 20 minutes long.
Main Theme: Bigger Models and Cheaper Inference (1:13 - 2:13)
This section highlights the two main themes in LLMs for 2025: the development of larger models (e.g., Deepseek version 3 at 600 billion parameters, Kim K2 at 1 trillion parameters) and the crucial need to lower inference requirements. The goal is to make these large models feasible to run at scale in data centers without excessive cost, even if not on personal devices.
Grouped-Query Attention (GQA) (2:13 - 5:44)
Grouped-Query Attention (GQA) is presented as a popular technique to reduce inference requirements, specifically by optimizing the KV cache. It replaces the multi-head attention module within transformer blocks. Unlike traditional multi-head attention where each query has a unique key and value, GQA groups queries, allowing multiple query heads to share the same key and value. This sharing significantly reduces the size of the KV cache, making inference more efficient. The speaker illustrates how this can reduce KV cache size by a factor of two in an example. While extreme sharing (multi-query attention) can degrade accuracy, a tuned ratio maintains performance. GQA is a widely adopted trick to reduce KV cache size.
Multi-Head Latent Attention (MLA) (5:44 - 9:51)
Multi-Head Latent Attention (MLA) is introduced as an alternative to GQA, used in models like DeepSeek version 2 and 3, and Kimik K2. MLA aims for a similar goal of reducing KV cache size but through a different mechanism. Instead of directly generating keys and values from the input, MLA first compresses the input into a latent representation. This compressed latent state is then used to project into keys and values, effectively storing a smaller representation in the KV cache. This approach is somewhat lossy but saves storage. Ablation studies from the DeepSeek version 2 paper suggest MLA can offer better modeling performance than GQA, though it is more complex to implement. MLA can theoretically be combined with GQA.
Sliding Window Attention (SWA) (9:51 - 13:57)
Sliding Window Attention (SWA) is discussed as another technique for reducing KV cache size, particularly relevant for long contexts. SWA limits how far back a token can attend, meaning it only considers a fixed-size window of previous tokens rather than the entire past sequence. This reduces the memory required for the KV cache. The speaker provides an example with a window size of three. Models like Gemma 3 utilize SWA, often in a hybrid approach where SWA layers are interspersed with regular multi-head attention layers (e.g., a 5:1 ratio) to occasionally access the full context and prevent modeling performance degradation. SWA can also be combined with GQA, as seen in Gemma 3.
Mixture of Experts (MoE) is highlighted as a technique almost universally adopted in larger LLMs. MoE replaces a single feed-forward module with multiple 'expert' feed-forward modules. The key advantage is that it dramatically increases the total parameter count of the model (e.g., DeepSeek version 3 has 671 billion parameters) during training, allowing for extensive knowledge acquisition. However, during inference, only a sparse subset of these experts (e.g., 9 out of 256 for DeepSeek version 3) is activated for any given token, keeping the active parameter count and inference cost significantly lower (e.g., 37 billion active parameters for DeepSeek version 3). This allows models to be very knowledgeable without being prohibitively expensive to run.
LLM and Transformer Alternatives (17:01 - 26:02)
The speaker shifts to discussing alternatives to the mainstream transformer-based LLMs. While current transformers are well-understood, mature, and state-of-the-art (e.g., GLM 4.6), their size and inference cost drive the search for alternatives. Some models, like Gwen 3 next and Deepseek version 3.2, introduce tweaks like gated delta nets or sparse attention for efficiency, but these can add complexity and potentially trade off accuracy.
Further off the main track are:
- Hierarchical Reasoning Models (e.g., Tiny Reasoning Model): Excellent for specific tasks like Sudoku or maze pathfinding, but not general-purpose text models. They might serve as specialized modules within future LLMs.
- Code World Models: These models train on code traces, simulating execution and understanding internal variables, offering a deeper environmental understanding. This is seen as a promising direction for coding models.
- Text Diffusion Models: While popular in vision, text diffusion models are newer due to text's discrete nature. They generate text in parallel and refine it iteratively. The speaker notes potential downsides for real-time reading of long outputs and questions their benefit for chain-of-thought reasoning.
- Liquid Foundation Models: Based on differential equations, these are very different from decoder-style transformers and appear parameter-efficient. The speaker notes a recent development of MoE versions for these models.
- Transformer-RNN Hybrids (e.g., RWKV): These models combine transformers with recurrent neural networks. While potentially cheaper for long contexts due to constant memory (no KV cache), their accuracy on complex reasoning tasks is currently lower than pure transformers.
- State Space Models (e.g., Mamba): Gaining popularity, especially in hybrid forms with transformers. Models like Hunion T1 are showing competitive performance on LM leaderboards, indicating their growing maturity.
- LSDMs (Long Short-Term Memory): Still being refined, particularly for highly efficient on-device models, leveraging existing theoretical understanding. They are not seen as replacements for large, general-purpose models like GPT-5 but have niche applications.
The speaker concludes by reiterating that these were initial thoughts and mentions an upcoming blog article that will delve deeper into these topics. He also promotes his new book, "Build a Reasoning Model From Scratch," which continues from his previous work on building LLMs and focuses on reasoning and inference techniques. The speaker acknowledges going slightly over the planned 20 minutes but decides to keep the recording as is for sharing.