Giải thích toán học, ưu điểm/nhược điểm cho từng kỹ thuật : Mixture of Experts, Grouped Query Attention, Flash Attention
Unlocking the Secrets of LLM Efficiency: MoE, GQA, FlashAttention, RoPE, YaRN, and BPE
Large Language Models (LLMs) are pushing boundaries every year. But behind the scenes, their efficiency and scalability rely on clever mathematical tricks and algorithmic innovations. In this blog, we’ll dive deep into six powerful techniques: Mixture of Experts (MoE), Grouped Query Attention (GQA), Flash Attention, Rotary Positional Embeddings (RoPE), YaRN, and Byte Pair Encoding (BPE). Along the way, we’ll reveal their math, strengths, weaknesses, and real-world applications—with Mermaid diagrams to help you visualize how they work.
1. Mixture of Experts (MoE)
Mathematics:
Given an input vector , MoE uses a set of experts . A gating function selects the top- experts.
Mermaid Diagram:
graph TD
A[Input x] --> B{Gating Function}
B -->|Top-k| C1[Expert 1]
B -->|Top-k| C2[Expert 2]
B -->|Top-k| Cn[Expert n]
C1 --> D[Weighted Sum]
C2 --> D
Cn --> D
D --> E[Output y]
Pros:
- Activates only a few experts → lower compute cost.
- Enables scaling model size without proportional inference cost.
Cons:
- Hard-to-train gating (especially with hard routing).
- Risk of load imbalance across experts.
Examples: Google’s Switch Transformer, Mistral’s MoE models.
Use case: LLMs requiring efficiency at scale.
2. Grouped Query Attention (GQA)
Mathematics:
Standard attention:
Grouped Query Attention:
Mermaid Diagram:
graph TD
A[Queries Q] --> B[Group Queries]
B --> C[Shared Keys K]
B --> D[Shared Values V]
C --> E[Attention]
D --> E
E --> F[Output]
Pros:
- Reduces memory footprint.
- Faster inference on GPUs.
Cons:
- Slightly less expressive (loses per-query uniqueness).
Examples: Used in LLaMA v2.
Use case: Efficient GPU deployment of LLMs.
3. Flash Attention
Mathematics:
Instead of materializing the full matrix, Flash Attention computes attention block-wise directly in GPU SRAM.
Mermaid Diagram:
graph TD
A[Queries Q] --> B[Block Partition]
A2[Keys K] --> B
A3[Values V] --> B
B --> C[Block-wise Softmax + Multiply]
C --> D[Output]
Pros:
- Reduces memory from to near .
- Huge speedups for long sequences.
Cons:
- Requires specialized GPU kernels.
- More complex implementation.
Examples: FlashAttention v1/v2 in PyTorch & Triton.
Use case: Training & inference in models like GPT-4, Claude, and LLaMA.
4. Rotary Positional Embeddings (RoPE)
Mathematics:
RoPE rotates embedding dimensions by an angle :
Mermaid Diagram:
graph TD
A[Token Embedding] --> B["Apply Rotation R(theta)"]
B --> C[Rotated Embedding]
Pros:
- Preserves relative position information.
- Scales well to longer sequences.
Cons:
- Less intuitive than sinusoidal embeddings.
Examples: LLaMA, GPTNeoX.
Use case: Transformer decoders requiring relative positional encoding.
5. YaRN (Yet another RoPE eNhancement)
Mathematics:
Adjusts rotation frequencies to extend context:
Mermaid Diagram:
graph TD
A[RoPE Frequencies θ] --> B[Stretch/Compress]
B --> C[Enhanced Long-Context θ']
C --> D[Improved Long-Context Embeddings]
Pros:
- Enables extrapolation to unseen long contexts.
Cons:
- Requires tuning hyperparameters.
Examples: Mistral with YaRN (128K context).
Use case: Long-context reasoning (legal docs, long chats).
6. Byte Pair Encoding (BPE)
Mathematics:
- Start: split words into characters.
- Iteratively merge most frequent pairs (A, B → AB).
Example:
- "lower" → ["l","o","w","e","r"] → ["lo", "w", "er"]
Mermaid Diagram:
graph TD
A[Characters] --> B[Find Frequent Pairs]
B --> C[Merge into Tokens]
C --> D[Final Vocabulary]
Pros:
- Reduces out-of-vocab issues.
- Flexible vocabulary size.
Cons:
- Can produce awkward tokenization.
- Struggles with non-Latin languages.
Examples: GPT, LLaMA, T5.
Use case: Tokenization in nearly all LLMs.
Final Comparison Table
Technique | Pros | Cons | Example Use Cases |
---|---|---|---|
Mixture of Experts | Efficient scaling, selective activation | Gating hard to train, imbalance | GPT-MoE, Switch Transformer |
Grouped Query Attention | Memory savings, faster inference | Less detail per query | LLaMA v2 |
Flash Attention | O(n) memory, massive speedup | Needs GPU kernel support | GPT-4, Claude, LLaMA |
Rotary Positional Embedding | Preserves relative order, scalable | Less intuitive | LLaMA, GPTNeoX |
YaRN | Long-context reasoning | Requires tuning | Mistral 128K |
BPE | Reduces OOV, compact vocab | Semantic mismatch | GPT, LLaMA, T5 |
Closing Thoughts
The success of modern LLMs isn’t just about scaling parameters—it’s about smart mathematics and clever engineering. Techniques like MoE, GQA, and Flash Attention make training feasible, while RoPE, YaRN, and BPE help models understand and generalize language better. Together, these innovations are paving the way for more efficient, powerful, and context-aware AI systems.
All rights reserved