+3

Giải thích toán học, ưu điểm/nhược điểm cho từng kỹ thuật : Mixture of Experts, Grouped Query Attention, Flash Attention

Unlocking the Secrets of LLM Efficiency: MoE, GQA, FlashAttention, RoPE, YaRN, and BPE

Large Language Models (LLMs) are pushing boundaries every year. But behind the scenes, their efficiency and scalability rely on clever mathematical tricks and algorithmic innovations. In this blog, we’ll dive deep into six powerful techniques: Mixture of Experts (MoE), Grouped Query Attention (GQA), Flash Attention, Rotary Positional Embeddings (RoPE), YaRN, and Byte Pair Encoding (BPE). Along the way, we’ll reveal their math, strengths, weaknesses, and real-world applications—with Mermaid diagrams to help you visualize how they work.


1. Mixture of Experts (MoE)

Mathematics:

Given an input vector xx, MoE uses a set of experts E1,E2,...,EnE_1, E_2, ..., E_n. A gating function G(x)G(x) selects the top-kk experts.

y=i=1kGi(x)Ei(x)y = \sum_{i=1}^k G_i(x) \cdot E_i(x)

Mermaid Diagram:

graph TD
    A[Input x] --> B{Gating Function}
    B -->|Top-k| C1[Expert 1]
    B -->|Top-k| C2[Expert 2]
    B -->|Top-k| Cn[Expert n]
    C1 --> D[Weighted Sum]
    C2 --> D
    Cn --> D
    D --> E[Output y]

Pros:

  • Activates only a few experts → lower compute cost.
  • Enables scaling model size without proportional inference cost.

Cons:

  • Hard-to-train gating (especially with hard routing).
  • Risk of load imbalance across experts.

Examples: Google’s Switch Transformer, Mistral’s MoE models.

Use case: LLMs requiring efficiency at scale.


2. Grouped Query Attention (GQA)

Mathematics:

Standard attention:

Attention(Q,K,V)=softmax(QKTd)V\text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d}}\right)V

Grouped Query Attention:

Qi=Qgroup(i)sharing K, VQ_i = Q_{\text{group}(i)} \quad \text{sharing K, V}

Mermaid Diagram:

graph TD
    A[Queries Q] --> B[Group Queries]
    B --> C[Shared Keys K]
    B --> D[Shared Values V]
    C --> E[Attention]
    D --> E
    E --> F[Output]

Pros:

  • Reduces memory footprint.
  • Faster inference on GPUs.

Cons:

  • Slightly less expressive (loses per-query uniqueness).

Examples: Used in LLaMA v2.

Use case: Efficient GPU deployment of LLMs.


3. Flash Attention

Mathematics:

Instead of materializing the full QKTQK^T matrix, Flash Attention computes attention block-wise directly in GPU SRAM.

Attention(Q,K,V)=softmax(QKTd)V\text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d}}\right)V

Mermaid Diagram:

graph TD
    A[Queries Q] --> B[Block Partition]
    A2[Keys K] --> B
    A3[Values V] --> B
    B --> C[Block-wise Softmax + Multiply]
    C --> D[Output]

Pros:

  • Reduces memory from O(n2)O(n^2) to near O(n)O(n).
  • Huge speedups for long sequences.

Cons:

  • Requires specialized GPU kernels.
  • More complex implementation.

Examples: FlashAttention v1/v2 in PyTorch & Triton.

Use case: Training & inference in models like GPT-4, Claude, and LLaMA.


4. Rotary Positional Embeddings (RoPE)

Mathematics:

RoPE rotates embedding dimensions by an angle θ\theta:

[x1x2]=[cos(θ)sin(θ)sin(θ)cos(θ)][x1x2]\begin{bmatrix} x_1' \\ x_2' \end{bmatrix} = \begin{bmatrix} \cos(\theta) & -\sin(\theta) \\ \sin(\theta) & \cos(\theta) \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix}

Mermaid Diagram:

graph TD
    A[Token Embedding] --> B["Apply Rotation R(theta)"]
    B --> C[Rotated Embedding]

Pros:

  • Preserves relative position information.
  • Scales well to longer sequences.

Cons:

  • Less intuitive than sinusoidal embeddings.

Examples: LLaMA, GPTNeoX.

Use case: Transformer decoders requiring relative positional encoding.


5. YaRN (Yet another RoPE eNhancement)

Mathematics:

Adjusts rotation frequencies to extend context:

θi=1λiα,α(0,1)\theta_i' = \frac{1}{\lambda_i \cdot \alpha}, \quad \alpha \in (0,1)

Mermaid Diagram:

graph TD
    A[RoPE Frequencies θ] --> B[Stretch/Compress]
    B --> C[Enhanced Long-Context θ']
    C --> D[Improved Long-Context Embeddings]

Pros:

  • Enables extrapolation to unseen long contexts.

Cons:

  • Requires tuning hyperparameters.

Examples: Mistral with YaRN (128K context).

Use case: Long-context reasoning (legal docs, long chats).


6. Byte Pair Encoding (BPE)

Mathematics:

  • Start: split words into characters.
  • Iteratively merge most frequent pairs (A, B → AB).

Example:

  • "lower" → ["l","o","w","e","r"] → ["lo", "w", "er"]

Mermaid Diagram:

graph TD
    A[Characters] --> B[Find Frequent Pairs]
    B --> C[Merge into Tokens]
    C --> D[Final Vocabulary]

Pros:

  • Reduces out-of-vocab issues.
  • Flexible vocabulary size.

Cons:

  • Can produce awkward tokenization.
  • Struggles with non-Latin languages.

Examples: GPT, LLaMA, T5.

Use case: Tokenization in nearly all LLMs.


Final Comparison Table

Technique Pros Cons Example Use Cases
Mixture of Experts Efficient scaling, selective activation Gating hard to train, imbalance GPT-MoE, Switch Transformer
Grouped Query Attention Memory savings, faster inference Less detail per query LLaMA v2
Flash Attention O(n) memory, massive speedup Needs GPU kernel support GPT-4, Claude, LLaMA
Rotary Positional Embedding Preserves relative order, scalable Less intuitive LLaMA, GPTNeoX
YaRN Long-context reasoning Requires tuning Mistral 128K
BPE Reduces OOV, compact vocab Semantic mismatch GPT, LLaMA, T5

Closing Thoughts

The success of modern LLMs isn’t just about scaling parameters—it’s about smart mathematics and clever engineering. Techniques like MoE, GQA, and Flash Attention make training feasible, while RoPE, YaRN, and BPE help models understand and generalize language better. Together, these innovations are paving the way for more efficient, powerful, and context-aware AI systems.


All rights reserved

Viblo
Hãy đăng ký một tài khoản Viblo để nhận được nhiều bài viết thú vị hơn.
Đăng kí