1.3K 46 99

Đã đăng vào thg 8 7, 8:36 SA 2 phút đọc

165

Giải thích toán học, ưu điểm/nhược điểm cho từng kỹ thuật : Mixture of Experts, Grouped Query Attention, Flash Attention

Unlocking the Secrets of LLM Efficiency: MoE, GQA, FlashAttention, RoPE, YaRN, and BPE

Large Language Models (LLMs) are pushing boundaries every year. But behind the scenes, their efficiency and scalability rely on clever mathematical tricks and algorithmic innovations. In this blog, we’ll dive deep into six powerful techniques: Mixture of Experts (MoE), Grouped Query Attention (GQA), Flash Attention, Rotary Positional Embeddings (RoPE), YaRN, and Byte Pair Encoding (BPE). Along the way, we’ll reveal their math, strengths, weaknesses, and real-world applications—with Mermaid diagrams to help you visualize how they work.

1. Mixture of Experts (MoE)

Mathematics:

Given an input vector $x$ , MoE uses a set of experts $E_1, E_2, ..., E_n$ . A gating function $G(x)$ selects the top- $k$ experts.

$y = \sum_{i=1}^k G_i(x) \cdot E_i(x)$

Mermaid Diagram:

graph TD
    A[Input x] --> B{Gating Function}
    B -->|Top-k| C1[Expert 1]
    B -->|Top-k| C2[Expert 2]
    B -->|Top-k| Cn[Expert n]
    C1 --> D[Weighted Sum]
    C2 --> D
    Cn --> D
    D --> E[Output y]

Pros:

Activates only a few experts → lower compute cost.
Enables scaling model size without proportional inference cost.

Cons:

Hard-to-train gating (especially with hard routing).
Risk of load imbalance across experts.

Examples: Google’s Switch Transformer, Mistral’s MoE models.

Use case: LLMs requiring efficiency at scale.

2. Grouped Query Attention (GQA)

Mathematics:

Standard attention:

$\text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d}}\right)V$

Grouped Query Attention:

$Q_i = Q_{\text{group}(i)} \quad \text{sharing K, V}$

Mermaid Diagram:

graph TD
    A[Queries Q] --> B[Group Queries]
    B --> C[Shared Keys K]
    B --> D[Shared Values V]
    C --> E[Attention]
    D --> E
    E --> F[Output]

Pros:

Reduces memory footprint.
Faster inference on GPUs.

Cons:

Slightly less expressive (loses per-query uniqueness).

Examples: Used in LLaMA v2.

Use case: Efficient GPU deployment of LLMs.

3. Flash Attention

Mathematics:

Instead of materializing the full $QK^T$ matrix, Flash Attention computes attention block-wise directly in GPU SRAM.

$\text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d}}\right)V$

Mermaid Diagram:

graph TD
    A[Queries Q] --> B[Block Partition]
    A2[Keys K] --> B
    A3[Values V] --> B
    B --> C[Block-wise Softmax + Multiply]
    C --> D[Output]

Pros:

Reduces memory from $O(n^2)$ to near $O(n)$ .
Huge speedups for long sequences.

Cons:

Requires specialized GPU kernels.
More complex implementation.

Examples: FlashAttention v1/v2 in PyTorch & Triton.

Use case: Training & inference in models like GPT-4, Claude, and LLaMA.

4. Rotary Positional Embeddings (RoPE)

Mathematics:

RoPE rotates embedding dimensions by an angle $\theta$ :

$\begin{bmatrix} x_1' \\ x_2' \end{bmatrix} = \begin{bmatrix} \cos(\theta) & -\sin(\theta) \\ \sin(\theta) & \cos(\theta) \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix}$

Mermaid Diagram:

graph TD
    A[Token Embedding] --> B["Apply Rotation R(theta)"]
    B --> C[Rotated Embedding]

Pros:

Preserves relative position information.
Scales well to longer sequences.

Cons:

Less intuitive than sinusoidal embeddings.

Examples: LLaMA, GPTNeoX.

Use case: Transformer decoders requiring relative positional encoding.

5. YaRN (Yet another RoPE eNhancement)

Mathematics:

Adjusts rotation frequencies to extend context:

$\theta_i' = \frac{1}{\lambda_i \cdot \alpha}, \quad \alpha \in (0,1)$

Mermaid Diagram:

graph TD
    A[RoPE Frequencies θ] --> B[Stretch/Compress]
    B --> C[Enhanced Long-Context θ']
    C --> D[Improved Long-Context Embeddings]

Pros:

Enables extrapolation to unseen long contexts.

Cons:

Requires tuning hyperparameters.

Examples: Mistral with YaRN (128K context).

Use case: Long-context reasoning (legal docs, long chats).

6. Byte Pair Encoding (BPE)

Mathematics:

Start: split words into characters.
Iteratively merge most frequent pairs (A, B → AB).

Example:

"lower" → ["l","o","w","e","r"] → ["lo", "w", "er"]

Mermaid Diagram:

graph TD
    A[Characters] --> B[Find Frequent Pairs]
    B --> C[Merge into Tokens]
    C --> D[Final Vocabulary]

Pros:

Reduces out-of-vocab issues.
Flexible vocabulary size.

Cons:

Can produce awkward tokenization.
Struggles with non-Latin languages.

Examples: GPT, LLaMA, T5.

Use case: Tokenization in nearly all LLMs.

Final Comparison Table

Technique	Pros	Cons	Example Use Cases
Mixture of Experts	Efficient scaling, selective activation	Gating hard to train, imbalance	GPT-MoE, Switch Transformer
Grouped Query Attention	Memory savings, faster inference	Less detail per query	LLaMA v2
Flash Attention	O(n) memory, massive speedup	Needs GPU kernel support	GPT-4, Claude, LLaMA
Rotary Positional Embedding	Preserves relative order, scalable	Less intuitive	LLaMA, GPTNeoX
YaRN	Long-context reasoning	Requires tuning	Mistral 128K
BPE	Reduces OOV, compact vocab	Semantic mismatch	GPT, LLaMA, T5

Closing Thoughts

The success of modern LLMs isn’t just about scaling parameters—it’s about smart mathematics and clever engineering. Techniques like MoE, GQA, and Flash Attention make training feasible, while RoPE, YaRN, and BPE help models understand and generalize language better. Together, these innovations are paving the way for more efficient, powerful, and context-aware AI systems.

Machine Learning

Unlocking the Secrets of LLM Efficiency: MoE, GQA, FlashAttention, RoPE, YaRN, and BPE

1. Mixture of Experts (MoE)

2. Grouped Query Attention (GQA)

3. Flash Attention

4. Rotary Positional Embeddings (RoPE)

5. YaRN (Yet another RoPE eNhancement)

6. Byte Pair Encoding (BPE)

Final Comparison Table

Closing Thoughts

Mục lục