Đã đăng vào thg 10 8, 2025 8:36 SA 1 phút đọc

139

Execution Plan và Statistics trong MySQL Database

📋 Mục Lục

Execution Plan - Chiến Lược Thực Thi
Statistics và Indexes - Nguồn Gốc Dữ Liệu
Stale Statistics - Vấn Đề Nghiêm Trọng
Những Trường Hợp Hệ Thống "Quên" Cập Nhật
Detection và Monitoring
Best Practices và Prevention

1. Execution Plan - Chiến Lược Thực Thi

Định Nghĩa:

Execution Plan là một roadmap chi tiết mà database engine tạo ra để thực hiện một câu SQL query một cách hiệu quả nhất.

Ví Dụ Cụ Thể:

Query Mẫu:

SELECT e.name, d.dept_name 
FROM employees e 
JOIN departments d ON e.dept_id = d.id 
WHERE e.salary > 50000;

Database Có Thể Chọn Nhiều Chiến Lược:

Chiến Lược 1: Nested Loop Join

1. Quét toàn bộ bảng employees
2. Với mỗi employee có salary > 50000:
   - Tìm department tương ứng trong bảng departments
3. Trả về kết quả

Cost: Phù hợp khi employees nhỏ, departments có index tốt

Chiến Lược 2: Hash Join

1. Tạo hash table từ bảng departments (nhỏ hơn)
2. Quét bảng employees, filter salary > 50000
3. Với mỗi employee, lookup trong hash table
4. Trả về kết quả

Cost: Phù hợp khi có đủ memory, departments vừa phải

Chiến Lược 3: Sort-Merge Join

1. Sort bảng employees theo dept_id (có filter salary > 50000)
2. Sort bảng departments theo id  
3. Merge 2 bảng đã sort
4. Trả về kết quả

Cost: Phù hợp khi ít memory, data lớn

Execution Plan Components:

📋 Execution Plan Details:
├── 🔍 Access Methods (Table Scan, Index Scan, Index Seek)
├── 🔗 Join Algorithms (Nested Loop, Hash Join, Merge Join)
├── 📊 Sort Operations (ORDER BY, GROUP BY)
├── 🎯 Filter Operations (WHERE conditions)  
├── 📈 Cost Estimates (CPU, I/O, Memory)
└── ⏱️ Execution Order (Step-by-step sequence)

Cách Xem Execution Plan:

-- Basic explain
EXPLAIN SELECT e.name, d.dept_name 
FROM employees e 
JOIN departments d ON e.dept_id = d.id 
WHERE e.salary > 50000;

-- Detailed JSON format
EXPLAIN FORMAT=JSON SELECT ...;

Kết quả mẫu:

+----+-------------+-------+------+---------------+------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key  | key_len | ref  | rows | Extra       |
+----+-------------+-------+------+---------------+------+---------+------+------+-------------+
|  1 | SIMPLE      | e     | ALL  | dept_id       | NULL | NULL    | NULL | 1000 | Using where |
|  1 | SIMPLE      | d     | ref  | PRIMARY       | PRIMARY | 4    | e.dept_id | 1 | NULL     |
+----+-------------+-------+------+---------------+------+---------+------+------+-------------+

2. Statistics và Indexes - Nguồn Gốc Dữ Liệu

A) Statistics - Thống Kê Từ Đâu Mà Có?

Auto Statistics Collection:

-- MySQL tự động thu thập statistics khi:
-- 1. Table được tạo mới
-- 2. Có > 10% data thay đổi (INSERT/UPDATE/DELETE)
-- 3. ANALYZE TABLE được chạy

-- Xem statistics hiện tại:
SELECT 
    table_name,
    table_rows,           -- Ước tính số rows
    data_length,          -- Kích thước data (bytes)
    index_length,         -- Kích thước indexes (bytes) 
    avg_row_length        -- Độ dài trung bình mỗi row
FROM information_schema.tables 
WHERE table_schema = 'your_database'
  AND table_name = 'employees';

Sampling Process:

-- MySQL không scan toàn bộ table (quá chậm)
-- Thay vào đó, nó sample random pages:

-- Ví dụ: Table 1 million rows
-- Sample: ~20-50 random pages (khoảng 20,000-50,000 rows)
-- Extrapolate: Ước tính cho toàn bộ table

-- Xem sampling info:
SELECT 
    table_name,
    table_rows,           -- Estimated (từ sampling)
    (SELECT COUNT(*) FROM employees) as actual_rows  -- Actual count
FROM information_schema.tables 
WHERE table_name = 'employees';

Manual Statistics Update:

-- Force update statistics
ANALYZE TABLE employees;

-- Với sampling cho table lớn:
ANALYZE TABLE employees UPDATE HISTOGRAM ON salary, dept_id WITH 100 BUCKETS;

-- Xem chi tiết histogram:
SELECT 
    schema_name,
    table_name,  
    column_name,
    JSON_EXTRACT(histogram, '$.buckets') as histogram_buckets
FROM information_schema.column_statistics
WHERE table_name = 'employees';

B) Indexes - Làm Sao Mà Có?

Automatic Indexes:

-- 1. PRIMARY KEY tự động tạo clustered index
CREATE TABLE employees (
    id INT PRIMARY KEY,           -- ← Tự động tạo PRIMARY index
    name VARCHAR(100),
    dept_id INT,
    salary DECIMAL(10,2)
);

-- 2. UNIQUE constraint tự động tạo unique index  
ALTER TABLE employees ADD CONSTRAINT UNIQUE(email);

-- 3. FOREIGN KEY tự động tạo index (trong một số DB)
ALTER TABLE employees ADD FOREIGN KEY(dept_id) REFERENCES departments(id);

Manual Index Creation:

-- Tạo index để tối ưu queries
CREATE INDEX idx_salary ON employees(salary);           
CREATE INDEX idx_dept_salary ON employees(dept_id, salary);  
CREATE INDEX idx_name ON employees(name(10));

Index Metadata:

-- Xem tất cả indexes của table:
SHOW INDEX FROM employees;

-- Chi tiết từ information_schema:
SELECT 
    index_name,
    column_name,
    seq_in_index,         -- Thứ tự column trong composite index
    cardinality,          -- Số giá trị unique ước tính
    sub_part,             -- Prefix length (nếu có)
    nullable,
    index_type            -- BTREE, HASH, FULLTEXT, etc.
FROM information_schema.statistics 
WHERE table_name = 'employees'
ORDER BY index_name, seq_in_index;

3. Stale Statistics - Vấn Đề Nghiêm Trọng

Tình Huống Điển Hình:

Kịch Bản:

-- Ban đầu: Bảng products có 1,000 rows
CREATE TABLE products (
    id INT PRIMARY KEY,
    category_id INT,
    price DECIMAL(10,2),
    created_at TIMESTAMP,
    INDEX idx_category (category_id)
);

-- Statistics ban đầu:
-- table_rows: 1,000
-- category_id cardinality: 10 (100 products/category)

Sau 6 tháng - Data Explosion:

-- Bây giờ có 10 MILLION rows, nhưng statistics CHƯA CẬP NHẬT
SELECT COUNT(*) FROM products;                    -- 10,000,000 rows
SELECT COUNT(DISTINCT category_id) FROM products; -- 500 categories (20,000 products/category)

-- Nhưng statistics cũ vẫn báo:
SELECT table_rows FROM information_schema.tables 
WHERE table_name = 'products';  -- 1,000 rows (SAI!)

Hậu Quả: Chiến Lược Sai Lầm

Query Đơn Giản:

SELECT * FROM products WHERE category_id = 15;

Optimizer Nghĩ (Statistics Cũ):

"Ồ, table chỉ có 1,000 rows, category_id = 15 sẽ trả về ~100 rows
→ TABLE SCAN sẽ nhanh hơn INDEX SEEK!"

Execution Plan (SAI):
1. Full Table Scan on products  
2. Filter: category_id = 15
3. Expected cost: Low

Thực Tế Khắc Nghiệt:

"Table thực sự có 10 MILLION rows, category_id = 15 có 20,000 rows
→ TABLE SCAN quét 10M rows để tìm 20K rows = DISASTER!"

Actual Result:
- Execution time: 45 seconds (thay vì 0.1s với index)
- CPU usage: 100%  
- I/O operations: 500,000 reads
- Buffer pool pollution

Ví Dụ Thực Tế - Complex Query:

E-commerce Disaster:

-- Black Friday: Orders table tăng từ 100K → 50M rows
SELECT customer_id, SUM(total_amount) 
FROM orders 
WHERE order_date >= '2024-11-01'
GROUP BY customer_id;

-- Stale stats → Table scan 50M rows
-- Should be: Index scan on order_date + group aggregation
-- Result: Website timeout, customer complaints

JOIN Operations Gone Wrong:

SELECT p.name, c.category_name, COUNT(r.rating) as review_count
FROM products p
JOIN categories c ON p.category_id = c.id  
JOIN reviews r ON p.id = r.product_id
WHERE p.price BETWEEN 100 AND 500
GROUP BY p.id, p.name, c.category_name;

Stale Statistics Báo:

products: 1,000 rows
categories: 10 rows  
reviews: 5,000 rows

Optimizer Chọn (SAI):

1. Nested Loop Join (products → categories)    -- Tưởng products nhỏ
2. Hash Join (result → reviews)               -- Tưởng reviews không lớn  

Expected time: 2-3 seconds

Thực Tế:

products: 10,000,000 rows
categories: 500 rows
reviews: 50,000,000 rows  

Result: 45 MINUTES (thay vì 3 giây)
1. Nested Loop với 10M rows = 5 BILLION operations!
2. Hash Join với 50M reviews = Out of Memory

Performance Impact Comparison:

Với Stale Statistics:

Query: SELECT * FROM products WHERE category_id = 15;

Plan: Full Table Scan
- Rows examined: 10,000,000
- Execution time: 45.2 seconds  
- CPU usage: 100%
- I/O reads: 500,000
- Buffer pool hits: 5%

Với Fresh Statistics:

Query: SELECT * FROM products WHERE category_id = 15;

Plan: Index Range Scan → Key Lookup
- Rows examined: 20,000
- Execution time: 0.08 seconds
- CPU usage: 5%  
- I/O reads: 150
- Buffer pool hits: 95%

🎯 Tổng Kết

Những Điểm Quan Trọng:

Execution Plan phụ thuộc hoàn toàn vào Statistics - Sai statistics = sai plan = performance disaster
Statistics không tự động perfect - Cần monitoring và maintenance proactive
Bulk operations thường bypass auto-analyze - ETL, LOAD DATA, TRUNCATE+INSERT cần attention đặc biệt
Stale statistics có patterns nhận biết được - Never analyzed, data newer than stats, zero cardinality
Detection and prevention equally important - Monitor để catch early, prevent để avoid problems
Size matters cho priority - Large tables với stale stats = highest impact