Đã đăng vào thg 5 30, 4:21 CH 11 phút đọc

125

LiteLLM cho Hệ Thống Lớn (Phần 2): Triển Khai Nâng Cao

MayFest2026

Trước khi qua tới các triển khai nâng cao của LiteLLM các bạn có thể xem lại phần giới thiệu nội dung cơ bản ở phần 1 nhé, mình để link bài viết ở dưới

Xem trước: LiteLLM (Phần 1) — Điều Gì Xảy Ra Khi OpenAI Không Còn Là Lựa Chọn Duy Nhất?

1. Kiến Trúc Production Tham Chiếu

Trước khi đi sâu vào từng thành phần, hãy nhìn lại bức tranh đầy đủ của 1 con LiteLLM nhé:

                    ┌─────────────────────────────────────────┐
                    │           CLIENT APPLICATIONS            │
                    │   Web App · Mobile · Internal Tools ·   │
                    │   AI Agents · LangChain / LlamaIndex    │
                    └────────────────────┬────────────────────┘
                                         │ HTTPS
                    ┌────────────────────▼────────────────────┐
                    │      API GATEWAY (Nginx / Cloudflare)    │
                    │     SSL · WAF · DDoS · Rate Limiting     │
                    └────────────────────┬────────────────────┘
                                         │
              ┌──────────────────────────┼──────────────────────────┐
              │                          │                          │
   ┌──────────▼───────┐      ┌───────────▼──────┐       ┌──────────▼───────┐
   │ LiteLLM Pod 1    │      │ LiteLLM Pod 2    │       │ LiteLLM Pod N    │
   │ (1 worker)       │      │ (1 worker)       │       │ (1 worker)       │
   └──────────┬───────┘      └───────────┬──────┘       └──────────┬───────┘
              └──────────────────────────┼──────────────────────────┘
                                         │
        ┌────────────────────────────────┼────────────────────────────────┐
        │                                │                                │
┌───────▼───────────┐         ┌──────────▼──────────┐         ┌──────────▼─────────┐
│  PostgreSQL       │         │  Redis Cluster      │         │  Observability     │
│  ─────────────    │         │  ───────────────    │         │  ──────────────    │
│  • Virtual keys   │         │  • Response cache   │         │  • Langfuse        │
│  • Teams & Orgs   │         │  • Rate limit state │         │  • Prometheus      │
│  • Spend logs     │         │  • Cooldown state   │         │  • OpenTelemetry   │
│  • Budget         │         │  • Semantic cache   │         │  • Datadog / Slack │
└───────────────────┘         └─────────────────────┘         └────────────────────┘
                                         │
       ┌─────────────────────────────────┼─────────────────────────────────┐
       │                                 │                                 │
┌──────▼──────────┐     ┌────────────────▼─────────────┐     ┌─────────────▼──────┐
│  Cloud LLMs     │     │  Self-hosted LLMs            │     │  Vector DB         │
│  ────────────   │     │  ──────────────────────      │     │  ──────────        │
│  OpenAI         │     │  Ollama (Llama, Qwen)        │     │  Pinecone /        │
│  Anthropic      │     │  vLLM (cluster GPU)          │     │  Weaviate /        │
│  Gemini         │     │  Text-Generation-Inference   │     │  Qdrant            │
│  Bedrock        │     │                              │     │                    │
└─────────────────┘     └──────────────────────────────┘     └────────────────────┘

Những nguyên tắc thiết kế then chốt sẽ được áp dụng xuyên suốt phần này:

Stateless pods: Mỗi pod LiteLLM không lưu state cục bộ mà mọi thứ chia sẻ qua Redis (cache, cooldown) và PostgreSQL (keys, budget). Đây là tiền đề để scale ngang tự do.
Một Uvicorn worker mỗi pod: Theo khuyến nghị chính thức từ tài liệu LiteLLM Production Best Practices, chạy 1 worker/pod và scale bằng cách tăng số pod cho độ ổn định latency tốt nhất dưới tải, thay vì nhồi nhiều worker vào một pod.
PostgreSQL là single source of truth: Budget, keys, spend đều ở đây. Redis chỉ là cache.
Mọi callback async: Logging và spend tracking không bao giờ block request chính.

2. Cấu Hình LiteLLM Nâng Cao

File config.yaml đầy đủ cho production, có chú thích từng phần:

# litellm_config_production.yaml

# ─── DANH SÁCH MODEL ───────────────────────────────────────────────────────

model_list:

  # --- OpenAI: tier cao nhất ---
  - model_name: gpt-4o
    litellm_params:
      model: openai/gpt-4o
      api_key: os.environ/OPENAI_API_KEY
      rpm: 500              # Requests/phút — khớp với tier OpenAI của bạn
      tpm: 2000000          # Tokens/phút
      timeout: 60.0
      stream_timeout: 90.0

  - model_name: gpt-4o-mini
    litellm_params:
      model: openai/gpt-4o-mini
      api_key: os.environ/OPENAI_API_KEY
      rpm: 2000
      tpm: 10000000

  # --- Anthropic ---
  - model_name: claude-sonnet
    litellm_params:
      model: anthropic/claude-sonnet-4-20250514
      api_key: os.environ/ANTHROPIC_API_KEY
      rpm: 1000

  # --- Google ---
  - model_name: gemini-flash
    litellm_params:
      model: gemini/gemini-2.0-flash
      api_key: os.environ/GEMINI_API_KEY
      rpm: 2000

  # --- Self-hosted ---
  - model_name: llama-local
    litellm_params:
      model: ollama/llama3.3:70b
      api_base: http://ollama-service:11434

  # --- Load-balanced group với weight ---
  # 60% traffic đi OpenAI, 30% Anthropic, 10% Gemini
  - model_name: smart-chat
    litellm_params:
      model: openai/gpt-4o
      api_key: os.environ/OPENAI_API_KEY
      weight: 6
  - model_name: smart-chat
    litellm_params:
      model: anthropic/claude-sonnet-4-20250514
      api_key: os.environ/ANTHROPIC_API_KEY
      weight: 3
  - model_name: smart-chat
    litellm_params:
      model: gemini/gemini-2.0-flash
      api_key: os.environ/GEMINI_API_KEY
      weight: 1

# ─── ROUTER SETTINGS ────────────────────────────────────────────────────────

router_settings:
  routing_strategy: latency-based-routing
  redis_host: os.environ/REDIS_HOST
  redis_port: 6379
  redis_password: os.environ/REDIS_PASSWORD

  # Circuit breaker
  allowed_fails: 3            # Cooldown deployment sau 3 lỗi/phút
  cooldown_time: 30           # Tạm dừng 30s

  # Pre-call check
  enable_pre_call_checks: true   # Check context window trước khi gọi
  num_retries: 2
  timeout: 60
  retry_after: 5

# ─── LITELLM SETTINGS ────────────────────────────────────────────────────────

litellm_settings:

  # Fallback đa tầng
  fallbacks:
    - gpt-4o: ["claude-sonnet", "gemini-flash", "llama-local"]
    - claude-sonnet: ["gpt-4o", "gemini-flash"]
    - smart-chat: ["llama-local"]

  # Fallback đặc thù theo loại lỗi
  context_window_fallbacks:
    - gpt-4o-mini: ["gpt-4o"]   # context quá dài → escalate lên gpt-4o
  content_policy_fallbacks:
    - gpt-4o: ["claude-sonnet"]   # block bởi policy → thử provider khác

  # Cache
  cache: true
  cache_params:
    type: redis
    host: os.environ/REDIS_HOST
    port: 6379
    password: os.environ/REDIS_PASSWORD
    ttl: 3600
    supported_call_types:
      - completion
      - acompletion
      - embedding
      - aembedding

  # Callbacks
  success_callback: ["langfuse", "prometheus"]
  failure_callback: ["langfuse", "slack"]

  # Drop unsupported params (vd: gemini không hỗ trợ frequency_penalty)
  drop_params: true

  # Redact thông tin nhạy cảm khỏi logs
  turn_off_message_logging: false   # Bật true nếu compliance yêu cầu
  redact_user_api_key_info: true

# ─── GENERAL SETTINGS ────────────────────────────────────────────────────────

general_settings:
  master_key: os.environ/LITELLM_MASTER_KEY
  database_url: os.environ/DATABASE_URL
  database_connection_pool_limit: 100   # Quan trọng cho cluster lớn
  store_model_in_db: true
  proxy_budget_rescheduler_min_time: 597
  proxy_budget_rescheduler_max_time: 605
  alerting: ["slack"]
  alert_to_webhook_url:
    slack: os.environ/SLACK_WEBHOOK_URL
  alert_types:
    - "spend_reports"
    - "budget_alerts"
    - "db_exceptions"
    - "outage_alerts"

# ─── ENVIRONMENT VARIABLES ───────────────────────────────────────────────────

environment_variables:
  LANGFUSE_PUBLIC_KEY: os.environ/LANGFUSE_PUBLIC_KEY
  LANGFUSE_SECRET_KEY: os.environ/LANGFUSE_SECRET_KEY
  LANGFUSE_HOST: "https://cloud.langfuse.com"

3. Routing Nâng Cao: 6 Chiến Lược và Cách Chọn

Đây là phần thường bị bỏ qua nhất nhưng lại quyết định trải nghiệm thực tế. LiteLLM Router (tài liệu chính thức tại docs.litellm.ai/docs/routing) hỗ trợ 6 chiến lược routing, mỗi cái có trade-off riêng, chúng ta sẽ điểm qua từng phần nhé

3.1 simple-shuffle (mặc định)

Random uniform giữa các deployment trong cùng nhóm.

router_settings:
  routing_strategy: simple-shuffle

Dùng khi: Tất cả deployment có quota và latency tương đương; muốn phân phối đều
Không phù hợp khi: Có một số deployment chậm hoặc bị quá tải

3.2 least-busy

Chọn deployment có ít request đang xử lý nhất.

router_settings:
  routing_strategy: least-busy

Dùng khi: Tải không đồng đều, một số deployment thường xuyên có request dài
Lưu ý: Cần Redis shared state để hoạt động chính xác giữa nhiều pod

3.3 latency-based-routing

Chọn deployment có response time trung bình thấp nhất trong window gần đây.

router_settings:
  routing_strategy: latency-based-routing
  routing_strategy_args:
    ttl: 3600                # Window 1 giờ
    lowest_latency_buffer: 0.5   # Buffer 50%: deployment trong khoảng [min, min*1.5] đều được chọn

Dùng khi: SLA latency là ưu tiên số một (chatbot real-time, voice AI)
Lưu ý: Có thể dồn quá tải vào một deployment "may mắn nhanh"; tham số lowest_latency_buffer giúp giảm vấn đề này

3.4 usage-based-routing-v2

Chọn deployment có TPM (tokens/phút) sử dụng thấp nhất hiện tại.

router_settings:
  routing_strategy: usage-based-routing-v2

Dùng khi: Có nhiều deployment Azure/OpenAI cùng một model, mỗi cái có quota riêng
Lưu ý: Cần khai báo tpm và rpm chính xác cho từng deployment

3.5 cost-based-routing

Chọn deployment có chi phí thấp nhất cho request hiện tại.

router_settings:
  routing_strategy: cost-based-routing

Dùng khi: Mục tiêu chính là tối ưu chi phí (volume cao, latency không gắt)
Không phù hợp khi: Một số deployment giá rẻ nhưng chất lượng kém — bạn sẽ luôn bị route vào đó

3.6 Tier-based routing (pattern phổ biến)

Một pattern được khuyên dùng trong bài viết "Implementing LLM Model Routing" trên Medium (Michael Hannecke) — case study giảm 88% chi phí: chia model thành các tier rồi dùng fallback giữa các tier.

model_list:
  # Tier 1: Nhanh, rẻ — xử lý phần lớn traffic
  - model_name: tier-1-fast
    litellm_params:
      model: gemini/gemini-2.0-flash
      api_key: os.environ/GEMINI_API_KEY

  # Tier 2: Cân bằng
  - model_name: tier-2-balanced
    litellm_params:
      model: openai/gpt-4o-mini
      api_key: os.environ/OPENAI_API_KEY

  # Tier 3: Mạnh nhất — chỉ dùng khi tier dưới fail
  - model_name: tier-3-premium
    litellm_params:
      model: openai/gpt-4o
      api_key: os.environ/OPENAI_API_KEY

litellm_settings:
  fallbacks:
    - tier-1-fast: ["tier-2-balanced", "tier-3-premium"]
    - tier-2-balanced: ["tier-3-premium"]

Ứng dụng mặc định gọi tier-1-fast — phần lớn request được xử lý bởi model rẻ. Chỉ khi lỗi (timeout, content filter, context window) mới escalate lên tier cao hơn. Kết quả: chi phí giảm 80–90% mà uptime vẫn 99.9%+.

3.7 Order parameter

Một tính năng ít được biết đến: tham số order cho phép chỉ định thứ tự ưu tiên trong cùng một model group:

model_list:
  - model_name: gpt-4o
    litellm_params:
      model: azure/gpt-4o-eastus
      api_key: os.environ/AZURE_KEY_1
      api_base: https://eastus.openai.azure.com
      order: 1                        # Ưu tiên cao nhất

  - model_name: gpt-4o
    litellm_params:
      model: azure/gpt-4o-westus
      api_key: os.environ/AZURE_KEY_2
      api_base: https://westus.openai.azure.com
      order: 2                        # Dự phòng khi order=1 fail

  - model_name: gpt-4o
    litellm_params:
      model: openai/gpt-4o            # OpenAI public — dự phòng cuối
      api_key: os.environ/OPENAI_KEY
      order: 3

Router sẽ luôn thử order=1 trước. Khi tất cả deployment order=1 đều bị cooldown hoặc fail, mới chuyển sang order=2, rồi order=3. Mỗi tầng có retry riêng trước khi escalate.

4. Caching Đa Tầng: Từ Exact Đến Semantic

Caching là đòn bẩy chi phí lớn nhất trong toàn bộ stack LiteLLM. Theo benchmark của Redis Developer team (notebook public trên GitHub redis-ai-resources), cache hit chuyển latency từ ~0.6s xuống ~0.02s — giảm 30 lần — và chi phí giảm tương ứng với cache hit rate. LiteLLM hỗ trợ ba tầng caching khác nhau, có thể dùng kết hợp:

4.1 Exact-match cache (cơ bản)

Cache theo hash của request. Nếu request giống hệt xuất hiện lần nữa, trả về cached response ngay.

litellm_settings:
  cache: true
  cache_params:
    type: redis
    host: redis-service
    port: 6379
    password: os.environ/REDIS_PASSWORD
    ttl: 3600
    supported_call_types: ["completion", "acompletion", "embedding"]

Cache key được tạo từ: model + messages + temperature + max_tokens + các params khác liên quan. Theo tài liệu DeepWiki về LiteLLM Caching System, chỉ params trong supported_call_params mới ảnh hưởng key, đảm bảo cache không bị invalidate vô lý khi thêm field mới không quan trọng.

4.2 DualCache: L1 in-memory + L2 Redis

Tính năng tinh tế: LiteLLM tự động kết hợp L1 cache in-memory (local) + L2 cache Redis (shared). Khi có Redis hit, response được "promote" lên local memory để các request sau cùng pod truy cập gần như tức thì. Theo source code công khai (litellm/caching/dual_cache.py), pattern này tránh việc liên tục hit Redis cho các query rất phổ biến cái mà tốt cho cả latency lẫn chi phí Redis.

4.3 Semantic cache

Đây là điểm khác biệt lớn nhất. Thay vì so khớp byte-by-byte, semantic cache embed request thành vector, rồi tìm vector tương tự trong store (Redis hoặc Qdrant). Nếu similarity vượt ngưỡng, trả về cached response.

litellm_settings:
  cache: true
  cache_params:
    type: redis-semantic
    redis_host: redis-service
    redis_port: 6379
    redis_password: os.environ/REDIS_PASSWORD
    similarity_threshold: 0.92       # >= 0.92 mới coi là "tương tự"
    redis_semantic_cache_embedding_model: azure-embedding-model

Ví dụ thực tế:

Request 1: "Microservices là gì?" → cache MISS, gọi LLM
Request 2: "Hãy giải thích microservices?" → cache HIT (similarity ~0.94)
Request 3: "Microservice architecture giúp gì?" → cache MISS (similarity ~0.78, dưới ngưỡng)

Theo bài viết của Redis "Scale your LLM gateway with LiteLLM" (redis.io/blog), semantic cache là cách hiệu quả nhất để cắt giảm chi phí cho các workload như FAQ chatbot, support center, hay search assistant nơi người dùng đặt cùng một câu hỏi với nhiều cách diễn đạt khác nhau.

4.4 Per-request cache control

Đôi khi bạn muốn tắt cache cho một request cụ thể (ví dụ: query realtime data):

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Giá cổ phiếu VNM hôm nay?"}],
    extra_body={
        "cache": {"no-cache": True}   # Bỏ qua cache hoàn toàn
    }
)

# Hoặc chỉ accept cache mới hơn 60 giây
response = client.chat.completions.create(
    ...,
    extra_body={
        "cache": {"s-maxage": 60}
    }
)

4.5 Lưu ý quan trọng

TTL cẩn thận: Cache 24h hợp cho FAQ tĩnh, nhưng tai họa cho data real-time. Đặt TTL theo loại query.
Disable cache cho streaming nếu cần: Streaming response được cache như non-streaming, nhưng nếu logic phụ thuộc vào timing token-by-token, kết quả có thể khác.
Semantic cache thêm latency embedding: Mỗi request không hit cache phải embed query, thường thêm 50–100ms. Tính toán nếu trade-off đáng.

5. Multi-Tenancy: Organization → Team → Key

Khi tổ chức có nhiều phòng ban, dự án, ứng dụng cùng dùng LiteLLM, ta cần cấu trúc phân cấp rõ ràng. LiteLLM hỗ trợ ba cấp độ:

Organization (cấp công ty/đơn vị lớn)
    └── Team (phòng ban, dự án)
            └── Virtual Key (ứng dụng, môi trường cụ thể)

Theo tài liệu "Multi-Tenant Architecture with LiteLLM" chính thức, cấp Organization yêu cầu Enterprise tier; cấp Team và Key có sẵn trong open source.

5.1 Tạo Team với budget riêng

import requests

BASE_URL = "http://litellm-proxy:4000"
HEADERS = {"Authorization": f"Bearer {MASTER_KEY}"}

# Tạo team Data Science
team_resp = requests.post(f"{BASE_URL}/team/new", headers=HEADERS, json={
    "team_alias": "data-science",
    "max_budget": 2000,          # USD — budget của cả team
    "budget_duration": "30d",
    "tpm_limit": 5000000,
    "rpm_limit": 2000,
    "models": ["gpt-4o", "claude-sonnet", "gemini-flash"],
    "metadata": {"department": "DS", "cost_center": "CC-2026-DS"}
})
team_id = team_resp.json()["team_id"]

5.2 Tạo Key trong Team

Theo bài viết "Multi-Tenant Architecture" của LiteLLM, có hai pattern key phổ biến:

Pattern 1: Service Account Key không gắn user, dùng cho ứng dụng production lâu dài

# Key cho ứng dụng chatbot production
service_key = requests.post(f"{BASE_URL}/key/generate", headers=HEADERS, json={
    "team_id": team_id,
    "key_alias": "chatbot-prod-service-account",
    "max_budget": 800,           # Key này không vượt $800/tháng
    "budget_duration": "30d",
    "models": ["gpt-4o", "gemini-flash"],
    "metadata": {"env": "production", "service": "chatbot"}
}).json()

Pattern 2: User Key gắn cụ thể với một nhân viên, dùng cho dev/research

# Key cá nhân cho data scientist
user_key = requests.post(f"{BASE_URL}/key/generate", headers=HEADERS, json={
    "team_id": team_id,
    "user_id": "nguyen.van.a@company.com",
    "key_alias": "nguyen-van-a-research",
    "max_budget": 100,           # Mỗi nhân viên $100/tháng
    "budget_duration": "30d",
    "duration": "90d",           # Key tự hết hạn sau 90 ngày
    "models": ["gpt-4o-mini", "claude-sonnet"],
}).json()

Quan trọng: budget của key không thể vượt budget của team. LiteLLM tự enforce điều này nếu team chỉ còn $50 thì dù key có cap $500 cũng bị chặn.

6. Budget Management và Alert Tự Động

6.1 Cấu hình alert tự động qua Slack

general_settings:
  alerting: ["slack"]
  alert_to_webhook_url:
    slack: os.environ/SLACK_WEBHOOK_URL

  # Loại alert cần gửi
  alert_types:
    - "spend_reports"          # Báo cáo chi phí hàng ngày/tuần
    - "budget_alerts"          # Khi gần đạt budget
    - "db_exceptions"          # Khi PostgreSQL có vấn đề
    - "outage_alerts"          # Khi nhiều deployment cùng fail
    - "daily_reports"

  # Ngưỡng cảnh báo budget
  spend_report_frequency: "1d"   # Daily

Khi 80% budget của team bị tiêu, LiteLLM tự gửi message dạng:

⚠️ Budget Alert: Team `data-science`
Đã dùng $1,612 / $2,000 (80.6%) trong 30 ngày qua
Reset vào: 2026-06-30
Top 3 keys tiêu nhiều nhất:
  1. chatbot-prod-service: $987 (49.4%)
  2. nguyen-van-a-research: $315 (15.8%)
  3. data-pipeline-etl: $221 (11.1%)

6.2 Custom monitor cho yêu cầu phức tạp

Nếu alert built-in chưa đủ (ví dụ: muốn ping PagerDuty thay vì Slack, hoặc đẩy data sang BigQuery), tự viết monitor:

# budget_monitor.py
import asyncio, httpx
from datetime import datetime, timedelta

class BudgetMonitor:
    """Chạy mỗi 15 phút, alert theo nhiều ngưỡng."""

    THRESHOLDS = [0.70, 0.85, 0.95]    # 70%, 85%, 95%

    def __init__(self, litellm_url: str, master_key: str, slack_url: str):
        self.url = litellm_url
        self.headers = {"Authorization": f"Bearer {master_key}"}
        self.slack_url = slack_url
        self.sent: dict = {}            # Tránh spam alert

    async def check_all(self):
        async with httpx.AsyncClient() as c:
            r = await c.get(f"{self.url}/team/list", headers=self.headers)
            teams = r.json().get("teams", [])

        for team in teams:
            await self._check_team(team)

    async def _check_team(self, team: dict):
        spend, budget = team.get("spend", 0), team.get("max_budget")
        if not budget:
            return
        ratio = spend / budget
        tid = team["team_id"]

        for th in self.THRESHOLDS:
            if ratio >= th:
                key = f"{tid}:{th}"
                # Chỉ alert một lần mỗi 24h cho mỗi ngưỡng
                last = self.sent.get(key)
                if last and (datetime.utcnow() - last) < timedelta(hours=24):
                    continue
                await self._send_alert(team, ratio, th)
                self.sent[key] = datetime.utcnow()

    async def _send_alert(self, team, ratio, threshold):
        emoji = "🔴" if threshold >= 0.85 else "🟡"
        msg = {
            "text": (
                f"{emoji} *Budget Alert*\n"
                f"Team: `{team.get('team_alias', team['team_id'])}`\n"
                f"Đã dùng: *{ratio*100:.1f}%* (${team['spend']:.2f}/${team['max_budget']:.2f})\n"
                f"Ngưỡng: {threshold*100:.0f}%"
            )
        }
        async with httpx.AsyncClient() as c:
            await c.post(self.slack_url, json=msg)

    async def run(self, interval: int = 900):
        while True:
            try:
                await self.check_all()
            except Exception as e:
                print(f"Monitor error: {e}")
            await asyncio.sleep(interval)

# Chạy
monitor = BudgetMonitor(
    "http://litellm-proxy:4000",
    os.environ["LITELLM_MASTER_KEY"],
    os.environ["SLACK_WEBHOOK_URL"],
)
asyncio.run(monitor.run())

7. Triển Khai Docker Compose

Phù hợp cho staging và production nhỏ (đến vài chục requests/giây):

# docker-compose.production.yml
version: '3.8'

services:
  litellm:
    image: ghcr.io/berriai/litellm-database:main-latest
    deploy:
      replicas: 2
    ports:
      - "4000:4000"
    volumes:
      - ./litellm_config_production.yaml:/app/config.yaml:ro
    environment:
      - OPENAI_API_KEY=${OPENAI_API_KEY}
      - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
      - GEMINI_API_KEY=${GEMINI_API_KEY}
      - LITELLM_MASTER_KEY=${LITELLM_MASTER_KEY}
      - LITELLM_SALT_KEY=${LITELLM_SALT_KEY}
      - DATABASE_URL=postgresql://litellm:${DB_PASSWORD}@postgres:5432/litellm
      - REDIS_HOST=redis
      - REDIS_PORT=6379
      - REDIS_PASSWORD=${REDIS_PASSWORD}
      - LANGFUSE_PUBLIC_KEY=${LANGFUSE_PUBLIC_KEY}
      - LANGFUSE_SECRET_KEY=${LANGFUSE_SECRET_KEY}
      - SLACK_WEBHOOK_URL=${SLACK_WEBHOOK_URL}
    command:
      - "--config"
      - "/app/config.yaml"
      - "--port"
      - "4000"
      - "--num_workers"
      - "1"                # ← 1 worker/container, scale bằng replicas
    depends_on:
      postgres:
        condition: service_healthy
      redis:
        condition: service_healthy
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:4000/health/readiness"]
      interval: 30s
      timeout: 10s
      retries: 3
    restart: unless-stopped

  postgres:
    image: postgres:16-alpine
    environment:
      - POSTGRES_USER=litellm
      - POSTGRES_PASSWORD=${DB_PASSWORD}
      - POSTGRES_DB=litellm
    volumes:
      - postgres_data:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U litellm"]
      interval: 10s
      timeout: 5s
      retries: 5
    restart: unless-stopped

  redis:
    image: redis:7-alpine
    command:
      - redis-server
      - --requirepass
      - ${REDIS_PASSWORD}
      - --maxmemory
      - 4gb
      - --maxmemory-policy
      - allkeys-lru
    volumes:
      - redis_data:/data
    healthcheck:
      test: ["CMD", "redis-cli", "-a", "${REDIS_PASSWORD}", "ping"]
      interval: 10s
      timeout: 5s
      retries: 5
    restart: unless-stopped

  # Load balancer phía trước
  nginx:
    image: nginx:alpine
    ports:
      - "443:443"
      - "80:80"
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf:ro
      - ./certs:/etc/nginx/certs:ro
    depends_on:
      - litellm
    restart: unless-stopped

volumes:
  postgres_data:
  redis_data:

# nginx.conf — Load balance giữa các litellm replicas
upstream litellm_backend {
    least_conn;
    server litellm:4000;
    keepalive 32;
}

server {
    listen 443 ssl http2;
    server_name llm.company.com;

    ssl_certificate /etc/nginx/certs/fullchain.pem;
    ssl_certificate_key /etc/nginx/certs/privkey.pem;

    location / {
        proxy_pass http://litellm_backend;
        proxy_http_version 1.1;
        proxy_set_header Connection "";

        # Important cho streaming
        proxy_buffering off;
        proxy_cache off;
        proxy_read_timeout 120s;
    }
}

Khởi động:

docker compose -f docker-compose.production.yml --env-file .env up -d

# Kiểm tra
docker compose ps
docker compose logs -f litellm

# Scale lên 4 replicas khi cần
docker compose up -d --scale litellm=4

8. Triển Khai Kubernetes với Best Practices

Phù hợp cho production lớn (hàng trăm rps trở lên) với HPA auto-scaling.

8.1 Uvicorn worker/pod

Đây là khuyến nghị chính thức từ tài liệu LiteLLM Production Best Practices (docs.litellm.ai/docs/proxy/prod):

"We recommend running 1 Uvicorn worker per pod and scaling out horizontally with more pods rather than more workers per pod. This gives the most stable latency under load and works best with the HPA thresholds." Lý do: nhiều worker trong cùng pod chia sẻ memory và GIL, latency dao động lớn hơn dưới tải. Scale bằng pod thì mỗi pod độc lập, latency ổn định và HPA hoạt động chính xác.

8.2 Deployment manifest

# k8s/namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: ai-platform

---
# k8s/litellm-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: litellm-proxy
  namespace: ai-platform
spec:
  replicas: 3
  selector:
    matchLabels:
      app: litellm-proxy
  template:
    metadata:
      labels:
        app: litellm-proxy
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "4000"
        prometheus.io/path: "/metrics"
    spec:
      # Security best practice: read-only root filesystem
      securityContext:
        runAsNonRoot: true
        runAsUser: 1000
      containers:
      - name: litellm
        image: ghcr.io/berriai/litellm-database:main-latest
        ports:
        - containerPort: 4000
        args:
          - "--config"
          - "/app/config.yaml"
          - "--port"
          - "4000"
          - "--num_workers"
          - "1"                          # ← 1 worker/pod
          # Tùy chọn: restart worker sau N requests (chống memory leak)
          - "--max_requests_before_restart"
          - "10000"
        env:
        - name: DATABASE_URL
          valueFrom:
            secretKeyRef:
              name: litellm-secrets
              key: database-url
        - name: LITELLM_MASTER_KEY
          valueFrom:
            secretKeyRef:
              name: litellm-secrets
              key: master-key
        - name: LITELLM_SALT_KEY
          valueFrom:
            secretKeyRef:
              name: litellm-secrets
              key: salt-key
        - name: REDIS_HOST
          value: "redis-service"
        - name: REDIS_PORT
          value: "6379"
        - name: REDIS_PASSWORD
          valueFrom:
            secretKeyRef:
              name: litellm-secrets
              key: redis-password
        - name: OPENAI_API_KEY
          valueFrom:
            secretKeyRef:
              name: provider-secrets
              key: openai-key
        - name: ANTHROPIC_API_KEY
          valueFrom:
            secretKeyRef:
              name: provider-secrets
              key: anthropic-key
        volumeMounts:
        - name: config
          mountPath: /app/config.yaml
          subPath: config.yaml
          readOnly: true
        # Cần writable cho migrations và UI
        - name: migrations
          mountPath: /app/migrations
        - name: ui-cache
          mountPath: /app/ui
        resources:
          requests:
            memory: "512Mi"
            cpu: "500m"
          limits:
            memory: "2Gi"
            cpu: "2000m"
        livenessProbe:
          httpGet:
            path: /health/liveness
            port: 4000
          initialDelaySeconds: 30
          periodSeconds: 15
          failureThreshold: 3
        readinessProbe:
          httpGet:
            path: /health/readiness
            port: 4000
          initialDelaySeconds: 10
          periodSeconds: 5
        startupProbe:
          httpGet:
            path: /health/readiness
            port: 4000
          failureThreshold: 30
          periodSeconds: 5
      volumes:
      - name: config
        configMap:
          name: litellm-config
      - name: migrations
        emptyDir: {}
      - name: ui-cache
        emptyDir: {}

---
apiVersion: v1
kind: Service
metadata:
  name: litellm-service
  namespace: ai-platform
spec:
  selector:
    app: litellm-proxy
  ports:
  - name: http
    port: 4000
    targetPort: 4000
  type: ClusterIP

---
# Horizontal Pod Autoscaler
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: litellm-hpa
  namespace: ai-platform
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: litellm-proxy
  minReplicas: 3
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Pods
        value: 2
        periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300   # Chờ 5 phút trước khi scale down
      policies:
      - type: Pods
        value: 1
        periodSeconds: 60

9. Best Practices và Những Lỗi Cần Tránh

Tổng hợp từ tài liệu chính thức LiteLLM, các bài viết community lớn (Stack Harbor, Markaicode, Medium MITB For All, Success Knocks), và GitHub issues thực tế:

Nên làm

Một worker per pod, scale bằng pod

litellm --num_workers 1 --config config.yaml
# Scale lên 10 pod, không phải 1 pod x 10 workers

PostgreSQL không phải SQLite/MySQL

"SQLite is fine for a laptop demo and bad for anything else because budget decrements race under concurrent requests" — Stack Harbor

Tách master key khỏi ứng dụng Cấu hình max_requests_before_restart cho memory leak

litellm --num_workers 1 --max_requests_before_restart 10000

Bật enable_pre_call_checks để tránh waste tokens

router_settings:
  enable_pre_call_checks: true   # Check context window trước khi gọi

Đặt budget buffer 5–10% dưới giá trị thực tế (vì race condition cập nhật spend, xem GitHub issue #27735)

TTL khác nhau cho loại content khác nhau

# FAQ tĩnh: cache 24h
# Realtime data: tắt cache với extra_body={"cache": {"no-cache": True}}

Dùng Helm chart chính thức cho Kubernetes — đã handle nhiều edge case (read-only fs, migration, UI assets)

Không nên làm

KHÔNG để API key thật trong config.yaml

KHÔNG bật detailed_debug trong production

"WARNING: FOR PROD DO NOT USE --detailed_debug it slows down response times" — Tài liệu chính thức LiteLLM

KHÔNG cache response cho realtime queries

KHÔNG bỏ qua salt key

# LITELLM_SALT_KEY là bắt buộc để encrypt giá trị nhạy cảm trong DB
# Nếu mất, không thể decrypt các virtual key đã tạo
LITELLM_SALT_KEY=sk-salt-very-secret-stable-string

KHÔNG dùng routing_strategy: latency-based-routing cho workload có spike traffic Latency window có thể đo nhầm — một deployment "may mắn nhanh" trong window vừa qua sẽ bị dồn toàn bộ request mới, dẫn đến quá tải. Dùng least-busy hoặc usage-based-routing-v2 an toàn hơn.

KHÔNG bỏ qua health check

livenessProbe:
  httpGet: { path: /health/liveness, port: 4000 }
readinessProbe:
  httpGet: { path: /health/readiness, port: 4000 }

Tài Liệu Tham Khảo — Phần 2

Tài liệu chính thức:

Bài viết community lớn:

Deploying LiteLLM proxy with per-team budgets — Stack Harbor — kịch bản thực tế và setup chi tiết
LiteLLM Fallback Configuration: Reduce API Errors by 90% — Markaicode
LiteLLM Routing Config: Balance 3 Providers in 10 Lines — Markaicode
Implementing LLM Model Routing — Medium / Michael Hannecke — case study giảm 88% chi phí
Scale your LLM gateway with LiteLLM & Redis — Redis Blog
LiteLLM + LangFuse Observability — Cash Williams
How to deploy LiteLLM — Northflank Guide

Phân tích kỹ thuật chuyên sâu:

← Quay lại: LiteLLM (Phần 1) — Điều Gì Xảy Ra Khi OpenAI Không Còn Là Lựa Chọn Duy Nhất?

API Open AI