Đã đăng vào thg 7 3, 2:03 CH 6 phút đọc

Tích hợp Novita AI vào RAG pipeline với LlamaIndex

Tác giả: Cộng tác viên vnai.vn Ngày: 2025-06-08 Thời gian đọc: ~8 phút Trình độ: Trung cấp – Nâng cao

Giới thiệu

RAG (Retrieval-Augmented Generation) đã trở thành kiến trúc tiêu chuẩn khi xây dựng ứng dụng LLM cần truy cập dữ liệu ngoài vùng knowledge cutoff. Tuy nhiên, chi phí inference thường là rào cản lớn — đặc biệt khi bạn cần prototype nhanh hoặc vận hành ở quy mô vừa.

Trong bài viết này, chúng ta sẽ xây dựng một RAG pipeline hoàn chỉnh sử dụng LlamaIndex + Novita AI — nền tảng cung cấp hơn 200 model AI với chi phí thấp hơn OpenAI đến 90%. Toàn bộ code đều có thể chạy ngay.

Tổng quan kiến trúc

┌─────────────┐ ┌──────────────┐ ┌─────────────────┐ │ Documents │────▶│ Embedding │────▶│ Vector Store │ │ (PDF, TXT) │ │ (Novita AI) │ │ (FAISS/Chroma) │ └─────────────┘ └──────────────┘ └────────┬────────┘ │ ┌──────────────┐ │ │ Query │◀─────────────┘ │ (User) │ └──────┬───────┘ │ ┌──────▼───────┐ │ LlamaIndex │ │ QueryEngine │ └──────┬───────┘ │ ┌──────▼───────┐ │ Novita AI │ │ (LLM call) │ └──────┬───────┘ │ ┌──────▼───────┐ │ Response │ └──────────────┘

Pipeline gồm 2 bước chính: Indexing: Chunk documents → Embedding (via Novita) → Lưu vào vector store Querying: Retrieve top-k relevant chunks → Generate answer (via Novita LLM)

Cài đặt môi trường

pip install llama-index llama-index-llms-openai llama-index-embeddings-openai faiss-cpu

Novita AI cung cấp API tương thích hoàn toàn với OpenAI SDK, nghĩa là bạn có thể dùng thẳng llama-index-llms-openai mà không cần custom adapter. Đây là lợi thế lớn — không phải maintain wrapper riêng.

Cấu hình Novita AI

from llama_index.llms.openai import OpenAI from llama_index.embeddings.openai import OpenAIEmbedding

Thay bằng API key của bạn

NOVITA_API_KEY = "YOUR_NOVITA_API_KEY" NOVITA_BASE_URL = "https://api.novita.ai/v3/openai"

Cấu hình LLM — dùng Llama 3.3 70B cho generation

llm = OpenAI( model="meta-llama/llama-3.3-70b-instruct", api_key=NOVITA_API_KEY, api_base=NOVITA_BASE_URL, temperature=0.1, max_tokens=2048, )

Cấu hình Embedding — dùng BAAI/bge-large-en-v1.5

Lưu ý: Nếu Novita chưa hỗ trợ embedding endpoint,

bạn có thể dùng HuggingFace embedding local

from llama_index.embeddings.huggingface import HuggingFaceEmbedding

embed_model = HuggingFaceEmbedding( model_name="BAAI/bge-large-en-v1.5" )

Mẹo: Bạn cũng có thể dùng embedding model qua Novita nếu nền tảng hỗ trợ. Chỉ cần thay OpenAIEmbedding với api_base trỏ đến Novita endpoint.

Thiết lập Settings toàn cục

from llama_index.core import Settings

Settings.llm = llm Settings.embed_model = embed_model Settings.chunk_size = 512 Settings.chunk_overlap = 50

Build Index từ tài liệu

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader from llama_index.core.storage.storage_context import StorageContext from llama_index.vector_stores.faiss import FaissVectorStore import faiss

Tạo FAISS index (dimension = 1024 cho bge-large)

faiss_index = faiss.IndexFlatL2(1024) vector_store = FaissVectorStore(faiss_index=faiss_index) storage_context = StorageContext.from_defaults(vector_store=vector_store)

Load tài liệu từ thư mục

documents = SimpleDirectoryReader("./data").load_data() print(f"Đã load {len(documents)} documents")

Build index

index = VectorStoreIndex.from_documents( documents, storage_context=storage_context, show_progress=True, )

Lưu index để dùng lại

index.storage_context.persist(persist_dir="./storage") print("Index đã được lưu vào ./storage")

Query với RAG

from llama_index.core import load_index_from_storage, StorageContext

Load index đã lưu

storage_context = StorageContext.from_defaults(persist_dir="./storage") index = load_index_from_storage(storage_context)

Tạo query engine

query_engine = index.as_query_engine( similarity_top_k=5, response_mode="compact", )

Truy vấn

response = query_engine.query( "Những điểm chính của tài liệu là gì?" )

print(f"Câu trả lời:\n{response}") print(f"\n---\nSố nguồn tham chiếu: {len(response.source_nodes)}") for i, node in enumerate(response.source_nodes): print(f" [{i+1}] Score: {node.score:.4f} | {node.text[:100]}...")

Nâng cao: Streaming Response

streaming_engine = index.as_query_engine( similarity_top_k=5, streaming=True, )

streaming_response = streaming_engine.query( "Phân tích chi tiết nội dung tài liệu" )

streaming_response.print_response_stream()

So sánh chi phí: Novita AI vs OpenAI

Giả sử bạn có bộ dữ liệu 100 tài liệu (~500K tokens), mỗi ngày xử lý 200 queries (trung bình 1K tokens output/query):

Chi phí hàng tháng (30 ngày)

💰 Tiết kiệm ~93% khi dùng Novita AI so với GPT-4o.

Với GPT-4o-mini, chi phí OpenAI giảm xuống ~$10.65/tháng — nhưng chất lượng output của Llama 3.3 70B qua Novita hoàn toàn có thể so sánh được cho hầu hết use case RAG.

Trường hợp quy mô lớn hơn

Ở quy mô production, chênh lệch chi phí trở nên cực kỳ đáng kể.

Một số lưu ý khi dùng Novita AI cho RAG

10.1 Chọn model phù hợp

Cho tác vụ tóm tắt, phân tích — cần chất lượng cao

llm_high = OpenAI( model="meta-llama/llama-3.3-70b-instruct", api_key=NOVITA_API_KEY, api_base=NOVITA_BASE_URL, )

Cho tác vụ đơn giản (phân loại, trích xuất entity) — cần nhanh + rẻ

llm_fast = OpenAI( model="deepseek/deepseek-v3-0324", api_key=NOVITA_API_KEY, api_base=NOVITA_BASE_URL, )

10.2 Xử lý tiếng Việt

Llama 3.3 và Qwen 2.5 hỗ trợ tiếng Việt khá tốt. Tuy nhiên, nếu bạn cần chất lượng cao hơn cho tiếng Việt:

Dùng chunk overlap lớn hơn (80-100 tokens) để tránh mất ngữ cảnh khi cắt Thêm instruction rõ ràng: "Trả lời bằng tiếng Việt dựa trên ngữ cảnh sau..." Consider dùng embedding model đa ngữ như intfloat/multilingual-e5-large

10.3 Rate limiting & retry

import time from tenacity import retry, wait_exponential, stop_after_attempt

@retry(wait=wait_exponential(multiplier=1, min=2, max=30), stop=stop_after_attempt(5)) def query_with_retry(query_engine, question): return query_engine.query(question)

Kết luận

Với việc cung cấp API tương thích OpenAI và chi phí cực thấp, Novita AI là lựa chọn tuyệt vời cho RAG pipeline — từ prototype đến production. Kết hợp LlamaIndex + Novita AI, bạn có thể:

🏗️ Xây dựng RAG system hoàn chỉnh trong < 50 dòng code 💰 Giảm 90%+ chi phí inference so với OpenAI 🔄 Linh hoạt chuyển đổi giữa 200+ models tùy task 🚀 Scale từ local development đến production dễ dàng

Đăng ký dùng thử

Bạn đọc có thể đăng ký tài khoản Novita AI miễn phí tại đây — có sẵn free credits để trải nghiệm:

👉 https://novita.ai/?ref=zgq5nwr&utmsource=affiliate

Tài liệu tham khảo

LlamaIndex Documentation (https://docs.llamaindex.ai/) Novita AI API Docs (https://novita.ai/docs) OpenAI-compatible API (https://novita.ai/docs/api-reference)

Bài viết được đóng góp bởi cộng đồng vnai.vn. Nếu bạn có câu hỏi hoặc muốn thảo luận, hãy để lại comment bên dưới.

Access Token