2.4K 88 32

Đã đăng vào thg 2 25, 4:01 SA 12 phút đọc

236

AGENTS.md: Context Files Có Thực Sự Giúp Coding Agents Hiệu Quả Hơn ?

AI Coding Agent Working

Mở đầu: Khi "thêm context" không phải lúc nào cũng tốt hơn

Hẹ hẹ, chào anh em! Dạo này ai làm dev cũng đang vibe với Claude Code, Cursor, hay GitHub Copilot phải không? Mình thì cũng đang xài Claude Code làm việc hằng ngày, và có một thứ mà hầu hết anh em đều tin là "càng cho AI nhiều context, nó càng code tốt hơn".

Thế nên nhiều người đã bắt đầu tạo file AGENTS.md hoặc CLAUDE.md trong repo, viết đủ thứ hướng dẫn cho AI:

# AGENTS.md
- Always run `npm test` before committing
- Code style: use TypeScript strict mode
- Test files must be in `/tests` folder
- Never modify files in `/legacy`

Nghe có vẻ hợp lý phải không? Cho agent biết rõ quy tắc, nó sẽ làm đúng hơn.

Nhưng rồi một nghiên cứu mới từ arXiv (2602.11988v1) vừa tát thẳng vào niềm tin đó.

Paper có tên "Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents?" đã làm một benchmark cực kỳ chỉn chu trên 138 tasks thực tế, và kết quả... hơi sốc:

Context files do LLM generate làm GIẢM success rate ~3%, tăng cost 20%!

Còn ngay cả human-written context thì chỉ cải thiện nhẹ nhàng, trong khi lại tốn thêm rất nhiều tokens và steps.

Vậy chuyện gì đang xảy ra? Context files có phải là "thuốc độc" cho AI agents? Hay là chúng ta đang dùng sai cách?

Bài này mình sẽ đi sâu vào nghiên cứu này để mọi người hiểu rõ:

Context files thực sự hoạt động thế nào
Tại sao LLM-generated lại làm giảm hiệu suất
Best practices thực chiến khi viết AGENTS.md/CLAUDE.md

Context Files là gì? Tại sao mọi người lại nghĩ nó hữu ích?

Context Files Concept

Context files (như AGENTS.md, CLAUDE.md, .cursorrules) là những file markdown đặt ở root của repo, chứa hướng dẫn cho AI coding agents.

Hãy tưởng tượng như này:

Bạn mới join một team dev. Trong ngày đầu, lead dev đưa cho bạn một quyển "sổ tay sinh tồn" ghi rõ: "test trước khi commit gì", "conventions đặt tên thế nào", "folder nào không được động vào"...

Context file cũng giống như quyển sổ tay đó, nhưng dành cho AI agent.

Ý tưởng nghe rất hợp lý:

Agent biết cấu trúc repo
Agent biết quy trình testing
Agent tránh được những sai lầm phổ biến
Agent làm việc theo đúng conventions của team

Thậm chí nhiều công cụ như Claude Code, Cursor đều tự động load những file này vào prompt context khi bắt đầu làm việc với repo.

Vấn đề là: "more context = better performance" nghe có vẻ đúng... nhưng thực tế lại phức tạp hơn thế.

AGENTBENCH: Benchmark "thật" nhất cho coding agents

Benchmark Testing

Trước nghiên cứu này, hầu hết các benchmark đánh giá AI agents đều có vấn đề:

Tasks quá đơn giản, không đại diện cho real-world
Không so sánh trực tiếp tác động của context files
Thiếu diversity về loại repositories

AGENTBENCH ra đời để giải quyết đúng vấn đề này.

Đặc điểm của AGENTBENCH:

138 real-world instances từ 12 repositories thực tế:

Tasks từ GitHub issues gần đây
Một phần từ SWE-bench Lite (benchmark nổi tiếng)
Đa dạng ngôn ngữ: Python, JavaScript, TypeScript, Go, Java...
Đa dạng loại task: bug fix, feature addition, refactoring

3 scenarios được so sánh:

No Context: Agent làm việc không có AGENTS.md
LLM-Generated Context: Dùng GPT-4o/Claude 3.5 Sonnet để generate AGENTS.md theo hướng dẫn của các agent developers
Human-Written Context: Context files do developer thực tế viết và commit vào repo

Agents được test:

Claude Code
Cursor
Và một số agents khác

Metrics đo lường:

Success Rate: Bao nhiêu % tasks được giải quyết đúng
Steps: Số lượng actions agent thực hiện
Cost: Tổng tokens tiêu thụ

Đây chính là benchmark robust và real-world nhất hiện nay để đánh giá context files.

Kết quả nghiên cứu: Sự thật "phũ phàng"

Surprised Face

Và đây là phần mà ai cũng phải ngạc nhiên...

1. LLM-Generated Context: Không chỉ không giúp, mà còn LÀM HẠI!

Khi dùng context files do LLM tự generate:

Metric	Thay đổi so với No Context
Success Rate	-2% đến -3%
Steps	+20-25%
Cost (tokens)	+20%+

Nói cách khác:

Agent giải quyết ÍT tasks hơn
Agent thực hiện NHIỀU steps hơn (lòng vòng)
Agent tốn NHIỀU tiền hơn (vì tokens tăng)

2. Human-Written Context: Cải thiện... nhưng không nhiều như mong đợi

Context files do developer viết (đã commit trong repo):

Metric	Thay đổi
Success Rate	+4% (cải thiện nhẹ)
Steps	+19% (vẫn tăng)
Cost	+19% (tốn thêm)

Human-written context TỐT HƠN LLM-generated, nhưng vẫn có trade-off:

Cải thiện success rate một chút
Nhưng agent làm việc "thorough" hơn (nhiều bước hơn)
Tốn thêm cost đáng kể

3. Tại sao lại như vậy?

Nghiên cứu chỉ ra 3 nguyên nhân chính:

Nguyên nhân 1: Over-exploration

Khi có context file type "always run full test suite", agent sẽ chạy tests không cần thiết cho từng bước nhỏ, dù issue chỉ cần sửa 1 file.

# SAI - Context quá chi tiết
Always run the full test suite before making changes
Run tests after every single edit
Check all integration tests even for unit-level changes

# ĐÚNG - Context vừa đủ
Run relevant tests for the modified modules
Use `pytest tests/unit` for unit changes

Nguyên nhân 2: Redundant Information

LLM-generated context thường chứa info mà agent vốn đã biết hoặc có thể tự infer được từ code.

Ví dụ: "Read all files carefully" là instruction vô nghĩa vì agent luôn cần đọc code để hiểu anyway.

Nguyên nhân 3: Agent tuân thủ instructions quá nghiêm ngặt

Agent modern được train để "follow instructions carefully", nên khi context file nói "explore thoroughly", nó sẽ làm đúng như vậy - dù không cần thiết.

Đây giống như bạn bảo junior dev "hãy test kỹ càng", và dev đó chạy toàn bộ test suite cho mỗi dòng code thay đổi. Technically đúng, nhưng không efficient.

So sánh chi tiết: No Context vs LLM-Generated vs Human-Written

Comparison Chart

Để hiểu rõ hơn, chúng ta phân tích behavior của agent trong từng scenario:

Scenario 1: No Context (Baseline)

Hành vi:

Agent đọc code, hiểu issue, fix nhanh
Focused, đi thẳng vào vấn đề
Ít exploration hơn

Ưu điểm:

Nhanh, efficient
Không lãng phí tokens
Success rate ổn định

Nhược điểm:

Có thể miss một số edge cases
Thiếu context về conventions của project

Scenario 2: LLM-Generated Context

Hành vi:

Agent đọc context → chạy NHIỀU tests hơn
Traverse NHIỀU files hơn (broader exploration)
Spend tokens reasoning thừa

Ưu điểm:

Thorough hơn... nhưng không đủ để compensate

Nhược điểm:

Biến task đơn giản thành phức tạp
Over-testing, over-exploration
Cost tăng mà success rate GIẢM

Ví dụ thực tế từ paper:

Một task đơn giản: "Fix typo in error message"

No Context: Agent sửa typo → chạy unit test liên quan → done (3 steps)
LLM-Generated Context: Agent đọc context "always verify all error paths" → sửa typo → chạy full test suite → check integration tests → verify error handling across modules → done (8 steps, nhiều unnecessary work)

Scenario 3: Human-Written Context

Hành vi:

Concise và project-specific hơn
Agent vẫn thorough nhưng focused hơn LLM-generated
Improvement nhẹ về success rate

Ưu điểm:

Cải thiện ~4% success rate
Giúp agent avoid common pitfalls
Hữu ích với unconventional projects

Nhược điểm:

Vẫn tốn thêm 19% cost
Chỉ đáng dùng khi project có conventions đặc biệt

Kết luận từ paper:

"Human-written context files show modest improvements but come with significant cost increases. They are most valuable for projects with non-standard testing procedures or unique architectural patterns."

Best Practices: Khi nào và làm sao viết Context Files đúng cách?

Best Practices

Dựa trên nghiên cứu này và kinh nghiệm thực tế, đây là best practices mình recommend:

1. Keep it Minimal - Ít nhưng có võ

SAI:

# AGENTS.md
- Read all files carefully before making changes
- Understand the full codebase architecture
- Always write clean, maintainable code
- Follow best practices
- Test thoroughly
- Document your changes
- Use meaningful variable names

ĐÚNG:

# AGENTS.md

## Testing
- Run: `pytest tests/unit/` for unit tests only
- Full suite: `make test` (use sparingly)
- Avoid CI scripts locally - they auto-run on PR

## Critical Rules
- NEVER modify files in `/legacy` (deprecated code)
- API changes require updating OpenAPI spec in `/docs/api.yaml`

2. Ưu tiên Actionable Instructions

Context phải là actions cụ thể, không phải guidelines chung chung.

SAI:

- Be careful with authentication code
- Consider security implications

ĐÚNG:

- Auth changes: Update both `/src/auth` and `/tests/auth_integration`
- Security: Run `npm run security-check` before committing auth code

3. Chỉ viết khi project có "unconventional patterns"

KHI NÀO CẦN context file:

Testing framework không chuẩn (không phải pytest/jest/go test tiêu chuẩn)
Repo structure đặc biệt (monorepo với custom tooling)
Critical folders không được động vào
Unconventional build process

KHI NÀO KHÔNG CẦN:

Repo structure chuẩn (npm, cargo, go mod, pip)
Testing commands rõ ràng trong package.json/Makefile
Standard conventions (eslint, prettier, gofmt đã config sẵn)

4. Test với agents trước khi commit

Quy trình recommended:

Bước 1: Tạo baseline

# Chạy agent KHÔNG có AGENTS.md trên vài tasks nhỏ
# Đo success rate, cost

Bước 2: Thêm AGENTS.md, measure lại

# So sánh với baseline
# Nếu không improve hoặc cost tăng quá nhiều → bỏ AGENTS.md

Bước 3: Iterate

# Nếu quyết định giữ AGENTS.md, hãy giữ nó concise
# Monitor agent behavior qua time

5. Monitor Context Length

Context files nên dưới 500 tokens (tương đương ~300-400 chữ tiếng Anh).

Nếu dài hơn → agent sẽ spend quá nhiều tokens để reasoning về instructions.

6. Sử dụng Comment-Based Context thay vì riêng file

Một approach khác (đôi khi tốt hơn AGENTS.md):

# src/payment.py

# IMPORTANT: Idempotency required - use transaction_id to prevent duplicate charges
# See tests/payment_integration_test.py for examples
def process_payment(transaction_id: str, amount: float):
    # ...

Agent sẽ tự nhiên đọc được context này khi explore code, không cần file riêng.

Bài học kinh nghiệm: Context là con dao hai lưỡi

Lessons Learned

KINH NGHIỆM CÁ NHÂN:

Mình cũng từng nghĩ "càng nhiều context càng tốt" và đã tạo một file CLAUDE.md cực kỳ chi tiết cho một dự án microservices mình đang maintain:

# CLAUDE.md (version cũ - 1200 tokens!)
- Always check service dependencies in /docs/architecture.md
- Run integration tests for each service
- Verify API contracts with Postman collection
- Check logging format consistency
- Update README for each change
- ... (và còn 20 dòng nữa)

Kết quả?

Claude Code thường bị "overwhelmed", bắt đầu chạy integration tests cho một typo fix, mất 5-10 phút cho tasks đơn giản, và tốn gần gấp đôi tokens so với không có CLAUDE.md.

Sau khi đọc paper này và refactor:

# CLAUDE.md (version mới - 200 tokens)

## Testing Quick Commands
- Unit: `make test-unit` (fast, use this first)
- Integration: `make test-integration` (slow, only if cross-service changes)

## Never Touch
- `/legacy/v1/*` - deprecated, will be removed Q2 2025
- `/scripts/production/*` - prod deployment only

## Critical
- API schema changes → update `/docs/openapi.yaml`

Improvement sau khi refactor:

Tasks complete nhanh hơn ~30%
Token usage giảm ~25%
Success rate tăng nhẹ (subjective, chưa đo chính xác)

Bài học: Context files là con dao hai lưỡi. Dùng đúng chỗ thì hữu ích, dùng sai là phản tác dụng.

Insights cho AI Orchestration trong tương lai

Nghiên cứu này không chỉ về context files, mà còn reveal một insight lớn hơn cho AI-driven development:

"More information ≠ Better performance"

Điều này trái ngược với intuition của con người.

Với con người:

Càng nhiều context, càng hiểu rõ
Documentation chi tiết giúp onboarding tốt hơn

Với AI agents:

Quá nhiều instructions → over-constrained behavior
Agent spend tokens reasoning về instructions thay vì solving problem
"Paralysis by analysis"

Tương lai: Context Compression & Selective Loading

Một hướng đi tốt hơn (một số công cụ đang thử nghiệm):

Dynamic Context Loading:

Agent tự identify khi nào cần context
Load only relevant sections
Context ưu tiên theo task type

Context Compression:

Summarize context files xuống core actionable items
Remove redundant info

Hybrid Approach:

Baseline: no context
Fallback: nếu agent stuck → load context
Progressive context injection

Tổng kết: Nên làm gì với AGENTS.md/CLAUDE.md của bạn?

Summary

Sau khi đọc hết nghiên cứu này, đây là recommendations cụ thể cho anh em dev:

Nên làm:

Nếu project của bạn CHUẨN (standard structure, standard testing):
- → Đừng tạo AGENTS.md
- Agents hoạt động tốt hơn không có context
Nếu project có "unconventional patterns":
- → Viết AGENTS.md ngắn gọn (<300 chữ)
- Chỉ list những điều CRITICAL và NON-STANDARD
Test trước khi commit:
- Chạy thử agent với/không có context
- So sánh results
- Giữ version nào tốt hơn

Không nên:

Đừng để LLM tự generate AGENTS.md cho bạn
- Results thường tệ hơn không có context
- Nếu cần, generate rồi human-edit heavily
Đừng viết generic guidelines
- "Write clean code", "Test thoroughly" → vô nghĩa
- Agents already know best practices
Đừng assume "more = better"
- Context file dài có thể harmful
- Measure impact thực tế

Rule of Thumb cuối cùng:

"No context is better than bad context."

Nếu không chắc context file có giúp gì không, thì đừng tạo. Baseline (no context) đã khá tốt rồi.

Hy vọng bài viết này giúp mọi người hiểu rõ hơn về context files và cách dùng AI coding agents hiệu quả hơn. Paper này còn nhiều insights khác nữa (như per-agent analysis, task type breakdown) mà mọi người có thể đọc thêm ở link dưới nhé!

Nguồn:

Paper gốc: https://arxiv.org/abs/2602.11988
AGENTBENCH dataset: https://arxiv.org/html/2602.11988v1
LinkedIn discussion: https://www.linkedin.com/posts/maxbarinov_evaluating-agentsmd-are-repository-level-activity-7429460776274284544-gPaD
AI Tools Analysis: https://ai-tools-aggregator-seven.vercel.app/blog/2026-02-17-agents-md-evaluation-study/
Detailed breakdown: https://umesh-malik.com/blog/agents-md-ai-coding-agents-study

FB: https://www.facebook.com/nguyendinhlong1998

Mở đầu: Khi "thêm context" không phải lúc nào cũng tốt hơn

Context Files là gì? Tại sao mọi người lại nghĩ nó hữu ích?

AGENTBENCH: Benchmark "thật" nhất cho coding agents

Đặc điểm của AGENTBENCH:

Kết quả nghiên cứu: Sự thật "phũ phàng"

1. LLM-Generated Context: Không chỉ không giúp, mà còn LÀM HẠI!

2. Human-Written Context: Cải thiện... nhưng không nhiều như mong đợi

3. Tại sao lại như vậy?

So sánh chi tiết: No Context vs LLM-Generated vs Human-Written

Scenario 1: No Context (Baseline)

Scenario 2: LLM-Generated Context

Scenario 3: Human-Written Context

Best Practices: Khi nào và làm sao viết Context Files đúng cách?

1. Keep it Minimal - Ít nhưng có võ

2. Ưu tiên Actionable Instructions

3. Chỉ viết khi project có "unconventional patterns"

4. Test với agents trước khi commit

5. Monitor Context Length

6. Sử dụng Comment-Based Context thay vì riêng file

Bài học kinh nghiệm: Context là con dao hai lưỡi

Insights cho AI Orchestration trong tương lai

"More information ≠ Better performance"

Tương lai: Context Compression & Selective Loading

Tổng kết: Nên làm gì với AGENTS.md/CLAUDE.md của bạn?

Nên làm:

Không nên:

Rule of Thumb cuối cùng:

Mục lục