Đã đăng vào thg 5 2, 9:25 SA 22 phút đọc

🚀 The SaaS Template Playbook 📖 - Part 2

MayFest2026

A comprehensive, opinionated, actionable guide for building a professional, reusable SaaS template that you can fork and reskin for any vertical (CRM, project management, analytics, internal tooling, vertical SaaS, etc.).

If you read only one section first, read §3 The 12 Pillars and §5 Multi-Tenancy — those two ideas dictate every other decision in this document.

Companion to 🤖 The AI SaaS Playbook (Practical Edition)📘 (how to add AI),🏗️ Building High-Quality AI Agents 🤖 — A Comprehensive, Actionable Field Guide 📚 (agentic systems), 🏛️ The System Design Playbook 📖 (the design vocabulary), 🛠️ The Senior Software Engineer Playbook 📖: From Good Coder to High-Impact Engineer 🚀 (software engineering), and 🦸 The Solo-Founder Playbook: Zero Hero 🚀 (operating alone).

📋 Table of Contents

🧐 What "SaaS Template" Actually Means
⚡ The 30-Second Mental Model
🏛️ The 12 Pillars of a Production SaaS
🏗️ Reference Architecture
🏢 Multi-Tenancy — the Keystone Decision
🔐 Authentication & Authorization
👥 Accounts, Organizations, Workspaces, Teams
🚪 Onboarding & Activation
💳 Billing, Subscriptions & Metering
🗄️ Database Design Patterns
🌐 API Design
⚙️ Background Jobs, Queues & Schedulers
📡 Real-time & Eventing
📨 Email, Notifications & Inbox
📦 File Storage, Uploads & CDN
🔎 Search (Full-Text + Semantic)
🚩 Feature Flags & Experiments
📊 Audit Logs, Activity Feeds & Telemetry
🛡️ Security, Compliance & Privacy
⚡ Performance, Caching & Scaling
📈 Observability — Logs, Metrics, Traces, Errors
🎨 Frontend Architecture
🌍 Internationalization & Accessibility
🔧 Admin & Internal Tooling
📝 Marketing Site, Docs & SEO
🚢 CI/CD, Environments & Release Strategy
🧰 Developer Experience (DX)
🧪 Testing Strategy
💰 Pricing, Plans & Packaging Strategy
🎯 Product Analytics & Growth
🤝 Customer Support & Success
📦 Reusability — How to Make This a Template
🗺️ The 14-Phase Build Plan
⚠️ Common Pitfalls & Hard-Won Guardrails
📋 Cheat Sheet

Section 1 -> 18 , Read Part 1 here https://viblo.asia/p/the-saas-template-playbook-part-1-ZjJYWZrOVOE

19. 🛡️ Security, Compliance & Privacy

19.1 The OWASP non-negotiables

Parameterized queries (no string-concatenated SQL ever).
Input validation at every boundary (use Zod / pydantic / typed structs).
Output encoding (React handles this; be careful in raw HTML / PDF generation).
CSRF tokens on cookie-auth state-changing endpoints.
CSP headers (Content-Security-Policy: default-src 'self').
HSTS (Strict-Transport-Security: max-age=63072000; includeSubDomains; preload).
Cookie attributes: Secure; HttpOnly; SameSite=Lax.
File upload type + size + MIME validation.

19.2 Secrets management

Never commit secrets. Pre-commit hook with gitleaks / detect-secrets.
Local: .env (gitignored).
Prod: AWS Secrets Manager / Doppler / Vault / Infisical.
Rotate on personnel changes and on any leak suspicion.

19.3 Data classification

Tag every data field by sensitivity:

Public — workspace name.
Private — email, IP, billing address.
Sensitive — password hash, OAuth tokens, API keys.
Restricted — payment data (PCI), health data (HIPAA), kid data (COPPA) — generally avoid storing if you can.

Sensitive data: encrypt at rest with KMS-managed key. Restricted data: outsource to a compliant provider (Stripe for cards, etc.).

19.4 Compliance by tier

Compliance	Effort	When you need it
GDPR (EU privacy)	Mandatory if you have any EU users	Day one
CCPA (California privacy)	Mostly overlaps with GDPR	Day one for US
SOC 2 Type I → Type II	3–6 months prep + audit	When enterprise prospects ask
HIPAA	Significant; needs BAA with all subprocessors	Healthcare verticals only
ISO 27001	6–12 months	International enterprise
PCI-DSS	High; outsource to Stripe and you're SAQ-A	If you touch card data

For a template: bake in GDPR-ready primitives (data export endpoint, account deletion, consent log, data residency tag). Defer SOC 2 until you have $$$ on the line.

19.5 Key GDPR primitives

Export my data endpoint: zip of every user-owned row in JSON.
Delete my account endpoint: anonymize PII, retain audit logs with user_id = NULL.
Consent log: consent (user_id, type, version, granted_at, ip).
DPA (Data Processing Agreement): signed with every paid customer, downloadable PDF.
Subprocessor list: public page listing every third party that touches customer data.
Data residency: support EU-only deployments by tagging tenants and routing.

19.6 Penetration testing & bug bounty

DIY scanning: OWASP ZAP / Burp / Nuclei / Trivy on every release.
Third-party pentest: annually for SOC 2.
Public bug bounty: HackerOne / Intigriti once you have something worth attacking.

20. ⚡ Performance, Caching & Scaling

20.1 Latency budget

A user-facing API request should complete in < 500 ms p95. Set this as a hard budget. Anything over needs optimization or async-ification.

20.2 Cache layers

[CDN]            — public assets, public docs, marketing pages
   ↓
[App-level]      — Redis (hot reads, computed views, rate-limit counters)
   ↓
[DB query cache] — Postgres shared buffers; no client-side query cache
   ↓
[DB read replica]— route read-heavy endpoints (e.g., search) to a replica

20.3 Rules

Cache invalidation > cache duration. Always know how a cached value gets invalidated. Never set a long TTL "just in case."
Tag-based invalidation: key the cache with (workspace_id, kind, version). Bump version on writes.
Don't cache user-specific data with long TTLs. Personalization defeats CDN caching anyway.

20.4 N+1 prevention

Use EXPLAIN ANALYZE on hot endpoints.
Use dataloaders in GraphQL.
Prefer joins to per-row lookups.
Add a CI check: log slow queries with pg_stat_statements and assert <5 over a benchmark.

20.5 Scaling Postgres

Order of operations:

Indexes — fix the missing ones first. 90% of Postgres "slow" is "no index."
Connection pooling — PgBouncer in transaction mode. Postgres can't handle 1000 connections; PgBouncer can.
Read replicas — route read-heavy reports.
Partitioning — by workspace_id or created_at for huge tables (audit log, events).
Vertical scaling — bigger box. Surprisingly far you can go.
Sharding — only when you have a reason. Last resort.

20.6 Background work moves the latency

If something can be async, it should be. Email, webhooks, audit log fanout, search indexing, analytics events — all queue-driven. Keep the request path lean.

21. 📈 Observability — Logs, Metrics, Traces, Errors

21.1 The four signals (correlated)

Signal	Tool	Question it answers
Logs	Loki / Datadog / CloudWatch	What happened?
Metrics	Prometheus / Grafana	How much, how fast, how often?
Traces	Jaeger / Tempo / Honeycomb / Datadog APM	Where is time spent?
Errors	Sentry	What broke, and how do I reproduce?

All four should share request_id and tenant_id so you can pivot from one to another.

21.2 Structured logging

Go: slog (stdlib) or zerolog. zerolog is the production default for Go SaaS — zero allocations on the hot path, fluent API, JSON-native, contextual loggers attach to context.Context.

// zerolog — fluent, zero-alloc, context-aware
logger := log.With().
    Str("request_id", reqID).
    Str("workspace_id", wsID.String()).
    Str("user_id", userID.String()).
    Logger()

logger.Info().
    Str("issue_id", issue.ID.String()).
    Int64("duration_ms", elapsed.Milliseconds()).
    Msg("issue.created")

Equivalent with slog:

slog.InfoContext(ctx, "issue.created",
    "request_id", reqID,
    "workspace_id", wsID,
    "user_id", userID,
    "issue_id", issue.ID,
    "duration_ms", elapsed.Milliseconds())

JSON in production, pretty-printed (zerolog's ConsoleWriter, or tint / lmittmann for slog) in dev. Never fmt.Println.

Python: structlog. The right answer for any FastAPI/async service — contextvars-aware, fast (with orjson), composable processors. logging-only is a dead end the moment you need request-scoped context.

import structlog

structlog.configure(
    processors=[
        structlog.contextvars.merge_contextvars,   # request_id, workspace_id flow automatically
        structlog.processors.add_log_level,
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.JSONRenderer(serializer=orjson.dumps),
    ],
)

log = structlog.get_logger()

# In a middleware:
structlog.contextvars.bind_contextvars(
    request_id=req_id, workspace_id=ws_id, user_id=user_id,
)

# Anywhere downstream — context is automatic:
log.info("embedding.generated", document_id=doc.id, dim=1536, duration_ms=elapsed)

Both languages, same rules: one event per log line, snake_case keys, every log inside a request carries request_id, workspace_id, user_id. No interpolated strings (f"user {id} did X") — that defeats structured search.

21.3 OpenTelemetry-first

Instrument with OTel SDK in every language. Export to whichever vendor — switching is then a config change, not a rewrite.

21.4 The four golden signals (per service)

Latency — p50, p95, p99.
Traffic — requests/sec.
Errors — error rate (5xx + key 4xx).
Saturation — CPU, memory, DB pool, queue depth.

Alert on anomalies, not absolute thresholds. Rate-of-change > p99 latency.

21.5 SLO + error budget

Define one or two SLOs and stick to them.

SLO: 99.9% of API requests < 500ms over 30-day window
     → error budget = 43 minutes/month

If you burn the budget, freeze feature work and fix reliability. This is the engineering culture lever.

21.6 On-call & runbooks

Every alert has a runbook URL in the alert text.
Runbooks live in the repo (docs/runbooks/<alert>.md), not Confluence.
Post-mortems for every Sev-1 / 2: blameless, in-repo, indexed.

22. 🎨 Frontend Architecture

22.1 Strict state separation

State type	Tool	Rule
Server state	TanStack Query	Everything from the API. Never duplicate into a client store.
Client UI state	Zustand (or React state)	Selection, modals, drafts, presence.
URL state	TanStack Router / Next.js	Filters, tabs, pagination — anything shareable.
Form state	React Hook Form + Zod	Validation co-located with schema.

22.2 Package boundaries

For monorepo:

packages/
  core/       headless logic — stores, hooks, api client, types
              ZERO react-dom, ZERO localStorage (use adapter), ZERO process.env
  ui/         atomic primitives (shadcn-style)
              ZERO @core imports, ZERO business logic
  views/      business components & pages
              ZERO next/*, ZERO routing-library imports (use adapter)
apps/
  web/        Next.js wiring + adapters
  desktop/    Electron wiring + adapters
  mobile/     React Native wiring + adapters

Internal packages export raw .ts / .tsx, no build step. Consumer's bundler compiles. Fast HMR, real go-to-definition.

22.3 Design system

Tailwind for atomic styling. No CSS-in-JS in 2026 — Tailwind v4 is faster and cleaner.
shadcn/ui as base primitives — copy-paste, then own them.
Radix UI under the hood for accessibility.
One token file (design-tokens.ts) for colors, spacing, radii.
One typography scale.
Storybook (or Ladle if you want a faster, lighter alternative) for component dev. One story per component covering default + edge states (loading, error, empty, long-text). Doubles as living documentation for designers and as the surface for visual regression tools (Chromatic, Percy, Playwright snapshots) and axe-core a11y checks in CI.

22.4 Routing

Next.js app router (RSC + streaming) if you want SEO-able marketing + app in one stack.
Vite + TanStack Router if you want an SPA with type-safe routing.
Avoid mixing two routers in one app.

22.5 Forms

const schema = z.object({ title: z.string().min(1).max(120) })
type FormValues = z.infer<typeof schema>

const form = useForm<FormValues>({ resolver: zodResolver(schema) })

Same Zod schema is reused for API validation server-side. Single source of truth.

22.6 Loading states + suspense

Skeleton screens for any fetch > 200ms.
Optimistic updates for user-triggered actions (TanStack Query mutations).
Error boundaries at route level — never let an error nuke the whole app.

22.7 Critical UX details

Keyboard shortcuts (Cmd-K, Cmd-Enter, /).
Toast system (one provider, toast.success(...)).
Global confirm modal helper.
Date formatting via one utility (formatDate(d, "short")) — never raw toLocaleString.
<Link> everywhere — never raw <a> for internal nav.

23. 🌍 Internationalization & Accessibility

23.1 i18n from day one — even if you ship English-only

Defer language additions; don't defer the plumbing.

Wrap every user-facing string in t("key.name").
Use i18next / next-intl / format.js.
Keep translations in locales/<lang>.json.
Use ICU MessageFormat for plurals/genders.
Avoid string concatenation — translators need full sentences.

23.2 Locale-aware formatting

Dates: Intl.DateTimeFormat.
Numbers / currency: Intl.NumberFormat.
Pluralization: ICU select.
Time zones: store UTC, render local.

23.3 Accessibility (WCAG 2.2 AA)

Every interactive element keyboard-reachable.
Visible focus states (don't outline: none without a replacement).
ARIA labels on icon-only buttons.
Semantic HTML — <button> not <div onClick>.
Color contrast ≥ 4.5:1 for body text.
Test with axe-core in CI.

24. 🔧 Admin & Internal Tooling

24.1 Build it day one. Do not skip.

You'll be on support-debug duty all year. An admin panel pays for itself in week two.

24.2 What goes in it

Capability	Why
Search any user / workspace	Triage support tickets.
Impersonate user (read-only by default)	"It works on my machine" reproduction.
Suspend / unsuspend workspace	Abuse handling.
Force-verify email	Lost-access support flow.
Refund / credit	Billing support.
Adjust plan / quota	Sales overrides.
Re-send webhook	Customer integration debug.
Replay failed jobs	Ops.
Inspect Stripe customer	Without leaving your tool.
Feature flag override per tenant	Beta access requests.

24.3 Implementation

Same codebase, gated behind is_internal_admin claim.
Separate hostname (admin.yourtool.com) and route group.
Every action audit-logged with actor_user_id (the staff member, not the impersonated user).
IP-allowlist optional; MFA mandatory.
Time-boxed sessions (re-auth every 30 min).

24.4 Don't overthink

You don't need React-Admin or Retool. A plain set of pages with tables and confirm modals is fine. Internal users will accept worse UX than customers.

24.5 BI for the business team

Sales/CS/finance/leadership will ask the same kind of questions every week — "MRR by plan?", "trial-to-paid by signup source?", "top 50 workspaces by API usage?". Without a self-serve tool, every one of those becomes a Slack message to engineering. Stand up a BI dashboard against a read replica (or a warehouse mirror — see §4.2) on day one of having paying customers.

Tool	License	Sweet spot	Watch out for
Apache Superset	Apache 2.0	Default recommendation. Clean license, powerful SQL Lab, rich chart library (incl. geospatial via deck.gl), scales to large orgs. The right pick when your data team is comfortable in SQL.	Steeper UX for non-technical users; more ops overhead than Metabase.
Metabase (Community)	AGPLv3	Easier UX than Superset for non-technical users — point-and-click query builder genuinely works for sales/CS. Setup in 10 minutes.	License gotcha: AGPL is usually fine for internal-only BI but a hard block for embedded analytics in your customer-facing product (need Metabase Enterprise for embedding rights). Many corporate legal policies blanket-ban AGPL — verify with counsel.
Lightdash	MIT	dbt-native — your dbt models are the metrics layer. Best fit if you're already on dbt for transformations.	Smaller community; assumes a dbt workflow.
Evidence.dev	MIT	Code-as-config (Markdown + SQL → static dashboards in git). Versioned reports as a developer-friendly alternative to clicky dashboard tools.	Not interactive ad-hoc exploration — built for publishing recurring reports, not slicing-and-dicing.
Redash (Databricks-owned)	BSD-2-Clause	Lightweight SQL-first dashboarding. Mature, simple, low-touch.	Lower velocity since the Databricks acquisition; community pace has slowed.
Hex / Mode / Hashboard	Managed (commercial)	Polished hosted experiences with notebook-style data exploration; pay-per-seat.	Per-seat pricing scales with the team that uses it most.

Template recommendation:

Default: Apache Superset against a Postgres read replica — Apache 2.0 license keeps your options open, and the SQL Lab covers 90% of business questions.
If your team is mostly non-technical and AGPL is acceptable: Metabase is the better UX. Just confirm with legal first, especially if you might want to embed dashboards in your product later.
If you already run dbt: Lightdash, since "the metric layer is your dbt models" is genuinely a better workflow than maintaining metrics in two places.

Run BI only against a read replica or warehouse mirror, never your primary OLTP database. A finance team running a "everything joined to everything" query will lock your prod app. Same auth gate as the admin panel (§24.3): SSO + MFA, IP-allowlist optional, time-boxed sessions.

25. 📝 Marketing Site, Docs & SEO

25.1 Three separate surfaces, often conflated

Surface	Stack	URL
Marketing site	Next.js (or Astro)	`yourtool.com`
Product docs	Mintlify / Docusaurus / Nextra	`yourtool.com/docs`
API reference	Stoplight / Redoc / Mintlify	`yourtool.com/docs/api`
Status page	StatusPage.io / Instatus	`status.yourtool.com`
Changelog	Markdown in repo + RSS	`yourtool.com/changelog`

Don't try to put marketing + app + docs in one Next.js app on day one. Build separately, deploy separately, link liberally.

25.2 SEO basics

Server-render marketing + docs (RSC, static generation).
Per-page <title> and <meta description>.
Open Graph + Twitter card tags + share image generator.
sitemap.xml + robots.txt.
JSON-LD schema for product/company.
Page speed: lighthouse ≥ 95 on every marketing page.

25.3 Conversion essentials

Clear pricing page with comparison table + FAQ.
Public roadmap (or at least a changelog).
Customer logos / case studies (after you have any).
Contact + sales form that goes to a real human in < 24h.

26. 🚢 CI/CD, Environments & Release Strategy

26.1 Environment ladder

dev (laptop)  →  ephemeral preview (per-PR)  →  staging  →  production

Preview environments per PR: each PR gets its own deployed URL with a seeded DB. Vercel / Render / Fly do this natively.
Staging mirrors prod config + tools but with a separate DB. For E2E tests + final smoke.
Production is the only environment paying customers see.

26.2 CI pipeline (keep < 10 min)

1. Install deps (cache aggressively)
2. Lint  (parallel)
3. Typecheck  (parallel)
4. Unit tests  (parallel)
5. Build artifacts
6. Integration tests (real Postgres + Redis as services)
7. E2E tests (Playwright against built artifacts) — only on main + tags
8. Deploy preview (PR) / staging (main) / prod (tag)

Fail fast: lint + typecheck before tests. Cache node_modules and ~/go/pkg/mod.

26.3 Database migrations on deploy

Migrations run automatically on deploy, before app code.
Always backwards-compatible: app version N+1 must work against DB at version N (briefly, during rollout).
For destructive migrations (drop column), use a 2-deploy dance: stop reading → deploy → drop column.

26.4 Release strategy

Blue-green or rolling deploys. Never stop-the-world.
Canary for risky changes: 1% → 10% → 50% → 100% with metrics gates.
Feature flags decouple deploy from release. Deploy whenever; release when ready.
Tag-driven releases for the CLI / desktop apps via GoReleaser / electron-builder.

26.5 Rollback

Every release is a single immutable artifact (container image with sha256 tag).
make rollback reverts to the previous artifact in < 60 seconds.
DB migrations are forward-only; rollback means not running the new migration yet, not undoing it.

26.6 Where to host (and when to switch)

Stage	Host	Why
Local dev	Docker Compose	Single command, identical to prod shape.
First production deploy	Fly.io / Railway / Render	Push-to-deploy, managed Postgres, zero ops. Cost: $20–$100/mo until you have traction.
Profitability stage	Hetzner (Cloud or dedicated) + Caddy front door	Best price-to-performance in the industry. A €20/mo CCX dedicated-vCPU box runs the API + workers comfortably for thousands of paying customers. Pair with managed Postgres elsewhere or run it yourself with daily off-site backups.
Polished IaaS	Digital Ocean (Droplets + Managed PG/Redis + Spaces + App Platform)	Better dashboard than Hetzner, managed databases included, predictable billing. ~2× the cost of Hetzner for similar specs but you get the managed pieces.
Enterprise / compliance	AWS / GCP / Azure	Region breadth, BAAs, customer procurement requirements.

Reverse proxy on VM-style hosts (Hetzner, DO Droplets, bare metal):

Caddy — single binary, automatic HTTPS via Let's Encrypt/ZeroSSL, config in a Caddyfile. The right default for "I have one or two boxes."
```
app.yourtool.com {
    reverse_proxy api-1:8080 api-2:8080 {
        health_uri /healthz
    }
    encode gzip zstd
    log
}
```

Traefik — pulls config from Docker labels, K8s ingress objects, or a key-value store. The right default when you have a containerized fleet that scales horizontally and you want zero manual proxy config.

# docker-compose.yml
api:
  labels:
    - "traefik.enable=true"
    - "traefik.http.routers.api.rule=Host(`app.yourtool.com`)"
    - "traefik.http.routers.api.tls.certresolver=letsencrypt"

Don't run nginx unless you have a specific reason — Caddy and Traefik handle TLS, HTTP/3, and modern defaults without the config gymnastics.

26.7 The bootstrapped reference deployment

A surprising number of profitable SaaS run on:

[Cloudflare] (CDN, WAF, DNS, Turnstile, R2 for files)
     │
     ▼
[Hetzner CCX dedicated-vCPU box, €20–€60/mo]
     │
     ├── Caddy (TLS, reverse proxy)
     ├── Go API (Gin + GORM + zerolog)
     ├── Worker (Asynq or NATS JetStream consumer)
     ├── NATS JetStream (single node, file-backed)
     ├── Postgres 16 (with WAL-G off-site backups to R2)
     └── Casdoor (auth, separate container)

Total infra cost: €30–€80/month all-in. Capable of serving thousands of paying customers before you need a second box. Move to Digital Ocean managed Postgres the day you stop wanting to be the on-call DBA.

27. 🧰 Developer Experience (DX)

27.1 The "one command to dev" rule

make dev

Should:

Boot Postgres + Redis (Docker Compose).
Run migrations.
Seed data.
Start API + workers + frontend with hot reload.
Print URLs for app, docs, mailcatcher, DB UI.

If a new engineer can't git clone && make dev and reach the running app in 10 minutes, fix the gap.

27.2 Seed data

Realistic, idempotent, reproducible:

5 workspaces with different plans.
20 users, with at least one in each role.
100 representative resources (issues / projects / etc.).
1 demo workspace anyone can browse.

27.3 Mail in dev

Run MailHog / Mailpit in Compose. All transactional emails route there. Open the UI to read them.

27.4 DB UI in dev

Embed pgweb / Adminer in Compose at localhost:8081. Saves "where's the user table" Slack messages.

27.5 Repo conventions

Makefile is the entry point for every workflow (make dev, make test, make migrate-up, make seed).
.env.example checked in; .env gitignored.
CONTRIBUTING.md with the 5 commands a new dev needs.
docs/decisions/ for ADRs (Architecture Decision Records).

27.6 Codegen, not boilerplate

API clients generated from OpenAPI.
DB types generated by sqlc / Prisma.
Translation keys type-checked.
Routes type-safe (TanStack Router / Next).
If you find yourself writing the same thing in three places, generate it.

27.7 Pick one Go stack and standardize on it

Two viable shapes. Don't mix them within one service.

Shape	Stack	When to pick
Lean / SQL-first	`chi` (router) + `sqlc` (codegen) + `pgx` (driver) + `slog` or `zerolog`	You want explicit SQL, zero ORM magic, maximum performance. Code reads like a database textbook.
Batteries-included	`Gin` (router + middleware ecosystem) + `GORM` (ORM, migrations, hooks) + `zerolog`	You want to ship features faster and trade some control for ergonomics. Most Go SaaS teams pick this.

For the template, default to Gin + GORM + zerolog unless your team has a strong preference. It's the path with the most tutorials, middleware, and Stack Overflow answers — which matters when onboarding new engineers.

// Gin + GORM + zerolog skeleton
r := gin.New()
r.Use(
    requestid.New(),
    ginzerolog.Logger("api"),     // structured access logs
    gin.Recovery(),
    middleware.Auth(authProvider), // verifies session/JWT, sets actor in ctx
    middleware.Tenant(),           // resolves workspace_id, sets app.workspace_id GUC
)

r.POST("/api/v1/projects", handlers.CreateProject(db))

// db is *gorm.DB with logger plugged into zerolog

GORM gotchas to know up front: callbacks fire on every save (use them for audit-log fan-out, not business logic), Preload is N+1's disguise (prefer explicit joins for hot paths), and AutoMigrate is fine for dev but never run it in prod — use goose, golang-migrate, or Atlas for versioned production migrations.

28. 🧪 Testing Strategy

28.1 The pyramid

       /\      E2E (Playwright)         5–10%   slow, valuable
      /  \
     /----\    Integration (real DB)    20–30%  most leverage
    /------\
   /--------\  Unit                     60–70%  fast feedback

28.2 Rules

Unit tests are co-located with source: foo.go + foo_test.go, Button.tsx + Button.test.tsx.
Integration tests spin up a real Postgres + Redis (testcontainers, or services in CI).
E2E tests run against the full Compose stack on tagged releases + main.
Fast tests in pre-commit / on file save. Full suite in CI.

28.3 Critical user-facing flows to E2E

Sign up → verify email → create workspace → first activation event.
Invite teammate → teammate accepts → both see the same data.
Upgrade plan → feature unlocks immediately.
Cancel plan → downgrade scheduled at period end.
Forgotten password → reset → log back in.

If any of these break, the whole product is broken. E2E them.

28.4 Snapshot tests

Useful for emails (rendered HTML) and API responses (response schema).
Avoid for UI — too much false-positive noise. Visual regression tools (Chromatic / Percy) are better.

28.5 Property-based tests

For pure logic (validation, pricing math, date calculations) — fast-check (TS) / hypothesis (Python) / gopter (Go) catch the cases you didn't think of.

28.6 Don't skip coverage; don't worship it

Aim for ~70% line coverage on logic-heavy packages. Below that = gaps. Above 90% = you're testing trivial getters.

29. 💰 Pricing, Plans & Packaging Strategy

29.1 The three SaaS pricing axes

Per-seat — works for collaboration (Slack, Linear, Figma). Predictable, scales with customer.
Usage-based — works for backend infra & AI (Stripe, OpenAI, Vercel). Aligns with value, but harder to budget.
Per-feature tier — works for breadth (HubSpot, Zendesk). Lets enterprise sales upsell.

Most SaaS combine all three: per-seat × tier + usage-based add-ons.

29.2 Recommended starting tiers

Free / Hobby     — 1 user, X resources, limited features    → top of funnel
Starter / Pro    — N users, full features, $/seat/month     → SMB / individual paid
Team / Business  — unlimited users, advanced features       → mid-market
Enterprise       — SSO, audit export, custom DPA, support   → contact sales

Don't ship 6 tiers on day one. Ship 3.

29.3 What goes behind the paywall

Free: the core value prop, scoped (e.g., "10 issues, 1 user").
Pro/Team: depth (advanced fields, automations, API).
Enterprise: trust (SSO, SCIM, audit log export, custom contract, SLA, support).

29.4 Annual discount

Standard: ~20% off vs monthly. Locks in cash flow + reduces churn.

29.5 Free trial vs freemium — pick one

Trial (14 days, full features) — high commercial pressure, faster decision.
Freemium (free forever, limited) — top-of-funnel volume, harder conversion.

For a vertical/B2B SaaS template: default to trial. For PLG products targeting individuals: freemium.

29.6 Discounting & overrides

Coupons in Stripe with promotion codes for marketing.
Sales-set discounts via admin panel (audit-logged).
Annual prepay discounts handled by Stripe automatically.

30. 🎯 Product Analytics & Growth

30.1 Two analytics stacks

Stack	Tool	Purpose
Product	PostHog / Mixpanel / Amplitude	"Did the user activate? Convert? Churn?"
Engineering	OpenTelemetry → Grafana	"Is the system healthy?"

PostHog is the recommended default — it bundles analytics, session replay, feature flags, and A/B tests in one tool.

30.2 The events you must track

From day one:

signed_up (workspace_id, user_id, source)
activated (workspace_id) — your activation event
<core_action>_created — whatever your "noun" is
invited_member, member_accepted
upgraded_plan, downgraded_plan, cancelled_subscription
viewed_paywall, clicked_upgrade

Every event has workspace_id and user_id. Don't track per-user without per-tenant.

30.3 The funnels you must measure

Sign-up → email-verified → workspace-created → activated.
Activation → invite teammate → second user activated.
Free → paywall view → upgrade.
Subscribed → renewal (LTV / churn).

30.4 Cohort retention

Plot retention by signup-week cohort. Healthy SaaS shows a "smile" — short-term decline, long-term flat or up. If your retention curves go to zero, no amount of marketing fixes the product.

30.5 NPS / CSAT

In-app survey (Delighted / built-in PostHog) at 30 days post-signup and quarterly. NPS > 30 is good, > 50 great.

31. 🤝 Customer Support & Success

31.1 Day-one support stack

Email: support@yourtool.com → ticketing system (Pylon, Plain, HelpScout, or just Front).
In-app chat: Intercom / Crisp / Pylon. Gate by plan if costly.
Docs: searchable, with embedded video.
Status page: automatic incident updates from your monitors.
Community: Slack / Discord / Discourse — only if you have bandwidth to keep it active.

31.2 Build support hooks into the product

"Get help" button opens chat with current page URL pre-filled.
"Copy debug info" button: workspace_id, user_id, browser, version, request_id of last error.
Per-error pages include request_id + a "contact support" link.

31.3 Customer success vs support

Support reacts: ticket comes in, response goes out.
Customer success is proactive: usage drops, success manager reaches out.

You don't need CS until you have customers worth saving. But instrument the data day one.

32. 📦 Reusability — How to Make This a Template

If the goal is a template you fork per product, the architecture must keep domain-specific code clean.

32.1 The "kernel + product" split

kernel/          — every SaaS has this
  auth, tenancy, billing, notifications, audit, admin, files, search,
  flags, analytics, infra, observability

product/         — your domain
  models, services, handlers, UI, jobs

32.2 Hard rules

kernel/ never imports product/. One-way dependency.
product/ extends kernel through hooks/interfaces, never by editing kernel.
New tenant-scoped tables follow the same conventions: id, workspace_id, created_at, RLS policy.
Domain events publish on the same in-process bus.
Domain UI uses the same design system + permission helpers.

32.3 Configuration over code

Most "per-product" customizations should be config:

# product.config.yaml
brand:
  name: "MyApp"
  primary_color: "#5B5BD6"
features:
  audit_log_export: true
  custom_domains: false
plans:
  - name: starter
    price_cents: 1900
    limits: { members: 5 }

Logo, name, palette, plan structure — all configurable without touching kernel code.

32.4 Domain plug-points

Predefine extension points in the kernel:

Hook	Example use
`OnSignup(user, workspace)`	Auto-create demo project.
`OnActivated(workspace)`	Send welcome email + slack notification.
`BeforeRequest(ctx)`	Inject tenant-specific data.
`MeterEvent(name, qty)`	Custom usage metering for your domain.
`RenderEmail(template, data)`	Domain-specific transactional emails.

Each is a Go interface or TS function imported from kernel, implemented in product.

32.5 Reskin checklist (minutes, not days)

[ ] Update product.config.yaml.
[ ] Replace logo, favicon, OG images.
[ ] Update tailwind.config.ts colors.
[ ] Update marketing copy in apps/marketing/content/.
[ ] Configure Stripe products + prices, paste IDs into config.
[ ] Add domain models to product/.
[ ] Wire domain routes / pages.
[ ] Update seed.go with domain-relevant demo data.

32.6 Versioning the template

Treat the template as its own project with a version. When kernel improves, projects forked from it can pull updates by:

Adding the template repo as a template-upstream remote.
Cherry-picking kernel commits.
Or running a custom bin/upgrade-kernel that copies non-product paths.

33. 🗺️ The 14-Phase Build Plan

Each phase is shippable. Don't skip ahead. Most failures here come from doing phase 7 before phase 3 is solid.

🌱 Phase 1 — Skeleton (2 days)

Monorepo: apps/web, apps/api, packages/{core,ui,views}, infra/.
Docker Compose: Postgres + Redis + Mailpit + pgweb.
make dev brings up the stack with hot reload.
Health endpoints, structured logging, request ID middleware.
One CI job: lint + typecheck + unit tests.

Done when: git clone && make dev and an empty app loads with no auth.

🔐 Phase 2 — Auth (2 days)

Email + password + magic link.
Email verification.
Google OAuth.
Password reset.
Session via cookie (browser) and JWT (API).
Rate limit on /login.

Done when: new user can sign up, verify, log out, log in, reset password.

🏢 Phase 3 — Tenancy (2 days)

workspace, membership, invite tables.
Workspace creation flow.
Workspace switcher UI.
Subdomain or path-based routing.
RLS policies on every tenant-scoped table.
Permission helper Can(user, action, resource).
Roles: owner, admin, member.

Done when: invited teammates only see the workspaces they belong to. Cross-tenant DB access is blocked at the RLS layer.

📨 Phase 4 — Notifications & Email (1 day)

Resend / Postmark integration.
React Email templates: verify, reset, invite, billing failure.
In-app inbox table + WS push.
Notification preferences.

Done when: invite emails arrive in Mailpit (dev) and real inbox (prod), and the in-app bell shows new mentions.

💳 Phase 5 — Billing (3 days)

Stripe integration: Checkout + Customer Portal.
Plans table + subscription table + webhook handler.
Trial logic.
Feature gating helper.
Dunning emails on failed payments.
Admin override for plan/quota.

Done when: users can pick a plan, pay, see their plan, upgrade, downgrade, and a failed payment triggers correct UX.

⚙️ Phase 6 — Background Jobs & Cron (1 day)

Job queue (Asynq / River / BullMQ).
Worker process running in Compose.
Job examples: send email, sync to Stripe, expire trial.
Cron scheduler with leader election or Postgres-backed.
Outbox pattern for transactional events.

Done when: a 10-second job runs in the worker, the API stays fast, and a daily cron fires once across N replicas.

📦 Phase 7 — Files (1 day)

S3 / R2 bucket per environment.
Signed-URL upload endpoint.
Confirm endpoint storing metadata.
Avatar upload as the canonical example.
CDN with signed cookies for private files.

Done when: a user can upload an avatar and serve it via CDN, without bytes touching the API.

🔎 Phase 8 — Search & Search-Adjacent (1 day)

Postgres FTS index on the main domain entity.
Generic searchable interface.
Hybrid (BM25 + trigram) ranking.
(Optional) pgvector + embedding worker.

Done when: typing in the search bar returns relevant results in < 200ms.

📡 Phase 9 — Real-time (1 day)

WebSocket endpoint with auth + origin check.
In-process hub + (optional) Redis pub/sub for multi-node.
Client subscribes, server invalidates Query cache via WS event.
Presence (online/offline indicators).

Done when: two browser windows show the same data update simultaneously.

📊 Phase 10 — Audit, Activity, Telemetry (1 day)

audit_log table with privileged-action logging.
activity table for user-facing feeds.
PostHog (or equivalent) wired with the canonical events.
Workspace activation event + retention dashboard.

Done when: every privileged action is in the audit log and every signup is tracked in PostHog.

🚩 Phase 11 — Feature Flags & Admin Panel (2 days)

Self-hosted PostHog or DIY flag table.
Per-env / per-workspace / per-user flag resolution.
Admin panel: user search, workspace search, impersonate (read-only), suspend, override flags.
Admin actions audit-logged with staff actor.

Done when: support can resolve a "I can't see X" ticket in < 5 minutes via admin tools.

🛡️ Phase 12 — Security & Compliance Foundation (1 day)

CSP, HSTS, secure cookies, CSRF.
gitleaks pre-commit + CI.
GDPR primitives: data export endpoint, account deletion endpoint, consent log.
DPA template + subprocessor list page.
Pen-test scan via OWASP ZAP in CI.

Done when: a security review can pass the OWASP Top 10 checklist without changes.

📈 Phase 13 — Observability (1 day)

OpenTelemetry SDK in API + workers.
Logs, metrics, traces all tagged with request_id + tenant_id.
Sentry for errors.
Basic Grafana dashboard with golden signals.
Status page (Instatus or self-hosted).
One SLO defined + alerted.

Done when: clicking an error in Sentry takes you to the trace, which links to the logs, which contain the request.

📦 Phase 14 — Package, Document, Reskin (2 days)

kernel/ ↔ product/ separation.
product.config.yaml and reskin guide.
Marketing landing page template.
Docs site template (Mintlify / Nextra).
README + CONTRIBUTING + ADRs.
One full reskin pass to verify the template works.

Done when: a new engineer can fork, run bin/reskin --name AcmeApp --color "#FF5C5C", and have a custom-branded skeleton in 30 minutes.

Total: ~21 working days for a single experienced engineer to build an MVP-quality SaaS template. ~6–8 weeks calendar with reviews, polish, and docs.

34. ⚠️ Common Pitfalls & Hard-Won Guardrails

Pitfall	Guardrail
Forgetting `WHERE workspace_id = ?` somewhere	RLS policies on every tenant table; CI grep for missing filters.
Stripe webhook handler is non-idempotent	Use `event.id` as a dedup key in Redis with 7-day TTL.
Long-running job blocks request path	Move to a queue; never call third parties synchronously.
Admin actions not audit-logged	Wrap every admin handler in middleware that writes to audit log.
Email enumeration on signup/login	Same response and timing for "exists" vs "not exists".
Migration breaks rolling deploy	Two-phase migrations; never drop+rename in one shot.
WS message updates client store directly	Rule: WS invalidates Query cache only, never writes to stores.
Cookie auth without CSRF	`SameSite=Lax` + CSRF token on state-changing endpoints.
Secrets committed to git	`gitleaks` pre-commit + CI fail.
Free tier abuse (signup farming)	Rate limit signups per IP + email-domain block list + Cloudflare Turnstile.
Plan change inconsistencies (paid down to free with paid resources still active)	Plan change handler: enforce limits, archive overflow, email user.
Trial expires while user has 50 issues	Read-only mode + upgrade banner; do not delete data.
Hot N+1 query in detail page	`EXPLAIN ANALYZE` in CI for top endpoints.
Cache that never invalidates	Tag-based invalidation; never set TTL > 1 hour without invalidation hook.
Tenant data exposed via search index	Search index keys include `workspace_id` and the search query filters by it.
Misconfigured CORS opens API to malicious origins	Allowlist origins explicitly; reject `*` with credentials.
User can delete their own audit log entries	Audit log is append-only; no user-facing endpoint to mutate.
One slow query takes down the API	Statement-level timeouts (`SET LOCAL statement_timeout = '5s'`).
Background worker silently fails forever	Dead-letter queue + alert on DLQ depth.
Subdomain takeover via stale CNAME	Audit DNS regularly; deactivate orphan subdomains.
Test data leaks into prod	Distinct connection strings; loud banner in non-prod environments.
"Forgot password" reveals if email exists	Generic response: "If an account exists, we've sent a reset link."
No consent log → GDPR audit fails	`consent` table with version + timestamp + IP from day one.
Customer asks for a feature already on roadmap	Public roadmap so they can upvote instead of opening a ticket.

35. 📋 Cheat Sheet

📖 First files / decisions to lock down

Multi-tenancy model — pool, all queries filter by workspace_id, RLS as defense.
Auth model — cookie session for browser, JWT for mobile/API, API keys for integrations.
Permissions — single Can(actor, action, resource) helper, RBAC roles.
Billing — Stripe Checkout + Customer Portal; metered prices for usage.
Event bus — in-process publisher → outbox → workers.
API shape — REST + JSON, cursor pagination, single error envelope, idempotency keys.
Frontend state — TanStack Query for server state, Zustand for UI, never mix.

⚙️ Default config defaults

Setting	Default
Session TTL (cookie)	14 days, sliding
JWT access token TTL	15 min
Refresh token TTL	30 days
API rate limit	100 req/min/IP, 1000 req/min/workspace
File upload max	100 MB
Idempotency cache TTL	24 h
Trial length	14 days
Soft-delete grace period	30 days
Audit log retention	7 years
Activity feed retention	6 months
GDPR data export TTL	7 days from generation
Workspace slug regex	`[a-z0-9-]{3,40}`
Password min length	12 chars (or zxcvbn score ≥ 3)

🚫 Hard rules (non-negotiable)

Every tenant-scoped query filters by workspace_id.
Every privileged action writes to audit_log.
Every email obeys per-user notification preferences.
Every webhook handler is idempotent.
Every form input is validated server-side (Zod / pydantic / typed structs).
Every secret is in a secrets manager, not in env in prod.
Every public endpoint has a rate limit.
Every payment side effect goes through Stripe webhooks, not the request path.
Every long-running task is in a job queue.
WS events invalidate Query cache; they never write directly to stores.
Migrations are append-only.
Admin actions are audit-logged with the staff member as actor.
Feature flags wrap any risky new behavior.
File uploads bypass the API server (signed S3 URLs).
No WHERE clause in SQL is built via string concatenation.
New tables follow the convention: id, workspace_id, created_at, updated_at.

📐 The canonical resource shape (REST)

{
  "id": "01HMZQ...",
  "workspace_id": "01HMW1...",
  "name": "Project Alpha",
  "status": "active",
  "created_at": "2026-04-30T10:00:00Z",
  "updated_at": "2026-04-30T10:00:00Z",
  "created_by": { "type": "user", "id": "01HM..." }
}

🎭 The polymorphic-actor pattern

created_by_type TEXT CHECK (created_by_type IN ('user','api_key','system')),
created_by_id   UUID

Use this on every "actor" field. It lets you treat agents, integrations, and humans uniformly without parallel schemas.

🔑 Environment variables baseline

APP_ENV=production            # dev | staging | production
APP_URL=https://app.yourtool.com
PUBLIC_URL=https://yourtool.com

DATABASE_URL=postgres://...
REDIS_URL=redis://...

JWT_SECRET=<32-byte-random>
SESSION_SECRET=<32-byte-random>
COOKIE_DOMAIN=.yourtool.com

STRIPE_SECRET_KEY=sk_live_...
STRIPE_WEBHOOK_SECRET=whsec_...
PAYPAL_CLIENT_ID=...                   # optional, secondary payment method
PAYPAL_CLIENT_SECRET=...
PAYPAL_WEBHOOK_ID=...

# Object storage (S3 / Cloudflare R2 / Supabase Storage — pick one)
S3_BUCKET=...
S3_REGION=...
S3_ENDPOINT=...                        # set for R2 / Supabase / MinIO
AWS_ACCESS_KEY_ID=...
AWS_SECRET_ACCESS_KEY=...

# Auth (pick the block matching your provider)
# --- Casdoor (self-hosted IAM)
CASDOOR_ENDPOINT=https://auth.yourtool.com
CASDOOR_CLIENT_ID=...
CASDOOR_CLIENT_SECRET=...
CASDOOR_ORG=yourtool
CASDOOR_APP=app
# --- Ory Kratos (self-hosted)
KRATOS_PUBLIC_URL=https://auth.yourtool.com
KRATOS_ADMIN_URL=http://kratos:4434
# --- Supabase Auth
SUPABASE_URL=https://xyz.supabase.co
SUPABASE_ANON_KEY=...
SUPABASE_SERVICE_ROLE_KEY=...
# --- WorkOS / Clerk
WORKOS_API_KEY=...
CLERK_SECRET_KEY=...

# Eventing
NATS_URL=nats://nats:4222              # if using NATS JetStream
NATS_STREAM=app-events

RESEND_API_KEY=...
EMAIL_FROM="YourTool <hi@yourtool.com>"

SENTRY_DSN=...
POSTHOG_KEY=...
POSTHOG_HOST=https://app.posthog.com

OPENAI_API_KEY=...           # optional, if you have AI features

🎯 KPIs to track from day one

Sign-ups / week
Activation rate (signed up → activated)
Free → paid conversion rate
MRR / ARR
Net revenue retention (NRR)
Logo churn
DAU / WAU / MAU
p95 API latency
Error rate
NPS

💭 Closing Thought

A great SaaS template is opinionated about everything that doesn't matter to the customer, and flexible about everything that does.

Auth, billing, tenancy, observability, admin → opinionated, baked-in.
Domain models, UI flows, branding, pricing → flexible, configurable.

The discipline: every time you find yourself solving the same infrastructure problem in a new product, that solution belongs in the template. Every time you find yourself solving a different domain problem, that work belongs in product/.

If you internalize §5 (Multi-Tenancy), §9 (Billing), §19 (Security), and the §32 kernel/product split, the rest of this playbook becomes a detailed checklist you can execute over 6–8 weeks to ship a real, professional, reusable SaaS foundation.

Now go build.

If you found this helpful, let me know by leaving a 👍 or a comment!, or if you think this post could help someone, feel free to share it! Thank you very much! 😃

Android iOS JavaScript ReactJS