🚀 The SaaS Template Playbook - Part 2 📖
A comprehensive, opinionated, actionable guide for building a professional, reusable SaaS template that you can fork and reskin for any vertical (CRM, project management, analytics, internal tooling, vertical SaaS, etc.).
If you read only one section first, read §3 The 12 Pillars and §5 Multi-Tenancy — those two ideas dictate every other decision in this document.
📋 Table of Contents
- 🧐 What "SaaS Template" Actually Means
- ⚡ The 30-Second Mental Model
- 🏛️ The 12 Pillars of a Production SaaS
- 🏗️ Reference Architecture
- 🏢 Multi-Tenancy — the Keystone Decision
- 🔐 Authentication & Authorization
- 👥 Accounts, Organizations, Workspaces, Teams
- 🚪 Onboarding & Activation
- 💳 Billing, Subscriptions & Metering
- 🗄️ Database Design Patterns
- 🌐 API Design
- ⚙️ Background Jobs, Queues & Schedulers
- 📡 Real-time & Eventing
- 📨 Email, Notifications & Inbox
- 📦 File Storage, Uploads & CDN
- 🔎 Search (Full-Text + Semantic)
- 🚩 Feature Flags & Experiments
- 📊 Audit Logs, Activity Feeds & Telemetry
- 🛡️ Security, Compliance & Privacy
- ⚡ Performance, Caching & Scaling
- 📈 Observability — Logs, Metrics, Traces, Errors
- 🎨 Frontend Architecture
- 🌍 Internationalization & Accessibility
- 🔧 Admin & Internal Tooling
- 📝 Marketing Site, Docs & SEO
- 🚢 CI/CD, Environments & Release Strategy
- 🧰 Developer Experience (DX)
- 🧪 Testing Strategy
- 💰 Pricing, Plans & Packaging Strategy
- 🎯 Product Analytics & Growth
- 🤝 Customer Support & Success
- 📦 Reusability — How to Make This a Template
- 🗺️ The 14-Phase Build Plan
- ⚠️ Common Pitfalls & Hard-Won Guardrails
- 📋 Cheat Sheet
Section 1 -> 18 , Read Part 1 here https://viblo.asia/p/the-saas-template-playbook-part-1-ZjJYWZrOVOE
19. 🛡️ Security, Compliance & Privacy
19.1 The OWASP non-negotiables
- Parameterized queries (no string-concatenated SQL ever).
- Input validation at every boundary (use Zod / pydantic / typed structs).
- Output encoding (React handles this; be careful in raw HTML / PDF generation).
- CSRF tokens on cookie-auth state-changing endpoints.
- CSP headers (
Content-Security-Policy: default-src 'self'). - HSTS (
Strict-Transport-Security: max-age=63072000; includeSubDomains; preload). - Cookie attributes:
Secure; HttpOnly; SameSite=Lax. - File upload type + size + MIME validation.
19.2 Secrets management
- Never commit secrets. Pre-commit hook with
gitleaks/detect-secrets. - Local:
.env(gitignored). - Prod: AWS Secrets Manager / Doppler / Vault / Infisical.
- Rotate on personnel changes and on any leak suspicion.
19.3 Data classification
Tag every data field by sensitivity:
- Public — workspace name.
- Private — email, IP, billing address.
- Sensitive — password hash, OAuth tokens, API keys.
- Restricted — payment data (PCI), health data (HIPAA), kid data (COPPA) — generally avoid storing if you can.
Sensitive data: encrypt at rest with KMS-managed key. Restricted data: outsource to a compliant provider (Stripe for cards, etc.).
19.4 Compliance by tier
| Compliance | Effort | When you need it |
|---|---|---|
| GDPR (EU privacy) | Mandatory if you have any EU users | Day one |
| CCPA (California privacy) | Mostly overlaps with GDPR | Day one for US |
| SOC 2 Type I → Type II | 3–6 months prep + audit | When enterprise prospects ask |
| HIPAA | Significant; needs BAA with all subprocessors | Healthcare verticals only |
| ISO 27001 | 6–12 months | International enterprise |
| PCI-DSS | High; outsource to Stripe and you're SAQ-A | If you touch card data |
For a template: bake in GDPR-ready primitives (data export endpoint, account deletion, consent log, data residency tag). Defer SOC 2 until you have $$$ on the line.
19.5 Key GDPR primitives
- Export my data endpoint: zip of every user-owned row in JSON.
- Delete my account endpoint: anonymize PII, retain audit logs with
user_id = NULL. - Consent log:
consent (user_id, type, version, granted_at, ip). - DPA (Data Processing Agreement): signed with every paid customer, downloadable PDF.
- Subprocessor list: public page listing every third party that touches customer data.
- Data residency: support EU-only deployments by tagging tenants and routing.
19.6 Penetration testing & bug bounty
- DIY scanning: OWASP ZAP / Burp / Nuclei / Trivy on every release.
- Third-party pentest: annually for SOC 2.
- Public bug bounty: HackerOne / Intigriti once you have something worth attacking.
20. ⚡ Performance, Caching & Scaling
20.1 Latency budget
A user-facing API request should complete in < 500 ms p95. Set this as a hard budget. Anything over needs optimization or async-ification.
20.2 Cache layers
[CDN] — public assets, public docs, marketing pages
↓
[App-level] — Redis (hot reads, computed views, rate-limit counters)
↓
[DB query cache] — Postgres shared buffers; no client-side query cache
↓
[DB read replica]— route read-heavy endpoints (e.g., search) to a replica
20.3 Rules
- Cache invalidation > cache duration. Always know how a cached value gets invalidated. Never set a long TTL "just in case."
- Tag-based invalidation: key the cache with
(workspace_id, kind, version). Bump version on writes. - Don't cache user-specific data with long TTLs. Personalization defeats CDN caching anyway.
20.4 N+1 prevention
- Use
EXPLAIN ANALYZEon hot endpoints. - Use dataloaders in GraphQL.
- Prefer joins to per-row lookups.
- Add a CI check: log slow queries with
pg_stat_statementsand assert <5 over a benchmark.
20.5 Scaling Postgres
Order of operations:
- Indexes — fix the missing ones first. 90% of Postgres "slow" is "no index."
- Connection pooling — PgBouncer in transaction mode. Postgres can't handle 1000 connections; PgBouncer can.
- Read replicas — route read-heavy reports.
- Partitioning — by
workspace_idorcreated_atfor huge tables (audit log, events). - Vertical scaling — bigger box. Surprisingly far you can go.
- Sharding — only when you have a reason. Last resort.
20.6 Background work moves the latency
If something can be async, it should be. Email, webhooks, audit log fanout, search indexing, analytics events — all queue-driven. Keep the request path lean.
21. 📈 Observability — Logs, Metrics, Traces, Errors
21.1 The four signals (correlated)
| Signal | Tool | Question it answers |
|---|---|---|
| Logs | Loki / Datadog / CloudWatch | What happened? |
| Metrics | Prometheus / Grafana | How much, how fast, how often? |
| Traces | Jaeger / Tempo / Honeycomb / Datadog APM | Where is time spent? |
| Errors | Sentry | What broke, and how do I reproduce? |
All four should share request_id and tenant_id so you can pivot from one to another.
21.2 Structured logging
Go: slog (stdlib) or zerolog. zerolog is the production default for Go SaaS — zero allocations on the hot path, fluent API, JSON-native, contextual loggers attach to context.Context.
// zerolog — fluent, zero-alloc, context-aware
logger := log.With().
Str("request_id", reqID).
Str("workspace_id", wsID.String()).
Str("user_id", userID.String()).
Logger()
logger.Info().
Str("issue_id", issue.ID.String()).
Int64("duration_ms", elapsed.Milliseconds()).
Msg("issue.created")
Equivalent with slog:
slog.InfoContext(ctx, "issue.created",
"request_id", reqID,
"workspace_id", wsID,
"user_id", userID,
"issue_id", issue.ID,
"duration_ms", elapsed.Milliseconds())
JSON in production, pretty-printed (zerolog's ConsoleWriter, or tint / lmittmann for slog) in dev. Never fmt.Println.
Python: structlog. The right answer for any FastAPI/async service — contextvars-aware, fast (with orjson), composable processors. logging-only is a dead end the moment you need request-scoped context.
import structlog
structlog.configure(
processors=[
structlog.contextvars.merge_contextvars, # request_id, workspace_id flow automatically
structlog.processors.add_log_level,
structlog.processors.TimeStamper(fmt="iso"),
structlog.processors.JSONRenderer(serializer=orjson.dumps),
],
)
log = structlog.get_logger()
# In a middleware:
structlog.contextvars.bind_contextvars(
request_id=req_id, workspace_id=ws_id, user_id=user_id,
)
# Anywhere downstream — context is automatic:
log.info("embedding.generated", document_id=doc.id, dim=1536, duration_ms=elapsed)
Both languages, same rules: one event per log line, snake_case keys, every log inside a request carries request_id, workspace_id, user_id. No interpolated strings (f"user {id} did X") — that defeats structured search.
21.3 OpenTelemetry-first
Instrument with OTel SDK in every language. Export to whichever vendor — switching is then a config change, not a rewrite.
21.4 The four golden signals (per service)
- Latency — p50, p95, p99.
- Traffic — requests/sec.
- Errors — error rate (5xx + key 4xx).
- Saturation — CPU, memory, DB pool, queue depth.
Alert on anomalies, not absolute thresholds. Rate-of-change > p99 latency.
21.5 SLO + error budget
Define one or two SLOs and stick to them.
SLO: 99.9% of API requests < 500ms over 30-day window
→ error budget = 43 minutes/month
If you burn the budget, freeze feature work and fix reliability. This is the engineering culture lever.
21.6 On-call & runbooks
- Every alert has a runbook URL in the alert text.
- Runbooks live in the repo (
docs/runbooks/<alert>.md), not Confluence. - Post-mortems for every Sev-1 / 2: blameless, in-repo, indexed.
22. 🎨 Frontend Architecture
22.1 Strict state separation
| State type | Tool | Rule |
|---|---|---|
| Server state | TanStack Query | Everything from the API. Never duplicate into a client store. |
| Client UI state | Zustand (or React state) | Selection, modals, drafts, presence. |
| URL state | TanStack Router / Next.js | Filters, tabs, pagination — anything shareable. |
| Form state | React Hook Form + Zod | Validation co-located with schema. |
22.2 Package boundaries
For monorepo:
packages/
core/ headless logic — stores, hooks, api client, types
ZERO react-dom, ZERO localStorage (use adapter), ZERO process.env
ui/ atomic primitives (shadcn-style)
ZERO @core imports, ZERO business logic
views/ business components & pages
ZERO next/*, ZERO routing-library imports (use adapter)
apps/
web/ Next.js wiring + adapters
desktop/ Electron wiring + adapters
mobile/ React Native wiring + adapters
Internal packages export raw .ts / .tsx, no build step. Consumer's bundler compiles. Fast HMR, real go-to-definition.
22.3 Design system
- Tailwind for atomic styling. No CSS-in-JS in 2026 — Tailwind v4 is faster and cleaner.
- shadcn/ui as base primitives — copy-paste, then own them.
- Radix UI under the hood for accessibility.
- One token file (
design-tokens.ts) for colors, spacing, radii. - One typography scale.
- Storybook (or Ladle if you want a faster, lighter alternative) for component dev. One story per component covering default + edge states (loading, error, empty, long-text). Doubles as living documentation for designers and as the surface for visual regression tools (Chromatic, Percy, Playwright snapshots) and
axe-corea11y checks in CI.
22.4 Routing
- Next.js app router (RSC + streaming) if you want SEO-able marketing + app in one stack.
- Vite + TanStack Router if you want an SPA with type-safe routing.
- Avoid mixing two routers in one app.
22.5 Forms
const schema = z.object({ title: z.string().min(1).max(120) })
type FormValues = z.infer<typeof schema>
const form = useForm<FormValues>({ resolver: zodResolver(schema) })
Same Zod schema is reused for API validation server-side. Single source of truth.
22.6 Loading states + suspense
- Skeleton screens for any fetch > 200ms.
- Optimistic updates for user-triggered actions (TanStack Query mutations).
- Error boundaries at route level — never let an error nuke the whole app.
22.7 Critical UX details
- Keyboard shortcuts (Cmd-K, Cmd-Enter, /).
- Toast system (one provider,
toast.success(...)). - Global confirm modal helper.
- Date formatting via one utility (
formatDate(d, "short")) — never rawtoLocaleString. <Link>everywhere — never raw<a>for internal nav.
23. 🌍 Internationalization & Accessibility
23.1 i18n from day one — even if you ship English-only
Defer language additions; don't defer the plumbing.
- Wrap every user-facing string in
t("key.name"). - Use i18next / next-intl / format.js.
- Keep translations in
locales/<lang>.json. - Use ICU MessageFormat for plurals/genders.
- Avoid string concatenation — translators need full sentences.
23.2 Locale-aware formatting
- Dates:
Intl.DateTimeFormat. - Numbers / currency:
Intl.NumberFormat. - Pluralization: ICU select.
- Time zones: store UTC, render local.
23.3 Accessibility (WCAG 2.2 AA)
- Every interactive element keyboard-reachable.
- Visible focus states (don't
outline: nonewithout a replacement). - ARIA labels on icon-only buttons.
- Semantic HTML —
<button>not<div onClick>. - Color contrast ≥ 4.5:1 for body text.
- Test with
axe-corein CI.
24. 🔧 Admin & Internal Tooling
24.1 Build it day one. Do not skip.
You'll be on support-debug duty all year. An admin panel pays for itself in week two.
24.2 What goes in it
| Capability | Why |
|---|---|
| Search any user / workspace | Triage support tickets. |
| Impersonate user (read-only by default) | "It works on my machine" reproduction. |
| Suspend / unsuspend workspace | Abuse handling. |
| Force-verify email | Lost-access support flow. |
| Refund / credit | Billing support. |
| Adjust plan / quota | Sales overrides. |
| Re-send webhook | Customer integration debug. |
| Replay failed jobs | Ops. |
| Inspect Stripe customer | Without leaving your tool. |
| Feature flag override per tenant | Beta access requests. |
24.3 Implementation
- Same codebase, gated behind
is_internal_adminclaim. - Separate hostname (
admin.yourtool.com) and route group. - Every action audit-logged with
actor_user_id(the staff member, not the impersonated user). - IP-allowlist optional; MFA mandatory.
- Time-boxed sessions (re-auth every 30 min).
24.4 Don't overthink
You don't need React-Admin or Retool. A plain set of pages with tables and confirm modals is fine. Internal users will accept worse UX than customers.
24.5 BI for the business team
Sales/CS/finance/leadership will ask the same kind of questions every week — "MRR by plan?", "trial-to-paid by signup source?", "top 50 workspaces by API usage?". Without a self-serve tool, every one of those becomes a Slack message to engineering. Stand up a BI dashboard against a read replica (or a warehouse mirror — see §4.2) on day one of having paying customers.
| Tool | License | Sweet spot | Watch out for |
|---|---|---|---|
| Apache Superset | Apache 2.0 | Default recommendation. Clean license, powerful SQL Lab, rich chart library (incl. geospatial via deck.gl), scales to large orgs. The right pick when your data team is comfortable in SQL. | Steeper UX for non-technical users; more ops overhead than Metabase. |
| Metabase (Community) | AGPLv3 | Easier UX than Superset for non-technical users — point-and-click query builder genuinely works for sales/CS. Setup in 10 minutes. | License gotcha: AGPL is usually fine for internal-only BI but a hard block for embedded analytics in your customer-facing product (need Metabase Enterprise for embedding rights). Many corporate legal policies blanket-ban AGPL — verify with counsel. |
| Lightdash | MIT | dbt-native — your dbt models are the metrics layer. Best fit if you're already on dbt for transformations. | Smaller community; assumes a dbt workflow. |
| Evidence.dev | MIT | Code-as-config (Markdown + SQL → static dashboards in git). Versioned reports as a developer-friendly alternative to clicky dashboard tools. | Not interactive ad-hoc exploration — built for publishing recurring reports, not slicing-and-dicing. |
| Redash (Databricks-owned) | BSD-2-Clause | Lightweight SQL-first dashboarding. Mature, simple, low-touch. | Lower velocity since the Databricks acquisition; community pace has slowed. |
| Hex / Mode / Hashboard | Managed (commercial) | Polished hosted experiences with notebook-style data exploration; pay-per-seat. | Per-seat pricing scales with the team that uses it most. |
Template recommendation:
- Default: Apache Superset against a Postgres read replica — Apache 2.0 license keeps your options open, and the SQL Lab covers 90% of business questions.
- If your team is mostly non-technical and AGPL is acceptable: Metabase is the better UX. Just confirm with legal first, especially if you might want to embed dashboards in your product later.
- If you already run dbt: Lightdash, since "the metric layer is your dbt models" is genuinely a better workflow than maintaining metrics in two places.
Run BI only against a read replica or warehouse mirror, never your primary OLTP database. A finance team running a "everything joined to everything" query will lock your prod app. Same auth gate as the admin panel (§24.3): SSO + MFA, IP-allowlist optional, time-boxed sessions.
25. 📝 Marketing Site, Docs & SEO
25.1 Three separate surfaces, often conflated
| Surface | Stack | URL |
|---|---|---|
| Marketing site | Next.js (or Astro) | yourtool.com |
| Product docs | Mintlify / Docusaurus / Nextra | yourtool.com/docs |
| API reference | Stoplight / Redoc / Mintlify | yourtool.com/docs/api |
| Status page | StatusPage.io / Instatus | status.yourtool.com |
| Changelog | Markdown in repo + RSS | yourtool.com/changelog |
Don't try to put marketing + app + docs in one Next.js app on day one. Build separately, deploy separately, link liberally.
25.2 SEO basics
- Server-render marketing + docs (RSC, static generation).
- Per-page
<title>and<meta description>. - Open Graph + Twitter card tags + share image generator.
sitemap.xml+robots.txt.- JSON-LD schema for product/company.
- Page speed: lighthouse ≥ 95 on every marketing page.
25.3 Conversion essentials
- Clear pricing page with comparison table + FAQ.
- Public roadmap (or at least a changelog).
- Customer logos / case studies (after you have any).
- Contact + sales form that goes to a real human in < 24h.
26. 🚢 CI/CD, Environments & Release Strategy
26.1 Environment ladder
dev (laptop) → ephemeral preview (per-PR) → staging → production
- Preview environments per PR: each PR gets its own deployed URL with a seeded DB. Vercel / Render / Fly do this natively.
- Staging mirrors prod config + tools but with a separate DB. For E2E tests + final smoke.
- Production is the only environment paying customers see.
26.2 CI pipeline (keep < 10 min)
1. Install deps (cache aggressively)
2. Lint (parallel)
3. Typecheck (parallel)
4. Unit tests (parallel)
5. Build artifacts
6. Integration tests (real Postgres + Redis as services)
7. E2E tests (Playwright against built artifacts) — only on main + tags
8. Deploy preview (PR) / staging (main) / prod (tag)
Fail fast: lint + typecheck before tests. Cache node_modules and ~/go/pkg/mod.
26.3 Database migrations on deploy
- Migrations run automatically on deploy, before app code.
- Always backwards-compatible: app version N+1 must work against DB at version N (briefly, during rollout).
- For destructive migrations (drop column), use a 2-deploy dance: stop reading → deploy → drop column.
26.4 Release strategy
- Blue-green or rolling deploys. Never stop-the-world.
- Canary for risky changes: 1% → 10% → 50% → 100% with metrics gates.
- Feature flags decouple deploy from release. Deploy whenever; release when ready.
- Tag-driven releases for the CLI / desktop apps via GoReleaser / electron-builder.
26.5 Rollback
- Every release is a single immutable artifact (container image with sha256 tag).
make rollbackreverts to the previous artifact in < 60 seconds.- DB migrations are forward-only; rollback means not running the new migration yet, not undoing it.
26.6 Where to host (and when to switch)
| Stage | Host | Why |
|---|---|---|
| Local dev | Docker Compose | Single command, identical to prod shape. |
| First production deploy | Fly.io / Railway / Render | Push-to-deploy, managed Postgres, zero ops. Cost: $20–$100/mo until you have traction. |
| Profitability stage | Hetzner (Cloud or dedicated) + Caddy front door | Best price-to-performance in the industry. A €20/mo CCX dedicated-vCPU box runs the API + workers comfortably for thousands of paying customers. Pair with managed Postgres elsewhere or run it yourself with daily off-site backups. |
| Polished IaaS | Digital Ocean (Droplets + Managed PG/Redis + Spaces + App Platform) | Better dashboard than Hetzner, managed databases included, predictable billing. ~2× the cost of Hetzner for similar specs but you get the managed pieces. |
| Enterprise / compliance | AWS / GCP / Azure | Region breadth, BAAs, customer procurement requirements. |
Reverse proxy on VM-style hosts (Hetzner, DO Droplets, bare metal):
-
Caddy — single binary, automatic HTTPS via Let's Encrypt/ZeroSSL, config in a Caddyfile. The right default for "I have one or two boxes."
app.yourtool.com { reverse_proxy api-1:8080 api-2:8080 { health_uri /healthz } encode gzip zstd log } -
Traefik — pulls config from Docker labels, K8s ingress objects, or a key-value store. The right default when you have a containerized fleet that scales horizontally and you want zero manual proxy config.
# docker-compose.yml api: labels: - "traefik.enable=true" - "traefik.http.routers.api.rule=Host(`app.yourtool.com`)" - "traefik.http.routers.api.tls.certresolver=letsencrypt"
Don't run nginx unless you have a specific reason — Caddy and Traefik handle TLS, HTTP/3, and modern defaults without the config gymnastics.
26.7 The bootstrapped reference deployment
A surprising number of profitable SaaS run on:
[Cloudflare] (CDN, WAF, DNS, Turnstile, R2 for files)
│
▼
[Hetzner CCX dedicated-vCPU box, €20–€60/mo]
│
├── Caddy (TLS, reverse proxy)
├── Go API (Gin + GORM + zerolog)
├── Worker (Asynq or NATS JetStream consumer)
├── NATS JetStream (single node, file-backed)
├── Postgres 16 (with WAL-G off-site backups to R2)
└── Casdoor (auth, separate container)
Total infra cost: €30–€80/month all-in. Capable of serving thousands of paying customers before you need a second box. Move to Digital Ocean managed Postgres the day you stop wanting to be the on-call DBA.
27. 🧰 Developer Experience (DX)
27.1 The "one command to dev" rule
make dev
Should:
- Boot Postgres + Redis (Docker Compose).
- Run migrations.
- Seed data.
- Start API + workers + frontend with hot reload.
- Print URLs for app, docs, mailcatcher, DB UI.
If a new engineer can't git clone && make dev and reach the running app in 10 minutes, fix the gap.
27.2 Seed data
Realistic, idempotent, reproducible:
- 5 workspaces with different plans.
- 20 users, with at least one in each role.
- 100 representative resources (issues / projects / etc.).
- 1 demo workspace anyone can browse.
27.3 Mail in dev
Run MailHog / Mailpit in Compose. All transactional emails route there. Open the UI to read them.
27.4 DB UI in dev
Embed pgweb / Adminer in Compose at localhost:8081. Saves "where's the user table" Slack messages.
27.5 Repo conventions
Makefileis the entry point for every workflow (make dev,make test,make migrate-up,make seed)..env.examplechecked in;.envgitignored.CONTRIBUTING.mdwith the 5 commands a new dev needs.docs/decisions/for ADRs (Architecture Decision Records).
27.6 Codegen, not boilerplate
- API clients generated from OpenAPI.
- DB types generated by sqlc / Prisma.
- Translation keys type-checked.
- Routes type-safe (TanStack Router / Next).
- If you find yourself writing the same thing in three places, generate it.
27.7 Pick one Go stack and standardize on it
Two viable shapes. Don't mix them within one service.
| Shape | Stack | When to pick |
|---|---|---|
| Lean / SQL-first | chi (router) + sqlc (codegen) + pgx (driver) + slog or zerolog |
You want explicit SQL, zero ORM magic, maximum performance. Code reads like a database textbook. |
| Batteries-included | Gin (router + middleware ecosystem) + GORM (ORM, migrations, hooks) + zerolog |
You want to ship features faster and trade some control for ergonomics. Most Go SaaS teams pick this. |
For the template, default to Gin + GORM + zerolog unless your team has a strong preference. It's the path with the most tutorials, middleware, and Stack Overflow answers — which matters when onboarding new engineers.
// Gin + GORM + zerolog skeleton
r := gin.New()
r.Use(
requestid.New(),
ginzerolog.Logger("api"), // structured access logs
gin.Recovery(),
middleware.Auth(authProvider), // verifies session/JWT, sets actor in ctx
middleware.Tenant(), // resolves workspace_id, sets app.workspace_id GUC
)
r.POST("/api/v1/projects", handlers.CreateProject(db))
// db is *gorm.DB with logger plugged into zerolog
GORM gotchas to know up front: callbacks fire on every save (use them for audit-log fan-out, not business logic), Preload is N+1's disguise (prefer explicit joins for hot paths), and AutoMigrate is fine for dev but never run it in prod — use goose, golang-migrate, or Atlas for versioned production migrations.
28. 🧪 Testing Strategy
28.1 The pyramid
/\ E2E (Playwright) 5–10% slow, valuable
/ \
/----\ Integration (real DB) 20–30% most leverage
/------\
/--------\ Unit 60–70% fast feedback
28.2 Rules
- Unit tests are co-located with source:
foo.go+foo_test.go,Button.tsx+Button.test.tsx. - Integration tests spin up a real Postgres + Redis (testcontainers, or services in CI).
- E2E tests run against the full Compose stack on tagged releases + main.
- Fast tests in pre-commit / on file save. Full suite in CI.
28.3 Critical user-facing flows to E2E
- Sign up → verify email → create workspace → first activation event.
- Invite teammate → teammate accepts → both see the same data.
- Upgrade plan → feature unlocks immediately.
- Cancel plan → downgrade scheduled at period end.
- Forgotten password → reset → log back in.
If any of these break, the whole product is broken. E2E them.
28.4 Snapshot tests
- Useful for emails (rendered HTML) and API responses (response schema).
- Avoid for UI — too much false-positive noise. Visual regression tools (Chromatic / Percy) are better.
28.5 Property-based tests
For pure logic (validation, pricing math, date calculations) — fast-check (TS) / hypothesis (Python) / gopter (Go) catch the cases you didn't think of.
28.6 Don't skip coverage; don't worship it
Aim for ~70% line coverage on logic-heavy packages. Below that = gaps. Above 90% = you're testing trivial getters.
29. 💰 Pricing, Plans & Packaging Strategy
29.1 The three SaaS pricing axes
- Per-seat — works for collaboration (Slack, Linear, Figma). Predictable, scales with customer.
- Usage-based — works for backend infra & AI (Stripe, OpenAI, Vercel). Aligns with value, but harder to budget.
- Per-feature tier — works for breadth (HubSpot, Zendesk). Lets enterprise sales upsell.
Most SaaS combine all three: per-seat × tier + usage-based add-ons.
29.2 Recommended starting tiers
Free / Hobby — 1 user, X resources, limited features → top of funnel
Starter / Pro — N users, full features, $/seat/month → SMB / individual paid
Team / Business — unlimited users, advanced features → mid-market
Enterprise — SSO, audit export, custom DPA, support → contact sales
Don't ship 6 tiers on day one. Ship 3.
29.3 What goes behind the paywall
- Free: the core value prop, scoped (e.g., "10 issues, 1 user").
- Pro/Team: depth (advanced fields, automations, API).
- Enterprise: trust (SSO, SCIM, audit log export, custom contract, SLA, support).
29.4 Annual discount
Standard: ~20% off vs monthly. Locks in cash flow + reduces churn.
29.5 Free trial vs freemium — pick one
- Trial (14 days, full features) — high commercial pressure, faster decision.
- Freemium (free forever, limited) — top-of-funnel volume, harder conversion.
For a vertical/B2B SaaS template: default to trial. For PLG products targeting individuals: freemium.
29.6 Discounting & overrides
- Coupons in Stripe with promotion codes for marketing.
- Sales-set discounts via admin panel (audit-logged).
- Annual prepay discounts handled by Stripe automatically.
30. 🎯 Product Analytics & Growth
30.1 Two analytics stacks
| Stack | Tool | Purpose |
|---|---|---|
| Product | PostHog / Mixpanel / Amplitude | "Did the user activate? Convert? Churn?" |
| Engineering | OpenTelemetry → Grafana | "Is the system healthy?" |
PostHog is the recommended default — it bundles analytics, session replay, feature flags, and A/B tests in one tool.
30.2 The events you must track
From day one:
signed_up(workspace_id, user_id, source)activated(workspace_id) — your activation event<core_action>_created— whatever your "noun" isinvited_member,member_acceptedupgraded_plan,downgraded_plan,cancelled_subscriptionviewed_paywall,clicked_upgrade
Every event has workspace_id and user_id. Don't track per-user without per-tenant.
30.3 The funnels you must measure
- Sign-up → email-verified → workspace-created → activated.
- Activation → invite teammate → second user activated.
- Free → paywall view → upgrade.
- Subscribed → renewal (LTV / churn).
30.4 Cohort retention
Plot retention by signup-week cohort. Healthy SaaS shows a "smile" — short-term decline, long-term flat or up. If your retention curves go to zero, no amount of marketing fixes the product.
30.5 NPS / CSAT
In-app survey (Delighted / built-in PostHog) at 30 days post-signup and quarterly. NPS > 30 is good, > 50 great.
31. 🤝 Customer Support & Success
31.1 Day-one support stack
- Email:
support@yourtool.com→ ticketing system (Pylon, Plain, HelpScout, or just Front). - In-app chat: Intercom / Crisp / Pylon. Gate by plan if costly.
- Docs: searchable, with embedded video.
- Status page: automatic incident updates from your monitors.
- Community: Slack / Discord / Discourse — only if you have bandwidth to keep it active.
31.2 Build support hooks into the product
- "Get help" button opens chat with current page URL pre-filled.
- "Copy debug info" button: workspace_id, user_id, browser, version, request_id of last error.
- Per-error pages include
request_id+ a "contact support" link.
31.3 Customer success vs support
- Support reacts: ticket comes in, response goes out.
- Customer success is proactive: usage drops, success manager reaches out.
You don't need CS until you have customers worth saving. But instrument the data day one.
32. 📦 Reusability — How to Make This a Template
If the goal is a template you fork per product, the architecture must keep domain-specific code clean.
32.1 The "kernel + product" split
kernel/ — every SaaS has this
auth, tenancy, billing, notifications, audit, admin, files, search,
flags, analytics, infra, observability
product/ — your domain
models, services, handlers, UI, jobs
32.2 Hard rules
kernel/never importsproduct/. One-way dependency.product/extends kernel through hooks/interfaces, never by editing kernel.- New tenant-scoped tables follow the same conventions:
id,workspace_id,created_at, RLS policy. - Domain events publish on the same in-process bus.
- Domain UI uses the same design system + permission helpers.
32.3 Configuration over code
Most "per-product" customizations should be config:
# product.config.yaml
brand:
name: "MyApp"
primary_color: "#5B5BD6"
features:
audit_log_export: true
custom_domains: false
plans:
- name: starter
price_cents: 1900
limits: { members: 5 }
Logo, name, palette, plan structure — all configurable without touching kernel code.
32.4 Domain plug-points
Predefine extension points in the kernel:
| Hook | Example use |
|---|---|
OnSignup(user, workspace) |
Auto-create demo project. |
OnActivated(workspace) |
Send welcome email + slack notification. |
BeforeRequest(ctx) |
Inject tenant-specific data. |
MeterEvent(name, qty) |
Custom usage metering for your domain. |
RenderEmail(template, data) |
Domain-specific transactional emails. |
Each is a Go interface or TS function imported from kernel, implemented in product.
32.5 Reskin checklist (minutes, not days)
- [ ] Update
product.config.yaml. - [ ] Replace logo, favicon, OG images.
- [ ] Update
tailwind.config.tscolors. - [ ] Update marketing copy in
apps/marketing/content/. - [ ] Configure Stripe products + prices, paste IDs into config.
- [ ] Add domain models to
product/. - [ ] Wire domain routes / pages.
- [ ] Update
seed.gowith domain-relevant demo data.
32.6 Versioning the template
Treat the template as its own project with a version. When kernel improves, projects forked from it can pull updates by:
- Adding the template repo as a
template-upstreamremote. - Cherry-picking kernel commits.
- Or running a custom
bin/upgrade-kernelthat copies non-product paths.
33. 🗺️ The 14-Phase Build Plan
Each phase is shippable. Don't skip ahead. Most failures here come from doing phase 7 before phase 3 is solid.
🌱 Phase 1 — Skeleton (2 days)
- Monorepo:
apps/web,apps/api,packages/{core,ui,views},infra/. - Docker Compose: Postgres + Redis + Mailpit + pgweb.
make devbrings up the stack with hot reload.- Health endpoints, structured logging, request ID middleware.
- One CI job: lint + typecheck + unit tests.
Done when: git clone && make dev and an empty app loads with no auth.
🔐 Phase 2 — Auth (2 days)
- Email + password + magic link.
- Email verification.
- Google OAuth.
- Password reset.
- Session via cookie (browser) and JWT (API).
- Rate limit on
/login.
Done when: new user can sign up, verify, log out, log in, reset password.
🏢 Phase 3 — Tenancy (2 days)
workspace,membership,invitetables.- Workspace creation flow.
- Workspace switcher UI.
- Subdomain or path-based routing.
- RLS policies on every tenant-scoped table.
- Permission helper
Can(user, action, resource). - Roles: owner, admin, member.
Done when: invited teammates only see the workspaces they belong to. Cross-tenant DB access is blocked at the RLS layer.
📨 Phase 4 — Notifications & Email (1 day)
- Resend / Postmark integration.
- React Email templates: verify, reset, invite, billing failure.
- In-app inbox table + WS push.
- Notification preferences.
Done when: invite emails arrive in Mailpit (dev) and real inbox (prod), and the in-app bell shows new mentions.
💳 Phase 5 — Billing (3 days)
- Stripe integration: Checkout + Customer Portal.
- Plans table +
subscriptiontable + webhook handler. - Trial logic.
- Feature gating helper.
- Dunning emails on failed payments.
- Admin override for plan/quota.
Done when: users can pick a plan, pay, see their plan, upgrade, downgrade, and a failed payment triggers correct UX.
⚙️ Phase 6 — Background Jobs & Cron (1 day)
- Job queue (Asynq / River / BullMQ).
- Worker process running in Compose.
- Job examples: send email, sync to Stripe, expire trial.
- Cron scheduler with leader election or Postgres-backed.
- Outbox pattern for transactional events.
Done when: a 10-second job runs in the worker, the API stays fast, and a daily cron fires once across N replicas.
📦 Phase 7 — Files (1 day)
- S3 / R2 bucket per environment.
- Signed-URL upload endpoint.
- Confirm endpoint storing metadata.
- Avatar upload as the canonical example.
- CDN with signed cookies for private files.
Done when: a user can upload an avatar and serve it via CDN, without bytes touching the API.
🔎 Phase 8 — Search & Search-Adjacent (1 day)
- Postgres FTS index on the main domain entity.
- Generic
searchableinterface. - Hybrid (BM25 + trigram) ranking.
- (Optional) pgvector + embedding worker.
Done when: typing in the search bar returns relevant results in < 200ms.
📡 Phase 9 — Real-time (1 day)
- WebSocket endpoint with auth + origin check.
- In-process hub + (optional) Redis pub/sub for multi-node.
- Client subscribes, server invalidates Query cache via WS event.
- Presence (online/offline indicators).
Done when: two browser windows show the same data update simultaneously.
📊 Phase 10 — Audit, Activity, Telemetry (1 day)
audit_logtable with privileged-action logging.activitytable for user-facing feeds.- PostHog (or equivalent) wired with the canonical events.
- Workspace activation event + retention dashboard.
Done when: every privileged action is in the audit log and every signup is tracked in PostHog.
🚩 Phase 11 — Feature Flags & Admin Panel (2 days)
- Self-hosted PostHog or DIY flag table.
- Per-env / per-workspace / per-user flag resolution.
- Admin panel: user search, workspace search, impersonate (read-only), suspend, override flags.
- Admin actions audit-logged with staff actor.
Done when: support can resolve a "I can't see X" ticket in < 5 minutes via admin tools.
🛡️ Phase 12 — Security & Compliance Foundation (1 day)
- CSP, HSTS, secure cookies, CSRF.
gitleakspre-commit + CI.- GDPR primitives: data export endpoint, account deletion endpoint, consent log.
- DPA template + subprocessor list page.
- Pen-test scan via OWASP ZAP in CI.
Done when: a security review can pass the OWASP Top 10 checklist without changes.
📈 Phase 13 — Observability (1 day)
- OpenTelemetry SDK in API + workers.
- Logs, metrics, traces all tagged with
request_id+tenant_id. - Sentry for errors.
- Basic Grafana dashboard with golden signals.
- Status page (Instatus or self-hosted).
- One SLO defined + alerted.
Done when: clicking an error in Sentry takes you to the trace, which links to the logs, which contain the request.
📦 Phase 14 — Package, Document, Reskin (2 days)
kernel/↔product/separation.product.config.yamland reskin guide.- Marketing landing page template.
- Docs site template (Mintlify / Nextra).
- README + CONTRIBUTING + ADRs.
- One full reskin pass to verify the template works.
Done when: a new engineer can fork, run bin/reskin --name AcmeApp --color "#FF5C5C", and have a custom-branded skeleton in 30 minutes.
Total: ~21 working days for a single experienced engineer to build an MVP-quality SaaS template. ~6–8 weeks calendar with reviews, polish, and docs.
34. ⚠️ Common Pitfalls & Hard-Won Guardrails
| Pitfall | Guardrail |
|---|---|
Forgetting WHERE workspace_id = ? somewhere |
RLS policies on every tenant table; CI grep for missing filters. |
| Stripe webhook handler is non-idempotent | Use event.id as a dedup key in Redis with 7-day TTL. |
| Long-running job blocks request path | Move to a queue; never call third parties synchronously. |
| Admin actions not audit-logged | Wrap every admin handler in middleware that writes to audit log. |
| Email enumeration on signup/login | Same response and timing for "exists" vs "not exists". |
| Migration breaks rolling deploy | Two-phase migrations; never drop+rename in one shot. |
| WS message updates client store directly | Rule: WS invalidates Query cache only, never writes to stores. |
| Cookie auth without CSRF | SameSite=Lax + CSRF token on state-changing endpoints. |
| Secrets committed to git | gitleaks pre-commit + CI fail. |
| Free tier abuse (signup farming) | Rate limit signups per IP + email-domain block list + Cloudflare Turnstile. |
| Plan change inconsistencies (paid down to free with paid resources still active) | Plan change handler: enforce limits, archive overflow, email user. |
| Trial expires while user has 50 issues | Read-only mode + upgrade banner; do not delete data. |
| Hot N+1 query in detail page | EXPLAIN ANALYZE in CI for top endpoints. |
| Cache that never invalidates | Tag-based invalidation; never set TTL > 1 hour without invalidation hook. |
| Tenant data exposed via search index | Search index keys include workspace_id and the search query filters by it. |
| Misconfigured CORS opens API to malicious origins | Allowlist origins explicitly; reject * with credentials. |
| User can delete their own audit log entries | Audit log is append-only; no user-facing endpoint to mutate. |
| One slow query takes down the API | Statement-level timeouts (SET LOCAL statement_timeout = '5s'). |
| Background worker silently fails forever | Dead-letter queue + alert on DLQ depth. |
| Subdomain takeover via stale CNAME | Audit DNS regularly; deactivate orphan subdomains. |
| Test data leaks into prod | Distinct connection strings; loud banner in non-prod environments. |
| "Forgot password" reveals if email exists | Generic response: "If an account exists, we've sent a reset link." |
| No consent log → GDPR audit fails | consent table with version + timestamp + IP from day one. |
| Customer asks for a feature already on roadmap | Public roadmap so they can upvote instead of opening a ticket. |
35. 📋 Cheat Sheet
📖 First files / decisions to lock down
- Multi-tenancy model — pool, all queries filter by
workspace_id, RLS as defense. - Auth model — cookie session for browser, JWT for mobile/API, API keys for integrations.
- Permissions — single
Can(actor, action, resource)helper, RBAC roles. - Billing — Stripe Checkout + Customer Portal; metered prices for usage.
- Event bus — in-process publisher → outbox → workers.
- API shape — REST + JSON, cursor pagination, single error envelope, idempotency keys.
- Frontend state — TanStack Query for server state, Zustand for UI, never mix.
⚙️ Default config defaults
| Setting | Default |
|---|---|
| Session TTL (cookie) | 14 days, sliding |
| JWT access token TTL | 15 min |
| Refresh token TTL | 30 days |
| API rate limit | 100 req/min/IP, 1000 req/min/workspace |
| File upload max | 100 MB |
| Idempotency cache TTL | 24 h |
| Trial length | 14 days |
| Soft-delete grace period | 30 days |
| Audit log retention | 7 years |
| Activity feed retention | 6 months |
| GDPR data export TTL | 7 days from generation |
| Workspace slug regex | [a-z0-9-]{3,40} |
| Password min length | 12 chars (or zxcvbn score ≥ 3) |
🚫 Hard rules (non-negotiable)
- Every tenant-scoped query filters by
workspace_id. - Every privileged action writes to
audit_log. - Every email obeys per-user notification preferences.
- Every webhook handler is idempotent.
- Every form input is validated server-side (Zod / pydantic / typed structs).
- Every secret is in a secrets manager, not in env in prod.
- Every public endpoint has a rate limit.
- Every payment side effect goes through Stripe webhooks, not the request path.
- Every long-running task is in a job queue.
- WS events invalidate Query cache; they never write directly to stores.
- Migrations are append-only.
- Admin actions are audit-logged with the staff member as actor.
- Feature flags wrap any risky new behavior.
- File uploads bypass the API server (signed S3 URLs).
- No
WHEREclause in SQL is built via string concatenation. - New tables follow the convention:
id,workspace_id,created_at,updated_at.
📐 The canonical resource shape (REST)
{
"id": "01HMZQ...",
"workspace_id": "01HMW1...",
"name": "Project Alpha",
"status": "active",
"created_at": "2026-04-30T10:00:00Z",
"updated_at": "2026-04-30T10:00:00Z",
"created_by": { "type": "user", "id": "01HM..." }
}
🎭 The polymorphic-actor pattern
created_by_type TEXT CHECK (created_by_type IN ('user','api_key','system')),
created_by_id UUID
Use this on every "actor" field. It lets you treat agents, integrations, and humans uniformly without parallel schemas.
🔑 Environment variables baseline
APP_ENV=production # dev | staging | production
APP_URL=https://app.yourtool.com
PUBLIC_URL=https://yourtool.com
DATABASE_URL=postgres://...
REDIS_URL=redis://...
JWT_SECRET=<32-byte-random>
SESSION_SECRET=<32-byte-random>
COOKIE_DOMAIN=.yourtool.com
STRIPE_SECRET_KEY=sk_live_...
STRIPE_WEBHOOK_SECRET=whsec_...
PAYPAL_CLIENT_ID=... # optional, secondary payment method
PAYPAL_CLIENT_SECRET=...
PAYPAL_WEBHOOK_ID=...
# Object storage (S3 / Cloudflare R2 / Supabase Storage — pick one)
S3_BUCKET=...
S3_REGION=...
S3_ENDPOINT=... # set for R2 / Supabase / MinIO
AWS_ACCESS_KEY_ID=...
AWS_SECRET_ACCESS_KEY=...
# Auth (pick the block matching your provider)
# --- Casdoor (self-hosted IAM)
CASDOOR_ENDPOINT=https://auth.yourtool.com
CASDOOR_CLIENT_ID=...
CASDOOR_CLIENT_SECRET=...
CASDOOR_ORG=yourtool
CASDOOR_APP=app
# --- Ory Kratos (self-hosted)
KRATOS_PUBLIC_URL=https://auth.yourtool.com
KRATOS_ADMIN_URL=http://kratos:4434
# --- Supabase Auth
SUPABASE_URL=https://xyz.supabase.co
SUPABASE_ANON_KEY=...
SUPABASE_SERVICE_ROLE_KEY=...
# --- WorkOS / Clerk
WORKOS_API_KEY=...
CLERK_SECRET_KEY=...
# Eventing
NATS_URL=nats://nats:4222 # if using NATS JetStream
NATS_STREAM=app-events
RESEND_API_KEY=...
EMAIL_FROM="YourTool <hi@yourtool.com>"
SENTRY_DSN=...
POSTHOG_KEY=...
POSTHOG_HOST=https://app.posthog.com
OPENAI_API_KEY=... # optional, if you have AI features
🎯 KPIs to track from day one
- Sign-ups / week
- Activation rate (signed up → activated)
- Free → paid conversion rate
- MRR / ARR
- Net revenue retention (NRR)
- Logo churn
- DAU / WAU / MAU
- p95 API latency
- Error rate
- NPS
💭 Closing Thought
A great SaaS template is opinionated about everything that doesn't matter to the customer, and flexible about everything that does.
- Auth, billing, tenancy, observability, admin → opinionated, baked-in.
- Domain models, UI flows, branding, pricing → flexible, configurable.
The discipline: every time you find yourself solving the same infrastructure problem in a new product, that solution belongs in the template. Every time you find yourself solving a different domain problem, that work belongs in product/.
If you internalize §5 (Multi-Tenancy), §9 (Billing), §19 (Security), and the §32 kernel/product split, the rest of this playbook becomes a detailed checklist you can execute over 6–8 weeks to ship a real, professional, reusable SaaS foundation.
Now go build.
If you found this helpful, let me know by leaving a 👍 or a comment!, or if you think this post could help someone, feel free to share it! Thank you very much! 😃
All rights reserved