0

🏛️ The Solution Architect Playbook 📚: From Best Designer to Best Bridge - Part 2 🌉

A deep, opinionated, practical guide for the engineer-architect who designs end-to-end solutions across systems, teams, and business units. The mental models, decision frameworks, discovery tactics, design methods, communication patterns, and anti-patterns that separate the SA whose solutions actually ship and run for years from the one whose 80-page Visio decks gather dust on Confluence. Grounded in current reality — multi-cloud by default, AI woven into every solution, smaller delivery teams per dollar of revenue, regulated by frameworks that didn't exist five years ago, and customers who can read a SOC 2 report.

If you read only one section first, read §2 Mindset, §6 Discovery, §9 NFRs, and §13 Build vs Buy. Everything else is the implementation of those four.

Companion to 🧑‍💻 The Tech Lead Playbook: From Best IC to Multiplier 🚀 (the team-level role), 👨‍💻 The CTO Playbook 📘: From Best Builder to Best Bet ♟️ (the org-level role), 🏛️ The System Design Playbook 📖 (the design vocabulary), 🛠️ The Senior Software Engineer Playbook 📖: From Good Coder to High-Impact Engineer 🚀 (deep IC craft), [🤖 The AI SaaS Playbook (Practical Edition)📘](🤖 The AI SaaS Playbook (Practical Edition)📘 https://dev.to/truongpx396/the-ai-saas-playbook-practical-edition-33lb) (AI overlay), and 🚀 The SaaS Template Playbook 📖 (delivery foundations). This one is for the technical professional who is accountable for a solution end-to-end across systems, teams, and stakeholders — whether at a consulting firm, cloud vendor, ISV, or in-house enterprise team.


📋 Table of Contents

  1. ⚡ Read This First
  2. 🧠 The Solution Architect Mindset
  3. 🎭 The SA Landscape: Five Archetypes
  4. 🪜 SA vs TL vs Software Architect vs EA vs CTO
  5. 🚪 The First 90 Days
  6. 🔍 Discovery: The Real Job Begins Here
  7. 📐 Solution Design Methodology
  8. 🗂️ Documenting a Solution: C4, ADRs, arc42
  9. 🎯 Non-Functional Requirements: The Real Job
  10. ☁️ Cloud Architecture (AWS, Azure, GCP, Multi)
  11. 🔌 Integration Architecture
  12. 🗄️ Data & AI Architecture
  13. ⚖️ Build vs Buy vs Customize
  14. 🛒 Vendor Evaluation & Selection
  15. 💰 Cost & TCO Modeling
  16. 🛡️ Security, Compliance & Risk
  17. 🚚 Migration Architecture: 6Rs and Beyond
  18. 💬 Communication: Diagrams, Documents, Presentations
  19. 🤝 Stakeholder Management
  20. 🤵 Pre-Sales SA: The Consultative Sale
  21. 🛠️ Post-Sales SA: Delivery Architecture
  22. 🚀 Working with Delivery Teams
  23. ⏱️ The Operating Cadence
  24. 🤖 AI in the SA Role
  25. 🧰 Tools of the Trade
  26. ⚠️ The SA Anti-Pattern Catalog
  27. 🗺️ The Phased Roadmap (Day 1 → Year 5)
  28. 📋 Cheat Sheet & Resources

Section 1 -> 9: Read Part 1 here https://viblo.asia/p/the-solution-architect-playbook-from-best-designer-to-best-bridge-part-1-13VM9D2QVY7

10. ☁️ Cloud Architecture (AWS, Azure, GCP, Multi)

The default substrate for solution architecture today is the cloud. You will design for at least one and increasingly for more than one. Six things to get right.

10.1 The cloud-provider choice (one-way door)

The single most consequential ADR you'll write on most solutions. Drivers, in roughly this order:

  1. What the customer already runs. Skill, contracts, operating model. A 5-year AWS shop is rarely best served switching.
  2. Regulatory residency. Some regions are only on some clouds. Some governments only certify some clouds.
  3. Native services that matter. BigQuery is in GCP. Active Directory and Microsoft 365 integration favor Azure. SageMaker, EKS-with-Fargate, deep AI/ML breadth favor AWS.
  4. Pricing posture. Reserved instance / commitment discounts you've already negotiated.
  5. Specific service maturity. Vector DB, identity-aware proxy, managed Kubernetes, edge compute, etc.

Multi-cloud as default = mistake. Cost doubles, ops complexity quadruples, the team gets shallow on both. Multi-cloud for specific reasons (DR for a single critical workload, regulatory mandate, cost arbitrage on egress, vendor avoidance) — fine. Decide deliberately.

10.2 The Well-Architected lens

Each major cloud publishes a Well-Architected Framework (AWS WAF, Azure WAF, GCP Architecture Framework). They're surprisingly good. Six pillars (with cross-cloud equivalents):

  1. Operational Excellence — runbooks, IaC, observability, change management.
  2. Security — IAM, encryption, network segmentation, secrets, audit.
  3. Reliability — failure modes, recovery, multi-AZ/region, capacity headroom.
  4. Performance Efficiency — sizing, latency, scaling, hot-spots.
  5. Cost Optimization — sizing, reservations, lifecycle, FinOps.
  6. Sustainability — efficiency, region selection, lifecycle.

Run a Well-Architected review at design milestone, mid-delivery, and pre-go-live. Most cloud vendors will run one for free if you're a meaningful spender — take them up on it.

10.3 Landing zone and shared platform

A landing zone is the foundation: account/subscription structure, network, identity, logging, billing, baseline security. Don't reinvent it; use the vendor's reference (AWS Control Tower, Azure Landing Zones, GCP Cloud Foundation). For solution architects, two things matter:

  • Don't be the one designing the landing zone for a single solution. It's a multi-solution foundation. Coordinate with the platform team / EA. If there is no landing zone, raise it as a project-level risk.
  • Inherit, don't fight. If the landing zone forces a tagging schema, IAM boundary, network topology — work within it. Solutions that fight the landing zone get veto'd.

10.4 Compute model

The default decision tree, in order of preference:

  1. Managed serverless (Lambda/Functions/Cloud Run) — cheap, simple, scales to zero. Default for low-medium load, event-driven, async workloads. Limits: cold starts, runtime, vendor lock surface.
  2. Managed containers (ECS Fargate, AKS, GKE Autopilot, Cloud Run) — solid middle ground. Reasonable lock-in if you stick to Kubernetes.
  3. Self-managed Kubernetes (EKS, AKS, GKE classic) — only if you have the team. Yes, "we'll learn it" is a lie when the team is 6 people.
  4. VMs — only when there's a specific reason (license, kernel module, vendor support).

Anti-pattern: defaulting to Kubernetes. Kubernetes is a power tool. It's correct when you have ≥10 services, a platform team, and stable deployment patterns. It's wrong on day 1 of a 4-service product with no platform team — Cloud Run / Fargate / Container Apps win there.

10.5 Network and identity

Two areas SAs underestimate, and that auditors and incidents both punish.

  • Network: VPC layout, subnetting, peering, transit gateway / hub-spoke, private endpoints, egress control. Egress is the blind spot — most data exfiltration paths are egress-shaped, and egress is also a major cost line.
  • Identity: workload identity (instance profiles, managed identities, workload identity federation) > static keys, every time. Human identity through SSO/IdP only — no shared admin accounts. Service-to-service: short-lived tokens, mTLS, or workload identity. Never use long-lived credentials in production.

A solution that gets identity right almost always gets the security review on the first pass. A solution that gets identity wrong almost always gets blocked in week 2.

10.6 Multi-cloud, hybrid, and edge

  • Multi-cloud for a single workload: rarely correct, almost never worth the operational cost. Exception: regulated workloads or strategic vendor avoidance.
  • Multi-cloud at the portfolio level: common in enterprises (CRM in one, data lake in another). Solution architect for one solution still picks one cloud; the EA owns the portfolio.
  • Hybrid (cloud + on-prem): legitimate for legacy + regulated systems. Design the boundary carefully — direct connect, identity federation, data sync.
  • Edge / point-of-sale / IoT: a different design — intermittent connectivity, local data, conflict resolution, OTA updates. Bring an edge specialist; this is its own discipline.

11. 🔌 Integration Architecture

Where systems meet is where projects fail. Integration is the most underestimated portion of a solution by a factor of 2–3×. Spend disproportionate time here.

11.1 Integration styles, picked deliberately

Style Best for Avoid when
Synchronous REST / gRPC Request/response, low latency, strong contract High-fanout, long-running, brittle dependencies
Asynchronous events (pub/sub, Kafka, EventBridge, Service Bus) Decoupling, fan-out, audit trail, replay Strict ordering across topics, instant consistency required
Message queues (SQS, RabbitMQ) Worker pools, retries, backpressure Pub/sub patterns (use topic)
Batch / file drop Legacy, bulk, regulatory data exchange Real-time needs
Database integration (shared DB) Almost never Almost always — coupling at the data layer is the worst kind
API gateway aggregation BFF for mobile/web Backend-to-backend (just call directly)
Webhooks Outbound notifications to partners Internal — too brittle for retries/auth
CDC (change data capture) Replicating data without writing client code Real-time business logic — events are better

Default rule: synchronous within a service boundary, asynchronous across service boundaries. Async-everywhere is over-engineering; sync-everywhere is brittle.

11.2 Contracts: the integration's NFRs

Every integration is a contract. Document it explicitly:

  • Schema: OpenAPI / AsyncAPI / Protobuf. Versioned. Stored in a shared registry.
  • Compatibility policy: backward-compatible always; breaking changes go through a deprecation window.
  • SLA: latency, availability, error rate. Both sides sign.
  • Auth: OAuth/OIDC scope, mTLS cert, service account. Documented.
  • Idempotency: are repeated calls safe? With what key?
  • Retry policy: exponential backoff, max attempts, jitter, dead-letter destination.
  • Rate limits: documented; both sides aware.
  • Failure semantics: what do consumers see when this is down? Cached? Errored? Skipped?

A common failure: each team having their own opinion of the contract. The SA's job is to make the contract canonical, schema-checked, and version-controlled. Everything else flows from that.

11.3 Patterns for unreliable upstreams

You will integrate with a system that breaks more often than yours can tolerate. Apply patterns:

  • Circuit breaker: stop calling a degraded service after a threshold; back off.
  • Bulkhead: isolate threadpools/connections per upstream so one slow upstream doesn't drag the rest.
  • Retry with backoff + jitter: idempotent calls only.
  • Timeout, always: no unbounded calls, ever. Set p99-budget-aware timeouts.
  • Cache with TTL (or stale-while-revalidate): tolerate brief upstream outages with served-stale.
  • Dead-letter queue + alarm: failed messages go somewhere you can replay them.
  • Compensating transaction (Saga): for distributed flows that can't be a single transaction.

Each pattern has a cost (latency, complexity, eventual consistency). Apply them where the upstream merits, not by default.

11.4 The data contract

Increasingly the most under-defined part of integrations. Data contract = schema + semantics + freshness + ownership + retention + classification.

Examples:

  • "The customers.id field is a UUID v4 owned by the CRM team. Never mutated. Mapped to legacy cust_no only at the boundary."
  • "The orders topic is at-least-once with idempotency key order_id. Schema in registry. Compatibility: backward-compatible. Retention: 7 days for replay."
  • "The pii fields in the events stream are tokenized at source; raw values only available via the Identity Service with audit-logged lookup."

Without explicit data contracts, integrations rot. Every addition has to ask "is this safe?" and the answer is folklore. With them, the answer is in the registry.

11.5 Integration platforms (iPaaS) and ESBs

Be honest:

  • iPaaS (Workato, Mulesoft, Boomi, Azure Logic Apps, AWS AppFlow, Tray) shines for citizen-developer style integrations, SaaS-to-SaaS, low-volume, low-business-criticality. Bad for high-volume, transactional, latency-sensitive, programmable workflows.
  • ESB is largely a legacy term. If your customer has one, you'll work with it; if they don't, don't introduce one.

Default to direct event/REST integration with a registry. Reach for iPaaS for SaaS-stitching, not for the core path.


12. 🗄️ Data & AI Architecture

Data is half of every solution; AI is increasingly half of every data solution. Three sub-architectures matter: operational data, analytical data, and AI/ML.

12.1 The operational data plane

The OLTP store(s) for the solution. Decisions:

  • Polyglot persistence vs single store. Default to a single primary store unless the access pattern demands otherwise. PostgreSQL handles 80% of cases (relational, JSONB, full-text, geo, vector with pgvector). DynamoDB handles single-digit-ms key-value at scale. Specialized stores (Redis for cache, Elastic/OpenSearch for search, time-series DB for metrics) bolted on as needed.
  • Schema ownership. One team owns the schema. No two teams write to the same table. Cross-team reads via API or replicated views.
  • Migrations. Online, backward-compatible, two-step (add → backfill → switch read → switch write → remove). Documented in ADRs.

12.2 The analytical data plane

Where reporting, dashboards, ML training, and ad-hoc analysis live. The current default stack:

  • Lakehouse (S3/ADLS/GCS + Delta Lake / Iceberg / Hudi) as the storage substrate.
  • Warehouse (Snowflake / BigQuery / Redshift / Databricks SQL) on top, or as the primary for many use cases.
  • Streaming (Kafka / Kinesis / Pub-Sub) for real-time pipelines.
  • dbt as the SQL transformation backbone.
  • Reverse-ETL (Hightouch / Census) to push warehouse data back to operational SaaS tools.

The SA's job is not to design the entire data platform — that's a Data Architect's job. Your job is to:

  1. Decide what data the operational solution emits (events, CDC, snapshots) and at what cadence.
  2. Decide what data the operational solution consumes from the warehouse and how (reverse-ETL, scheduled fetch).
  3. Negotiate data contracts at the boundary (see §11.4).
  4. Ensure PII / regulated data is handled per policy on both sides of the boundary.

12.3 AI / ML in the solution

Today, almost every solution has an AI component. Three patterns dominate:

Pattern When to use Build cost Operational cost
LLM API call (OpenAI, Anthropic, Google) Most NL / generation tasks Low Per-token, predictable
RAG (Retrieval-Augmented Generation) Q&A over private content, customer support Medium Per-token + vector DB
Fine-tuned / hosted small model Domain-specific NLP at scale, latency-sensitive, data-sovereign High Compute reservation
Custom ML pipeline Predictive (churn, fraud, recommendation) Highest Training + inference + monitoring

Most "AI in the solution" requirements should default to LLM API + RAG, unless data sovereignty, latency, or volume forces otherwise. See 🤖 The AI SaaS Playbook (Practical Edition)📘 for the depth.

Key design points the SA owns:

  • Data flow to/from the model: what leaves your boundary? Logged where? Retained how long?
  • Prompt strategy: stored where, versioned how, evaluated how?
  • Evaluation harness: how do we know it's still working? Golden sets, online evals, human review.
  • Cost guardrails: per-tenant token budget, prompt size caps, model fallback to cheaper tier.
  • Failure mode: when the model is slow/down/wrong, what does the user see? (Increasingly: the most critical question.)

12.4 Vector stores and embeddings

For RAG and semantic search, you'll pick a vector store. Three tiers:

  • Embedded (pgvector on Postgres, sqlite-vec): default for ≤10M vectors and where you already have the DB.
  • Managed (Pinecone, Weaviate Cloud, Qdrant Cloud, Vertex Vector Search, Atlas Search): default for ≥10M vectors or when latency targets demand it.
  • Self-hosted at scale (Milvus, Vespa): only when you have a platform team and a reason.

Don't reach for a dedicated vector store on day 1. pgvector serves until you have data showing you've outgrown it.

12.5 Data residency and sovereignty

Increasingly mandatory and increasingly hard. Three rules:

  1. Map data classes early. What's PII? Health data? Financial? Regulated by which jurisdiction?
  2. Default to single-region for regulated data. Multi-region adds replication paths the regulator will scrutinize.
  3. Keep AI in the loop. Many AI providers run inference in specific regions. "Calls to LLM cross the EU boundary" is a finding waiting to happen. Use region-pinned endpoints; many providers offer them now.

13. ⚖️ Build vs Buy vs Customize

The single biggest cost lever in any solution. Wrong here = wasted years. Right here = hire fewer engineers, ship faster, focus on the differentiator.

13.1 The framework

Apply this in order, for every meaningful capability in the solution:

  1. Is it a strategic differentiator? If yes (the thing customers buy us for), build. If no, default to buy/reuse.
  2. Is there a mature off-the-shelf option? If yes, score it (see §14). If no, build.
  3. Is there a viable open-source option we can self-host? Score: TCO of self-hosting vs SaaS pricing.
  4. Is the cost of switching low (two-way door)? If yes, buy. If no, slow down — vendor lock-in is expensive.
  5. Does our team have the skill to operate the build option? If no, default to buy unless we're prepared to hire.
  6. What's the time-to-value difference? If "buy = 8 weeks, build = 9 months," that's usually decisive.

Note the order: the question "is this a differentiator?" comes first. Most teams build the wrong thing first — they build the auth system, the CMS, the ticketing system — none of which differentiate them — and starve the differentiator of time.

13.2 The classic "always buy" list

Capabilities that are almost always wrong to build today:

  • Authentication / SSO / IdP (Auth0, Cognito, Entra, Okta, WorkOS)
  • Email / transactional messaging (Postmark, SendGrid, Resend, SES)
  • Payments (Stripe, Adyen, Braintree)
  • Logging / observability platform (Datadog, New Relic, Grafana Cloud, Honeycomb)
  • Error tracking (Sentry, Rollbar)
  • Analytics (Amplitude, Mixpanel, PostHog)
  • Search infrastructure (Algolia, OpenSearch managed)
  • File storage (S3 / equivalent)
  • Customer support (Zendesk, Intercom, HelpScout)
  • Status pages (Statuspage.io)
  • DAM, CDN, WAF, DDoS — all categories where infrastructure providers excel

Building any of these requires a written justification. The default is buy. The bias is strongly toward buy.

13.3 The classic "consider build" list

Capabilities where build is more often correct:

  • The core product surface (your differentiator)
  • Domain-specific data models that no SaaS product expresses
  • Workflow / orchestration of your business processes
  • Customer-facing UX (you're the brand)
  • Pricing engine, recommendation engine, ranking model — where your data is the moat
  • Multi-tenant isolation, residency, audit — when SaaS options can't meet your specific compliance posture

13.4 The "customize" trap

A vendor offers a platform you can heavily customize (Salesforce, ServiceNow, Pega, Microsoft Dynamics, low-code platforms). The trap: you start with "10% customization" and end with a 100-FTE practice maintaining a snowflake. Customization budget compounds.

Rules:

  • Be ruthless about what you customize. Workflows: yes. UI: maybe. Data model: only if forced. Core engine: never.
  • Time-box customization investment. Set an explicit budget (FTE-years and dollars) and revisit annually.
  • Plan an exit strategy. Even if you never use it, know how you'd leave. The vendor's roadmap is not yours.

13.5 The TCO comparison

Always quantify, always over 3 years. Don't compare list price; compare full TCO.

Cost component Build Buy SaaS Self-host OSS
Build / setup 8–12 FTE-months 1–2 FTE-months 2–4 FTE-months
Annual licenses 0 $X/seat × N 0
Annual ops 1–2 FTE 0.1 FTE 0.5–1 FTE
Cloud infra $A/yr usually included $B/yr
Y3 cost rapid growth scales with usage sub-linear
Risk schedule, attrition, scope vendor, lock-in, price community, security, ops

A common trap: comparing "build cost" (engineers building) vs "SaaS cost" (license fee), forgetting the build option carries lifetime ops + maintenance + team-context cost too. Three-year TCO almost always favors buy for non-differentiator capabilities.


14. 🛒 Vendor Evaluation & Selection

You will pick vendors. Often. Do it as a process, not a vibes-based fight in a meeting.

14.1 The funnel

  1. Long list (≥5 vendors): gather from analyst reports (Gartner, Forrester, G2 grids), peer recommendations, your network. The point of a long list is to avoid the file-drawer effect of "the two we already heard about."
  2. Short list (3 vendors): cut on table-stakes — region availability, compliance certifications, integration availability, price band, scale.
  3. RFP / questionnaire: standardized, scored, with same questions to all 3. (See §14.2.)
  4. Proof of concept (PoC): same scenario for all 3, same evaluation rubric, time-boxed.
  5. Reference calls: ≥2 references each, asking the uncomfortable questions (see §14.4).
  6. Commercial negotiation: only after technical decision is made.
  7. Decision: written ADR with the scoring artifact attached.

14.2 The questionnaire (RFP)

A single questionnaire, applied to all 3 vendors. Categories and weights that work in practice:

Category Weight Sample questions
Functional fit 25% Does it cover capabilities X, Y, Z? Demo the workflow A.
Non-functional 20% SLA, availability, RPO, scale, observability surface
Integration 15% API quality, OpenAPI, events, SDK languages, rate limits, idempotency
Security / compliance 15% SOC 2 Type II, ISO 27001, GDPR posture, sub-processors, data residency, MFA, SSO, audit log retention
Operability 10% Status page, incident transparency, support tier responses, observability into our tenant
Roadmap & viability 5% Funding stage, customer count, growth, top customers, leadership stability
Commercial 10% Pricing model, predictability at scale, exit terms, data export, MSA flexibility

Vendors will resist standardized questionnaires. Insist. "We are evaluating three vendors with the same questionnaire to give you a fair comparison." They comply.

14.3 The PoC

A 2–4 week structured trial, with the same scenario across all 3 vendors, scored on a published rubric. Hard rules:

  • The customer's engineers run the PoC, with vendor support. Not vendor-led.
  • Time-boxed; the same time box for each vendor.
  • Acceptance criteria written before the PoC starts. Otherwise you'll move the goalposts.
  • Document failures, not just successes — "vendor 2 needed a workaround for our SSO" is a finding.

14.4 The reference call: ask the uncomfortable

Vendors' references are pre-selected; assume they're friendly. Get value anyway by asking:

  • "What's the worst incident you've had with this vendor in the last 18 months? How was it handled?"
  • "What did you wish you'd known before signing?"
  • "What's the next vendor capability that's blocking you?"
  • "How predictable is your bill quarter to quarter?"
  • "If you were starting today, would you choose them again?"
  • "Who else did you evaluate, and why did they lose?"

Ask for one reference not on the vendor's list — usually possible through your network.

14.5 The vendor scorecard (running)

After selection, don't stop scoring. Maintain a running scorecard for any meaningful vendor:

  • SLA met (each month).
  • Incident count and severity.
  • Roadmap items shipped vs promised.
  • Cost trajectory vs forecast.
  • Support responsiveness.

When the scorecard goes red over two quarters, it's time to revisit. Most vendor problems are gradual decline, not sudden death — the scorecard catches them early.

14.6 Lock-in: the four flavors

Not all lock-in is equal. Distinguish:

  • Data lock-in: getting your data out is hard or expensive. The most dangerous. Always negotiate data export terms upfront.
  • Operational lock-in: your team has skilled up and integrated workflows. Costly but survivable.
  • API lock-in: your code calls vendor APIs. Use abstraction at the boundary if the cost of switching matters.
  • Commercial lock-in: pricing escalators, multi-year commits, penalty clauses. Read the contract.

Data lock-in is the deal-breaker. Always have a written, tested, sub-week data export path.


15. 💰 Cost & TCO Modeling

If you can't defend the cost, you can't defend the design. SAs who don't model cost don't get to architect — they get overruled. Cost is a first-class design constraint, not a finance afterthought.

15.1 The three-year TCO

Always model three years. Year 1 hides the ramp; Year 3 reveals the steady-state. Categories:

Category Y1 Y2 Y3 Notes
Cloud infra (compute, storage, network, data transfer) Usage-based; model 3 scenarios
Managed services (DB, queue, cache, CDN) Mix base + usage
SaaS / vendor licenses Per-seat, per-event, per-tenant
AI / LLM API spend Per-token; sensitivity to volume
Build cost (FTEs × loaded cost × duration) Y1-heavy
Run cost (FTEs operating) Compounding
Compliance / audit Often overlooked
Support / training Often overlooked
Hidden — data transfer, snapshot retention, log volume, dev/staging environments The biggest blind spots

Sum it. Show base case + optimistic + pessimistic (10× growth). Compare alternatives.

15.2 The cost-per-business-event metric

The most useful unit metric for a solution is cost per business event: per order, per request, per active user, per ML inference, per ticket. Calculate it; it's how you'll defend cost to the business.

Examples:

  • "$0.04 per order, of which $0.02 is database, $0.01 is compute, $0.005 is network, $0.005 is log volume."
  • "$0.18 per support conversation, of which $0.12 is LLM tokens (decreasing with caching), $0.04 is vector DB lookups."
  • "$2.10 per active user per month, dominated by storage and CDN."

When the number changes by 30%, you investigate. When the business asks "what does this cost?" — you have the answer.

15.3 Cloud cost levers

  • Right-sizing: most workloads are 30–60% over-provisioned by default. Saves 20–40% almost always.
  • Reserved instances / savings plans: 30–60% off list, for predictable workloads. Budget for the commitment.
  • Spot / preemptible: 60–90% off, for fault-tolerant batch and stateless. Only with the right workload shape.
  • Storage class / lifecycle: hot → infrequent → cold → glacier. Saves 50–95% on cold data.
  • Data transfer: the sneakiest cost. Cross-region, cross-AZ, NAT gateways. Architect to avoid.
  • Log volume: ingestion + storage + retention. Sample, drop, route by class. Often the biggest reduction lever after right-sizing.
  • Idle environments: dev/staging running 24/7 → switch off nights/weekends. Saves 50–70% on those environments.

15.4 FinOps integration

Make the solution FinOps-aware from day 1, not retrofit later:

  • Tagging schema: every resource tagged with application, environment, cost-center, owner, data-class. Without tags, you have a cost line, not a cost story.
  • Budget alerts: at 50%, 80%, 100% of monthly budget, by tag. Alert the owner.
  • Showback / chargeback: monthly cost report by team / tenant / feature. Visibility changes behavior.
  • Anomaly detection: enable cloud-native (AWS Cost Anomaly Detection, equivalents). Catch the runaway batch job in 24h, not 28d.

15.5 Cost as a design driver

Surface cost in the architecture review. For each major component, attach: (load) × (unit cost) = (monthly cost). When a component is a 40% line item, defend it explicitly. Sometimes the design changes: a $40k/mo component you discovered late might be cheaper in a different topology.

A common SA upgrade: bring the FinOps person into the architecture review. They're often hungry to be invited; they'll find waste you missed; the design improves.


16. 🛡️ Security, Compliance & Risk

Security is not a section to bolt on at the end. It's a constraint that touches every box on the diagram. Compliance is the codification of security that somebody (regulator, auditor, customer) checks. Risk is the brutal honest list of what could kill the project.

16.1 Threat modeling — early, with the security team

Run a threat model at the design stage, not at go-live. STRIDE is the workhorse:

  • Spoofing: identity assumption — covered by auth/IAM
  • Tampering: data alteration — covered by integrity, signing
  • Repudiation: deny actions — covered by audit logs
  • Information disclosure: leak — covered by encryption, access control
  • Denial of service: outage — covered by rate limiting, autoscale, isolation
  • Elevation of privilege: getting more rights — covered by least privilege, segmentation

For each component on the C4 L2 diagram, walk STRIDE. Document the controls. The output is a threat model artifact (typically 3–10 pages) the security team signs.

16.2 The control catalogue (mapped to compliance)

Compliance frameworks (SOC 2, ISO 27001, HIPAA, PCI DSS, FedRAMP, GDPR, NIS2) all reduce to roughly the same set of controls. Map your design against this canonical list:

Control What it means in design
Identity & access SSO, MFA, RBAC, least privilege, JIT access for admin
Encryption at rest CMK in KMS, rotated, with audited key access
Encryption in transit TLS 1.2+ everywhere, mTLS for service-to-service
Audit logging Every privileged action logged, immutable, retained per policy
Vulnerability management Image scanning, dependency scanning, periodic pen-test
Change management All changes via PR, reviewed, tested, rolled back-able
Backup & recovery RPO/RTO tested, DR drilled
Incident response Runbooks, on-call, post-mortem culture
Data classification Each data element tagged; PII handled distinctly
Vendor / sub-processor management Inventory, DPAs, security questionnaires
Physical / environmental Cloud provider's responsibility (in shared model)
Personnel Background checks, training, separation procedures (HR / IT)

The SA's job: ensure the design enables each control. Not necessarily implement them all directly — but never design a solution that prevents a control.

16.3 The shared responsibility model

In cloud, security is shared. The cloud provider secures the substrate; you secure what you build on it. SAs frequently get the line wrong, either claiming AWS does too much or doing AWS's job for them.

A specific, clear table by service tier (illustrative):

  • IaaS (EC2, VMs): provider handles hypervisor, network fabric, physical. You handle OS patching, runtime, app, identity.
  • Managed services (RDS, ECS Fargate): provider handles OS, DB engine. You handle config, IAM, data, app.
  • Serverless (Lambda, Cloud Run): provider handles runtime. You handle code, IAM, secrets, data.
  • SaaS: provider handles almost everything. You handle identity (SSO), data classification, config.

State this explicitly in the security architecture document. Auditors love it. Engineers stop arguing about whose job patching is.

16.4 The risk register — the brutal list

A risk register is the honest list of what could derail this solution. Format:

ID Risk Likelihood Impact Owner Mitigation Status
R-01 Vendor X bankrupt within 12 months M H SA Data export tested, secondary vendor researched Open
R-02 Key engineer departs before go-live M H EM Pair-programming, design docs, knowledge transfer plan Open
R-03 Data residency requirement changes mid-project L H Compliance Design abstracts region; abstraction tested Mitigated
R-04 LLM cost grows 5× at 10× usage M M SA Caching, prompt budget, model fallback In progress

Review the register at every steering committee. A risk register that doesn't change is a risk register that's not being maintained. Risks should appear, mitigate, close.

16.5 Privacy by design (GDPR and beyond)

If the solution touches personal data, design for privacy from day 1:

  • Data minimization: collect the least; design schemas around it.
  • Purpose limitation: each data element has a documented purpose; new use requires re-consent or DPIA.
  • Storage limitation: retention by data class, automated deletion.
  • Right to erasure: design for deletion. (This is harder than it sounds — backups, logs, analytics.)
  • Data subject access requests (DSAR): design an API for "give me a user's data."
  • Cross-border transfers: SCCs, adequacy, residency design.

Privacy is non-trivial to retrofit. Asking these questions in week 4 is cheap; asking them in week 40 is expensive.

16.6 Compliance posture as a design output

By go-live, the solution should ship with:

  • A compliance posture document (1–3 pages) — which frameworks apply, which are out-of-scope, which controls are evidenced where.
  • A control mapping — every control mapped to where it's implemented and how it's evidenced.
  • A DPIA (if EU/personal data) — Data Protection Impact Assessment.
  • A records of processing (GDPR Article 30) — for data flows.

These artifacts are increasingly commercial assets — customers ask for them in security questionnaires, sales asks for them in deals, regulators ask for them in audits. Designing the solution to produce them naturally beats retrofitting them under audit pressure.


17. 🚚 Migration Architecture: 6Rs and Beyond

Many SA engagements are migrations more than greenfield. The "6Rs" framework (originally Gartner's 5Rs, extended) is the canonical taxonomy.

17.1 The 6Rs

For each system in scope, pick exactly one R:

R Action When Cost Risk
Retain Leave it where it is Stable, not strategic, low-risk-of-staying Lowest Lowest
Retire Decommission No longer needed, redundant, replaced Low (one-time) Low if scoped right
Rehost ("lift-and-shift") Move as-is to cloud Speed > optimization, simple stateless workloads Medium Medium — works but expensive at run
Replatform Move with minimal changes (e.g., to managed DB) Easy wins via managed services Medium-high Medium
Refactor Re-architect Cloud-native is required, scale demands it High High
Repurchase Replace with SaaS Off-the-shelf option exists Medium-low (license + integration) Vendor risk

For each system: write the R, the rationale, the cost, the schedule, and the success criteria. A migration plan that can't articulate the R per system is not a plan.

17.2 The strangler fig pattern

For migrating large systems incrementally rather than big-bang. Conceptually: stand up the new system alongside the old, route a slice of traffic to new, validate, expand the slice, eventually retire the old.

Implementation patterns:

  • Reverse proxy / API gateway: route by path or feature flag.
  • Dual-write: write to old + new for a window; reconcile.
  • Read from new, fall back to old: for read paths.
  • CDC: replicate old → new while migrating.

Hard parts:

  • Data convergence: how do you ensure old + new agree during transition? Reconciliation jobs, comparison metrics.
  • Schema divergence: new schema may differ; transformation at the boundary.
  • Long tail: the last 10% of features takes 50% of the time. Plan for it.

17.3 The migration runway

Every migration has a runway. Plan it:

  • Phase 0: Foundations — landing zone, identity, network, observability, IaC. Done before any workload moves.
  • Phase 1: Pilot — one low-risk workload, end-to-end. Prove the pipeline.
  • Phase 2: Wave — group similar workloads, migrate in 4–8 week sprints.
  • Phase 3: Tail — the hard cases. Strangler, replatform, or accept retain.
  • Phase 4: Retire — decommission old infra. The most-skipped phase. Until you turn it off, you pay double.

A common failure: declaring victory at Phase 2. The legacy infra stays "for safety" for 18 months and you pay 1.7× run cost the whole time.

17.4 Migration cost shapes

Migrations have a characteristic "U-shape" cost: high during transition, theoretically lower after. Two traps:

  1. Underestimating transition cost. Dual-running, training, parallel teams. Often 1.5–2× steady-state for 6–18 months.
  2. Overestimating post-migration savings. Lift-and-shift to cloud is often more expensive than on-prem for the first 1–2 years, until right-sizing and managed services pay off.

Be honest in the TCO model. The CFO will remember.


18. 💬 Communication: Diagrams, Documents, Presentations

Most of your impact lands through communication. Bad communication kills good designs. Two principles dominate: audience-first and progressive disclosure.

18.1 The three-audience problem

Every artifact has at least three audiences:

Audience Wants Hates
Executive The headline, the cost, the risk, the recommendation Detail, jargon, indecision
Architect peer The decisions, the alternatives, the rationale Hand-waving, missing tradeoffs
Engineer The implementation truth, the contracts, the failure modes Vague abstractions, no examples

A single document cannot serve all three. Either produce three layered documents (recommended), or one document with clear sections labeled by audience.

The rough hierarchy:

  • Executive brief (1–2 pages): problem, recommendation, cost, risk, decision needed. No diagrams more complex than C4 L1.
  • Architecture brief / RFC (8–20 pages): full design, decisions, alternatives, NFRs, risks. Architects' bread and butter.
  • Technical spec / detailed design (per component): the engineer-facing detail.

18.2 Diagrams that earn their pixels

Rules:

  1. Title every diagram. "Figure 3: Order Flow — happy path, sync, p99 budget 400ms." Untitled diagrams are riddles.
  2. Legend, always. Every shape and arrow color means something.
  3. One concept per diagram. A C4 L2 + sequence diagram + deployment view in one box is unreadable.
  4. Annotate the load and latency. Each box: estimated RPS, p99, cost contribution. Diagrams without numbers are decoration.
  5. Pretty is a feature. A clean diagram earns trust; a tangled one earns suspicion. Spend the extra hour.
  6. Mermaid > Visio for living architecture. Diagrams in code stay current; diagrams in Visio rot.

A well-known anti-pattern: the Buzzword Soup Diagram — 60 boxes, 200 arrows, every cloud icon, no information. It says "I am working." It does not say what the system does. Replace with a 12-box C4 L2.

18.3 The architecture brief: a template

A reusable arc42-flavored skeleton:

  1. Summary (½ page) — problem, recommended solution, cost, risk, decisions needed now.
  2. Context (1–2 pages) — current state, business outcome, scope, out-of-scope.
  3. Constraints & NFRs (1 page) — table.
  4. Strategic options (1 page) — A/B/C with recommendation.
  5. Solution (3–6 pages) — C4 L1, L2, key flows, deployment.
  6. Decisions (link to ADRs).
  7. Cost & TCO (1 page) — Y1/Y3, sensitivity.
  8. Risks (½–1 page) — top 10 with mitigation.
  9. Migration / rollout (½–1 page) — phases.
  10. Open questions & decisions needed (½ page) — explicit, named, dated.

Length cap: 20 pages. If you can't fit it, layer it: this brief + linked ADRs + linked detailed designs.

18.4 The executive presentation

Different beast. 5–10 slides, 15-minute briefing, 30-minute decision meeting. Slide structure that works:

  1. The problem (1 slide, 1 sentence).
  2. What we recommend (1 slide, 3 bullets).
  3. Why this and not the alternatives (1 slide, 3 columns).
  4. What it costs and when it pays back (1 slide, 1 chart).
  5. What could go wrong, and our mitigation (1 slide, top 3 risks).
  6. What we need from you, and by when (1 slide, decisions list).
  7. Backup: full architecture, full TCO, full risk register. Don't open unless asked.

Anti-pattern: the 60-slide architecture deck where slide 23 has the recommendation. The exec is 60 seconds in by the time you reach slide 4. Lead with the answer.

18.5 The status update

Weekly or bi-weekly. Keep it boring. A template that works:

Project: <name>
Week of: <date>
RAG status: G/A/R (with reason if not G)

Highlights (3 max):
- ...

Decisions made this week:
- ...

Risks updated:
- ...

Decisions needed (with owner & date):
- ...

Next week:
- ...

Boring is the strategy. Stakeholders need to know they don't have to read closely. The week you flip from green to amber, they read; that's the value.


19. 🤝 Stakeholder Management

Eighty percent of the SA job is alignment with people you don't manage. The patterns:

19.1 The stakeholder map (RACI variant)

For each major decision, label four kinds of stakeholders:

  • Responsible (does the work)
  • Accountable (single owner of the decision)
  • Consulted (input; two-way)
  • Informed (one-way)

Rules:

  • Exactly one A. If you have two, you have zero.
  • The A is rarely the SA. The SA is often the R or C, sometimes the I.
  • Publish the map. Re-check at every gate. Decisions stall when A is unclear.

19.2 The decision log

Every decision gets an entry. Date, decision, alternatives, decider, rationale, reversibility. Stored alongside ADRs. Reviewed at gates.

A specific failure mode: "we kind of decided" decisions — discussed in a meeting, never written. Six weeks later, the team rediscovers the question and re-decides differently. Cost: weeks. Solution: the SA writes it down within 24 hours, sends to the room, gets confirmation.

19.3 The "single throat to choke" pattern

For a complex solution, one person should be accountable for the solution end-to-end. Often that's you, the SA, or it's the Engagement Manager / Program Lead. Make it explicit. The customer should know whose phone to dial when something is going wrong. Distributed accountability = no accountability.

19.4 Difficult stakeholders

Patterns and counter-patterns:

Stakeholder type Pattern Counter
The dictator ("we're using X technology, end of story") Gives orders without rationale Ask "what problem are you solving with X?" — re-route to the actual decision
The bikesheder (debates trivial things) Spends meetings on color of buttons Time-box the meeting; explicitly defer trivial choices to the team
The veto (security, legal, EA) Blocks late, never engages early Bring them in week 1; share artifacts early; get conditional approvals
The ghost (decision-maker who never shows) Books, cancels, no replies Escalate via their boss with written rationale; make absence costly
The polite blocker (says yes, does nothing) Agrees in meetings, no follow-through Ask for written commitment, dates; track in decision log
The technologist (a peer with strong tech opinions) Argues every choice as an aesthetic Push to write-up; force them to commit alternatives in ADR form

For each, the counter-pattern is make work visible and dated. Ambiguity is the enemy.

19.5 The quarterly steering committee

Every meaningful solution has a steering committee — sponsor + key business + key tech leads + you. The cadence is monthly or quarterly. Run it as:

  1. RAG status (1 slide).
  2. Decisions needed today (3 slides max, one per decision).
  3. Risks updated (1 slide, focus on what changed).
  4. Roadmap (1 slide, gantt).
  5. AOB (10 min).

Goal: leave with written, signed decisions on every "decision needed today" item. If you don't, the next 2-4 weeks stall. The SA's job is to make the steering committee productive, not informational.

19.6 Bringing bad news

You will deliver bad news — over budget, over schedule, the design is wrong, the vendor failed, the engineer left. Rules:

  1. Surface early. Bad news ages worse than fish. Tell the sponsor in 24h, not at the next steering.
  2. Bring options, not just problems. "We're 30 days behind. Three paths: cut scope X, add 2 contractors, accept slip. Recommendation: cut X."
  3. No blame. Talk about the system, not the people. People who fear blame hide problems.
  4. Take responsibility. As the SA, you're the connective tissue. If a thing didn't get caught, it's partly your job.
  5. Follow up in writing. Verbal news is half-news.

Sponsors who learn early that you bring honest, structured bad news with options trust you forever. Sponsors who learn late that you sat on it stop trusting you forever. Choose.


20. 🤵 Pre-Sales SA: The Consultative Sale

A pre-sales SA inside a vendor or SI has a different operating model. Not selling — consulting — but you do have a quota. The shape of the work:

20.1 The funnel and your role

Pre-sales SAs sit on the technical side of the sales funnel:

  1. Discovery — sales-led, you co-attend. You listen for real problems; sales listens for budget and timing.
  2. Demo — you lead. Tailored to the customer's actual problem, not the canned demo.
  3. PoC — you scope, deliver or oversee, defend. Time-boxed, success-criteria-led.
  4. RFP / RFI response — you write the technical sections. Often the deal is decided here.
  5. Statement of work / Pricing — collaboration with sales / engagement managers.
  6. Close — sales-led, you support objection handling.

20.2 The consultative sale

The pattern that wins, regardless of vendor:

  1. Understand the customer's business problem first. Not the technical requirement. Not the RFP question. The actual business outcome.
  2. Reflect it back. "You're trying to reduce time-to-resolution on tier-1 tickets from 8h to 1h, because customer churn correlates with first-touch latency. Did I get that right?" — earns trust on the first call.
  3. Educate, don't pitch. Walk the customer through how similar customers solved similar problems — yours and otherwise. They learn; trust compounds.
  4. Be the trusted advisor on the category, not the salesperson for the product. Mention competitors honestly. "If you have a heavy Salesforce footprint, our integration to product X may be less mature than competitor Y's; here's how customers handle it."
  5. Disqualify when needed. "Honestly, we're not the best fit for this use case. Vendor Z is stronger." — this loses some deals and wins more, bigger, longer-term.

The sales reps who hit quota for years partner with SAs who do this. The ones who don't? They burn customers and the funnel goes dry.

20.3 The technical demo

A 30–60 minute live walk-through. Rules:

  • Personalized: customer logo, customer data flavor, customer problem on screen. Generic demos lose.
  • Outcome-led: "By the end you'll see how this solves your tier-1 ticket time."
  • Failure-prepared: you've rehearsed, you've cached responses, you've got backup screenshots. The demo gods are cruel; the prepared SA is not surprised.
  • Q&A handled in real-time: if you don't know, say so, write it down, follow up within 48h. Honesty earns the deal.
  • No 60-slide intro. Start in the product. Slides for context, not for content.

20.4 The PoC: the scary one

PoCs are where deals are won or lost — and where pre-sales SAs go off the rails. Rules:

  • Scoped explicitly: 2–3 use cases, 2–4 weeks, written success criteria. The customer signs the criteria.
  • Customer-led where possible: their engineers do the work, you support. They build muscle; they buy.
  • Failure modes documented: where the product doesn't fit, write it down. Surprises in production kill renewals.
  • Done = done. When the success criteria are met, celebrate and close. Don't drift into "while we're here, can you also..." That's free consulting and it tanks the deal close.

20.5 The RFP response

RFPs are a war of attrition. Practical patterns:

  • Reuse aggressively: maintain a question bank with last year's answers, scored by win/loss.
  • Answer the question asked, not the one you wish was asked. RFP scorers are unforgiving.
  • Use diagrams and tables in technical sections — text walls don't score well.
  • Highlight unique strengths in 1–2 places — once at the top of the technical section, once in the executive summary.
  • Refuse low-quality RFPs: if the RFP looks copy-pasted from a competitor's marketing, you're column fodder. Decide whether to bid.

20.6 The handoff to delivery

The single most important moment in pre-sales SA work. Anti-pattern: pre-sales SA promises feature X to win the deal; delivery team didn't know; six months later the customer churns. Counter-patterns:

  • Internal SOW review: delivery sees the SOW before it's signed. They sign off in writing.
  • Documented promises: every commitment beyond the standard product is in a "delivery commitments" appendix. No verbal-only promises.
  • Joint kickoff: pre-sales SA + delivery SA + customer in the same room for handoff.
  • Pre-sales SA stays for first 30 days: as advisor, not driver. Continuity beats clean handoff.

(...to be continued...) Read Part 3 here https://viblo.asia/p/the-solution-architect-playbook-from-best-designer-to-best-bridge-part-3-y0VGwO9DVPA


If you found this helpful, let me know by leaving a 👍 or a comment!, or if you think this post could help someone, feel free to share it! Thank you very much! 😃


All rights reserved

Viblo
Hãy đăng ký một tài khoản Viblo để nhận được nhiều bài viết thú vị hơn.
Đăng kí