0

🏛️ The Solution Architect Playbook 📚: From Best Designer to Best Bridge - Part 1 🌉

A deep, opinionated, practical guide for the engineer-architect who designs end-to-end solutions across systems, teams, and business units. The mental models, decision frameworks, discovery tactics, design methods, communication patterns, and anti-patterns that separate the SA whose solutions actually ship and run for years from the one whose 80-page Visio decks gather dust on Confluence. Grounded in current reality — multi-cloud by default, AI woven into every solution, smaller delivery teams per dollar of revenue, regulated by frameworks that didn't exist five years ago, and customers who can read a SOC 2 report.

If you read only one section first, read §2 Mindset, §6 Discovery, §9 NFRs, and §13 Build vs Buy. Everything else is the implementation of those four.

Companion to 🧑‍💻 The Tech Lead Playbook: From Best IC to Multiplier 🚀 (the team-level role), 👨‍💻 The CTO Playbook 📘: From Best Builder to Best Bet ♟️ (the org-level role), 🏛️ The System Design Playbook 📖 (the design vocabulary), 🛠️ The Senior Software Engineer Playbook 📖: From Good Coder to High-Impact Engineer 🚀 (deep IC craft), [🤖 The AI SaaS Playbook (Practical Edition)📘](🤖 The AI SaaS Playbook (Practical Edition)📘 https://dev.to/truongpx396/the-ai-saas-playbook-practical-edition-33lb) (AI overlay), and 🚀 The SaaS Template Playbook 📖 (delivery foundations). This one is for the technical professional who is accountable for a solution end-to-end across systems, teams, and stakeholders — whether at a consulting firm, cloud vendor, ISV, or in-house enterprise team.


📋 Table of Contents

  1. ⚡ Read This First
  2. 🧠 The Solution Architect Mindset
  3. 🎭 The SA Landscape: Five Archetypes
  4. 🪜 SA vs TL vs Software Architect vs EA vs CTO
  5. 🚪 The First 90 Days
  6. 🔍 Discovery: The Real Job Begins Here
  7. 📐 Solution Design Methodology
  8. 🗂️ Documenting a Solution: C4, ADRs, arc42
  9. 🎯 Non-Functional Requirements: The Real Job
  10. ☁️ Cloud Architecture (AWS, Azure, GCP, Multi)
  11. 🔌 Integration Architecture
  12. 🗄️ Data & AI Architecture
  13. ⚖️ Build vs Buy vs Customize
  14. 🛒 Vendor Evaluation & Selection
  15. 💰 Cost & TCO Modeling
  16. 🛡️ Security, Compliance & Risk
  17. 🚚 Migration Architecture: 6Rs and Beyond
  18. 💬 Communication: Diagrams, Documents, Presentations
  19. 🤝 Stakeholder Management
  20. 🤵 Pre-Sales SA: The Consultative Sale
  21. 🛠️ Post-Sales SA: Delivery Architecture
  22. 🚀 Working with Delivery Teams
  23. ⏱️ The Operating Cadence
  24. 🤖 AI in the SA Role
  25. 🧰 Tools of the Trade
  26. ⚠️ The SA Anti-Pattern Catalog
  27. 🗺️ The Phased Roadmap (Day 1 → Year 5)
  28. 📋 Cheat Sheet & Resources

1. ⚡ Read This First

Seven truths that will save you the first 18 months of mistakes every new solution architect makes:

  1. You are paid for the solution, not the technology. Technology is the cheapest input to a solution. The expensive inputs are: the problem you chose to solve, the constraints you accepted, the integrations you didn't anticipate, the stakeholders you forgot to align, and the operational cost the customer didn't budget. A great SA renders a business problem into a runnable, affordable, supportable system. A mediocre SA renders a Visio diagram. Recognize which one you are this quarter.
  2. Your authority is borrowed. You usually don't manage the people who will build the thing. You don't sign the cheque. You don't run the production system. Your influence comes from technical credibility (people trust your judgment), clarity (people know what to do and why), and being the only person who has read the whole problem (you are the connective tissue). If you try to lead with "because the architect said so," you have already lost.
  3. NFRs are the job; functional requirements are table stakes. Every junior can list "the system should let users log in." A senior SA writes: "login p99 ≤ 400ms at 5,000 RPS, 99.95% available, MFA required for admin actions, SOC 2 evidence captured per session, and per-tenant audit retention of 7 years." The first sentence is the menu. The second is the contract. The contract is where projects succeed or fail. Most SA failures aren't bad designs — they're missing or sloppy non-functional requirements.
  4. The boring decisions compound. Naming conventions, ADR templates, environment promotion rules, IAM patterns, secrets handling, observability standards, vendor onboarding workflow. A solution where these are boring and consistent ships in 4 months. A solution where every team improvises ships in 14 months and never gets to "production-grade." Predictable, written, unsexy patterns beat clever bespoke designs every time.
  5. You will spend more time in conversations than in diagrams. Discovery interviews. Vendor calls. Risk reviews. Stakeholder alignment. Steering committee briefings. PMO standups. Devops handoffs. Most new SAs over-index on diagram-quality and under-index on conversation-quality. The single highest-leverage skill is: walk into a 60-minute meeting with five people who disagree and walk out with a written, signed decision. Practice it explicitly.
  6. Reversibility is your most valuable axis. Bezos's two-way / one-way door framing matters more for an SA than for almost any other role. Your job is to isolate the irreversible decisions (cloud provider, primary identity store, core data model, the integration contract two business units depend on) and surface them with appropriate care, while deliberately defaulting all reversible decisions to fast and cheap. SAs who treat every decision as one-way burn quarters; SAs who treat every decision as two-way leak risk.
  7. Writing is the operating system of your job. Architecture briefs, ADRs, RFP responses, runbooks, risk registers, decision memos, vendor scorecards, post-mortems. If your writing is mediocre, every other lever is dampened. The SAs who scale fastest are the ones whose writing is so clear that the team can act without needing a meeting. Ship that skill before you ship anything else.

The rest is implementation of these seven.

Who this is for

  • You were just made (or about to be made) Solution Architect, Principal Architect, or Senior Cloud Architect at a consulting firm, ISV, cloud vendor, SI, or in-house team.
  • You're a senior/staff engineer being pulled into pre-sales, vendor selection, or end-to-end design and want to learn the discipline rather than wing it.
  • You're a tech lead whose scope just expanded across teams or business units and you no longer have a single team's people leverage.
  • You're an enterprise architect or program lead who wants the next layer down — how solutions actually get designed and delivered.

Who this is not for

A note on context

The default voice assumes a mid-to-senior solution architect on a multi-team, multi-system engagement, ~3 to 12 months of design+delivery duration, current reality (multi-cloud by default, AI woven through every solution, GenAI in copilots, FinOps mandatory, a regulatory surface that grew teeth). Pre-sales SAs in vendor/SI roles should read everything but lean hardest into §6, §14, §18, §20. In-house enterprise SAs should focus on §9, §16, §22, §23. Boutique and freelance SAs need every section, doubly so §1, §13, §15.


2. 🧠 The Solution Architect Mindset

The mindset shift from senior engineer or tech lead to SA is harder than the skill shift. Most failed SAs were technically capable; they failed at the positional layer — they kept thinking like a builder when their job was to think like a connector.

2.1 Identity reframe: from "best designer" to "best bridge"

You used to be measured by the system you designed. Now you are measured by whether the right system gets designed, gets bought (literally or organizationally), and gets shipped, given the constraints and stakeholders in play. Your output is a solution that closes a business problem, and that includes everything from "the integration is feasible" to "the CFO signed off on the cost" to "the security team accepted the risk register" to "the delivery team can actually build it." This breaks five engineering instincts you must consciously rewire:

Old engineering instinct New SA instinct
"I'll design the cleanest system" "Which 3 constraints determine 80% of this design? Optimize there, accept the rest."
"Let me research the best technology" "What does the customer already have, what can they operate, and what can they afford?"
"I'll just code a prototype" "What's the smallest demo, document, or whiteboard that decides this?"
"We need consensus on the design" "Who owns this decision? When and how do they decide? Who do they need to hear from?"
"Production is the next team's problem" "Operability is part of my design. If it can't be run, I haven't designed it."

Practical: write a one-line role description and pin it to your monitor. "I am the Solution Architect for [Project / Account / Domain]. My job is to deliver a runnable, affordable, supportable solution that closes the business problem within the agreed constraints, working through teams I do not manage and stakeholders I do not control." If you can't articulate this, your stakeholders can't either, and they will silently form their own (often conflicting) definitions of your job.

2.2 The five hats — and how they fight

You wear five hats simultaneously, and they actively interfere:

Hat Mode Time horizon Output
Discoverer Curious, slow, listening Days–weeks Interview notes, context map, problem statement
Designer Deep, abstract, system-level Weeks Architecture brief, C4 diagrams, ADRs
Negotiator Diplomatic, fast, decisive Hours–days Decisions logged, alignment achieved, scope clarified
Salesperson Confident, narrative, value-led Hours Pitch decks, RFP responses, executive briefings
Operator Pragmatic, hands-dirty Days–weeks Runbooks, governance gates, delivery escalations

Each demands a different brain state. A 2-hour design session with engineers and a 2-hour vendor pitch to a CIO cannot share the same morning. Batch by hat, not by topic. The most common failure mode: defaulting to Designer mode whenever uncomfortable. Discovery is messy, negotiation is stressful, sales feels icky, operations is tedious. Designer mode produces gorgeous diagrams that no one will pay for, no one will sign off on, and no one will run. Calendar discipline beats willpower. See §23 for the cadence.

2.3 The four voices

Every SA has four internal voices. They lie in different ways. Notice them.

  1. The Architect Astronaut Voice"This deserves a layered abstraction with a domain-driven hexagonal core." Lies upward — turns simple problems into 18-month platform plays. Common in SAs who came from heavy frameworks or who haven't shipped recently.
  2. The Vendor-Whisperer Voice"AWS launched X last week, this is a perfect use case." Lies sideways — fits the customer to the technology rather than the technology to the customer. Especially common in vendor-employed SAs and the newly certified.
  3. The Imposter Voice"They hired me by mistake, the real architects know more about [obscure pattern]." Lies downward — talks you out of necessary calls and produces a consensus-only SA who never makes a decision and is invisible at the steering committee.
  4. The Steward Voice"What does this customer need to be capable of in 18 months given their team, budget, and regulatory reality? What's the smallest system that gets there?" Lies the least. Cultivate this one.

When the Astronaut, Vendor-Whisperer, or Imposter voice is driving a decision, write the decision down and revisit in 24 hours. Most regretted SA decisions happen in the 24 hours after a glossy vendor briefing, a hostile steering committee, or a public dressing-down. Sleep first.

2.4 The leverage hierarchy

Rank your time by leverage. Always work top-down:

  1. Problem framing. What is actually being solved, for whom, with what constraints. 1 hour here = 100 hours saved later.
  2. NFR negotiation. Latency, availability, cost ceiling, RPO/RTO, data residency, compliance class. The contract.
  3. Stakeholder alignment. Who owns each decision, who signs which doc, who attends which gate. The political wiring of the project.
  4. Build vs buy vs reuse. The biggest cost lever. Wrong here = wasted years.
  5. Reference architecture & ADRs. The shape of the solution, the irreversible choices, the rationale.
  6. Cost / TCO model. Without this you cannot defend the design.
  7. Integration design. Where systems meet is where projects fail. Spend disproportionate time here.
  8. Risk register & mitigation plan. The brutal honest list of what could kill this.
  9. Delivery handoff. The team needs to own this solution, not implement it under your dictation.
  10. Reviewing. Other people's diagrams, PRs, vendor decks. Useful in moderation. Stop being on the critical path.
  11. Building. Your own code. Lowest-leverage of all. Do only what literally only you can do — usually a thin spike to prove a tradeoff, never production code.

When you feel busy but useless, you've inverted the stack. Reset by asking: "In the last 5 working hours, how much did I spend on items 1–4?" If the answer is "<2," that's the problem.

2.5 Reversible vs irreversible decisions

The single most clarifying frame in your toolkit. Examples calibrated to the SA seat:

  • Two-way doors (reversible): which CI provider, which monitoring vendor, the exact format of an ADR, sprint cadence, the choice between two equivalent serializers, naming a microservice. Decide fast, reverse if wrong, do not run a six-week working group on these.
  • One-way doors (hard or expensive to reverse): primary cloud provider for production data, identity provider, core data model, public API shape, primary database for OLTP, the customer-facing event schema, a long-term integration contract with a partner, the multi-tenant boundary, the country of data residency. Slow down. Write it up. Get input. Get expert review. Sleep on it. Document why.

A good SA visibly labels each decision in the running ADR log: Reversibility: Two-way / One-way / One-and-a-half-way (reversible only with notable cost). This single column changes how stakeholders engage. It also gives you political air cover: "This is one-way. We need a written decision from the data owner. Until then, we're building the two-way pieces around it."

2.6 The "Design for the second-best engineer" rule

You will not be the one operating this thing in production. The team that operates it will not be the most senior team in the company. Design for the engineer who is the second-best on the team that will inherit it, on a Tuesday afternoon, three months after you've moved on. That engineer is intelligent but tired, has not read your 40-page design, has half a Slack thread of context, and just got paged.

If your design requires the brilliant engineer to keep it running, your design is wrong. Examples of the rule applied:

  • Prefer obvious over clever. If you must choose between a standard managed service and a custom event-driven mesh, the managed service wins unless the data forces otherwise.
  • Keep the operating model boring: standard SLOs, standard runbooks, standard observability stack, standard secrets store.
  • Eliminate "context-only-the-architect-knows" from the critical path. Every load-bearing decision must be a written ADR.

2.7 Three habits that separate principal from staff

  1. Quantify before you draw. Every box on the diagram has an estimated load (RPS, GB/day, concurrent users), a latency budget, a failure mode, and a cost. If you cannot fill those four columns, you have not designed it; you have drawn it.
  2. Name the failure modes. For every component: "What happens when this is slow / down / wrong / saturated / breached?" Then "Who finds out, how fast, and what do they do?" If you cannot answer, the design is incomplete.
  3. Defer the exotic. Reach for the boring tool until measurements force the exotic one. The career graveyard is full of solution architects who chose Cassandra-on-Day-One because the marketing said "scales," and now the customer has a six-node ops nightmare for 3,000 RPS.

3. 🎭 The SA Landscape: Five Archetypes

"Solution Architect" is not one job; it is at least five. Be honest about which one you are this quarter — the playbook chapters land differently depending on the answer.

Archetype Sits in Time horizon Primary deliverable Compensation model Key risk
Pre-sales SA Vendor, SI, cloud provider Days–weeks Demo, RFP response, statement of work Tied to bookings/quota Selling solutions you can't deliver
Delivery / Engagement SA SI, consulting, internal program Months Reference architecture, ADRs, governance, handoff Project / utilization Diagrams that don't survive contact with reality
In-house Enterprise SA Big-co IT, regulated industry Quarters–years Domain reference architecture, integration contracts, vendor list Salary, sometimes bonus Becoming a process bottleneck
Cloud / Platform SA Cloud or platform vendor Continuous Reference architectures, customer reviews, partner enablement Salary + variable "Vendor goggles" — every problem solved with your stack
Independent / Fractional SA Boutique or freelance Days–months Strategy memo, vendor selection, Phase-0 design Day rate Scope creep, no installed credibility, payment risk

A few non-obvious points:

  • The same person can wear all five hats over a career; the operating model differs sharply. A pre-sales SA who promises a feature wins the deal; a delivery SA who promises that same feature loses the project. Watch your incentives.
  • Cloud-vendor SAs are sometimes called "Solutions Architect" formally but spend ~70% of their time on enablement and reference architectures, not on a single customer's solution end-to-end. Title alike, job different.
  • Enterprise SAs in regulated industries (banking, insurance, health, telco) are often part of a governance function with veto power on certain designs. The skill is wielding that veto sparingly.

Cross-archetype constants (every SA does these): write ADRs, run NFR negotiations, design for operability, manage stakeholders, model cost. Everything else varies.


4. 🪜 SA vs TL vs Software Architect vs EA vs CTO

The single most common confusion in the role. Five real adjacent positions:

Role Owns Time horizon People management Code authorship Where they fail
Tech Lead One team's delivery and quality Sprints–quarters Often dotted-line High (15–40% of time) Stays IC, never grows the team
Software / Application Architect One product or system's internal design Months–year None Medium (5–20%) Becomes "the only one who knows it"
Solution Architect One solution across systems & teams 3–18 months None (lateral influence) Low (<5%, mostly spikes) Diagrams that don't ship
Enterprise Architect (EA) Enterprise IT landscape, governance, capabilities 1–5 years Sometimes Almost zero Frameworks > outcomes; "the strategy team that ships nothing"
CTO / VP Eng The whole engineering organization 6–24 months and beyond Yes, 5–500 reports Zero in steady state Goes too IC or too political

A useful mental geometry:

  • TL is vertical-narrow (one team, deep on its delivery).
  • Software Architect is vertical-deep (one product, deep on its internal structure).
  • Solution Architect is horizontal — across systems, vendors, teams — for a finite engagement.
  • EA is horizontal-and-permanent — across all of IT, with multi-year governance horizons.
  • CTO is the line manager of the system that produces all of the above.

A few specific clarifications you'll need to make to a stakeholder, probably weekly:

  • "I am a Solution Architect, not a Software Architect — I will not pick the unit-test framework. I will pick the integration contract between system A and B, the data residency boundary, and the build-vs-buy on the search component." — sets scope cleanly.
  • "I am a Solution Architect, not an Enterprise Architect — I am accountable for this solution. I will align with the EA's principles where they exist; I will not author them." — keeps scope from ballooning.
  • "I am not the Tech Lead — I do not own velocity. I own the design and the decision log. The TL owns the burn-down." — keeps you out of standups you shouldn't be in.

The role names vary by company. Validate by responsibilities, not by title. A "Senior Cloud Architect" at one shop is a Pre-sales SA; at another, an in-house Enterprise SA; at a third, a Software Architect with a vendor focus.


5. 🚪 The First 90 Days

You are new to the engagement, the team, the customer, or all three. The first 90 days are almost entirely about earning the right to design. Skip this and you will make a beautiful design that nobody implements.

5.1 The 30-day plan: listen, map, baseline

Goals: Understand the business, the people, the existing landscape, the constraints, and the political wiring. Resist every urge to draw a diagram in week one.

Do:

  • Run 15–25 discovery interviews (see §6). Across business, product, engineering, ops, security, finance, vendors, customers if possible.
  • Build a stakeholder map: who decides, who advises, who is informed, who blocks. Include their concerns and what they consider success.
  • Build a system context map: every system touching this solution, every owner, every integration. This is not a target architecture — it's archaeology.
  • Read the last 6 months of relevant documents: design docs, post-mortems, board updates, audit reports, RFP responses, vendor contracts, incident reports. Most of your design constraints are in those documents already.
  • Identify the 3 burning constraints: cost ceiling, regulatory deadline, key-person dependency, integration that's already on fire, etc. These will dominate the design.
  • Listen for the 3 zombie projects: prior attempts to solve this problem that died. Why? You inherit those carcasses.

Do not:

  • Propose a target architecture. You don't have permission yet.
  • Promise scope. You don't know what's deliverable.
  • Bash an existing system, even if it's bad. The person who built it is in the room.
  • Default to "your" stack. The customer has a stack, a team that runs it, and a budget for it.

Output by day 30: a written Discovery Findings memo (4–8 pages): business problem, current state context map, top 5 NFRs (draft), top 5 risks, top 3 zombie projects, list of unanswered questions, proposed next-30-day plan.

5.2 The 60-day plan: frame the problem, propose the shape

Goals: Get alignment on the problem, the NFRs, and the shape of the solution. Still no detailed design. The question to answer is not "what should we build?" but "what are we trying to be true at the end of this?"

Do:

  • Run an NFR workshop with the right stakeholders (see §9). Output: a signed-off NFR register with quantified targets and acceptance criteria.
  • Produce a Solution Vision doc (3–5 pages): the future state in plain English, the 3–5 architectural principles you propose to follow, the major shape (monolith vs distributed, sync vs async, on-prem vs cloud), and the top 3 strategic options at a high level (e.g., Option A: Build in-house on AWS, Option B: Buy SaaS X, Option C: Hybrid).
  • Run a risk workshop to surface the top 10 risks and their owners. Compliance, legal, vendor, key-person, technical, schedule.
  • Validate the cost ceiling with finance/CFO/Procurement: not "how much will it cost," but "what's the budget you've actually approved."

Output by day 60: a Solution Vision doc and a signed NFR register. Stakeholders should be able to repeat the problem and the principles in their own words. If they can't, you haven't done the work yet.

5.3 The 90-day plan: design, gate, and start delivery

Goals: Produce the reference architecture, the major ADRs, the cost model, the migration plan (if applicable), and hand off to delivery. Run the first design-review gate.

Do:

  • Produce the Reference Architecture: C4 Levels 1–3 (see §8), the major data flows, the integration contracts, the deployment topology. With NFR mapping (which component delivers which NFR target).
  • Produce the first 5–10 ADRs: cloud provider, identity, primary data store, integration backbone, compute model, observability stack, secrets, multi-tenancy boundary. (Trim to what your solution actually needs.)
  • Produce the TCO model (see §15): year 1, year 3, sensitivities. Cross-check against the budget.
  • Run the architecture review with the steering committee, security, compliance, and the EA. Capture decisions and dissent.
  • Hand off to the delivery TLs and PMs with a written delivery plan and the first sprint scope.

Output by day 90: the Solution Design Pack — Vision, NFRs, Ref Arch, ADR set, Risk Register, TCO. This is what you'll be measured against for the next 6–18 months.

A common mistake: trying to "complete" the design at day 90. You won't. The design will keep evolving as delivery exposes assumptions. The day-90 design is the design that's good enough to start. Plan for at least three major design review gates ahead.

5.4 The 90-day mistakes to avoid

  • Premature toolchain commitment. "We'll use Kafka." Until you know the data velocity, the team's Kafka skill, the cost, the integration mode, and whether managed Kafka exists in this region, that's a guess. Defer.
  • Saying yes to every interview. You'll burn 90 days in meetings. Prioritize the 25 highest-signal interviews; the rest go in a survey.
  • Skipping the EA. If there's an Enterprise Architect, brief them in week 1, before you produce anything. Their good will saves quarters.
  • Skipping security. Same. Bring them in early; they'll be your first reviewer or your last blocker. Choose.
  • Skipping finance. The cheapest way to discover the budget is to ask. The most expensive way is to design first.

6. 🔍 Discovery: The Real Job Begins Here

Discovery is not a phase you finish; it's the foundation that quietly determines whether the design is right. Most failed solutions are failures of discovery, not of design. You designed a great solution to the wrong problem.

6.1 The five layers of discovery

You have to surface all five. Skipping any will haunt you.

Layer What you're trying to learn Asked of
Business Why this solution, what outcomes, what dollar value, what deadline Sponsor, business owner, CFO
User / Customer Who uses this, how, when, what's painful, what does success feel like Product, end users, support
Functional The capabilities the solution must provide Product, BAs, domain experts
Non-functional The quality attributes (perf, availability, cost ceiling, security, compliance) Ops, security, compliance, finance
Constraint What the customer already has, can run, will allow, can pay All of the above + procurement, legal, vendor management

A solution that ships is one where the constraint layer was discovered first. Most SAs discover it last — usually the day before architecture review, when procurement says "we don't have a contract with that vendor and won't get one in your timeline."

6.2 The Five Whys, applied to solution design

When a stakeholder hands you a "requirement," it is almost always a solution they already chose, not the actual requirement. Apply the Five Whys.

Stakeholder: "We need a real-time dashboard." SA: "Why?" "So executives can see the funnel." SA: "Why does that need real-time?" "Well, end-of-day is fine, but the current system is two days behind." SA: "If we made it next-day reliable, would that solve the problem?" "Yes, that's actually fine."

You just saved $200k of streaming infra and 4 months. Do this on every requirement. Real-time, high-availability, multi-region, full-mesh, blockchain — these are almost always pre-baked solutions. Find the underlying need.

6.3 The discovery interview: a script

Each interview is 45–60 minutes. Always one note-taker (you, or a co-architect) so eye contact is preserved.

  1. Their context (5 min): role, team, what they own, how long they've been in the seat.
  2. Their world today (15 min): "Walk me through a typical week. What's working, what's broken, what wakes you up?" Listen for the language they use — that's the language to use back.
  3. Their wishlist (10 min): "If I could give you three things tomorrow, what would they be?" Distinguish wish from need.
  4. Their constraints (15 min): "What can't change? What's off-limits? What would your boss kill?" — these are the irreversible boundaries.
  5. Their concerns (10 min): "What's the most likely way this project goes wrong?" — the most undervalued question. Their answer is your risk register, free.
  6. Wrap (5 min): summarize back, ask "did I get that right?", ask "who else should I talk to?", thank, schedule follow-up if needed.

Anti-patterns:

  • Leading with technology. "Are you on AWS or Azure?" — you're hiring, not researching. Save for the constraint interview.
  • Selling. You're not pitching yet. Asking and listening is the entire job for now.
  • Note-light. Memory degrades by 50% in 24 hours. Type or transcribe; review same-day.

6.4 The context map — your most reused artifact

A context map is a one-page diagram of every system, every team, every integration, every data flow that touches this solution today, with arrows labeled. Not a target architecture; not beautiful; exhaustive.

This single artifact will be the most-photographed page of every meeting you run for the next 6 months. Conventions:

  • Every box has an owner (team or person).
  • Every arrow has a protocol (REST, gRPC, file drop, JDBC, message queue) and a frequency.
  • Every system has a "stability" tag: green (stable), yellow (planned change), red (deprecating, on fire, or unowned).
  • Every external system has a vendor name and contract status.

If you can produce a high-quality context map and the stakeholders argue with it, you've already done your job — you've surfaced their misalignment about what they have today. Half of "design problems" are actually "we don't agree on the current state."

6.5 The unspoken constraints

The constraints stakeholders don't say are usually the ones that kill the project.

  • Vendor relationships. "We can't use AWS — the CIO had a fight with their AE in 2024." (True story.)
  • Data residency. "Our German customers' data cannot leave the EU." Often only spoken when the contract review starts.
  • Internal politics. "The data team will block any solution that has its own database." Unstated until day 60.
  • Off-the-record commitments. "We promised the regulator we'd be on-prem until 2027." Lives in someone's email, not the wiki.
  • Headcount realities. "We will lose half the platform team in Q3 to the new product." Spoken only at the leaving drinks.

You discover these by asking specifically: "What are the things the org has decided that aren't written down?" "What does the CFO/CIO/CISO refuse to do?" "Who is leaving in the next year?" Ask once per interview, in the constraints block. Some you'll only learn by being around for 60+ days.

6.6 The discovery output

A 4–8 page memo with these sections, every time:

  1. Problem statement (1 paragraph). The business outcome, not the technology.
  2. Stakeholders (table). Who decides, advises, blocks, is informed.
  3. Current state (1 page + context map). What's running today.
  4. Top 5 NFR drafts (table with quantified targets). Subject to §9.
  5. Top 10 risks (table). With owners.
  6. Open questions (list). With dates by which they must be answered.
  7. Recommended next steps (numbered list).

Send it. Get reactions. Iterate. Do not design the solution before this memo is signed off. If you do, you'll design the wrong solution.


7. 📐 Solution Design Methodology

You have the discovery in hand. Now you design. The disciplined SA does not start in Visio; they start in a structured methodology that compresses what we know into what we're choosing.

7.1 RAPID-S, adapted for solutions

The system-design interview framework adapts well to real solutions. Six phases, in order:

  1. R — Requirements: functional + non-functional + constraints. Already done in discovery; reformulate as a one-pager.
  2. A — API / Interface contracts: what does this solution expose, to whom, with what guarantees. Public APIs, integration contracts, event schemas.
  3. P — Persistence model: data ownership, schema sketch, retention, residency. Not the table schema — the boundaries of data.
  4. I — Infrastructure: compute model, deployment topology, network, identity, observability stack.
  5. D — Decisions: ADRs for the irreversible 5–10 choices. The lasting artifact.
  6. S — Scaling, security, sustainability: the NFR enforcement plan. How the solution holds at 10× load, an attempted breach, and 3 years from now.

Walk it in this order. RA-first, not I-first. The most common mistake is jumping to I (the cloud diagram) before R is signed off — you end up architecting the wrong NFR class.

7.2 The two designs — current vs target — and the gap

Every design is really three documents in one:

  • Current state architecture (CSA): what's running today.
  • Target state architecture (TSA): where we want to be.
  • Transition architecture(s): the intermediate states that are themselves runnable.

A common mistake: drawing only the TSA. The TSA is hypothetical until the transition is designed. Most projects fail in the transition, not in the target. The transition has to be runnable: every milestone is a live, supported, monitored state.

For migration-heavy work, draw at least 3 transition architectures, not 1. (See §17.)

7.3 The principles set: the design constitution

Before drawing a single box, write 5–7 principles the solution will follow. These are explicit value choices the team can cite during inevitable arguments. Examples:

  • "Buy before build, unless build is a clear strategic differentiator."
  • "Every service is owned by exactly one team."
  • "All data classified as PII is encrypted at rest with a customer-managed key."
  • "Synchronous calls only between services in the same trust boundary; cross-boundary is async."
  • "Single primary cloud (AWS); secondary cloud only for DR or specific regulated workloads."
  • "Every public API is versioned and documented in OpenAPI before code is written."
  • "Observability stack is shared; teams do not roll their own."

Principles are most useful when they cost something. "Be secure" is not a principle, it's a wish. "Customer-managed keys for all PII" is a principle — it costs latency, complexity, and budget. That's why it's load-bearing.

7.4 The strategic options analysis (SOA)

Before committing to an architecture, write 2–4 strategic options and analyze each. Don't compare 8 — analysis paralysis. Don't compare 1 — that's a recommendation, not analysis. Three is usually right.

Option Description Pros Cons Cost (Y1 / Y3) Risk Recommendation
A Build in-house on AWS Full control, integrates with rest of stack 9-month build, hire 4 engineers $1.2M / $2.4M Hiring market Default
B Buy SaaS (Vendor X) 6 weeks to live, vendor handles ops Lock-in, integration cost, $400k/yr forever $0.5M / $1.5M Vendor risk Recommended
C Hybrid — buy core, build edges Best of both Two teams to manage, integration complexity $0.9M / $2.1M Coordination Acceptable backup

This is a steering-committee artifact. It compresses 200 pages of analysis into one defensible recommendation. Commit to one option in the SOA, with rationale. Wishy-washy "any could work" outputs get re-debated for months.

7.5 The "shape before the boxes" principle

A design has a shape before it has components. Decide the shape first:

  • Topology: monolith, modular monolith, microservices, mesh, micro-frontends, event-driven, batch.
  • Data flow: request/response, fan-out, pipeline, lake.
  • State: stateless services + data tier, stateful services with replication, ephemeral compute.
  • Multi-tenancy: shared everything, shared infra-isolated data, per-tenant deployment.
  • Failure model: graceful degradation, circuit breaker, retry, fallback to cache, fail fast.

Decide these before the cloud diagram. The cloud diagram is the implementation of the shape; many cloud diagrams can render the same shape; many shapes can be incompatible with the same NFRs. Get the shape right — the rest is wiring.


8. 🗂️ Documenting a Solution: C4, ADRs, arc42

Three documentation tools cover 90% of SA work. Use them. Stop using "shapes in PowerPoint."

8.1 The C4 Model (Simon Brown)

A hierarchy of architecture diagrams that scales from "show this to a CFO" to "show this to a developer." Four levels:

Level Audience What it shows Example
L1 — System Context Non-technical stakeholders, exec, customer The system as one box, with users and external systems around it "Order System receives orders from Web/Mobile, queries Inventory and CRM, sends to Fulfillment"
L2 — Container Architects, leads, sec, ops Internal containers (apps, databases, queues) inside the system box "API service, worker, Postgres, Redis, S3"
L3 — Component Engineers, designers Components inside one container "OrderController → OrderService → OrderRepository"
L4 — Code Engineers (rarely) Class diagrams (mostly auto-generated) Skip in 99% of cases

For a typical solution: produce L1 always, L2 always, L3 for the 2–3 most novel containers, L4 never. Tooling: Structurizr, draw.io, Excalidraw, Mermaid (in-line in Markdown — composes with ADRs beautifully).

A common SA failure: starting at L2 with a 40-box diagram and never producing L1. Without L1 the CFO has no idea what they're funding. Always L1 first.

8.2 Architecture Decision Records (ADRs)

The single most important document genre in solution architecture. An ADR captures one decision, the alternatives, the rationale, and the consequences. Format (Michael Nygard variant, lightly extended for SA use):

# ADR-0007: Use AWS Aurora PostgreSQL for the OLTP store

Date: 2026-05-06
Status: Accepted
Reversibility: One-way (data migration is expensive)
Context owners: SA, Data Lead, Platform Lead

## Context
We need a primary OLTP store for order, inventory, and customer data, sized for 5,000 RPS peak, sub-50ms p99 reads, RPO ≤ 5min, RTO ≤ 1hr, single region with read replicas, encryption at rest with CMK, regional residency in eu-west-1.

## Decision
Use Amazon Aurora PostgreSQL 16, multi-AZ, with two read replicas, snapshot every 6 hours.

## Alternatives considered
- Self-managed PostgreSQL on EC2: rejected — operational cost, no team capacity for tuning.
- Amazon RDS PostgreSQL: viable, but Aurora's storage model gives better failover characteristics for our RTO target.
- DynamoDB: rejected — relational schema, ad-hoc joins required for the order workflow, would force redesign.
- CockroachDB: rejected — multi-region not yet a requirement, adds operational burden.

## Consequences
+ Managed, in-region, meets RPO/RTO.
+ Familiar SQL surface for the team.
+ Encryption with CMK supported natively.
- Vendor lock-in to AWS (mitigated by standard PostgreSQL surface).
- Cost: ~$8k/month at the targeted size (see TCO doc §3).

## Compliance and security notes
- CMK in KMS, rotated annually.
- IAM authentication enabled; no static passwords.
- Audit logging to S3 → CloudWatch → SIEM, retained 7 years per policy P-23.

## Open follow-ups
- Validate read-replica lag under failover (load test before go-live).
- Decide PITR window with Compliance team.

Rules of ADR hygiene that compound over years:

  • Numbered, never deleted. ADR-0007-aurora.md. If a decision is reversed, write ADR-0023: Reverse ADR-0007 — switch to RDS for cost reasons. Append history. Never rewrite.
  • One decision per ADR. Two decisions = two ADRs. Otherwise the rationale becomes mush.
  • Reversibility tag. Forces honesty.
  • Alternatives section is mandatory. A decision without alternatives is a preference. Always list ≥2.
  • Consequences are signed. A consequence labeled "we accept higher latency for cross-region reads" is a contract — surface it during review.
  • Stored with the code. docs/adr/0001-cloud-provider.md in the repo, not buried in Confluence. Engineers read code; they only sometimes read Confluence.

A solution with 25–60 well-maintained ADRs is unkillable — its decisions can be defended, audited, and evolved. A solution with 200 PowerPoint slides and zero ADRs is unmaintainable — when anyone leaves, the rationale is lost and the design starts decaying.

8.3 arc42

A 12-section architecture documentation template. Use it as the table of contents for your Solution Design Pack (§5.3). Sections (lightly summarized):

  1. Introduction & Goals
  2. Constraints
  3. Context & Scope (= C4 L1)
  4. Solution Strategy (= the principles, the SOA recommendation)
  5. Building Block View (= C4 L2/L3)
  6. Runtime View (sequence diagrams for key flows)
  7. Deployment View (the actual cloud topology)
  8. Cross-cutting Concepts (security, observability, resilience patterns)
  9. Architecture Decisions (link to ADRs)
  10. Quality Requirements (= NFRs, see §9)
  11. Risks and Technical Debt (= risk register)
  12. Glossary

You don't need every section every time, but having a consistent ToC across solutions removes a class of "where do I look?" overhead for everyone downstream. Pair arc42 with C4 for diagrams and ADRs for decisions, and you have a complete kit.

8.4 Documentation that ages

The hardest discipline in SA documentation is keeping it alive. Three rules that make the difference:

  1. Source-of-truth in the repo. Markdown, diagrams in Mermaid/Structurizr, ADRs as files. PR reviews catch drift; Confluence hides it.
  2. Reviewed at gates. Every steering committee, every release, every quarter — pop the relevant doc, ask the team "is this still true?" If not, fix it now.
  3. Owned by name. Each doc lists an owner. When the owner leaves the project, ownership transfers in writing. Otherwise the doc dies the day they leave.

9. 🎯 Non-Functional Requirements: The Real Job

If you take one section away, take this one. Most SA failures aren't bad designs — they're sloppy or missing NFRs. The contract between business and technology lives in this section.

9.1 The eight NFR classes

Every solution has targets in eight classes. Make them explicit, quantified, and acceptance-tested.

Class What to specify Example
Performance Latency p50/p95/p99, throughput, cold-start "p99 ≤ 400ms at 5,000 RPS, p99 cold-start ≤ 2s"
Availability Uptime SLO, error budget, planned downtime "99.95% per calendar month, ≤4hr planned/yr"
Reliability / Resilience RPO, RTO, max tolerated dependency outage "RPO ≤ 5min, RTO ≤ 1hr, survive single AZ loss"
Scalability Peak load, growth runway, scale type "10× burst, 3-year runway, horizontal-only"
Security Threat model, controls, IAM model, encryption "STRIDE-reviewed, CMK at rest, MFA admin"
Compliance Frameworks, audit obligations, data classes "SOC 2 Type II, GDPR, HIPAA-eligible, PCI-out-of-scope"
Cost Y1/Y3 ceiling, $/transaction, cost-per-tenant "≤$80k/mo Y1, $0.04/order, scale linearly to $200k/mo at 10×"
Operability Monitoring, on-call expectations, runbook coverage "Every critical path observed; oncall rotation; ≤30min p99 MTTD"

Add as needed: usability, accessibility (WCAG 2.2 AA), localization, internationalization, sustainability (kgCO2e/req), data quality.

9.2 The NFR negotiation

Every NFR target costs something. The number on the left has a direct line to the number on the bottom. The negotiation is not "what do we need," it's "what are we willing to pay for."

Examples of the cost curve:

  • 99.9% → 99.95% availability: roughly 2× infra cost (multi-AZ active-active, replicated state, faster failover). Plus oncall maturity.
  • p99 ≤ 200ms → p99 ≤ 50ms: usually a fundamental architecture change (cache layer, edge compute, denormalization). Sometimes 5×.
  • RPO 5min → RPO 0: synchronous replication, multi-region writes, conflict resolution, latency hit. Often the hardest NFR.
  • Multi-region active-active: 2–3× infra cost, 5–10× design complexity. Don't accept it without explicit business case.

Run an NFR workshop during the 30–60 day window. Whiteboard. Each line: target / cost / acceptance test. Force the business owner to commit to the target with the cost on the table. Sign the page. Photograph it. That's the contract.

9.3 NFR acceptance tests

An NFR target without an acceptance test is a wish. For every quantified target, write how you will verify it.

NFR Target Acceptance test
Latency p99 ≤ 400ms at 5,000 RPS k6 load test, soak 1hr, p99 from server-side metrics
Availability 99.95%/month SLO measured by SLI = (success/total) over 30d trailing
RPO ≤ 5min DR drill quarterly; restore from backup within RPO measured
Cost ≤ $80k/mo FinOps weekly tag-based report; alert at 80% threshold
Security STRIDE-passed Threat model reviewed by security pre go-live; pen-test pre-prod
Compliance SOC 2 Type II External auditor, annual; controls evidenced in GRC tool

If you can't write an acceptance test, you don't have a real NFR. Promote vague NFRs ("highly available", "fast", "secure") to refusal status until they're quantified.

9.4 NFR mapping to components

For each NFR, identify which components in the architecture deliver it. This map should be in the Reference Architecture doc.

Availability 99.95% — delivered by:
  - Multi-AZ Aurora (primary + replicas)
  - ALB across 2 AZs
  - ECS Fargate with min 2 tasks per AZ
  - DNS failover (Route 53 health checks)
  - Runbook RB-007 (db failover) drilled quarterly

When a stakeholder questions "are we sure we hit 99.95%?", you point to the map. When the on-call engineer asks "why is everything in multi-AZ?", you point to the map. When the CFO asks "why are we spending 2× on infra?", you point to the map.

9.5 The NFR-to-architecture pressure test

Before the architecture review, take each NFR and stress-test:

  • "What if we 10×'d the latency target?" — is that just a knob, or a redesign?
  • "What if compliance moved from SOC 2 to FedRAMP Moderate?" — fundamental redesign or incremental?
  • "What if cost dropped 50%?" — what would we cut?
  • "What if availability moved from 99.95% to 99.5%?" — what could we simplify?

If a small NFR change forces a fundamental redesign, you've got an architecture that's brittle to its NFRs. Flag this as a risk and consider a more flexible shape.

(...to be continued...) Read Part 2 here https://viblo.asia/p/the-solution-architect-playbook-from-best-designer-to-best-bridge-part-2-PoL7e0Xa4vk


If you found this helpful, let me know by leaving a 👍 or a comment!, or if you think this post could help someone, feel free to share it! Thank you very much! 😃


All rights reserved

Viblo
Hãy đăng ký một tài khoản Viblo để nhận được nhiều bài viết thú vị hơn.
Đăng kí