Skip to content
System & Infrastructure Architecture

Systems

Listen 0%
Speed

24 ยท System & Infrastructure Architecture

The layer beneath the application: how systems scale and stay available, and the infrastructure that runs them โ€” load balancing, caching tiers, databases, message queues, CDNs/edge, containers and orchestration, CI/CD, infrastructure-as-code, observability, and deployment strategies. Written for a frontend engineer who must design integrations, reason about system design interviews, and ship to production reliably.


Positioning

A senior frontend engineer isnโ€™t a DevOps/SRE, but operates inside a system and must understand it: where your app is served from, how it scales, why the API is sometimes slow, what a CDN/edge does to your caching (18), how your deploy reaches users, and how to read a system-design interview. This file gives the system-design vocabulary (scaling, availability, consistency, caching, queues) and the infra literacy (containers, CI/CD, IaC, observability, deploy strategies) that senior frontend roles assume. It complements software architecture (10, 20โ€“22) and the decision-making file (25).


Foundations: the qualities youโ€™re designing for

System architecture trades off a handful of qualities:

  • Scalability โ€” handle growth in load/data without redesign.
  • Availability โ€” stay up (measured in โ€œninesโ€: 99.9% โ‰ˆ 8.7h/yr down; 99.99% โ‰ˆ 52min).
  • Reliability / Fault tolerance โ€” keep working despite component failures.
  • Performance / Latency โ€” fast responses (15, 18).
  • Consistency โ€” all readers see the same data (vs eventual consistency, 13).
  • Maintainability, Security (17), Cost.

Two master trade-offs frame everything:

  • CAP theorem โ€” under a network Partition you must choose Consistency or Availability. PACELC extends it: else (no partition), trade Latency vs Consistency. Distributed systems are usually eventually consistent by choice โ€” which is why your UI must tolerate stale reads (13).
  • Vertical vs horizontal scaling โ€” scale up (bigger machine: simple, has a ceiling, single point of failure) vs scale out (more machines: near-unlimited, needs statelessness + load balancing + coordination). Modern systems scale out; the enabling requirement is statelessness (no per-user state on a given server โ€” push it to a shared store/session service).

Deep dive: system building blocks

1. Load balancing

Distributes traffic across many server instances (round-robin, least-connections, IP-hash, latency-based). Enables horizontal scaling and availability (route around dead instances via health checks). Lives at L4 (TCP) or L7 (HTTP, can route by path/host โ€” relevant to MFE/zone routing, 09/08). Adds the need for statelessness or sticky sessions.

2. Caching (the highest-leverage performance tool, at every tier)

Store computed/fetched results closer to the consumer. Tiers, outerโ†’inner:

  • Browser cache + HTTP caching (18) โ€” Cache-Control, ETag, stale-while-revalidate.
  • CDN / edge cache โ€” static assets and increasingly dynamic/edge-rendered content at PoPs near users (18).
  • Application / in-memory cache โ€” Redis/Memcached for sessions, computed results, rate-limit counters, hot data.
  • Database cache โ€” query/result caches, materialized views (CQRS read models, 13). Core concerns: invalidation (โ€œone of the two hard problemsโ€), TTL, eviction (LRU/LFU), cache stampede (many misses at once โ†’ use request coalescing/locks), and write strategies (write-through, write-back, cache-aside). A BFF (12) is a common caching choke point.

3. Databases

  • Relational (SQL) โ€” Postgres/MySQL. Strong consistency, ACID transactions, joins, schemas. Default for most apps; most teams over-reach for NoSQL too early.
  • NoSQL families: document (MongoDB), key-value (Redis, DynamoDB), wide-column (Cassandra), graph (Neo4j). Chosen for scale-out, flexible schema, or specific access patterns; usually eventually consistent and join-light.
  • Concepts to know: ACID vs BASE, indexing (and how a missing index makes a query O(n)), N+1 query problem (the backend twin of the frontend N+1, 12), replication (read replicas for read scaling), sharding/partitioning (horizontal data split for write scaling), and transactions vs distributed sagas (13).
  • Frontend touchpoint: this is why some data is strongly consistent and some isnโ€™t; why โ€œsearchโ€ might hit a different store (Elasticsearch) than โ€œcheckout.โ€

4. Message queues & event streaming

Kafka, RabbitMQ, SQS, NATS decouple producers from consumers for asynchronous, resilient processing (13). Enable: load leveling (absorb spikes), background jobs (emails, image processing), and event-driven architectures. Guarantees to know: at-least-once vs exactly-once delivery, ordering, idempotent consumers, dead-letter queues. Frontend touchpoint: real-time updates pushed to the browser via WebSocket/SSE (04, 18) often originate from these streams; โ€œyour order is processingโ€ reflects async queue work.

5. API layer

  • REST, GraphQL (12), gRPC (service-to-service, binary/HTTP2), tRPC (TS end-to-end). An API gateway is the single entry point (routing, auth, rate-limiting, 13); a BFF is the per-experience variant (12).
  • Rate limiting (token bucket/leaky bucket), API versioning, idempotency keys for safe retries.

6. CDN & edge compute

CDNs cache near users; edge runtimes (Cloudflare Workers, Vercel Edge, 08) run code at PoPs for SSR/personalization/auth with minimal latency โ€” the infra that makes streaming SSR/RSC fast globally (07).


Deep dive: infrastructure & delivery

7. Containers & orchestration

  • Docker packages an app + its dependencies into a portable image that runs identically anywhere โ€” solves โ€œworks on my machine,โ€ and is the unit of modern deployment.
  • Kubernetes (K8s) orchestrates containers at scale: scheduling, self-healing (restart failed pods), horizontal autoscaling, rolling updates, service discovery, secrets/config. Heavyweight; many frontend teams instead use PaaS (Vercel/Netlify/Render/Fly) that hide K8s.
  • Service mesh (13) handles service-to-service mTLS/retries/observability via sidecars.

8. CI/CD (your daily infra)

  • CI โ€” on every push: install, lint/typecheck, test (16), build (14), and produce artifacts. Fast feedback; gate merges.
  • CD โ€” automatically deploy passing builds to staging/production. Continuous delivery (one click to prod) vs continuous deployment (fully automatic).
  • Pipeline shape that works (16): static checks โ†’ unit โ†’ integration โ†’ build โ†’ deploy preview โ†’ E2E on preview โ†’ promote. Tools: GitHub Actions, GitLab CI (Rianโ€™s context), CircleCI. Frontend specifics: preview deployments per PR, caching dependencies/build, bundle-size budgets (14/15) as a gate, and watch CI memory on coverage providers.

9. Deployment strategies (how new code reaches users safely)

  • Rolling โ€” replace instances gradually; default in K8s.
  • Blue-green โ€” two identical environments; switch traffic from blue (old) to green (new) instantly; instant rollback by switching back.
  • Canary โ€” release to a small % of users, watch metrics, ramp up or roll back. Pairs with feature flags (LaunchDarkly/Unleash) for decoupling deploy from release and gradual rollout/kill-switch.
  • Frontend note: immutable, content-hashed assets (18) make frontend deploys atomic; keep old chunks available so in-flight sessions donโ€™t 404 mid-deploy.

10. Infrastructure as Code (IaC)

Define infra in version-controlled code, not clicks: Terraform (declarative, multi-cloud), Pulumi (real languages), AWS CDK, CloudFormation. Benefits: reproducible, reviewable, auditable environments; no โ€œsnowflakeโ€ servers. GitOps extends this โ€” the repo is the source of truth for infra state.

11. Observability (you canโ€™t fix what you canโ€™t see)

Three pillars: logs (events), metrics (numeric time series โ€” latency, error rate, throughput; the โ€œREDโ€/โ€œUSEโ€ methods), traces (a requestโ€™s path across services โ€” distributed tracing via OpenTelemetry, essential for microservices/BFF debugging, 12/13). Add alerting on SLOs and error tracking (Sentry) + RUM (15) for the frontend. OpenTelemetry is the vendor-neutral standard.

12. Frontend deployment infra specifically

  • Static/SSG โ†’ object storage (S3) + CDN (CloudFront) or a Jamstack host (Netlify).
  • SSR/RSC โ†’ Node/edge runtime (Vercel, Cloudflare, a container on K8s) (07, 08).
  • MFEs โ†’ independently deployed remotes behind a CDN, discovered via a manifest (09).
  • Concerns: cache-busting via hashed filenames, atomic deploys, environment config injection, and not breaking long-lived sessions on deploy.

Worked example: a scalable web system (system-design sketch)

                         โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ CDN / Edge (static + cache + edge SSR) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
   Users โ”€โ”€โ”€DNS(anycast)โ”€โ–ถ                                                              โ”‚
                         โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                                                    โ–ผ
                                          Load Balancer (L7, health checks)
                                                    โ”‚  (stateless app tier โ†’ scale out)
                        โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                        โ–ผ                            โ–ผ                           โ–ผ
                   App/SSR node                 App/SSR node                BFF / API gateway
                        โ”‚                            โ”‚                           โ”‚
                        โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Redis (sessions, cache) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
                                                    โ”‚                            โ–ผ
                          Primary DB (writes) โ”€โ”€replicationโ”€โ”€โ–ถ Read replicas   Services
                                   โ”‚                                             โ”‚
                                   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ events โ”€โ”€โ–ถ Kafka โ”€โ”€โ–ถ async workers (email, search index)
   Cross-cutting: CI/CD pipeline ยท IaC (Terraform) ยท Observability (OTel: logs/metrics/traces) ยท feature flags

Reading it: scale out behind a load balancer (stateless apps, sessions in Redis), cache at CDN/edge/Redis tiers, separate read replicas from the write primary, push slow work to queues, and keep the whole thing reproducible (IaC) and observable (OTel). This is the shape behind most system-design answers.


Pitfalls & gotchas

  • Stateful app servers blocking horizontal scaling โ€” externalize session/state.
  • Reaching for microservices/NoSQL/K8s prematurely โ€” huge operational cost; start simple (25).
  • Cache invalidation bugs โ€” stale data, or stampedes on expiry; plan TTL + coalescing.
  • No idempotency on retried operations โ€” duplicates (13).
  • Ignoring the N+1 query on the backend feeding your UI โ€” slow APIs no frontend trick fixes.
  • Treating eventual consistency as immediate โ€” UIs that break on stale reads (13).
  • No observability โ€” flying blind; add tracing/metrics/error-tracking before you need them.
  • Deploys that 404 old chunks โ€” keep prior hashed assets during/after deploy.
  • Snowflake infra (hand-clicked) โ€” unreproducible; use IaC.

Interview questions

  1. Vertical vs horizontal scaling โ€” trade-offs and the statelessness requirement.
  2. State the CAP theorem (and PACELC). What does choosing AP vs CP mean for a UI?
  3. Where can you cache in a web stack, and what are the invalidation/stampede concerns?
  4. SQL vs NoSQL โ€” when each? What are replication and sharding?
  5. What problem do message queues solve? At-least-once vs exactly-once?
  6. What do Docker and Kubernetes each do?
  7. Blue-green vs canary vs rolling deploys โ€” and where feature flags fit.
  8. What is Infrastructure as Code and why use it?
  9. Name the three pillars of observability and what distributed tracing buys you.
  10. Sketch a scalable system for a high-traffic web app.

Recommendations

  • Design app tiers to be stateless and scale out behind a load balancer; keep state in shared stores.
  • Cache at every tier with deliberate TTL/invalidation; protect against stampedes.
  • Default to relational storage; adopt NoSQL/sharding only for proven scale/access-pattern needs.
  • Use queues for async/spiky work; make consumers idempotent.
  • Containerize; reach for managed PaaS over raw K8s unless you need K8s.
  • Treat CI/CD + IaC + observability as part of the product: PR previews, bundle budgets (15), tracing (OTel), error tracking (Sentry).
  • Ship frontend with atomic, hash-busted deploys and feature flags to separate deploy from release.
  • Match complexity to need โ€” start simple (25); add infrastructure when load/teams justify it.

Books & references

  • โ€œDesigning Data-Intensive Applicationsโ€ โ€” Martin Kleppmann (DDIA). The single best systems book: consistency, replication, partitioning, queues, streams. Essential. (Shared with 13.)
  • โ€œSystem Design Interviewโ€ Vol 1 & 2 โ€” Alex Xu. The standard interview-prep books; build the vocabulary above into reusable templates. (ByteByteGo is the companion site/newsletter.)
  • โ€œBuilding Microservicesโ€ โ€” Sam Newman; โ€œRelease It!โ€ โ€” Michael Nygard (stability/ops patterns) (12, 13).
  • โ€œThe DevOps Handbookโ€ / โ€œAccelerateโ€ โ€” Kim/Forsgren et al. CI/CD, delivery performance, and the metrics that matter.
  • โ€œSite Reliability Engineeringโ€ โ€” Google (free at sre.google). SLOs, observability, operating at scale.
  • Docker docs, Kubernetes docs, Terraform docs, OpenTelemetry docs โ€” primary infra references.
  • AWS/GCP Well-Architected Framework โ€” vendor-neutral-ish principles for reliability, performance, cost, security.

Connections

Frontend Deep-Dive Library ยท content is the single source of truth.