Skip to main content
This lesson explains how to structure a Site Reliability Engineering (SRE) function, the trade-offs for common team models, role definitions, sizing guidelines, and the skills that make SREs effective. Use this as a practical guide when deciding how to staff reliability work and evolve SRE practices as your organization grows.

Quick overview

  • What to choose: the right SRE model depends on organization size, product complexity, culture, and growth plans.
  • Practical trade-offs: embedded SREs give product context; centralized teams deliver consistency; consulting scales knowledge; hybrid blends the benefits.
  • People and skills: SRE teams combine technical depth (programming, systems, observability) with strong communication and incident leadership.

Common SRE team models (with trade-offs)

ModelHow it worksKey benefitsMain trade-offs
EmbeddedSREs sit inside product teamsDeep service knowledge, faster reliability improvements, strong ownershipInconsistent practices across teams, harder to scale tooling & standards
CentralizedOne SRE team supports multiple productsConsistent tooling, shared standards, efficient resource useMay lack product context, can become a throughput bottleneck
ConsultingSREs act as advisors/coachesScales knowledge, lightweight ownership, accelerates capability adoptionEffectiveness depends on product teams adopting guidance
HybridMix of central tooling + embedded SREsFlexible; balances consistency and contextRequires clear role boundaries and strong coordination
Embedded model
  • SREs live in product teams, influence design choices, and deliver rapid reliability improvements.
  • Best when product teams must own both feature and reliability work; promotes collaboration and faster feedback loops.
A presentation slide titled "SRE Team Models" describing the "Embedded Model" where SREs are embedded within product teams, with icons and benefits: gain knowledge of service, influence technical decisions, and accelerate reliability efforts. A footer note reads "Promotes collaboration but lacks consistency and scalability."
Centralized model
  • A single SRE organization maintains cross-product platforms, shared monitoring, and common practices.
  • Works well for enforcing standards and building platform-level automation.
Consulting model
  • SREs operate as coaches or consultants, advising product teams, building blueprints, and delivering training.
  • Scales SRE ideas without owning each service directly; requires adoption by product teams to succeed.
A presentation slide titled "SRE Team Models" describing the "Consulting Model" and noting that SREs act as advisors rather than owners. It shows an illustration of a person presenting to two seated people and two callout boxes stating it advises teams without managing systems directly and that it scales SRE culture but relies on team buy-in.
Hybrid model
  • Blends embedded SREs for high-impact services with a centralized team that builds shared tooling, libraries, and standards.
  • Allows organizations to combine scale and context while evolving engagement models as teams mature.
A slide titled "SRE Team Models" illustrating a Hybrid Model that combines a centralized SRE team and embedded engineers (icons with a green plus) feeding into "High-impact services." A footer banner reads "Works well in growth but demands clarity and coordination."
There is no single “best” model. Choose based on your organization’s size, maturity, product complexity, and culture — and be prepared to evolve the model as you scale.

Real-world examples and how organizations apply the models

  • Embedded: Netflix and Amazon follow “you-run-it” philosophies where product teams own reliability. Spotify also embeds reliability into product squads.
  • Centralized: Google historically used dedicated SRE teams; LinkedIn and Microsoft have centralized functions to enforce consistency.
  • Consulting: Dropbox uses SREs mainly as coaches; Google also places SREs in advisory roles where appropriate.
  • Hybrid: Meta, Google (in many domains), IBM, and Uber combine central tooling with embedded engineers aligned to product needs.
These examples illustrate how companies adapt SRE models to their culture, scale, and operational goals.

Core SRE roles (typical team composition)

RolePrimary focusWhen to introduce
SRE GeneralistToil reduction, automation, on-call, partnering with developersFrom day one for small teams
Reliability ArchitectSystem design, capacity planning, fault domainsMid-stage onward for complex systems
Observability SpecialistMetrics, logging, tracing, SLIs/SLOsWhen you need consistent instrumentation and platform observability
Incident CommanderIncident coordination, communications, postmortemsAs on-call rotations scale beyond a few people
SRE ManagerStrategy, team development, engagement modelsWhen multiple SREs or subteams need alignment
A slide titled "Core SRE Roles" showing five colored boxes for SRE Generalist, Reliability Architect, Observability Specialist, Incident Commander, and SRE Manager. Each box gives a short description of that role's responsibilities (automation and incident response, scalable design, observability/SLIs, incident coordination, and team/strategy leadership).

SRE at every stage of growth

SRE practices are adaptable — the principles remain the same while the approach changes with scale.
  • Small / early-stage startups
    • Team size: 0–5 SREs or developer-led reliability.
    • Focus: generalists who automate, monitor, and own on-call duties. Keep tooling pragmatic.
  • Mid-size organizations
    • Team size: ~5–15 SREs; specializations appear (observability, incident response, platform).
    • Focus: formalize SLOs, incident playbooks, error budgets, and internal standards.
  • Large enterprises
    • Team size: 15+ SREs organized into domain-aligned subteams (e.g., storage, networking, data pipelines).
    • Focus: invest in platform services, training programs, and defined engagement/onboarding processes for product teams.
A slide titled "SRE at Any Stage" comparing Small Organization and Large Organization approaches, each with a people icon and bullet points. The small side lists developer-driven SRE, integrated practices, lean tooling and agile processes; the large side lists dedicated SRE teams, specialized roles, and comprehensive tooling.

Team-sizing and engagement guidance

  • Early-stage: hire generalists who can iterate quickly and establish basic on-call, monitoring, and automation.
  • Mid-stage: hire role owners for observability and incident response; codify SLOs and error budget policies.
  • Large scale: create clear service-level engagement models so product teams know how to request SRE help and what to expect.

Case study: Meta’s Production Engineering evolution

Meta’s Production Engineering (PE) demonstrates an SRE evolution:
  1. Centralized beginnings — PE provided incident response and scaling help when product teams were small.
  2. Shift to embedded — as Meta grew, PE embedded engineers in product teams for design-time reliability and better feedback loops.
  3. Hybrid outcome — central PE functions (tooling and standards) remain while most reliability work is handled within product teams.
A presentation slide titled "A Closer Look at Real-World Evolution" showing a five-step timeline that describes how reliability/production engineering moved from a centralized team to embedded engineers within product teams (with some central functions remaining). To the right are three circular icons representing small product teams.

Skills and hiring signals

Technical skills
  • Proficiency in at least one primary language (Go, Python, Java).
  • Systems expertise: Linux internals, networking, containers.
  • CI/CD and infrastructure-as-code experience (Terraform, GitOps).
  • Observability tooling and instrumentation (metrics, tracing, logging).
  • Troubleshooting distributed systems at scale.
Non-technical skills
  • Calm and clear communication during incidents.
  • Curiosity and continuous learning.
  • Empathy for users and product teams.
  • Root-cause thinking and a focus on long-term fixes.
A slide titled "Skills for a Successful SRE" showing two boxed lists: Technical Skills (proficiency in major programming languages, strong systems knowledge, CI/CD/infrastructure-as-code/monitoring experience, and debugging large-scale systems) and Non-Technical Skills (curiosity, calmness under pressure, empathy, and a passion for solving root causes). The layout has the technical box on the left, non-technical on the right, and a simple avatar illustration in the center.
In-demand skills (market signals)
  • Cloud-native observability (OpenTelemetry, Prometheus, Grafana).
  • Kubernetes troubleshooting and automation.
  • SLO design and error budget management.
  • Incident leadership and cross-team communication.
  • Collaboration for shared ownership and platform engagement.
A presentation slide titled "Skills for a Successful SRE" listing four in-demand skills: cloud-native observability (e.g., OpenTelemetry, Prometheus, Grafana), Kubernetes troubleshooting and automation, SLO implementation and error budget policies, and incident response leadership and communication. There's also a small newspaper-style "NEWS" icon on the left and a citation to LinkedIn Workforce Insights.

Pro tip

  • Go deep before you go broad: build deep expertise in one area (observability, automation, or incident response) first. Depth builds intuition and technical muscle you can apply across the SRE discipline.
A presentation slide titled "SRE Beginners – A Pro Tip" featuring a podcast banner for "Episode 1 - IBM SRE Profession: Making of the SRE Omelette." Below it is a tip that reads: "Go deep before going wide."
Going deep in one area—whether diagnosing outages, building automation, or instrumenting systems—gives you a reliable foundation for learning the rest of the SRE skill set.

Watch Video