Observability Engineering - Achieving Production Excellence

My personal notes on the book Observability Engineering: Achieving Production Excellence.

Chapter 1: What is Observability?

Observability = understanding a system’s internal state by examining its external outputs. If you can understand any bizarre or novel state without needing to ship new code, you have observability.

Key differences from monitoring:

  • Monitoring answers known questions (predefined metrics/dashboards)
  • Observability answers unknown questions (exploratory debugging of novel problems)

The Three Pillars (Legacy Model)

  • Metrics: aggregated numeric data
  • Logs: discrete events
  • Traces: request paths through distributed systems

Problem with pillars: Each addresses a specific question type, requiring context-switching between tools. Not sufficient for modern complexity.

Modern Observability Requirements

  • High cardinality: ability to query across many dimensions
  • High dimensionality: rich context in every event
  • Explorability: iterative investigation without predefined queries

Core principle: Instrument once, ask any question later.

Chapter 2: Debugging Pre- vs Post-Production

Pre-Production Debugging

  • Controlled environment
  • Reproducible issues
  • Step-through debuggers, breakpoints
  • Small data sets

Post-Production Debugging

  • Live user traffic - can’t pause or reproduce
  • Scale complexity - millions of requests, distributed systems
  • Unknown unknowns - novel failures you didn’t anticipate
  • Time pressure - every minute affects users/revenue

Key insight: Traditional debugging tools (debuggers, profilers) don’t work in production. Need observability to understand system state without disrupting it.

The Production Debugging Loop

  1. Notice problem (alert, user report)
  2. Form hypothesis
  3. Query telemetry data
  4. Refine hypothesis
  5. Repeat until root cause found

Speed matters: The faster you can iterate through this loop, the faster you resolve incidents.

Chapter 3: Lessons from Scaling Without Observability

Common Anti-Patterns

Metrics overload: Creating metrics for everything results in:

  • Overwhelming dashboards (hundreds of graphs)
  • High cardinality explosion (storage costs)
  • Still can’t answer unexpected questions

Alert fatigue: Too many alerts lead to:

  • Ignoring/silencing important alerts
  • Desensitization to pages
  • Missing critical issues

Dashboard proliferation:

  • One dashboard per service/team
  • No one knows which to check
  • Stale/unmaintained dashboards

What Doesn’t Scale

  • Pre-aggregated metrics - lose detail needed for debugging
  • Grep-ing logs - inefficient at scale, requires log shipping/indexing
  • Static dashboards - can’t answer new questions
  • Tribal knowledge - team members as the “observability layer”

Key Lesson

Without observability, debugging time increases exponentially with system complexity.

Chapter 4: Observability, DevOps, SRE, and Cloud Native

DevOps Connection

Observability enables key DevOps practices:

  • Fast feedback loops: quickly see impact of changes
  • Shared responsibility: everyone can investigate issues
  • Blameless postmortems: data-driven incident analysis

SRE Principles

Observability supports SRE tenets:

  • SLIs/SLOs: measure what matters to users
  • Error budgets: quantify acceptable failure
  • Toil reduction: less manual investigation

Cloud Native Requirements

Modern architectures demand observability:

  • Microservices: distributed tracing essential
  • Containers: ephemeral infrastructure
  • Auto-scaling: dynamic resource allocation
  • Polyglot systems: diverse tech stacks

Reality: You can’t SSH into production anymore. Need to understand systems from the outside.

Chapter 5: Observability in the Software Life Cycle

Where Observability Fits

Development:

  • Validate features work as intended
  • Understand performance characteristics
  • Debug integration issues

Testing:

  • Performance testing insights
  • Identify bottlenecks early
  • Validate under load

Deployment:

  • Progressive rollouts: compare canary vs production
  • Real-time health checks
  • Immediate rollback triggers

Production:

  • Incident response
  • Performance optimization
  • Capacity planning
  • Business intelligence

Key insight: Observability isn’t just for production - use it throughout the entire lifecycle for faster feedback.

Chapter 6: Observability-Driven Development

The Practice

Write instrumentation before or alongside application code, not as an afterthought.

Benefits

  • Better understanding: forces you to think about what matters
  • Faster debugging: instrumentation ready when you need it
  • Validates assumptions: see if code behaves as expected
  • Documentation: telemetry shows how system actually works

What to Instrument

  • Business logic: user actions, feature usage
  • Performance: latency, resource usage
  • Errors: failure modes, edge cases
  • Dependencies: external service calls
  • State changes: critical transitions

Anti-Pattern

Don’t just instrument technical metrics (CPU, memory). Instrument business value and user experience.

Chapter 7: Understanding Cardinality

Definition

Cardinality = number of unique values in a dimension.

Examples:

  • HTTP status code: low cardinality (~5-10 values)
  • User ID: high cardinality (millions of values)
  • Request ID: very high cardinality (unbounded)

Why It Matters

High-cardinality data is essential for observability because:

  • Enables precise filtering (find specific user’s request)
  • Supports arbitrary grouping
  • Answers specific questions, not just aggregates

The Problem

Traditional metrics systems can’t handle high cardinality:

  • Metrics per unique combination explode exponentially
  • Storage/compute costs become prohibitive
  • Systems slow down or crash

The Solution

Use structured events instead of metrics:

  • Store raw, detailed events
  • Query/aggregate at read time
  • Pay for what you actually query

Key equation:

Cardinality = dimension1_values × ... × dimensionN_values

With 10 dimensions of 100 values each = 100^10 possible combinations!

Chapter 8: Structured Events as Building Blocks

  • An event is a rich, wide data structure representing a single unit of work (request, transaction, operation).
  • A structured event is a comprehensive record of a request’s lifecycle in a service, created by:
    1. Starting with an empty map when the request begins
    2. Collecting key data throughout the request’s duration (including IDs, variables, headers, parameters, timing, and external service calls)
    3. Formatting this information as searchable key-value pairs
    4. Capturing the complete map when the request ends or errors

Example structure:

{
  "timestamp": "2024-01-15T10:30:45Z",
  "duration_ms": 234,
  "user_id": "user_12345",
  "endpoint": "/api/checkout",
  "status_code": 200,
  "item_count": 3,
  "total_amount": 89.99,
  "payment_method": "credit_card",
  "region": "us-west",
  "version": "v2.3.1"
}

Advantages Over Metrics/Logs

vs Metrics:

  • No pre-aggregation loss
  • Query any dimension combination
  • Retain full detail

vs Logs:

  • Structured (easily queryable)
  • Consistent format
  • Efficient storage/querying

Best Practices

  • One event per unit of work (not multiple logs)
  • Include context (user, session, trace IDs)
  • Add metadata (version, region, host)
  • Record outcomes (success, error codes, duration)

Traces

  • Distributed traces are an interrelated series of events. In an observable system, a trace is simply a series of interconnected events.
  • Tracing is a fundamental software debugging technique wherein various bits of information are logged throughout a program’s execution for the purpose of diagnos‐ ing problems.
  • Distributed tracing is a method of tracking the progression of a single request - called a trace - as it is handled by various services that make up an application.
  • What we want from a trace: to clearly see relationships among various services.
    • To quickly understand where bottlenecks may be occurring, it’s useful to have waterfall-style visualizations of a trace. Each stage of a request is displayed as an individual chunk in relation to the start time and duration of a request being debugged. Trace Waterfall
    • Each chunk of this waterfall is called a trace span, or span for short. Within any given trace, spans are either the root span - the top-level span in that trace - or are nested within the root span. Spans nested within the root span may also have nested spans of their own. That relationship is sometimes referred to as parent-child.
    • To construct the view we want for any path taken, no matter how complex, we need five pieces of data for each component:
      • Trace ID: a unique identifier for the trace so that we can map it back to a particular request. This ID is created by the root span and propagated throughout each step taken to fulfill the request.
      • Span ID: a unique identifier for each individual spa. Spans contain information captured while a unit of work occurred during a single trace.
      • Parent Span ID: used to properly define nesting relationships throughout the life of the trace.
      • Timestamp: each span must indicate when its work began.
      • Duration: each span must record how long that work took to finish.
    • Other fields may be helpful when identifying these spans - any additional data added to them is essentially a series of tags. Service Name and Span Name are good examples of common tags.
  • When handling a request in a service, we would start a trace in the root span, and forward its IDs in via HTTP headers to other services, such as the X-B3-TraceId and the X-B3-ParentSpanId headers. In the called services, we would extract these headers to generate their own trace and spans. On the backend, those traces are stitched together to create the waterfall-type visualization we want to see.
  • A common scenario for a nontraditional use of tracing is to do a chunk of work that is not distributed in nature, but that you want to split into its own span for a variety of reasons, such as tracking performance and resources usage.

Chapter 9: How Instrumentation Works

Instrumentation Approaches

1. Manual Instrumentation

  • Explicitly add telemetry in code
  • Full control over data collected
  • More effort, but most flexible

2. Auto-Instrumentation

  • Framework/library generates telemetry automatically
  • Fast to deploy
  • Less control over data

3. Hybrid

  • Auto-instrument framework operations
  • Manually instrument business logic
  • Recommended approach

What to Capture

Technical telemetry:

  • Request/response details
  • Database queries
  • External API calls
  • Errors and exceptions

Business telemetry:

  • User actions
  • Feature usage
  • Conversion events
  • Business outcomes

Context propagation:

  • Trace IDs
  • User IDs
  • Session IDs
  • Feature flags

Sampling Strategies

Head-based sampling: Decide at request start

  • Simple to implement
  • May miss interesting traces

Tail-based sampling: Decide after request completes

  • Can keep all errors
  • More complex to implement
  • Preferred for observability

Dynamic sampling: Adjust based on load/value

  • Sample less interesting requests more aggressively
  • Always keep errors, slow requests

Chapter 10: Instrumentation with OpenTelemetry

  • OTEL is a vendor-neutral standard for collecting telemetry data.
  • OTel captures traces, metrics, logs, and other telemetry data and lets you send it to the backend of your choice. Core concepts:
    • API: the specification portion of OTel libraries.
    • SDK: the concrete implementation of OTel.
    • Tracer: a component within the SDK that is responsible for tracking which span is currently active in your process.
    • Meter: a component within the SDK that is responsible for tracking which metrics are available to report in your process.
    • Context Propagation: a part of the SDK that deserializes context about the current inbound request from headers such as TraceContext or B3M, and passes it to downstream services.
    • Exporter: a plugin that transforms OTel in-memory objects into the appropriate format for delivery to a specific destination.
    • Collector: a standalone process that can be run as a proxy or sidecar and that receives, processes and tees telemetry data to one or more destinations.
  • Start with automatic instrumentation to decrease friction, i.e. use built-in functions (e.g. from the Go package) or bundles like the Symfone one FriendsOfOpenTelemetry/opentelemetry-bundle.
  • Once you have automatic instrumentation, you can start attaching fields and rich values to the auto-instrumentated spans inside your code.

Chapter 11: Analyzing Events for Observability

The Debugging Workflow

1. Start broad:

  • What’s the overall pattern? (error rate, latency distribution)
  • Which dimension stands out? (region, version, user tier)

2. Iteratively narrow:

  • Filter to interesting subset
  • Group by suspected dimension
  • Compare to baseline/expected

3. Find outliers:

  • Identify anomalous values
  • Drill into specific examples
  • Examine full event detail

4. Form and test hypotheses:

  • What could cause this pattern?
  • Query to confirm/refute
  • Repeat until root cause found

Essential Query Patterns

Filtering:

  • Isolate specific requests: WHERE status_code >= 500
  • Time windows: WHERE timestamp > now() - 1h

Grouping:

  • Aggregate by dimension: GROUP BY endpoint
  • Find distribution: COUNT(*) GROUP BY status_code

Statistical analysis:

  • Percentiles: P50(duration_ms), P99(duration_ms)
  • Heatmaps: duration distribution over time
  • Counts: error rates, throughput

Comparison:

  • Before/after deployment
  • Canary vs production
  • Success vs failure

Advanced Techniques

BubbleUp: Automatically finds dimensions that differ between two groups

  • Compare error vs success requests
  • System highlights differentiating fields
  • Quickly identifies root cause

High-Cardinality Exploration:

  • Find specific user’s bad experience
  • Identify problem with single customer
  • Debug edge cases

Trace Analysis:

  • Follow request through system
  • Identify slow components
  • Understand dependencies

Key Principles

Think in dimensions, not dashboards:

  • Don’t rely on pre-built views
  • Explore data interactively
  • Follow where the data leads

Keep context:

  • Always connect telemetry to user impact
  • Understand business implications
  • Prioritize based on value

Iterate quickly:

  • Fast query responses enable exploration
  • Don’t wait for batch processing
  • Real-time feedback loop

Chatper 12: Using Service-Level Objectives for Reliability

  • In monitoring-based approaches, alerts often measure the things that are easiest to measure. These usually don’t produce meaningful alerts for you to act upon.
  • Becoming accustomed to alerts that are prone to false positives is a known problem and a dangerous practice - it is known as normalization of deviance. In the software industry, the poor signal-to-noise ratio of monitoring-based alerting often leads to alert fatigue.
  • Threshold alerting is for known-unknowns only: in a distributed system with hundreds or thousands of components serving production traffic, failure is inevitable. Failures that get automatically remediated should not trigger alarms.
  • The Google SRE book indicates that a good alert must reflect urgent user impact, must be actionable, must be novel, and must require investigation rather than rote action.
  • Service-level objectives (SLOs) are internal goals for measurement of service health.
  • SLOs quantify an agreed-upon target for service availability, based on critical end-user journeys rather than system metrics. That target is measured using service-level indicators (SLIs), which categorize the system state as good or bad.
    • Time-based measures: “99th percentile latency less than 300 ms over each 5-minute window”
    • Event-based measures: “proportion of events that took less than 300 ms during a given rolling time window”
  • Example of event-based SLI: a user should be able to successfully load your home page and see a result quickly. The SLI should do the following:
    • Look for any event with a request path of /home.
    • Screen qualifying events for conditions in which the event duration < 100 ms.
    • If the event duration < 100 ms and was served successfully, consider it OK.
    • If the event duration > 100 ms, consider that event an error even if it returned a success code.
  • SLOs narrow the scope of your alerts to consider only symptoms that impact what the users of our service experience. However, there is no correlation as to why and how the service might be degraded. We simply know that something is wrong.
  • Decoupling “what” from “why” is one of the most important distinctions in writing good monitoring with maximum signal and minimum noise.

13. Acting on and Debugging SLO-Based Alerts

  • An error budget represents that maximum amount of system unavailability that your business is willing to tolerate. If your SLO is to ensure that 99.9% of requests are successful, a time-based calculation would state that your system could be unavailable for no more than 8 hours, 45 minutes, 57 seconds in one standard year.
    • An event-based calculation considers each individual event against qualification criteria and keeps a running tally of “good” events versus “bad”.
    • Subtract the number of failed (burned) requests from your total calculated error budget, and that is known colloquially as the amount of error budget remaining.
    • Error budget burn alerts are designed to provide early warning about future SLO violations that would occur if the current burn rate continues.
  • The first choice to analyze SLO is whether you want to use a fixed window (e.g. from the 1st to the 30th day of the month) or a sliding window (e.g. the last 30 days).
  • With a timeframe selected, you can now set up a trigger to alert you about error budget conditions you care about. The easiest alert to set is a zero-level alert - one that triggers when your entire error budget is exhausted. Once your budget is spent, you need to stop working on new features and work toward service stability.
  • A SLO burn alert is an alert triggered when an error budget is being depleted faster than is sustainable for a given Service Level Objective (SLO). In general, when a burn alert is triggered, teams should initiate an investigative response.
  • Rather than making aggregate decisions in the good minute / bad minute scenario, using event data to calculate SLOs gives you request-level granularity to evaluate system health.
    • Example: instead of checking if the system CPU or RAM is above a threshold, we can check the request duration for every individual request, and discover a pattern that affects our SLO (e.g. network issue when communicating with the database, which would not be covered by the previous “known-unknown” alerts).
  • Observability data that traces actual user experience with your services is a more accurate representation of system state than coarsely aggregated time-series data.

Key Takeaways

  1. Observability ≠ Monitoring: Observability answers unknown questions; monitoring answers known ones
  2. High cardinality is essential: Must support arbitrary dimensions for modern debugging
  3. Structured events > Metrics + Logs: Rich, wide events provide necessary context
  4. Instrument early: Add observability during development, not as afterthought
  5. OpenTelemetry: Use standards for portability and future-proofing
  6. Explore iteratively: Debug by following the data, not predefined dashboards
  7. Speed matters: Faster debugging loops = faster incident resolution = better user experience