Week 28 of the year 2025

July 12, 2025 - 5 minutes read - 1050 words

This Week’s Challenges:

Achieving Operational Excellence through Observability

Achieving Operational Excellence through Observability

Why do I read About Observability Engineering being an Engineering Manager?

As an engineering manager, staying ahead of the curve in managing complex software systems is paramount. While many in the industry look to foundational texts like “Observability Engineering: Achieving Production Excellence” by Charity Majors, Liz Fong-Jones, and George Miranda – a book whose general influence is widely recognized in the field, though its specific contents are not detailed in the materials provided here – my own recent deep dive into the principles of observability, as outlined in the resources I’ve been studying, has profoundly reshaped my approach to production excellence. These insights, which I consider my key takeaways, underscore why observability is no longer just a technical luxury, but a strategic necessity.

The Shifting Landscape: From Monitoring to Proactive Understanding

Traditionally, monitoring has been our go-to for understanding system health, offering an approximation of the overall state. It’s typically reactive, relying on dashboards, alerts, and static thresholds. However, as systems evolve from monoliths to microservices, and from bare metal to virtual machines, cloud, and serverless functions (like lambda), the complexity increases and more control moves out of our hands. This leads to a low ratio of “somewhat predictable failures” in cloud-based distributed systems compared to monoliths.

This is where observability steps in. Observability is proactive, allowing us to understand the system state with granular detail. It’s a measure of how the internal state of a system can be inferred from its external outputs. Unlike monitoring’s intuition-based approach, observability is a data-based investigative process that handles a large volume of very granular data. For an engineering manager, this shift means moving away from simply knowing if something is broken, to understanding why it’s broken and how to prevent similar issues in the future. The fundamental problem with typical monitoring is the necessity to decide in advance about which metrics and dimensions to track. Observability, on the other hand, enables explorability – the ability to iteratively investigate and eventually understand the system’s state without shipping new code.

The Core Pillars: Embracing High-Dimensional, High-Cardinality Data

Observability is built upon three core pillars: logs, metrics, and traces.

Metrics are scalar values representing system state, often aggregated by nature, which is a fundamental limitation. They are relatively “cheap” and typically used with low cardinality dimensions. Time-series databases (TSDBs) are built upon this concept.
Unstructured (human-readable) logs and structured (machine-parsable) logs provide valuable context.
Traces (or distributed tracing) have become an indispensable troubleshooting tool. A trace tracks the progression of a single request as it’s handled by various services in an application. Key components of a trace include a trace ID, span ID, parent ID, timestamp, and duration. This allows us to add event context to outbound requests, often via HTTP headers like B3-propagation (“x-b3-…” headers). Projects like OpenTelemetry (OTel), under CNCF, are standardizing this process.

What truly differentiates observability is its focus on high-dimensionality and high-cardinality data. Cardinality refers to the uniqueness of values in a dataset, where single values are low cardinality and IDs are high. Dimensionality refers to the number of keys in data. Observability leverages arbitrary wide structured events. These events capture “all the context of an operation in the system”, containing hundreds of key-value pairs (dimensions). This richness of data is what allows for the detailed investigation and correlation needed to understand complex distributed systems.

As an EM, understanding these data types is crucial because it informs how my teams instrument their code and what kind of data we rely on for troubleshooting and understanding user experience. Metrics are great for monitoring infrastructure, but events are key for monitoring our own software.

Driving Action: SLOs, Error Budgets, and Actionable Alerts

A significant takeaway for any engineering manager is the shift towards Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Error Budgets. SLIs categorize system state as good or bad.

Event-based SLIs (e.g., “proportion of events that took less than 300ms during a given rolling time window”) are preferred over time-based measures, as they have a far lesser degree of false positives and false negatives. This is powerful because it prioritizes impact on the end-user experience.
The error budget, allowed by the SLO, becomes a critical management tool. It enables the decoupling of deployment and release cycles, where released functionality is controlled by feature flags.
Crucially, if the budget goes negative, feature work stops until stability improves. This creates a direct incentive for engineers to focus on reliability, as everything goes into the error budget and will eventually be reviewed. This allows us to catch situations where service quality slowly degrades over time, even if traditional metrics appear fine. A 30-day sliding window is often most useful for SLO budget burn calculations, and forecasting tools can even predict budget burn.

This framework directly addresses alert fatigue by focusing on creating only helpful alerts. Alerts must be reliable indicators that the user experience is degraded and, most importantly, actionable. As an EM, this shifts my focus from “are we green?” to “how are we performing against our user commitments?”

The Future: Automation and AI-Powered Operations

The sheer volume and granularity of observability data open doors to advanced automation. The “core analysis loop” can be achieved using arbitrarily wide structured events. Because debugging from first principles doesn’t require prior system familiarity, automation can be done methodically, without needing to pre-seed “intelligence” about the application. This directly ties into the concept of AIOps, leveraging AI for automating operations. This is an exciting prospect for engineering managers looking to scale operations efficiency without linearly scaling headcount.

Why This Matters for an Engineering Manager

For me, reading about observability engineering isn’t just about learning new tools; it’s about fundamentally changing how my teams build, operate, and troubleshoot software. It’s about:

Moving from reactive firefighting to proactive problem-solving.
Empowering teams with the deep context needed to understand complex distributed systems.
Driving data-driven decisions about quality and feature delivery through SLOs and error budgets.
Fostering a culture of accountability where user experience is the ultimate metric.

Ultimately, this journey into observability engineering is about equipping myself and my teams with the knowledge and tools to truly achieve production excellence, ensuring our systems are not just running, but thriving and delivering value to our users.

Engineering