Automation Crunch 😎
August 10, 2025

Observability vs Monitoring: Advanced Patterns for Cloud-Native Systems

Posted on August 10, 2025  •  4Β minutes  • 809Β words

Introduction

“Monitoring tells you when something is broken. Observability tells you why.”

This distinction has become critical in cloud-native systems where traditional monitoring breaks down due to complexity, dynamism, and distribution.

Based on experience instrumenting IBM’s cloud infrastructure, this article explores observability patterns that scale.

Monitoring vs Observability

Key Differences

AspectMonitoringObservability
ApproachRules-based alertingExploratory querying
Known unknownsDetects predefined issuesDiscovers novel issues
ScalabilityBreaks with complexityScales with complexity
Time to resolutionMinutesSeconds
CostLinear with metricsProportional to value

Example: Latency Issue

MONITORING (Traditional):
- Alert: "P99 latency > 500ms"
- Question: Is it DNS? Database? Network?
- Investigation: Check 10 different systems
- Time to resolution: 45 minutes

OBSERVABILITY (Modern):
- Anomaly detected: P99 latency spike
- Distributed trace shows: Database query slow
- Root cause: Missing index on users table
- Time to resolution: 5 minutes

The Three Pillars of Observability

1. Structured Logs

# Unstructured logging (problematic)
logger.info(f"User {user_id} logged in from {ip} at {timestamp}")

# Structured logging (observability-ready)
logger.info("user_login", extra={
    "user_id": user_id,
    "user_email": user_email,
    "ip_address": ip,
    "location": geolocation,
    "timestamp": timestamp,
    "session_id": session_id,
    "auth_method": "oauth"
})

# Example: JSON output
{
  "timestamp": "2025-08-10T14:23:45Z",
  "level": "INFO",
  "message": "user_login",
  "user_id": "usr_12345",
  "user_email": "user@example.com",
  "ip_address": "203.0.113.42",
  "location": "San Francisco, CA",
  "session_id": "sess_abcdef",
  "auth_method": "oauth",
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "span_id": "00f067aa0ba902b7"
}

2. Metrics (Telemetry)

# OpenTelemetry metrics instrumentation
from opentelemetry import metrics
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.exporter.prometheus import PrometheusMetricReader

# Set up metrics
prometheus_reader = PrometheusMetricReader()
provider = MeterProvider(metric_readers=[prometheus_reader])
metrics.set_meter_provider(provider)
meter = metrics.get_meter(__name__)

# Create instruments
request_counter = meter.create_counter(
    name="http_requests_total",
    description="Total HTTP requests",
    unit="1"
)

request_duration = meter.create_histogram(
    name="http_request_duration_seconds",
    description="HTTP request duration",
    unit="s"
)

active_connections = meter.create_up_down_counter(
    name="active_connections",
    description="Active WebSocket connections"
)

# Record metrics
def handle_request(request):
    start_time = time.time()
    
    try:
        response = process_request(request)
        duration = time.time() - start_time
        
        request_counter.add(1, {
            "method": request.method,
            "endpoint": request.path,
            "status": response.status_code
        })
        
        request_duration.record(duration, {
            "method": request.method,
            "endpoint": request.path,
            "status": response.status_code
        })
        
        return response
    except Exception as e:
        request_counter.add(1, {
            "method": request.method,
            "endpoint": request.path,
            "status": "error"
        })
        raise

3. Distributed Traces

# Distributed tracing with OpenTelemetry
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.sdk.trace.export import BatchSpanProcessor

# Configure Jaeger exporter
jaeger_exporter = JaegerExporter(
    agent_host_name="jaeger-agent",
    agent_port=6831,
)

trace.set_tracer_provider(TracerProvider())
trace.get_tracer_provider().add_span_processor(
    BatchSpanProcessor(jaeger_exporter)
)

tracer = trace.get_tracer(__name__)

# Instrument business logic
def process_user_request(user_id: str):
    with tracer.start_as_current_span("process_user_request") as span:
        span.set_attribute("user_id", user_id)
        
        # Fetch user from database
        with tracer.start_as_current_span("fetch_user") as db_span:
            db_span.set_attribute("user_id", user_id)
            user = database.get_user(user_id)
            db_span.set_attribute("status", "success")
        
        # Validate user
        with tracer.start_as_current_span("validate_user") as val_span:
            is_valid = validate_user(user)
            val_span.set_attribute("valid", is_valid)
        
        # Fetch user profile
        with tracer.start_as_current_span("fetch_profile") as profile_span:
            profile_span.set_attribute("user_id", user_id)
            profile = api.get_user_profile(user_id)
        
        # Send notification
        with tracer.start_as_current_span("send_notification") as notify_span:
            notify_span.set_attribute("notification_type", "welcome")
            send_notification(user, "welcome")
        
        return {
            "user": user,
            "profile": profile,
            "status": "processed"
        }

# Result: Single trace shows entire request flow with timing

Real-World Trace Analysis

POST /api/users
β”‚
β”œβ”€ parse_request (2ms)
β”‚
β”œβ”€ authenticate_user (45ms)
β”‚  β”œβ”€ fetch_jwt (1ms)
β”‚  β”œβ”€ validate_signature (8ms)
β”‚  └─ check_permissions (36ms) ⚠️ SLOW
β”‚
β”œβ”€ fetch_user_data (120ms) πŸ”΄ VERY SLOW
β”‚  β”œβ”€ database_query (95ms)
β”‚  β”‚  └─ SELECT * FROM users WHERE id = ? (90ms) ⚠️ SLOW QUERY
β”‚  └─ serialize_response (25ms)
β”‚
β”œβ”€ send_audit_log (50ms)
β”‚  └─ kafka_publish (48ms)
β”‚
└─ total_request (220ms)

Root cause identified: Database index missing on permissions table

Implementing OpenTelemetry

Complete Example: E-commerce API

from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.exporter.prometheus import PrometheusMetricReader
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor

# Setup
jaeger_exporter = JaegerExporter(agent_host_name="localhost")
prometheus_reader = PrometheusMetricReader()

trace.set_tracer_provider(
    TracerProvider(exporter=jaeger_exporter)
)
metrics.set_meter_provider(
    MeterProvider(metric_readers=[prometheus_reader])
)

app = Flask(__name__)

# Auto-instrumentation
FlaskInstrumentor().instrument_app(app)
SQLAlchemyInstrumentor().instrument()
RequestsInstrumentor().instrument()

tracer = trace.get_tracer(__name__)
meter = metrics.get_meter(__name__)

# Business metrics
order_counter = meter.create_counter("orders_created")
order_value = meter.create_histogram("order_value_usd")

@app.route('/api/orders', methods=['POST'])
def create_order():
    with tracer.start_as_current_span("create_order") as span:
        order_data = request.json
        span.set_attribute("order_id", order_data['id'])
        span.set_attribute("user_id", order_data['user_id'])
        span.set_attribute("total", order_data['total'])
        
        # Process order
        order = process_order(order_data)
        
        # Record metrics
        order_counter.add(1)
        order_value.record(order_data['total'])
        
        return jsonify(order)

Alerting on Observability Data

# Prometheus alerting rules
groups:
- name: observability-alerts
  rules:
  - alert: HighLatencyDetected
    expr: histogram_quantile(0.99, rate(request_duration_seconds_bucket[5m])) > 0.5
    for: 2m
    annotations:
      summary: "High latency detected (p99 > 500ms)"
      dashboard: "http://grafana/dashboard/requests"
  
  - alert: ErrorRateHigh
    expr: rate(errors_total[5m]) > 0.05
    for: 1m
    annotations:
      summary: "Error rate > 5%"
      runbook: "https://wiki/runbooks/high-error-rate"
  
  - alert: SlowDatabaseQueries
    expr: histogram_quantile(0.95, rate(db_query_duration_seconds_bucket[5m])) > 1.0
    for: 5m
    annotations:
      summary: "Database queries slow (p95 > 1s)"
      action: "Check for missing indexes"

Best Practices

  1. Instrument Everything: Application, infrastructure, network, dependencies
  2. Correlation IDs: Trace requests across services
  3. Sampling: Sample high-volume traces to control costs
  4. Cardinality Awareness: Avoid unbounded label values
  5. Document Dashboards: Include runbooks for common alerts

Cost Optimization

Strategy: Tail-based sampling

Before: 1M traces/day = $500/month
After: Sample 1% normally + 100% errors + latency outliers = $120/month
Result: 75% cost reduction with better visibility into failures

Conclusion

Observability is the foundation of reliable systems at scale. By collecting structured logs, metrics, and traces, you transform operational uncertainty into actionable insights.


What’s your observability stack? Share your setup in the comments!

Follow me

You can find me on