Observability

Gaining insight into your system health

Must read or watch

Adrian Cole
Peter Bourgon
Brendan D. Gregg
Yuri Shkuro
Ben Sigelman
Cindy Sridharan
The Dapper paper from 2010

Comment Joindre l’équipe Observabilité et Outils de surveillance TI de la Banque Nationale

Gestion de la demande
Offre de service – Confluence
Annonces – Yammer (IT Monitoring)

Ref: CNCF Trailmap

Observability

The behavior of a system can be determined by only looking at its inputs and outputs.
In control theory, observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs.
Ref: Wikipedia
Observability is about producing/transmitting/recording data.

Monitoring and Observability

Observability is an attribute meaning the system is emitting a signal.

Monitoring is an action taken from a human or a machine based on an event.

Telemetry and Observability

It's about recording & transmitting the readings of an instrument.
  • Logs:
    Easy to grep, to read and produces a high volume of data
  • Metrics:
    Identify trends over time, are visualize through graphs and produce low volumes.
  • Distributed traces:
    Identify tree of calls across services
Ref: Adrian Cole

Ref: Peter Bourgon

Logs

  • Logs are the stream of aggregated, time-ordered events collected from the output streams of all running processes.
  • Logs in their raw form are typically a text format with one event per line (exceptions may span multiple lines).
  • Logs have no fixed beginning or end, but flow continuously as long as the app is operating.
https://12factor.net/logs

Logs

  • Come in three different flavors: Plaintext, Structured (JSON format) or binary
  • Trivial to generate
  • Provide rich context information
  • Expensive to process, move, store and query
  • Noisy
  • Relates to a single system

An app should not attempt to write to or manage logfiles.

Instead, each running process writes its event stream, unbuffered, to stdout.

https://12factor.net/logs

Ref: fluentd

Instrumentation

					
							import org.slf4j.Logger;
							import org.slf4j.LoggerFactory;
							import org.slf4j.LoggerFactory;
							public class Wombat {
							  final Logger logger = 
							  LoggerFactory.getLogger(Wombat.class);
							  Integer t;
							  Integer oldT;
							
							  public void setTemperature(Integer temperature) {								
							    oldT = t;        
								t = temperature;							
								logger.debug("Temperature set to {}", t, oldT);							
								if(temperature.intValue() > 50) {
								  logger.info("Temperature has risen above 50 degrees.");
								 }
							  }
							}
						
					

Metrics

Metrics are a numeric representation of data measured over intervals of time. Metrics can harness the power of mathematical modeling and prediction to derive knowledge of the behavior of a system over intervals of time in the present and future.
Ref: Cindy Sridharan

Metrics

  • Have a linear cost. Spike in traffic doesn't generate more metrics.
  • Near realtime availability
  • Useful to identify patterns and generate alerts
  • Great to generate reports and diagram
  • Cost increase with the number of label values
  • Almost no context
  • Relates to a single system (like logs)

Time Series and TSDB

A time series is simply a series of data points ordered in time.

In a time series, time is often the independent variable and the goal is usually to make a forecast for the future

Marco Peixeiro

Strava

Type of metrics

  • Counters: Number of HTTP 500
  • Gauge: Memory usage
  • Histogram: Statistical distribution: 95th percentile of request execution
  • Rate: Error/minutes
Ref: Cindy Sridharan

Distributed Traces

Distributed tracing is a method used to profile and monitor applications,
especially those built using a microservices architecture. Distributed tracing helps pinpoint where failures occur and what causes poor performance.
https://opentracing.io

Distributed Traces

  • Help visualize and understand complex architectures
  • Enables developers to see how an individual request is handled
  • Attach a unique trace ID to requests
  • Provides insight into the full lifecycles of requests, allowing you to pinpoint failures and performance issues.

OpenTracing Overview

  • Trace:
    Collection of Spans reprensented as a DAG
  • Span:
    Unit of work defined by: operation name, timestamps and more
  • SpanContext:
    All the info identifying a Span, and that must be propagated, such as: TraceId, SpanId
  • Tracer:
    The actual implementation that creates, inject and extract SpanContext

https://en.wikipedia.org/wiki/Directed_acyclic_graph


Ref: Jaeger

Instrumentation Mechanism

  • Framework:
    Embedded in the application code. Almost every plateform provide a library.
  • Sidecar:
    Require modifications to the application code. Minimally the application should forward SpanContext headers
  • Agent:
    modification required. Example of agents à la sauce bytebuddy:

Flame Graph

Demo

Ref: https://github.com/cgos/observability-app-demo

What happen after you reached
100% Observability

Analysis

Analysis

  • USE (Brendan Gregg):
    For every resource, check Utilization, Saturation, and Errors.
  • 4 Golden Signals (Google):
    Latency, Traffic, Errors, and Saturation
  • RED (Tom Wilkie):
    Rate, Errors, and Duration

Brendan Gregg

Fundamental Principles Beyond Observability

  1. Carefully choose what you measure
  2. Explain what you see
  3. Turn data into action