The behavior of a system can be determined by only looking at its inputs and outputs.
In control theory, observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs.
Observability is about producing/transmitting/recording data.

Monitoring and Observability

Observability is an attribute meaning the system is emitting a signal.

Monitoring is an action taken from a human or a machine based on an event.

Telemetry and Observability

It's about recording & transmitting the readings of an instrument.
  • Logs:
    Easy to grep, to read and produces a high volume of data
  • Metrics:
    Identify trends over time, are visualize through graphs and produce low volumes.
  • Distributed traces:
    Identify tree of calls across services
  • Logs are the stream of aggregated, time-ordered events collected from the output streams of all running processes.
  • Logs in their raw form are typically a text format with one event per line (exceptions may span multiple lines).
  • Logs have no fixed beginning or end, but flow continuously as long as the app is operating.


  • Come in three different flavors: Plaintext, Structured (JSON format) or binary
  • Trivial to generate
  • Provide rich context information
  • Expensive to process, move, store and query
  • Noisy
  • Relates to a single system

An app should not attempt to write to or manage logfiles.

Instead, each running process writes its event stream, unbuffered, to stdout.


							import org.slf4j.Logger;
							import org.slf4j.LoggerFactory;
							public class Wombat {
							  final Logger logger = 
							  Integer t;
							  Integer oldT;
							  public void setTemperature(Integer temperature) {								
							    oldT = t;        
								t = temperature;							
								logger.debug("Temperature set to {}", t, oldT);							
								if(temperature.intValue() > 50) {
								  logger.info("Temperature has risen above 50 degrees.");


Metrics are a numeric representation of data measured over intervals of time. Metrics can harness the power of mathematical modeling and prediction to derive knowledge of the behavior of a system over intervals of time in the present and future.
  • Have a linear cost. Spike in traffic doesn't generate more metrics.
  • Near realtime availability
  • Useful to identify patterns and generate alerts
  • Great to generate reports and diagram
  • Cost increase with the number of label values
  • Almost no context
  • Relates to a single system (like logs)

Time Series and TSDB

A time series is simply a series of data points ordered in time.

In a time series, time is often the independent variable and the goal is usually to make a forecast for the future

Type of metrics

  • Counters: Number of HTTP 500
  • Gauge: Memory usage
  • Histogram: Statistical distribution: 95th percentile of request execution
  • Rate: Error/minutes
Distributed Traces

Distributed tracing is a method used to profile and monitor applications,
especially those built using a microservices architecture. Distributed tracing helps pinpoint where failures occur and what causes poor performance.

Distributed Traces

  • Help visualize and understand complex architectures
  • Enables developers to see how an individual request is handled
  • Attach a unique trace ID to requests
  • Provides insight into the full lifecycles of requests, allowing you to pinpoint failures and performance issues.

OpenTracing Overview

  • Trace:
    Collection of Spans reprensented as a DAG
  • Span:
    Unit of work defined by: operation name, timestamps and more
  • SpanContext:
    All the info identifying a Span, and that must be propagated, such as: TraceId, SpanId
  • Tracer:
    The actual implementation that creates, inject and extract SpanContext


Instrumentation Mechanism

  • Framework:
    Embedded in the application code. Almost every plateform provide a library.
  • Sidecar:
    Require modifications to the application code. Minimally the application should forward SpanContext headers
  • Agent:
    modification required. Example of agents à la sauce bytebuddy:

Flame Graph


What happen after you reached
100% Observability



  • USE (Brendan Gregg):
    For every resource, check Utilization, Saturation, and Errors.
  • 4 Golden Signals (Google):
    Latency, Traffic, Errors, and Saturation
  • RED (Tom Wilkie):
    Rate, Errors, and Duration

Fundamental Principles Beyond Observability

  1. Carefully choose what you measure
  2. Explain what you see
  3. Turn data into action