Skip to content

Observability and Monitoring

  • Status: Accepted
  • Deciders: Emmitt Johnson
  • Date: 2025-04-10

Context and Problem Statement

Observability and monitoring are critical for ensuring the reliability and performance of modern, complex systems like microservices. Traditional monitoring often falls short in detecting unexpected issues due to its reliance on predefined metrics and thresholds. Key challenges include siloed data, scalability issues, slow incident resolution, and high operational overhead. A robust observability strategy is needed to provide unified insights, handle large-scale telemetry data, and enable proactive issue detection and resolution.

Considered Options

  • Dynatrace
  • Splunk
  • Grafana/Prometheus

Decision Outcome

Chosen option: "Dynatrace", because

  • Dynatrace provides superior tracing and metrics capabilities compared to Splunk, enabling better visibility into distributed systems and faster root cause analysis.
  • It offers advanced AI-driven anomaly detection and automated problem resolution, reducing operational overhead.
  • Dynatrace scales efficiently to handle large volumes of telemetry data, making it suitable for complex, modern architectures.
  • Its unified platform integrates seamlessly with existing tools, providing a comprehensive observability solution.