DevOpsCraftDevOpsCraft
All Services
Observability & Monitoring

Observability Stack

Full visibility into your systems — metrics, logs, traces, profiling, and synthetic monitoring. Alert the right person at the right time. Reduce MTTR from hours to minutes.

PrometheusGrafanaLokiTempoJaegerDatadogNew RelicSplunkOpenTelemetryPyroscopePagerDutySLO/SLI

Six pillars of observability

We cover the full spectrum — not just metrics and logs.

Metrics

Time-series metrics from every layer: infra, app, and business. RED method per service. SLO/SLI dashboards. Long-term storage with Thanos, Mimir, or VictoriaMetrics.

PrometheusGrafanaDatadogNew RelicVictoriaMetricsThanosMimirInfluxDBCloudWatch

Logs

Centralized log aggregation from all services, pods, and infra. Structured JSON logging. Correlated with metrics — click a spike, see the logs for that exact window.

LokiElastic (ELK)SplunkDatadog LogsFluentdFluent BitPromtailVectorGraylogPapertrail

Traces

Distributed tracing across every microservice call. Identify exact latency bottlenecks, slow DB queries, and N+1 API calls. Instrument once, trace everywhere.

TempoJaegerZipkinOpenTelemetryDatadog APMNew Relic APMAWS X-RayLightstepHoneycomb

Alerting & On-call

SLO-based alerting — alert on burn rate, not raw metrics. No alert storms. Right person, right time, with runbook attached and auto-escalation.

AlertmanagerPagerDutyOpsGenieGrafana OnCallVictoriaMetrics AlertsDatadog MonitorsxMatters

Profiling & APM

Continuous profiling for CPU, memory, goroutines, and heap. Identify hotspots in production without reproducing in dev. Application performance baselines.

PyroscopeGrafana PyroscopeDatadog Continuous ProfilerParcapy-spyasync-profiler

Synthetic & RUM

Synthetic monitoring: probe endpoints every minute globally. Real User Monitoring: track Core Web Vitals, JS errors, and user sessions from real traffic.

Grafana k6Blackbox ExporterDatadog SyntheticsNew Relic BrowserSentryOpenReplayLogRocket

Full tool coverage

We work with every major observability tool — open-source and commercial.

Metrics & Visualization

PrometheusGrafanaDatadogNew RelicDynatraceVictoriaMetricsThanosMimirInfluxDBGraphiteCloudWatchAzure MonitorGoogle Cloud Monitoring

Logs

LokiElasticsearchLogstashKibana (ELK)SplunkFluentdFluent BitVectorPromtailGraylogPapertrailDatadog LogsCoralogixMezmo

Distributed Tracing

JaegerZipkinTempoOpenTelemetryAWS X-RayDatadog APMNew Relic APMHoneycombLightstepDynatraceSigNoz

Alerting & On-call

AlertmanagerPagerDutyOpsGenieGrafana OnCallSquadcastSpike.shxMattersVictoriaMetrics vmalertSlackMS Teams

Profiling

PyroscopeGrafana PyroscopeParcaDatadog Profilerpy-spyasync-profilerperfeBPF / Pixie

Synthetic & RUM

Grafana k6Blackbox ExporterChecklyDatadog SyntheticsSentryOpenReplayLogRocketHotjar

Exporters & Agents

node-exporterkube-state-metricscAdvisorDCGM (GPU)MySQL exporterPostgres exporterRedis exporterNginx exporterBlackbox exporter

Instrumentation

OpenTelemetry SDKMicrometerStatsDhot-shotsprometheus-clientopencensusdd-trace

Long-term Storage

ThanosMimirVictoriaMetricsCortexInfluxDBTimescaleDBClickHouseS3 (Parquet)

What you get

Observability strategy doc: which tools, which signals, which layer
Prometheus + kube-state-metrics + node-exporter for full K8s coverage
Grafana dashboards: infra, application (RED), SLO/SLI, business KPIs
Loki / ELK log aggregation with structured LogQL / KQL queries
Distributed tracing: Tempo or Jaeger + OpenTelemetry auto-instrumentation
Metrics-logs-traces correlation in Grafana (click spike → see logs + traces)
Long-term storage: Thanos / Mimir / VictoriaMetrics (S3-backed)
SLO/SLI definition + error budget dashboards per service
Alertmanager with SLO burn rate alerts — no alert storms
On-call: PagerDuty / OpsGenie schedule, escalation policy, runbooks
Datadog or New Relic setup (if preferred over self-hosted)
Synthetic monitoring: Blackbox Exporter or k6 probing key endpoints
Custom exporters for your DB, queue, or internal services
Pyroscope continuous profiling (optional — pinpoint CPU/memory hotspots)
Cost optimization: log retention tiers, metric cardinality reduction
Team training: PromQL, LogQL, dashboard authoring, alert tuning
AspectSelf-hosted (Grafana Stack)SaaS (Datadog / New Relic)
Cost~$200–500/mo (infra only)$2,000–20,000+/mo at scale
Setup time1–2 weeksHours to days
Data ownershipFull — stays in your infraData leaves your infra
CustomizationUnlimitedLimited to platform features
Ops overheadRequires DevOps to maintainZero — fully managed
Vendor lock-inNone (open standards)High
Best forTeams with DevOps capacityTeams wanting zero ops overhead

Questions

Self-hosted (Grafana/Prometheus) vs SaaS (Datadog/New Relic)?

Self-hosted is 80–90% cheaper at scale and gives full data control — ideal for teams with DevOps capacity. Datadog/New Relic are faster to start, better for teams that want zero ops overhead. We implement both and help you choose based on your scale, budget, and team size.

What's the difference between metrics, logs, and traces?

Metrics tell you something is wrong (latency up 3x). Logs tell you what happened (which request, which error). Traces tell you where the time was spent across services. You need all three for fast incident resolution — each answers a different question.

How do you avoid alert fatigue?

We use SLO-based alerting: alert on error budget burn rate, not individual thresholds. A 5xx spike that burns 1% of your monthly budget in 5 minutes pages. A transient spike that resolves in 30 seconds doesn't. This cuts alert volume by 70–90% while catching real incidents faster.

What is OpenTelemetry and should we use it?

OpenTelemetry (OTel) is the open standard for instrumentation — one SDK that emits metrics, logs, and traces. We recommend it for all new instrumentation: it's vendor-neutral, meaning you can switch from Jaeger to Tempo to Datadog without changing app code.

Datadog is expensive — can you reduce our bill?

Yes. Common fixes: drop high-cardinality metrics before ingestion, use log exclusion filters on noisy low-value logs, move old traces to cheap storage, and switch infrastructure metrics to Prometheus + Grafana Cloud (much cheaper). We've cut Datadog bills by 40–60%.

Can you set up Sentry for error tracking?

Yes. Sentry for frontend/backend error tracking integrates well with the rest of the stack. We connect Sentry alerts to PagerDuty and link Sentry issues to Grafana traces for full context during debugging.

What about GPU / ML workload monitoring?

Yes. We set up DCGM exporter (NVIDIA Data Center GPU Manager) for GPU utilization, memory, temperature, and power metrics — displayed in Grafana alongside your application metrics.

Can you set up on-call rotations?

Yes. PagerDuty or OpsGenie schedule setup, escalation policies, alert routing, and runbook creation. We also tune alert thresholds so the on-call engineer gets actionable pages, not noise.

Stop flying blind

Book a call. We'll assess your current visibility gaps and recommend the right stack for your team size, budget, and infra.

Book Discovery Call