Skip to main content

Grafana Dashboard

Building a 6-panel production dashboard for FinPay using PromQL.

Dashboard Overview

PanelQuery TypeWhy It Matters
CPU UsageCounter → rate()Detect CPU spikes from traffic or attacks
Memory UsageGaugeDetect memory leaks over time
HTTP Request RateCounter → rate()Baseline traffic, detect spikes
HTTP Error RateCounter → rate() + filterDetect attacks, broken deployments
Response Time P95Histogram quantileReal user experience — not averages
Event Loop LagGaugeNode.js health — blocking operations

Panel 1 — CPU Usage

rate(process_cpu_seconds_total{job="prometheus.scrape.finpay_api"}[1m])
  • Unit: percent (0.0-1.0)
  • Why rate(): CPU seconds is a counter — it only increases. rate() converts it to CPU usage per second
  • Why [1m]: Look at the last 1 minute to calculate the rate

Panel 2 — Memory Usage

process_resident_memory_bytes{job="prometheus.scrape.finpay_api"}
  • Unit: bytes(IEC) — auto-formats as MiB/GiB
  • Why no rate(): Memory is a gauge — it already represents the current value
  • Why resident memory: RSS is actual RAM in use. Virtual memory is misleadingly large

Panel 3 — HTTP Request Rate

rate(http_requests_total{job="prometheus.scrape.finpay_api"}[1m])
  • Unit: requests/sec
  • What you'll see: Multiple lines — one per route/status code combination
  • Why this matters: Establishes your normal traffic baseline

Panel 4 — HTTP Error Rate

rate(http_requests_total{job="prometheus.scrape.finpay_api", status_code=~"4..|5.."}[1m])
  • Unit: requests/sec
  • =~ means regex match — 4.. matches any 4xx, 5.. matches any 5xx
  • "No data" is good news — zero errors

What errors mean for FinPay

StatusMeaning
401 spikesBrute force attack on login
429 spikesRate limiter blocking repeat offenders
500 spikesBug in business logic
400 spikesBad client requests / API misuse

Panel 5 — Response Time P95

histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{job="prometheus.scrape.finpay_api"}[1m]))
  • Unit: seconds (s)
  • P95 means: 95% of requests complete under this time
  • Why not average: Average hides pain. P95 exposes the worst 5% of users

Panel 6 — Event Loop Lag

nodejs_eventloop_lag_seconds{job="prometheus.scrape.finpay_api"}
  • Unit: seconds (s)
  • Normal: 0–10ms
  • Warning: 10–100ms
  • Incident: 100ms+

Node.js is single-threaded. High event loop lag means all users are waiting behind a blocked operation.

Dashboard as Code

Export your dashboard as JSON for version control:

  1. Open dashboard → ShareExport
  2. Toggle "Export for sharing externally" ON
  3. Save as monitoring/dashboards/finpay-api-dashboard.json
  4. Commit to git

This means anyone can import the exact dashboard into their own Grafana instance.

Production Results

After pointing Alloy at the Railway production URL:

  • CPU: 0.3% at idle — stable
  • Memory: 96–104 MiB — normal warm-up
  • P95 Response Time: 8ms — excellent
  • Error Rate: 0 req/s — clean
✦ Test Your Knowledge

1.Why do we monitor P95 response time instead of average response time?

AP95 is easier to calculate
BAverage hides outliers — P95 shows that 95% of users get responses under that time, exposing the worst 5%
CGrafana only supports P95
DAverage response time is always zero

2.What does an HTTP Error Rate panel showing 'No data' indicate in a healthy system?

AThe monitoring is broken
BThe query is wrong
CZero errors are occurring — this is good news
DThe panel needs to be refreshed

3.What is Event Loop Lag and why does it matter for a payment API?

ATime between browser requests — irrelevant for APIs
BHow long tasks wait in the Node.js queue — high lag means all users are delayed regardless of their request
CNetwork latency between Railway and Upstash
DTime for a database transaction to complete

4.What PromQL function finds the 99th percentile of a histogram metric?

Arate(metric[1m])
Bavg(metric)
Chistogram_quantile(0.99, rate(metric_bucket[1m]))
Dmax(metric)