Grafana Dashboard

Building a 6-panel production dashboard for FinPay using PromQL.

Dashboard Overview

Panel	Query Type	Why It Matters
CPU Usage	Counter → rate()	Detect CPU spikes from traffic or attacks
Memory Usage	Gauge	Detect memory leaks over time
HTTP Request Rate	Counter → rate()	Baseline traffic, detect spikes
HTTP Error Rate	Counter → rate() + filter	Detect attacks, broken deployments
Response Time P95	Histogram quantile	Real user experience — not averages
Event Loop Lag	Gauge	Node.js health — blocking operations

Panel 1 — CPU Usage

rate(process_cpu_seconds_total{job="prometheus.scrape.finpay_api"}[1m])

Unit: percent (0.0-1.0)
Why rate(): CPU seconds is a counter — it only increases. rate() converts it to CPU usage per second
Why [1m]: Look at the last 1 minute to calculate the rate

Panel 2 — Memory Usage

process_resident_memory_bytes{job="prometheus.scrape.finpay_api"}

Unit: bytes(IEC) — auto-formats as MiB/GiB
Why no rate(): Memory is a gauge — it already represents the current value
Why resident memory: RSS is actual RAM in use. Virtual memory is misleadingly large

Panel 3 — HTTP Request Rate

rate(http_requests_total{job="prometheus.scrape.finpay_api"}[1m])

Unit: requests/sec
What you'll see: Multiple lines — one per route/status code combination
Why this matters: Establishes your normal traffic baseline

Panel 4 — HTTP Error Rate

rate(http_requests_total{job="prometheus.scrape.finpay_api", status_code=~"4..|5.."}[1m])

Unit: requests/sec
=~ means regex match — 4.. matches any 4xx, 5.. matches any 5xx
"No data" is good news — zero errors

What errors mean for FinPay

Status	Meaning
`401` spikes	Brute force attack on login
`429` spikes	Rate limiter blocking repeat offenders
`500` spikes	Bug in business logic
`400` spikes	Bad client requests / API misuse

Panel 5 — Response Time P95

histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{job="prometheus.scrape.finpay_api"}[1m]))

Unit: seconds (s)
P95 means: 95% of requests complete under this time
Why not average: Average hides pain. P95 exposes the worst 5% of users

Panel 6 — Event Loop Lag

nodejs_eventloop_lag_seconds{job="prometheus.scrape.finpay_api"}

Unit: seconds (s)
Normal: 0–10ms
Warning: 10–100ms
Incident: 100ms+

Node.js is single-threaded. High event loop lag means all users are waiting behind a blocked operation.

Dashboard as Code

Export your dashboard as JSON for version control:

Open dashboard → Share → Export
Toggle "Export for sharing externally" ON
Save as monitoring/dashboards/finpay-api-dashboard.json
Commit to git

This means anyone can import the exact dashboard into their own Grafana instance.

Production Results

After pointing Alloy at the Railway production URL:

CPU: 0.3% at idle — stable
Memory: 96–104 MiB — normal warm-up
P95 Response Time: 8ms — excellent
Error Rate: 0 req/s — clean

✦ Test Your Knowledge

1.Why do we monitor P95 response time instead of average response time?

AP95 is easier to calculate

BAverage hides outliers — P95 shows that 95% of users get responses under that time, exposing the worst 5%

CGrafana only supports P95

DAverage response time is always zero

2.What does an HTTP Error Rate panel showing 'No data' indicate in a healthy system?

AThe monitoring is broken

BThe query is wrong

CZero errors are occurring — this is good news

DThe panel needs to be refreshed

3.What is Event Loop Lag and why does it matter for a payment API?

ATime between browser requests — irrelevant for APIs

BHow long tasks wait in the Node.js queue — high lag means all users are delayed regardless of their request

CNetwork latency between Railway and Upstash

DTime for a database transaction to complete

4.What PromQL function finds the 99th percentile of a histogram metric?

Arate(metric[1m])

Bavg(metric)

Chistogram_quantile(0.99, rate(metric_bucket[1m]))

Dmax(metric)

Dashboard Overview​

Panel 1 — CPU Usage​

Panel 2 — Memory Usage​

Panel 3 — HTTP Request Rate​

Panel 4 — HTTP Error Rate​

What errors mean for FinPay​

Panel 5 — Response Time P95​

Panel 6 — Event Loop Lag​

Dashboard as Code​

Production Results​