Chapter 7.1: Monitoring (Prometheus & Grafana)
The "Why": What is Observability?
You have now learned how to build, package, and deploy your applications to the cloud. Your app is running on Kubernetes. **But is it working?**
How do you know? How do you know if users are getting errors? How do you know if your server is about to crash from high CPU? How do you know if a deploy made the app 50% slower?
This is the problem of **Observability** (or o11y) - the ability to understand the internal state of your system just by looking at its outputs. You can't just ssh into 500 containers and run htop. You need a centralized system.
Observability is built on three main pillars:
- Metrics (This Chapter): The "What." A number measured over time.
- Example:
cpu_usage_percent = 85%
- Example:
- Logs (Chapter 7.2): The "Why." A detailed, timestamped text record of an event.
- Example:
ERROR: User '123' failed to log in: password incorrect.
- Example:
- Traces (Advanced): The "Where." A trace shows the complete journey of a single request as it moves through all your different microservices.
- Example: Request A -> 10ms in Login API -> 150ms in Database -> 20ms in Login API -> Success.
This chapter focuses on **Monitoring**, which is the art of collecting, storing, and visualizing **Metrics**.
What is Prometheus?
Prometheus** is the open-source, industry-standard tool for monitoring and alerting. It was created at SoundCloud and is now a core part of the Cloud Native Computing Foundation (CNCF), just like Kubernetes.
Its entire job is to be a **Time-Series Database (TSDB)**. It saves numbers (metrics) with timestamps.
What is Grafana?
Grafana** is the open-source industry-standard tool for **visualization**. Prometheus *collects* and *stores* the data. Grafana *queries* Prometheus and displays the data as beautiful, interactive dashboards (graphs, charts, gauges).
You almost *always* use them together. **Prometheus is the database. Grafana is the dashboard.**
Part 1: The Prometheus Architecture (Pull vs. Push)
Prometheus has a very specific "pull-based" architecture. This is a critical concept.
The "Pull" Model
You don't *push* your metrics to Prometheus. Instead, Prometheus *scrapes* (pulls) metrics from your applications.
- Your App / Server (The "Target"): You don't install an "agent." You just run a lightweight web server (called an **Exporter**) that exposes a
/metricsendpoint. This page is just a wall of text with all the current metrics. - The Prometheus Server:** You configure Prometheus (in
prometheus.yml) with a list of all your targets. - The Scrape:** Every 15 seconds (by default), Prometheus visits the
/metricsendpoint of every target, downloads all the metrics, and saves them to its database.
Core Components
- Prometheus Server: The main component that does the *scraping*, *storing* (in its TSDB), and *querying* (with its PromQL language).
- Exporters:** The "agents" that you install *on* your servers or *with* your apps. Their only job is to expose the
/metricsendpoint.node_exporter:** The most common. Monitors a Linux server's health (CPU, RAM, Disk).cAdvisor:** Monitors Docker containers.- Client Libraries:** For your own app (Node.js, Python, Kotlin) to expose custom metrics (like
http_requests_total).
- Alertmanager:** A separate component that Prometheus *sends alerts to*. Alertmanager's job is to de-duplicate, group, and route those alerts to the right place (e.g., Slack, PagerDuty, Email).
Part 2: The 4 Types of Prometheus Metrics
This is the most important theory. Prometheus only has four types of metrics. You *must* use the right one for the job.
1. Counter (The "Odometer")
A **Counter** is a metric that can **only go up** (or be reset to 0). It's like the odometer in your car (its total mileage). It never goes down.
- Use Case:** Counting *total* occurrences of an event.
- Examples:**
http_requests_total(Total HTTP requests handled)errors_total(Total errors encountered)user_signups_total(Total new users)
You *never* graph a Counter directly. A graph of "total requests" just goes up and up, which is useless. Instead, you use the rate() function (see PromQL) to ask: "How *fast* is this counter increasing per second?"
2. Gauge (The "Speedometer")
A **Gauge** is a metric that can **go up or down**. It's like the speedometer in your car. It represents a single, "point-in-time" value.
- Use Case:** Measuring a value that can fluctuate.
- Examples:**
cpu_usage_percent(Currently 85%)memory_usage_bytes(Currently 4.2 GB)users_online_current(Currently 520 users)
You can graph a Gauge directly. It's useful for "What is my CPU usage *right now*?"
3. Histogram (The "Bucket List" for Latency)
This is the most powerful and complex type. It's used to measure **distributions** of data, most commonly **request latency** (how long something took).
A Histogram doesn't just store one number. It stores *multiple* counters in "buckets."
When you define a Histogram for http_request_duration_seconds, you define your buckets (e.g., 0.1s, 0.5s, 1s, 2s).
When a request finishes in 0.7 seconds, the Histogram updates *three* counters:
- The counter for
le="1.0"(less than or equal to 1.0s) increments. - The counter for
le="2.0"increments. - The counter for
le="+Inf"(infinity) increments.
This allows you to calculate **percentiles** (e.g., "95% of my users had a response time under 1s").
4. Summary (Also for Latency)
A Summary is similar to a Histogram, but it calculates the percentiles (quantiles) on the *client-side* (in your app) before sending them to Prometheus. This is less common and harder to aggregate. **Rule of thumb: Always use a Histogram for latency.**
Part 3: Installing & Configuring Prometheus
The easiest way to run Prometheus (and all its components) is with Docker.
Step 1: Create a `prometheus.yml` Configuration
Prometheus is configured using a YAML file. Create a folder (e.g., my-prometheus-project) and inside it, create a file named prometheus.yml.
# 'global' block: settings that apply to all jobs
global:
# How often to scrape targets (default is 1 minute)
scrape_interval: 15s
# How often to evaluate alerting rules
evaluation_interval: 15s
# 'alerting' block: Tell Prometheus where Alertmanager is
alerting:
alertmanagers:
- static_configs:
- targets:
- 'alertmanager:9093' # We'll use this hostname later
# 'rule_files' block: Where to find our custom alerts
rule_files:
- "alert.rules.yml" # We will create this file
# 'scrape_configs' block: *What* to monitor
scrape_configs:
# Job 1: Monitor Prometheus itself
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"] # Prometheus scrapes itself
# Job 2: Monitor our Linux server (Node Exporter)
# We will add this in the next section
# - job_name: "node"
# static_configs:
# - targets: ["node-exporter:9100"]
Step 2: Run Prometheus in Docker
Now, run this command from your terminal in the *same folder* as your prometheus.yml file.
docker run -d \
-p 9090:9090 \
--name prometheus \
-v $(pwd)/prometheus.yml:/etc/prometheus/prometheus.yml \
prom/prometheus
-d: Run in detached (background) mode.-p 9090:9090: Map your laptop's port 9090 to the container's port 9090.--name prometheus: Name the container.-v $(pwd)/prometheus.yml:/etc/prometheus/prometheus.yml: This is the **volume mount**. It mounts your local config file *into* the container, overwriting the default one.prom/prometheus: The official Docker image.
You can now open http://localhost:9090 in your browser! You'll see the Prometheus dashboard. Try typing up in the query bar and clicking "Execute." It should show up{job="prometheus"} 1, meaning it is successfully scraping itself.
Part 4: Deep Dive - PromQL (The Query Language)
Prometheus is not a SQL database. You cannot SELECT * FROM .... It has its own powerful query language called **PromQL (Prometheus Query Language)**. It's designed to slice and dice time-series data.
1. Selectors & Time Series
A "time series" is a metric name + a set of labels (key-value pairs).
# This is a time series
http_requests_total{method="GET", path="/api/login", status="200"}
You select data by its name and labels:
# Select all time series with this metric name
http_requests_total
# Select only the ones for the "api" job
http_requests_total{job="api"}
# Select only "POST" requests to the "/api/login" path
http_requests_total{method="POST", path="/api/login"}
2. Instant vs. Range Vectors
- Instant Vector:** The *single* latest value for each time series. (e.g., `cpu_usage`).
- Range Vector:** A *range* of historical values. You specify the time in
[].
http_requests_total[5m]- "Give me all the values for this metric from the last 5 minutes."
You *cannot* graph a Range Vector directly. You must use a function to process it.
3. The Most Important Function: `rate()`
As we said, Counters (like http_requests_total) just go up. This is useless. We want the *per-second rate of change*. The rate() function does this.
# "Calculate the per-second rate of http_requests_total,
# averaged over the last 5 minutes"
rate(http_requests_total[5m])
This will return an Instant Vector showing the "requests per second" for each time series. **This is what you graph.**
4. Aggregation Operators (`sum`, `avg`, `topk`)
rate() gives you the rate for *every* combination of labels. You usually want to aggregate them.
# Get the *total* requests per second for *all* jobs
sum(rate(http_requests_total[5m]))
# Get the *average* requests per second
avg(rate(http_requests_total[5m]))
# Show the top 5 time series by current memory usage
topk(5, memory_usage_bytes)
5. Grouping with `by` and `without`
The by clause lets you aggregate *by* a specific label.
# Don't give me one total sum.
# Give me the total sum *for each job*.
sum(rate(http_requests_total[5m])) by (job)
# Give me the total sum, but preserve the 'job' and 'method' labels
sum(rate(http_requests_total[5m])) by (job, method)
6. Using Histograms: `histogram_quantile()`
This is the most complex, but most important, query for measuring performance (latency). You *cannot* just `avg()` a histogram. You must use this function.
# Calculate the 95th percentile (0.95) latency
# for the 'http_request_duration_seconds' metric.
# This query is complex.
histogram_quantile(
0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)
This query will give you a single number (in seconds) that answers: "95% of all my users had a response time *faster* than this." This is how you measure your app's true performance.
Part 5: Practical Monitoring - `node_exporter`
Let's monitor a real server. **Node Exporter** is an official exporter from Prometheus that exposes 1000s of metrics about a Linux server (CPU, RAM, Disk, Network).
Step 1: Run `node_exporter` in Docker
Run this command on the server you want to monitor.
docker run -d \
--name=node-exporter \
-p 9100:9100 \
--net="host" \
--pid="host" \
-v "/:/host:ro,rslave" \
quay.io/prometheus/node-exporter:latest \
--path.rootfs=/host
This is a complex command that gives the container read-only access to the host's filesystem so it can read system stats. It will expose a metrics endpoint at http://[SERVER_IP]:9100/metrics.
Step 2: Add to `prometheus.yml`
Now, go back to your prometheus.yml file and add a new job. (Assume your server's private IP is 192.168.1.50).
scrape_configs:
# ... (prometheus job) ...
# NEW JOB: Monitor our Linux server
- job_name: "node-exporter-job"
static_configs:
- targets: ["192.168.1.50:9100"]
Restart your Prometheus container (docker restart prometheus). It will now scrape your server's health every 15 seconds.
Step 3: Example `node_exporter` Queries
# Get CPU usage percentage (per CPU core) in 'idle' mode.
# (This is complex because we want the INVERSE of idle)
100 - (
avg(rate(node_cpu_seconds_total{mode="idle"}[1m])) by (instance) * 100
)
# Get available memory in bytes
node_memory_MemAvailable_bytes
# Get free disk space percentage
(node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100
As you can see, these queries are complex. This is why we use Grafana.
Part 6: Visualization with Grafana
Grafana is our dashboard. It connects to Prometheus (as a "Data Source") and runs those complex PromQL queries for us, displaying the results in beautiful graphs.
Step 1: Run Grafana in Docker
docker run -d \
-p 3000:3000 \
--name=grafana \
grafana/grafana-oss:latest
Step 2: Setup Grafana
- Open
http://localhost:3000in your browser. - Default login is: **User:**
admin, **Password:**admin. (It will ask you to change this).
Step 3: Add Prometheus as a Data Source
- Click the "Settings" cog (gear) icon on the left.
- Click "Data Sources."
- Click "Add data source."
- Select **"Prometheus"**.
- For the **URL**, enter your Prometheus server's URL. **IMPORTANT:** Since both are in Docker,
localhostwon't work. You must use your computer's *private* IP (e.g.,http://192.168.1.10:9090). - Click "Save & Test". It should say "Data source is working."
Step 4: Import a Pre-built Dashboard
You *can* build your own dashboard, but the community has already built amazing ones. Let's import the official dashboard for `node_exporter`.
- Go to Grafana Dashboards.
- Search for "Node Exporter Full" (a popular one is ID `1860`).
- Copy the **Dashboard ID** (e.g., `1860`).
- In Grafana, click the "+" (Create) icon on the left.
- Click **"Import"**.
- Paste the ID `1860` into the box and click "Load".
- On the next screen, select *your* Prometheus data source from the dropdown.
- Click "Import".
**Done!** You will instantly see a complete, professional dashboard with all your server's CPU, RAM, Disk, and Network stats, all updating in real-time. This is the power of the Prometheus + Grafana stack.
Read the Official Prometheus Docs → Read the Official Grafana Docs →