Introduction
Monitoring is essential for reliable infrastructure. This tutorial guides you through setting up the industry-standard monitoring stack: Prometheus for metrics collection and storage, Grafana for visualization and alerting. You'll monitor servers, containers, databases, and applications in under an hour.
Understanding Prometheus Architecture
Prometheus uses pull-based model: it scrapes metrics from HTTP endpoints (targets) at configured intervals. Components: Prometheus server (collects, stores, queries), Alertmanager (handles alerts), Exporters (expose metrics for third-party systems), Pushgateway (for short-lived jobs). Metrics stored as time-series data (identifiers + timestamp + value). Query language: PromQL for powerful analysis and aggregation.
Installing Prometheus
Download from prometheus.io: wget https://github.com/prometheus/prometheus/releases/download/v2.51.0/prometheus-2.51.0.linux-amd64.tar.gz, tar xvf prometheus-*.tar.gz, cd prometheus-*. Configure prometheus.yml: global: scrape_interval: 15s, evaluation_interval: 15s. scrape_configs: - job_name: "prometheus", static_configs: - targets: ["localhost:9090"]. Run: ./prometheus --config.file=prometheus.yml. Access web UI at http://localhost:9090. Create systemd service for production: /etc/systemd/system/prometheus.service with ExecStart, User=prometheus, Group=prometheus.
Node Exporter: Server Metrics
Node exporter collects hardware and OS metrics (CPU, memory, disk, network). Install: wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz, tar xvf node_exporter-*.tar.gz, ./node_exporter. Runs on port 9100. Test: curl http://localhost:9100/metrics. Configure Prometheus to scrape: - job_name: "node_exporter", static_configs: - targets: ["server1:9100", "server2:9100", "server3:9100"]. Labels: - targets: ["server1:9100"], labels: instance: "web-server-1", environment: "production". For multiple servers, use file_sd_configs with JSON files for dynamic discovery.
cAdvisor: Container Metrics
cAdvisor (Container Advisor) from Google monitors Docker containers. Run Docker container: docker run -d --name=cadvisor --restart=always --volume=/:/rootfs:ro --volume=/var/run:/var/run:ro --volume=/sys:/sys:ro --volume=/var/lib/docker/:/var/lib/docker:ro --publish=8080:8080 gcr.io/cadvisor/cadvisor:latest. Exposes metrics at http://localhost:8080/metrics. Scrape config: - job_name: "cadvisor", static_configs: - targets: ["localhost:8080"] (for each node). Prometheus aggregates container CPU, memory, network, filesystem metrics across all containers.
Database Exporters
MySQL exporter: Run alongside MySQL (or separate): docker run -d --name mysql_exporter -p 9104:9104 -e DATA_SOURCE_NAME="user:password@(mysql-host:3306)/" prom/mysqld-exporter. PostgreSQL exporter: wget from github.com/prometheus-community/postgres_exporter, set DATA_SOURCE_NAME="postgresql://user:password@localhost:5432/db?sslmode=disable". MongoDB exporter, Redis exporter similarly available. Each exposes database-specific metrics: query counts, connection pools, replication lag, buffer pool usage, cache hit ratios.
Blackbox Exporter: Endpoint Monitoring
Blackbox exporter probes endpoints over HTTP, HTTPS, DNS, TCP, ICMP. Install: wget https://github.com/prometheus/blackbox_exporter/releases/download/v0.24.0/blackbox_exporter-0.24.0.linux-amd64.tar.gz. Configure probes in blackbox.yml: modules: http_2xx: prober: http, timeout: 5s, http: valid_status_codes: [200], method: GET. Prometheus config: - job_name: "blackbox", metrics_path: /probe, params: module: [http_2xx], static_configs: - targets: ["https://example.com", "https://api.example.com/health"], relabel_configs: - source_labels: [__address__], target_label: __param_target, - target_label: __address__, replacement: blackbox-exporter:9115. Measures response time, SSL expiry, HTTP status, DNS resolution time.
PromQL: Query Language Basics
PromQL fundamentals: Instant queries (current value), range queries (trends). Basic queries: node_cpu_seconds_total, node_memory_MemAvailable_bytes, rate(node_network_receive_bytes_total[5m]). Filtering with labels: node_cpu_seconds_total{mode="user"}, node_memory_MemAvailable_bytes{instance="server1"}. Operators: arithmetic (+, -, *, /), comparison (>, <, ==), logical (and, or, unless). Aggregate functions: sum(), avg(), max(), min(), count(), quantile(). Example: 100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) (CPU utilization). Rate functions critical for counters: rate() for per-second average, irate() for instant rate, increase() for total increase over time.
Alerting with Alertmanager
Alertmanager handles deduplication, grouping, routing, silencing, inhibition. Install: wget https://github.com/prometheus/alertmanager/releases/download/v0.26.0/alertmanager-0.26.0.linux-amd64.tar.gz. Configure alertmanager.yml: route: group_by: ['alertname', 'cluster'], group_wait: 10s, group_interval: 10s, repeat_interval: 1h. receivers: - name: 'email', email_configs: - to: 'team@example.com'. - name: 'slack', slack_configs: - channel: '#alerts'. Define Prometheus alerting rules: groups: - name: instance, rules: - alert: HighCPUUsage, expr: (100 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80, for: 5m, labels: severity: critical, annotations: summary: "High CPU on {{ $labels.instance }}", description: "CPU usage is {{ $value }}%".
Installing and Configuring Grafana
Install Grafana: sudo apt-get install -y adduser libfontconfig1 musl, wget https://dl.grafana.com/enterprise/release/grafana-enterprise_10.4.0_amd64.deb, sudo dpkg -i grafana-enterprise_10.4.0_amd64.deb, sudo systemctl enable grafana-server, sudo systemctl start grafana-server. Access http://localhost:3000 (admin/admin, change password). Add Prometheus datasource: URL http://prometheus:9090, access Browser, enable Basic auth if needed. Import dashboards from Grafana Labs dashboard hub (ID 1860 for Node Exporter Full, ID 179 for Kubernetes, ID 7362 for MySQL). Create custom dashboards with variables (instance, environment) for dynamic filtering.
Creating Effective Dashboards
Dashboard design best practices: Each dashboard focuses on specific service or use case. Use 12-hour or 24-hour time ranges. Row grouping: System Overview, Web Performance, Database, Business Metrics. Panel types: Time series (graphs), Stat (single numbers), Gauge (utilization), Table (tabular data), Heatmap (distribution over time). Query examples in Grafana: "CPU Usage" - avg(rate(node_cpu_seconds_total{mode!="idle"}[5m])) by (instance). "Memory Usage" - (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100. "Disk Usage" - 100 - ((node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100). Use annotations: query for deployments, alerts, or code pushes. Set up alert rules within Grafana for thresholds with notification channels (Email, Slack, PagerDuty, Webhook).
Advanced: Service Discovery and Relabeling
Dynamically discover targets without static configs. Kubernetes service discovery: kubernetes_sd_configs: - role: pod, relabel_configs: - source_labels: [__meta_kubernetes_pod_label_app], action: keep, regex: "myapp". - source_labels: [__meta_kubernetes_pod_container_port_number], action: keep, regex: "8080". - target_label: namespace, replacement: ${__meta_kubernetes_namespace}. EC2 discovery: ec2_sd_configs: - region: us-east-1, access_key: ... Filters for tags: port: 9100. Consul, DNS, GCE, Azure, DigitalOcean similarly supported. Relabeling modifies labels before ingestion: add environment from instance name (production, staging), drop internal metrics, rename labels for consistency, add job label from metadata.
Long-term Storage and High Availability
Prometheus local storage limited (default 15 days). Solutions: Remote storage adapters for Thanos (global view, long-term, downsampling), Cortex (multi-tenant, horizontally scalable), VictoriaMetrics (performance-optimized, lower resource usage), Mimir (Grafana Labs, fully managed). Run Prometheus in HA pair: two instances scraping same targets, Alertmanager dedupes alerts. Use consistent labels across both instances (cluster label). Federation for hierarchical setup: central Prometheus scrapes from multiple leaf Prometheus servers.
Monitoring Applications with Client Libraries
Application metrics via Prometheus client libraries. Python: from prometheus_client import Counter, Histogram, Gauge, start_http_server. REQUEST_COUNT = Counter("http_requests_total", "Total HTTP requests", ["method", "endpoint"]). request_duration = Histogram("http_request_duration_seconds", "Request duration", ["method"]). REQUEST_COUNT.labels(method="GET", endpoint="/api").inc(). Wrap functions with decorators for automatic timing. Expose /metrics endpoint: start_http_server(8000). Node.js: prom-client package. Java: Micrometer or simpleclient. Rails: prometheus-client-mmap. Provide business metrics: active users, orders per minute, queue depth, cache hit ratio, error rates by type.
Conclusion
Prometheus + Grafana provides complete monitoring solution. Start with node_exporter and basic dashboards, add exporters for critical services (databases, load balancers), implement alerting for production, and gradually adopt application metrics. Regular dashboard reviews and paging tuning improve alert effectiveness.