Building a Complete Monitoring Stack with Grafana Alloy: From Zero to Production

Khimananda Oli 19 min read Linux
Building a Complete Monitoring Stack with Grafana Alloy: From Zero to Production

Introduction

Managing observability across a growing infrastructure is challenging. The traditional approach involves deploying multiple agents per server:

  • Prometheus for metrics collection
  • Node Exporter for host metrics
  • MySQL Exporter for database metrics
  • Apache Exporter for web server metrics
  • Promtail for log shipping
  • OpenTelemetry Collector for traces

That's potentially 5-6 different agents per server, each with its own configuration, update cycle, and failure modes.

Grafana Alloy changes this paradigm by providing a single, unified agent that handles metrics, logs, and traces collection. In this guide, I'll walk you through building a production-ready monitoring stack from scratch.

Why Grafana Alloy?

Grafana Alloy (formerly Grafana Agent) is a vendor-agnostic OpenTelemetry Collector distribution with programmable pipelines. Here's what it replaces:

Traditional ComponentAlloy Equivalent
Prometheus (scraping only)Built-in scraping
Node Exporterprometheus.exporter.unix
MySQL Exporterprometheus.exporter.mysql
Apache Exporterprometheus.exporter.apache
Redis Exporterprometheus.exporter.redis
Promtailloki.source.file, loki.source.docker
OpenTelemetry CollectorNative OTLP support

Key Benefits

  1. Single Binary: One agent to deploy, configure, and maintain
  2. River Configuration: Intuitive, declarative configuration language
  3. Built-in Service Discovery: Automatic target discovery for Kubernetes, Docker, EC2, etc.
  4. Lower Resource Footprint: Optimized for edge deployment
  5. Native Remote Write: Push metrics directly to Prometheus, Mimir, or Grafana Cloud
  6. Programmable Pipelines: Transform, filter, and route telemetry data

When NOT to Use Alloy

  • You need local PromQL querying (use Prometheus directly)
  • You're already running a well-optimized stack with no issues
  • You need exporters that Alloy doesn't have built-in (though you can still scrape external exporters)

Data Flow

  1. Alloy runs on each target server
  2. Metrics are scraped locally and remote-written to Prometheus
  3. Logs are tailed and pushed to Loki
  4. Grafana queries both Prometheus and Loki for visualization

Prerequisites

Monitoring Server

  • Ubuntu 22.04/24.04 LTS
  • Docker and Docker Compose installed
  • Minimum 2 CPU, 4GB RAM (for small deployments)
  • 50GB+ disk for metrics and logs storage

Target Servers

  • Ubuntu 20.04/22.04/24.04 LTS
  • Network access to monitoring server (ports 9090, 3100)
  • Root or sudo access

Network Requirements

SourceDestinationPortProtocolPurpose
Alloy agentsPrometheus9090TCPMetrics remote write
Alloy agentsLoki3100TCPLog push
AdminGrafana3000TCPWeb UI

Part 1: Setting Up the Central Monitoring Server

1.1 Create Directory Structure

```

mkdir -p /opt/monitoring/{prometheus,loki,grafana/provisioning/datasources,alloy}
cd /opt/monitoring

```

1.2 Docker Compose Configuration

Create docker-compose.yml:

1.3 Prometheus Configuration

Create prometheus/prometheus.yml:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets: []

rule_files: []

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

Key settings:

  • --web.enable-remote-write-receiver: Allows Alloy to push metrics
  • --storage.tsdb.retention.time=30d: Keep 30 days of data
  • --storage.tsdb.retention.size=40GB: Cap storage at 40GB

1.4 Loki Configuration

Create loki/loki-config.yml:

auth_enabled: false

server:
  http_listen_port: 3100
  grpc_listen_port: 9096

common:
  instance_addr: 127.0.0.1
  path_prefix: /loki
  storage:
    filesystem:
      chunks_directory: /loki/chunks
      rules_directory: /loki/rules
  replication_factor: 1
  ring:
    kvstore:
      store: inmemory

query_range:
  results_cache:
    cache:
      embedded_cache:
        enabled: true
        max_size_mb: 100

schema_config:
  configs:
    - from: 2020-10-24
      store: tsdb
      object_store: filesystem
      schema: v13
      index:
        prefix: index_
        period: 24h

ruler:
  alertmanager_url: http://localhost:9093

limits_config:
  retention_period: 30d
  ingestion_rate_mb: 10
  ingestion_burst_size_mb: 20
  max_streams_per_user: 10000
  max_line_size: 256kb

compactor:
  working_directory: /loki/compactor
  compaction_interval: 10m
  retention_enabled: true
  retention_delete_delay: 2h
  retention_delete_worker_count: 150
  delete_request_store: filesystem

analytics:
  reporting_enabled: false

Important: The delete_request_store: filesystem line is required when retention_enabled: true. Without it, Loki will fail to start.

1.5 Grafana Datasource Provisioning

Create grafana/provisioning/datasources/datasources.yml:

apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: false

  - name: Loki
    type: loki
    access: proxy
    url: http://loki:3100
    editable: false
    jsonData:
      maxLines: 1000

1.6 Start the Stack

cd /opt/monitoring
docker compose up -d

1.7 Verify Installation

# Check all containers are running
docker compose ps

# Test Prometheus
curl -s http://localhost:9090/-/ready
# Expected: Prometheus Server is Ready.

# Test Loki (may take 15-30 seconds on first start)
curl -s http://localhost:3100/ready
# Expected: ready

# Test Grafana
curl -s http://localhost:3000/api/health
# Expected: {"commit":"...","database":"ok","version":"..."}

Access Grafana at http://YOUR_SERVER_IP:3000 with admin/changeme.

Part 2: Monitoring MySQL Database Servers

2.1 Install Alloy

Run on each MySQL server:

# Add Grafana repository
curl -fsSL https://apt.grafana.com/gpg.key | gpg --dearmor -o /usr/share/keyrings/grafana.gpg
echo "deb [signed-by=/usr/share/keyrings/grafana.gpg] https://apt.grafana.com stable main" | tee /etc/apt/sources.list.d/grafana.list

# Install Alloy
apt update && apt install alloy -y

2.2 Create MySQL Monitoring User

Connect to MySQL and create a dedicated monitoring user:

CREATE USER 'alloy'@'localhost' IDENTIFIED BY 'your_secure_password_here';
GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'alloy'@'localhost';
FLUSH PRIVILEGES;

Permissions explained:

  • PROCESS: View running queries and connections
  • REPLICATION CLIENT: View replication status
  • SELECT: Read table statistics

2.3 Configure Alloy

Create /etc/alloy/config.alloy:

// =============================================================================
// ALLOY CONFIGURATION FOR MYSQL SERVER
// Server: mysql-01 (change for each server)
// =============================================================================

// -----------------------------------------------------------------------------
// NODE/HOST METRICS
// Replaces: node_exporter
// -----------------------------------------------------------------------------
prometheus.exporter.unix "node" { }

prometheus.scrape "node" {
  targets    = prometheus.exporter.unix.node.targets
  forward_to = [prometheus.relabel.add_labels.receiver]
  
  scrape_interval = "15s"
}

// -----------------------------------------------------------------------------
// MYSQL METRICS
// Replaces: mysqld_exporter
// -----------------------------------------------------------------------------
prometheus.exporter.mysql "database" {
  data_source_name = "alloy:your_secure_password_here@(localhost:3306)/"
}

prometheus.scrape "mysql" {
  targets    = prometheus.exporter.mysql.database.targets
  forward_to = [prometheus.relabel.add_labels.receiver]
  
  scrape_interval = "15s"
}

// -----------------------------------------------------------------------------
// LABELS
// Add consistent labels to all metrics
// -----------------------------------------------------------------------------
prometheus.relabel "add_labels" {
  rule {
    action       = "replace"
    target_label = "server"
    replacement  = "mysql-01"  // CHANGE THIS FOR EACH SERVER
  }
  
  rule {
    action       = "replace"
    target_label = "environment"
    replacement  = "production"
  }
  
  forward_to = [prometheus.remote_write.default.receiver]
}

// -----------------------------------------------------------------------------
// REMOTE WRITE TO PROMETHEUS
// -----------------------------------------------------------------------------
prometheus.remote_write "default" {
  endpoint {
    url = "http://192.168.0.23:9090/api/v1/write"
    
    queue_config {
      max_samples_per_send = 1000
      batch_send_deadline  = "5s"
      min_backoff          = "30ms"
      max_backoff          = "5s"
    }
  }
}

// -----------------------------------------------------------------------------
// LOG COLLECTION
// Replaces: promtail
// -----------------------------------------------------------------------------
loki.source.file "mysql_logs" {
  targets = [
    {
      __path__  = "/var/log/mysql/error.log",
      job       = "mysql-error",
      server    = "mysql-01",
      component = "mysql",
    },
    {
      __path__  = "/var/log/mysql/mysql-slow.log",
      job       = "mysql-slow",
      server    = "mysql-01",
      component = "mysql",
    },
  ]
  forward_to = [loki.write.default.receiver]
}

loki.source.file "system_logs" {
  targets = [
    {
      __path__ = "/var/log/syslog",
      job      = "syslog",
      server   = "mysql-01",
    },
    {
      __path__ = "/var/log/auth.log",
      job      = "authlog",
      server   = "mysql-01",
    },
  ]
  forward_to = [loki.write.default.receiver]
}

// -----------------------------------------------------------------------------
// LOKI WRITE
// -----------------------------------------------------------------------------
loki.write "default" {
  endpoint {
    url = "http://192.168.0.23:3100/loki/api/v1/push"
    
    batch_wait   = "1s"
    batch_size   = 1048576  // 1MB
  }
}

2.4 Configure Permissions

# Allow Alloy to read log files
usermod -aG adm alloy
usermod -aG mysql alloy

# Verify permissions
su - alloy -s /bin/bash -c "cat /var/log/mysql/error.log | head -1"

2.5 Start Alloy

systemctl enable alloy
systemctl start alloy

# Check status
systemctl status alloy

# View logs
journalctl -u alloy -f --no-pager

2.6 Verify Data Collection

# Check metrics are being scraped
curl -s http://localhost:12345/metrics | grep mysql_up

# Check data is reaching Prometheus (run from monitoring server)
curl -s 'http://192.168.0.23:9090/api/v1/query?query=mysql_up' | jq '.data.result[].metric.server'

Part 3: Monitoring Apache + PHP-FPM Web Servers

3.1 Install Alloy

curl -fsSL https://apt.grafana.com/gpg.key | gpg --dearmor -o /usr/share/keyrings/grafana.gpg
echo "deb [signed-by=/usr/share/keyrings/grafana.gpg] https://apt.grafana.com stable main" | tee /etc/apt/sources.list.d/grafana.list
apt update && apt install alloy -y

3.2 Enable Apache Server Status

Enable the status module:

a2enmod status

Create a dedicated vhost for status endpoints. This is important if you're running Laravel or any framework that catches all routes.

Create /etc/apache2/sites-available/000-localhost-status.conf:


    ServerName 127.0.0.1
    
    # Apache server status
    
        SetHandler server-status
        Require local
    
    
    # PHP-FPM status (optional)
    
        SetHandler "proxy:unix:/run/php/php8.1-fpm.sock|fcgi://localhost/status"
        Require local
    

Enable the vhost:

a2enmod proxy proxy_fcgi
a2ensite 000-localhost-status
systemctl restart apache2

The 000- prefix ensures this vhost loads first, before your application vhosts.

Verify:

curl http://127.0.0.1/server-status?auto

3.3 Enable PHP-FPM Status (Optional)

Edit /etc/php/8.1/fpm/pool.d/www.conf:

pm.status_path = /status

Restart PHP-FPM:

systemctl restart php8.1-fpm

Verify:

curl http://127.0.0.1/status

3.4 Configure Alloy

Create /etc/alloy/config.alloy:

// =============================================================================
// ALLOY CONFIGURATION FOR WEB SERVER
// Server: web-server
// =============================================================================

// -----------------------------------------------------------------------------
// NODE/HOST METRICS
// -----------------------------------------------------------------------------
prometheus.exporter.unix "node" { }

prometheus.scrape "node" {
  targets    = prometheus.exporter.unix.node.targets
  forward_to = [prometheus.relabel.add_labels.receiver]
}

// -----------------------------------------------------------------------------
// APACHE METRICS
// Replaces: apache_exporter
// -----------------------------------------------------------------------------
prometheus.exporter.apache "web" {
  scrape_uri = "http://127.0.0.1/server-status?auto"
}

prometheus.scrape "apache" {
  targets    = prometheus.exporter.apache.web.targets
  forward_to = [prometheus.relabel.add_labels.receiver]
}

// -----------------------------------------------------------------------------
// LABELS
// -----------------------------------------------------------------------------
prometheus.relabel "add_labels" {
  rule {
    action       = "replace"
    target_label = "server"
    replacement  = "web-server"
  }
  
  rule {
    action       = "replace"
    target_label = "environment"
    replacement  = "production"
  }
  
  forward_to = [prometheus.remote_write.default.receiver]
}

// -----------------------------------------------------------------------------
// REMOTE WRITE
// -----------------------------------------------------------------------------
prometheus.remote_write "default" {
  endpoint {
    url = "http://192.168.0.23:9090/api/v1/write"
  }
}

// -----------------------------------------------------------------------------
// LOG COLLECTION
// -----------------------------------------------------------------------------
loki.source.file "system_logs" {
  targets = [
    {__path__ = "/var/log/syslog", job = "syslog", server = "web-server"},
    {__path__ = "/var/log/auth.log", job = "authlog", server = "web-server"},
  ]
  forward_to = [loki.write.default.receiver]
}

loki.source.file "apache_logs" {
  targets = [
    {__path__ = "/var/log/apache2/access.log", job = "apache-access", server = "web-server"},
    {__path__ = "/var/log/apache2/error.log", job = "apache-error", server = "web-server"},
    {__path__ = "/var/log/apache2/*-access.log", job = "apache-access", server = "web-server"},
    {__path__ = "/var/log/apache2/*-error.log", job = "apache-error", server = "web-server"},
  ]
  forward_to = [loki.write.default.receiver]
}

loki.source.file "php_logs" {
  targets = [
    {__path__ = "/var/log/php*.log", job = "php-fpm", server = "web-server"},
  ]
  forward_to = [loki.write.default.receiver]
}

// -----------------------------------------------------------------------------
// LOKI WRITE
// -----------------------------------------------------------------------------
loki.write "default" {
  endpoint {
    url = "http://192.168.0.23:3100/loki/api/v1/push"
  }
}

3.5 Start Alloy

usermod -aG adm alloy
usermod -aG www-data alloy
systemctl enable alloy
systemctl start alloy

Part 4: Monitoring Docker Container Hosts

4.1 Install Alloy

curl -fsSL https://apt.grafana.com/gpg.key | gpg --dearmor -o /usr/share/keyrings/grafana.gpg
echo "deb [signed-by=/usr/share/keyrings/grafana.gpg] https://apt.grafana.com stable main" | tee /etc/apt/sources.list.d/grafana.list
apt update && apt install alloy -y

4.2 Deploy cAdvisor

Alloy's built-in cAdvisor exporter can have permission issues with Docker's overlay filesystem. The reliable solution is running cAdvisor as a privileged container:

docker run -d \
  --name=cadvisor \
  --restart=unless-stopped \
  --privileged \
  -p 8081:8080 \
  -v /:/rootfs:ro \
  -v /var/run:/var/run:ro \
  -v /sys:/sys:ro \
  -v /var/lib/docker/:/var/lib/docker:ro \
  gcr.io/cadvisor/cadvisor:latest

Note: Use port 8081 if 8080 is already in use.

Verify:

curl -s http://localhost:8081/metrics | grep container_cpu

4.3 Configure Alloy

Create /etc/alloy/config.alloy:

// =============================================================================
// ALLOY CONFIGURATION FOR DOCKER HOST
// Server: docker-host
// =============================================================================

// -----------------------------------------------------------------------------
// NODE/HOST METRICS
// -----------------------------------------------------------------------------
prometheus.exporter.unix "node" { }

prometheus.scrape "node" {
  targets    = prometheus.exporter.unix.node.targets
  forward_to = [prometheus.relabel.add_labels.receiver]
}

// -----------------------------------------------------------------------------
// CADVISOR (CONTAINER METRICS)
// -----------------------------------------------------------------------------
prometheus.scrape "cadvisor" {
  targets = [
    {"__address__" = "localhost:8081", "job" = "cadvisor"},
  ]
  forward_to = [prometheus.relabel.add_labels.receiver]
  
  scrape_interval = "15s"
  scrape_timeout  = "10s"
}

// -----------------------------------------------------------------------------
// LABELS
// -----------------------------------------------------------------------------
prometheus.relabel "add_labels" {
  rule {
    action       = "replace"
    target_label = "server"
    replacement  = "docker-host"
  }
  
  forward_to = [prometheus.remote_write.default.receiver]
}

// -----------------------------------------------------------------------------
// REMOTE WRITE
// -----------------------------------------------------------------------------
prometheus.remote_write "default" {
  endpoint {
    url = "http://192.168.0.23:9090/api/v1/write"
  }
}

// -----------------------------------------------------------------------------
// SYSTEM LOGS
// -----------------------------------------------------------------------------
loki.source.file "system_logs" {
  targets = [
    {__path__ = "/var/log/syslog", job = "syslog", server = "docker-host"},
    {__path__ = "/var/log/auth.log", job = "authlog", server = "docker-host"},
  ]
  forward_to = [loki.write.default.receiver]
}

// -----------------------------------------------------------------------------
// DOCKER CONTAINER LOGS
// Automatically discovers and collects logs from all containers
// -----------------------------------------------------------------------------
discovery.docker "containers" {
  host = "unix:///var/run/docker.sock"
}

loki.source.docker "containers" {
  host       = "unix:///var/run/docker.sock"
  targets    = discovery.docker.containers.targets
  labels     = {server = "docker-host", job = "docker"}
  forward_to = [loki.write.default.receiver]
  
  refresh_interval = "5s"
}

// -----------------------------------------------------------------------------
// LOKI WRITE
// -----------------------------------------------------------------------------
loki.write "default" {
  endpoint {
    url = "http://192.168.0.23:3100/loki/api/v1/push"
  }
}

4.4 Start Alloy

usermod -aG docker alloy
usermod -aG adm alloy
systemctl enable alloy
systemctl start alloy

Part 5: Creating Grafana Dashboards

5.1 Dashboard Variables

Before creating panels, set up a server variable for filtering:

  1. Open your dashboard
  2. Go to Settings → Variables → New variable
  3. Configure:
    • Name: server
    • Type: Query
    • Data source: Prometheus
    • Query: label_values(up, server)
    • Multi-value: Enable
    • Include All option: Enable
  4. Click Apply

Now use {server=~"$server"} in all your queries.

5.2 Key Metrics Reference

Host Metrics (Node Exporter)

# CPU Usage (%)
100 - (avg by(server) (rate(node_cpu_seconds_total{mode="idle", server=~"$server"}[5m])) * 100)

# Memory Usage (%)
(node_memory_MemTotal_bytes{server=~"$server"} - node_memory_MemAvailable_bytes{server=~"$server"}) / node_memory_MemTotal_bytes{server=~"$server"} * 100

# Disk Usage (%)
100 - (node_filesystem_avail_bytes{server=~"$server", mountpoint="/"} / node_filesystem_size_bytes{server=~"$server", mountpoint="/"} * 100)

# Load Average
node_load1{server=~"$server"}
node_load5{server=~"$server"}
node_load15{server=~"$server"}

# Network Traffic
rate(node_network_receive_bytes_total{server=~"$server", device!="lo"}[5m])
rate(node_network_transmit_bytes_total{server=~"$server", device!="lo"}[5m])

# Disk I/O
rate(node_disk_read_bytes_total{server=~"$server"}[5m])
rate(node_disk_written_bytes_total{server=~"$server"}[5m])

MySQL Metrics

# MySQL Up/Down
mysql_up{server=~"$server"}

# Connections
mysql_global_status_threads_connected{server=~"$server"}
mysql_global_variables_max_connections{server=~"$server"}

# Queries per Second
rate(mysql_global_status_queries{server=~"$server"}[5m])

# Slow Queries per Second
rate(mysql_global_status_slow_queries{server=~"$server"}[5m])

# Buffer Pool Hit Rate
1 - (rate(mysql_global_status_innodb_buffer_pool_reads{server=~"$server"}[5m]) / rate(mysql_global_status_innodb_buffer_pool_read_requests{server=~"$server"}[5m]))

# InnoDB Buffer Pool Usage
mysql_global_status_innodb_buffer_pool_bytes_data{server=~"$server"}
mysql_global_variables_innodb_buffer_pool_size{server=~"$server"}

Apache Metrics

# Apache Up/Down
apache_up{server=~"$server"}

# Requests per Second
rate(apache_accesses_total{server=~"$server"}[5m])

# Traffic (Bytes/sec)
rate(apache_sent_kilobytes_total{server=~"$server"}[5m]) * 1024

# Workers
apache_workers{server=~"$server", state="busy"}
apache_workers{server=~"$server", state="idle"}

# Uptime
apache_uptime_seconds_total{server=~"$server"}

Docker/Container Metrics

# Container CPU Usage
rate(container_cpu_usage_seconds_total{server=~"$server", name!=""}[5m]) * 100

# Container Memory Usage
container_memory_usage_bytes{server=~"$server", name!=""}

# Container Network I/O
rate(container_network_receive_bytes_total{server=~"$server", name!=""}[5m])
rate(container_network_transmit_bytes_total{server=~"$server", name!=""}[5m])

# Running Containers Count
count(container_memory_usage_bytes{server=~"$server", name!=""})

5.3 Sample Dashboard JSON

Here's a complete MySQL dashboard you can import:

{
  "title": "MySQL Overview",
  "uid": "mysql-overview",
  "templating": {
    "list": [{
      "name": "server",
      "type": "query",
      "datasource": "Prometheus",
      "query": "label_values(mysql_up, server)",
      "refresh": 2,
      "multi": true,
      "includeAll": true
    }]
  },
  "panels": [
    {
      "title": "MySQL Status",
      "type": "stat",
      "gridPos": {"h": 4, "w": 4, "x": 0, "y": 0},
      "targets": [{"expr": "mysql_up{server=~\"$server\"}", "legendFormat": "{{server}}"}],
      "fieldConfig": {
        "defaults": {
          "mappings": [
            {"options": {"0": {"text": "DOWN", "color": "red"}}, "type": "value"},
            {"options": {"1": {"text": "UP", "color": "green"}}, "type": "value"}
          ]
        }
      }
    },
    {
      "title": "Queries per Second",
      "type": "timeseries",
      "gridPos": {"h": 8, "w": 12, "x": 4, "y": 0},
      "targets": [{"expr": "rate(mysql_global_status_queries{server=~\"$server\"}[5m])", "legendFormat": "{{server}}"}]
    },
    {
      "title": "Connections",
      "type": "timeseries",
      "gridPos": {"h": 8, "w": 8, "x": 16, "y": 0},
      "targets": [
        {"expr": "mysql_global_status_threads_connected{server=~\"$server\"}", "legendFormat": "{{server}} - Connected"},
        {"expr": "mysql_global_variables_max_connections{server=~\"$server\"}", "legendFormat": "{{server}} - Max"}
      ]
    }
  ],
  "schemaVersion": 38,
  "time": {"from": "now-1h", "to": "now"},
  "refresh": "30s"
}

Part 6: Alerting Configuration

6.1 Prometheus Alerting Rules

Create prometheus/alerts.yml:

groups:
  - name: host
    rules:
      - alert: HostDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Host {{ $labels.server }} is down"
          description: "{{ $labels.server }} has been unreachable for more than 1 minute."
      
      - alert: HighCPU
        expr: 100 - (avg by(server) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.server }}"
          description: "CPU usage is above 80% (current: {{ $value | printf \"%.1f\" }}%)"
      
      - alert: HighMemory
        expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage on {{ $labels.server }}"
      
      - alert: DiskSpaceLow
        expr: 100 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} * 100) > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Low disk space on {{ $labels.server }}"

  - name: mysql
    rules:
      - alert: MySQLDown
        expr: mysql_up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "MySQL is down on {{ $labels.server }}"
      
      - alert: MySQLTooManyConnections
        expr: mysql_global_status_threads_connected / mysql_global_variables_max_connections > 0.8
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "MySQL connections above 80% on {{ $labels.server }}"
      
      - alert: MySQLSlowQueries
        expr: rate(mysql_global_status_slow_queries[5m]) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "MySQL slow queries detected on {{ $labels.server }}"

  - name: apache
    rules:
      - alert: ApacheDown
        expr: apache_up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Apache is down on {{ $labels.server }}"

Troubleshooting Guide

Alloy Won't Start

Check config syntax:

alloy fmt /etc/alloy/config.alloy

Run manually to see errors:

alloy run /etc/alloy/config.alloy

Common errors:

  • component "xxx" does not exist: The exporter isn't available in Alloy
  • permission denied: Run usermod -aG alloy and restart

No Data in Grafana

Check if Alloy is scraping:

curl -s http://localhost:12345/metrics | grep -c "^[a-z]"

Check if Prometheus is receiving:

curl -s 'http://PROMETHEUS_IP:9090/api/v1/query?query=up' | jq '.data.result[].metric.server'

Check Alloy logs:

journalctl -u alloy -f --no-pager

Loki "Ingester Not Ready"

This is normal on first start. Wait 15-30 seconds:

watch -n 5 'curl -s http://localhost:3100/ready'

Permission Denied on Logs

# Check current groups
groups alloy

# Add to required groups
usermod -aG adm alloy      # System logs
usermod -aG mysql alloy    # MySQL logs
usermod -aG www-data alloy # Apache logs
usermod -aG docker alloy   # Docker socket

# Restart
systemctl restart alloy

High Memory Usage

Alloy can consume significant memory with many targets. Tune the config:

prometheus.remote_write "default" {
  endpoint {
    url = "http://prometheus:9090/api/v1/write"
    
    queue_config {
      max_samples_per_send = 500    # Reduce from default 2000
      capacity             = 2500   # Reduce from default 10000
      max_shards           = 10     # Limit parallelism
    }
  }
}

Performance Tuning

Prometheus Storage

For longer retention or higher cardinality:

command:
  - '--storage.tsdb.retention.time=90d'
  - '--storage.tsdb.retention.size=100GB'
  - '--storage.tsdb.wal-compression'
  - '--storage.tsdb.max-block-duration=2h'

Loki Optimization

For high-volume log ingestion:

limits_config:
  ingestion_rate_mb: 20
  ingestion_burst_size_mb: 40
  per_stream_rate_limit: 5MB
  per_stream_rate_limit_burst: 15MB

Alloy Resource Limits

Create /etc/systemd/system/alloy.service.d/limits.conf:

[Service]
MemoryMax=512M
CPUQuota=50%
systemctl daemon-reload
systemctl restart alloy

Conclusion

We've built a complete monitoring stack that:

  • Collects metrics from hosts, MySQL, Apache, and Docker containers
  • Aggregates logs from all sources into Loki
  • Visualizes everything in Grafana dashboards
  • Uses a single agent (Alloy) per server instead of multiple exporters

Key Takeaways

  • Alloy simplifies operations by consolidating multiple agents
  • Remote write eliminates the need for Prometheus to scrape targets
  • Consistent labeling (server, environment) enables powerful filtering
  • Docker Compose makes the central stack portable and reproducible

Next Steps

  • Add Alertmanager for alert routing and notifications
  • Implement Tempo for distributed tracing
  • Scale to Mimir for long-term metrics storage
  • Set up recording rules for complex/expensive queries

Resources