Skip to content

ops-guru/vm-utilization

Repository files navigation

VM Utilization Agent

A cross-platform Telegraf-based utilization agent that collects CPU, memory, and disk metrics from virtual machines and uploads them to AWS S3. This agent provides automated deployment scripts for multiple Linux distributions and Windows environments.

πŸš€ Quick Start

Option 1: Environment Variables (Recommended)

Linux (Ubuntu, Debian, CentOS, RHEL, Fedora, openSUSE, SLES, Alpine):

# Interactive setup
./setup-env.sh

# Install agent
sudo ./install.sh

Windows:

# Create environment file
cp env-template-windows.ps1 .env.ps1
# Edit .env.ps1 with your values

# Load environment and install
. .\.env.ps1
.\install.ps1

Option 2: Command Line Arguments

Linux:

sudo ./install.sh \
  --telegraf-url "https://dl.influxdata.com/telegraf/releases/telegraf-1.34.4_linux_amd64.tar.gz" \
  --bucket "your-s3-bucket" \
  --access-key "YOUR_AWS_ACCESS_KEY" \
  --secret-key "YOUR_AWS_SECRET_KEY"

Windows:

.\install.ps1 -TelegrafUrl "https://dl.influxdata.com/telegraf/releases/telegraf-1.34.4_windows_amd64.zip" `
              -Bucket "your-s3-bucket" `
              -AccessKey "YOUR_AWS_ACCESS_KEY" `
              -SecretKey "YOUR_AWS_SECRET_KEY"

🐧 Linux Distribution Support

The Linux installer automatically detects and supports the following distributions:

  • Ubuntu (18.04+) - apt package manager
  • Debian (9+) - apt package manager
  • CentOS (7+) - yum/dnf package manager
  • RHEL (7+) - yum/dnf package manager
  • Fedora (30+) - dnf package manager
  • openSUSE - zypper package manager
  • SLES - zypper package manager
  • Alpine Linux - apk package manager

The installer automatically:

  • Detects the distribution type
  • Uses the appropriate package manager
  • Handles different user creation methods
  • Adapts file ownership commands
  • Supports both x86_64 and aarch64 architectures

πŸ“‹ What It Does

  • Collects Metrics: CPU, memory, and disk utilization every 30 seconds
  • Stores Locally: Metrics saved as newline-delimited JSON files with daily rotation
  • Uploads to S3: Automated sync to AWS S3 every 5 minutes
  • Self-Managing: Systemd services (Linux) or Windows Services for reliability
  • Minimal Footprint: ~2-5 MB storage per day per VM
  • Cross-Platform: Works on all major Linux distributions and Windows

πŸ“Š Analytics and Reporting

Once metrics are collected and stored in S3, you can perform powerful analytics using AWS Athena:

Serverless SQL Analytics

  • No infrastructure - Query directly from S3 using standard SQL
  • Cost-effective - Pay only for queries you run
  • Scalable - Handles petabytes of data automatically
  • Fast - Optimized for JSON data with proper partitioning

Key Analytics Capabilities

  • Real-time Monitoring - Current system status across your fleet
  • Historical Analysis - Trends and patterns over time
  • Performance Optimization - Identify bottlenecks and inefficiencies
  • Cost Optimization - Find underutilized resources for downsizing
  • Capacity Planning - Predict future resource needs
  • Custom Alerting - SQL-based thresholds and notifications

Sample Queries

-- Get current CPU usage across all VMs
SELECT tags.host, AVG(fields.usage_active) as avg_cpu 
FROM vm_metrics_db.vm_utilization 
WHERE name = 'cpu' AND timestamp > UNIX_TIMESTAMP() - 300
GROUP BY tags.host;

-- Find underutilized VMs for cost savings
SELECT tags.host, AVG(fields.usage_active) as avg_cpu
FROM vm_metrics_db.vm_utilization 
WHERE name = 'cpu' AND timestamp > UNIX_TIMESTAMP() - 604800
GROUP BY tags.host 
HAVING AVG(fields.usage_active) < 15;

For complete SQL examples and setup instructions, see Athena Analytics Guide.

πŸ“ Repository Structure

vm-utilization/
β”œβ”€β”€ README.md                     # This file - getting started guide
β”œβ”€β”€ LICENSE                       # MIT License
β”œβ”€β”€ install.sh                    # Multi-distribution Linux installer
β”œβ”€β”€ install.ps1                   # Windows installation script
β”œβ”€β”€ setup-env.sh                  # Interactive environment setup (Linux)
β”œβ”€β”€ env-template.txt              # Linux environment template
β”œβ”€β”€ env-template-windows.ps1      # Windows environment template
β”œβ”€β”€ ENVIRONMENT-SETUP.md          # Detailed environment setup guide
β”œβ”€β”€ ATHENA-ANALYTICS.md           # SQL analytics with AWS Athena
β”œβ”€β”€ LIVE-TESTING-REPORT.md        # Live Azure testing results
β”œβ”€β”€ SECURITY.md                   # Security considerations
β”œβ”€β”€ CHANGELOG.md                  # Version history and changes
└── docs/                         # Additional documentation

πŸ“š Documentation

πŸ›  Prerequisites

Requirement Linux Windows Description
Admin Rights sudo access Administrator Required to install services
Internet Access HTTPS (443) HTTPS (443) For downloads and S3 sync
AWS Credentials S3 write permissions S3 write permissions For metrics upload
S3 Bucket Pre-existing Pre-existing Target for metrics storage
System Requirements systemd-based distro Windows Server 2016+ Service management

πŸ”§ Configuration

The installation scripts support multiple configuration methods with the following priority:

  1. Environment Variables (highest priority)
  2. Command Line Arguments (fallback)
  3. Default Values (lowest priority)

Required Environment Variables

Variable Description Example
VM_TELEGRAF_URL Telegraf download URL https://dl.influxdata.com/telegraf/releases/telegraf-1.34.4_linux_amd64.tar.gz
VM_S3_BUCKET S3 bucket for metrics storage my-vm-metrics-bucket
VM_AWS_ACCESS_KEY AWS access key ID AKIA...
VM_AWS_SECRET_KEY AWS secret access key wJalrXUtn...

Optional Environment Variables

Variable Description Default
VM_AWS_REGION AWS region us-east-1
VM_CUSTOMER_ID Customer identifier default-customer

For detailed configuration options, see ENVIRONMENT-SETUP.md.

πŸ“Š Metrics Collected

The agent collects the following metrics:

CPU Metrics

  • cpu.usage_active - Active CPU percentage
  • cpu.usage_idle - Idle CPU percentage
  • Per-CPU core metrics (when available)

Memory Metrics

  • mem.used_percent - Memory usage percentage
  • mem.available - Available memory in bytes
  • mem.total - Total memory in bytes

Disk Metrics

  • disk.used_percent - Disk usage percentage per mount/drive
  • disk.free - Free disk space in bytes
  • disk.total - Total disk space in bytes

πŸ— Architecture

VM Utilization Agent Architecture:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Telegraf      β”‚    β”‚  Local Storage  β”‚    β”‚   AWS S3        β”‚
β”‚   Collector     │───▢│  (JSON files)   │───▢│   Bucket        β”‚
β”‚   (30s interval)β”‚    β”‚  Daily rotation β”‚    β”‚  (5min sync)    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Linux Implementation:

  • Supports all major distributions automatically
  • Telegraf runs as systemd service
  • S3 sync via systemd timer (5-minute intervals)
  • Secure credential storage with proper permissions

Windows Implementation:

  • Telegraf runs as Windows Service
  • S3 sync via Scheduled Task (5-minute intervals)
  • Credentials stored as task environment variables

πŸ§ͺ Testing

Local Testing

  1. Clone the repository:

    git clone https://github.com/ops-guru/vm-utilization.git
    cd vm-utilization
  2. Test on your distribution:

    # Linux (any supported distribution)
    sudo ./install.sh --telegraf-url "..." --bucket "test-bucket" --region "us-east-1" --access-key "..." --secret-key "..."
    
    # Windows
    .\install.ps1 -TelegrafUrl "..." -Bucket "test-bucket" -Region "us-east-1" -AccessKey "..." -SecretKey "..."
  3. Verify metrics collection:

    # Linux
    sudo ls -la /var/lib/vm-metrics/
    sudo systemctl status telegraf
    
    # Windows
    Get-ChildItem "C:\ProgramData\vm-metrics\"
    Get-Service -Name "Telegraf"

Live Testing Results

See LIVE-TESTING-REPORT.md for comprehensive testing results including:

  • βœ… Azure cloud deployment testing
  • βœ… Multi-VM environment validation
  • βœ… Performance benchmarks (30MB memory, minimal CPU)
  • βœ… Security compliance verification
  • βœ… End-to-end metric flow validation

πŸ“ˆ Monitoring

Service Status Commands

Linux (All Distributions):

# Check Telegraf service
sudo systemctl status telegraf

# Check S3 sync timer
sudo systemctl status vm-metrics-sync.timer

# View sync logs
sudo journalctl -u vm-metrics-sync -f

# Check distribution detection
./install.sh --help  # Shows supported distributions

Windows:

# Check Telegraf service
Get-Service -Name "Telegraf"

# Check scheduled task
Get-ScheduledTask -TaskName "VM-Metrics-Sync"

# View task history
Get-WinEvent -FilterHashtable @{LogName='Microsoft-Windows-TaskScheduler/Operational'; ID=200,201} | Where-Object {$_.Message -like "*VM-Metrics-Sync*"}

πŸ—‘ Uninstallation

Linux

# Stop and disable services
sudo systemctl stop telegraf vm-metrics-sync.timer
sudo systemctl disable telegraf vm-metrics-sync.timer vm-metrics-sync.service

# Remove service files
sudo rm -f /etc/systemd/system/telegraf.service
sudo rm -f /etc/systemd/system/vm-metrics-sync.service
sudo rm -f /etc/systemd/system/vm-metrics-sync.timer

# Remove application files
sudo rm -rf /etc/telegraf
sudo rm -rf /var/lib/vm-metrics
sudo rm -rf /etc/vm-metrics
sudo rm -f /usr/local/bin/telegraf

# Reload systemd
sudo systemctl daemon-reload

Windows

# Stop and remove Telegraf service
Stop-Service -Name "Telegraf" -Force
sc.exe delete "Telegraf"

# Remove scheduled task
Unregister-ScheduledTask -TaskName "VM-Metrics-Sync" -Confirm:$false

# Remove application directories
Remove-Item -Path "C:\Program Files\Telegraf" -Recurse -Force
Remove-Item -Path "C:\ProgramData\Telegraf" -Recurse -Force
Remove-Item -Path "C:\ProgramData\vm-metrics" -Recurse -Force

πŸ”’ Security

  • Credentials: Stored with restricted permissions (root/SYSTEM only)
  • Transport: HTTPS/TLS for all S3 communications
  • File Permissions: Metrics files readable only by system accounts
  • No Network Exposure: Agent only makes outbound connections

πŸ› Troubleshooting

Common Issues

Issue Cause Solution
Service won't start Configuration error Check telegraf --config <config> --test
S3 upload fails Credentials/permissions Verify IAM permissions and bucket access
High disk usage Sync failure Check network connectivity and S3 permissions

Log Locations

Linux:

  • Telegraf: sudo journalctl -u telegraf
  • S3 Sync: sudo journalctl -u vm-metrics-sync

Windows:

  • Telegraf: Event Viewer > Windows Logs > System
  • S3 Sync: Event Viewer > Task Scheduler logs

πŸ“ License

This project is licensed under the MIT License - see the LICENSE file for details.

🀝 Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

πŸ“ž Support

For issues and support:

🏷 Tags

telegraf monitoring metrics vm-utilization aws-s3 linux windows automation devops

About

No description, website, or topics provided.

Resources

License

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published