Kepler Tutorial: Monitoring Kubernetes Energy with eBPF

The Attribution Problem is one of the hardest challenges in FinOps and GreenOps. Your CFO looks at the AWS bill and sees "EC2 - $50,000." They turn to you and ask: "How much of that was the Marketing AI Bot, and how much was the Customer Support Bot?"

In a shared Kubernetes cluster, this is notoriously difficult to answer. Kubernetes knows about CPU shares and memory limits, but it knows nothing about Volts, Amps, or Watts. The physical hardware consumes power, but the logical containers just consume "time."

Enter Kepler (Kubernetes-based Efficient Power Level Exporter). This CNCF project solves the attribution mystery by using eBPF (Extended Berkeley Packet Filter) to peek deep into the Linux kernel, correlating process activity with power consumption metrics to give you per-pod energy observability.

How Kepler Works: The Mechanics

Kepler doesn't just guess; it measures (where possible) and models (where necessary). It runs as a DaemonSet, meaning one Kepler agent sits on every node in your cluster. Here is the architectural flow:

1. Data Collection (The Sensors) Kepler taps into hardware counters to get the ground truth of power draw:

RAPL (Running Average Power Limit): On Intel and AMD CPUs, RAPL provides incredibly precise data on energy consumption by the CPU package and DRAM.
NVML (Nvidia Management Library): For AI workloads, this is critical. It polls the GPU to get the exact power draw (in milliwatts) of your H100s or A10s.
ACPI / Hwmon: For bare-metal servers, it can read platform power info directly from the motherboard sensors.

2. Attribution (The Magic of eBPF) This is where Kepler shines. Knowing the node consumes 300 Watts is easy. Knowing that Pod A is responsible for 50 of those Watts is hard. Kepler attaches eBPF programs to kernel tracepoints (like sched_switch). It tracks exactly how many CPU cycles each Process ID (PID) and CGroup (Container) consumes. It then calculates a ratio: if Container A used 10% of the CPU cycles, it gets attributed roughly 10% of the CPU dynamic power.

3. Modeling (The Backup Plan) What if you are running in the cloud (AWS/GCP) where you don't have access to physical RAPL counters? Kepler uses pre-trained Model Server estimators. It sees "This is a Skylake CPU running at 50% load," and infers the likely power consumption based on laboratory benchmarks.

Implementation Guide: From Zero to Visibility

Step 1: Install via Helm The easiest way to deploy Kepler is using its official Helm chart. We recommend creating a dedicated namespace.

Bash

# Add the Kepler Helm repository
helm repo add kepler https://sustainable-computing-io.github.io/kepler-helm-chart
helm repo update

# Install Kepler in the 'kepler-system' namespace
helm install kepler kepler/kepler \
  --namespace kepler-system \
  --create-namespace \
  --set serviceMonitor.enabled=true

Note: We enabled serviceMonitor above. This assumes you already have the Prometheus Operator running. This makes Kepler instantly discoverable by your monitoring stack.

Step 2: Verify the Installation Check that the pods are running on all your nodes. You should see one pod per node.

Bash

kubectl get pods -n kepler-system

Step 3: Scrape with Prometheus Kepler exposes metrics on port 9102 by default. If you didn't use the ServiceMonitor, add this job to your prometheus.yml:

YAML

- job_name: kepler
  static_configs:
  - targets: ['<node-ip>:9102']

Step 4: Querying Watts per Pod Now for the payoff. Open Grafana and try these PromQL queries.

Total Energy by Pod (in Joules):

sum(kepler_container_joules_total) by (pod_name, container_namespace)

Real-time Power Consumption (in Watts): Since Joules are cumulative energy, Watts are the rate of change. We use the rate() function over a 1-minute window.

sum(rate(kepler_container_joules_total[1m])) by (pod_name)

The "Energy Unit" Chargeback

Once you have this data, you can implement Green Chargeback. Traditionally, platform teams send invoices based on CPU requests or memory limits. This is inaccurate.

With Kepler, you can bill teams based on actual Energy Units. If the Data Science team's training job pins the CPU and GPU at 100% for 4 hours, Kepler captures that spike. If the Web Frontend team reserves 4 CPUs but only uses 5% of them, Kepler sees lower power draw (though their idle power attribution might still be high—another lesson in efficiency!).

Next Steps

Dashboards: Import the official Kepler Grafana Dashboard (ID: 17701) to visualize your cluster's carbon footprint instantly.
CI/CD Integration: Use these metrics to fail a build if power consumption exceeds a baseline.
Carbon-Aware Scaling: Feed these metrics into KEDA (see Blog 34) to make scaling decisions based on real-time energy data.

See, Understand, Optimize -
All in One Place

Atler Pilot decodes your cloud spend story by bringing monitoring, automation, and intelligent insights together for faster and better cloud operations.