Workload Scaler
Automatic scale-to-zero for idle Kubernetes workloads, with instant scale-up on demand — eliminating the cost of compute that runs but does nothing.
What It Does
The Workload Scaler monitors every Deployment, StatefulSet and DaemonSet in your cluster and tracks real activity. When a workload has been idle for a configurable period — no incoming traffic, no active processing — it scales the workload down to zero replicas. When traffic returns, the workload is scaled back up automatically before the first request is served.
For teams running development, staging, QA, or preview environments, this is where the largest wins come from. These environments are typically provisioned identically to production, used actively for 6–8 hours per business day, and left running at full cost for the remaining 16–18 hours — including nights, weekends, and holidays. The Workload Scaler reclaims that wasted spend without any change to how your team works.
40–70% cost reduction on non-production workloads within the first week of enabling this feature.
Workloads not serving traffic consume no compute. Pods are removed from nodes entirely, which also reduces node count when the cluster autoscaler consolidates freed capacity. Savings compound.
When a request arrives for a scaled-down workload, the scaler detects the signal and scales back up before the request times out. Most services are ready in under 30 seconds.
For latency-sensitive services that cannot tolerate cold-start delay, Scale-to-One keeps a single replica running at all times — saving 50–80% while eliminating cold-start entirely.
Monitoring agents, production workloads, and service meshes must remain running. Exclusion lists at both namespace and service level ensure nothing critical is touched.
30 minutes for interactive dev environments, 2–4 hours for batch pipelines. The timer resets on any detected activity, so workloads with rare but real usage are never prematurely scaled.
Every scale event is recorded: workload, direction, timestamp, and idle duration. Available in the UI and via API for cost reporting workflows.
How It Stays Safe
- Excluded namespaces are enforced in code, not configuration. System namespaces —
kube-system,monitoring,istio-system,cert-manager— are permanently excluded and cannot be removed by any user action. - Scale-up happens before traffic is routed. The scaler does not allow traffic to reach a workload until the target pod is in
Readystate. Users never see a 503 from a workload that hasn't finished starting. - Stateful workloads require explicit opt-in. StatefulSets are only eligible for scaling when explicitly opted in or running in a namespace configured for full scaling.
- Exclusions survive any toggle state. Disabling the global scaler does not remove entries from the exclusion list. Protections persist regardless of the master switch.
Settings Reference
| Parameter | Type | Default | Description |
|---|---|---|---|
workload_scaling_status |
string | "disabled" |
Master control. Set to "enabled" to activate workload scaling. Enable this first before configuring idle timeout and exclusions. |
workload_scale_down_one_status |
string | "disabled" |
Scale-to-One mode. When enabled, idle workloads scale to 1 replica instead of 0. Mutually exclusive with scale-to-zero for the same workload. |
duration |
integer (min) | 3 |
Idle timeout before scale-down. Timer resets on any activity. Recommended: 30–60 min for dev/staging, 120–240 min for batch environments. |
excluded_namespaces |
list[string] | [] |
Namespaces where no workload will be scaled. Additive on top of permanent system exclusions. Add all production and shared infrastructure namespaces here. |
excluded_services |
list[string] | [] |
Individual service names excluded cluster-wide regardless of namespace. Use for specific workloads within an otherwise-eligible namespace. |
Recommended timeout by environment type
| Environment Type | Recommended Timeout |
|---|---|
| Interactive dev / staging | 30–60 minutes |
| CI/CD preview environments | 15–30 minutes |
| Scheduled batch environments | 120–240 minutes |
| QA (business-hours use) | 60 minutes |
Setup Sequence
Enable in non-production first
Set workload_scaling_status to "enabled". Add production and shared infrastructure namespaces to excluded_namespaces. Verify the exclusion list covers everything that must always be running.
Observe the first scale cycle
After the first idle timeout period, review the audit history. Confirm only the expected workloads were scaled down and nothing critical was touched.
Tune the timeout
Review how long workloads typically sit idle before use. Adjust the duration setting to match your team's actual usage patterns.
Enable Scale-to-One for latency-sensitive services (optional)
For services where cold-start delay would disrupt workflow, enable workload_scale_down_one_status for those specific services rather than globally.
Typical savings by environment profile
| Environment Profile | Typical Working Hours | Typical Saving |
|---|---|---|
| Dev/staging, 8-hour business day | 8 / 24 | 60–70% |
| Preview / review environments | 4 / 24 | 75–85% |
| QA, shared across teams | 10 / 24 | 55–65% |
| Mixed with production workloads | Variable | 30–50% |
Node Scheduler
Time-based node group/pool scaling that provisions the right infrastructure before you need it and removes it the moment you don't — without waiting for reactive signals.
What It Does
The Node Scheduler lets you define exactly when your cluster should be large and when it should be small. Instead of reacting to load after pods are already pending, you schedule capacity changes in advance: scale up 10 minutes before business hours begin, scale down to a minimal footprint overnight and through the weekend.
Most Kubernetes autoscaling is reactive — it waits for a signal and then acts. This means you are always slightly behind. The Node Scheduler eliminates this gap entirely for predictable workloads by making the scaling decision before the demand change occurs.
Scale up before demand arrives. Your workloads never experience scheduling latency from waiting for a new node to join the cluster during a planned traffic ramp.
At the end of the working day, scale to minimum footprint immediately — no cooldown hesitation. A 10-node to 2-node reduction overnight saves 80% of nightly compute cost.
Each schedule specifies the instance type and capacity type. Run on-demand during business hours for reliability; switch to spot overnight to stack two savings mechanisms simultaneously.
A weekday business-hours schedule, a Saturday maintenance schedule, a Sunday zero-capacity schedule, and holiday overrides — all composing together without conflict.
Target specific days of the week, a calendar date range, or a 24/7 baseline — covering sprint cycles, reporting periods, planned maintenance, and seasonal patterns without manual intervention.
Every scheduled scaling action is logged with name, timing, target configuration, and outcome. Queryable from the UI and API.
How It Stays Safe
- Minimum node floor always respected. Every schedule respects
min_on_demand_nodes. The cluster is never reduced below your defined minimum, regardless of what a schedule specifies. - Nodes are drained before removal. Pods are gracefully evicted and rescheduled on remaining nodes before a node is removed from the group. No pod is killed mid-request.
- PDB constraints are respected on drain. If the Auto-PDB Operator is running, drain operations wait for safe windows rather than forcing evictions that would violate availability guarantees.
- Conflicts resolve to the more conservative setting. If two schedules overlap, the scheduler applies the higher node count. Availability is never sacrificed due to a scheduling conflict.
Settings Reference
| Parameter | Type | Default | Description |
|---|---|---|---|
node_scheduler_status |
string | "disabled" |
Master control. All configured schedules are paused when disabled. Previously applied changes remain in place; the scheduler does not reverse them on disable. |
min_on_demand_nodes |
integer | 1 |
Absolute minimum on-demand nodes that must remain running at all times. Applied to every schedule. Set to 2+ for clusters running databases or stateful services. |
Schedule object fields
| Field | Type | Description |
|---|---|---|
name | string | Human-readable identifier shown in the UI and audit log |
scheduleType | string | "24/7", "specificDays", or "dateRange" |
specificDays | list | Days of week: "Mon", "Tue", etc. |
startAt / endAt | string | Time in HH:MM format (local cluster timezone) |
instanceType | string | Cloud instance type during this schedule (e.g. m5.large, Standard_D4s_v3) |
capacityType | string | "On-Demand" or "Spot" |
Typical savings by usage pattern
| Usage Pattern | Node Hours Saved / Week | Typical Saving |
|---|---|---|
| 8h/day weekdays only (40h/168h) | 128 hours | 55–75% |
| 12h/day weekdays only (60h/168h) | 108 hours | 45–65% |
| Batch: 4h/night, 5 nights (20h/168h) | 148 hours | 75–85% |
| 24/7 with weekend scale-down | 48 hours | 30–40% |
Setup Sequence
Enable the scheduler
Set node_scheduler_status to "enabled". No schedules are active until you create them, so enabling the engine has no immediate effect on your cluster.
Create a single off-hours schedule first
Start with an overnight or weekend scale-down before adding a scale-up schedule. This validates drain behaviour for your workloads before you depend on the scheduler for morning provisioning.
Add the scale-up schedule
Once confident in scale-down behaviour, add the business-hours scale-up schedule. Set startAt 10–15 minutes before your first expected traffic to give nodes time to join and become ready.
Review the first full cycle
After the first complete weekday cycle, review the schedule history table and node cost data. Confirm timing is aligned with your team's usage patterns and adjust if needed.
Node Autoscaler
Reactive, policy-driven node scaling that adds capacity exactly when pods cannot be scheduled and removes underutilised nodes the moment they become unnecessary.
What It Does
The Node Autoscaler continuously monitors your cluster's scheduling state and node utilisation. When pods are waiting to be scheduled because no existing node has enough free capacity, the autoscaler adds a new node. When nodes run at low utilisation for longer than a configurable window, the autoscaler removes them and reschedules their workloads onto the remaining capacity.
Unlike the Node Scheduler, which acts on a pre-defined timetable, the Node Autoscaler responds to what is actually happening in real time. The two features are designed to work together: the scheduler handles predictable demand patterns, and the autoscaler handles the variance — unexpected traffic spikes, ad-hoc deployments, and gradual load growth between schedule cycles.
Without automated scaling, clusters are sized for peak load permanently. The autoscaler matches node count to actual workload at every point — the cloud bill curves with usage rather than sitting flat at peak provisioning.
Four parameters let you tune the tradeoff between aggressive cost savings and protection against premature removal: unneeded time, post-add delay, scan interval, and utilisation threshold.
Distinguishes between pods waiting for genuine capacity and pods temporarily unscheduled due to normal scheduling lag. Only genuine shortages trigger scale-up — preventing unnecessary node additions.
Every node addition and removal is recorded with reason, count before/after, and timestamp. Available as a chart and event table in the UI, and via API for incident analysis.
Scale-down events produce quantified savings; scale-up events record their cost. The financial effect of autoscaler behaviour is transparent and measurable, not just an assumed benefit.
The scheduler handles predictable patterns; the autoscaler handles variance within those windows. Together they are more efficient than either feature alone.
┌─────────────────────────────────────┐
│ Node Autoscaler │
│ │
Kubernetes API ──│─▶ Pending Pod Watcher │
│ │ │
│ ▼ │
│ Genuine shortage? ──No──▶ Skip │
│ │ Yes │
│ ▼ │
│ Scale-Up Decision ─────────────────│──▶ Cloud Provider API
│ │ (add node to group)
Kubernetes API ──│─▶ Utilisation Scanner │
│ │ │
│ Below threshold? │
│ For long enough? ──No──▶ Wait │
│ After cooldown? │
│ │ Yes │
│ ▼ │
│ Scale-Down Decision ───────────────│──▶ kubectl drain
│ (check PDB, min nodes) │ Cloud Provider API
└─────────────────────────────────────┘
│
▼
MongoDB / UI
(event log, cost data)
How It Stays Safe
- Cooldown after scale-up prevents thrashing. After adding a node, the autoscaler waits for
scale_down_delay_after_addbefore evaluating any node for removal. - Utilisation threshold prevents premature removal. Nodes hosting even light but sustained workloads are never removed — only nodes below threshold for the full
scale_down_unneeded_timewindow. - PDB constraints are fully respected on drain. If a drain would violate a PDB, the node is not removed. It remains eligible and drain is retried when the disruption window allows.
- Minimum node count is always preserved. The autoscaler will never reduce the cluster below
min_on_demand_nodes, regardless of utilisation measurements.
Settings Reference
| Parameter | Type | Default | Description |
|---|---|---|---|
node_autoscaler_status |
string | "disabled" |
Master control. When "enabled", continuously monitors and applies scaling decisions. |
scale_down_unneeded_time |
integer (min) | 5 |
How long a node must remain below utilisation threshold before it is eligible for removal. Increase to 20–30 min for bursty traffic; decrease to 5–10 min for batch/dev clusters. |
scale_down_delay_after_add |
integer (min) | 5 |
Cooldown after a scale-up event before any node is considered for removal. Prevents thrashing during short-lived load spikes. Increase to 15–20 min for highly variable traffic. |
scale_down_utilization_threshold |
float (0–1) | 0.5 |
CPU utilisation fraction below which a node is considered underutilised. 0.5 = 50%. Use 0.6–0.7 for dev clusters; 0.3–0.4 for production with predictable load. |
Setup Sequence
Enable with conservative defaults
Set node_autoscaler_status to "enabled". Leave scale_down_unneeded_time at 10–15 min and scale_down_utilization_threshold at 0.5. This gives active autoscaling without aggressive scale-down.
Observe one week of history
Review the autoscaling event table. Look for: unnecessary scale-up events triggered by transient lag, thrashing (scale-down immediately followed by scale-up), and nodes that stayed underutilised for long periods.
Tune threshold and timing
Based on observations, adjust scale_down_utilization_threshold and scale_down_unneeded_time. Most clusters need one or two iterations before reaching a stable configuration.
Workload HPA Right-Sizing
AI-powered analysis of real workload behaviour that automatically corrects over-provisioned resource requests, right-sizes HPA boundaries, and eliminates the CPU and memory waste that accumulates in every long-running Kubernetes cluster.
What It Does
Every Kubernetes workload has two sets of numbers attached to it: resource requests (what the pod tells the scheduler it needs) and HPA configuration (how many replicas the autoscaler is allowed to run). Both are almost always wrong.
Resource requests are set at deployment time based on estimates, then never revisited. HPA boundaries have the same problem — minReplicas: 2 set in 2022 that was never justified by load testing, maxReplicas: 20 that was just a comfortable-feeling upper bound. HPA Right-Sizing replaces manual guesswork with continuous, ML-driven analysis of what your workloads actually do.
A workload requesting 2 CPU cores with P99 usage of 0.4 cores wastes 1.6 cores of reserved capacity. Multiplied across hundreds of workloads, these corrections produce a substantial reduction in the resource footprint the scheduler must provision for — which directly reduces node count.
Monitors actual CPU and memory consumption and computes P50/P95/P99 distributions. Recommendations account for both normal operation and peak periods without over-provisioning for absolute outliers.
Analyses historical replica counts alongside request rates to find the minimum replica count that keeps latency within bounds, and a maximum that represents a realistic ceiling — not a feared upper limit.
For every workload, a forecast of future resource needs is surfaced in the UI before any change is applied. You review the model's projection and validate it against upcoming traffic before approving.
Before any recommendation is applied, run a full dry-run that shows exactly what would change: every workload's current config, the recommended replacement, and the projected impact.
Changes are applied progressively. If the workload shows resource pressure — OOMKill events, throttling, error rate increase — the change is automatically rolled back. No manual monitoring required per workload.
Workloads with intentionally conservative resource requests — for compliance, SLA, or contractual reasons — can be protected while the rest of the cluster benefits from continuous optimisation.
How It Stays Safe
- Recommendations always include headroom. Requests are set to P95/P99 values plus a safety buffer. Workloads always have room to absorb normal variance without hitting their limits.
- Limits are never set below current spikes. If a workload has spiked to 800 Mi of memory at any point in the observation window, the recommended memory limit will be at or above 800 Mi.
- HPA changes respect PDB constraints. A configuration update that would create a PDB conflict is flagged and not applied until the conflict is resolved.
- Single-replica changes require explicit approval. Reducing
minReplicasto 1 is flagged as higher-risk and requires explicit approval regardless of what the data suggests. - Rollback is always available. For every applied change, the previous configuration is stored and can be restored with a single action from the UI or API.
Settings Reference
| Parameter | Type | Default | Description |
|---|---|---|---|
auto_updater_status |
string | "disabled" |
Controls whether resource request and limit recommendations are automatically applied. Observe recommendations for 1–2 weeks before enabling. |
hpa_updater_status |
string | "disabled" |
Controls whether HPA configuration recommendations (min/max replicas, target utilisation) are automatically applied. Independent of auto_updater_status. Enable after resource right-sizing is stable. |
excluded_namespaces |
list[string] | [] |
Namespaces excluded from HPA right-sizing analysis. Workloads in these namespaces receive no recommendations and are never modified. |
excluded_workloads |
list[object] | [] |
Individual workloads excluded by {namespace, workload_name}. Use for workloads with deliberately unusual resource profiles — GPU services, large JVM heaps, or event-driven spike workloads. |
Typical impact by cluster profile
| Cluster Profile | CPU Reduction | Memory Reduction | HPA Impact |
|---|---|---|---|
| Legacy cluster, requests set at launch and never updated | 50–75% over-request | 40–60% over-request | Min replicas 30–50% too high |
| Active cluster, occasional updates | 25–45% over-request | 20–35% over-request | Max replicas often too low |
| Recently right-sized manually | 10–20% ongoing drift | 10–20% ongoing drift | Continuous correction |
Setup Sequence
Analysis only — Week 1–2
Keep both auto_updater_status and hpa_updater_status disabled. Let the engine build its observation window. Review recommendations in the UI and validate against your knowledge of each workload.
Enable resource right-sizing — Week 3
Enable auto_updater_status. The engine applies CPU and memory corrections during low-traffic periods. Monitor for one sprint. Check for OOMKill events or throttling that might indicate an over-aggressive recommendation.
Enable HPA right-sizing — Week 5+
Once resource request changes have been running stably for two weeks, enable hpa_updater_status. HPA changes will be applied with the canary rollout mechanism active, with automatic rollback if issues are detected.
FinOps Savings Dashboard
Unified financial visibility across every optimisation running in your cluster — real numbers, real time, with full audit history and export for reporting.
What It Does
The FinOps Savings dashboard aggregates the cost impact of every active optimisation feature into a single view that answers the question your leadership team will ask: how much money is this actually saving us?
Most cost-monitoring tools show you what you spent. The Savings dashboard shows you what you spent compared to what you would have spent without optimisation — and it shows the difference, period by period, feature by feature, in concrete currency values anchored to your own pre-optimisation baseline, not a vendor-supplied benchmark.
Two lines: actual spend and what spend would have been without optimisation. The widening gap is your saving. As the divergence grows over time, you see the compounding effect of continuous optimisation.
1 day, 7 days, 15 days, 30 days, since start of month, or since start of year. Matches the reporting cadence of engineering standups, monthly cost reviews, and quarterly business reviews.
Day-by-day breakdown: baseline cost, actual cost, saving. Identify specific days where cost spiked unexpectedly, validate scheduled scale-downs, produce raw data for chargeback conversations.
Every optimisation action logged with: which feature triggered it, which resource was affected, what action was taken, when, cost impact, and duration. Complete record for compliance, change management, or internal audit.
Full savings dataset and audit log exported to CSV at any time. Imports directly into Excel, Google Sheets, or your BI tool. Use for board reporting, finance reconciliation, or external cost platform integration.
What the Workload Scaler contributed, what Node Scheduler windows saved, what HPA right-sizing freed up. Each component's saving is attributed separately so you know where the value is coming from.
How Savings Are Calculated
Baseline
Derived from your cluster's measured compute spend before the optimisation platform was deployed. Collected during an initial observation period before enabling optimisation features. Anchored to actual historical data from your cluster — not a theoretical scenario or industry benchmark.
Cost per resource unit
Calculated using actual on-demand pricing for instance types in your cluster, retrieved from your cloud provider's pricing API. Spot instances are valued at their actual charged price. Reported figures match your cloud bill.
Saving attribution by feature
| Feature | Saving Attributed When |
|---|---|
| Workload Scaler | Workload is at zero replicas; associated node capacity freed |
| Node Scheduler | Cluster running at reduced node count during a schedule window |
| Node Autoscaler | Node removed due to measured low utilisation |
| HPA Right-Sizing | Reduced resource requests enable higher pod density per node |
| Node Optimizer | Node group reconfigured to lower-cost instance types or mix |
A saving is only recorded when the relevant optimisation action is confirmed to have completed successfully. Projected savings from pending actions are shown separately and are not included in the realised total.
Recommended Use by Audience
For engineering teams — weekly
Review using the 7-day window during sprint retrospectives. Look for days where savings were lower than expected — this may indicate a new workload added to an excluded namespace, or an idle timeout set too conservatively. The audit log surfaces every action.
For engineering managers — monthly
Use the 30-day view. Export the daily cost table to CSV and include the realised savings total in your monthly infrastructure report. Direct, auditable line between platform tooling investment and monthly cloud spend reduction.
For finance and procurement — quarterly
Use the "1st of Year" window for annualised data. The daily cost table and audit log CSV export together provide the detail required for cloud spend reconciliation and vendor negotiation conversations.
Example annual review output
Annual cluster compute spend: $201,600
Without optimisation (baseline extrapolated): $341,000
Total annual saving: $139,400 (40.9%)
Saving vs. platform subscription cost: 9.7× ROI
Time window reference
| Window Option | Description | Typical Use |
|---|---|---|
| 1 Day | Rolling 24-hour window | Incident investigation, same-day validation |
| 7 Days | Rolling 7-day window | Weekly engineering standups (default) |
| 15 Days | Rolling 15-day window | Sprint-level reporting |
| 30 Days | Rolling 30-day window | Monthly finance reporting |
| 1st of Month | Calendar month to date | Month-over-month cost tracking |
| 1st of Year | Year to date | Annual reporting, ROI calculation |
Auto-PDB Operator
Automatic Pod Disruption Budget management for every workload in your cluster — with intelligent calculation, single-replica drain protection, and zero-downtime node maintenance built in.
What It Does
The Auto-PDB Operator continuously reconciles PodDisruptionBudgets across all workloads in a cluster. It scans Deployments, StatefulSets, and DaemonSets, calculates optimal PDB values based on replica count, workload type, HPA configuration, and criticality, then creates or updates PDB resources accordingly. The operator runs on a configurable schedule (default: every 1 minute) and persists reconciliation history for full audit visibility.
Without PDBs, cluster operations — node drains during scale-down, rolling updates, cluster upgrades — can evict all replicas of a workload simultaneously. This turns routine infrastructure maintenance into a production incident. The PDB Operator is the safety layer that makes every other optimisation feature in this platform safe to run aggressively.
Every node drain triggered by the Node Scheduler, Node Autoscaler, or Node Optimizer runs through PDB checks. Without PDBs in place, a scale-down that removes three nodes simultaneously could take down all replicas of a critical service. The PDB Operator ensures optimisation actions never cross the line into an outage.
PDBs are created, updated, and deleted automatically as workloads are deployed, scaled, and removed. No manual PDB management. No PDBs left behind when workloads are deleted.
When a workload has an HPA, the operator uses effective replica count — the worst-case minimum the HPA could scale to — rather than the current count. A deployment at 5 replicas with an HPA minimum of 1 is treated as a single-replica workload for PDB purposes.
Automatically detects database StatefulSets (PostgreSQL, MySQL, MongoDB, Redis, Kafka, etcd, and more) by inspecting labels and workload names. Database workloads receive quorum-based PDBs: minAvailable = (n/2) + 1.
Single-replica workloads with preScale policy are automatically scaled 1→2 before a drain, held until the new pod is Ready on a different node, then scaled back 2→1 after the drain completes. Zero downtime, fully automatic.
Workloads in critical namespaces or flagged as critical receive stricter PDB values automatically — higher minAvailable floors and lower maxUnavailable ceilings — without any manual annotation.
PDBs whose corresponding workload no longer exists are automatically removed each reconciliation cycle, keeping the cluster clean and preventing stale PDBs from blocking future drain operations.
┌──────────────────────────────────────────────────────────────┐ │ PDB Operator │ │ │ │ ┌─────────────┐ ┌───────────────┐ ┌──────────────────┐ │ │ │ Scheduler │──▶│ Calculator │──▶│ K8s API Server │ │ │ │ │ │ │ │ (create/update │ │ │ └─────────────┘ │ • Gather │ │ PDB resources) │ │ │ │ • Calculate │ └──────────────────┘ │ │ ┌─────────────┐ │ • Build PDB │ │ │ │ PreScale │ └───────────────┘ ┌─────────────────┐ │ │ │ Manager │─────────────────────▶│ MongoDB │ │ │ │ (node watch)│ │ • Run history │ │ │ └─────────────┘ │ • PreScale logs│ │ │ └─────────────────┘ │ └──────────────────────────────────────────────────────────────┘
PDB Calculation Logic
The operator calculates an effective replica count representing the worst-case minimum a workload can have — accounting for HPA lower bounds. PDB values are then derived from this effective count.
Deployments
| Effective Replicas | Strategy | Value | Reasoning |
|---|---|---|---|
| 1 | Policy-based | See Single Replica Policies | Defers to SingleReplicaPolicy setting |
| 2–3 | minAvailable | n−1 (absolute) | Small deployment: keep at least 1 pod available at all times |
| 4+ | minAvailable | 50% | Percentage-based for larger deployments — scales with replica count |
StatefulSets
| Effective Replicas | Database? | Strategy | Value |
|---|---|---|---|
| 1 | Any | Policy-based | Defers to SingleReplicaPolicy |
| 2+ | Yes (auto-detected) | minAvailable | (n/2)+1 — quorum requirement for data consistency |
| 2+ | No | minAvailable | 67% — two-thirds majority for availability |
DaemonSets
| Strategy | Value | Reasoning |
|---|---|---|
| maxUnavailable | 10% (minimum 1) | Allows gradual rolling drain across nodes without taking down the entire DaemonSet |
Critical workload escalation
| Classification | minAvailable floor | maxUnavailable cap |
|---|---|---|
| Critical namespace | At least 67% | Maximum 25% |
| Critical workload classification | At least 75% | Maximum 20% |
PreScale Manager — Zero-Downtime Single-Replica Drains
The preScale policy is the recommended default for production clusters. It provides the safety of a blocked drain with fully automatic handling — no manual intervention required during node maintenance.
Single Replica Policy comparison
| Policy | PDB Created | Behaviour on Drain | Downtime | Automation |
|---|---|---|---|---|
exempt | No | Pod evicted immediately | Yes | None |
allow | Yes (maxUnavailable: 1) | Pod evicted immediately | Yes | None |
block | Yes (maxUnavailable: 0) | Drain blocked until manual scale-up → drain → scale-down | No | Manual |
preScale | Yes (maxUnavailable: 0) | Auto scale 1→2, wait Ready, drain, scale 2→1 | No | Fully automatic |
Node Cordoned (kubectl cordon / drain)
│
▼
┌──────────────────────────────┐
│ Node Watcher detects │
│ Unschedulable = true │
└────────┬─────────────────────┘
▼
┌──────────────────────────────┐
│ Find single-replica workloads │
│ on this node using preScale │
└────────┬─────────────────────┘
▼
┌──────────────────────────────┐
│ Scale 1 → 2 │ State: scaling_up
│ (patch replicas) │
└────────┬─────────────────────┘
▼
┌──────────────────────────────┐
│ Wait for new pod Ready │ State: wait_ready
│ on a DIFFERENT node │ (polls every 5s, timeout 5m)
└────────┬─────────────────────┘
▼
┌──────────────────────────────┐
│ Drain evicts old pod │ State: drainable
│ (PDB allows with 2 replicas) │
└────────┬─────────────────────┘
▼
┌──────────────────────────────┐
│ Scale 2 → 1 │ State: completed
│ (restore original count) │
└──────────────────────────────┘
- Rollback on ready timeout. If the new pod does not become Ready within 5 minutes, the operator rolls back to the original replica count and marks the operation as failed.
- Drainable TTL. Records in
drainablestate expire aftermax(2 × readyTimeout, 10 minutes). If the drain was cancelled or the node was uncordoned, the operator forces a scale-down regardless. - Per-workload policy override. Individual workloads can override the global policy via annotation:
pdb.terakube.io/single-replica-policy: "allow"
Settings Reference
| Parameter | Type | Default | Description |
|---|---|---|---|
SingleReplicaPolicy |
string | "preScale" |
Global policy for single-replica workloads. Options: exempt, allow, block, preScale. Use preScale for production — zero downtime, fully automatic. |
DryRun |
boolean | false |
When true, performs all calculations and logs what it would do, but does not create, update, or delete any PDB resources. Reconciliation runs are still saved with [DRY RUN] prefix. Recommended for initial deployment. |
ReconcileInterval |
duration | 1m |
How frequently the operator scans all workloads and reconciles PDB state. Shorter intervals keep PDBs more current for rapidly-changing clusters. |
ReadyTimeout |
duration | 5m |
Maximum time the PreScale Manager waits for the new pod to become Ready before rolling back the scale-up. |
CleanupOrphanedPDBs |
boolean | true |
When enabled, PDB resources whose corresponding workload no longer exists are automatically deleted each reconciliation cycle. |
CriticalNamespaces |
list[string] | [] |
Namespaces whose workloads receive stricter PDB values: minAvailable raised to at least 67%, maxUnavailable capped at 25%. |
ExcludedNamespaces |
list[string] | [] |
Namespaces where no PDB will be created or managed. Use for test or dev namespaces where availability is not a concern. |
ExcludedWorkloads |
list[string] | [] |
Individual workload names excluded from PDB management. Reserve for batch jobs, dev tooling, or other workloads that genuinely should not have disruption protection. |
Best Practices
- Start with dry run. Enable
DryRun: truewhen first deploying the operator. Review the planned PDB calculations before applying real changes to the cluster. - Use preScale for production. The
preScalepolicy eliminates downtime for single-replica workloads during drains without requiring manual intervention. It is the recommended default for any cluster where availability matters. - Populate CriticalNamespaces. Ensure all production namespaces are listed so they receive stricter PDB values automatically without per-workload annotation.
- Test node drains early. After enabling the operator, validate behaviour with
kubectl drain <node> --ignore-daemonsetson a non-production node and confirm preScale operations complete successfully. - Monitor reconciliation runs. Review runs with
status: "partial"orstatus: "failed"to catch configuration issues early. Full history is available in the platform audit log. - Use per-workload annotation overrides sparingly. Reserve annotation overrides (
pdb.terakube.io/single-replica-policy) for workloads that genuinely need different behaviour, such as non-critical dev services where brief downtime during drains is acceptable.
Node Optimizer
AI-powered node pool composition analysis that identifies where your cluster is over-spending on the wrong instance types and recommends — or automatically applies — a lower-cost configuration without touching a single workload.
What It Does
Most organisations running Kubernetes on AWS EKS or Azure AKS are over-provisioned by 60–90%. This happens naturally: teams provision for peak load, deployments accumulate reserved capacity that is rarely used, and no single person has full visibility across the cluster. Node Optimizer fixes this systematically.
The engine analyses real workload behaviour — not just what resources pods claim to need, but what they actually use — and identifies the optimal combination of instance types and on-demand/spot mix that delivers the same workload capacity at materially lower cost. A cluster running at 0.76% CPU utilisation across four nodes is a common real-world pattern. Node Optimizer identifies this and proposes a configuration that costs 90% less while maintaining the same workload capacity.
Over-provisioned clusters with no spot usage typically see 70–92% cost reduction. Clusters with reasonable utilisation and no spot see 40–65%. Already-optimised clusters with mixed on-demand/spot see 20–40% ongoing improvement as the engine tracks drift over time.
Evaluates the full AWS and Azure instance catalogs to find the combination of sizes and families that best matches your actual workload profile. Larger, more expensive types are replaced where smaller ones are sufficient.
Calculates the right mix — enough on-demand nodes to keep critical workloads running, enough spot to maximise savings — and maintains that balance automatically, including recovery after spot interruptions.
When a spot node is reclaimed by the cloud provider, the engine responds without human intervention: it temporarily scales up on-demand capacity to absorb displaced workloads, finds new spot capacity, then restores the original mixed setup and releases the temporary nodes.
Distinguishes between pods waiting due to genuine capacity shortage and those experiencing normal scheduling lag. Only true shortages trigger a scale-up, preventing unnecessary node additions from transient states.
Every recommendation surfaces as a side-by-side view: current cluster composition and cost vs. recommended composition and projected cost. Nothing is applied until you review and approve — or until auto-apply triggers after passing all safety gates.
Recommendations below 70% model confidence are never auto-applied. Recommendations projecting less than your configured savings threshold are surfaced for review but not executed automatically.
How It Stays Safe
- It never increases your bill. A recommendation is only applied if it produces a strictly positive saving. If projected costs would equal or exceed current costs, no action is taken.
- It never regresses a cluster you already optimised. If your cluster is already running at optimal cost, a new cycle will not apply a marginal recommendation that risks destabilising the current configuration.
- It never touches system infrastructure. Namespaces hosting cluster-critical components — networking, certificate management, service mesh, Kubernetes internals — are permanently excluded and cannot be overridden.
- It requires a confidence threshold. Recommendations below 70% model confidence are not auto-applied, regardless of projected savings.
- It requires a minimum savings threshold. By default, auto-apply only activates when projected savings exceed 20% of current spend. This filters out marginal changes that may not justify an infrastructure update.
- Spot interruptions are handled without permanent changes. Recovery from spot reclamation uses temporary on-demand nodes. No permanent configuration changes are made under pressure.
Settings Reference
| Parameter | Type | Default | Description |
|---|---|---|---|
auto_optimization |
string | "disabled" |
Controls whether the analysis engine runs. When "enabled", the engine collects cluster metrics, runs the optimisation model, and generates recommendations. Enable this first — review recommendations before enabling auto_apply. |
auto_apply |
boolean | false |
Controls whether approved recommendations are automatically applied. When false, recommendations are stored for manual review. When true, recommendations that pass all safety gates are applied without manual approval. Also starts the pending pod watch. |
prometheus_url |
string (URL) | auto-discovered |
URL of your Prometheus HTTP API for GPU utilisation metrics. If empty, the engine scans Kubernetes services and auto-discovers the Prometheus endpoint. |
Typical savings by cluster profile
| Cluster Profile | Typical Saving |
|---|---|
| Over-provisioned, 0% spot, multiple nodes | 70–92% |
| Reasonable utilisation, 0% spot | 40–65% |
| Mixed on-demand / spot, well-sized | 20–40% |
| Already optimised with spot | 10–25% ongoing |
Supported platforms
| Platform | Node Group Management | Spot / Preemptible |
|---|---|---|
| AWS EKS | EKS Managed Node Groups | EC2 Spot via Mixed Instances Policy |
| Azure AKS | AKS Agent Pools | Azure Spot Node Pools |
Recommended Setup Sequence
Week 1 — Observe
Enable auto_optimization only. Leave auto_apply disabled. Review recommendations generated over the first week: projected savings, confidence scores, and the specific node groups targeted. This builds familiarity with the model's behaviour on your specific cluster before any changes are applied.
Week 2 — Validate
Review whether recommendations are consistent and sensible.
Week 3 onwards — Automate
Enable auto_apply. The engine will now apply recommendations that pass all safety gates without requiring manual approval. Monitor the cost dashboard weekly. All applied recommendations are logged with full detail including savings projections, confidence scores, and the specific actions taken.
Node Optimizer works best when combined with the Workload HPA Right-Sizing and Node Autoscaler. The Optimizer selects the right instance types and spot mix; the Resources right-sizing handles over-provisioned resources ; the Autoscaler handles real-time variance. Together they deliver the full potential saving across all three dimensions: instance cost, capacity timing, and utilisation efficiency.
Smart Provisioner
ML-driven workload placement intelligence that predicts exactly how much capacity each workload needs before it scales — eliminating the reactive guesswork that causes both over-provisioning and scheduling failures.
What It Does
Traditional Kubernetes scheduling is reactive by design: a pod asks for resources, the scheduler finds a node with enough free capacity, and the pod lands. Nobody checks whether the declared resource request reflects what the workload will actually consume. The result is nodes that look 90% allocated on paper but are 15% utilised in reality — and nodes that look 30% free but can't accept a new pod because the remaining memory is fragmented across dozens of tiny gaps.
The Smart Provisioner changes this. It models each workload's real resource consumption — not its declared request — over a rolling 30-day window and uses that model to predict what the workload will need at the moment of scheduling. When the cluster needs new capacity, the provisioner selects the instance type, size, and placement zone that will actually fit the incoming load, not just the load that the YAML says is coming.
Most clusters fail to schedule new pods not because they lack total capacity, but because the right capacity isn't available in the right node at the right time. The Smart Provisioner eliminates scheduling failures caused by resource fragmentation — the hidden inefficiency that forces teams to over-provision "just in case".
Builds a P50/P95/P99 consumption model per workload from 30 days of real telemetry. Provisioning decisions are based on actual behaviour, not YAML declarations written months ago.
Selects instance types that minimise wasted capacity across the node pool — fitting more workloads onto fewer nodes without increasing scheduling failure rate or application latency.
Places workloads across availability zones and failure domains to satisfy topology spread constraints without requiring manual affinity rules on every deployment.
Maintains a configurable headroom of warm capacity that absorbs burst traffic without waiting for a node to spin up — keeping P99 latency stable during sudden load increases.
When the cluster becomes fragmented over time — a natural result of rolling deployments and partial scale-downs — the provisioner gradually rebalances workloads without evictions or downtime.
Tracks pending-pod root causes and resolves them before they become incidents. Fragmentation, incorrect resource requests, and topology mismatches are all surfaced and corrected automatically.
How It Works
Workload deploys / HPA fires / traffic spike detected
│
▼
┌──────────────────────────────────────────────────────┐
│ Smart Provisioner Engine │
│ │
│ 1. Fetch workload consumption model (30d history) │
│ 2. Predict peak demand at P95 + safety buffer │
│ 3. Evaluate current node pool for fit: │
│ • Bin-pack against real (not declared) usage │
│ • Check topology spread constraints │
│ • Check zone balance and AZ capacity │
│ 4. If fit found: schedule immediately │
│ 5. If not: pre-provision optimal instance type │
│ before the pod becomes Pending │
└──────────────────────────────────────────────────────┘
│
▼
Pod scheduled. No Pending state.
No CloudWatch alarm. No 3am page.
Rebalancing cycle
The provisioner evaluates the cluster's packing efficiency. If fragmentation has caused the cluster to use more nodes than the optimal bin-packing solution requires, it migrates workloads and removes the surplus nodes. All migrations respect PodDisruptionBudgets and are done one node at a time with a configurable inter-drain delay.
Safety & Settings
- Never evicts a pod that would violate a PDB. All rebalancing operations check PodDisruptionBudgets before moving any workload.
- Rebalancing pauses during high-load windows. If cluster CPU utilisation exceeds a configurable threshold, rebalancing is deferred until load drops.
- Predictive model degrades gracefully. For workloads with fewer than 7 days of history, the provisioner falls back to declared resource requests plus a conservative buffer rather than making a prediction from insufficient data.
- Topology constraints are never violated. Zone spread requirements, node affinity, and taints/tolerations are treated as hard constraints — the provisioner never overrides them to achieve a better packing ratio.
Hotspot & Pressure Detection
Real-time identification of nodes, namespaces, and workloads under abnormal resource pressure — surfacing problems before they become outages and pinpointing the exact source of cluster instability with no manual investigation required.
What It Detects
A hotspot is any cluster resource — a node, a namespace, or a specific workload — where resource consumption is abnormally high relative to its baseline or its neighbours. Left undetected, hotspots cause cascading failures: a single noisy-neighbour workload degrades every other pod on its node, a namespace with a runaway deployment starves adjacent namespaces of CPU, or a node with a memory leak slowly squeezes everything else off until the kubelet starts OOMKilling pods at random.
Pressure detection goes further. It identifies not just that a workload is using a lot of CPU, but that the workload is throttled — consuming all its allowed CPU but being held back from more by its limits, which is a silent performance degradation that standard utilisation metrics never surface.
Continuously compares each node's CPU, memory, and network I/O against its historical baseline and against peer nodes in the same node group. Nodes with abnormal consumption patterns are flagged immediately.
Tracks cpu_throttled_seconds_total per container. Workloads that are consistently throttled receive a recommendation to increase CPU limits — resolving silent performance degradation that never appears in standard dashboards.
Detects nodes approaching the OOM threshold before the kubelet starts evicting pods. Alerts fire with enough lead time to either migrate workloads or add capacity before any pod is killed.
Identifies workloads whose resource consumption negatively affects co-located pods. Provides actionable recommendations: move the workload to a dedicated node pool, adjust limits, or apply CPU pinning.
Ranks namespaces by their resource pressure score — a composite of CPU/memory utilisation relative to limits, throttle rate, and eviction history. High-pressure namespaces are surfaced for review before they cause incidents.
Distinguishes between a sudden anomalous spike (likely a runaway process or a traffic incident) and a gradual upward trend (likely organic growth that needs capacity planning). Each gets a different response playbook.
Detection Engine
The detection engine runs on a 15-second scan interval and evaluates three signal types per resource:
| Signal | Source | Hotspot Condition |
|---|---|---|
| CPU utilisation ratio | Metrics Server / Prometheus | > 2× workload's 7-day P95 baseline |
| CPU throttle rate | container_cpu_cfs_throttled_seconds_total | > 25% of CPU time throttled over a 5-min window |
| Memory utilisation ratio | Metrics Server / Prometheus | > 85% of memory limit, rising |
| OOM event rate | Kubernetes Events API | Any OOMKill in previous 10 minutes |
| Pod eviction rate | Kubernetes Events API | > 2 evictions per namespace per hour |
| Pending pod duration | Kubernetes pod status | Pod Pending > 90 seconds with no scheduling progress |
| Node condition pressure | Node status conditions | MemoryPressure or DiskPressure = True |
Alerts & Automated Actions
Most Kubernetes incidents are preceded by 15–45 minutes of detectable pressure signals. Hotspot Detection catches these signals and either resolves them automatically or delivers a precise, actionable alert — before a pod is OOMKilled, before a node becomes NotReady, before your on-call rotation gets woken up.
| Hotspot Type | Automated Response | Alert Sent |
|---|---|---|
| CPU throttling > 25% | Flag for HPA Right-Sizing review; suggest limit increase | Yes — with throttle rate and affected workload |
| Node memory > 85% and rising | Cordon node, trigger workload migration via PDB-safe drain | Yes — with time-to-OOM estimate |
| OOMKill detected | Immediately flag workload for memory limit right-sizing | Yes — with container name, namespace, and restart count |
| Noisy-neighbour identified | Recommend node pool isolation; optionally auto-migrate | Yes — with affected co-residents listed |
| Namespace pressure > threshold | Throttle non-critical workloads in namespace (if auto-response enabled) | Yes — with pressure score and top contributors |
| Pod Pending > 90s | Trigger Smart Provisioner capacity check | Yes — with scheduling failure reason |
- All automated actions respect PDB constraints. No workload is moved if doing so would violate its PodDisruptionBudget.
- Alert-only mode available. Set
hotspot_auto_response: falseto receive alerts without any automated cluster changes — useful for teams that want visibility first before enabling automation. - Per-namespace suppression. Individual namespaces can be excluded from automated response while still receiving alerts.
- Deduplication window. Alerts for the same hotspot are deduplicated over a 10-minute window to prevent notification floods during extended pressure events.
Live Cost Lens
Per-workload, per-namespace, and per-team cost attribution in real time — so every engineer can see the financial impact of their deployment decisions the moment they make them, not at the end of the month when it's too late to act.
What It Does
The Workload Cost Usage engine queries the Metrics for live CPU and memory consumption for every pod in the cluster, then maps that consumption to actual cloud pricing for the instance type the pod is running on. The result is a real cost figure — not a theoretical allocation, not a per-pod share of a monthly invoice — but the exact dollar value that each pod is consuming right now, expressed per hour.
This data is available via the /podsCostUsage API endpoint and returns a structured breakdown per pod: namespace, CPU usage in cores, memory usage in GB, GPU count if applicable, and the corresponding cost for each resource type plus a total. Both on-demand and spot capacity types are detected from node labels automatically.
Real-Time Cost Visibility
Cloud cost management tools show you what you spent last month. They cannot tell you which deployment on Tuesday caused Wednesday's cost spike, which team's new service is responsible for 40% of your cluster's compute bill, or how much money you are burning right now — per workload, per namespace, per hour — as you are reading this.
Live Cost Lens does all three. It calculates the actual cost of every running workload at 60-second granularity, attributes it to the team or namespace that owns the workload, and surfaces it in a live dashboard that every engineer in your organisation can access. When a developer deploys a service with 10× the resources it needs, they see the cost the moment the pods come up — not in the next monthly bill review.
Costs are calculated using live cloud pricing data retrieved from the Pricing API — not estimates, not averages. The engine detects your cluster's region, instance type, OS, tenancy, and capacity type (On-Demand or Spot) from node automatically, so pricing is always accurate for the specific hardware your pods are actually running on.
Organisations that give engineers real-time cost visibility see an average 23% reduction in over-provisioning within the first quarter — not from automation, but from engineers making better decisions when they can see the consequences of those decisions immediately.
Every pod gets a cost figure: CPU cost, memory cost, GPU cost, and total cost per hour. Costs are based on the pod's actual measured consumption from the Metrics Server — not its declared resource requests.
All pods are attributed by namespace, making it straightforward to sum costs per team, per environment, or per application by grouping on the namespace field in the response.
GPU usage is detected from pod and priced separately using the GPU cost component of the instance's hourly rate — essential for clusters running ML inference or training workloads.
Pricing is pulled per the exact instance type of each pod's node — m5.xlarge, c5.4xlarge, p3.2xlarge — so cost figures reflect the true unit economics of your node mix, not a cluster average.
Pods running on Spot nodes are priced at spot rates; pods running on On-Demand nodes at on-demand rates.
The per-pod cost data feeds the Savings Dashboard's FinOps metrics and informs the Node Optimizer's recommendations — closing the loop between what workloads cost and what the engine recommends changing.
API Reference
| Endpoint | Method | Auth | Description |
|---|---|---|---|
/podsCostUsage |
GET | Required | Returns per-pod cost breakdown for all pods across all namespaces. Queries the Kubernetes Metrics Server in real time and maps consumption to live AWS pricing. |
The /podsCostUsage endpoint requires the Kubernetes Metrics Server to be running in your cluster. Without it, the endpoint cannot collect real-time CPU and memory consumption. If your cluster does not have the Metrics Server installed, deploy it as part of the standard Kubeflux installation process — see the Installation Guide.
Deploy Kubeflux on Your Cluster
Step-by-step instructions for deploying the Kubeflux application with persistent volume on AWS EKS. Follow the steps in order — the whole process takes approximately 20–40 minutes on a freshly provisioned cluster.
Obtain your Kubeflux License Key
Before deploying, you need a license key issued by the Kubeflux team. To request one, retrieve your cluster's unique ID and send it to the support team.
kubectl get namespace kube-system -o jsonpath='{.metadata.uid}'Send the output to support@kubeflux.com and the team will issue your license key.
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
Add IRSA Permissions
Kubeflux requires IAM Roles for Service Accounts (IRSA) to interact with AWS resources. A setup script is included in the deployment package.
# Make the script executable chmod +x kubecut-irsa-step.sh # Run the IRSA setup ./kubecut-irsa-step.sh
Install Terraform and Deploy the Application
Kubeflux uses Terraform to provision EFS persistent storage on AWS. If Terraform is not already installed, run the following commands on Ubuntu 24.04:
# Update system packages sudo apt update && sudo apt upgrade -y sudo apt install -y wget gnupg software-properties-common # Add HashiCorp GPG key and repository wget -O- https://apt.releases.hashicorp.com/gpg | gpg --dearmor > hashicorp-archive-keyring.gpg sudo mv hashicorp-archive-keyring.gpg /usr/share/keyrings/ echo "deb [signed-by=/usr/share/keyrings/hashicorp-archive-keyring.gpg] \ https://apt.releases.hashicorp.com $(lsb_release -cs) main" \ | sudo tee /etc/apt/sources.list.d/hashicorp.list # Install Terraform sudo apt update && sudo apt install terraform -y
Once Terraform is installed, extract the deployment package and configure your cluster details:
# Decompress the Terraform package tar -xzf kubecut-terraform-efs-k8s.tar.gz # Edit the variables file with your cluster details # kubecut-terraform-efs-k8s/terraform.tfvars region = "your-aws-region" # e.g. us-east-1 eks_cluster_name = "your-eks-cluster-name"
Verify your kubectl context points to the correct cluster before applying:
kubectl config current-context
Run the Terraform deployment from inside the extracted directory:
# Navigate to the Terraform directory cd kubecut-aws/kubecut-terraform-efs-k8s/ # Initialise, validate, and apply terraform init terraform validate terraform apply
terraform apply.
Once Terraform completes, deploy the Kubeflux application manifest:
kubectl apply -f kubecut-core-aws.yaml
kubectl get pods -n kubeflux to confirm all pods reach Running status within 2–3 minutes.
Configure the Workload Scaler
After deployment, open the Kubeflux dashboard and navigate to Workload Scaler to configure which namespaces and services are managed.
Excluding namespaces
Excluding individual services
Uninstall / Reinstall
To remove the application, open the Kubeflux dashboard, navigate to Settings → My Profile, click Remove Kubeflux Resources, and confirm. This cleanly removes all Kubeflux components from the cluster while preserving your cluster workloads.
Get in Touch
Have a question, a feature request, or need help with your deployment? Send us a message and the team will respond within one business day.
kubectl get namespace kube-system \
-o jsonpath='{.metadata.uid}'