COLA: Collective Autoscaling for Cloud Microservices

1. Introduction

The shift from monolithic architectures to loosely coupled microservices in cloud applications introduces significant complexity in resource management. Developers must decide how many compute resources (e.g., container replicas, VMs) to allocate to each microservice. This decision critically impacts both the operational cost for the developer and the end-to-end latency experienced by the application user. Traditional autoscaling methods, such as Horizontal Pod Autoscaling (HPA), scale each microservice independently based on local metrics like CPU utilization. However, this approach is suboptimal because it ignores the interdependent nature of microservices within an application workflow. COLA (Collective Autoscaler) is proposed as a solution that collectively allocates resources across all microservices with a global objective: minimizing dollar cost while ensuring the application's end-to-end latency remains below a specified target.

2. The Problem with Independent Autoscaling

Current industry-standard autoscaling operates in a distributed, per-microservice manner. Each service triggers scaling actions (adding/removing VMs or pods) when its own resource utilization (CPU, memory) crosses a threshold. The fundamental flaw is that this local view fails to account for the application's global performance. Improving the latency of one microservice may have negligible impact on the overall user-perceived latency if another service in the chain remains a bottleneck. This leads to inefficient resource allocation—over-provisioning some services while under-provisioning critical bottlenecks—resulting in higher costs without achieving the desired latency Service Level Objective (SLO).

3. COLA: Collective Autoscaling Approach

COLA reframes the autoscaling problem as a constrained optimization problem. It replaces multiple independent autoscalers with a single, centralized controller that has global visibility into the application's microservice topology and performance.

3.1. Core Optimization Framework

The goal is formalized as:

Objective: Minimize total compute cost.
Constraint: Application end-to-end mean or tail latency ≤ Target Latency.
Decision Variables: Number of VMs (or replicas) allocated to each microservice $i$, denoted as $n_i$.

This is a complex, non-linear optimization problem because the relationship between $n_i$ and end-to-end latency is not straightforward and depends on workload patterns and inter-service communication.

3.2. Offline Search Process

Solving this optimization online is impractical due to the time required for provisioning and performance stabilization. Therefore, COLA employs an offline search process:

Workload Application: Apply a representative workload to the application.
Bottleneck Identification: Identify the most congested microservice (greatest increase in CPU utilization under load).
Resource Allocation via Bandit Problem: For the bottleneck service, determine the optimal number of VMs using a multi-armed bandit formulation. The "reward" function balances latency improvement against cost increase.
Iteration: Repeat steps 2-3 for the next most congested microservice until the global latency target is met.
Policy Generation: The result is a scaling policy (a mapping from workload characteristics to resource allocations) that can be deployed online.

COLA can interpolate between known workloads and fall back to default autoscalers if faced with an unseen workload pattern.

4. Technical Details & Mathematical Formulation

The core optimization problem can be abstractly represented as:

$$\min_{\{n_i\}} \sum_{i=1}^{M} C_i(n_i)$$ $$\text{subject to: } L_{e2e}(\{n_i\}, \lambda) \leq L_{target}$$ $$n_i \in \mathbb{Z}^+$$ Where:

$M$: Number of microservices.
$n_i$: Number of resource units (e.g., VMs) for microservice $i$.
$C_i(n_i)$: Cost function for microservice $i$ with $n_i$ units.
$L_{e2e}$: End-to-end latency function, dependent on all $n_i$ and the workload intensity $\lambda$.
$L_{target}$: The desired latency SLO.

The "bandit problem" in step 3 of COLA's search involves treating each possible VM allocation for the bottleneck service as an "arm." Pulling an arm corresponds to provisioning that configuration and measuring the resultant cost-latency trade-off. Algorithms like Upper Confidence Bound (UCB) can be used to efficiently explore and exploit the configuration space.

5. Experimental Results & Evaluation

COLA was rigorously evaluated against several baseline autoscalers (utilization-based and ML-based) on Google Kubernetes Engine (GKE).

5.1. Experimental Setup

Applications: 5 open-source microservice applications (e.g., Simple WebServer, BookInfo, Online Boutique).
Platforms: GKE Standard (user-managed nodes) and GKE Autopilot (provider-managed infrastructure).
Baselines: Standard HPA (CPU-based), advanced ML-based autoscalers.
Workloads: 63 distinct workload patterns.
Target: Meet a specified median or tail (e.g., p95) latency SLO.

5.2. Key Performance Metrics

SLO Attainment

53/63

Workloads where COLA met the latency target.

Average Cost Reduction

19.3%

Compared to the next cheapest autoscaler.

Most Cost-Effective Policy

48/53

COLA was the cheapest for 48 of the 53 successful workloads.

Optimality on Small Apps

~90%

For smaller applications where exhaustive search was possible, COLA found the optimal configuration in ~90% of cases.

5.3. Results Summary

The results demonstrate COLA's significant advantage. It consistently achieved the desired latency SLO where others failed, and did so at a substantially lower cost. The cost savings were so pronounced that the "training cost" of running COLA's offline search was recouped within a few days of operation. On GKE Autopilot, COLA's benefits were even more apparent, as it effectively navigated the provider-managed abstraction to minimize costs.

Chart Description (Imagined): A bar chart would likely show "Cost per Successful Request" or "Total Cluster Cost" on the Y-axis, with different autoscalers (COLA, HPA, ML-A) on the X-axis. COLA's bar would be significantly lower. A second chart might show "Latency SLO Violation Rate," where COLA's bar approaches zero while others show higher violation rates.

6. Analysis Framework & Example Case

Analyst's Perspective: A Four-Step Deconstruction

Core Insight: The paper's fundamental breakthrough isn't a fancy new algorithm, but a critical shift in perspective: treating the entire microservice application as a single system to be optimized, not a collection of independent parts. This is analogous to the shift in computer vision brought by models like CycleGAN (Zhu et al., 2017), which moved beyond paired image translation by considering the cycle-consistency of the entire transformation domain. COLA applies a similar "global consistency" principle to resource management.

Logical Flow: The argument is compellingly simple: 1) Local optima (per-service scaling) sum to a global inefficiency. 2) Therefore, use a global objective (cost) with a global constraint (end-to-end latency). 3) Since solving this online is too slow, solve it offline via search and deploy the policy. The elegance lies in using the bandit problem to make the search for the bottleneck's optimal allocation efficient, a technique supported by extensive research in reinforcement learning for systems optimization (e.g., work from UC Berkeley's RISELab).

Strengths & Flaws: Strengths: The empirical results are stellar—19.3% cost reduction is a boardroom-level figure. The offline approach is pragmatic, avoiding runtime instability. The framework is platform-agnostic. Flaws: The Achilles' heel is the dependency on representative offline workloads. In rapidly evolving applications or under "black swan" traffic events, the pre-computed policy may be obsolete or catastrophic. The paper's fallback to default autoscalers is a band-aid, not a cure, for this robustness issue. Furthermore, the search complexity likely scales poorly with the number of microservices, potentially limiting its use in extremely large, complex applications.

Actionable Insights: For cloud architects, the message is clear: stop setting CPU thresholds in isolation. Invest in building or adopting global performance observability and a centralized decision engine. Start with a hybrid approach: use COLA's philosophy to define critical service chains and apply collective scaling there, while leaving less critical, independent services on traditional HPA. The ROI, as shown, can be rapid. Cloud providers should take note; tools like GKE Autopilot need such intelligent orchestration layers to truly deliver on the promise of "managed" infrastructure.

7. Application Outlook & Future Directions

The principles behind COLA have broad applicability beyond basic VM scaling:

Multi-Resource & Heterogeneous Scaling: Future versions could collectively decide on VM size (memory- vs. compute-optimized), GPU allocation, and even placement across availability zones or cloud providers for cost and resilience.
Integration with Service Meshes: Coupling COLA with a service mesh (like Istio) would provide richer telemetry (request-level tracing, dependency graphs) and even enable direct control over traffic routing and circuit breaking as part of the optimization.
Online Adaptation & Meta-Learning: The major research frontier is overcoming the offline limitation. Techniques from meta-learning could allow COLA to quickly adapt its policy online based on real-time feedback, or to safely explore new configurations during low-traffic periods.
Green Computing Objectives: The optimization objective could be extended to minimize carbon footprint or energy consumption, aligning with sustainable computing initiatives, by incorporating data from sources like the Cloud Carbon Footprint project.
Marketplace for Policies: For common application patterns (e.g., e-commerce, media streaming), pre-optimized COLA policies could be shared or sold, reducing the need for individual training runs.

8. References

Sachidananda, V., & Sivaraman, A. (2022). COLA: Collective Autoscaling for Cloud Microservices. arXiv preprint arXiv:2112.14845v3.
Zhu, J., Park, T., Isola, P., & Efros, A. A. (2017). Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks. Proceedings of the IEEE International Conference on Computer Vision (ICCV).
Burns, B., Grant, B., Oppenheimer, D., Brewer, E., & Wilkes, J. (2016). Borg, Omega, and Kubernetes. Queue, 14(1), 70–93.
Hoffman, M., Shahriari, B., & Aslanides, J. (2020). Addressing Function Approximation Error in Actor-Critic Methods. Proceedings of the 37th International Conference on Machine Learning (ICML). (Example of advanced RL relevant for online adaptation).
Cloud Carbon Footprint. (n.d.). An open source tool to measure and visualize the carbon footprint of cloud usage. Retrieved from https://www.cloudcarbonfootprint.org/.
Verma, A., Pedrosa, L., Korupolu, M., Oppenheimer, D., Tune, E., & Wilkes, J. (2015). Large-scale cluster management at Google with Borg. Proceedings of the European Conference on Computer Systems (EuroSys).