Optimizing System Resiliency and Service Continuity through the Integration of Chaos Engineering, Dynamic Load Balancing, and Zero-Downtime Deployment Architectures in Public Cloud Ecosystems

Dr. Alistair Sterling

Authors

Dr. Alistair Sterling Department of Computer Science and Software Engineering, University of Melbourne, Australia Author

Keywords:

Chaos Engineering, Cloud Computing, Zero-Downtime Deployment, System Resiliency

Abstract

As modern software architectures transition toward decentralized microservices and public cloud infrastructures, the necessity for maintaining high availability and systemic resilience has become a primary engineering imperative. This research explores the intersection of proactive fault injection-specifically through the lens of Chaos Engineering-and automated deployment strategies designed to eliminate service interruptions. By synthesizing contemporary methodologies in dynamic load balancing, particularly hybrid optimization algorithms, with zero-downtime deployment techniques such as Canary Analysis and Blue-Green environments, this study proposes a unified framework for dependable cloud computing. The investigation delves into the mechanics of software fault injection at the system-call and Java Virtual Machine (JVM) levels to identify latent vulnerabilities in exception handling. Furthermore, the paper evaluates the role of Chaos Engineering not merely as a technical tool, but as a human-centered learning framework that fosters high-reliability engineering teams. Through an extensive theoretical analysis of error injection realism and adaptive fault tolerance strategies, the research demonstrates how systemic resilience is achieved when automated recovery mechanisms are validated by intentional, controlled turbulence. The findings suggest that the synergy between sophisticated scheduling, automated canary analysis, and rigorous fault simulation provides the most robust defense against the inherent volatility of distributed cloud environments.

Downloads

Download data is not yet available.

References

1. Addula, S. R., et al. Dynamic load balancing in cloud computing using hybrid Kookaburra-Pelican optimization algorithms.

2. Basiri, A., Behnam, N., Rooij, R., Hochstein, L., Kosewski, L., Reynolds, J., & Rosenthal, C. (2016). Chaos engineering. IEEE Software, 33(3), 35–41.

3. Chowdhury, A., & Tripathi, P. (2014). Enhancing cloud computing reliability using efficient scheduling by providing reliability as a service. In 2014 international conference on parallel, distributed and grid computing, pp 99–104. IEEE.

4. Davidovič, Š., et al. (2018). Canary analysis service. Communications of the ACM.

5. Feinbube, L., Pirl, L., & Polze, A. (2017). Software fault injection: a practical perspective. In: Márquez FPG, Papaelias M (eds) Dependability engineering. IntechOpen, Rijeka.

6. Sagar Kesarpu. (2025). Chaos Engineering as a Learning Framework: A Human-Centered Model for Developing High-Reliability Engineering Teams. The American Journal of Engineering and Technology, 7(12), 57–64. https://doi.org/10.37547/tajet/Volume07Issue12-05

7. Laprie, J.-C. (2008). From dependability to resilience. In 38th IEEE/IFIP international conference on dependable systems and networks, pp 8–9.

8. Natella, R., Cotroneo, D., & Madeira, H. S. (2016). Assessing dependability with software fault injection: a survey. ACM Computing Surveys.

9. Piscitelli, R., Bhasin, S., & Regazzoni, F. (2017). In: Sklavos N, Chaves R, Di Natale G, Regazzoni F (eds) Fault attacks, injection techniques and tools for simulation, pp 27–47. Springer, Cham.

10. Rosenthal, C., et al. (2020). Chaos Engineering: System Resiliency in Practice. O'Reilly Media.

11. Rudrabhatla, C. K. Comparison of zero downtime based deployment techniques in public cloud infrastructure.

12. Shortridge, K. (2023). Security chaos engineering: sustaining resilience in software program and systems. O'Reilly Media, Inc.

13. Sun, D., Chang, G., Miao, C., & Wang, X. (2013). Analyzing, modeling and evaluating dynamic adaptive fault tolerance strategies in cloud computing environments. Journal of Supercomputing, 66, 193–228.

14. Talaver, V., & Vakaliuk, T. A. (2023). Reliable allotted systems: overview of present day strategies. Journal of facet computing, 2(1), pp. 84-101.

15. Zhang, L., et al. (2021). A chaos engineering system for live analysis and falsification of exception-handling in the JVM. IEEE Transactions on Software Engineering.

16. Zhang, L., et al. (2022). Maximizing error injection realism for chaos engineering with system calls. IEEE Transactions on Dependable and Secure Computing.

Optimizing System Resiliency and Service Continuity through the Integration of Chaos Engineering, Dynamic Load Balancing, and Zero-Downtime Deployment Architectures in Public Cloud Ecosystems

Authors

Keywords:

Abstract

Downloads

References

Downloads

Published

Issue

Section

License

How to Cite

Similar Articles