CNCF Takes LitmusChaos Platform to the Incubation Level

The Cloud-Native Corporation (CNCF) Technical Oversight Committee (TOC) announced today that it is elevating the LitmusChaos open source application testing platform to incubation level.

LitmusChaos is a chaos engineering platform that was donated to CNCF by ChaosNative in 2020. Since then, it has been adopted within production environments by more than 25 organizations, including Intuit, Lenskart, Orange, Red Hat, and VMware.

Uma Mukara, CEO of ChaosNative, says LitmusChaos was developed to provide DevOps teams with a container-built chaos engineering platform that makes it easier to scale and downsize than existing monolithic alternatives. However, the same platform can be used to test applications based on monolithic and microservices.

The platform consists of a Chaos Engine based on a Software Development Kit (SDK); ChaosHub to host and share experiences; Litmus Workflow that declaratively connects experiences (either sequentially or in parallel) to build a Chaos scenario; A central control plane, dubbed ChaosCenter, for designing, scheduling, and tracking Litmus Workflows; Litmus sensors for creating clutter scenarios that automate steady-state validation and processing procedures and clutter monitoring tool for exporting Prometheus metrics.

In total, there are now over 400 contributors to the LitmusChaos project and over 4,000 withdrawal requests have been submitted. Since the beginning of this year, Litmus operator installations have grown to more than 2,000 per day, Makara said. Going forward, the goal for 2022 is to increase the number of existing LitmusChaos integrations for a larger set of DevOps platforms, Makara added. Additional sets of experiments are also planned for Kubernetes and non-Kubernetes targets, improved observability and integration with other platforms – via the open source OpenTelemetry agent also being developed by CNCF.

Overall, it’s still early days in terms of embracing chaos engineering with DevOps workflows. However, Makara said that as more organizations adopt observability to better understand IT processes, the number of organizations also using chaos engineering to test IT resilience should also increase. Indeed, as IT environments become more complex — thanks mainly to the rise of microservice-based applications — chaos engineering approaches to testing may be in demand one day, Mukara noted.

In the meantime, a lot of IT professionals remain uncomfortable with chaos engineering methods of testing that deliberately break core services to test an application’s resilience. However, the basic idea is that no application should have a single point of failure that could lead to its unavailability. In theory, an application based on microservices will not fail because when a service becomes unavailable, calls are automatically forwarded to another service. The performance of this application will deteriorate, but it will not stop working. The challenge is that building an application that achieves this level of flexibility with microservices requires a significant amount of DevOps expertise.

No matter how the app is built, one thing that is clear is how much organizations rely on apps to drive revenue. As such, the tolerance for application failure – or performance degradation – has never been lower.

Leave a Comment