Blog

Cloud Foundry Summit Sessions, “Diego: Re-envisioning the Elastic Runtime”

Vitaly Sedelnik

The amazing session, “Diego: Re-envisioning the Elastic Runtime,” was one of the highlights at this year’s CF Summit. Onsi Fakhouri, Engineering Manager at Pivotal, shared some technical details on Project Diego, including why it is important for Cloud Foundry developers and how it will evolve in the future.

Diego, a large-scale project on which Pivotal is working right now, will introduce a number of significant changes to the Cloud Foundry architecture. Read on to learn about the reasons for this kind of revision, why we should care about Diego, and what impact it will have on Cloud Foundry and PaaS.
 

Why was it necessary to re-write something in Cloud Foundry?

Diego is an almost complete re-write of Cloud Foundry Elastic Runtime. This includes a large part of the system, the entire DEA Pool, together with Warden, the Health Manager, NATS, etc.

The need to invest into this kind of endeavor came due to multiple drawbacks of the existing architecture. These include:

  • issues with orchestration
  • poor separation of concerns
  • triangular dependencies
  • tightly coupled components
  • limitations of Ruby
  • domain specificity
  • platform specificity

As a result, it is hard for developers to add new features and maintain existing ones, hard to test, and hard to understand how the system works. Onsi Fakhouri illustrated these challenges with several real-life examples:

1. For instance, since the Cloud Controller is responsible for too many things, it may be inefficient when deciding how to distribute apps among DEAs. This causes orchestration issues.

2. The Cloud Controller, Health Manager, and DEA Pool were designed to work together and are tightly coupled. Because of this adding new features to the Cloud Controller may negatively affect the Health Manager and/or the DEA Pool.

3. Domain and platform specificity make it difficult for developers to extend the system.

4. Finally, DEA and Warden—two long-lived, long-running processes—have lots of concurrency and low-level OS interactions. As a result, the Ruby code currently used by the system has been pushed to the limit.

 

How is Diego different?

Diego aims to provide the right level of abstraction that will make it possible to overcome the above-mentioned challenges and make Cloud Foundry more robust.

Unlike the current Elastic Runtime, Diego:

  • Is written in Go
  • Has strong concurrency support
  • Has strong low-level OS support
  • Is strongly typed
  • Provides explicit error handling
  • Promotes developer discipline (the Go lang requires better developer discipline than Ruby)

 

What does this mean for developers?

Diego embraces the complexity, necessary in a platform like Cloud Foundry and tries to make it explicit, transparent, and understandable. Thanks to this, it is easier for developers to work with the system, add and test new features, maintain existing ones, etc. Onsi Fakhouri used some examples to illustrate this:

1. As we have already said above, Diego is written in Go, a programming language developed in the cloud era. It eliminates many issues caused by Ruby.

2. Diego still has a Cloud Controller, but the DEA Pool has been replaced with the Executor Pool, an eventually consistent, self-managing, monitoring, and healing system. As a result, Cloud Foundry becomes more robust and the Health Manager is no longer necessary.

3. Diego solves the domain specificity issue by replacing complex notions, such as “running apps”, etc., with generic ones, such as one-off tasks (one-time jobs) and long-running processes (LRPs). As a result, adding new features becomes easy.

4. Diego has been built from ground up to be platform independent. Most of the components, including the Cloud Controller and the layers in the Executor Pool, except for the back-end layer, do not care about the platform. This means, when adding a new platform, developers only need to care about two things: the back-end and the binary for that specific platform.

5. Removing part of responsibilities from the Cloud Controller has solved the orchestration issues. The Cloud Controller does not have to “think” about how and where the apps will run any more. Instead, the Executor distributes long-running processes using the auction method.

In total, there are 14 different components, each responsible for one separate job. It creates a lot of complexity, which is inherent to large-scale systems, such as Cloud Foundry. This is why Pivotal uses the simulation driven approach when working on Diego. During the session, Onsi Fakhouri demonstrated how it evenly distributes 1,000 apps among dozens of executors, using a piece of the same code that runs in production.

Unit testing is done on all components to ensure that each of them operates as expected. In addition, a special library provides shared narrative and integration tests to make sure everything works ok.


 

Current situation and future roadmap for Diego

By now, the team has completed work on the staging components. The part responsible for running apps is half-done. There is already support for Linux and buildpacks.

The list of features that will be included into the project is rather long. According to Onsi Fakhouri, soon Diego will provide placement pools, process types, persistent disk, shell access, auto-rebalancing, zero-downtime deploys, support for dockerfiles, .NET support, and custom health checks.

So, to sum it up, we may describe the Diego project as Elastic Runtime 2.0. The things that make it different from the one employed in CF today are as follows:

  • It uses the notions of tasks and LRPs to provide flexible abstraction.
  • It is platform-independent and therefore extensible.
  • It is self-managing, monitoring, and healing, making Cloud Foundry more robust.
  • It embraces complexity

The final slide featured Pivotal’s development team, currently working on Diego. Unfortunately, no slides have been published for this talk. However, several days ago, Onsi Fakhouri tweeted that video recordings of all the sessions from the CF Summit will be available online any time soon.

Still, I feel lucky to have visited this great event. Looking forward to the next one.

Recap of the CF Summit 2014: Day 1 | Day 2Day 3

4 Comments
  • Aliaksey Kandratsenka

    I’m very curious about that switch away from Ruby. Has anybody mentioned any specific issues they had with Ruby?

    • Andrei Yurkevich

      The official and short version of the reason is performance. There is a general strategy of rewriting a number of Cloud Foundry components in Go. The most wildely known example of that migration is (Go) Router. Some of the drivers behind redesign (not directly related to Ruby issues though) are listed at github https://github.com/cloudfoundry/gorouter

    • SergueiF

      I work with them there at Pivotal and it mostly came down to concurrency, package management, libraries for interfacing with OS and no exceptions.

      Ruby’s concurrency is mostly just a regular old set of tools – locks, threads, etc. Some objects are thread-safe, some are not. Go’s CSP thing has a lot of tools that encourage higher-level tools like channels and hint at mutability-issues by putting values in channels, instead of sharing references.

      Package management is very important. Ruby has some weak conventions for packages like modules(not even intended for it). Go won’t compile without an acyclic dependency graph and the designs that forces are just consistently better.

      The domain of Diego is full of interacting with the OS, so lots of file IO, careful network I/O, and lots of possibilities for error. Go community and authors invested in just that. And lack of exceptions makes it all explicit and your designs fail early.

    • Mark Carlson

      Onsi covers those reasons in the video at https://www.youtube.com/watch?v=1OkmVTFhfLY

      -mdc

Benchmarks and Research

Subscribe to new posts

Get new posts right in your inbox!