At Rubrik, our mission is to secure the world’s data. Data is complex and it comes in many forms (structured, unstructured, sensitive, transient, etc. ) and it is critical for every enterprise to protect it. Our systems that backup and store huge amounts of data also get subjected to extreme situations – enormous scale & stress, aging and faults. Handling those extreme situations is a challenge that the System Engineers at Rubrik are busy with – to deliver reliability for our customers. Today let’s explore some of the key capabilities we’ve built, and examine the challenges we’re surmounting to deliver peace of mind to Rubrik’s customers.

Software Reliability

Let’s explore Reliability and Systems Engineering in some detail before we dig into other aspects. What is Software Reliability? Here’s the definition from ISO 25010 standard: 

Software Reliability: Degree to which a system, product or component performs specific functions under specified conditions for a specified period of time. This characteristic is composed of the following sub-characteristics:

  • Maturity - Degree to which a system, product or component meets needs for reliability under normal operation.

  • Availability - Degree to which a system, product or component is operational and accessible when required for use.

  • Fault tolerance - Degree to which a system, product or component operates as intended despite the presence of hardware or software faults.

  • Recoverability - Degree to which, in the event of an interruption or a failure, a product or system can recover the data directly affected and re-establish the desired state of the system.

In successful software products, the only constant is change. To name a few – new features and capabilities, incorporating new and modern architectures, subsystem upgrades and fixes to new security vulnerabilities. Ensuring that system reliability is maintained in a constantly evolving product is a challenge that we have honed over the years. 

By intelligently combining our expertise on various enterprise verticals (Cloud, Virtualisation, Databases, NoSQL, Kubernetes, etc.) with strong infrastructure and automation skills, System Engineers have identified and developed following key capabilities:

  • High Quality Observability
  • Stable Infrastructure
  • Shift Left
  • Provide solutions

High-quality Observability

This is an important capability without which adequate feedback cannot be obtained. At Rubrik, we measure multiple internal metrics on a continuous basis:

  • Uptime – Based on Rubrik’s distributed architecture and scale-out capabilities, we compute the uptime at the cluster and node level
  • Availability of services – Rubrik’s cloud services have an availability of 99.9% or above 
  • MTTF and MTTR - custom measurements tuned to Rubrik’s products
  • Compliance – we developed this measure to ensure the system continues to operate at a certain level of load 

In addition, there are various internal resource consumption metrics that are computed at a system level – like CPU, Memory, Disk utilization, Disk IO and Network IO across all the services that are not customer visible. For instance, Rubrik has a custom file system layer and the rate of IO across all the nodes for this file system provides a good indication of activity that’s happening in the system 

Stable Infrastructure

Observability is of little use, unless it’s backed by a stable infrastructure. System Engineers build and maintain different configurations of customer workloads in order to certify reliability. 

For example, let’s take a stress profile for a standard two-BRIK configuration with aggressive SLAs

 

Count          

Size                

Change rate         

Backup frequency        

VSphere VMs

1000+

100 GB 

5% hourly

Every 1 hour

MSSQL Databases    

600+

20 GB

5% hourly

Every 30 min

Windows Filesets

16+

2 TB

5% hourly

Every 1 hour


This is one example profile aggregated across multiple customers. We have identified a standard set of profiles which are validated every week. In order to achieve this scale, we have developed the following:

  • Deploy variety of enterprise products or databases of various scales and sizes on-demand
    • We’ve built an in-house multi-threaded application that is capable of deploy the any workload in a few hours 
  • To inject change at a dynamic rate for all kinds of applications
    • A custom-built library of standard and secure templates which can be auto-generated for any workload is available
  • Continuously monitor all the apps if they are healthy and available to the system
    • Embedded monitoring systems that can detect failures and recover if possible
  • Maintain conformance in terms of protected objects and maintain the load
    • External monitoring that runs at a set cadence to repair the workloads
  • At the same time, tooling to achieve desired stress in an accelerated mode
    • We’ve built an advanced synchronization mechanism that ensures the stress level never drops below a certain threshold

Lastly, by treating all of the above infrastructure as code, we are able to maintain the desired state in an efficient and automated manner. This makes our systems less prone to unauthorized modifications and system drift. Just as we measure product availability and compliance, we use similar measures for all the internal infrastructure.

Shift left

One of the key things that enables the speed and scale at which we operate is because of our culture of Excellence (E in RIVET) to welcome peer reviews. When a major feature is being planned, there are multiple checks and balances to ensure the design of the feature conforms to the scalable architecture and fault tolerance principles.

In addition, system reliability is not an afterthought or a step that occurs late in the development cycle. It is a continuous process that provides weekly feedback on code commits that developers contribute, which makes Rubrik a leader in responding to customer demands with high velocity (V in RIVET). 

All of these processes make the Engineering function truly agile – Take a guess which quadrant best represents the work System Engineers do? (answer at the bottom)
 

system engineers

Provide solutions 

Every customer has unique requirements and the way enterprise software is used is unique to each organization. Systems Engineers enable the creation of reference architectures and solutions that help a large number of customers secure their data. To improve customer focus, we have developed a way to incorporate ever-changing customer workloads based on the customer input using advanced clustering techniques.

Thought Leadership on Reliability

Based on years of experience setting a high bar for Reliability at Rubrik, we have categorized this space into the following aspects:

Scale

  • How does the system perform when it is at scale? The largest customers of Rubrik have hundred+ nodes per cluster!
  • How does the system handle extremely large (both in size and numbers) enterprise workloads (a.k.a. Snappables in Rubrik lingo) ? 

Stress

  • How does the system perform under stress? Ex. what happens when the CPU is always at 95% utilization
  • How does the system handle millions of events being fired? Our on-prem solution that deploys on dark sites can handle millions of events without facing any disruptions

Resiliency

  • How does the system react when a fault occurs? Ex. one of the nodes goes bad / disk becomes faulty. 
  • Is the data intact and doesn’t get corrupted? Are there sufficient alerting and monitoring mechanisms to alert the customer? 
  • Can the system auto-recover without impacting availability of our cloud services?

Longevity

  • How does the system perform when run for a long period of time? Are there memory leaks / resource breaches?
  • Does the history of changes affect the data being secured? 
  • Are there incompatibilities due to age of the data at both data and metadata level? 

Case Study and a serious challenge

As an exhibit of these capabilities in action, let's delve into an instance where we partnered with a prominent enterprise customer to optimize their experience and harness the full potential of Rubrik's CDM solution

The customer had extremely large Oracle databases, ranging from 200-300 TB in size and wanted the backups to be replicated to a target within 24 hours, a very challenging timeline. The customer's databases were hosted on proprietary machines, with the largest datafile being 64TB! The only way to do this was to use a MV (managed volume) as our in-house Oracle offering breached the maximum OS file size limit of 16TB. 

The customer encountered multiple difficulties using MV directly

  • Backup performance was slow, overextending weekend window for full backups
  • MV resets leading to backup failures
  • Extremely slow replication to DR site, taking nearly 3 weeks to complete (expected: 1 day)

Replicating the setup internally was not easy, as it required significant hardware. Based on telemetry data and other analysis, it was found that the behavior can be mimicked even with one-eighth the size while maintaining the data ingestion rate per node per category:

  • Two 8-node CDM clusters were built: one for backup and another as a replication target
  • A custom database load tool used to launch a large Oracle database (50 TB)
  • The team’s mass deployer tool (a homegrown tool to deploy and register objects at scale) facilitated quick deployment and registration of desired workloads to the cluster : Oracle, Vsphere VMs, Windows and Unix Filesets, MSSQL, etc . 
  • Multiple RMAN (Recovery Manager or RMAN is an Oracle Database client that performs backup and recovery tasks of your database.) experiments were conducted, leading to recommendations for oracle backup script improvements. 
  • Lastly, network latencies were replicated in our testbed, providing practical exposure.

Using this setup, multiple Snappable and Platform teams within Rubrik came together to deliver several fixes and improved the performance, while clear tuning guidelines were established for those who have large database requirements.

The customer achieved successful full and incremental backups of their massive databases, meeting their performance expectations and replicating snapshots within 24 hours. The fixes implemented not only benefited this customer but also paved the way for other enterprise customers to be on-board.

Going through this situation was an enlightening experience, where every capability and tooling we had was challenged leading to the development of advanced new tools to help our customers succeed.

We are busy!

Today, we’re busy crafting the roadmap for the next generation of Systems Engineering – in order to help customers realize lasting Cyber resilience (Cyber Posture + Cyber Recovery), while enhancing our tooling with advanced capabilities. There are multiple interesting challenges ahead:

  • Resilient Cloud services – we’re developing a way to auto-discover fault-intolerant code blocks the same way a security scan uncovers vulnerabilities 
  • Filtering out noise using advanced AI and machine learning – Since we deal at the system level, there is a lot of noise (100k+ signals per day) due to the scale of our infrastructure, separating noise out accurately is a challenge.
  • Furthering Reliability to shift-left – we’re reimagining our processes as we add new products, features in order to continue growing like a rocketship
  • Dynamic resource allocation for CDM scale / stress deployments – due to the scale at which Rubrik operates, managing resources dynamically helps reduce overhead and significantly save costs 
  • Evangelising and deepening the culture on reliability with continuous thought leadership

While we’re busy, we are also excited by the opportunity above challenges provide. If you feel the same way, and want to make an impact, look no further as we are hiring!

* Answer is Quadrant 4 (Performance, Quality attributes, etc.)