RSS

3 tips for choosing the correct Cloud Operations model

Cloud Operations are very different from Data Center Operations, and your leadership needs to understand it
Share this page:

So, you’ve decided to move your workloads to Cloud?

Deploying/Migrating production workloads to a Public Cloud is the “new black” of the industry. Everyone is doing it. Why? To improve reliability, reduce Infrastructure CapEx, reduce infrastructure operations, and the list goes on.

While we all tend to focus on the first part of the journey (cloud Architecture and cost estimation of our service once we “upload it all” to the Cloud), most big organizations tend to forget the Day 2 Ops. What’s the issue? Your Data Center Operations model won’t fit anymore. If you’re lucky - you’ve realized this before your production workloads are already in the cloud.

NOC

Let’s review some of the differences between what’s required to successfully operate a Data Center, versus Public Cloud.

Data Center Public Cloud
Bunch of independent Operation Teams, each one taking care only of their “parcel” (Data Bases, Network, Security, Storage, Virtualization etc.) One Team, having the entire Service in mind when creating Processes and Procedures.
Each team uses their own set of tools for Monitoring, Visibility, etc. A Single Tool Set, allowing different sub-teams to share dashboards, access logs and metrics.
Each operations group only cares about the metrics of their system. Healthy = my system says everything is OK. Product Owner defines what “application is healthy” means, and the Operations Teams need to unify the efforts to meet the objectives.

I’ll be honest, it’s a big change, but - it’s doable. After seeing a huge number of customers around Europe trying different models, failing, and trying again, I’ve managed to summarize te conclusions into 3 simple tips, that I wish I had.

TIP 1: Have your C-suite in the room, even though you’re discussing IT Ops

Cultural changes, such as this, need to come from top down. It’s likely that your COO or CTO don’t get this at first, but you can’t fight this war alone. If the change is cultural - it needs to come from the leadership, because Operations Folk is not happy to accept the change otherwise.

In my experience, the best way to do this is not to organize all this yourself. Instead - look for external allies, such as Cloud Provider’s account team. They will definitely have a set of Workshops and Assessments to help you, and your Leadership will take their advise more seriously (cruel world… I know).

TIP 2: SRE is likely to fit your organization

You can push this cultural change as DevOps, but DevOps as a term has been abused and misused so much by the industry, that it’s become difficult to find 2 people who agree on what it is. One thing we’re clear on - DevOps is a culture, and cultural changes are the most difficult ones. And since the change is so big, you’re better off pushing something that has an explicit definition “on google”.

SRE is an implementation of DevOps, invented by Google. The entire concept, along with the ways to customize it to your organization, help product teams define their SLIs and SLOs for the desired Reliability, is very well documented, in a number of free books published by Google:

My favorite misconception (that goes together with this tip): but we’re not Google, this doesn’t apply to us. This is exactly why SRE is so brilliant - a customization of it can work literally in any kind of organization. Do you know why Google invented SRE? Because of the huge variety of different applications of services they are running, and they needed “one DevOps model to rule them all”. Trust me, if SRE works on ALL Google Services - it will 100% work in your organization. And yes, you will definitely need to “tune it” a little bit, it’s not “one size fits all” out of the book.

TIP 3: Consider Processes, then Products, then People

In the beginning of the journey, it seems like your only problem is the skills. Sure, you do need your Ops engineers to learn YAML and Python, and your Developers to expand their skills into the Infrastructure, but the order you need to take care of things is the following:

  • Processes: There’s a misconception that the skill is your only problem. It’s not. You first need to analyze your processes (how your users can open a support case, how support cases are handled and escalated, how war room is created and managed, who handles external communication when the service is down etc.). Downtime is not a good time to discuss who does what, you’ll want all your brains looking at the problem.
  • Product: Product Teams need to understand how to make Operable Products. SREs will provide that definition, and help your product teams improve the product quality.
  • People: Once your processes are defined, and your Product Teams know how to create products that match your description of “operational excellence”, you can get into analyzing how to regroup and reorganize your operation team.

If you do all this well, you’ll be able to operate Cloud with a smaller better organized team, you’ll be able to onboard more products, and your operations staff will start feeling as part of the team. Your operations will no longer be a dumpster where you throw half-done products with huge technical dept. This will lead into a more efficient, and happier team.