Introduction

In the dizzying race to reinvent infrastructure in the last 5-10 years, much was made about scaling out vs. scaling up. You hear about it every day.

As an analyst focused on high availability, I’ve noticed this: The things being replaced by net-new infrastructure, the scale-up systems supporting mission-critical systems, had availability and security high on the list of design priorities. For those scale-up computers in the traditional data center, each individual system had to be able to withstand downtime

Today, in our scale-out server world, IT organizations are finding new ways to achieve that goal of availability.  But it’s clear that we’re doing so very differently than they did in the era when Unix systems were the primary guardians of mission-critical applications and databases.

Availability is Highly Relevant

The specific technologies that support application availability and data availability are changing rapidly.  However, the IT best practices will never go out of style: IT’s the implementations that will shift over to the new-style of IT. Amazingly, there are some who doubt the importance of detailed planning for high availability in a web-enabled, cloud-enabled world. But, rest assured, the longtime goals for availability, security and quality-of-service (QoS) to end-users are highly relevant today.

Without availability, your data center ends up in the news. Think about July 8, the recent day when Wall Street trading, the Wall Street Journal and United Airlines went offline. The news organizations always call this phenomenon “computer glitches.” I call them IT best practices gone awry: In many cases, IT practitioners know how to avoid these lengthy outages, but a combination of very human events pushes the systems offline, anyway.

Sometimes, the sheer complexity of these systems sometimes triggers an unintended outage due to extremely high data volumes. And sometimes,

software updates cause the outages – and those updates have to be removed when things go wrong; it takes time to return to the version of the software that was being replaced. Another cause: Rapidly changing market conditions trigger a programmed trading run-up that moves faster than many of the systems that support them.

Replication, Redundancy and Workload Balancing

Rather than focusing on single-system availability, the focus has moved to protecting the data itself – no matter where it resides. And, rather than saying “follow the money: we should be saying: “follow the data.” And, I would add: Protect it.

That’s why replication inside the data center, or across the cloud, is so important.

Judging from the cloud platforms out there, the magic number is 3: Three copies of data, none co-located, makes data safe in our scale-out world. Multiplying the data-stores over and over, placing it in multiple locations, does make life safer.

However, the copying process, when taken too far, can lead to inefficient operations. We’ve seen that, too. That’s why storage de-dupe features and workload consolidation in the data center are so important. The takeaway: Store often, and in multiple places – but not too much. Specifically, don’t store too much in any one place: that will make the data just as vulnerable as if there were only one copy.

Moving the Data Across the Clouds

Software-defined infrastructure (SDI) makes data-moving its priority – keeping data moving around the network. This is especially true in hybrid clouds, which have taken hold in the enterprise data centers, which are increasingly being linked to off-site clouds. End-to-end apps spanning data centers are becoming the norm.

Industry conferences this spring and summer showed the incredible degree of agreement vendors have reached about how to deploy hybrid cloud technology. Dell, EMC, HP, IBM, Microsoft, Oracle, Red Hat and VMware, to name a few of the largest vendors all made this clear this spring and summer.

Today, everything’s in motion. Workload balancing, workload orchestration across the infrastructure and increasing levels of automation for moving virtual machines (VMs) and containers – that’s the way to go.

Without Security, there is no Availability

Finally, achieving security throughout the hybrid cloud is an absolute pre-requisite to achieving availability. Hacking, spoofing and deliberate damage to systems cannot be tolerated when there is mission-critical data to protect. Here, the workload types are key to understanding what kinds of protection to apply to data-in-motion.

With stateless applications, queries can be re-submitted – and no harm is done. I’ve called this the “fallen soldier” effect, referring to toy soldiers (not the real ones in harm’s way). Those fallen toy soldiers can be replaced, or placed upright for the next use. But the stateful ones, as in banking transactions, depend upon a sequence of events – and must return to earlier processing states in the case of disruption. That’s what takes so long when the outages occur – and when operations may not return to normal for several hours, or more.

Living in the Software-Defined World

What can we do about this great datacenter transformation? How can we bring those older workloads into the new world? Key items on the checklist:

  • Look deeply into your datacenter infrastructure – and take a thorough inventory of what you’ve really got.
  • Ensure high availability, security and quality-of-service (QoS). Without them, end-users and customers are sure to be unhappy as they work with unreliable systems.
  • Avoid the ripple effect. Make sure that each time you add new infrastructure elements to your data center, it doesn’t compromise or harm the availability, security and QoS levels you’re expected to maintain.

Living in the software-defined world for compute, storage and networking will take planning, and time. There’s no way around it.

Many organizations will adopt software-defined infrastructure (SDI) in waves, using multiple projects to reach their goal. In this new world, we can retrieve data from multiple sources according to planned scenarios—and retrace a few steps, if need be.

When we’ve done our homework correctly, computing will proceed, even if small outages pop up from time to time. But only if we plan for disruptions, and trust the power of the software automation we’re putting into place now. We have to plan for the inevitable – when systems go offline – and know what to do about it.