Cloud Computing:
Power & Frailty
Jump to...
For all its operational, scale, and cost advantages, cloud computing is an (un)surprisingly fragile technology.



Part 3: High-availability Cloud Architecture Just Isn't Enough
​
To mitigate suffering, businesses often set up high-availability (HA) cloud architecture. Typically this means they are running cloud services spread across multiple instances in different sub-regions and even regions. Some go as far (though very uncommon) to span their cloud architecture across multiple different cloud service providers. This way, when the cloud service(s) fails in one area, the business can just shift over to using a backup geography.
​
HA architecture is complex and difficult to achieve while being affordable. Higher availability requires more sophisticated architectural components that need to work in sync, cross communicate, monitor performance, detect problems, and failover gracefully to other instances and geographies when cloud outages occur. This isn’t free. The more resilient a cloud architecture, the more expensive it is to implement. In other words, resiliency and cost are directly correlated. Note however, that they are not necessarily proportionally correlated. At some point, spending on HA doesn’t pay off and that drop off point steepens dramatically over time and scale.
HA Expenditures
HA Benefit
High-availability Cost-Benefit Model
$

Scale
Optimal ?
Value Surplus
Inefficiencies
There is a point at which spending on HA just isn't worth the value anymore. In practice, it is exceedingly difficult to pinpoint where the optimal scale is. As a result, most businesses typically overspend on HA, thus generating waste, or under utilize HA, thus exposing their businesses to downtime and financial risk.
This creates tension in terms of business decisions. Does your business over-spend on HA, thus incurring higher costs, but not experience enough outages each year to realize the actual benefits of said HA expenditures? Or does your business underspend on HA, thus leaving your operations exposed to financial risks of unanticipated downtimes? Most professionals agree that this optimization question is rather difficult to answer. The reason is fairly obvious: it is difficult to foresee exactly what types of outages would occur and how long the outages last, therefore, making the optimization point a an unpredictable and ever-moving point in the spectrum of cost management.
Spending on HA is usually a fixed cost that scales with the overall size of your cloud infrastructure. But the benefits don’t scale that way. In fact, benefits only are realized when outages happen. So you can easily overspend on HA and DR even in cases where you didn’t experience business disruption from outages. On the other hand, you can completely underestimate the number of outage incidents and length of downtimes after you’ve already spent on your HA and DR. When outages occur, your business cannot simply buy and implement HA and DR on the spot, and the existing redundancies and failovers currently in place were not enough to render a painless financial and operational recovery.
​
1) Over-pay for HA without reaping the commensurate benefits of a safer cloud operation, thus incurring unnecessary expenditures in the long run.
Optimal
Inefficiencies

Overall Costs
Costs of Outage
HA/DR Expenses
Risk of Outage
2) Under-spend on HA/DR due to resource constraints, thus leaving business operations over exposed to financial risks of cloud outages.
Costs of Outage
Optimal
Overall Costs

HA/DR Expenses
Risk of Outage
Over Exposure
Consequently, two behaviors manifest, depending on what type or size of business makes the decision. Enterprises typically have deeper pockets and would rather knowingly over-pay for HA and mitigate whatever cloud-related downtimes may occur. Small and medium businesses (SMBs), on the other hand, have more constrained financial resources, and existing research tells us that most SMBs actually under-pay for HA, leaving their operations exposed to the financial risks deriving from cloud service outages.
RPO
RTO
Costs of Disaster Recovery
Cost ($)

Time
Outage Incident

