In this article, I will share with you my real-world outage experience with Azure Stack HCI OS (a.k.a Storage Spaces Direct, known as S2D).
Table of Contents
If you have been in IT for while, you are probably aware of how power is important for your Datacenter.
I’m not sure about how many times I’ve read or been told that power is the number one cost in a modern datacenter today, but it has been a frequent refrain. Thanks to virtualization that helped us throughout the years to consolidate and reduce power costs. Power is absolutely the fastest growing operational cost of a high-scale service.
Redundancy is critical for every organization. For example, using two power supplies into a server or a Raid system for storage will generally provide enough time for the component to be replaced. This is what is known “N+1” approach. However, for systems where failure is just not acceptable, then an “N+M” approach (having more than one extra component in place) may be used.
Within the Datacenter itself, the use of more modular uninterruptible power supplies (UPSs), power filters, generators, and air-conditioning with in-built N+1 redundant power supplies, batteries, and so on can also be used to increase redundancy and protect your servers.
What about room failure? This would also require building two datacenters within the same building or in a different city with the facility services being mirrored across each as N+1 power distribution networks, UPSs, cooling systems, and so on. This is, by its very nature, far too expensive.
And the list goes on and on…
We recently deployed a 4-Nodes Storage Spaces Direct using the Hyper-Converged model.
This technology is really awesome in terms of simplicity, performance, fault tolerance, efficiency, manageability, and much more.
We are so happy with the results and the performance we get out of 4-Nodes is fantastic.
With four servers we can tolerate up to 2 faults. Here is an example of the six different circumstances in which the system stays online.
- One drive lost (includes cache drives).
- One server is lost.
- One server and one drive are lost.
- Two servers are lost.
- Two drives are lost on different servers.
- More than two drives lost, in condition, that maximum of two servers are affected. In other words, if two drives are lost on “Server 1” and two other drives are lost on “Server 3”, the system stays online.
In every case of the six different scenarios above, all volumes will stay online, in a condition that your cluster maintains quorum!
So as you can see, with four servers we have a fairly good fault-tolerant.
Expect The Unexpected
The million-dollar question is what if all servers go down!!!
Is there really a thing like a 3 am wake-up call to fix a system?
Well, I received that call from one of my colleagues that the Storage Spaces Direct cluster is not turning on.
Long story short, we encountered a big sparkle at one of our datacenters and everything tripped down. Half of the power source for the servers is connected to the main power and half to the UPS. The electric sparkle burned the PDUs in the Rack and the power supplies for all servers.
Yes, I know it’s a bad situation to be in… and redundant power supplies won’t even help in this scenario!
We waited until the next day to receive the new power supplies and replaced them.
After replacing the power supplies, we brought all the nodes up at the same time, and guess what?
Storage Spaces Direct sustains this failure and we were able to recover. The system came back to a normal state, and the resync took around 25 minutes to complete.
Zero Data Loss
I would like to add an additional example to the list mentioned above.
7. If all Servers go down as if someone removes the power cable, Storage Spaces Direct will recover from complete power loss. (This is my own experience, it’s not supported by Microsoft).
Kudos to the Storage Spaces Direct Team!!!
Please note that Resilient File System (ReFS) is Microsoft’s newest file system is recommended to be used with Storage Spaces Direct. ReFS is designed to maximize data availability, scale efficiently to large data sets across diverse workloads, and provide data integrity by means of resiliency to corruption.
Is your Disaster Recovery Plan updated and maintained? What about your backup? Are you using Storage Spaces Direct? I strongly recommend you start evaluating this awesome technology if you’re not doing so already.
Businesses today demand greater availability from their infrastructure. To achieve high uptime, even highly unlikely occurrences such as power failures, rack outages, or natural disasters must be protected against.
For example, to be rack fault-tolerant, your servers and your data must be distributed across multiple racks. Look at fault domain awareness in Storage Spaces Direct, which uses fault domains to maximize data safety.
Storage Spaces Direct (S2D) and Storage Replica (SR) are better together, look at how you can achieve maximum protection by combining these technologies together. More information about Storage Replica in Windows Server is here.
A big thank you to all my fellow MVPs and the Microsoft product group who offer their support during this outage.
Hope my real experience will help someone out there.
Thanks for reading!