Running Azure Stack HCI On #DataON Integrated System All-NVMe Flash

Share this post:

Starting with Windows Server 2016, Microsoft introduced a new type of storage called Storage Spaces Direct which is part of the new Azure Stack HCI program. Azure Stack HCI enables building highly available storage systems with locally attached disks without the need to have any external SAS fabric such as shared JBODs or enclosures. This is the first true Software-Defined Storage (SDS) from Microsoft. Software-Defined Storage is a concept, which involves storing data without dedicated hardware.

In Windows Server 2019, Microsoft added a lot of improvements to the Azure Stack HCI (formerly known as Windows Server Software-Defined, a.k.a WSSD).

Fast forward to 2020, Microsoft introduced a new operating system dedicated to the Hyper-Converged deployment model where the innovation continues at a faster cadence compared to Windows Server. Azure Stack HCI was introduced which is a new hyper-converged infrastructure (HCI) operating system delivered as an Azure service that provides the latest security, performance, and feature updates.

Table of Contents

Azure Stack HCI

With Azure Stack HCI, you can deploy and run Windows and Linux virtual machines (VMs) in your datacenter or at the edge using your existing tools, processes, and skillsets. Additionally, you can extend the datacenter to the cloud with Azure Backup, Site Recovery, Azure File Sync, Azure Monitor, Azure ARC, and Azure Security Center.

On September 1st, 2021, Microsoft announced the GA release of Windows Server 2022 with no major improvement for Storage Spaces Direct. As noted earlier, all future innovations will go into Azure Stack HCI to run hyper-converged infrastructures; however, Windows Server will continue to benefit from improvements to existing features. Windows Server 2022 still lacks advanced features such as stretched clusters but has been given a new repair option for Storage Spaces Direct (“Adjustable Storage Repair Speed”). System admins can use this to control how many resources they want to allocate for repairing data copies or active workloads.

I recently did a 3-Nodes Azure Stack HCI Hyper-Converged deployment on top of DataON AZS-216 Integrated Systems – All-NVMe Flash and hit over 2.5 Million IOPS.

In this article, I would like to share with you my experience and performance results.

3 Nodes DataON – Integrated Systems

For this deployment, I used the following hardware configuration:

DataON™ AZS-216 Integrated Systems For Azure Stack HCI OS
Supports Dual Intel Xeon® Scalable™ Gen 2 Processor Series & (24) DDR4 DIMM
Drive Bay: (16) NVMe U.2 2.5″ Hot-swappable
PCIe Slot: (7) PCIe 3.0 x8
Onboard NIC: (2) Built-In 10GbE RJ45
1300W (1+1) 110V hot-swappable redundant PSU with NEMA 5-15 Power Cords
Intel® Remote Management Module 4
Intel® Xeon® Scalable Gen.2 Gold 5218R 2.1 GHz, 20-Core, 27.5MB Cache
384GB (12x32GB) Samsung® DDR4 2933MHz ECC-Register RDIMM
2 X Intel® S4510™ 480GB SATA M.2 Boot Drive For OS
10 X Intel® DC P5510™ NVMe 3.8TB 2.5″ 144L 3D TLC SSD
2 X NVIDIA|Mellanox® ConnectX-4 Lx EN Dual Port SFP+ 10/25GbE RDMA Card
2 X NVIDIA|Mellanox® LinkX™ Passive Copper Cable, ETH, up to 25Gb/s, SFP28, 30 AWG
2 X Mellanox® Spectrum™ 18-port 10/25GbE X 4-port 100GbE Switch (RDMA/RoCEv2)

The DataON AZS-216 Integrated System for Azure Stack HCI are pre-configured nodes with certified components, tested and validated by DataON and Microsoft to help build Azure Stack HCI clusters with ease.

In this configuration, all NVMe disks are used as capacity (all-flash) as shown in the inventory below.

DataOn Azure Stack HCI | Drives Inventory — DataON Azure Stack HCI | Drives Inventory

Resiliency

The cluster-shared volumes are configured with a three-way mirror to support the maximum resiliency in one site. With a three-way mirror, you can sustain two failures at the same time, and your workloads remain online.

You could test the following 4 different scenarios:

1) Physical drive pull.

2) Reboot a node (observe failover).

3) Physical power pull of a node.

4) Shut down one node and pull a single drive from one of the remaining nodes that are still up.

Software Configuration

Host: Azure Stack HCI OS, version 20H2 (OS build 17784.1884)
Single Storage Pool (117 TB)
3 X 10.3 TB (three-way mirror)
ReFS/CSVF file system
60 virtual machines (20 VMs per node)
2 virtual processors and 8 GB RAM per VM
VM: Windows Server 2019 Datacenter Core Edition with August 2021 update
Jumbo Frame enabled
CSV Cache is disabled for benchmarking purposes only. For real-world workloads, CSV Cache is enabled with 16GB

Workload Configuration

DISKSPD version 2.0.21a workload generator

VM Fleet workload orchestrator

Test 1 – Random 4K, 8 Threads, 8 Outstanding I/O, 100% Read

Total 2.5 Million IOPS – Read/Write Latency @ 0.1/0.6(ms)

Each VM is configured with:

4K IO size
10GB working set
100% read and 0% write
No Storage QoS
RDMA Enabled RoCEv2

Block size 4Kb, 8 Threads, 8 Outstanding I/O (100% Read)

Please note that 100% READ output is a bit skewed since the reads are all local. However, having the same amount of threads on any workload that involved writes would drastically increase the latency and reduce the number of IOPS as shown in the subsequent tests.

Test 2 – Random 4K, 4 Threads, 8 Outstanding I/O, 100% Write

Total 460K IOPS – Read/Write Latency @ 0.02/2.5(ms)

Each VM is configured with:

4K IO size
10GB working set
0% read and 100% write
No Storage QoS
RDMA Enabled RoCEv2

Block size 4Kb, 4 Threads, 8 Outstanding I/O (100% Write)

Test 3 – Random 4K, 4 Threads, 8 Outstanding I/O, 70% Read / 30% Write

Total 1 Million IOPS – Read/Write Latency @ 0.01/0.4(ms)

Each VM is configured with:

4K IO size
10GB working set
70% read and 30% write
No Storage QoS
RDMA Enabled RoCEv2

Block size 4Kb, 4 Threads, 8 Outstanding I/O, (30% Write / 70% Read)

Test 4 – Random 4K, 4 Threads, 8 Outstanding I/O, 50% Read / 50% Write

Total 785K IOPS – Read/Write Latency @ 0.1/0.7(ms)

Each VM is configured with:

4K IO size
10GB working set
50% read and 50% write
No Storage QoS
RDMA Enabled RoCEv2

Block size 4Kb, 4 Threads, 8 Outstanding I/O, (50% Write / 50% Read)

Test 5 – Sequential 512K, 1 Thread, 1 Outstanding I/O, 100% Read

Total 72K IOPS – Read/Write Latency @ 0.7/0.3(ms)

Each VM is configured with:

512K IO size
10GB working set
100% read and 0% write
No Storage QoS
RDMA Enabled RoCEv2

Block size 512 Kb, 1 Threads, 1 Outstanding I/O, (100% Read) — Block size 512 Kb, 1 Thread, 1 Outstanding I/O, (100% Read)

Test 6 – Sequential 512K, 1 Thread, 1 Outstanding I/O, 100% Write

Total 17K IOPS – Read/Write Latency @ 0.00/3.3(ms)

Each VM is configured with:

512K IO size
10GB working set
0% read and 100% write
No Storage QoS
RDMA Enabled RoCEv2

Block size 512 Kb, 1 Threads, 1 Outstanding I/O, (100% Write) — Block size 512 Kb, 1 Thread, 1 Outstanding I/O, (100% Write)

DataON and Windows Admin Center integration

DataON MUST is a hybrid-cloud infrastructure monitoring and management tool. It’s designed to seamlessly integrate with Windows Admin Center through a single pane of glass that consolidates all aspects of local, remote server, cluster, and Azure Stack HCI monitoring and management

DataOn MUST integration with Windows Admin Center

The second integration is DataON MUST Pro which integrates with Windows Admin Center’s cluster creation and cluster-aware updating (CAU) functionality to simplify deployment and updates to Microsoft Azure Stack HCI, with minimal disruptions to your infrastructure.

MUST Pro automatically compares your DataON Integrated Systems for Azure Stack HCI against DataON’s latest quarterly validated server component image baseline. It also ensures that servers have the same OS version, drivers, firmware, BIOS, and BMC, and checks the drivers and firmware for network cards, host bus adapters, and SSD and HDD drives.

Summary

In this article, I shared my experience and showed you the performance results with three-way mirror resiliency on 3 nodes DataON AZS-216 Integrated System. For more information about Azure Stack HCI, please check the Microsoft documentation here.

Azure Stack HCI

3 Nodes DataON – Integrated Systems

Resiliency

Software Configuration

Workload Configuration

Test 1 – Random 4K, 8 Threads, 8 Outstanding I/O, 100% Read

Test 2 – Random 4K, 4 Threads, 8 Outstanding I/O, 100% Write

Test 3 – Random 4K, 4 Threads, 8 Outstanding I/O, 70% Read / 30% Write

Test 4 – Random 4K, 4 Threads, 8 Outstanding I/O, 50% Read / 50% Write

Test 5 – Sequential 512K, 1 Thread, 1 Outstanding I/O, 100% Read

Test 6 – Sequential 512K, 1 Thread, 1 Outstanding I/O, 100% Write

DataON and Windows Admin Center integration

Summary

Let us know what you think, or ask a question...

Azure Stack HCI

3 Nodes DataON – Integrated Systems

Resiliency

Software Configuration

Workload Configuration

Test 1 – Random 4K, 8 Threads, 8 Outstanding I/O, 100% Read

Test 2 – Random 4K, 4 Threads, 8 Outstanding I/O, 100% Write

Test 3 – Random 4K, 4 Threads, 8 Outstanding I/O, 70% Read / 30% Write

Test 4 – Random 4K, 4 Threads, 8 Outstanding I/O, 50% Read / 50% Write

Test 5 – Sequential 512K, 1 Thread, 1 Outstanding I/O, 100% Read

Test 6 – Sequential 512K, 1 Thread, 1 Outstanding I/O, 100% Write

DataON and Windows Admin Center integration

Summary

Enable Vulnerability Assessment on SQL Servers with Azure Policy

AZ-800 Study Guide: Administering Windows Server Hybrid Core Infrastructure

Let us know what you think, or ask a question...