Running Azure Stack HCI on #DataON Integrated System All-NVMe Flash

4 Min. Read

In Windows Server 2016, Microsoft introduced a new type of storage called Storage Spaces Direct which is part of the new Azure Stack HCI program. Azure Stack HCI enables building highly available storage systems with locally attached disks, and without the need to have any external SAS fabric such as shared JBODs or enclosures. This is the first true Software-Defined Storage (SDS) from Microsoft. Software-Defined Storage is a concept, which involves storing data without dedicated hardware.

In Windows Server 2019, Microsoft added a lot of improvements for the Azure Stack HCI (formerly known as Windows Server Software-Defined, a.k.a WSSD).

Fast forward to 2020, Microsoft introduced a new operating system dedicated to the Hyper-Converged deployment model where the innovation continues at a faster cadence compared to Windows Server. Azure Stack HCI was introduced which is a new hyper-converged infrastructure (HCI) operating system delivered as an Azure service that provides the latest security, performance, and feature updates.

Introduction

With Azure Stack HCI, you can deploy and run Windows and Linux virtual machines (VMs) in your datacenter or at the edge using your existing tools, processes, and skillsets. Additionally, you can extend the datacenter to the cloud with Azure Backup, Site Recovery, Azure File Sync, Azure Monitor, Azure ARC, and Azure Security Center.

On September 1st, 2021, Microsoft announced the GA release of Windows Server 2022 with no major improvement for Storage Spaces Direct. As noted earlier, all future innovations will go into Azure Stack HCI to run hyper-converged infrastructures; however, Windows Server will continue to benefit from improvements to existing features. Windows Server 2022 still lacks advanced features such as stretched clusters, but has been given a new repair option for Storage Spaces Direct (“Adjustable Storage Repair Speed”). System admins can use this to control how many resources they want to allocate for repairing data copies or active workloads.

I recently did a 3-Nodes Azure Stack HCI Hyper-Converged deployment on top of DataON AZS-216 Integrated Systems – All-NVMe Flash and hit over 2.5 Million IOPS.

DataON AZS-216 2.5 Million IOPS
DataON AZS-216 2.5 Million IOPS

In this article, I would like to share with you my experience and performance results.

3 Nodes DataON – Integrated Systems

For this deployment, I used the following hardware configuration:

  • DataON™ AZS-216 Integrated Systems For Azure Stack HCI OS
  • Supports Dual Intel Xeon® Scalable™ Gen 2 Processor Series & (24) DDR4 DIMM
  • Drive Bay: (16) NVMe U.2 2.5″ Hot-swappable
  • PCIe Slot: (7) PCIe 3.0 x8
  • Onboard NIC: (2) Built-In 10GbE RJ45
  • 1300W (1+1) 110V hot-swappable redundant PSU with NEMA 5-15 Power Cords
  • Intel® Remote Management Module 4
  • Intel® Xeon® Scalable Gen.2 Gold 5218R 2.1 GHz, 20-Core, 27.5MB Cache
  • 384GB (12x32GB) Samsung® DDR4 2933MHz ECC-Register RDIMM
  • 2 X Intel® S4510™ 480GB SATA M.2 Boot Drive For OS
  • 10 X Intel® DC P5510™ NVMe 3.8TB 2.5″ 144L 3D TLC SSD
  • 2 X NVIDIA|Mellanox® ConnectX-4 Lx EN Dual Port SFP+ 10/25GbE RDMA Card
  • 2 X NVIDIA|Mellanox® LinkX™ Passive Copper Cable, ETH, up to 25Gb/s, SFP28, 30 AWG
  • 2 X Mellanox® Spectrum™ 18-port 10/25GbE X 4-port 100GbE Switch (RDMA/RoCEv2)

The DataON AZS-216 Integrated System for Azure Stack HCI are pre-configured nodes with certified components, tested and validated by DataON and Microsoft to help build Azure Stack HCI clusters with ease.

In this configuration, all NVMe disks are used as capacity (all-flash) as shown in the inventory below.

DataOn Azure Stack HCI | Drives Inventory
DataON Azure Stack HCI | Drives Inventory

Resiliency

The cluster shared volumes are configured with a three-way mirror to support the maximum resiliency in one site. With a three-way mirror, you can sustain two failures at the same time, and your workloads remain online.

You could test the following 4 different scenarios:

1) Physical drive pull.

2) Reboot a node (observe failover).

3) Physical power pull of a node.

4) Shut down one node and pull a single drive from one of the remaining nodes that are still up.

Software Configuration

  • Host: Azure Stack HCI OS, version 20H2 (OS build 17784.1884)
  • Single Storage Pool (117 TB)
  • 3 X  10.3 TB (three-way mirror)
  • ReFS/CSVF file system
  • 60 virtual machines (20 VMs per node)
  • 2 virtual processors and 8 GB RAM per VM
  • VM: Windows Server 2019 Datacenter Core Edition with August 2021 update
  • Jumbo Frame enabled
  • CSV Cache is disabled for benchmarking purposes only. For real-world workloads, CSV Cache is enabled with 16GB

Workload Configuration

DISKSPD version 2.0.21a workload generator

VM Fleet workload orchestrator

Test 1 – Random 4K, 8 Threads, 8 Outstanding I/O, 100% Read

Total 2.5 Million IOPS – Read/Write Latency @ 0.1/0.6(ms)

Each VM is configured with:

  • 4K IO size
  • 10GB working set
  • 100% read and 0% write
  • No Storage QoS
  • RDMA Enabled RoCEv2
Block size 4Kb, 8 Threads, 8 Outstanding I/O (100% Read)
Block size 4Kb, 8 Threads, 8 Outstanding I/O (100% Read)

Please note that 100% READ output is a bit skewed since the reads are all local. However, having the same amount of threads on any workload that involved writes would drastically increase the latency and reduce the number of IOPS as shown in the subsequent tests.

Test 2 – Random 4K, 4 Threads, 8 Outstanding I/O, 100% Write

Total 460K IOPS – Read/Write Latency @ 0.02/2.5(ms)

Each VM is configured with:

  • 4K IO size
  • 10GB working set
  • 0% read and 100% write
  • No Storage QoS
  • RDMA Enabled RoCEv2
Block size 4Kb, 4 Threads, 8 Outstanding I/O (100% Write)
Block size 4Kb, 4 Threads, 8 Outstanding I/O (100% Write)

Test 3 – Random 4K, 4 Threads, 8 Outstanding I/O, 70% Read / 30% Write

Total 1 Million IOPS – Read/Write Latency @ 0.01/0.4(ms)

Each VM is configured with:

  • 4K IO size
  • 10GB working set
  • 70% read and 30% write
  • No Storage QoS
  • RDMA Enabled RoCEv2
Block size 4Kb, 4 Threads, 8 Outstanding I/O, (30% Write / 70% Read)
Block size 4Kb, 4 Threads, 8 Outstanding I/O, (30% Write / 70% Read)

Test 4 – Random 4K, 4 Threads, 8 Outstanding I/O, 50% Read / 50% Write

Total 785K IOPS – Read/Write Latency @ 0.1/0.7(ms)

Each VM is configured with:

  • 4K IO size
  • 10GB working set
  • 50% read and 50% write
  • No Storage QoS
  • RDMA Enabled RoCEv2
Block size 4Kb, 4 Threads, 8 Outstanding I/O, (50% Write / 50% Read)
Block size 4Kb, 4 Threads, 8 Outstanding I/O, (50% Write / 50% Read)

Test 5 – Sequential 512K, 1 Thread, 1 Outstanding I/O, 100% Read

Total 72K IOPS – Read/Write Latency @ 0.7/0.3(ms)

Each VM is configured with:

  • 512K IO size
  • 10GB working set
  • 100% read and 0% write
  • No Storage QoS
  • RDMA Enabled RoCEv2
Block size 512 Kb, 1 Threads, 1 Outstanding I/O, (100% Read)
Block size 512 Kb, 1 Threads, 1 Outstanding I/O, (100% Read)

Test 6 – Sequential 512K, 1 Thread, 1 Outstanding I/O, 100% Write

Total 17K IOPS – Read/Write Latency @ 0.00/3.3(ms)

Each VM is configured with:

  • 512K IO size
  • 10GB working set
  • 0% read and 100% write
  • No Storage QoS
  • RDMA Enabled RoCEv2
Block size 512 Kb, 1 Threads, 1 Outstanding I/O, (100% Write)
Block size 512 Kb, 1 Threads, 1 Outstanding I/O, (100% Write)

DataON and Windows Admin Center integration

DataON MUST is a hybrid-cloud infrastructure monitoring and management tool. It’s designed to seamlessly integrate with Windows Admin Center through a single pane of glass that consolidates all aspect of local, remote server, cluster and Azure Stack HCI monitoring and management

DataOn MUST integration with Windows Admin Center
DataOn MUST integration with Windows Admin Center

The second integration is DataON MUST Pro which integrates with Windows Admin Center’s cluster creation and cluster-aware updating (CAU) functionality to simplify deployment and updates to Microsoft Azure Stack HCI, with minimal disruptions to your infrastructure.

MUST Pro automatically compares your DataON Integrated Systems for Azure Stack HCI against DataON’s latest quarterly validated server component image baseline. It also ensures that servers have the same OS version, drivers, firmware, BIOS, and BMC, and checks the drivers and firmware for network cards, host bus adapters, and SSD and HDD drives.

Summary

In this article, I shared my experience and showed you the performance results with three-way mirror resiliency on 3 nodes DataON AZS-216 Integrated System. For more information about Azure Stack HCI, please check the Microsoft documentation here.

Always remember that storage is cheap, but downtime is expensive!!!

Let me know what you think in the comment section below.

__
Thank you for reading my blog.

If you have any questions or feedback, please leave a comment.

-Charbel Nemnom-

Related Posts

Previous

Enable Vulnerability Assessment on SQL Servers with Azure Policy

AZ-800 Study Guide: Administering Windows Server Hybrid Core Infrastructure

Next

Let me know what you think, or ask a question...

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Subscribe to Stay in Touch

Never miss out on your favorite posts and our latest announcements!

The content of this website is copyrighted from being plagiarized!

You can copy from the 'Code Blocks' in 'Black' by selecting the Code.

Please send your feedback to the author using this form for any 'Code' you like.

Thank you for visiting!