Adding Additional Nodes to Azure Stack

3 Dec

Last week, I had the opportunity to add some extra capacity to a four-node appliance that I look after. Luckily, I got to double the capacity, so making it an eight-node scale unit. This post documents my experience and fills in the gaps that the official documentation doesn’t tell you 😊

From a high-level perspective, these are the activities that take place:

  • Additional nodes/servers are racked and stacked in the same rack as the existing scale unit.
  • The Node BMC interface is configured with an IP address and Gateway server within the management network, along with the correct username/password (the same as the existing nodes in the cluster).
  • The Azure Stack OEM configures the BMC and Top of Rack switch configurations to enable the additional switch ports for the additional nodes.
  • Azure Stack Operator adds each additional node to the Scale Unit (one at a time)
    • Compute resource becomes available first
    • S2D re-balances the cluster, once completed, the additional storage is made available

Easy huh?

Here’s a bit more detail into each of the steps:

Each of the additional nodes that are being added to the scale unit must be identical to the existing servers. This includes CPU, memory, storage capacity, hardware versions. The hardware must be installed as prescribed by the OEM and connected to the BMC and TOR switches to the correct ports. It is unclear whose responsibility this is, whether it is the OEM or the operator, so check beforehand when you purchase your additional nodes.

For Azure Stack to be able to add the additional nodes into the scale unit, the BMC interface must be configured so that the IP/subnet/Gateway are correctly configured, as well as the username and password that matches the existing nodes in the scale unit. This is critical as any misconfiguration will stop the node being added. As an example, assume that the management network is set to 10.0.0.0/27 and we have 4 existing nodes in our scale unit. 10.0.0.1 would be our Gateway address, 10.0.0.3 – 10.0.0.6 would be the IP address of nodes 1 – 4, so for our first additional node, we would use 10.0.0.7, incrementing from there up to a maximum of 16 nodes (10.0.0.18)

The network switches must be configured by the OEM. There is currently no provision for the additional configuration to be carried out via automation, and if the network switches were to be opened to an operator, this breaks the principle of Azure Stack being a blackbox appliance. The switches need reconfiguring to enable the additional ports on the BMC and two Top of Rack switches. Unused ports are purposely not enabled to keep the configuration as secure as possible.

Prior to attempting addition of the additional nodes, check in the Administrator Portal whether any existing FRU operations are taking place (e.g. rebuild of an existing node due to a hardware issue).

OK, so all the above has been carried out the Azure Stack operator can start to add in the additional nodes. From the Administrator portal:

Select Dashboard -> Region Management

0 s 。 亅 d nos u u u 5 8 P] 0 sea

Select Scale Units to open the blade:

Machine generated alternative text:

We currently have a nice and healthy four-node cluster 😊. Select Add Node to open the configuration blade.

With the current release, as there is only one region and scale unit, there is only one option that we can select for the first two drop downs.

Enter the BMC IP address of the additional node and select OK

Add node Adds nev to a scak mit and configues the settinw. T ore Region Scale unit 17216

My first attempt didn’t work as the BMC was incorrectly configured (The Gateway address for the BMC adapter was not set).

This is the error you will see if there is a problem:

Notifications Irfcrmztiora' m Failed to add the node Completed A 2:18 PM Failed to add ru%e 172.16.—;• Error message: The SMC IP address '172.16>•• did not resporu± to a general query in the expected time. Verity that the SMC with this IP address has connected to the network and configured with the.„

Correcting the gateway address solved the problem. Here is what you’ll see when the scale unit is expanding:

+ Add node Φ ΑΡΙ Α Kale l.mt physical canput«s togett-er to high availability.

After a few minutes, you will see the additional node being listed as a member of the s-cluster scale unit, albeit listed as Stopped.

Clicking on the new node will show the status as ‘Adding’, if you click on it from the blade.

If you prefer, you can check the status via PowerShell. From a system that has the Azure Stack PowerShell modules installed, connect to the Admin Endpoint environment and run:


#Retrieve Status for the Scale Unit

Get-AzsScaleUnit|select name,state

#Retrieve Status for each Scale Unit Node

Get-AzsScaleUnitNode |Select Name, ScaleUnitNodeStatus, PowerState

P S C : Statu s , state St ate P S C : Statu s *ale Uni t eæh *ale unit Node RaleUnitNo&Status, *ale Status —N ode 01 _N ode 02 _N ode 03 , _N o de 04 _N o de 05 Ru ng Ru ng Ru ng Add S Stopped

Whilst the expansion takes place, a critical alert fired. It’s safe to ignore this.

Machine generated alternative text:

Successfully completed node addition ill show the power status as running, plus the additional cores and memory available to the cluster :

It takes a little shy of 3 hours to complete the addition of a single cluster node:

16 min

Note: you can only add one extra node at a time, if you do, an error will be thrown as below:

Failed to add the node 8:52 PM Failed to add ru%e I •i •IA. Error message: Additional scale unit ru3des cannot be added while there are scale unit ncdes in the scale unit that are not running and operational„

 

I found that the first node added without a hitch, but subsequent nodes had some issues. I got error messages stating ‘Device not found’ on a couple of occasions. In hindsight, I guess that in the background, the cluster was performing some S2D operations and it caused some clashes for the newly added node. To fix this, I had to perform a ‘Repair’ on the new node. This invariably fixed the problem on the first attempt. If there was more information into what is actually happening under the hood, I could give a more qualified answer.

Eventually, All nodes were added 😊

Adding those additional nodes does not add additional storage to the Scale Unit until the S2D cluster has rebalanced. The only way you know that a rebalance is taking place is that the status of the scale unit shows as ‘expanding’, and will do for a long time after adding the additional node(s)!

Here’s how the Infrastructure File Shares blade looks like whist expansion is taking place:

Storage - File shares Sec.-ch (Ctrl . _EStore_I Ale-ts Cor#u ration Storage STORA« SEMCES Bob Table service File shares View API NSUIFi1eserver NSUIFi1eserver NSUIFi1eserver NSUIFi1eserver . _EStore_3 . _EStore_4 Healthy Healthy Healthy Healthy 357.09 GB 307.69 GB 16.8 TB 16.85 TB 16.78 TB 16.77 TB 17.15 TB 17.15 TB 17.15 TB 17.15 TB

Once expansion has completed, then the additional infrastructure file shares are created:

Unfortunately, there is no way to check the progress of the rebalance operation either in the portal or via PowerShell. The Privileged Endpoint does include the Get-StorageJob CMDlet, but this is useless unless the support session is unlocked. If it is unlocked, the following script could be used to check:


$ClusterName="s-cluster"

$jobs=(Get-StorageSubSystem -CimSession $ClusterName -FriendlyName Clus* | Get-StorageJob -CimSession $ClusterName)

if ($jobs){

do{

$jobs=(Get-StorageSubSystem -CimSession $ClusterName -FriendlyName Clus* | Get-StorageJob -CimSession $ClusterName)

$count=($jobs | Measure-Object).count

$BytesTotal=($jobs | Measure-Object BytesTotal -Sum).Sum

$BytesProcessed=($jobs | Measure-Object BytesProcessed -Sum).Sum

$percent=($jobs.PercentComplete)

Write-output("$count Storage Job(s) Running. GBytes Processed: $($BytesProcessed/1GB) GBytes Total: $($BytesTotal/1GB) Percent: $($percent)% `r")

Start-Sleep 10

}until($jobs -eq $null)

}

In total, it took about 3 days to complete the rebalance operation and to add the additional infrastructure shares, so it’s worth noting if you’re going to add four nodes at the same time into your scale unit.

Hopefully my experience helps give you some confidence in that adding extra nodes is straightforward, but it can take a while to complete, so set your expectations accordingly!

 

Danny McDermott

Danny is a Cloud Architect within the Azure Cloud Enablement Team, based in the UK.

Twitter LinkedIn