Azure to Azure Stack site-to-site IPSec VPN tunnel failure… after 8 hours

19 Aug

We had a need to create a site-to-site VPN tunnel for a POC from Azure Stack to Azure.  It seemed pretty straight forward.  Spoiler alert, obviously I’m writing this because it wasn’t.

The tunnel was created okay, but each morning it would no longer allow traffic to travel across it.  The tunnel would show connected in Azure and in Azure Stack but traffic just wouldn’t flow; ping, SSH, RDP, DNS and AD all wouldn’t work.  After some tinkering we found we would have to change the connection’s sharedkey value to something random, save it, then change it back to the correct key.  This only worked from the Azure Stack side of the connection, to re-initiate successfully and allow traffic to flow again (or recreate the connection from scratch).  It would work for another 8 hours and then fail to pass traffic again.

My suspicion was the re-keying, as this would explain why it worked at first and would fail the next day (everyday, for the last 5 days).  I tried using VPN diagnostics on the Azure side, as they don’t currently support VPN diagnostics on Azure Stack (we are on update 1805).  After reviewing the IKE log there were some errors, but it was hard to find something to tell me what was going wrong, more specifically something I could do to fix it.  Below is the IKE log file I collected through the VPN diagnostics from Azure.

I logged a case with Microsoft support.  The first support person did their best.  They had a tool called aznetassist that visualized what was going on and helped collect some logs.  The yellow and blue boxes are separate Azure subscriptions, with the ‘infrastructure’ Azure VPN router being circled in red dashes.  The gray box label ‘Not Azure’ is Azure Stack. While Microsoft can identify the endpoints they are connecting to, from Azure, they do not have permission to dig any deeper and look into the contents of our subscriptions hosted on Azure Stack.  I was asked to change the local VPN gateways from specific subnets to be the entire vnet address space.  While it worked initially, again it failed after 8 hours.

The support engineer collected some network traffic and other logs and forwarded the case to an Azure Stack support engineer. Once the call was assigned they asked me to connect to the privileged endpoint (PEP) and we proceeded with breaking the glass to Azure Stack to trouble shoot.  The engineer gave me a few PowerShell commands to run to investigate what was going on.

#First find out which of the VPN gateways is active.
icm Azs-gwy01,Azs-gwy02 { get-vpns2sinterface }

#Check Quick Mode Key Exchange
icm Azs-gwy01 { get-netIpsecQuickModeSA } 

#Check Main Mode Key Exchange
icm Azs-gwy01 { get-netIpsecMainModeSA }

The Microsoft engineer had a hunch of exactly what he was looking for and was on point. The commands showed that the Quick mode key exchange had failed to complete the refresh, yet the Main Mode had succeeded.  This explained why the tunnel was up but no traffic could flow across it.


We rebooted the active VPN gateway so the tunnels would fail-over to the second gateway.  Logging was on by default so we just had to wait for the next timeout to occur.  When it did I was given the task of collecting and uploading the logs from the PEP.

These logs are a series of ETL files that need to be processed by Microsoft to make sense of them.  Fortunately it turned up the following log entries.

As commented above, the root cause was that the PFS and CipherType setting were incorrect on the Azure VPN gateway.  I was given a few PowerShell commands to run against the Azure Subscription to reconfigure the IPSec policy for the connection on the Azure side to match the policy of the VPN gateway and connection on Azure Stack.

$RG1 = 'RESOURCE GROUP NAME'
$CONN = 'CONNECTION NAME'
$GWYCONN = Get-AzureRmVirtualNetworkGatewayConnection -Name $CONN -ResourceGroupName $RG1
$newpolicy  = New-AzureRmIpsecPolicy -IkeEncryption AES256 -IkeIntegrity SHA256 -DhGroup DHGroup2 -IpsecEncryption GCMAES256 -IpsecIntegrity GCMAES256 -PfsGroup PFS2048 -SALifeTimeSeconds 27000 -SADataSizeKilobytes 33553408
Set-AzureRmVirtualNetworkGatewayConnection -VirtualNetworkGatewayConnection $GWYCONN -IpsecPolicies $newpolicy 

Almost there.  When I tried to run the command, the basic Sku doesn’t allow for custom IPSec policies.  Once I changed the Sku from basic to standard the command worked and the tunnel has been up and stable.

While this is any easy fix that anyone can run against their Azure subscription without opening a support ticket, this does incur a cost difference.  Hopefully in the future these policies will match out-of-the-box between Azure and Azure Stack so every consumer can use the basic VPN Sku to connect Azure Stack to Azure over a secure tunnel.

Matthew Quickenden

Working with private cloud solutions for several years. Heavy focus on virtualization and automation. Recently working to help business move into and consume true cloud solutions.

LinkedIn