Server Hung data collection check list

 

Plan for Hung VM 

 

Prerequisites Prior to incident 

 

Verify Serial Console Access 

Verify Serial Console Access has been enabled through the Azure Portal for each cluster. 

 

NetLogon Logging Preparation  

NetLogon logging is essential to better understanding the NetLogon errors which appear in the event log.  The logging is lightweight by design, as it usually runs on Domain Controllers.  We expect the NetLogon activity to be low on a SQL Server, so we would not expect this to impact performance.  The logging is simple to start and stop, so we can easily test this on a few systems before deploying it broadly. 

If leaving the logging enabled at all times is problematic, then we can attempt to enable the trace once an incident occurs.  However, it is possible we will be unable to enable it while the server is in the broken state.  More importantly, if the logging is always enabled then the log will capture the transition from healthy to broken, which may contain a key to understanding the root cause, especially if the root cause is truly NetLogon related. 

No server downtime is required to start or stop the verbose NetLogon logging.  This would need to be done on each server we wish to trace. 

 

To enable NetLogon logging: nltest /dbflag:2080ffff 

To disable NetLogon logging: nltest /dbflag:0 

 

Memory Dump Configuration 

Pre-configuration: Download and Install the Windows SDK Preview:  

- Go to: Flight Hub - Windows Insider Program | Microsoft Docs 
- Look for a recent build with an entry in the SDK Column. 
- Click the date in the SDK column (for example, I used build 20279 with a date of 12/16/2020). 
- It will download an ISO file containing the installer. 
- When running the installer, clear all checkboxes EXCEPT “Debugging Tools for Windows”. 
- Retrieve the kdbgctrl.exe file for the correct architecture, which should be x64.  By default this installs to here: 
C:\Program Files (x86)\Windows Kits\10\Debuggers\x64 

1. Set the following registry values under HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\CrashControl: 

·       [REG_DWORD] AutoReboot to 0  

·       [REG_DWORD] CrashDumpEnabled to 7  

·       [REG_SZ] DedicatedDumpFile to D:\dd.sys (D: is the local temp drive for the Azure VM)  

2. Get the kdbgctrl.exe tool from the installed SDK.  

 

For a complete dump (size of RAM): 

Run "kdbgctrl.exe -sd full" from an elevated command prompt. 

 

For a kernel dump (typically 5-15% size of RAM): 
 
Run "kdbgctrl.exe -sd kernel" from an elevated command prompt. 

2a) We expect a success is reported and a file named dd.sys is generated under D:\ with the size of RAM size for a full (complete) dump. It will be smaller if a kernel dump is selected. Delete the DedicatedDumpFile value afterwards. This is to ensure the system does not automatically try to replace the file on reboot. 

2b) If you don't see the file dd.sys is generated (with the expected size), inform Microsoft support as soon as possible so we can investigate. 

3. Monitor the server. 

3a) During this period if the server reboots (scheduled maintenance, Blue Screen error, etc.), you need to delete the file dd.sys (in the case of a Blue Screen error, if you want to investigate the cause the file dd.sys is required) and then redo the steps 1-2. 

3b) If the server unreachable issue recurs, complete the sanity check to profile the exact situation, and then trigger NMI from Azure portal. We expect the dumping progress indicator reaches 100% in the end. Then work with Microsoft to complete diagnostics from the backend. 

4. Restart the server from the portal (must NOT use Stop and then Start). Get the file dd.sys and upload it (after compressed) to Microsoft for debugging. 

 

 

User Memory Dump Considerations 

 

Benefits 

·       Dump configuration doesn't require server reboot. 

·       Dump configuration has no impact on server performance, and no risk to server stability. 

·       Can validate if dump configuration is effective beforehand - greater confidence of success. 

·       Can see progress directly as percentage complete. 

·       Full control of the procedure - can reboot the server at any time to recover HA. 

·       Get full access to memory dump. 

·       Less downtime expected vs. MSFT generating a dump from the backend.* 

·       Less time to abort vs. MSFT aborting a backend dump generation.* 

·       Flexibility on configuring dump type (complete or kernel). If downtime needs to be 1 hour or less, kernel dump via Molina Controlled Memory Dump procedure is the only viable option possible (the backend option takes longer). 

o   A Kernel dump will introduce lower risk and lower time cost. 

o   A Kernel dump will provide less data, but can still be productive in certain scenarios. 

 

Risks 

·       More complexity for Molina to stage. 

·       Must be configured on all clusters - time and labor. 

·       Temp drive - will have to validate space. 

·       If dedicated Data Disk is required for the dump, it will add cost. 

·       Can take 2-4 hours either way – as Molina controlled or MS backend generation.* 

 

 

*MS Backend Dump Generation 

 

Benefits 

·       Minimal action - no staging on server side. 

 

Risks 

·       MS process is still being developed and has some risk of failure. 

·       Complexity of communications - more time to initiate dump generation beginning from case escalation. 

·       Longer time to abort - additional risk to HA. 

·       Dump access cannot be provided to 3rd party by policy default. 

 

 

Incident Response Plan 

 

1.      Raise case to Sev A. 

2.      Do not reboot the VM. 

3.     Log into another working VM in the same Vnet as the hung VM. 

a.     Ping /t <hung VM> 

b.     Psping -t <hung VM>:3389  NOTE: This is to test network level access to the RDP service. 

4.      Open the Azure Portal and go the Serial Console, under Support + Troubleshooting in the VM blade.  

5.      From Serial Console run the following and record the results.  Note: If the psping test above is successful, you can skip items 5.a through 5.d. 

a.     ipconfig - make sure the IP is assigned 

b.     tasklist /svc | findstr /i term – make sure TermService is running and confirm the PID 

c.     netstat -aon | findstr /i <PID_TermService> - make sure TermService is listening on the port 

d.     Powershell, and then “netsh advfirewall firewall show rule dir=in name=all | select-string -pattern "(LocalPort.*3389)" -context 9,4 | more”. Make sure the rules for port 3389 are enabled and action being allow. Then proceed to the following commands and confirm if the psping on the other VM turns to positive (if so, we stop here as the connectivity has been restored): 

                                                                                       i.netsh advfirewall firewall add rule name=foo dir=in protocol=tcp localport=3389 action=allow 

                                                                                      ii.netsh advfirewall set allprofiles state off 

                                                               iii.Restart-Service -Name mpssvc 

e.       Start Netlogon logging if not already enabled. 

                                                                                       i.nltest /dbflag:2080ffff 

f.       Start Network trace.  run: netsh trace start capture=yes 

g.     Do “ping/t” to another VM in the same Vnet 

h.     Attempt to RDP to the server.  

i.       SQL: If possible, attempt a new connection to the SQL Server via SSMS. 

j.       SQL: If SA login is available, please test if possible. 

k.      Stop Network trace. 

                                                                  i.netsh trace stop 

l.         Stop Netlogon logging if you started it after the server entered the broken state. 

                                                                                       i.nltest /dbflag:0 

6.      Wait for MSFT team to join a call to fully triage the VM behavior and capture available data. 

a.      What was first the alert or signal of an issue? i.e. user reported error, system generated error? 

                                                                  i.What precise date/time/timezone? 

                                                                ii.What precise error was seen from user? 

                                                               iii.What precise error was thrown in the alert? 

1.      Is there an associated log? 

b.     What troubleshooting steps were taken by Molina IT team? 

c.      What Server checks were made? 

                                                                  i.Is RDP successful? If not, what is precise error? 

d.     What SQL checks were made? 

e.      What Cluster checks were made? 

7.      Return to Serial Console 

a.      Ensure VM has been pre-configured for a memory dump. 

b.     Send Non-Maskable Interrupt (NMI) to trigger memory dump. We expect the dumping progress indicator reaches 100% in the end. Then work with Microsoft to complete diagnostics from the backend. Finally restart the server from the portal (must NOT use Stop and then Start). 

8.      After VM reboots, ensure production environment is stable and back online. 

9.      Work with MSFT team to pull additional logs. 

a.      Memory dump file. 

b.     Windows logs 

                                                                  i.System log 

                                                                ii.Application log 

                                                               iii.NetLogon logs: c:\windows\debug\netlogon.log and netlogon.bak 

                                                               iv.Network trace 

c.      Cluster logs  

                                                                  i.Open PowerShell as admin. 

                                                                ii.Run command: import-module failoverclusters 

                                                               iii.Run: get-clusterlog 

                                                               iv.That will create a cluster log per node, and the file can be located on each node on "C:\Windows\Cluster\Reports".  Please upload this file to the File Transfer link provided by MSFT.  

d.     SQL reports 

                                                                  i.TSS report 

1.      Download the file below and collect the logs. 

2.      http://aka.ms/getTSS 

                                                                ii.PSSDiag 

1.      Open Powershell as Admin  

2.     Switch to Folder location: SQL Base Diagnostics 

3.     Run command: .\Get-psSDP.ps1 SQLBase 

e.      Upload all logs to the link that will be provided by the MSFT team. 

 

Building on all that our teams have learned, it is essential to follow this action plan precisely as outlined to ensure that all possible data will be captured to try to identify the cause for the VM hang. 

 

Please note, this action plan is a starting point, and will continue to be reviewed and amended. 

 

Best,

 

 

Comments

Popular posts from this blog

NetSH collection commands

Script for Host entry in remote servers