Server Hung data collection check list
Plan for Hung VM
Prerequisites Prior
to incident
Verify Serial
Console Access
Verify Serial
Console Access has been enabled through the Azure Portal for each cluster.
NetLogon Logging Preparation
NetLogon logging
is essential to better understanding the NetLogon errors which appear
in the event log. The logging is lightweight by design, as it
usually runs on Domain Controllers. We expect
the NetLogon activity to be low on a SQL Server, so we
would not expect this to impact performance. The logging is
simple to start and stop, so we can easily test this on a few systems
before deploying it broadly.
If leaving the
logging enabled at all times is problematic, then we can attempt to
enable the trace once an incident occurs. However, it is possible we
will be unable to enable it while the server is in the broken state. More
importantly, if the logging is always enabled then the log will
capture the transition from healthy to broken, which may contain a key to
understanding the root cause, especially if the root cause is
truly NetLogon related.
No server downtime
is required to start or stop the verbose NetLogon logging. This
would need to be done on each server we wish to trace.
To enable NetLogon logging: nltest /dbflag:2080ffff
To disable NetLogon logging: nltest /dbflag:0
Memory Dump Configuration
Pre-configuration:
Download and Install the Windows SDK Preview:
- Go to: Flight
Hub - Windows Insider Program | Microsoft Docs
- Look for a recent build
with an entry in the SDK Column.
- Click the date in the SDK column (for example, I
used build 20279 with a date of 12/16/2020).
- It will download an ISO file containing the
installer.
- When running the installer, clear all checkboxes
EXCEPT “Debugging Tools for Windows”.
- Retrieve the kdbgctrl.exe file for the correct
architecture, which should be x64. By default this installs to here:
C:\Program Files (x86)\Windows Kits\10\Debuggers\x64
1. Set the following
registry values under
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\CrashControl:
· [REG_DWORD] AutoReboot to
0
· [REG_DWORD] CrashDumpEnabled to 7
· [REG_SZ] DedicatedDumpFile to
D:\dd.sys (D: is the local temp drive for the Azure VM)
2. Get the
kdbgctrl.exe tool from the installed SDK.
For a complete
dump (size of RAM):
Run
"kdbgctrl.exe -sd full" from an elevated command prompt.
For a kernel dump
(typically 5-15% size of RAM):
Run "kdbgctrl.exe -sd kernel" from an
elevated command prompt.
2a) We expect a success is reported and a
file named dd.sys is generated under D:\ with the size of RAM size for a
full (complete) dump. It will be smaller if a kernel dump is
selected. Delete the DedicatedDumpFile value afterwards. This
is to ensure the system does not automatically try to replace the file on
reboot.
2b) If you don't see the file dd.sys is
generated (with the expected size), inform Microsoft support as soon as
possible so we can investigate.
3. Monitor the
server.
3a) During this period if the server reboots
(scheduled maintenance, Blue Screen error, etc.), you need to delete the file
dd.sys (in the case of a Blue Screen error, if you want to investigate the
cause the file dd.sys is required) and then redo the steps 1-2.
3b) If the server unreachable issue recurs,
complete the sanity check to profile the exact situation, and then trigger NMI
from Azure portal. We expect the dumping progress indicator reaches 100% in the
end. Then work with Microsoft to complete diagnostics from the backend.
4. Restart the
server from the portal (must NOT use Stop and then Start). Get the file dd.sys
and upload it (after compressed) to Microsoft for debugging.
User Memory Dump Considerations
Benefits
· Dump
configuration doesn't require server reboot.
· Dump
configuration has no impact on server performance, and no risk to
server stability.
· Can
validate if dump configuration is effective beforehand - greater confidence of
success.
· Can
see progress directly as percentage complete.
· Full
control of the procedure - can reboot the server at any time to recover HA.
· Get
full access to memory dump.
· Less
downtime expected vs. MSFT generating a dump from the backend.*
· Less
time to abort vs. MSFT aborting a backend dump generation.*
· Flexibility
on configuring dump type (complete or kernel). If downtime needs to
be 1 hour or less, kernel dump via Molina Controlled Memory Dump procedure is
the only viable option possible (the backend option takes longer).
o
A Kernel dump
will introduce lower risk and lower time cost.
o
A Kernel dump will
provide less data, but can still be productive in certain scenarios.
Risks
· More
complexity for Molina to stage.
· Must
be configured on all clusters - time and labor.
· Temp
drive - will have to validate space.
· If
dedicated Data Disk is required for the dump, it will add cost.
· Can
take 2-4 hours either way – as Molina controlled or
MS backend generation.*
*MS Backend Dump
Generation
Benefits
· Minimal
action - no staging on server side.
Risks
· MS
process is still being developed and has some risk of failure.
· Complexity
of communications - more time to initiate dump generation beginning from
case escalation.
· Longer
time to abort - additional risk to HA.
· Dump
access cannot be provided to 3rd party by policy default.
Incident
Response Plan
1.
Raise case to Sev A.
2.
Do not reboot the
VM.
3.
Log into
another working VM in the same Vnet as the hung VM.
a.
Ping /t
<hung VM>
b.
Psping -t
<hung VM>:3389 NOTE: This is to test network level access
to the RDP service.
4.
Open the Azure
Portal and go the Serial Console, under Support + Troubleshooting in the VM
blade.
5.
From Serial Console run
the following and record the results. Note: If
the psping test above is successful, you can skip items 5.a
through 5.d.
a.
ipconfig - make
sure the IP is assigned
b.
tasklist /svc
| findstr /i term – make sure TermService is running and
confirm the PID
c.
netstat -aon | findstr /i
<PID_TermService> - make sure TermService is listening on
the port
d.
Powershell, and then
“netsh advfirewall firewall
show rule dir=in name=all | select-string -pattern "(LocalPort.*3389)"
-context 9,4 | more”. Make sure the rules for port 3389 are enabled and
action being allow. Then proceed to the following commands
and confirm if the psping on the other VM turns to positive (if so,
we stop here as the connectivity has been restored):
i.netsh advfirewall firewall
add rule name=foo dir=in protocol=tcp localport=3389 action=allow
ii.netsh advfirewall set allprofiles state off
iii.Restart-Service
-Name mpssvc
e.
Start Netlogon logging if
not already enabled.
i.nltest /dbflag:2080ffff
f.
Start Network
trace. run: netsh trace start capture=yes
g.
Do “ping/t” to
another VM in the same Vnet
h.
Attempt to RDP to
the server.
i.
SQL: If
possible, attempt a new connection to the SQL Server via SSMS.
j.
SQL: If
SA login is available, please test if possible.
k.
Stop Network trace.
i.netsh trace
stop
l.
Stop Netlogon logging if
you started it after the server entered the broken state.
i.nltest /dbflag:0
6.
Wait for MSFT team
to join a call to fully triage the VM behavior and capture available data.
a.
What was first
the alert or signal of an issue? i.e. user reported error,
system generated error?
i.What precise
date/time/timezone?
ii.What precise
error was seen from user?
iii.What precise error
was thrown in the alert?
1.
Is there an
associated log?
b.
What troubleshooting
steps were taken by Molina IT team?
c.
What Server
checks were made?
i.Is RDP successful?
If not, what is precise error?
d.
What SQL checks were
made?
e.
What Cluster checks
were made?
7.
Return to Serial
Console
a.
Ensure VM has been
pre-configured for a memory dump.
b.
Send Non-Maskable
Interrupt (NMI) to trigger memory dump. We expect the
dumping progress indicator reaches 100% in the end. Then work with Microsoft to
complete diagnostics from the backend. Finally restart
the server from the portal (must NOT use Stop and then Start).
8.
After VM reboots,
ensure production environment is stable and back online.
9.
Work with MSFT team
to pull additional logs.
a.
Memory dump file.
b.
Windows logs
i.System log
ii.Application log
iii.NetLogon logs: c:\windows\debug\netlogon.log
and netlogon.bak
iv.Network trace
c.
Cluster logs
i.Open PowerShell as
admin.
ii.Run command:
import-module failoverclusters
iii.Run: get-clusterlog
iv.That will create a
cluster log per node, and the file can be located on each node on
"C:\Windows\Cluster\Reports". Please upload this file to the
File Transfer link provided by MSFT.
d.
SQL reports
i.TSS report
1.
Download the file
below and collect the logs.
ii.PSSDiag
1.
Open Powershell as Admin
2.
Switch to Folder
location: SQL Base Diagnostics
3.
Run command:
.\Get-psSDP.ps1 SQLBase
e.
Upload all logs
to the link that will be provided by the MSFT team.
Building on all that our teams have learned,
it is essential to follow this action plan precisely as outlined to ensure that
all possible data will be captured to try to identify the cause for the VM
hang.
Please note, this action plan is a starting
point, and will continue to be reviewed and amended.
Best,
Comments
Post a Comment