Showing posts with label Server. Show all posts
Showing posts with label Server. Show all posts

Saturday, July 12, 2025

Troubleshooting ESXi PSOD: A Quick Guide for SREs

When an ESXi host hits a Purple Screen of Death (PSOD), it’s more than just a crash - it’s a signal that something critical needs attention. Here’s how to handle it effectively.


What happens during a PSOD?

  • The ESXi server displays a purple diagnostic screen.
  • You’ll see alerts/ incidents for host connectivity, FC/ Ethernet link down, and related alarms.
  • The console screen confirms the purple screen.

Immediate actions

  • Capture screenshots of the PSOD from the console screen.
  • Check server hardware health via the out-of-band management interface like iDRAC/ RMC/ BMC.
  • Observe if the host is stuck or rebooting repeatedly.
    • If ESXi reloads successfully, immediately place the node in Maintenance Mode via vCenter.
    • If it keeps crashing, try to capture all PSOD instances.

Collecting logs

  • Generate a support bundle from vCenter once the host is online.
  • Collect server hardware logs.

Engage support

  • Broadcom/ VMware: Share PSOD screenshots and ESXi support bundle for RCA.
  • Hardware vendor: Attach server hardware logs, screenshots, and context for analysis.

Analyze crash dumps

  • Look for these keywords in the core dump logs: BlueScreen, Backtrace, Exception
  • In the ESXi support bundle you will find the crash dump logs under /var/core directory.
  • Analyzing the core dump files should help you find the root cause of the PSOD event. It could be due to some hardware issue, bugs in ESXi hypervisor, faults in device firmware or drivers, etc.
  • You may notice many vmkernel-zdump files, and to quickly filter out all the BlueScreen events, you can use the following PowerShell code snippet.
------------------------------------------------------------------
param(
    [string]$directoryPath,
    [string[]]$keywords
)

# Function to search for keywords in files
function Search-Files {
    param (
        [string]$path,
        [string[]]$keywords
    )

    # Get all files in the directory and subdirectories
    $files = Get-ChildItem -Path $path -Recurse -File

    # Loop through each file
    foreach ($file in $files) {
        # Read the content of the file
        $content = Get-Content -Path $file.FullName

        # Loop through each line in the file content
        foreach ($line in $content) {
            # Check if the line contains any of the keywords (case-insensitive)
            foreach ($keyword in $keywords) {
                if ($line -match "(?i)$keyword") {
                    # Print the file name and the matching line
                    Write-Output "File: $($file.FullName)"
                    Write-Output "Line: $line"
                    Add-Content -path out.txt -value $line
                    break
                }
            }
        }
    }
}

# Call the function with the provided parameters
Search-Files -path $directoryPath -keywords $keywords
------------------------------------------------------------------
  • Save the above code snippet to a .ps1 file (example: find.ps1) and you can run it as follows:
> .\find.ps1 -directoryPath "C:\esx-esxi1.xre.com-2025-04-18--13.05-2106842\var\core" -keywords "bluescreen"
> .\find.ps1 -directoryPath "C:\esx-esxi1.xre.com-2025-04-18--13.05-2106842\var\core" -keywords "bluescreen", "#PF Exception"
  • All the log lines that include the given keyword or keywords will be saved to out.txt file.
  • A sample output of the above-mentioned code snippet against an ESXi core dump is given below.

  • Once you identify the root cause of the PSOD event, you can start working towards the resolution which may involve replacing a faulty hardware component, updating firmware/ driver/ ESXi, etc.

References

Hope it was useful. Cheers!

Thursday, December 19, 2019

Working with iDRAC9 Redfish API using PowerShell - Part 4


In this article, I will explain how to use iDRAC Redfish API to Power On and Graceful Shutdown a server using PowerShell. This is applicable to all Dell EMC servers having iDRAC. It can be a general-purpose PowerEdge rack server, Ready Node, Appliance, etc. I've tested on iDRAC9.

Note: In a production environment please make sure to follow proper shutdown or reboot procedure (if any) before performing any system reset actions on the server.

[CmdletBinding()]
param(
    [Parameter(Mandatory)]
    [String]$idrac_ip,

    [Parameter(Mandatory)]
    [ValidateSet('On''GracefulShutdown')]
    [String]$ResetType
)

#To fix the connection issues to iDRAC REST API
add-type @"
    using System.Net;
    using System.Security.Cryptography.X509Certificates;
    public class TrustAllCertsPolicy : ICertificatePolicy {
    public bool CheckValidationResult(
        ServicePoint srvPoint, X509Certificate certificate,
        WebRequest request, int certificateProblem) {
        return true;
        }
    }
"@

[System.Net.ServicePointManager]::CertificatePolicy = New-Object TrustAllCertsPolicy
[System.Net.ServicePointManager]::SecurityProtocol = [System.Net.SecurityProtocolType]::Tls12 -bor [System.Net.SecurityProtocolType]::Tls11

#Get iDRAC creds
$Credentials = Get-Credential -Message "Enter iDRAC Creds"

$JsonBody = @{"ResetType" = $ResetType} | ConvertTo-Json
$u1 = "https://$($idrac_ip)/redfish/v1/Systems/System.Embedded.1/Actions/ComputerSystem.Reset"

Invoke-RestMethod -Uri $u1 -Credential $Credentials -Method Post -UseBasicParsing -ContentType 'application/json' -Body $JsonBody -Headers @{"Accept"="application/json"} -Verbose


Hope it was useful. Cheers!

Related posts



References


iDRAC9 Redfish API guide