Showing posts with label vSAN. Show all posts
Showing posts with label vSAN. Show all posts
Thursday, September 19, 2019
Wednesday, September 18, 2019
vRealize Operations Manager 7.5 - Part7 - vSAN monitoring and troubleshooting
In this article, I will walk you through how to use vROps for vSAN monitoring and performance troubleshooting. It is always recommended to follow a systematic and established approach to troubleshoot problems. Before we start here is a link to one of my article which explains the scientific method of troubleshooting.
Given below are some very useful content from VMware that talks about vSAN performance troubleshooting.
Performance Troubleshooting – Understanding the Different Levels of vSAN Performance Metrics
Performance Troubleshooting – Which vSAN Performance Metrics Should be Looked at First?
Troubleshooting vSAN performance
Performance is all relative and sometimes performance issues can be because of the wrong perception. So it is always good to validate it with actual numbers. Compare with a benchmark value or verify all relevant metrics before and after the issue has been reported. Now assume there is a storage issue in the environment. Given below is a systematic order to approach the problem, identify it correctly, isolate it and finally take necessary steps to resolve it.
- Infrastructure: Perform vSAN cluster health check
- Virtual machine level: Is there a storage issue observed at the application level?
- Virtual machine level: Is there a storage issue per vmdk level?
- Latency (vmdk)
- IOPS (vmdk)
- Cluster level: Look at operations overview at the cluster level
- Latency
- IOPS
- Host level: Identify the IO type that has a performance issue
- Read IO
- Write IO
- Host level: Collect/ analyze metrics of the storage objects
- Storage adapter (vmhba)
- Disk groups
- Cache disk
- Capacity disk
- Host level: Collect/ analyze metrics of the network objects
- Physical adapter (vmnic)
- vSAN network (vmk)
At this point, you have a clearly defined workflow in identifying and resolving the issue. So let's have a look at the various vROps dashboards that provides you end to end visibility of your stack and helps you easily identify and isolate the issue. If there is a problem or abnormality or unusual performance behavior in your vSAN environment, vROps will notify that with alerts based on various metric values it monitors using its inbuilt intelligence and analytics capabilities. Alert generation is based on symptom and alert definitions and this will finally affect the health, risk or efficiency badge of the respective object. Status of the badges, symptoms, alerts, recommendations, historical performance data and time stamps will be very useful in the process of troubleshooting and quickly finding the actual problem.
Infrastructure: Perform vSAN cluster health check
As a starting point, you can make use of integrated health checks from vCenter to verify your vSAN infrastructure.
To understand in-depth about vSAN health checks refer: https://vxplanet.com/2019/01/30/vsan-health-checks-explained-part-1/
Now to get a high-level overview, let's have a look into the health, risk and efficiency badges of vSAN cluster in vROps. Please refer to this blog article from VMware to get a detailed understanding of badges.
Health badge
Alerts
Virtual machine level: Is there a storage issue observed at the application level?
You can make use of application aware operations feature in vROps 7.5 to get full stack visibility. Given below are the list of applications that can be currently monitored using vROps 7.5.
You can make use of application aware operations feature in vROps 7.5 to get full stack visibility. Given below are the list of applications that can be currently monitored using vROps 7.5.
Reference to application aware monitoring: https://blogs.vmware.com/management/2019/05/application-aware-operations-with-vrealize-operations-7-5.html
If your application is not supported or if application aware monitoring is not configured, then you can go with native application performance counters/ methods to identify whether the application itself is observing/ affected by storage latency, low IOPS, etc.
Virtual machine level: Is there a storage issue per vmdk level?
If your application is not supported or if application aware monitoring is not configured, then you can go with native application performance counters/ methods to identify whether the application itself is observing/ affected by storage latency, low IOPS, etc.
As a first step, you can use the "Troubleshoot a VM" dashboard to understand and track resource usage of a virtual machine.
Troubleshoot a VM - a |
Troubleshoot a VM - b |
Cluster level: Look at operations overview at the cluster level
vSAN operations overview dashboard
Troubleshooting vSAN dashboard
Troubleshooting vSAN - a |
Troubleshooting vSAN - b |
Troubleshooting vSAN - c |
Host level: Identify the IO type that has a performance issue
Host level storage metrics
Host level: Collect/ analyze metrics of the storage objects
Metrics related to a disk group
Read cache and write buffer metrics of a disk group
Performance metrics of a capacity disk
Host level: Collect/ analyze metrics of the network objects
Metrics related to vmnic (physical NIC) and vSAN vmk
Metrics related to network objects will help to determine whether the performance issue is due to resource contention, network misconfiguration, hardware issue, etc.
Hope it was useful. Cheers!
Related posts:
Related posts:
References:
Labels:
alerts,
cache,
capacity tier,
cluster,
datastore,
Infrastructure,
IOPS,
latency,
monitoring,
storage,
symptoms,
troubleshooting,
VCSA,
virtualization,
vmk,
vmnic,
VMware,
vRealize Operations Manager,
vrops,
vSAN
Saturday, July 20, 2019
vRealize Operations Manager 7.5 - Part6 - Adding new symptoms and alert definitions
In my previous post, I tried to explain briefly about the alerting aspects in vROps and overall workflow of the alerting process. In this post, I will explain how to create custom symptom definitions and alert definitions based on a scenario.
Scenario
User is running some latency-sensitive business-critical applications on the vSAN cluster. Below are the symptoms that he would like to define and alerts should be produced for the same and these should affect the "Efficiency" badge of the vSAN cluster object.
- Warning - when vSAN Cluster Read Latency is greater than 1 ms
- Critical - when vSAN Cluster Read Latency is greater than 2 ms
- Warning - when vSAN Cluster Write Latency is greater than 2 ms
- Critical - when vSAN Cluster Write Latency is greater than 3 ms
Sample screenshot of vSAN environment efficiency badge |
Step1: Add symptom definitions
Go to Alerts - Symptom Definitions - Click Add (+)
Select base object type: vSAN Cluster
Select the metric "Read Latency (ms) - double click on it twice so that you can define both warning and critical symptoms.
Provide symptom definition name, criticality and numeric value as required and click Save.
Now you can see the two symptoms which you have just created.
Similarly, create symptom definitions for vSAN Cluster Write Latency.
All 4 symptom definitions are created now.
Step2: Add alert definitions
Next step is to add alert definitions.
Go to Alerts - Alert Definitions - Click Add (+)
- Click on Alert Impact and select Impact: Efficiency (this means this alert definition will affect the efficiency badge)
- Click Add Symptom Definitions (here you have to search for the symptom definitions that were created earlier and attach to this alert definition)
- Drag both symptom definitions to the right-hand side as shown in the screenshot (make sure to choose "Any" as highlighted below)
- Click Add Recommendations (here I added some sample recommendations) and click save
Similarly, create an alert definition for vSAN Cluster Write Latency alerts.
Now both alert definitions are created.
Let's verify current vSAN cluster Read/ Write latency in the dashboard.
As you can see above, Cluster I/O Write Latency is 2.67 ms which is greater than the warning threshold we defined. This means a warning alert should be produced and also should affect the efficiency badge of the vSAN Cluster object. An alert has already produced for this and can be seen in the second widget. It also shows the efficiency badge color is now yellow. If you click on the alert it will provide more details on the same.
If you browse the environment tab you can also notice that the efficiency badge of vSAN Cluster has turned to yellow.
Please feel free to share if this was useful. Cheers!
Related posts
Friday, February 15, 2019
Deleting inaccessible objects from vSAN cluster
Scenario:
Resolution:
Resolution:
- Login to vSAN VCSA as root
- rvc
- Provide a username that has administrator privileges @localhost as shown below
- Change directory to localhost
- Now browse through and set the directory to the respective cluster (in my case cluster name is "Cluster-Rack7")
- In this case its not a vswp object. So we have to find the owner node of the object and delete it forcefully from there.
- vsan.cmmds_find -u <UUID of object> <path to cluster> to find owner of the object
- SSH into the owner node and delete the inaccessible object file forcefully using /usr/lib/vmware/osfs/bin/objtool delete -u <UUID of inaccessible object> -f -v 10
Hope it was useful. Cheers!
References:
Subscribe to:
Posts (Atom)