Showing posts with label storage. Show all posts
Showing posts with label storage. Show all posts

Saturday, April 18, 2020

vSAN performance benchmarking

In this article, I will explain briefly on performance benchmarking considerations, factors affecting performance, and some of the best practices. We do performance benchmarking to understand the capabilities and bottlenecks of a system. When I say system it could be a storage system, CPU, GPU, network switch, etc. Now let's consider a VMware vSAN cluster infrastructure. It includes multiple components and each of these contributes to the performance. In this case, the vSAN cluster is the solution under test. We will have to conduct performance benchmarking to understand the storage performance behavior of the cluster. When I say storage behavior it includes the IOPS, latency, and throughput that the cluster can produce under varying loads.

The goal of benchmarking
  • Identify bottlenecks
    • Hardware bottleneck
    • Software bottleneck
    • Application bottleneck
  • Compare tradeoffs
  • Manage expectations
  • Make decisions

Usually in a real-world scenario, benchmarking will be done once the cluster is deployed/ ready and before starting to host production workload on top of it. As these benchmark values define the performance maximums it will be helpful to decide on when to scale or upgrade the cluster before it hits a bottleneck.

Fundamental factors of vSAN performance

Server hardware
  • Compatibility as per vSAN HCL
Host
  • Number of hosts in the cluster
  • Power settings
  • CPU - number of cores and frequency 
Storage
  • Hybrid or All-flash
  • NVMe, SAS, or SATA
  • Number of disk groups per host
  • Storage controller configuration
  • Compatibility of hardware devices as per vSAN HCL
Network
  • 10/ 25/ 40 GbE
  • MTU 
  • LAG
SPBM policy
  • FTT (Failures To Tolerate)
  • FTM (Mirroring/ Erasure coding)
  • Thin or Thick provision
Security
  • Encryption
  • Checksum
Other
  • Stripe width
  • Flash read cache reservation
  • IOPS limit for object
All of the above factors will affect performance. So you should know the benefits and tradeoffs. 

Benchmarking methodology

Image credit: VMware

Storage benchmarking tools

IO load generation tools
Application-specific tools
  • HammerDB (MSSQL, Oracle)
  • Jetstress (MS Exchange)
  • SLOB (Oracle)
  • DBGen (MSSQL, Oracle)

Best practices

  • Understand the production performance metrics.
  • Test what you plan to deploy.
  • Workload modeling.
  • Plan for use case testing.
  • Choose an appropriate size for benchmarking
  • Choose the right tool.
  • Pre-allocate blocks while testing.
  • Test for a longer time duration.
  • Deploy multiple VMs with multiple VMDKs.

References

Wednesday, September 18, 2019

vRealize Operations Manager 7.5 - Part7 - vSAN monitoring and troubleshooting

In this article, I will walk you through how to use vROps for vSAN monitoring and performance troubleshooting. It is always recommended to follow a systematic and established approach to troubleshoot problems. Before we start here is a link to one of my article which explains the scientific method of troubleshooting

Given below are some very useful content from VMware that talks about vSAN performance troubleshooting.

Performance Troubleshooting – Understanding the Different Levels of vSAN Performance Metrics
Performance Troubleshooting – Which vSAN Performance Metrics Should be Looked at First?
Troubleshooting vSAN performance

Performance is all relative and sometimes performance issues can be because of the wrong perception. So it is always good to validate it with actual numbers. Compare with a benchmark value or verify all relevant metrics before and after the issue has been reported. Now assume there is a storage issue in the environment. Given below is a systematic order to approach the problem, identify it correctly, isolate it and finally take necessary steps to resolve it. 

vSAN performance troubleshooting approach
  1. Infrastructure: Perform vSAN cluster health check
  2. Virtual machine level: Is there a storage issue observed at the application level?
  3. Virtual machine level: Is there a storage issue per vmdk level?
    1. Latency (vmdk)
    2. IOPS (vmdk)
  4. Cluster level: Look at operations overview at the cluster level
    1. Latency
    2. IOPS
  5. Host level: Identify the IO type that has a performance issue
    1. Read IO
    2. Write IO
  6. Host level: Collect/ analyze metrics of the storage objects
    1. Storage adapter (vmhba)
    2. Disk groups
    3. Cache disk
    4. Capacity disk 
  7. Host level: Collect/ analyze metrics of the network objects
    1. Physical adapter (vmnic)
    2. vSAN network (vmk)
At this point, you have a clearly defined workflow in identifying and resolving the issue. So let's have a look at the various vROps dashboards that provides you end to end visibility of your stack and helps you easily identify and isolate the issue. If there is a problem or abnormality or unusual performance behavior in your vSAN environment, vROps will notify that with alerts based on various metric values it monitors using its inbuilt intelligence and analytics capabilities. Alert generation is based on symptom and alert definitions and this will finally affect the health, risk or efficiency badge of the respective object. Status of the badges, symptoms, alerts, recommendations, historical performance data and time stamps will be very useful in the process of troubleshooting and quickly finding the actual problem.

Infrastructure: Perform vSAN cluster health check

As a starting point, you can make use of integrated health checks from vCenter to verify your vSAN infrastructure.


To understand in-depth about vSAN health checks refer: https://vxplanet.com/2019/01/30/vsan-health-checks-explained-part-1/

Now to get a high-level overview, let's have a look into the health, risk and efficiency badges of vSAN cluster in vROps. Please refer to this blog article from VMware to get a detailed understanding of badges.

Health badge


Risk badge


Alerts


Virtual machine level: Is there a storage issue observed at the application level?

You can make use of application aware operations feature in vROps 7.5 to get full stack visibility. Given below are the list of applications that can be currently monitored using vROps 7.5.


Reference to application aware monitoring: https://blogs.vmware.com/management/2019/05/application-aware-operations-with-vrealize-operations-7-5.html


If your application is not supported or if application aware monitoring is not configured, then you can go with native application performance counters/ methods to identify whether the application itself is observing/ affected by storage latency, low IOPS, etc.

Virtual machine level: Is there a storage issue per vmdk level?

As a first step, you can use the "Troubleshoot a VM" dashboard to understand and track resource usage of a virtual machine.

Troubleshoot a VM - a

Troubleshoot a VM - b

Select the VM object to get more details. Below screenshot shows metrics related to a virtual disk.


Cluster level: Look at operations overview at the cluster level

vSAN operations overview dashboard


Troubleshooting vSAN dashboard

Troubleshooting vSAN - a

Troubleshooting vSAN - b

Troubleshooting vSAN - c

Host level: Identify the IO type that has a performance issue

Host level storage metrics


Host level: Collect/ analyze metrics of the storage objects

Metrics related to a disk group


Read cache and write buffer metrics of a disk group


Performance metrics of a capacity disk


Host level: Collect/ analyze metrics of the network objects

Metrics related to vmnic (physical NIC) and vSAN vmk


Metrics related to network objects will help to determine whether the performance issue is due to resource contention, network misconfiguration, hardware issue, etc.  


References:





Friday, April 12, 2019

VMware VVols: Integrating Dell EMC Unity with vSphere environment

This article provides the step by step procedure to configure a Virtual Volume (VVol) based vSphere environment using Dell EMC Unity SAN storage. You can go through my previous post to get an understanding of the differences between VMFS and VVol.

Step1: Register a new storage provider in vCenter

Note: Registering a storage provider exposes all the array capabilities to vCenter through VASA API.

Select: vCenter server -> Configure -> Storage providers -> Register a new storage provider (+)

Register new storage provides in vCenter

After successful registration of storage provider


Step2: Create a VVol datastore on the storage array (Unity)

Create VVol datastore (storage container)

Provide a name

Select capability profiles for the VVol datastore

Note: Each pool has its own characteristics and is associated with a specific capability profile. Adding capability profiles to a VVOL datastore basically adds the corresponding pools to that VVOL construct. In the above figure we have added 3 capability profiles, which means pool_01/02/03 are now part of VVOL_Datastore_01.

Configure access to ESXi hosts

Summary

VVol datastore (storage container) creation completed

The storage container (VVOL datastore) has been created on the array and the next step is to add it to ESXi hosts.

Step3: Add a new datastore in the vSphere environment through vCenter

Add new VVol datastore

Select the VVol datastore and provide a name

VVol datastore creation complete

VVol datastore

Step4: Create VM storage policies

Select VM Storage Policies

Create VM storage policy

Select vCenter and provide a name



Storage type and rule

Select service level

Select the compatible storage


Now the storage policy (Platinum) has been created. Similarly, I have created Silver and Bronze policies which are shown below.

Sample storage policies

Step5: Migrate virtual machines to VVol datastore

Migrate VM01 to VVol datastore

While migrating the VM, you can choose the disk format and the VM storage policy and it will display the compatible VVOL datastores.


Select compatible storage


If you would like to apply storage policies at disk level you can edit the VM storage policies setting of the VM and apply policy as per the requirement as shown below.


Select VM storage policy per VMDK
Hope this was helpful. Cheers!

References:




Friday, March 15, 2019

VMFS vs VVOL

Let's start with a quick comparison.

VMFS
VVOL
  • LUN centric approach
  • Pre-provisioning of LUNs
  • Use of multiple datastores for different performance capabilities
  • Management difficulties as a single VM may span across multiple datastores

  • VM centric
  • vSphere is now aware of array capabilities through VASA provider
  • No pre-provisioning
  • One VVOL datastore can represent the whole array
  • Storage policies can be applied per vmdk level
  • Some of the vSphere operations are offloaded to array using VAAI (full cloning, snapshots)
  • VVOL snapshots are faster (different from traditional snapshots with redo logs)
            

Note: The explanations below regarding SAN is based on Dell EMC Unity (hybrid array) 

VMFS environment

A traditional VMFS datastore setup is given below. Say, you have two VMs. One with 3 virtual disks and the second one with 2 virtual disks. And your array has 4 storage pools with specific capabilities. Based on those capabilities the pools are classified into 4 service levels (Platinum, Gold, Silver and Bronze). In this case you need to provision 4 datastores/ LUNs from the respective pool to meet requirements of the two virtual machines.

VMFS

You can clearly see that the first VM is spanned across 3 datastores and the second VM is spanned across 2 datastores. If your environment has hundreds and thousands of VMs management becomes too complicated. In this case, as the datastores are properly named the vSphere admin can easily identify the service level/ capability of a specific datastore. But if naming convention is not followed, then vSphere admin has to contact the storage admin to know about the capabilities of that specific datastore/ LUN. Again lots of communication needed, making it a complex process!

VVOL environment

Now let’s have a look at the VVOL environment which is shown the figure below. You are having the same array, but instead of provisioning datastores/ LUNs from the respective pools, here we are creating one vvol datastore. The array has 4 storage pools, each having specific capabilities like drive type, RAID level, tiering policy, FAST Cache ON/ OFF etc. and are classified into 4 service levels as Platinum, Gold, Silver, and Bronze. So each pool has its own capability profile. You can provide any name, but here I just gave the same name as the service level of each pool.

Pool_01   -> Service level: Platinum         -> Capability profile: Platinum
Pool_02   -> Service level: Silver               -> Capability profile: Silver
Pool_03   -> Service level: Bronze            -> Capability profile: Bronze
Pool_04   -> Service level: Gold                -> Capability profile: Gold


Note: vvol datastores cannot be created without capability profile

Next steps are:

  • Create vvol datastore in the SAN
  • Provide a name for the vvol datastore
  • Add capability profiles that need to be part of the vvol datastore
  • In this case, all 4 capability profiles are part of the vvol datastore
  • You can also limit the amount of space that will be used from each capability profile by the vvol datastore
  • Once it’s done, vvol datastore is created in the SAN
VVOL

At this point, you can go ahead and configure host access to this vvol datastore. Now you have to let your Vsphere environment know about the capabilities of the storage array. That is done by registering the new storage provider on your vCenter server making use of VASA (VMware vStorage API for Storage Awareness) provider. Once registered the vSphere environment will communicate with the VASA provider through the array management interface (OOB network) which forms a control path. Next step is creating a vvol datastore on the ESXI host which you provided access earlier. For each vvol datastore created on the storage array, two protocol endpoints (PEs) will be automatically generated to communicate with an ESXI host forming a data path. If you create another vvol datastore on the array and provide access to same ESXI host, two more PEs will be created. PEs act like a target. And on the ESXI side, if you look at the storage devices, you can see 2 proxy LUNs which connects to the respective PEs.

Now you have to create VM storage policies based on service levels. Here you have 4 service levels, so you have to create 4 storage policies. Assign storage policy per vmdk basis as per requirements.

Eg: Storage_Policy_Gold -> VMDK 02 (second VM) -> it will be placed in Pool_04 automatically

So in this case, instead of having 4 VMFS datastores we needed only 1 VVOL datastore which has all the required capabilities. This means you don't need to provision more LUNs from the SAN. Storage management becomes easy with just one VVOL. There is more granularity with SPBM at each VMDK level. With VVOLs the SAN is aware of all the VMs and its corresponding files hosted on it. This makes space reclamation very easy and straight. The moment a VM or a VMDK is deleted, that space will be immediately made free as SAN is having the complete insight of virtual machines stored in it. Data mobility between different storage pools based on its service level becomes effortless as it is handled directly by the SAN internally based on SPBM. All together VVOL simplifies storage/ LUN management.

Hope it was useful. Cheers!

References:



Friday, February 15, 2019

Deleting inaccessible objects from vSAN cluster

Scenario:


Resolution:

  • Login to vSAN VCSA as root
  • rvc
  • Provide a username that has administrator privileges @localhost as shown below
  • Change directory to localhost
  • Now browse through and set the directory to the respective cluster (in my case cluster name is "Cluster-Rack7")
  • vsan.check_state -r <path to your cluster>
  • You can purge vswp objects using: vsan.purge_inaccessible_vswp_objects <path to cluster>
  • In this case its not a vswp object. So we have to find the owner node of the object and delete it forcefully from there.
  • vsan.cmmds_find -u <UUID of object> <path to cluster> to find owner of the object
  • SSH into the owner node and delete the inaccessible object file forcefully using /usr/lib/vmware/osfs/bin/objtool delete -u <UUID of inaccessible object> -f -v 10
  • Perform vSAN health check to verify status

Hope it was useful. Cheers!

References: