Showing posts with label virtualization. Show all posts
Showing posts with label virtualization. Show all posts

Friday, September 6, 2024

Revisiting Storage Performance Benchmarking

Few years ago, I had the opportunity to explore the intricacies of storage performance benchmarking using tools like FIO, DISKSPD, and Iometer. Those studies provided valuable insights into the performance characteristics of various storage solutions, shaping my understanding and approach to storage performance analysis. As I prepare for an upcoming project in this domain, I find it essential to revisit my previous work, reflect on the lessons learned, and share my experiences. This blog post aims to provide a comprehensive overview of my benchmarking journey and the evolving landscape of storage performance studies.


Recent advancements 

The field of storage technology has seen significant advancements in recent years. The rise of NVMe and storage-class memory technologies has also redefined high-end storage performance, offering unprecedented speed and efficiency. These advancements highlight the dynamic nature of storage performance benchmarking and underscore the importance of staying updated with the latest tools and methodologies.

Challenges

Benchmarking storage performance is not without its challenges. One of the primary difficulties is ensuring a consistent and controlled testing environment, as variations in hardware, software, and network conditions can significantly impact results. Another challenge is the selection of appropriate benchmarks that accurately reflect real-world workloads, which requires a deep understanding of the specific use cases and performance metrics. Additionally, interpreting the results can be complex, as it involves analyzing multiple metrics such as IOPS, throughput, and latency, and understanding their interplay. These challenges necessitate meticulous planning and a thorough understanding of both the benchmarking tools and the storage systems being tested.

Prior works

Following are some of the articles on storage benchmarking that I’ve published in the past:

Custom storage benchmarking framework

While there are numerous storage benchmarking tools available, such as VMFleet and HCIBench, I wanted to highlight a custom framework I developed a few years ago. Here are some reasons why we created this custom tool:

  • Great learning experience: It provided valuable insights into how things work.
  • Customization: Being a custom framework, it allows you to add or remove features as needed.
  • Flexibility: You can modify multiple parameters to suit your requirements.
  • Custom test profiles: You can create tailored storage test profiles.
  • No IP assignment needed: There’s no need for IP assignment or DHCP for the stress test VMs.
  • Centralized log collection: It offers centralized log collection for detailed analysis.


You can access the scripts and readme on my GitHub repository:

https://github.com/vineethac/vsan_cluster_storage_benchmarking_with_diskspd


Here is an overview.

  • Profile Manifest: All storage test profiles are listed in profile_manifest.psd1. You can define as many profiles as you want.
  • VM Template: A Windows VM template should be present in the vCenter server.
  • Benchmarking Manifest: Details of vCenter, cluster name, VM template, number of stress test VMs per host, etc., are provided in benchmarking_manifest.psd1.
  • Deploy Test VMs: deploy_test_vms.ps1 will deploy all the test VMs with pre-configured parameters.
  • Start Stress Test: start_stress_test.ps1 will initiate the storage stress test process for all the profiles mentioned in profile_manifest.psd1 one by one.
  • Log Collection: All log files will be automatically copied to a central location on the host from where these scripts are running.
  • Cleanup: Use delete_test_vms.ps1 to clean up the stress test VMs from the cluster.


Note:
 These scripts were created about five years ago, and I haven’t had the opportunity to refactor them according to current best practices and new PowerShell scripting standards. I plan to enhance them in the coming months!

This overview should provide you with a clear understanding of the overall process and workflow involved in the storage benchmarking process. I hope it was useful. Cheers!

Thursday, August 1, 2024

A decade of tech - My professional journey so far

Laying the Groundwork

My professional career commenced in February 2014, as a Trainee IT Services Engineer at Alamy Images. During my initial days, I was tasked with daily maintenance activities such as running tape backups, setting up Active Directory user accounts, mailboxes, and desktops for new employees. I also handled general IT support, troubleshooting various user issues within the organization.

After a few months, I had the opportunity to set up a lab infrastructure project using old decommissioned servers as part of a continuous learning initiative. This hands-on experience involved racking, stacking, and cabling physical servers, installing and configuring ESXi and Hyper-V hypervisors, FreeNAS storage servers, and deploying highly available clusters. Additionally, I gained exposure to configuring L2 network switches. This project significantly contributed to building my IT infrastructure foundation.

A year later, I was promoted to Junior IT Services Engineer, where I focused on virtualization projects. I spearheaded the migration of over 20 Dev/ Test/ UAT virtual machines from VMware to a Hyper-V cluster, enhancing system flexibility and cost-efficiency. I deployed a high-availability Hyper-V failover cluster in production and contributed to the planning and execution of a iSCSI storage server migration project.

Beyond virtualization, I worked on network infrastructure by a seamless L2 switch replacement and upgrade project with minimal operational disruption. Furthermore, I assisted in capacity planning initiatives for optimized resource utilization for both physical and virtual environments. These experiences refined my technical skills and problem-solving abilities. During this time, I developed a passion for infrastructure management and optimization, shaping my future career path.

From Junior IT Services Engineer to Storage Solutions Engineer

In January 2017, I transitioned to a Systems Development Engineer role at Dell EMC, specializing in Solutions Engineering. This marked a significant career shift as I immersed myself in the world of storage and virtualization solutions integration/development.

My daily responsibilities encompassed the installation and testing of various components, progressing from integration to validating system reliability and performance at scale. I designed and deployed multiple PowerFlex software-defined storage clusters for customer demos and proof-of-concepts, showcasing the product's performance and auto rebuild capabilities. A notable achievement was automating the storage performance benchmarking using PowerShell, FIO, and ELK stack, reducing process time from weeks to days.

I led the engineering efforts for developing a vROps management pack for PowerFlex, ensuring seamless integration and visibility. Additionally, I mastered vSphere Virtual Volumes (vVols), successfully executing integration projects between Dell storage solutions and VMware environments.

To streamline operations, I created a PowerShell module for managing PowerFlex using REST APIs and developed Ansible playbook for automated deployment of Kubernetes cluster with PowerFlex CSI driver. My expertise extended beyond systems engineering and automation as I authored and published whitepapers on disaster recovery using VMware SRM and hardware lifecycle management with Dell OME.

This period solidified my reputation as a virtualization and storage solutions expert, providing me with a deep understanding of storage architecture, performance optimization, and automation. I developed a passion for building scalable and reliable hyperconverged solutions.

From Storage Solutions Engineer to Site Reliability Engineer

In July 2021, I transitioned to a Site Reliability Engineer (SRE) role at VMware, focusing on ensuring the reliability and scalability of Kubernetes-as-a-Service project based on the vSphere with Tanzu platform.


Managing a vast infrastructure of Kubernetes clusters, I honed my skills in incident response, GitOps pipelines, automation, and monitoring. I played a crucial role in maintaining platform availability, collaborating closely with multiple internal teams and stakeholders to resolve issues and enhance service delivery. My proficiency in Python and PowerShell was instrumental in automating tasks and building custom monitoring solutions. During this time, I prepared diligently, practiced extensively, and successfully qualified for the CKA exam.

Beyond core SRE responsibilities, I explored emerging technologies. I successfully deployed and evaluated open-source language models on Kubernetes using Python, Ollama, and LangChain. In addition, I contributed to developing custom metrics for the Kubernetes-as-a-Service platform using Python, Prometheus, Grafana, and Helm.

This role deepened my expertise and ability to bridge the gap between development and operations, fostering a culture of reliability and efficiency. It has been an exciting journey of learning and growth, positioning me as a versatile IT professional with a strong foundation in both infrastructure and cloud-native technologies.

Gratitude

"This journey has been immensely fulfilling, made possible by the support and encouragement of exceptional organisations, inspiring managers, talented colleagues, friends, and family. I am truly grateful for the opportunities to learn, grow, and contribute meaningfully to driving success and making a positive impact."

The journey continues...

Wednesday, September 18, 2019

vRealize Operations Manager 7.5 - Part7 - vSAN monitoring and troubleshooting

In this article, I will walk you through how to use vROps for vSAN monitoring and performance troubleshooting. It is always recommended to follow a systematic and established approach to troubleshoot problems. Before we start here is a link to one of my article which explains the scientific method of troubleshooting

Given below are some very useful content from VMware that talks about vSAN performance troubleshooting.

Performance Troubleshooting – Understanding the Different Levels of vSAN Performance Metrics
Performance Troubleshooting – Which vSAN Performance Metrics Should be Looked at First?
Troubleshooting vSAN performance

Performance is all relative and sometimes performance issues can be because of the wrong perception. So it is always good to validate it with actual numbers. Compare with a benchmark value or verify all relevant metrics before and after the issue has been reported. Now assume there is a storage issue in the environment. Given below is a systematic order to approach the problem, identify it correctly, isolate it and finally take necessary steps to resolve it. 

vSAN performance troubleshooting approach
  1. Infrastructure: Perform vSAN cluster health check
  2. Virtual machine level: Is there a storage issue observed at the application level?
  3. Virtual machine level: Is there a storage issue per vmdk level?
    1. Latency (vmdk)
    2. IOPS (vmdk)
  4. Cluster level: Look at operations overview at the cluster level
    1. Latency
    2. IOPS
  5. Host level: Identify the IO type that has a performance issue
    1. Read IO
    2. Write IO
  6. Host level: Collect/ analyze metrics of the storage objects
    1. Storage adapter (vmhba)
    2. Disk groups
    3. Cache disk
    4. Capacity disk 
  7. Host level: Collect/ analyze metrics of the network objects
    1. Physical adapter (vmnic)
    2. vSAN network (vmk)
At this point, you have a clearly defined workflow in identifying and resolving the issue. So let's have a look at the various vROps dashboards that provides you end to end visibility of your stack and helps you easily identify and isolate the issue. If there is a problem or abnormality or unusual performance behavior in your vSAN environment, vROps will notify that with alerts based on various metric values it monitors using its inbuilt intelligence and analytics capabilities. Alert generation is based on symptom and alert definitions and this will finally affect the health, risk or efficiency badge of the respective object. Status of the badges, symptoms, alerts, recommendations, historical performance data and time stamps will be very useful in the process of troubleshooting and quickly finding the actual problem.

Infrastructure: Perform vSAN cluster health check

As a starting point, you can make use of integrated health checks from vCenter to verify your vSAN infrastructure.


To understand in-depth about vSAN health checks refer: https://vxplanet.com/2019/01/30/vsan-health-checks-explained-part-1/

Now to get a high-level overview, let's have a look into the health, risk and efficiency badges of vSAN cluster in vROps. Please refer to this blog article from VMware to get a detailed understanding of badges.

Health badge


Risk badge


Alerts


Virtual machine level: Is there a storage issue observed at the application level?

You can make use of application aware operations feature in vROps 7.5 to get full stack visibility. Given below are the list of applications that can be currently monitored using vROps 7.5.


Reference to application aware monitoring: https://blogs.vmware.com/management/2019/05/application-aware-operations-with-vrealize-operations-7-5.html


If your application is not supported or if application aware monitoring is not configured, then you can go with native application performance counters/ methods to identify whether the application itself is observing/ affected by storage latency, low IOPS, etc.

Virtual machine level: Is there a storage issue per vmdk level?

As a first step, you can use the "Troubleshoot a VM" dashboard to understand and track resource usage of a virtual machine.

Troubleshoot a VM - a

Troubleshoot a VM - b

Select the VM object to get more details. Below screenshot shows metrics related to a virtual disk.


Cluster level: Look at operations overview at the cluster level

vSAN operations overview dashboard


Troubleshooting vSAN dashboard

Troubleshooting vSAN - a

Troubleshooting vSAN - b

Troubleshooting vSAN - c

Host level: Identify the IO type that has a performance issue

Host level storage metrics


Host level: Collect/ analyze metrics of the storage objects

Metrics related to a disk group


Read cache and write buffer metrics of a disk group


Performance metrics of a capacity disk


Host level: Collect/ analyze metrics of the network objects

Metrics related to vmnic (physical NIC) and vSAN vmk


Metrics related to network objects will help to determine whether the performance issue is due to resource contention, network misconfiguration, hardware issue, etc.  


References:





Wednesday, August 7, 2019

VMware PowerCLI 101 - Part4 - Snapshots

In this post, I will briefly explain how to make use of PowerCLI when working with virtual machine snapshots.

Take a snapshot of VM:
New-Snapshot -VM "New Virtual Machine" -Name snap1 -Description try1

Revert to a snapshot:
$snap = Get-Snapshot -VM "New Virtual Machine" -Name "snap1"
Set-VM -VM "New Virtual Machine" -Snapshot $snap -WhatIf
Set-VM -VM "New Virtual Machine" -Snapshot $snap 


Delete specific snapshot of a VM:
$snap = Get-Snapshot -VM "New Virtual Machine" -Name "snap1"
Remove-Snapshot -Snapshot $snap -WhatIf
Remove-Snapshot -Snapshot $snap 

Delete all snapshots of a VM:
Get-VM "New Virtual Machine" | Get-Snapshot | Remove-Snapshot -WhatIf
Get-VM "New Virtual Machine" | Get-Snapshot | Remove-Snapshot 

List all VMs with snapshots:
Get-VM | Get-Snapshot | Select-Object VM, Name, Description, SizeGB

List VMs with snapshots older than a week:
Get-VM | Get-Snapshot | Where {$PSItem.Created -lt (Get-Date).AddDays(-7)} | Select-Object VM, Name, Description, Created, SizeGB | Format-Table

Find the parent-child relationship of VM snapshots:
$vm = Get-VM "New Virtual Machine"
get-vm $vm | Get-Snapshot | Select VM,Name,Description,Parent,Children,SizeGB,IsCurrent,Created,Id | sort Created |  Format-Table



Hope it was useful. Cheers!

Related posts:

Saturday, July 20, 2019

vRealize Operations Manager 7.5 - Part6 - Adding new symptoms and alert definitions

In my previous post, I tried to explain briefly about the alerting aspects in vROps and overall workflow of the alerting process. In this post, I will explain how to create custom symptom definitions and alert definitions based on a scenario. 

Scenario


User is running some latency-sensitive business-critical applications on the vSAN cluster. Below are the symptoms that he would like to define and alerts should be produced for the same and these should affect the "Efficiency" badge of the vSAN cluster object. 

  1. Warning - when vSAN Cluster Read Latency is greater than 1 ms
  2. Critical - when vSAN Cluster Read Latency is greater than 2 ms
  3. Warning - when vSAN Cluster Write Latency is greater than 2 ms
  4. Critical - when vSAN Cluster Write Latency is greater than 3 ms 
Sample screenshot of vSAN environment efficiency badge 

Step1: Add symptom definitions


Go to Alerts - Symptom Definitions - Click Add (+)


Select base object type: vSAN Cluster
Select the metric "Read Latency (ms) - double click on it twice so that you can define both warning and critical symptoms.


Provide symptom definition name, criticality and numeric value as required and click Save.


Now you can see the two symptoms which you have just created.


Similarly, create symptom definitions for vSAN Cluster Write Latency.


All 4 symptom definitions are created now.


Step2: Add alert definitions


Next step is to add alert definitions.

Go to Alerts - Alert Definitions - Click Add (+)

  • Provide a name and description.
  • Click on Base Object Type and select "vSAN Cluster"

  • Click on Alert Impact and select Impact: Efficiency (this means this alert definition will affect the efficiency badge)

  • Click Add Symptom Definitions (here you have to search for the symptom definitions that were created earlier and attach to this alert definition)

  • Drag both symptom definitions to the right-hand side as shown in the screenshot (make sure to choose "Any" as highlighted below)

  • Click Add Recommendations (here I added some sample recommendations) and click save

Similarly, create an alert definition for vSAN Cluster Write Latency alerts.


Now both alert definitions are created.


Let's verify current vSAN cluster Read/ Write latency in the dashboard.


As you can see above, Cluster I/O Write Latency is 2.67 ms which is greater than the warning threshold we defined. This means a warning alert should be produced and also should affect the efficiency badge of the vSAN Cluster object. An alert has already produced for this and can be seen in the second widget. It also shows the efficiency badge color is now yellow. If you click on the alert it will provide more details on the same.


If you browse the environment tab you can also notice that the efficiency badge of vSAN Cluster has turned to yellow.


Please feel free to share if this was useful. Cheers!

Related posts

Tuesday, July 9, 2019

vRealize Operations Manager 7.5 - Part5 - Alerting

Alerting is a very important aspect of infrastructure monitoring. vROps has very powerful alerting capabilities. It might look a bit complicated for the first time. A good understanding of symptoms, alert definitions, badges, notification rules, etc. are required to effectively utilize the maximum functionality/ capabilities of vROps. In this post, I will try to explain all these alert settings. Before getting into the configurations, first, let's have a look at objects and object types in vROps.

Objects and object types


As you can see in the screenshot above there are many objects of type "Datastore". All these objects/ object types have "Metrics" and "Properties". Click on the "Show Detail" icon as shown below to view more details of the selected object.


Metrics and Properties

"Metrics" and "Properties" of the object "vol03" is shown below.


Symptom definitions and alert definitions

Based on the value of metrics/ property you can define symptoms with criticality Info, warning, critical, etc. Using symptom definitions alert definitions can be created and this generates corresponding alerts and will directly affect the badges associated with the object. Now let's have a look at some of the symptom definitions and alert definitions that are pre-defined in vROps for the "vCenter Adapter". Here I am taking an example of object type "Datastore".


Examples: Symptom definition

Select a symptom definition and click edit.


As you can see, this symptom definition produces a warning alert when any datastore "capacity used %" is greater than 90.


Another symptom definition is given below where an info alert will be generated when space remaining on the datastore is 0.


Example: Alert definition

As shown in the screenshot below there are few pre-defined alert definitions for object type "datastore". 


Let's select one alert definition.


Now, if you would like to forward these alerts to an email address or an SNMP trap destination, you will need to configure two things.
  1. Outbound instance
  2. Alert notification rules

Add outbound instance

I will be configuring an SNMP trap destination. Go to Administration - Management - Outbound settings - click on the + icon. Provide necessary details and test to ensure the connection is successful.


Add alert notification rule

By default, no notification rules are available in vROps. User has to create new rules as per the requirement. Go to Alerts - Alert settings - Notification settings - click on the + icon and provide necessary configuration details. As an example, I will configure alert notification rule to forward all datastore related alerts to an SNMP trap destination.


The above rule will forward all the alerts that impact health, risk, and efficiency badges of datastore object to the configured SNMP trap destination.

Summary of the alerting process in vROps