Showing posts with label virtual machine. Show all posts
Showing posts with label virtual machine. Show all posts

Friday, September 6, 2024

Revisiting Storage Performance Benchmarking

Few years ago, I had the opportunity to explore the intricacies of storage performance benchmarking using tools like FIO, DISKSPD, and Iometer. Those studies provided valuable insights into the performance characteristics of various storage solutions, shaping my understanding and approach to storage performance analysis. As I prepare for an upcoming project in this domain, I find it essential to revisit my previous work, reflect on the lessons learned, and share my experiences. This blog post aims to provide a comprehensive overview of my benchmarking journey and the evolving landscape of storage performance studies.


Recent advancements 

The field of storage technology has seen significant advancements in recent years. The rise of NVMe and storage-class memory technologies has also redefined high-end storage performance, offering unprecedented speed and efficiency. These advancements highlight the dynamic nature of storage performance benchmarking and underscore the importance of staying updated with the latest tools and methodologies.

Challenges

Benchmarking storage performance is not without its challenges. One of the primary difficulties is ensuring a consistent and controlled testing environment, as variations in hardware, software, and network conditions can significantly impact results. Another challenge is the selection of appropriate benchmarks that accurately reflect real-world workloads, which requires a deep understanding of the specific use cases and performance metrics. Additionally, interpreting the results can be complex, as it involves analyzing multiple metrics such as IOPS, throughput, and latency, and understanding their interplay. These challenges necessitate meticulous planning and a thorough understanding of both the benchmarking tools and the storage systems being tested.

Prior works

Following are some of the articles on storage benchmarking that I’ve published in the past:

Custom storage benchmarking framework

While there are numerous storage benchmarking tools available, such as VMFleet and HCIBench, I wanted to highlight a custom framework I developed a few years ago. Here are some reasons why we created this custom tool:

  • Great learning experience: It provided valuable insights into how things work.
  • Customization: Being a custom framework, it allows you to add or remove features as needed.
  • Flexibility: You can modify multiple parameters to suit your requirements.
  • Custom test profiles: You can create tailored storage test profiles.
  • No IP assignment needed: There’s no need for IP assignment or DHCP for the stress test VMs.
  • Centralized log collection: It offers centralized log collection for detailed analysis.


You can access the scripts and readme on my GitHub repository:

https://github.com/vineethac/vsan_cluster_storage_benchmarking_with_diskspd


Here is an overview.

  • Profile Manifest: All storage test profiles are listed in profile_manifest.psd1. You can define as many profiles as you want.
  • VM Template: A Windows VM template should be present in the vCenter server.
  • Benchmarking Manifest: Details of vCenter, cluster name, VM template, number of stress test VMs per host, etc., are provided in benchmarking_manifest.psd1.
  • Deploy Test VMs: deploy_test_vms.ps1 will deploy all the test VMs with pre-configured parameters.
  • Start Stress Test: start_stress_test.ps1 will initiate the storage stress test process for all the profiles mentioned in profile_manifest.psd1 one by one.
  • Log Collection: All log files will be automatically copied to a central location on the host from where these scripts are running.
  • Cleanup: Use delete_test_vms.ps1 to clean up the stress test VMs from the cluster.


Note:
 These scripts were created about five years ago, and I haven’t had the opportunity to refactor them according to current best practices and new PowerShell scripting standards. I plan to enhance them in the coming months!

This overview should provide you with a clear understanding of the overall process and workflow involved in the storage benchmarking process. I hope it was useful. Cheers!

Sunday, September 11, 2022

vSphere with Tanzu using NSX-T - Part19 - Troubleshooting TKC stuck at creating phase

This article provides basic troubleshooting steps for TKCs (Tanzu Kubernetes Cluster) stuck at creating phase.

Verify status of the TKC

  • Use the following commands to verify the TKC status.
kubectl get tkc -n <supervisor_namespace>
kubectl get tkc -n <supervisor_namespace> -o json
kubectl describe tkc <tkc_name> -n <supervisor_namespace>
kubectl get cluster-api -n <supervisor_namespace>
kubectl get vm,machine,wcpmachine -n <supervisor_namespace> 

Cluster health

  • Verify health of the supervisor cluster.
❯ kubectl get node
NAME STATUS ROLES AGE VERSION
4201a7b2667b0f3b021efcf7c9d1726b Ready control-plane,master 86d v1.22.6+vmware.wcp.2
4201bead67e21a8813415642267cd54a Ready control-plane,master 86d v1.22.6+vmware.wcp.2
4201e0e8e29b0ddb4b59d3165dd40941 Ready control-plane,master 86d v1.22.6+vmware.wcp.2
wxx-08-r02esx13.xxxxxyyyy.com Ready agent 85d v1.22.6-sph-db56d46
wxx-08-r02esx14.xxxxxyyyy.com Ready agent 85d v1.22.6-sph-db56d46
wxx-08-r02esx15.xxxxxyyyy.com Ready agent 85d v1.22.6-sph-db56d46
wxx-08-r02esx16.xxxxxyyyy.com Ready agent 85d v1.22.6-sph-db56d46
wxx-08-r02esx17.xxxxxyyyy.com Ready agent 85d v1.22.6-sph-db56d46
wxx-08-r02esx18.xxxxxyyyy.com Ready agent 85d v1.22.6-sph-db56d46
wxx-08-r02esx19.xxxxxyyyy.com Ready agent 85d v1.22.6-sph-db56d46
wxx-08-r02esx20.xxxxxyyyy.com Ready agent 85d v1.22.6-sph-db56d46
wxx-08-r02esx21.xxxxxyyyy.com Ready agent 85d v1.22.6-sph-db56d46
wxx-08-r02esx22.xxxxxyyyy.com Ready agent 85d v1.22.6-sph-db56d46
wxx-08-r02esx23.xxxxxyyyy.com Ready agent 85d v1.22.6-sph-db56d46
wxx-08-r02esx24.xxxxxyyyy.com Ready agent 85d v1.22.6-sph-db56d46

❯ kubectl get --raw '/healthz?verbose'
[+]ping ok
[+]log ok
[+]etcd ok
[+]poststarthook/start-kube-apiserver-admission-initializer ok
[+]poststarthook/generic-apiserver-start-informers ok
[+]poststarthook/priority-and-fairness-config-consumer ok
[+]poststarthook/priority-and-fairness-filter ok
[+]poststarthook/start-apiextensions-informers ok
[+]poststarthook/start-apiextensions-controllers ok
[+]poststarthook/crd-informer-synced ok
[+]poststarthook/bootstrap-controller ok
[+]poststarthook/rbac/bootstrap-roles ok
[+]poststarthook/scheduling/bootstrap-system-priority-classes ok
[+]poststarthook/priority-and-fairness-config-producer ok
[+]poststarthook/start-cluster-authentication-info-controller ok
[+]poststarthook/aggregator-reload-proxy-client-cert ok
[+]poststarthook/start-kube-aggregator-informers ok
[+]poststarthook/apiservice-registration-controller ok
[+]poststarthook/apiservice-status-available-controller ok
[+]poststarthook/kube-apiserver-autoregistration ok
[+]autoregister-completion ok
[+]poststarthook/apiservice-openapi-controller ok
healthz check passed 

Terminating namespaces

  • Check for namespaces stuck at terminating phase. If there are any, properly clean them up by removing all child objects. 
  • You can use this kubectl get-all plugin to see all resources under a namespace.  Then clean them up properly. Mostly you need to set finalizers of remaining child resources to null. Following is a sample case where 2 PVCs where stuck at terminating and they were cleaned up by setting its finalizers to null.
❯ kg ns | grep Terminating
rgettam-gettam Terminating 226d

❯ k get-all -n rgettam-gettam
NAME NAMESPACE AGE
persistentvolumeclaim/58ef0d27-ba66-4f4e-b4d7-43bd1c4fb833-c8c0c111-e480-4df4-baf8-d140d0237e1d rgettam-gettam 86d
persistentvolumeclaim/58ef0d27-ba66-4f4e-b4d7-43bd1c4fb833-e5c99b7e-1397-4a9d-b38c-53a25cab6c3f rgettam-gettam 86d

❯ kg pvc -n rgettam-gettam
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
58ef0d27-ba66-4f4e-b4d7-43bd1c4fb833-c8c0c111-e480-4df4-baf8-d140d0237e1d Terminating pvc-bd4252fb-bfed-4ef3-ab5a-43718f9cbed5 8Gi RWO sxx-01-vcxx-wcp-mgmt 86d
58ef0d27-ba66-4f4e-b4d7-43bd1c4fb833-e5c99b7e-1397-4a9d-b38c-53a25cab6c3f Terminating pvc-8bc9daa1-21cf-4af2-973e-af28d66a7f5e 30Gi RWO sxx-01-vcxx-wcp-mgmt 86d

❯ kg pvc -n rgettam-gettam --no-headers | awk '{print $1}' | xargs -I{} kubectl patch -n rgettam-gettam pvc {} -p '{"metadata":{"finalizers": null}}'
  • You can also do kubectl get namespace <namespace> -oyaml and the status section will show if there are resources/ content to be deleted or any finalizers remaining.
  • Verify vmop-controller pod logs, and restart them if required.

IP_BLOCK_EXHAUSTED

  • Check CIDR usage of the supervisor cluster.
❯ kg clusternetworkinfos
NAME                                                AGE
domain-c1006-06046c54-c9e5-41aa-bc2c-52d72c05bce4   160d

❯ kg clusternetworkinfos domain-c1006-06046c54-c9e5-41aa-bc2c-52d72c05bce4 -o json | jq .usage
{
  "egressCIDRUsage": {
    "allocated": 33,
    "total": 1024
  },
  "ingressCIDRUsage": {
    "allocated": 42,
    "total": 1024
  },
  "subnetCIDRUsage": {
    "allocated": 832,
    "total": 1024
  }
} 
  • When the IP blocks of supervisor cluster are exhausted, you will find the following warning when you describe the TKC.
 Conditions:
    Last Transition Time:  2022-10-05T18:34:35Z
    Message:               Cannot realize subnet
    Reason:                ClusterNetworkProvisionFailed
    Severity:              Warning
    Status:                False
    Type:                  Ready 
  • Also when you check the namespace, you can see the following ncp error IP_BLOCK_EXHAUSTED.
 ❯ kg ns tsql-integration-test -oyaml
apiVersion: v1
kind: Namespace
metadata:
  annotations:
    calaxxxx.xxxyy.com/xxxrole-created: "1"
    ncp/error: IP_BLOCK_EXHAUSTED
    ncp/router_id: t1_d0a2af0f-8430-4250-9fcf-807a4afe51aa_rtr
    vmware-system-resource-pool: resgroup-307480
    vmware-system-vm-folder: group-v307481
  creationTimestamp: "2022-10-05T17:35:18Z"

Notes:

  • If the subnetCIDRUsage IP block is exhausted, you may need to remove some old/ unused namespaces, and that will release some IPs. If that is not possible, you may need to consider adding new subnet.
  • After removing the old/ unused namespaces, and even if IPs are available, sometimes the TKCs will be stuck at creating phase! In that case, check the ncp, vmop, and capw controller pods and you may need to restart them. What I observed is usually after restart of ncp pod, vmop-controller pods, and all pods under vmware-system-capw namespaces the VMs will start getting deployed and the TKC creation will progress and complete successfully.

Resource availability

  • Check whether there are enough resources available in the cluster.
LAST SEEN  TYPE   REASON       OBJECT                    MESSAGE
3m23s    Warning  UpdateFailure   virtualmachine/magna3-control-plane-9rhl4   The host does not have sufficient CPU resources to satisfy the reservation.
80s     Warning  ReconcileFailure  wcpmachine/magna3-control-plane-s5s9t-p2cxj  vm is not yet powered on: vmware-system-capw-controller-manager/WCPMachine//chakravartha-magna3/magna3/magna3-control-plane-s5s9t-p2cxj 

  • Check for resource limits applied to the namespace.

Check whether storage policy is assigned to the namespace

27m         Warning   ReconcileFailure               wcpmachine/gc-pool-0-cv8vz-5snbc          admission webhook "default.validating.virtualmachine.vmoperator.xxxyy.com" denied the request: StorageClass wdc-10-vc21c01-wcp-pod is not assigned to any ResourceQuotas in namespace mpereiramaia-demo2

  • In this case, the storage policy wasnt assigned to the ns. I assigned the storage policy wdc-10-vc21c01-wcp-pod to the respective namespace, and the TKC deployment was successful.

Check Content library can sync properly

  • Sometimes issues related to CL can cause TKCs to get stuck at creating phase! Check this blog post for more details.

KCP can't remediate

Message:               KCP can't remediate if current replicas are less or equal then 1
Reason:                WaitingForRemediation @ Machine/gc-control-plane-zpssc
Severity:              Warning
  • In this case, you can just edit the TKC spec, change the control plane vmclass to a different class and save. Once the deployment is complete and TKC is running, edit the TKC spec again and revert the vmclass that you modified earlier to its original class. This process will re-provision the control plane.

TKC VMs waiting for IP

  • In this case, take a look at NSXT and check whether all Edge nodes are healthy. If there are mismatch errors, resolve them.
  • You may also check ncp pod logs and restart ncp pod if required.

VirtualMachineClassBindingNotFound

Conditions:
    Last Transition Time:  2021-05-05T18:19:10Z
    Message:               1 of 2 completed
    Reason:                VirtualMachineClassBindingNotFound @ Machine/tkc-dev-control-plane-wxd57
    Severity:              Error
    Status:                False
    Message:               0/1 Control Plane Node(s) healthy. 0/2 Worker Node(s) healthy
Events:
  Normal  PhaseChanged  7m22s  vmware-system-tkg/vmware-system-tkg-controller-manager/tanzukubernetescluster-status-controller  cluster changes from creating phase to failed phase

  • This happens when the virtualmachineclassbindings are missing and can be resolved by adding all/ required VM Class to the Namespace using the vSphere Client. Following are the steps to add VM Classes to a namespace:
  • Log into vCenter web UI
  • From Hosts and Clusters > Select the namespace > Summary tab > VM Service tile > Click Manage VM Classes
  • Select all required VM Classes and click OK

Verify NSX-T objects

  • Issues at the NSX-T side can also cause the TKC to be stuck at creating phase. Following is a sample case and you can see these logs when you describe the TKC:
Message: 2 errors occurred:
* failed to configure DNS for /, Kind= namespace-test-01/gc: unable to reconcile kubeadm ConfigMap's CoreDNS info: unable to retrieve kubeadm Configmap from the guest cluster: configmaps "kubeadm-config" not found * failed to configure kube-proxy for /, Kind= namespace-test-01/gc: unable to retrieve kube-proxy daemonset from the guest cluster: daemonsets.apps "kube-proxy" not found
  • In this case, these were some issues with the virtual servers in loadbalancer. Some stale entries of virtual servers were still present and their IP didn't get removed properly and it was causing some intermittent connectivity issues to some of the other services of type loadbalancer. And, new TKC deployment within that affected namespace also gets stuck due to this. In our case we deleted the affected namespace, and recreated it, that cleaned up all those virtual server state entries and the load balancer, and new TKC deployments were successful. So it will be worth to check on the health and staus of NSX-T objects in case you have TKC deployment issues.

Check for broken TKCs in the cluster

  • Sometimes the TKC deployments are very slow and takes more than 30 minutes. In this case, you may notice that the first control plane VM will get deployed in like 30-45 minutes after the TKC creation has started. Look for vmop controller logs. Following is sample log:
❯ kail -n vmware-system-vmop
vmware-system-vmop/vmware-system-vmop-controller-manager-55459cb46b-2psrk[manager]: E1027 11:49:44.725620       1 readiness_worker.go:111] readiness-probe "msg"="readiness probe fails" "error"="dial tcp 172.29.9.212:6443: connect: connection refused" "vmName"="ciroscosta-cartographer/kontinue-control-plane-svlk4" "result"=-1

vmware-system-vmop/vmware-system-vmop-controller-manager-55459cb46b-2psrk[manager]: E1027 11:49:49.888653       1 readiness_worker.go:111] readiness-probe "msg"="readiness probe fails" "error"="dial tcp 172.29.2.66:6443: connect: connection refused" "vmName"="whaozhe-platform/gc-control-plane-mf4p5" "result"=-1

  • In the above case, two of the TKCs were broken/ stuck at updating phase and we were unable to connect to its control plane.
ciroscosta-cartographer    kontinue    updating       2021-10-29T18:47:46Z   v1.20.9+vmware.1-tkg.1.a4cee5b    1     2
whaozhe-platform           gc          updating       2022-01-27T03:59:31Z   v1.20.12+vmware.1-tkg.1.b9a42f3   1     10
  • After removing the namespaces with broken TKCs, new deployments were completing succesfully. 

Restart system pods

  • Sometimes restart of some of the system controller pods resoves the issue. I usually delete all the pods of the following namespaces and they will get restarted in a few seconds.
k delete pod --all --namespace=vmware-system-vmop
k delete pod --all --namespace=vmware-system-capw
k delete pod --all --namespace=vmware-system-tkg
k delete pod --all --namespace=vmware-system-csi
k delete pod --all --namespace=vmware-system-nsx

Hope this was useful. Cheers!

Friday, June 7, 2019

VMware PowerCLI 101 - Part3 - Basic VM operations

Previous posts of this blog series talked about how to install PowerCLI, connecting to ESXi host and basics of working with the vCenter server. In this article, we will go through basic VM operations.

Get basic VM details:
Get-VM -Name <name of the virtual machine>

To view all properties of a VM object:
Get-VM -Name <name of the virtual machine> | Get-Member

To start a VM:
Get-VM -Name <name of the virtual machine> | Start-VM

To restart a VM:
Get-VM -Name <name of the virtual machine> | Restart-VMGuest

To shut down a VM:
Get-VM -Name <name of the virtual machine> | Shutdown-VMGuest

To delete a VM permanently:
Get-VM -Name <name of the virtual machine> | Remove-VM -DeletePermanently

Let's have a look into the "ExtensionData" property of a virtual machine.

Status related details of a VM:
(Get-VM -Name <name of the virtual machine>).ExtensionData | select GuestHeartbeatStatus,OverallStatus,ConfigStatus


CPU, Memory config details of a VM:
(Get-VM -Name <name of the virtual machine>).ExtensionData.Config.Hardware



CPU, Memory hot add/ remove features:


Layout details of a VM:
(Get-VM -Name <name of the virtual machine>).ExtensionData.LayoutEx.File | sort Key | ft


VM tools status and other guest details:
(Get-VM -Name <name of the virtual machine>).ExtensionData.Guest




Network related info of a VM:
(Get-VM -Name <name of the virtual machine>).ExtensionData.Guest.Net


Hope this was useful. Cheers!

Sunday, March 11, 2018

PowerShell quick start guide

This post is for all those who would like to kickstart and learn PowerShell from the very basic level. When I started learning PowerShell I wrote few articles so that it will be helpful for folks who are looking forward to 'how to start and learn it' and I can also refer to it whenever I need. Here in this post, I am just putting all of them together in proper order so that anyone can easily make use of it.

PowerShell 101 blog series