Showing posts with label networking. Show all posts
Showing posts with label networking. Show all posts

Thursday, August 1, 2024

A decade of tech - My professional journey so far

Laying the Groundwork

My professional career commenced in February 2014, as a Trainee IT Services Engineer at Alamy Images. During my initial days, I was tasked with daily maintenance activities such as running tape backups, setting up Active Directory user accounts, mailboxes, and desktops for new employees. I also handled general IT support, troubleshooting various user issues within the organization.

After a few months, I had the opportunity to set up a lab infrastructure project using old decommissioned servers as part of a continuous learning initiative. This hands-on experience involved racking, stacking, and cabling physical servers, installing and configuring ESXi and Hyper-V hypervisors, FreeNAS storage servers, and deploying highly available clusters. Additionally, I gained exposure to configuring L2 network switches. This project significantly contributed to building my IT infrastructure foundation.

A year later, I was promoted to Junior IT Services Engineer, where I focused on virtualization projects. I spearheaded the migration of over 20 Dev/ Test/ UAT virtual machines from VMware to a Hyper-V cluster, enhancing system flexibility and cost-efficiency. I deployed a high-availability Hyper-V failover cluster in production and contributed to the planning and execution of a iSCSI storage server migration project.

Beyond virtualization, I worked on network infrastructure by a seamless L2 switch replacement and upgrade project with minimal operational disruption. Furthermore, I assisted in capacity planning initiatives for optimized resource utilization for both physical and virtual environments. These experiences refined my technical skills and problem-solving abilities. During this time, I developed a passion for infrastructure management and optimization, shaping my future career path.

From Junior IT Services Engineer to Storage Solutions Engineer

In January 2017, I transitioned to a Systems Development Engineer role at Dell EMC, specializing in Solutions Engineering. This marked a significant career shift as I immersed myself in the world of storage and virtualization solutions integration/development.

My daily responsibilities encompassed the installation and testing of various components, progressing from integration to validating system reliability and performance at scale. I designed and deployed multiple PowerFlex software-defined storage clusters for customer demos and proof-of-concepts, showcasing the product's performance and auto rebuild capabilities. A notable achievement was automating the storage performance benchmarking using PowerShell, FIO, and ELK stack, reducing process time from weeks to days.

I led the engineering efforts for developing a vROps management pack for PowerFlex, ensuring seamless integration and visibility. Additionally, I mastered vSphere Virtual Volumes (vVols), successfully executing integration projects between Dell storage solutions and VMware environments.

To streamline operations, I created a PowerShell module for managing PowerFlex using REST APIs and developed Ansible playbook for automated deployment of Kubernetes cluster with PowerFlex CSI driver. My expertise extended beyond systems engineering and automation as I authored and published whitepapers on disaster recovery using VMware SRM and hardware lifecycle management with Dell OME.

This period solidified my reputation as a virtualization and storage solutions expert, providing me with a deep understanding of storage architecture, performance optimization, and automation. I developed a passion for building scalable and reliable hyperconverged solutions.

From Storage Solutions Engineer to Site Reliability Engineer

In July 2021, I transitioned to a Site Reliability Engineer (SRE) role at VMware, focusing on ensuring the reliability and scalability of Kubernetes-as-a-Service project based on the vSphere with Tanzu platform.


Managing a vast infrastructure of Kubernetes clusters, I honed my skills in incident response, GitOps pipelines, automation, and monitoring. I played a crucial role in maintaining platform availability, collaborating closely with multiple internal teams and stakeholders to resolve issues and enhance service delivery. My proficiency in Python and PowerShell was instrumental in automating tasks and building custom monitoring solutions. During this time, I prepared diligently, practiced extensively, and successfully qualified for the CKA exam.

Beyond core SRE responsibilities, I explored emerging technologies. I successfully deployed and evaluated open-source language models on Kubernetes using Python, Ollama, and LangChain. In addition, I contributed to developing custom metrics for the Kubernetes-as-a-Service platform using Python, Prometheus, Grafana, and Helm.

This role deepened my expertise and ability to bridge the gap between development and operations, fostering a culture of reliability and efficiency. It has been an exciting journey of learning and growth, positioning me as a versatile IT professional with a strong foundation in both infrastructure and cloud-native technologies.

Gratitude

"This journey has been immensely fulfilling, made possible by the support and encouragement of exceptional organisations, inspiring managers, talented colleagues, friends, and family. I am truly grateful for the opportunities to learn, grow, and contribute meaningfully to driving success and making a positive impact."

The journey continues...

Wednesday, June 26, 2024

vSphere with Tanzu using NSX-T - Part33 - Troubleshooting intermittent connection timeouts to apiserver and workloads

In the realm of managing Tanzu Kubernetes clusters (TKCs), we have encountered several challenges that hindered the smooth functioning of our applications. In this blog post, we will discuss three such cases and the workarounds we employed to resolve them.


Case 1: TKC Control Plane Node Connectivity Issues


Symptoms:
  • TKC apiserver connection timeouts when attempting to connect using the kubeconfig.
  • Traffic was not flowing to two of the control plane nodes.
  • NSX-T web UI LB VS stats indicated this issue.


Case 2: TKC Worker Node Connectivity Issues


Symptoms:
  • Workload (example: PostgreSQL cluster) connection timeouts.
  • Traffic was not flowing to two of the worker nodes in the TKC.
  • NSX-T web UI LB VS stats indicated this issue.


Case 3: Load Balancer Connectivity Issues


Symptoms:
  • Connection timeouts when attempting to connect to a PostgreSQL workload through the load balancer VS IP.
  • This issue was observed only when creating new services of type LoadBalancer in the TKC.
  • We noticed datapath mempool usage for the edge nodes was above the threshold value.


Resolution/ work around

  • Find the T1 router that is attached to the TKC which has connectivity issues. 
  • In an Active - Standby HA configuration, you will see that there will be one Edge node that will be Active and another one in Standby status. 
  • First place the Standby Edge node in NSX MM, reboot it, and then exit it from NSX MM. 
  • Now, place the Active Edge node in NSX MM, there will be a slight network disruption during this failover, once it is in NSX MM, reboot it, and then exit NSX MM. 
  • This should resolved the issue.


In conclusion, these cases illustrate the importance of verifying NSX-T components when managing Tanzu Kubernetes clusters. By identifying the root cause of the issues and employing effective workarounds, we were able to restore functionality and maintain the health of our applications. Stay tuned for more insights and best practices in managing Kubernetes clusters.

Hope it was useful. Cheers!

Sunday, February 7, 2021

vSphere with Tanzu using NSX-T - Part4 - Tier-0 Gateway and BGP peering

In the previous posts we discussed the following: 

Part1: Prerequisites
Part2: Configure NSX-T
Part3: Edge Cluster


The next step is to create a Tier-0 Gateway, configure its interfaces, and BGP peer with the L3 TOR switches. Following is a high-level logical representation of this configuration:


Configure Tier-0 Gateway


Before creating the T0-Gateway, we need to create two segments.

  • Add Segments.
    • Create a segment "ls-uplink-v54"
      • VLAN: 54
      • Transport Zone: "edge-vlan-tz"
    • Create a segment "ls-uplink-v55"
      • VLAN: 55
      • Transport Zone: "edge-vlan-tz"

  • Add Tier-0 Gateway.
    • Provide the necessary details as shown below.

    • Add 4 interfaces and configure them as per the logical diagram given above.
      • edge-01-uplink1 - 192.168.54.254/24 - connected via segment ls-uplink-v54
      • edge-01-uplink2 - 192.168.55.254/24 - connected via segment ls-uplink-v55

      • edge-02-uplink1 - 192.168.54.253/24 - connected via segment ls-uplink-v54
      • edge-02-uplink2 - 192.168.55.253/24 - connected via segment ls-uplink-v55
    • Verify the status is showing success for all the 4 interfaces that you added.

  • Routing and multicast settings of T0 are as follows:

    • You can see a static route is configured. The next hop for the default route 0.0.0.0/0 is set to 192.168.54.1. 

    • The next hop configuration is given below.

  • BGP settings of T0 are shown below.

    • BGP Neighbor config:

    • Verify the status is showing success for the two BGP Neighbors that you added.

  • Route re-distribution settings of T0:

    • Add route re-distribution.
    • Set route re-distribution.

Now, the T0 configuration is complete. The next step is to configure BGP on the Dell S4048-ON TOR switches.

Configure TOR Switches


---On TOR A---
conf
router bgp 65500
neighbor 192.168.54.254 remote-as 65400
#peering to T0 edge-01 interface
neighbor 192.168.54.254 no shutdown
neighbor 192.168.54.253 remote-as 65400
#peering to T0 edge-02 interface
neighbor 192.168.54.253 no shutdown
neighbor 192.168.54.3 remote-as 65500
#peering to TOR B in VLAN 54
neighbor 192.168.54.3 no shutdown
maximum-paths ebgp 4
maximum-paths ibgp 4


---On TOR B---
conf
router bgp 65500
neighbor 192.168.55.254 remote-as 65400
#peering to T0 edge-01 interface
neighbor 192.168.55.254 no shutdown
neighbor 192.168.55.253 remote-as 65400
#peering to T0 edge-02 interface
neighbor 192.168.55.253 no shutdown
neighbor 192.168.54.2 remote-as 65500
#peering to TOR A in VLAN 54
neighbor 192.168.54.2 no shutdown
maximum-paths ebgp 4
maximum-paths ibgp 4


---Advertising ESXi mgmt and VM traffic networks in BGP on both TORs---

conf
router bgp 65500
network 192.168.41.0/24
network 192.168.43.0/24


Thanks to my friend and vExpert Harikrishnan @hari5611 for helping me with the T0 configs and BGP peering on TORs. Do check out his blog https://vxplanet.com/


Verify BGP Configurations


The next step is to verify the BGP configs on TORs using the following commands:

show running-config bgp

show ip bgp summary

show ip bgp neighbors


Follow the VMware documentation to verify the BGP connections from a Tier-0 Service Router. In the below screenshot you can see that both Edge nodes have the BGP neighbors 192.168.54.2 and 192.168.55.3 with state Estab.


In the next article, I will talk about adding a T1 Gateway, adding new segments for apps, connecting VMs to the segments, and verify connectivity to different internal and external networks. I hope this was useful. Cheers!

Monday, December 28, 2020

vSphere with Tanzu using NSX-T - Part3 - Edge Cluster

In the previous post, we went through basic NSX-T configurations. The next step is to deploy Edge VMs and create an Edge cluster. Before creating Edge VMs, we need to create two trunk port groups on the VDS using the VCSA web UI. Uplinks of the Edge VMs will be connected to these two trunk port groups.

  • Create 2 trunk port groups on the VDS.
    • "Trunk01-NSXT-Edge"
    • "Trunk02-NSXT-Edge"


  • Add Edge Transport Nodes.
    • Create two Edge VMs "edge-01, edge-02".
    • Click ADD EDGE VM, follow the wizard, and provide all the details.






  • Click Finish. Follow the same process and deploy one more Edge VM.

  • The next step is to create an Edge Cluster "edge-cluster-01and add the two Edge VMs to it.


The NSX-T Edge Cluster is now ready. Next, we have to add a Tier-0 Gateway and configure BGP peering with the router or L3 TOR switches. This will be covered in the next part. Hope it was useful. Cheers!

Related posts


Part1: Prerequisites
Part2: Configure NSX-T

Tuesday, December 8, 2020

vSphere with Tanzu using NSX-T - Part2 - Configure NSX

In this post, we will take a look at the different configuration steps that are required before enabling workload management to establish connectivity to the supervisor cluster, supervisor namespaces, and all objects that run inside the namespaces, such as vSphere pods, and Tanzu Kubernetes Clusters. 

At this point, I assume that the vSAN cluster is up and NSX-T 3.0 is installed. NSX-T appliance is connected to the same management network where the VCSA and ESXi nodes are connected. In my case, it will be through VLAN 41. Note that all the ESXi nodes of the vSAN cluster are connected to one vSphere Distributed Switch and has two uplinks from each node that connects to TOR A and TOR B.


NSX-T configurations

  • Add Compute Managers. I've added the vCenter server here.


  • Add Uplink Profiles.
    • Create a host uplink profile "nsx-uplink-profile-lbsrc" (this is for host TEP using VLAN 52).
    • Create an edge uplink profile "nsx-edge-uplink-profile-lbsrc" (this is for edge TEP using VLAN 53).



  • Add Transport Zones.
    • Create an overlay transport zone "tz-overlay".
    • Create a VLAN transport zone "edge-vlan-tz".




  • Add IP Address Pools.
    • Create an IP address pool for host TEPs "TEP-Pool-01" (this is for host TEP using VLAN 52).
    • Create an IP address pool for edge TEPs "Edge-TEP-Pool-01" (this is for Edge TEP using VLAN 53).



  • Add Transport Node Profiles.


  • Configure Host Transport Nodes. Select the required cluster and click configure NSX to convert all the ESXi nodes as transport nodes.

The next step is to deploy Edge VMs (Edge Transport Nodes) and create a Edge Cluster. We will cover it in the next part. Hope it was useful. Cheers!

Related posts


Part1 - Prerequisites


References


Monday, December 7, 2020

vSphere with Tanzu using NSX-T - Part1 - Prerequisites

In this blog series, I would like to share my experience deploying vSphere with Tanzu using NSX-T 3.0 networking. Following is a very high-level workflow of setting up the environment from scratch:


Software versions used for this study:
  • vSphere 7 U1
  • NSX-T 3.0.1.1

If you are using VCF, some of the deployment steps mentioned above are already automated. But, VCF is not in the scope of this blog series, and this post is aimed at explaining the workflow and background configurations that are required to prepare the environment before enabling workload management. This will be useful to understand the backend configuration tasks and logical workflow and will be helpful in case of troubleshooting too. In my lab, I have 4 Dell EMC PowerEdge rack servers connected to 2 TOR switches with 2 uplinks from each server. This is a consolidated architecture where both management components and application workload run on the same 4 node vSAN cluster. I am using ESXi 7 U1 and VCSA 7 U1. Here, the TORs are acting as L3 switches and we are doing BGP peering between TOR switches and NSX-T T0 Gateway which will be explained in a later part of this blog series.

Network connectivity



IP address schema and VLANs




I will not be covering how to create VLANs, configuring switch ports, deploying a vSAN cluster, etc. I assume that the network switches are correctly configured and the vSAN cluster is up and running. The next step is to deploy an NSX-T 3.0 appliance. You can either deploy it as a single node or in HA using 3 NSX-T appliances. In my lab, I have only one NSX-T 3.0 instance. Following are some of the reference material to deploy NSX-T 3.0:

Wednesday, October 26, 2016

3 Node Hyper-V 2012 R2 Cluster Design

Below diagram shows a traditional 3 tier highly available cluster with 3 Hyper-V nodes and 2 shared storage nodes (storage in Active-Passive/ Active-Active mode) all connected via network switches.

Compute

*Hyper-V nodes are running on DELL PowerEdge R630 rack servers

Networking

*Shared storage is accessed via MPIO over two separate VLANs (61 and 62)
*VM traffic is over NIC teaming (Switch independent and dynamic)
*Live migration/ cluster network is also teamed together

Storage

*We are using DELL PowerEdge R320 with Open-E DSS V7 as storage nodes
*Each node has 8 x 10K SAS drives with RAID 5
*Storage can be either in Active-Passive/ Active-Active cluster mode
*In Active-Passive mode, only one of the storage nodes will be active. That means resources of only one storage node will be utilized at a time
*In Active-Active mode, both servers will be active and serving storage traffic