Showing posts with label san. Show all posts
Showing posts with label san. Show all posts

Thursday, August 1, 2024

A decade of tech - My professional journey so far

Laying the Groundwork

My professional career commenced in February 2014, as a Trainee IT Services Engineer at Alamy Images. During my initial days, I was tasked with daily maintenance activities such as running tape backups, setting up Active Directory user accounts, mailboxes, and desktops for new employees. I also handled general IT support, troubleshooting various user issues within the organization.

After a few months, I had the opportunity to set up a lab infrastructure project using old decommissioned servers as part of a continuous learning initiative. This hands-on experience involved racking, stacking, and cabling physical servers, installing and configuring ESXi and Hyper-V hypervisors, FreeNAS storage servers, and deploying highly available clusters. Additionally, I gained exposure to configuring L2 network switches. This project significantly contributed to building my IT infrastructure foundation.

A year later, I was promoted to Junior IT Services Engineer, where I focused on virtualization projects. I spearheaded the migration of over 20 Dev/ Test/ UAT virtual machines from VMware to a Hyper-V cluster, enhancing system flexibility and cost-efficiency. I deployed a high-availability Hyper-V failover cluster in production and contributed to the planning and execution of a iSCSI storage server migration project.

Beyond virtualization, I worked on network infrastructure by a seamless L2 switch replacement and upgrade project with minimal operational disruption. Furthermore, I assisted in capacity planning initiatives for optimized resource utilization for both physical and virtual environments. These experiences refined my technical skills and problem-solving abilities. During this time, I developed a passion for infrastructure management and optimization, shaping my future career path.

From Junior IT Services Engineer to Storage Solutions Engineer

In January 2017, I transitioned to a Systems Development Engineer role at Dell EMC, specializing in Solutions Engineering. This marked a significant career shift as I immersed myself in the world of storage and virtualization solutions integration/development.

My daily responsibilities encompassed the installation and testing of various components, progressing from integration to validating system reliability and performance at scale. I designed and deployed multiple PowerFlex software-defined storage clusters for customer demos and proof-of-concepts, showcasing the product's performance and auto rebuild capabilities. A notable achievement was automating the storage performance benchmarking using PowerShell, FIO, and ELK stack, reducing process time from weeks to days.

I led the engineering efforts for developing a vROps management pack for PowerFlex, ensuring seamless integration and visibility. Additionally, I mastered vSphere Virtual Volumes (vVols), successfully executing integration projects between Dell storage solutions and VMware environments.

To streamline operations, I created a PowerShell module for managing PowerFlex using REST APIs and developed Ansible playbook for automated deployment of Kubernetes cluster with PowerFlex CSI driver. My expertise extended beyond systems engineering and automation as I authored and published whitepapers on disaster recovery using VMware SRM and hardware lifecycle management with Dell OME.

This period solidified my reputation as a virtualization and storage solutions expert, providing me with a deep understanding of storage architecture, performance optimization, and automation. I developed a passion for building scalable and reliable hyperconverged solutions.

From Storage Solutions Engineer to Site Reliability Engineer

In July 2021, I transitioned to a Site Reliability Engineer (SRE) role at VMware, focusing on ensuring the reliability and scalability of Kubernetes-as-a-Service project based on the vSphere with Tanzu platform.


Managing a vast infrastructure of Kubernetes clusters, I honed my skills in incident response, GitOps pipelines, automation, and monitoring. I played a crucial role in maintaining platform availability, collaborating closely with multiple internal teams and stakeholders to resolve issues and enhance service delivery. My proficiency in Python and PowerShell was instrumental in automating tasks and building custom monitoring solutions. During this time, I prepared diligently, practiced extensively, and successfully qualified for the CKA exam.

Beyond core SRE responsibilities, I explored emerging technologies. I successfully deployed and evaluated open-source language models on Kubernetes using Python, Ollama, and LangChain. In addition, I contributed to developing custom metrics for the Kubernetes-as-a-Service platform using Python, Prometheus, Grafana, and Helm.

This role deepened my expertise and ability to bridge the gap between development and operations, fostering a culture of reliability and efficiency. It has been an exciting journey of learning and growth, positioning me as a versatile IT professional with a strong foundation in both infrastructure and cloud-native technologies.

Gratitude

"This journey has been immensely fulfilling, made possible by the support and encouragement of exceptional organisations, inspiring managers, talented colleagues, friends, and family. I am truly grateful for the opportunities to learn, grow, and contribute meaningfully to driving success and making a positive impact."

The journey continues...

Friday, March 15, 2019

VMFS vs VVOL

Let's start with a quick comparison.

VMFS
VVOL
  • LUN centric approach
  • Pre-provisioning of LUNs
  • Use of multiple datastores for different performance capabilities
  • Management difficulties as a single VM may span across multiple datastores

  • VM centric
  • vSphere is now aware of array capabilities through VASA provider
  • No pre-provisioning
  • One VVOL datastore can represent the whole array
  • Storage policies can be applied per vmdk level
  • Some of the vSphere operations are offloaded to array using VAAI (full cloning, snapshots)
  • VVOL snapshots are faster (different from traditional snapshots with redo logs)
            

Note: The explanations below regarding SAN is based on Dell EMC Unity (hybrid array) 

VMFS environment

A traditional VMFS datastore setup is given below. Say, you have two VMs. One with 3 virtual disks and the second one with 2 virtual disks. And your array has 4 storage pools with specific capabilities. Based on those capabilities the pools are classified into 4 service levels (Platinum, Gold, Silver and Bronze). In this case you need to provision 4 datastores/ LUNs from the respective pool to meet requirements of the two virtual machines.

VMFS

You can clearly see that the first VM is spanned across 3 datastores and the second VM is spanned across 2 datastores. If your environment has hundreds and thousands of VMs management becomes too complicated. In this case, as the datastores are properly named the vSphere admin can easily identify the service level/ capability of a specific datastore. But if naming convention is not followed, then vSphere admin has to contact the storage admin to know about the capabilities of that specific datastore/ LUN. Again lots of communication needed, making it a complex process!

VVOL environment

Now let’s have a look at the VVOL environment which is shown the figure below. You are having the same array, but instead of provisioning datastores/ LUNs from the respective pools, here we are creating one vvol datastore. The array has 4 storage pools, each having specific capabilities like drive type, RAID level, tiering policy, FAST Cache ON/ OFF etc. and are classified into 4 service levels as Platinum, Gold, Silver, and Bronze. So each pool has its own capability profile. You can provide any name, but here I just gave the same name as the service level of each pool.

Pool_01   -> Service level: Platinum         -> Capability profile: Platinum
Pool_02   -> Service level: Silver               -> Capability profile: Silver
Pool_03   -> Service level: Bronze            -> Capability profile: Bronze
Pool_04   -> Service level: Gold                -> Capability profile: Gold


Note: vvol datastores cannot be created without capability profile

Next steps are:

  • Create vvol datastore in the SAN
  • Provide a name for the vvol datastore
  • Add capability profiles that need to be part of the vvol datastore
  • In this case, all 4 capability profiles are part of the vvol datastore
  • You can also limit the amount of space that will be used from each capability profile by the vvol datastore
  • Once it’s done, vvol datastore is created in the SAN
VVOL

At this point, you can go ahead and configure host access to this vvol datastore. Now you have to let your Vsphere environment know about the capabilities of the storage array. That is done by registering the new storage provider on your vCenter server making use of VASA (VMware vStorage API for Storage Awareness) provider. Once registered the vSphere environment will communicate with the VASA provider through the array management interface (OOB network) which forms a control path. Next step is creating a vvol datastore on the ESXI host which you provided access earlier. For each vvol datastore created on the storage array, two protocol endpoints (PEs) will be automatically generated to communicate with an ESXI host forming a data path. If you create another vvol datastore on the array and provide access to same ESXI host, two more PEs will be created. PEs act like a target. And on the ESXI side, if you look at the storage devices, you can see 2 proxy LUNs which connects to the respective PEs.

Now you have to create VM storage policies based on service levels. Here you have 4 service levels, so you have to create 4 storage policies. Assign storage policy per vmdk basis as per requirements.

Eg: Storage_Policy_Gold -> VMDK 02 (second VM) -> it will be placed in Pool_04 automatically

So in this case, instead of having 4 VMFS datastores we needed only 1 VVOL datastore which has all the required capabilities. This means you don't need to provision more LUNs from the SAN. Storage management becomes easy with just one VVOL. There is more granularity with SPBM at each VMDK level. With VVOLs the SAN is aware of all the VMs and its corresponding files hosted on it. This makes space reclamation very easy and straight. The moment a VM or a VMDK is deleted, that space will be immediately made free as SAN is having the complete insight of virtual machines stored in it. Data mobility between different storage pools based on its service level becomes effortless as it is handled directly by the SAN internally based on SPBM. All together VVOL simplifies storage/ LUN management.

Hope it was useful. Cheers!

References:



Thursday, November 30, 2017

Software Defined Storage using ScaleIO

In this article I will explain briefly about ScaleIO and various options that are available to deploy ScaleIO software defined storage (SDS) solution. 

ScaleIO can be considered as a very good option for customers who are moving towards deploying software defined storage  solutions and hyperconverged infrastructure. As ScaleIO software supports multiple hypervisors and operating systems like VMware ESXi, Hyper-V, RHEL, Windows etc. customers with a heterogeneous IT infrastructure gets the most benefit out of it. Apart from that it offers multiple deployment modes like hyperconverged, two layer and mixed mode. I am sure most of you are very much familiar with the term hyperconverged where compute and storage runs together on the same box. You can scale both compute and storage resources together by adding more and more nodes to your cluster. A two layer mode is nothing but a storage only configuration where you can scale the storage resources separately. It is essentially a virtual SAN infrastructure implemented using ScaleIO SDS. A mixed mode scenario will usually occur when transitioning from storage only configuration to hyperconverged.

Now I will just give an overview on how to deploy ScaleIO on VMware and RHEL platforms. ScaleIO has tight integration with VMware and they provide a powershell script and vCenter plugin to simplify the deployment. In case of RHEL platform, you can use Installation Manager (IM) which is a part of ScaleIO Gateway for quick and easy deployment of ScaleIO cluster. Customers have multiple options to consume ScaleIO. They can just buy the ScaleIO software alone and use commodity x86 hardware to build the cluster (not a great idea for production deployments as they have to figure out and use the validated/ qualified hardware and software components to ensure seamless operation and proper support) or they can buy ScaleIO Ready Nodes which are prevalidated, preconfigured and optimized PowerEdge servers to deploy ScaleIO cluster. Apart from that there is another offering VxRack System Flex which is a rack-scale hyperconverged solution built on Dell EMC PowerEdge servers with integrated Cisco networking and ScaleIO software. 

Lets have a look at the major components of ScaleIO. Below figure shows a 5 node hyperconverged ScaleIO cluster running on a highly available VMware platform. The three main components of ScaleIO are:

  • SDC - ScaleIO Data Client
  • SDS - ScaleIO Data Server
  • MDM - Meta Data Manager


In this scenario, all 5 nodes have ESXi installed and clustered. All nodes have local hard disks present in them. And its the responsibility of ScaleIO software to pool all the hard disks from all 5 nodes forming a distributed virtual SAN.

SDC is a light weight driver which is responsible for presenting LUNs provisioned from the ScaleIO system. SDS is responsible for managing local disks present in each node. MDM contains all the metadata required for system operation and configuration changes. It manages the metadata, SDC, SDS, system capacity, device mappings, volumes, data protection, errors/ failures, rebuild and rebalance operations etc. ScaleIO supports 3 node/ 5 node MDM cluster. Above figure shows a 5 node MDM cluster, where there will be 3 manager MDMs and out of which one will be master and two will be slaves and there will be two Tie-Breaker (TB) which helps in deciding master MDM by maintaining a majority in the cluster. In a production environment with 5 or more nodes, it is recommended to use a 5 node MDM cluster as it can tolerate 2 MDM failures.

ScaleIO uses a distributed two way mesh mirror scheme to protect data against disk or node failures. To ensure QoS it has the capability where you can limit bandwidth as well as IOPS for each volume provisioned from a ScaleIO cluster. And regarding scalability a single ScaleIO cluster supports upto 1024 nodes. In very large ScaleIO deployments it is highly recommended to configure separate protection domains and fault sets to minimize the impact of multiple failures at the same time. 

You can download ScaleIO software for free to test and play around in your lab environment.

References:
Dell EMC ScaleIO Basic Architecture
Dell EMC ScaleIO Design Considerations And Best Practices
Dell EMC ScaleIO Ready Node

Tuesday, January 26, 2016

Zoning and LUN masking

Zoning and LUN masking are used to isolate SAN traffic and to restrict access to storage devices. For example you might manage different zones separately for testing and production environment, so that they will not interfere. If you want to restrict certain hosts from accessing the storage devices then you have to setup zoning. This is generally done at FC switch level. Zoning are of two types : soft zoning and hard zoning.

Soft zoning is based on WWN name of the device and hard zoning is configured at FC switch port level. Soft zoning offers a greater range of flexibility. That means even if you move a device from one port to another on the FC switch, it will have the same access rights as the restriction is based on WWN name of the device. But the down side of this is using WWN spoofing you can gain access to zones that you aren't supposed to see. In case of hard zoning at switch port level, you will get a tighter access control but with less flexibility compared to soft zoning. Now if you change the device from one port to another as we done before, it won't be able to see its partner. In this case you can't spoof a physical port unless you are standing in the same room at the switch.

Once zoning is done you can further restrict access to SAN LUNs by using LUN masking. This will prevent certain devices from seeing specific LUNs hosted in the storage device. LUN masking is done at storage controller level or OS level of the storage device. It is recommended to use zoning and LUN masking together for securing storage traffic.

Wednesday, February 25, 2015

Implementing storage cluster - Open-E DSS V7 Active-Passive iSCSI SAN Failover Cluster

The setup that I used for this implementation is mentioned below :

-Two DELL PowerEdge R710 servers (ESXI01 and ESXI02)
-Implemented Open-E DSS V7 Active-Passive iSCSI Failover Cluster – using two VSA’s running on different ESXI 5.5 hosts
  • Created VM and installed Open-E Node A on ESXI01
  • Created VM and installed Open-E Node B on ESXI02
  • Configured separate network interfaces for heartbeat, volume replication and WEB GUI management
  • Configured a vSwitch on both ESXI hosts (ESXI01 and ESXI02) 
  • Added direct point-to-point connection between the above two ESXI hosts for reliable volume replication
  • Configured iSCSI volumes and targets on Node A and Node B
  • Configured replication task
  • Configured failover cluster with multiple auxiliary paths
  • Configured virtual target IP address
  • Added target to cluster
  • Configured iSCSI initiator
  • Tested failover and failback functions successfully

Note :
Detailed configuration guide is given in Open-E website itself

Implementing HA storage cluster - Open-E DSS V7 Active-Active Load Balancing iSCSI HA SAN cluster

The setup that I used for this implementation is mentioned below :

-Two DELL PowerEdge R710 servers (ESXI01 and ESXI02)
-Implemented Open-E DSS V7 Active-Active iSCSI HA SAN Cluster – using two VSA’s running on different ESXI 5.5 hosts
-In Active-Active mode, both nodes of the cluster will simultaneously run volumes providing high availability of data
-Overall cluster performance will be improved compared to Active-Passive mode since the read, write and replication traffic can be balanced on both nodes

-Open-E cluster nodes :
  • Node A1 on ESXI01 
  • Node B1 on ESXI02 
  • Configured separate network interfaces for heartbeat, volume replication and WEB GUI management
  • Added direct point-to-point connection between the above two ESX hosts for reliable volume replication
  • Configured iSCSI volumes and targets on Node A1 and Node B1
  • Configured replication tasks and failover cluster with multiple auxiliary paths
  • Configured virtual target IP address and added targets to cluster
  • Started cluster

Note :
Detailed configuration guide is given in Open-E website itself

Configuring iSCSI volume using Open-E DSS V7