My Recent SAN Migration
Tuesday, September 7, 2010 2:55:48 PM
SAN Migration Summary
Environment 1:
- 1 XIOTech Magnitude 3D 3000 (Old)
- Cisco 9100 SAN Fabric (Old)
- 2 XIOTech Emprise 5000 (New)
- Cisco 9124 SAN Fabric (New)
- 12 TB of Data
- 34 SAN Volumes
- 2 SAN Connected Physical Windows Servers (MPIO)
- File Servers running VSS and DFS/DFS-R
- 1 Windows Server Added to New SAN (MPIO)
- 3 vSphere 4.0.0-267974 Enterprise Nodes
- 24 Virtual Servers
- 2 SQL Servers
- 1 SQL Server has 3 RDM Volumes
- 2 Web Servers
- 1 Domain Controller
- 1 Print Server
- 18 App/Test Servers
- 5 SAN Connected VMFS Volumes (MPIO)
- 3 SAN Connected RDM Volumes (MPIO)
Environment 2:
- 1 XIOTech Magnitude 3D 3000 (Old)
- 1 XIOTech Emprise 5000 (New)
- Cisco 9124 SAN Fabric (New)
- 6.6 TB of Data
- 23 SAN Volumes
- 3 SAN Connected Physical Windows Servers (MPIO)
- 2 File Servers running VSS and DFS/DFS-R
- 1 SQL 2003 Server
- 3 Independent ESXi 4.0.0-171294 Nodes
- 14 Virtual Servers
- 1 SQL Server with 1 RDM Volumes
- 2 Web Servers
- 1 Web Server has 2 RDM Volumes
- 1 SCCM Server with 1 RDM Volumes
- 1 File Server with 1 RDM Volumes
- 9 App/Test Servers
- 3 SAN Connected VMFS Volumes (MPIO)
- 1 Assigned to 1 Node
- 2 Shared by 2 Nodes
- 4 SAN Connected RDM Volumes (MPIO)
Environment 1 Process
The first environment had a layer of complexity over the second because I had two independent fabrics. Luckily all servers were configured with MPIO support so I was able to remove one fibre pair, at the cost of performance, add place the servers into both zones.
Preparation
To prepare for the move I created all volumes ahead of time with a slight buffer (1GB) to handle mirroring of volumes. Both SANs and SAN Fabrics remained online during the entire process.
ESX Nodes
- Moved 1 fibre channel port to the new fabric (Switch 1) from each node.
- Zoned the fabric to allow the hosts to communicate with both SANs.
- With the Emprise 5000 Port 1 only needed to see MRC1 on both controllers, since they are plugged in to Switch 1, resulting in creating of 2 zones per server. Zone 1 contained Server1-HBA1/ISE1-MRC1 and Server1-HBA1/ISE2-MRC1.
- Created host record on each Emprise 5000 for each ESX node (ESX01/ESX02/ESX03).
- Assigned all new volumes to all ESX nodes so that they were seeing the same thing.
- Rescanned the HBA in the vSphere Infrastructure Client and verified that all volumes were seen.
- All datastore volumes were configured as VMFS volumes; RDP volumes were untouched at this point.
- Using Storage vMotion I was able to migrate all virtual servers live from the old datastore to the new datastore. This process completed rather quickly, moving all VMs within about an hour.
- With the RDMs I assigned them to their appropriate servers and then followed the process below for Windows Servers starting at step 7.
vSphere’s Storage vMotion proved to be a wonderful tool and worked without issue. There was zero downtime for any virtual server during the entire migration, aside from
Existing Windows Servers
- Verified that both fibre channel links were active.
- Moved 1 fibre channel port to the new fabric (Switch 2) from each server.
- Verified that my redundant paths were no longer fault tolerant.
- Zoned the fabric to allow hosts to communicate with both SANs.
- This time I zoned for MRC2 on each Emprise 5000, since I am connected to Switch 2, which is where MRC2 is connected for each Emprise.
- Created host records on each Emprise 5000 for each server.
- Assigned all new volumes to their appropriate server and using Disk Management mmc rescanned and verified that I had all volumes.
- I then initialized and converted all new volumes to dynamic disks.
- I didn’t want DFS-R to have issues with dismounting volumes so I stopped the ‘DFS Replication’ service at this point.
- If the server was a SQL server or other App server I would also stop the appropriate services to prevent I/O during the conversion process.
- Since VSS for all volumes is located on a single, dedicated volume (V:\) I decided to convert this one to a dynamic disk first. To do this you must dismount all volumes that are using VSS and this as the storage location using ‘mountvol :\ /P’. This is critical process. Per the documentation I have found if VSS loses communication for more than 20 minutes you will lose all of your snapshots. Bird-dogging this process and following these steps worked very well, deviating from these steps resulted in one server losing all VSS snapshots. VSS snapshots should never be used for a primary backup.
- Once all drives were dismount I then converted the disk to dynamic.
- After the VSS volume successfully completed I immediately remounted all volumes back to their original drive letters.
- I then selected all remaining, existing volumes and converted them to Dynamic Disks.
- Now that all of my new and old drives are dynamic I went from largest to smallest and added the new drive as a mirror to the old drive. I did it this way because the mirroring will only show you drives large enough to be a mirror, starting from largest to smallest I would only get one drive as an option each time as I moved down the list.
- I was now able to enable the ‘DFS Replication’ service without worry.
- After mirroring was completed I right clicked on each original volume in the ‘Disk Management’ mmc and removed it from the mirror. This will leave a Simple volume with only the new volumes being used.
Exchange Server Move
The Exchange server was a much smoother and more straight forward process, since it was not previously SAN connected.
- Zoned the fabric to allow host to communicate with the new SANs.
- Since I have 2 HBAs going to 2 Emprise 5000s I needed to create 4 zones for this host:
- Host-HBA1/ISE1-MRC1
- Host-HBA1/ISE2-MRC1
- Host-HBA2/ISE1-MRC2
- Host-HBA2/ISE2-MRC2
- Created host records on each Emprise 5000 for the server, including both HBAs.
- Assigned all new volumes to their appropriate server and using Disk Management mmc rescanned and verified that I had all volumes. For each storage group I created a RAID 5 volume for the database files and a RAID 10 volume for the transaction logs.
- Initialized and created partitions on all of the new volumes
- Using Exchange System Management I then went to each Storage Group and changed the paths of the log files and system directories to point to the new Log volume. Exchange will then dismount the store and move those items to the new location, remounting the store when completed.
- In each Mailbox Store for each storage group I then went through and changed the path for the database and streaming database and pointed them at their new location. Again, Exchange will dismount the store, move the files for you and remount the store when completed.
Exchange’s tools made the process extremely easy, though slow, and it was a downtime event for all users of the mailbox store during the process.
Final Steps
After all servers were done mirroring their volumes and the mirrors were broken I was able to use ICON Manager, a XIOTech tool, to verify that there was no more I/O to any of the volumes, if there were I would then be able to go to the assigned server and see what it was still using on that volume.
After I/O was verified I then removed the zoning to the original SAN for all servers and moved their secondary HBA to the new fabric. The fabric was then appropriately zoned and using the MPIO tools verified that I now had 2 paths to each volume.
In the vSphere Infrastructure Client I then went in and ensured that all volumes were set to Round Robin in their path settings to ensure the load was spread across all HBAs.
The only downtime during this process was during the conversion to dynamic disks where it requires the volumes be dismounted during the process and the Exchange migrations. We did have 1 issue during the mirroring where several customers received delayed write failures to the DFS path of their H: drive, where they had rather large PST files located, though no data loss appears to have occurred. Aside from that the migration was successful.
The mirroring was started around 2am on a Wednesday and completed by 9:30am Thursday morning. Total migration process took around 39 hours from start to finish.
1 site down, 1 to go. Only 10 user issues out of 700 and those were due to extremely large (2+ GB) PST files choking with the limited disk I/O and degraded server performance during the migration.
Environment 2 Process
Environment 2 was very similar to Environment 1 with the following differences, which is the only thing that I will cover here:
- Servers were already setup MPIO and the only change was zoning them for the new fabric, with 2 zones per server as follows:
- ServerX-HBA1/ISE1-MRC1
- ServerX-HBA2/ISE1-MRC2
- ESX nodes did not support Storage vMotion, manual migration of VMs was required and was a downtime event.
ESXi Migration Process
For the ESXi migration I followed the same zoning procedures as outlined previously. The ESXi servers had SSH enabled for remote console access from previous work. You can also use the vCenter Converter, but I found it took longer.
- All volumes were mapped from the new SAN to the servers
- HBAs were rescanned and verified that all volumes were visible
- Configured MPIO on all volumes and set the mode to Round Robin
- Formatted all data stores as VMFS
- Shutdown all Virtual Machines on a given server
- In the Edit Settings dialog I made note of all RDMs assigned to the server and removed them, deleting files from the disk. This is a critical step, failure to do this will result in the RDMs being copied in their entirety to the new datastore, not just the reference to the data store.
- From the remote management console, as root, went through each data virtual server folder on each datastore and verified that no RDMs remained. I then copied the virtual server’s folder, in full, to the new datastore.
- After copying completed I went into the vSphere Infrastructure Client and browsed the new datastores. Entering in each virtual server folder and right clicking the .vmx file and choosing ‘Add to Inventory’ I then gave it a name such as ‘ServerName-E5K’ to denote that it was now located on the Emprise 5000.
- If the server did not have any RDMs this process is done and you can now power on the new VM, choose that it was Moved when prompted and remove the original from the inventory.
- If the server did have RDMs you will now need to add the original RDMs back in as well as the new RDMs. You can then start the virtual server and perform data migration from old to new as mentioned above in the section ‘Existing Windows Servers’ starting at step 7.
Final Steps
After I/O was again verified, all zoning to the original SAN was removed. No fibre channel cables needed patched due to everything being connected to the same fabric.
Summary
Because of the ESXi migrations Environment 2 ended up taking 2 full nights of hands-on work. All data was fully migrated within a span of 4 days, only because I waited a day in between the night shifts.
We had 1 issue with our Blackberry BES server which was related to it losing communication with Exchange, which a reboot of BES fixed. Aside from that the only issue reported was from an off-site user who was not notified and could not communicate with an app server for a few hours, which was virtual and being migrated.