Tuesday, January 22, 2013

Storage Server gone Green!

It has been a while, but this topic keeps coming up during casual conversations so I thought it was about time to refresh everyone on what I am doing for my storage solutions.

I have been a big consumer of digital storage since my first computer in 1985. I digitize almost everything and really do not throw much away. I had hundreds of floppies, bought my first 1 GB SCSI drive in 1995 for $1,000, and have always been pushing the boundaries with my personal document collection, photos, music, and movies. As an example, I ripped all the CDs I own back in 1997 with 256k MP3. As I continued adding more CDs to my wall, I kept increasing the bit rates. I went back this summer and re-ripped the whole cabinet (about 500) CDs into Apple lossless format. Ditto for my DVD and BluRay collection. When you combine that with the tens of thousands of photos (original and scanned film), and every document I wrote since I was in high school in the 80’s through college and today, I have amassed quite a lot of digital assets.

To securely store all of these bits, I wanted an explicitly reliable filesystem. The best one I know of today is ZFS. Not only does it support different levels of redundancy, but also stores checksums of every block and can guarantee that I can retrieve the data exactly how it was written. Unlike "brain-dead" filesystems which are commonly used for Windows and Linux machines, they do not have the capability to verify that every data block is not only readable, but matches what was written. I have had issues in the past where an fsck and/or chkdsk returns a good status, but the data content itself is corrupted.

The primary source of transient data errors are Unrecoverable Read Errors (URE) which is typically around 10^14 for your normal platter drives. The guys at CERN have measured this and found an effective URE rate of 3*10^7. So basically, if you have 1TB of data, expect that 3 of those files have silent corruption if you are not using a checksum based filesystem. This “bit-rot” problem grows as you have more and more data stored. When we had 1GB drives, it was extremely uncommon, but at 24TB - it is a certainty. I have not had those same issues since moving to ZFS.

But that is all a side note to the real point of this article. How do I store all of those assets, and furthermore, how have I gone "Green" in the process? From my prior articles, you know I used a Norco 4U rack stuffed full of 20 Seagate 1.5TB drives. Then used two SuperMicro MV8 JBOD 8 port controllers to connect them to Solaris 10 where I had created a RAIDZ2 across 16 drives (8 drives on each controller) with 2 spare drives for the pool and 2 drives for the OS mirror with the onboard SATA ports. RAIDZ2 allows for 2 drives to fail and still be able to read and write data in a DEGRADED state. While that system was effective, stable, and extremely fast, it also consumed quite a bit of power and the case with all of its fans made a lot of noise pollution just to keep all those drives cool.

When I moved into my new house, I spent quite a bit on making it energy efficient. I replaced all the lights with LEDs (that deserves its own blog post about the importance of CRI in your household lights and ELV dimmers), ripped out and replaced much of the insulation, replaced the air conditioners with high SEER models, etc. So when it was time to rebuild the "storage server", I had to apply the same thinking.

But I did not want to focus exclusively on a "storage server". I had lots of other computers in the house, burning power, generating heat (in Texas where we use the A/C nearly year round), and taking up space. These were:
  • HomeSeer for Home Automation of all those new Z-Wave switches, pool, irrigation, etc
  • A ClearOS network server for DHCP & DNS (The reliability of the AT&T U-verse "Home Gateway" was problematic)
  • a CCTV recorder for my security cameras
  • a Media Server for the XBox, PS3, and Android devices
  • a HTPC at first running PLEX, then switching to XBMC
  • a Mac Mini running iTunes for the  iPhones in the house and AppleTV
  • A BitCoin manager for playing around with virtual currency
I had been using VMWare’s ESXi for a while really liked the way I could virtualize several systems into one. However, all of the systems I had were more than compute nodes, they had specialized IO cards and adapters. It was not until Intel, Nehalem, and DirectPath that this problem could be solved. With DirectPath, I could still virtualize the computers, but still give them exclusive access to dedicated PCI cards for their own use. This was the last hurdle and it was gone! So, I decided to build a large(ish) server with a new i7 chip, get some new PCI express cards, put all my eggs into one basket and start doing my part of saving the planet.

This items list below represents what is currently live and running in my main system. As you can see, I have been purchasing parts over the last few years as ideas (and budget!) came and went.
The physical space is a little tight in the case. There are 8 x 3.5” internal drive slots and 2 x 5 ¼” front bays. I used the IcyDock to mount the two SSDs and a normal adapter for the 1.5TB VMFS volume behind a fan speed controller in the top bays. The 1000 Watt power supply provides plenty of juice for the drives, processor, and video card. There is 16GB RAM attached to the Intel i7 processor running at 2.8GHz. The dedicated LSI PCI express controller card is attached to the 8x3TB SATA drives using two MiniSAS to SATA breakout cables and configured as JBOD. Each drive has its own path to the controller without any switching. I am running ESX 5.1.0 after my latest rebuild where I installed two SSDs for better performance. I did it because my VMs were starting to run slow and encountering high latency when they were sharing a single 1.5TB 7200 RPM Seagate drive. ESXi now boots off of an SSD along with the smaller VMs. The original 1.5TB boot drive is now datastore_hdd and used for the VMs which do not need high IO rates. I have used thin provisioning on the SSD datastore and thick provisioning on the platter datastore.

For the storage service, I took the LSI PCI Express card and assigned it directly to the Solaris 11 VM so it could bypass all the virtualization for IO. Then I made a raidz2 array across the 8 drives (the Solaris OS is a virtual disk on the SSD) and exported via CIFS, NFS, and a little bit of iSCSI.

$ zpool list
NAME    SIZE  ALLOC   FREE  CAP  DEDUP  HEALTH  ALTROOT
rpool  31.8G  5.36G  26.4G  16%  1.00x  ONLINE  -
tank2  21.8T  14.8T  6.99T  67%  1.00x  ONLINE  -
$ zpool status tank2
  pool: tank2
 state: ONLINE
  scan: scrub repaired 0 in 50h37m with 0 errors on Mon Oct 29 19:28:13 2012
config:

 NAME         STATE     READ WRITE CKSUM
 tank2        ONLINE       0     0     0
   raidz2-0   ONLINE       0     0     0
     c5t12d1  ONLINE       0     0     0
     c5t15d1  ONLINE       0     0     0
     c5t10d1  ONLINE       0     0     0
     c5t9d1   ONLINE       0     0     0
     c5t8d1   ONLINE       0     0     0
     c5t11d1  ONLINE       0     0     0
     c5t14d1  ONLINE       0     0     0
     c5t13d1  ONLINE       0     0     0

errors: No known data errors

With this configuration I have 21.8TB of effective storage!  I run a ‘zpool scrub’ on the main array once a month just to check for any data errors and correct any errors while there is still enough redundant data.


The diagram to the right represents the memory allocation I have settled on. With ESXi v5, I can also page memory over-allocations to SSD. It also shows the DirectPath configuration and USB devices attached to the VMs

This diagram below represents how I have split the VMs across the datastores.


Now, what do I do for backups? My original storage server is still running. It wakes up once a week in the summer and does an rsync with the new storage server. In the winter (or more precisely when the outdoor temperature is under 60 degrees F), it runs SETI@Home full time for “intelligent heating”. For backup up the VMs, I do still have to shut them down manually every once in awhile and do an export. I want to be able to back them up directly (at the VM layer) while they are online but have not dug into how I can do that from the ESXi kernel directly. scp on the ESXi host will not copy .vdmk files on running VMs. I also haven’t been able to get ESXi to attach to an iSCSI target which is hosted on a VM itself. If I could do that, then I’d only have to back up one VM regularly (the storage server VM hosting the iSCSI devices).



$ zpool list
NAME    SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
rpool   696G  6.44G   690G     0%  1.00x  ONLINE  -
tank1  21.8T  10.8T  10.9T    49%  1.00x  ONLINE  -
$ zpool status tank1
  pool: tank1
 state: ONLINE
 scan: resilvered 35.6G in 1h46m with 0 errors on Thu Aug  9 19:12:41 2012
config:

        NAME         STATE     READ WRITE CKSUM
        tank1        ONLINE       0     0     0
          raidz2-0   ONLINE       0     0     0
            c9t0d0   ONLINE       0     0     0
            c9t1d0   ONLINE       0     0     0
            c9t2d0   ONLINE       0     0     0
            c9t3d0   ONLINE       0     0     0
            c9t4d0   ONLINE       0     0     0
            c9t5d0   ONLINE       0     0     0
            c9t6d0   ONLINE       0     0     0
            c9t7d0   ONLINE       0     0     0
            c10t0d0  ONLINE       0     0     0
            c10t1d0  ONLINE       0     0     0
            c10t2d0  ONLINE       0     0     0
            c10t3d0  ONLINE       0     0     0
            c10t4d0  ONLINE       0     0     0
            c10t5d0  ONLINE       0     0     0
            c10t6d0  ONLINE       0     0     0
            c10t7d0  ONLINE       0     0     0
        spares
          c8t0d0     AVAIL   

errors: No known data errors

Well, I am all out of time! Let me know if there are any questions or mistakes.

Notes:
  1. My ESXi server had an issue with PSOD (Purple Screen of Death) after I installed the SSDs. It would be a random world every time but always a #PF Exception 14 with vmk_Memcpy@vmkernel at fault. After many hours of diagnosis, this was related to having a single VMFS5 datastore spanning across both of the SSD drives. Also of note that I was using the host memory cache feature as well. When I rebuilt the VMFS as two separate datastores and manually balanced my VMs, all the PSODs ceased. 
  2. One of the CIFS clients was doing lots of small IO reads and writes and nearly always show 100% disk overloaded. This was related to the Windows SMB client doing synchronous writes to the share. I turned off forced sync writes to disk on just that one share (zfs set sync=disabled tank1/scratch) and performance was fantastic.