I have been a big consumer of digital storage since my first computer in 1985. I digitize almost everything and really do not throw much away. I had hundreds of floppies, bought my first 1 GB SCSI drive in 1995 for $1,000, and have always been pushing the boundaries with my personal document collection, photos, music, and movies. As an example, I ripped all the CDs I own back in 1997 with 256k MP3. As I continued adding more CDs to my wall, I kept increasing the bit rates. I went back this summer and re-ripped the whole cabinet (about 500) CDs into Apple lossless format. Ditto for my DVD and BluRay collection. When you combine that with the tens of thousands of photos (original and scanned film), and every document I wrote since I was in high school in the 80’s through college and today, I have amassed quite a lot of digital assets.
To securely store all of these bits, I wanted an explicitly reliable filesystem. The best one I know of today is ZFS. Not only does it support different levels of redundancy, but also stores checksums of every block and can guarantee that I can retrieve the data exactly how it was written. Unlike "brain-dead" filesystems which are commonly used for Windows and Linux machines, they do not have the capability to verify that every data block is not only readable, but matches what was written. I have had issues in the past where an fsck and/or chkdsk returns a good status, but the data content itself is corrupted.
The primary source of transient data errors are Unrecoverable Read Errors (URE) which is typically around 10^14 for your normal platter drives. The guys at CERN have measured this and found an effective URE rate of 3*10^7. So basically, if you have 1TB of data, expect that 3 of those files have silent corruption if you are not using a checksum based filesystem. This “bit-rot” problem grows as you have more and more data stored. When we had 1GB drives, it was extremely uncommon, but at 24TB - it is a certainty. I have not had those same issues since moving to ZFS.
But that is all a side note to the real point of this article. How do I store all of those assets, and furthermore, how have I gone "Green" in the process? From my prior articles, you know I used a Norco 4U rack stuffed full of 20 Seagate 1.5TB drives. Then used two SuperMicro MV8 JBOD 8 port controllers to connect them to Solaris 10 where I had created a RAIDZ2 across 16 drives (8 drives on each controller) with 2 spare drives for the pool and 2 drives for the OS mirror with the onboard SATA ports. RAIDZ2 allows for 2 drives to fail and still be able to read and write data in a DEGRADED state. While that system was effective, stable, and extremely fast, it also consumed quite a bit of power and the case with all of its fans made a lot of noise pollution just to keep all those drives cool.
When I moved into my new house, I spent quite a bit on making it energy efficient. I replaced all the lights with LEDs (that deserves its own blog post about the importance of CRI in your household lights and ELV dimmers), ripped out and replaced much of the insulation, replaced the air conditioners with high SEER models, etc. So when it was time to rebuild the "storage server", I had to apply the same thinking.
But I did not want to focus exclusively on a "storage server". I had lots of other computers in the house, burning power, generating heat (in Texas where we use the A/C nearly year round), and taking up space. These were:
- HomeSeer for Home Automation of all those new Z-Wave switches, pool, irrigation, etc
- A ClearOS network server for DHCP & DNS (The reliability of the AT&T U-verse "Home Gateway" was problematic)
- a CCTV recorder for my security cameras
- a Media Server for the XBox, PS3, and Android devices
- a HTPC at first running PLEX, then switching to XBMC
- a Mac Mini running iTunes for the iPhones in the house and AppleTV
- A BitCoin manager for playing around with virtual currency
This items list below represents what is currently live and running in my main system. As you can see, I have been purchasing parts over the last few years as ideas (and budget!) came and went.
- 2012 Fall
- 2 x OCZ VTX3-25SAT3-120G Vertex 3 Solid State Drive
- 2012 Spring
- XIGMATEK Extreme Silient 140mm Case Fans
- APC SmartUPS 2200RM from eBay
- APC 9630 UPS Management Card 2
- 2011 Fall
- ASUS Radeon HD 6870 1GB 256-bit GDDR5 PCI Express Dual DVI/HDMI Video Card
- PCI Express 4 channel CCTV card SK-D104
- 2011 Spring
- Fractal Design Define R3 case
- Rosewill Lightning 1000W 80 Plus Gold
- ICY DOCK SSD bracket
- ASUS P755D Pro motherboard
- Zalman CNPS 9700 NT Ultra Quiet CPU Cooler
- G.Skill Ripjaw (4 x 4GB) DDR3-1333
- Intel Core i7 860 2.8 GHz Processor
- 8x3TB WD Green HDD WD30EZRSDTL- these run at 5900 RPM and have a 4k sector size
- 2010 Fall
- LSI MegaRAID Low-Power 9240-8i RAID controller
- areca MiniSAS to 4 SATA breakout cable
- 1 x Seagate 1.5TB drive
For the storage service, I took the LSI PCI Express card and assigned it directly to the Solaris 11 VM so it could bypass all the virtualization for IO. Then I made a raidz2 array across the 8 drives (the Solaris OS is a virtual disk on the SSD) and exported via CIFS, NFS, and a little bit of iSCSI.
$ zpool list NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT rpool 31.8G 5.36G 26.4G 16% 1.00x ONLINE - tank2 21.8T 14.8T 6.99T 67% 1.00x ONLINE - $ zpool status tank2 pool: tank2 state: ONLINE scan: scrub repaired 0 in 50h37m with 0 errors on Mon Oct 29 19:28:13 2012 config: NAME STATE READ WRITE CKSUM tank2 ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 c5t12d1 ONLINE 0 0 0 c5t15d1 ONLINE 0 0 0 c5t10d1 ONLINE 0 0 0 c5t9d1 ONLINE 0 0 0 c5t8d1 ONLINE 0 0 0 c5t11d1 ONLINE 0 0 0 c5t14d1 ONLINE 0 0 0 c5t13d1 ONLINE 0 0 0 errors: No known data errorsWith this configuration I have 21.8TB of effective storage! I run a ‘zpool scrub’ on the main array once a month just to check for any data errors and correct any errors while there is still enough redundant data.
The diagram to the right represents the memory allocation I have settled on. With ESXi v5, I can also page memory over-allocations to SSD. It also shows the DirectPath configuration and USB devices attached to the VMs
This diagram below represents how I have split the VMs across the datastores.
Now, what do I do for backups? My original storage server is still running. It wakes up once a week in the summer and does an rsync with the new storage server. In the winter (or more precisely when the outdoor temperature is under 60 degrees F), it runs SETI@Home full time for “intelligent heating”. For backup up the VMs, I do still have to shut them down manually every once in awhile and do an export. I want to be able to back them up directly (at the VM layer) while they are online but have not dug into how I can do that from the ESXi kernel directly. scp on the ESXi host will not copy .vdmk files on running VMs. I also haven’t been able to get ESXi to attach to an iSCSI target which is hosted on a VM itself. If I could do that, then I’d only have to back up one VM regularly (the storage server VM hosting the iSCSI devices).
$ zpool list NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT rpool 696G 6.44G 690G 0% 1.00x ONLINE - tank1 21.8T 10.8T 10.9T 49% 1.00x ONLINE - $ zpool status tank1 pool: tank1 state: ONLINE scan: resilvered 35.6G in 1h46m with 0 errors on Thu Aug 9 19:12:41 2012 config: NAME STATE READ WRITE CKSUM tank1 ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 c9t0d0 ONLINE 0 0 0 c9t1d0 ONLINE 0 0 0 c9t2d0 ONLINE 0 0 0 c9t3d0 ONLINE 0 0 0 c9t4d0 ONLINE 0 0 0 c9t5d0 ONLINE 0 0 0 c9t6d0 ONLINE 0 0 0 c9t7d0 ONLINE 0 0 0 c10t0d0 ONLINE 0 0 0 c10t1d0 ONLINE 0 0 0 c10t2d0 ONLINE 0 0 0 c10t3d0 ONLINE 0 0 0 c10t4d0 ONLINE 0 0 0 c10t5d0 ONLINE 0 0 0 c10t6d0 ONLINE 0 0 0 c10t7d0 ONLINE 0 0 0 spares c8t0d0 AVAIL errors: No known data errorsWell, I am all out of time! Let me know if there are any questions or mistakes.
Notes:
- My ESXi server had an issue with PSOD (Purple Screen of Death) after I installed the SSDs. It would be a random world every time but always a #PF Exception 14 with vmk_Memcpy@vmkernel at fault. After many hours of diagnosis, this was related to having a single VMFS5 datastore spanning across both of the SSD drives. Also of note that I was using the host memory cache feature as well. When I rebuilt the VMFS as two separate datastores and manually balanced my VMs, all the PSODs ceased.
- One of the CIFS clients was doing lots of small IO reads and writes and nearly always show 100% disk overloaded. This was related to the Windows SMB client doing synchronous writes to the share. I turned off forced sync writes to disk on just that one share (zfs set sync=disabled tank1/scratch) and performance was fantastic.
Nicely done. I am proud :)
ReplyDelete