Wiki Blog/2005-02-27Homebuilt 2 Terabytes RAID5 File Server Running LinuxAbout two years ago, I built a computer that would save all my files (movies, music, digital photos etc.). As it always is, capacity was quickly an issue: initially, it was 40 gigabytes (an IBM drive gotten back from RMA), then I purchased a 180GXP with 180 gigabytes. Some time later, a 10 gigabytes SCSI drive (together with an Adaptec AHA-2940UW, both ripped out from an old IBM workstation) was added. Notabene, all these drives had separate file systems. Life was better, but as the system was filling up, big files would not find place on any of the file systems, even if the total free space of them was bigger than the file. Another thing that bothered me: my 100mbit intranet was saturated when transferring large files from or to my file server. Movies or other big files were dreadfully slow to move around. Specifications of a new systemI decided to upgrade and made a small checklist of what would be most important:
So I started my research on the web. This were the preliminary results: Choice of motherboard: Asus A8N-SLI deluxeLink: The NVidia NForce4 chipset is designed for the AMD64 architecture (939 package form). It features PCI-Express channels, a gigabit ethernet port, 4 SATA ports. I chose the ASUS A8N-SLI deluxe, as it has an additional Sil 3114 with 4 SATA ports. This chip (altough capable of 66mhz) is connected to the 33mhz 32bit PCI bus to the chipset and thus maxes out theoretically at 133 megabytes/s - this leaves only 33MiB/s per drive, which is a bottleneck considering the performance of the nForce4 SATA ports (see below). Additionally, there is a secondary gigabit ethernet interface on the motherboard. I suspect this is connected to the chipset by a 33mhz 32bit PCI bus as well. Choice of Hard drive: Maxtor DiamondMax10 300 gigabytes MaxtorLink: I was bound to 8 drives now because of the motherboard. The most bang for the buck (you excuse my wording) I got with 63B300S0 300 gigabytes Maxtor drives. Choice of SATA backplanes: Sunnytek SNT-3141SATALink: After some very deep digging, I found the Sunnytek backplanes SNT-3141SATA. They each take 4 SATA drives and use three 5.25" slots in a chassis. This meant that I would have to get a chassis with six 5.25" slots. Choice of chassis: Silverstone TJ01SLink: This chassis has the needed six 5.25" slots and has a very nice finish, along with a door to cover the 5.25" slots. I removed the door for better ventilation, though. Choice of power supply: Thermaltake ?PurePower 560WLink: I first bought the Enermax 370W (type EG375AX-VE(G)), but it just would not work together with the main board - there were sudden power downs, most of the time I couldn't even get it to start (and this was all without any drives, just CPU, RAM, PCI graphics card). So I got a almost No-Name PSU with 420W as a temporary replacement from my dealer and that worked. It runs 24/7 now since about three weeks without problems. I installed a Thermaltake ?PurePower 560W in the meantime, which should provide more reserves, taking into account the temperatures will be significantly higher during the summer than they are now. Choice of CPU: AMD64 3200+This processor was the one my dealer had on stock. Software RAID consumes quite a bit of CPU and if you have a fast storage sub-system (ie. hard drives and interfaces), it might get tight in combination with gigabit ethernet, which in turn eats away some CPU resources as well when transferring data at full speed... Setting it all upLike I hoped, using SATA made the cabling very easy. The system is quite loud because of the fans of the SATA cages and thus belongs in a basement. You don't want to carry it around a lot neither, because it is heavy, believe me. Operating System and RAID setupI chose Debian as operating system and used the new installer (the current version is the release candidate RC2). For simplicity, I installed the 32bit version, even if the processor would allow a 64bit operating system and software. Surprisingly, the installer (using the 2.6.8 Linux kernel) detected all hardware correctly and without any manual intervention. Kudos to the Debian people! I had to set up manually the RAID system after the installation. I used mdadm. Sensitized by the problems with my old machine, I chose to go with the Logical Volume Manager. This software allows you to add, remove hard drives almost instantly and on the fly; you can resize partitions etc. Multiple physical drives form a pool (called Logical Volume Group) of usable space. This pool can be divided into Logical Volumes. They are almost like partitions and can contain one file system. See the good I configured my system as follows: +----------------------------------------+ | Ext3 File System | +----------------------------------------+ | LVM Logical Volume | +----------------------------------------+ | LVM Logical Volume Group | +----------------------------------------+ | RAID 5 array | +-----------------------+----+----+------+ | physical hard disk #1 | #2 | #3 | etc. | +-----------------------+----+----+------+ Perfomancewrite performance: me@beast:$ dd if=/dev/zero of=/storagearray/test.tmp bs=4096 1238865+0 records in 1238864+0 records out 5074386944 bytes transferred in 63.475536 seconds (79942404 bytes/sec) me@beast:$ dd if=/dev/zero of=/storagearray/test.tmp bs=4096 1291513+0 records in 1291512+0 records out 5290033152 bytes transferred in 58.591716 seconds (90286367 bytes/sec) me@beast:$ dd if=/dev/zero of=/storagearray/test2.tmp bs=4096 1209993+0 records in 1209992+0 records out 4956127232 bytes transferred in 61.981403 seconds (79961521 bytes/sec) this is between 76MB/s - 86MB/s. read performance: me@beast:$ dd if=/storagearray/test.tmp of=/dev/null bs=4096 1291513+0 records in 1291513+0 records out 5290037248 bytes transferred in 51.320580 seconds (103078283 bytes/sec) me@beast:$ dd if=/storagearray/test2.tmp of=/dev/null bs=4096 1209993+0 records in 1209993+0 records out 4956131328 bytes transferred in 48.381239 seconds (102439116 bytes/sec) this is about 98MB/s. I did as well a test of the ethernet performance: connecting a gigabit ethernet capable laptop directly to the file server, I got read performances of about 60 megabytes/s and write performances of about 35 megabytes/s. It is very well possible that there is some room for tuning here... :) CPU load during large file operationsWhen reading, the dd process puts a load of about 40% to the CPU. While writing, the load is about 24%. Problems and possible improvementsI agree that read and write performance are not stellar. The reason for this might be that it's a RAID5 array, which would also explain why write performance is lower than expected. Theoretically, read performance should scale quite well with the number of drives. Update 2005-02-28: LVM2 apparently affects performance (see this discussion: Through LVM, my system makes about 86MB/s: me@beast:$ time dd of=/dev/null if=/dev/mapper/VolumeGroup1-lvol0 \ bs=1024K count=8192 8192+0 records in 8192+0 records out 8589934592 bytes transferred in 94.677439 seconds (90728421 bytes/sec) real 1m34.686s user 0m0.012s sys 0m13.708s Accessing the RAID5 array directly suddenly gives 174MB/s! This is 100% faster! me@beast:$ time dd of=/dev/null if=/dev/md0 bs=1024K count=8192 8192+0 records in 8192+0 records out 8589934592 bytes transferred in 47.014788 seconds (182707079 bytes/sec) real 0m47.034s user 0m0.010s sys 0m19.663s Update 2005-03-01: 174MB/s is impressive for those having had only single drives, agreed. But for a 8 drive RAID 5 array, this figure should be about 150% higher, ie. around 240MB/s. See this article from the maintainer of the linux software RAID: Second Update 2005-03-01: In an attempt to improve performance, I left out LVM2 of the loop and rebuilt the file system (ext3) directly on top of the RAID array. With this configuration even through the file system layer the read performance has improved significantly: Write speed is still at 78MB/s: me@beast:$ time dd of=/storage/test3 if=/dev/zero bs=1024K count=8192 8192+0 records in 8192+0 records out 8589934592 bytes transferred in 104.683379 seconds (82056337 bytes/sec) real 1m44.812s user 0m0.021s sys 0m26.561s But read speed has rocketed from 98MB/s to 166MB/s! me@beast:$ time dd if=/storage/test2 of=/dev/null bs=1024K count=8192 8192+0 records in 8192+0 records out 8589934592 bytes transferred in 49.290285 seconds (174272366 bytes/sec) real 0m49.372s user 0m0.013s sys 0m20.919s Update 2005-03-02: To further narrow down the reasons of sub-optimal performance, I launched concurrent reads on all of the eight drives (not through the RAID layer). The results are somewhat surprising:
Thus total throughput is 340MB/s. To compare, non-concurrent reads give for both controllers values of 60MiB/s. Concurrent reads on all four drives connected to the nForce4 SATA ports gives a total of 245MiB/s (61MiB/s per drive). Another point is that when booting the system (happens rarely, but still needed sometimes), one of the SATA ports (of course belonging to the SiI 3114 :) does not recognize the drive correctly. This leads the RAID software to the false assumption that the drive died and it runs in degraded mode without any redundancy. When rebooting and manually re-adding the lost drive, the array re-calculates all checksums and thus puts heavy load on all disks, which is not a very good thing. I will have to write some script to prevent the RAID array to come up after a reboot if there is a drive missing and wait for manual intervention. Something I'll have to fix as soon as possible is that smartctl does not work with SATA without a patch of the Linux kernel (see this document: ConclusionWith a budget of about € 2700 (USD 3450) and a price of € 1.35 (USD 1.78) per gigabyte, I built a 2 terabyte file server for home use. If you feel there is some information missing (prices of each item will follow soon) or if you have questions, please write an email to Comments on WikiBlog/2005-02-27:
28.02.2005 11:14
130.94.160.76
New comment.Excellent! Thank-you for the review. Apparently 2.6.9 and up may have better RAID5 scores according to the maintainer of the software RAID code (source: http://cgi.cse.unsw.edu.au/neilb/01103064158)
1.03.2005 4:57
204.212.233.251
sata controller card?Which sata controlelr card did u use?
1.03.2005 13:25
82.235.52.165
he uses the s-ata ports integrated on the motherboardmost excellent review, thank you. I have ordered a P5GD2 Premium motherboard myself, and a Lian Li v2000 is on its way now. I will start out slowly with 6x200 gigs linux-raid5 tho... but there will be enough space to add 2 more arrays later :D
2.03.2005 0:46
65.205.244.100
CPU loadExcellent article! Do you know what the CPU load is on the system during a very large file write operation? I'm curious as to where the bottlenecks are - CPU, I/O bandwidth, etc...
2.03.2005 1:26
65.205.244.110
Spin-up delayMost hardware RAID controllers provide for delaying drive spin-up on power start to make sure the startup current doesn't cause system instability because the power supply's rails aren't able to sustain the rush of current on the +12V rail. Can you do that with software RAID? Maybe that's the cause of that drive not being detected consistency?
2.03.2005 19:09
65.205.244.100
129 MBps max on Si3114 drivesI think this is pretty close to the 133 MBps 33 Mhz PCI spec, so I bet the SI controller is bottlenecking on the PCI bandwidth. So it's probably not a 66 Mhz bus it's hooked to.
3.03.2005 10:10
138.189.119.133
RE: Spin-up delaythere are some drives where you can set a jumper to let the drive spin up a bit later than it would normally - but it's actually the controller's job to spin up the drives one after the other. however, I don't think that either the nForce4 chipset or the SiI 3114 controller do this... It could be that the power supply does not deliver enough juice for the last drive when all 9 are spinning up. I replaced the no-name 420W PSU with a Thermaltake 560W and I hope the problem won't resurface. regards nicola
3.03.2005 10:13
138.189.119.133
RE: 129 MBps max on Si3114 drivesas the nForce4 chipset only provides a 33mhz 32bit PCI bus, the SiI 3114 seems to be running as a matter of fact only at 33mhz, instead of the maximal 66mhz possible. regards nicola
9.03.2005 12:59
194.106.52.201
No PCI-E controller cards?Is the reason you've not used the PCI-E bus to replace the SIL chip that there are no suitable cards yet? The only other thing I can think of is using an extra SIL card and splitting the current 4 drives over both controllers. This might help through (a) enabling the SIL controllers to overlap slightly more than at present or (b) if there are two PCI busses provided by the NF4 chip: one for onboard chips, one for the expansion slots.
13.03.2005 16:02
69.17.45.112
Readahead settings?It's probably too late now for you to test, but did you try playing with the blockdev --setra settings for the LVM setup? On my personal server, with four 200GB drives, I use software RAID 5 and LVM2 on the 2.6.10 kernel w/preempt enabled. 3 drives are in the RAID 5, 1 is a hotspare. I generally aim for low latency over raw throughput as my server performs many varied tests, none of which include large file serving. I've found that with default readahead settings, LVM2 on top of software RAID 5 is indeed significantly slower, but with some tweaking, the LVM2 performance can be increased significantly. dd of LVM2 (on RAID5 md) w/256 readahead default: 51.54MB/s dd of RAID5 md directly w/2048 readahead default: 87.37MB/s dd of LVM2 (on RAID5 md) w/4096 readahead: 104.54MB/s As you can see, a significant boost. Other values may be better, particularly in balancing latency vs throughput.
13.03.2005 17:53
62.65.146.50
RE: No PCI-E controller cards?hi actually, I don't know of any cheap PCIe SATA controller cards. if anyone knows who produces such cards, please let me know... :) regards nicola
14.03.2005 6:01
67.161.71.252
RE: no PCI-E controller cardsNo cheap PCI-E SATA cards exist right now, but this will change soon. Silicon Image is starting to ship the SiI 2132 chipset, which has native PCI support. It handles 2 SATA drives, with the ability to double the total drives to 4 with some additional support circuitry. Cost should be quite low at $7/chip in volume, but I don't know when we'll these enter the channel. Soon I expect from the usual suspects. The good news is becaus eyou use software RAID, you should end up being able to move to the new controller without having to rebuild the array. The bad news is who knows how long it will take for linux support for this chipset to show up. :-)
14.03.2005 6:04
67.161.71.252
I meant 3132Sorry, I meant the Silicon Image 3132, not the 2132...
14.03.2005 10:35
138.189.119.133
RE: no PCI-E controller cardshi motherboard manufacturers like Asus will integrate the SiI 3132 into their new products: http://www.siliconimage.com/news/press/detailpressrelease.aspx?id=284 regards nicola
17.03.2005 1:02
202.139.111.196
SATAII port multiplicationGood work.. its great to see more info on software RAID setups. One possibility you could try is to use only the SATAII ports, but use port-multipliers. SATAII supports this natively and lets you share each port with several drives. 3Gbps / 300MB/sec is a fair chunk of bandwidth so say you put 2 drives per port using something like this: http://www.siimage.com/products/product.aspx?id=26 to get the extra connectors.
24.03.2005 21:54
65.205.244.100
New comment.I don't think the Linux SATA drivers support port multiplication. The host interface needs to have special code for this, and I don't think the ATA based driver does this. Thanks, Mike
8.04.2005 3:17
129.78.228.114
On port multipliers.There's a very good reason why the Linux software drivers don't support port multipliers. Actual retail product is still 3-6 months away from market.
8.05.2005 3:11
213.47.123.221
SiI 3132Nice design, but don't expect too much: It's only 2-port SATA-II on a single PCIe-1x lane. Quite impressive layout: 10mm x 10mm instead of 20x20 (SiI 3114, 4 SATA ports PCI32/66) according to the data sheet (http://www.siliconimage.com/docs/SI_3132PB_FINAL.pdf), motherboard manufacturers will like it :-) If performance is required with Silicon Image (Soft-RAID), one has to go for PCI-X (SiI 3124). It is very disappointing that PCIe is treated like the "faster PCI" here instead of "the smaller/cheaper PCI-X", but at least there are chips coming for PCIe that don't have VGA connectors. Considering an Areca/Tekram-based hardware-RAID solution after all... thanks for your review/testing!
12.05.2005 20:10
63.249.64.186
What about stability?Nice writeup, i'm planning to build an essentially identical system as a storage server for our simulations. I didn't realize the SiI chip was such a bottleneck, that's really annoying... Anyhow, you didn't say anything about stability in your writeup. Is the system working well? Have you had problems with disks dropping off the array? About the spinup problem, I've been wondering about this too. I'm planning to use Western Digital 250Gb WD2500SD drives, and they have a jumper to enable host-controlled spinup. But I have no idea how you would go about manually spinning the drives up in linux. It seems you just need to be able to send the proper command to the drive, but I'm completely ignorant of these sorts of low-level stuff... Cheers, /Patrik
13.05.2005 14:39
193.5.216.100
StabilityThat largely depends on the kernel you use (and user-space software). I had no disks being dropped or kernel crashes, however the NFS server seems to block when the blocksizes are not tuned to very specific sizes on the clients... very strange. UPS is advised strongly as a clean shutdown on power loss is very favorable, even more so if rebooting results in dropping one disk from the array. Other than that I am somewhat surprised how big 2TB are after all. I have now all my music (50GB) and all my movies (300GB) on the server, along with some backups - and there are still only 465GB of 2TB used (which is 24%).
6.06.2005 7:27
65.25.245.16
PCI-E to SATA llYou might want to check these products out: http://www.areca.us/ Have you tried installing Windows XP64, or XP for fun and tested out the sort of performance you get with it over linux?
6.06.2005 17:19
65.25.245.16
PCI-E to SATA llYou might want to check these products out: http://www.areca.us/ Have you tried installing Windows XP64, or XP for fun and tested out the sort of performance you get with it over linux?
9.06.2005 0:11
67.113.12.99
New comment.The Areca PCI-E controllers are nice, but incredibly expensive! And it doesn't really buy you much over software raid under Linux, particularly if you are running EVMS as a volume manager which lets you reconfigure RAID5 arrays on the fly. Of course, if you can't manage a lInux system and must run Windows, then you really need hardware raid like the Areca. But then, you're probably used to paying a lot anyway... :-)
26.06.2005 13:58
195.137.106.99
graphics cardI am thinking of making a similar system to yours. What graphics card did you use in your system? Obviously spec is not important in a server so cheapest is best but did you go with PCI-E or PCI? © Copyright 2004 - 2006 Nicola Fankhauser. All Rights Reserved. |