User Tools

Site Tools


technology:linux:creating_a_4-disk_raid_array

Creating a 4-Disk RAID Array

The Hardware

The hardware I'm using is pretty standard stuff. Its not a gamer PC, but its relatively new technology. And its very I/O-friendly:

  • Gigabyte GA EP45T-UD3LR motherboard (Newegg link
  • Pentium D dual-core 3.4 Ghz CPU
  • 4GB RAM
  • 1 80GB EIDE disk
  • 4 1TB SATA disks
  • RAID drive cage - 4 hot-swap bays, pass-thru data & power connectors, built-in cooling fan
  • Thermaltake 750W power supply

Hard disk connection diagram For the sake of brevity, I'll get right to the I/O setup. The drive configuration is shown to the right. An 80GB EIDE drive is connected to the PATA connector on the motherboard. The BIOS detects it as the first disk, which is perfect for this setup. The four SATA II drives are snapped into a drive cage, and their data connections are plugged into the four SATA ports on the motherboard. My board doesn't support hotswap, so I'll have to power off the system to replace a drive if one fails.

Hardware-based vs. Software based

The big question when doing RAID is: hardware or software?. The hardware approach requires an fairly expensive controller card, but the software approach requires a complex setup and a fast processor for in-memory checksum calculations. I did a lot of research and will try to summarize it in a table:

Option Hardware RAID Software RAID Fake RAID
Has CPU Overhead No Yes Yes
Requires controller hardware Yes No Yes
Requires OS drivers No Yes Yes
Platform-independent Yes No No
Supports all RAID levels Yes Yes Maybe
Data usable with different h/w or s/w No Yes Maybe

The “fake RAID” column exists for cost-saving reasons. It was introduced as a “best of both worlds” solution…combining the low cost of Software RAID with the accelerated performance of a specialized disk controller. But the opposite became true: hardware advances in the late 1990's made it unnecessary. Microsoft introduced Software RAID in Windows 2000 Server, and Linux added RAID support in early 2001. The market for Fake RAID has grown smaller ever since.

As for my system–I'm sticking with a pure software RAID solution, even though I have access to a Fake RAID controller card.

The Software Setup

The software setup involves partitioning all the drives, building a raid array, formatting it, and adding it to the system configuration.

Partitioning Drives

Drive partitioning is the process of splitting a disk into different logical sections (kind of like different songs on a CD). It has to be done before the drives are usable by an operating system. Normally this is done by the operating system at install-time, but I wanted to wait until post-install to work configure the installation myself.

I completed my partitioning with cfdisk. Its a little more robust than fdisk, and the man page for fdisk recommended cfdisk for what I was doing (creating partitions for use on Linux). I want to delete any partitions that already exist, and allocate 100% of each disk to a primary partition. I also want to set the partition type to FD–the Linux RAID autodetect partition type.

This is what my console looked like before configuring the first data drive (/dev/sdb).

root@werewolf:~# cfdisk /dev/sdb
                        cfdisk (util-linux-ng 2.14.2)                        
                                                                             
                            Disk Drive: /dev/sdb                             
                    Size: 1000204886016 bytes, 1000.2 GB                     
           Heads: 255   Sectors per Track: 63   Cylinders: 121601            
                                                                             
   Name        Flags     Part Type  FS Type         [Label]        Size (MB) 
 --------------------------------------------------------------------------- 
                          Pri/Log   Free Space                    1000202.28 
                                                                             
                                                                             
    [   Help   ]  [   New    ]  [  Print   ]  [   Quit   ]  [  Units   ]     
    [  Write   ]                                                             
                                                                           

I used the New option and followed the prompts. When finished, my screen looked like the one below. Last thing is to hit the Write command to save settings to disk

                            Disk Drive: /dev/sdb                             
                    Size: 1000204886016 bytes, 1000.2 GB         
           Heads: 255   Sectors per Track: 63   Cylinders: 121601       
                                                                        
   Name        Flags     Part Type  FS Type         [Label]        Size (MB) 
 --------------------------------------------------------------------------- 
   sdb1                   Primary   Linux raid autodetect         1000202.28 
                                                                        

    [ Bootable ]  [  Delete  ]  [   Help   ]  [ Maximize ]  [  Print   ]
    [   Quit   ]  [   Type   ]  [  Units   ]  [  Write   ]

The partition tool seemed concerned that I wasn't marking the partition as bootable. It warned me when I hit Write, and left a message on my screen after exiting cfdisk. In my case, this is OK. I have a different disk (/dev/sda) that is bootable.

root@werewolf:~# cfdisk /dev/sdb
Disk has been changed.
WARNING: If you have created or modified any
DOS 6.x partitions, please see the cfdisk manual
page for additional information.

After repeating the above process for all 4 of my data drives (/dev/sdb, /dev/sdc, dev/sdd, and /dev/sde), I ran the fdisk -l command to see what my system's partitioning looked like:

root@werewolf:~# fdisk -l

Disk /dev/sda: 80.0 GB, 80026361856 bytes
255 heads, 63 sectors/track, 9729 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x00078cba

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *           1        8733    70147791   83  Linux
/dev/sda2            8734        9729     8000370    5  Extended
/dev/sda5            8734        9729     8000338+  82  Linux swap / Solaris

Disk /dev/sdb: 1000.2 GB, 1000204886016 bytes
255 heads, 63 sectors/track, 121601 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x00038d83

   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1               1      121601   976760001   fd  Linux raid autodetect

Disk /dev/sdc: 1000.2 GB, 1000204886016 bytes
255 heads, 63 sectors/track, 121601 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x00000000

   Device Boot      Start         End      Blocks   Id  System
/dev/sdc1               1      121601   976760001   fd  Linux raid autodetect

Disk /dev/sdd: 1000.2 GB, 1000204886016 bytes
255 heads, 63 sectors/track, 121601 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x00000000

   Device Boot      Start         End      Blocks   Id  System
/dev/sdd1               1      121601   976760001   fd  Linux raid autodetect

Disk /dev/sde: 1000.2 GB, 1000204886016 bytes
255 heads, 63 sectors/track, 121601 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x00000000

   Device Boot      Start         End      Blocks   Id  System
/dev/sde1               1      121601   976760001   fd  Linux raid autodetect

Building the RAID Array

On my system, mdadm wasn't installed by default, so I installed it:

root@werewolf:~# apt-get install mdadm

It has a dependency on postfix (bacause mdadm can send email on drive failures), so I had to take a 2-minute detour and configure that on Ubuntu as part of the mdadm install.

The command I need to run is:

mdadm --create /dev/md0 --verbose --chunk=64 --level=raid5 --raid-devices=4 /dev/sd[bcde]1 

but I'm not sure about the chunk size…what's that? According to the Software RAID HowTo, its the amount of data that will be written to a single disk.

Obviously a larger chunk size will minimize the number of disk writes, but it will increase the compute time needed to generate each parity block. There's probably a happy medium in there somewhere, but its going to be affected by the type of data being written to the array. And in my case, I know the array will be used mostly for A/V files (photos, music, and movies), so I'm willing to try out a large chunk size. The HowTo recommends 128K, so I'll go with that instead of the default-recommended 64.

root@werewolf:~# mdadm --create /dev/md0 --verbose --chunk=128 --level=raid5 --raid-devices=4 /dev/sd[bcde]1
mdadm: layout defaults to left-symmetric
mdadm: /dev/sdb1 appears to contain an ext2fs file system
    size=970735624K  mtime=Sat Aug 22 16:31:34 2009
mdadm: /dev/sde1 appears to contain an ext2fs file system
    size=970735624K  mtime=Sat Aug 22 16:31:34 2009
mdadm: size set to 976759936K
Continue creating array? y
mdadm: array /dev/md0 started.

Nice! The command gave my terminal back, rather than locking it up for untold hours. I was afraid I'd have to run it with nohup or put it in a background process. So…aside from the blinking lights on my hard drive bays, how can I tell when my array is created? Lucky for me, there's a file in the /proc directory that I can cat:

root@werewolf:~# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active raid5 sde1[4] sdd1[2] sdc1[1] sdb1[0]
      2930279808 blocks level 5, 128k chunk, algorithm 2 [4/3] [UUU_]
      [>....................]  recovery =  0.0% (700032/976759936) finish=255.5min speed=63639K/sec

unused devices: <none>

Excellent! Notice the finish parameter…it tells me the array will be built in 255 minutes. After waiting that long, I ran this command and was greeted with a nice statistics page:

root@werewolf:/var/log$ mdadm --detail /dev/md0
/dev/md0:
        Version : 00.90
  Creation Time : Mon Aug 24 02:24:20 2009
     Raid Level : raid5
     Array Size : 2930279808 (2794.53 GiB 3000.61 GB)
  Used Dev Size : 976759936 (931.51 GiB 1000.20 GB)
   Raid Devices : 4
  Total Devices : 4
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Mon Aug 24 11:30:38 2009
          State : clean
 Active Devices : 4
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 128K

           UUID : 44062c56:b84201d7:bba6c3a9:4fef2f9c (local to host werewolf)
         Events : 0.4

    Number   Major   Minor   RaidDevice State
       0       8       17        0      active sync   /dev/sdb1
       1       8       33        1      active sync   /dev/sdc1
       2       8       49        2      active sync   /dev/sdd1
       3       8       65        3      active sync   /dev/sde1

Formatting the Array

The command I need to run is:

mkfs.ext3 -b 4096 -E stride=32,stripe-width=96

This command uses the ext3 filesystem. I got the parameters from a calculator here. It sets the block size to 4096 bytes (only 1024, 2048, and 4096 are available on my system).  The stride and stripe-width total to 128, which matches the chunksize I gave when setting up the array. (stripe-width = how much data to write, stride = how much space to leave blank [for checksumming?] ).

root@werewolf:~# mkfs.ext3 -b 4096 -E stride=32,stripe-width=96 /dev/md0
mke2fs 1.41.4 (27-Jan-2009)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
183148544 inodes, 732569952 blocks
36628497 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=0
22357 block groups
32768 blocks per group, 32768 fragments per group
8192 inodes per group
Superblock backups stored on blocks:
    32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
    4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
    102400000, 214990848, 512000000, 550731776, 644972544

Writing inode tables: done
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done

This filesystem will be automatically checked every 23 mounts or
180 days, whichever comes first.  Use tune2fs -c or -i to override.

Adding the Array to System Startup

All the above details are for setting up an array the first time. But so far we haven't told the OS how to reassemble the array at boot time. We do that with a file called /etc/mdadm.conf. The file format is fully explained in the man page for mdadm.conf. In my case, I need to tell it about the 4 partitions that contain data, and about the array itself (raid level, number of devices, etc.).

Below is my /etc/mdadm.conf file. The last two lines contain my email address, and the name of the program that will watch for certain events md-related events and email me if something goes wrong.

DEVICE /dev/sdb1
DEVICE /dev/sdc1
DEVICE /dev/sdd1
DEVICE /dev/sde1

ARRAY /dev/md0
    devices=/dev/sd[bcde]1
    num-devices=4
    level=5

MAILADDR chris@thefreyers.net

PROGRAM /usr/sbin/handle-mdadm-events

Deleting an Array

If you need to completely delete an array (perhaps because you built it with the wrong parameters and want to start over, here's how:

root@werewolf:~# mdadm /dev/md0 --fail /dev/sdb1
mdadm: set /dev/sdb1 faulty in /dev/md0
root@werewolf:~# mdadm /dev/md0 --fail /dev/sdc1
mdadm: set /dev/sdc1 faulty in /dev/md0
root@werewolf:~# mdadm /dev/md0 --fail /dev/sdd1
mdadm: set /dev/sdd1 faulty in /dev/md0
root@werewolf:~# mdadm /dev/md0 --fail /dev/sde1
mdadm: set /dev/sde1 faulty in /dev/md0
root@werewolf:~# mdadm /dev/md0 --remove /dev/sdb1
mdadm: hot removed /dev/sdb1
root@werewolf:~# mdadm /dev/md0 --remove /dev/sdc1
mdadm: hot removed /dev/sdc1
root@werewolf:~# mdadm /dev/md0 --remove /dev/sdd1
mdadm: hot removed /dev/sdd1
root@werewolf:~# mdadm /dev/md0 --remove /dev/sde1
mdadm: hot removed /dev/sde1
root@werewolf:~# mdadm --stop /dev/md0
mdadm: stopped /dev/md0
root@werewolf:~# mdadm --zero-superblock /dev/sdb1
root@werewolf:~# mdadm --zero-superblock /dev/sdc1
root@werewolf:~# mdadm --zero-superblock /dev/sdd1
root@werewolf:~# mdadm --zero-superblock /dev/sde1
cfdisk /dev/sdb  (delete partition, write partition table, then recreate)
cfdisk /dev/sdc  (delete partition, write partition table, then recreate)
cfdisk /dev/sdd  (delete partition, write partition table, then recreate)
cfdisk /dev/sde  (delete partition, write partition table, then recreate)

Re-adding a partition

Suppose you forget to update your /etc/mdadm/mdadm.conf file after changing an array with mdadm, and then you reboot (like I did)? You'll end up with an array that looks like this…

root@werewolf:~# mdadm --detail /dev/md0
/dev/md0:
        Version : 00.90
  Creation Time : Mon Aug 24 02:24:20 2009
     Raid Level : raid5
     Array Size : 2930279808 (2794.53 GiB 3000.61 GB)
  Used Dev Size : 976759936 (931.51 GiB 1000.20 GB)
   Raid Devices : 4
  Total Devices : 3
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Tue Aug 25 20:45:20 2009
          State : clean, degraded
 Active Devices : 3
Working Devices : 3
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 128K

           UUID : 44062c56:b84201d7:bba6c3a9:4fef2f9c (local to host werewolf)
         Events : 0.24

    Number   Major   Minor   RaidDevice State
       0       8       17        0      active sync   /dev/sdb1
       1       8       33        1      active sync   /dev/sdc1
       2       8       49        2      active sync   /dev/sdd1
       3       0        0        3      removed

This info tells me there are 4 raid partitions on my system, but only 3 are are associated with my /dev/md0 array. They are all active and working, but the array is in “clean, degraded” state (which means it is currently in a ready state with no backlog of work). The fourth partition isn't even part of the array. How do I add the partition back to the array? Its pretty simple.

root@werewolf:~# mdadm --re-add /dev/md0 /dev/sde1
mdadm: re-added /dev/sde1
root@werewolf:~#

That's it…the partition is added back in because it was once a part of the array and mdadm can recover it (i.e. bring it up-to-date on any changes that have been applied since it was disconnected from the array.

So looking at the statistics, I can see the partition is back, and the array status is changed to “clean, degraded, recovering”.

root@werewolf:~# mdadm --detail /dev/md0
/dev/md0:
        Version : 00.90
  Creation Time : Mon Aug 24 02:24:20 2009
     Raid Level : raid5
     Array Size : 2930279808 (2794.53 GiB 3000.61 GB)
  Used Dev Size : 976759936 (931.51 GiB 1000.20 GB)
   Raid Devices : 4
  Total Devices : 4
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Tue Aug 25 20:49:29 2009
          State : clean, degraded, recovering
 Active Devices : 3
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 1

         Layout : left-symmetric
     Chunk Size : 128K

 Rebuild Status : 0% complete

           UUID : 44062c56:b84201d7:bba6c3a9:4fef2f9c (local to host werewolf)
         Events : 0.154

    Number   Major   Minor   RaidDevice State
       0       8       17        0      active sync   /dev/sdb1
       1       8       33        1      active sync   /dev/sdc1
       2       8       49        2      active sync   /dev/sdd1
       4       8       65        3      spare rebuilding   /dev/sde1

My last point of interest is to see whats going on *right now* while the array is recovering…

root@werewolf:~# cat /proc/mdstat

Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid5 sde1[4] sdb1[0] sdc1[1] sdd1[2]
      2930279808 blocks level 5, 128k chunk, algorithm 2 [4/3] [UUU_]
      [========>............]  recovery = 40.1% (392130328/976759936) finish=146.4min speed=66540K/sec

unused devices: <none>

Replacing a bad device

Well, it eventually happens to everyone. One of my RAID drives went bad. I don't understand why–it wasn't doing anything demanding or different. But luckily I got this email, thanks to a properly configured /etc/mdadm/mdadm.conf file:

This is an automatically generated mail message from mdadm
running on werewolf

A FailSpare event had been detected on md device /dev/md0.

It could be related to component device /dev/sdc1.

Faithfully yours, etc.

P.S. The /proc/mdstat file currently contains the following:

Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid5 sdc1[4](F) sdb1[0] sdd1[1] sde1[3]
     2930279808 blocks level 5, 128k chunk, algorithm 2 [4/3] [UU_U]
     [=>...................]  recovery =  7.6% (75150080/976759936) finish=1514.9min speed=9919K/sec

unused devices: <none>

The important thing is the (F) next to the sdc1 partition. It means the device has failed. I power cycled the machine and the array came up in “degraded, recovering” status, but it failed after several hours of rebuilding. After two or three attempts, I decided the drive was bad (or at least bad enough to warrant replacing). Here are the stepssteps:

  1. Run mdadm –remove /dev/md0 /dev/sdc to remove the bad drive from the array
  2. Replace the faulty drive with a new one
  3. Use fdisk as described above to setup the drive like the others
  4. Run mdadm –add /dev/md0 -/dev/sdc1 to add the new drive to the array

After that, cat /proc/mdstat reported the array was recovering. It took nearly 6 hours to rebuild the data, but everything went back to normal. No lost data.

Benchmarking for Performance

One thing I've learned by experience is that you should benchmark a filesystem before you start using it. This isn't such a big deal on regular desktop systems where the I/O load is fairly light. But on I/O-bound servers like a database or a media server, it really matters.

The Test

Wikipedia's Comparison of File Systems led me to 3 candidates for my media server: EXT3, JFS, and XFS. EXT3 is the default filesystem on Linux (as of Summer 2009), and JFS and XFS get really good reviews on various forums. But which one has the best performance on a media server? I decided to perform a common set of tests on all 3 filesystems to found out. I wrote a script that:

  1. formats the RAID array and captures statistics about the process
  2. mounts the array
  3. runs a comprehensive iozone benchmark suite (while collecting statitics)
  4. times the creation of a 5gb file
  5. times the deletion of the 5gb file

This should give me enough data to make an informed decision about the filesystem.

#! /bin/bash
# dotest.sh
# Chris Freyer (chris@thefreyers.net)
# Sept 8, 2009

outputdir=/root/raidtest

export TIME="time:%E, IOfaults:%F, #fs inputs:%I, #fs outputs:%O, CPU:%P, CPU sec:%S, #signals:%k"

# ----------------------------
# EXT3
# ----------------------------
umount /data
/usr/bin/time -o $outputdir/mkfs_ext3.txt mkfs.ext3 -q -b 4096 -E stride=32,stripe-width=96 /dev/md0
mount -t ext3 /dev/md0 /data
/usr/bin/time -o $outputdir/iozone_ext3.txt iozone -a -f /data/testfile.tmp -R -b $outputdir/iozone_ext3.xls
/usr/bin/time -o $outputdir/dd_ext3.txt dd if=/dev/zero of=/data/file.out bs=1MB count=5000
/usr/bin/time -o $outputdir/rm_ext3.txt rm /data/file.out

# ----------------------------
# JFS
# ----------------------------
umount /data
/usr/bin/time -o $outputdir/mkfs_jfs.txt mkfs.jfs -f -q  /dev/md0
mount -t jfs /dev/md0 /data
/usr/bin/time -o $outputdir/iozone_jfs.txt iozone -a -f /data/testfile.tmp -R -b $outputdir/iozone_jfs.xls
/usr/bin/time -o $outputdir/dd_jfs.txt dd if=/dev/zero of=/data/file.out bs=1MB count=5000
/usr/bin/time -o $outputdir/rm_jfs.txt rm /data/file.out

# ----------------------------
# XFS
# ----------------------------
umount /data
/usr/bin/time -o $outputdir/mkfs_xfs.txt mkfs.xfs -f -q  /dev/md0
mount -t xfs /dev/md0 /data
/usr/bin/time -o $outputdir/iozone_xfs.txt iozone -a -f /data/testfile.tmp -R -b $outputdir/iozone_xfs.xls
/usr/bin/time -o $outputdir/dd_xfs.txt dd if=/dev/zero of=/data/file.out bs=1MB count=5000
/usr/bin/time -o $outputdir/rm_xfs.txt rm /data/file.out

Test Results

The script produced some really interesting statistics, which I'll summarize here.

FILESYSTEM CREATION
Measure EXT3 JFS XFS
Elapsed Time 16:01.71 0:03.93 0:08.42
Faults needing I/O 7 1 0
# filesystem inputs 944 128 280
# filesystem outputs 92,232,352 781,872 264,760
% CPU use (avg) 11% 29% 1%
# CPU seconds used 109.01 1.10 0.08

I'm not overly concerned with the time it requires to create a filesystem. Its an administrative task that I only do when setting up a new drive. But I had to notice the huge difference between EXT3 and the other formats. EXT3 took 16 minutes to create the filesystem while the others took just a few seconds. The number of filesystem outputs was similarly imbalanced. Not a good start for EXT3.

IOZONE EXECUTION
Measure EXT3 JFS XFS
Elapsed Time 13:27.74 10:23.15 10:57.16
Faults needing I/O 3 3 0
# filesystem inputs 576 656 992
# filesystem outputs 95,812,576 95,845,872 95,812,568
% CPU use (avg) 29% 27% 29%
# CPU seconds used 230.32 165.07 187.96

IOZONE provides some useful performance statistics for the disks. The above stats were gathered while it was running (same tests for each filesystem). EXT3 took longer to run the tests (3 minutes and 2.5 minutes longer), and took more CPU time (65 and 42 seconds more {39% and 26% extra} ). JFS has a slight advantage over XFX, but EXT3 is in a distant 3rd place.

5GB FILE CREATION
Measure EXT3 JFS XFS
Elapsed Time 1:01.54 1:07.51 00:56.08
Faults needing I/O 0 0 5
# filesystem inputs 312 1200 560
# filesystem outputs 9,765,640 9,785,920 9,765,672
% CPU use (avg) 38% 20% 24%
# CPU seconds used 23.88 13.72 14.00

Creation of multi-gigabyte files will be a routine event on this machine (since it will be recording TV shows daily). Each filesystem took just over 1 minute to create the file. As I expected, EXT3 had significantly higher CPU utilization than JFS and XFS (90% and 58% higher, respectively). The number of CPU seconds used was higher too (74% and 70%, respectively). These small numbers don't look significant until you think about running a media server and how much disk IO goes on.

5GB FILE DELETION
Measure EXT3 JFS XFS
Elapsed Time 00:00.96 00:00.05 00:00.06
Faults needing I/O 0 0 2
# filesystem inputs 0 0 320
# filesystem outputs 0 0 0
% CPU use (avg) 98% 8% 0%
# CPU seconds used 0.95 0.00 0.00

File deletion is a big deal when running a media server. People want to delete a large file (i.e. a recorded program) and immediately be able to continue using their system. But I've experienced long delays with EXT3 before – sometimes 10-15 seconds when deleting a file. The statistics here don't reflect that, but they do indicate a problem. The elapsed time is 19X and 16x longer with EXT3 than with JFX and XFS. CPU use and CPU seconds are simlar in nature.

Obviously, EXT3 is out of the running here, so I'll stop talking about it. The real decision is between JFS and XFS. Both have similar statistics, so I decided to search the internet for relevant info. Here are some sources that swayed my opinion:

And the winner is...

The winner is: XFS. I've been using it for several years on my MythTV box with no issues. My recorded programs are stored on an LVM volume formatted with XFS. The volume itself spans 4 drives from different manufacturers1) and with different capacities2) and interfaces3). My recording and playback performance are great, especially when you consider that my back-end machine serves 4 front-ends (one of which is on the back-end machine). And the file system delete performance is perfect: about 1 second to delete a recording (normally a 2-6gb file).

JFS has maturity on its side–it has been used in IBM's AIX for more than 10 years. It offers good performance, has good recovery tools, and has the stamp of approval from MythTV users. But I'm going to run it on a RAID system, and there's very little internet knowledge that I could find on that combination.

In contrast, XFS has format-time options specifically for RAID situations. There have been reports of 10% CPU savings when you tell XFS about your RAID strip size at format time. This means more free CPU time for transcoding and other CPU intensive tasks.

RAID Parameter Calculator

Calculating the parameters for a RAID array is a tedious process. Fortunately someone on the MythTV website had already written a shell script to help calculate the the proper values for an array. I converted that to JavaScript, and I offer it here for your convenience. If you find any errors or improvements, please let me know.

Note: blocksize refers to the size (in bytes) of a single chunk of disk space. In Linux, that can't be larger than the size of a memory page (called pagesize). So how do you find out your pagesize? In Ubuntu, you run getconf PAGESIZE at the command line. In my case, the value is 4096. It might be slightly different on other systems.

Blocksizebytes
ChunksizeKib
#of Spindles
RAID Type (0, 1, 10, 5, 6)
RAID Device Name
File System Label
1) Seagate, Maxtor, and Western Digital
2) 200gb, 300gb, 500gb, and 250gb
3) EIDE and SATA
/home/cfreyer/public_html/data/pages/technology/linux/creating_a_4-disk_raid_array.txt · Last modified: 2011/04/06 08:50 by Chris Freyer