>FOR FREEDOM!<


{A} Introduction


Here's a short description of what is supported in the Linux RAID drivers. RAID is not a guarantee for data integrity, it just allows you to keep your data if a disk dies.

The current RAID drivers in Linux support the following levels:

Linear mode | RAID-0 | RAID-1

RAID-4

  • If one drive fails, the parity information can be used to reconstruct all data. If two drives fail, all data is lost.
  • The reason this level is not more frequently used, is because the parity information is kept on one drive. This information must be updated every time one of the other disks are written to. Thus, the parity disk will become a bottleneck, if it is not a lot faster than the other disks. However, if you just happen to have a lot of slow disks and a very fast one, this RAID level can be very useful.

RAID-5

  • This is perhaps the most useful RAID mode when one wishes to combine a larger number of physical disks, and still maintain some redundancy. RAID-5 can be (usefully) used on three or more disks, with zero or more spare-disks. The resulting RAID-5 device size will be (N-1)*S, just like RAID-4. The big difference between RAID-5 and -4 is, that the parity information is distributed evenly among the participating drives, avoiding the bottleneck problem in RAID-4, and also getting more performance out of the disk when reading, as all drives will then be used.
  • If one of the disks fail, all data are still intact, thanks to the parity information. If spare disks are available, reconstruction will begin immediately after the device failure. If two disks fail simultaneously, or before the raid is reconstructed, all data are lost. RAID-5 can survive one disk failure, but not two or more.
  • Both read and write performance usually increase, but can be hard to predict how much. Reads are almost similar to RAID-0 reads, writes can be either rather expensive (requiring read-in prior to write, in order to be able to calculate the correct parity information, such as in database operations), or similar to RAID-1 writes (when larger sequential writes are performed, and parity can be calculated directly from the other blocks to be written). The write efficiency depends heavily on the amount of memory in the machine, and the usage pattern of the array. Heavily scattered writes are bound to be more expensive.

RAID-6

  • This is an extension of RAID-5 to provide more resilience. RAID-6 can be (usefully) used on four or more disks, with zero or more spare-disks. The resulting RAID-6 device size will be (N-2)*S. The big difference between RAID-5 and -6 is that there are two different parity information blocks, and these are distributed evenly among the participating drives.
  • Since there are two parity blocks; if one or two of the disks fail, all data is still intact. If spare disks are available, reconstruction will begin immediately after the device failure(s).
  • Read performance is almost similar to RAID-5 but write performance is worse.

RAID-10

  • RAID-10 is an "in-kernel" combination of RAID-1 and RAID-0 that is more efficient than simply layering RAID levels.
  • RAID-10 has a layout ("far") which can provide sequential read throughput that scales by number of drives, rather than number of RAID-1 pairs. You can get about 95 % of the performance of the RAID-0 with same amount of drives.
  • RAID-10 allows spare disk(s) to be shared amongst all the raid1 pairs.

FAULTY

  • This is a special debugging RAID level. It only allows one device and simulates low level read/write failures.
  • Using a FAULTY device in another RAID level allows administrators to practice dealing with things like sector-failures as opposed to whole drive failures

{B} Swapping on RAID


Swapping on a mirrored RAID can help you survive a failing disk. If a disk fails, then data for swapped processes would be inaccessable in a non-mirrored environment. If you run in a mirrored environment, then the system can go on running even if a disk fails in service.

There's not much reason to use RAID0 for swap performance reasons. The kernel itself can stripe swapping on several devices, if you just give them the same priority in the /etc/fstab file.

A nice /etc/fstab could look like:

 /dev/sda2       none           swap    defaults,pri=4   0 0
/dev/sdb2 none swap defaults,pri=4 0 0
/dev/sdc2 none swap defaults,pri=4 0 0
/dev/sdd2 none swap defaults,pri=4 0 0
/dev/sde2 none swap defaults,pri=4 0 0
/dev/sdf2 none swap defaults,pri=4 0 0
/dev/sdg2 none swap defaults,pri=4 0 0

This setup lets the machine swap in parallel on seven SAS devices. No need for RAID0, since this has been a kernel feature for a long time.

A different reason to use RAID for swap is high availability. If you set up a system to boot on eg. a RAID-1 device, the system should be able to survive a disk crash. If a system without mirrored swapping has been swapping on the now faulty device, you will most likely be going down. Swapping on a mirrored RAID partition such as RAID-1, raid10,n2 or raid10,f2 type would solve this problem.


{C} Spare disks


Spare disks (often called hot spares) are disks that do not take part in the RAID set until one of the active disks fail. When a device failure is detected, that device is marked as "faulty" and reconstruction is immediately started on the first spare disk available.

once reconstruction to a hot-spare begins, the RAID layer will start reading from all the other disks to re-create the redundant information. If multiple disks have built up bad blocks over time, the reconstruction itself can actually trigger a failure on one of the "good" disks. This can lead to a complete RAID failure and is the major reason for using RAID-6 in preference to RAID-5 and a hot spare.


{D} Faulty disks


When the RAID layer handles device failures just fine, crashed disks are marked as faulty, and reconstruction is immediately started on the first spare-disk available. If no spare is available then the array runs in 'degraded' mode.

Faulty disks still appear and behave as members of the array. The RAID layer just avoids reading/writing them.

If a device needs to be removed from an array for any reason (eg pro-active replacement due to SMART reports) then it must be marked as faulty before it can be removed.


{E} RAID setup


Prepare

Install the package "mdadm", and "modprobe raid456"、“modprobe raid10” etc.Then you will see:

[root@ ~]# cat /proc/mdstat
Personalities : [raid10] [raid6] [raid5] [raid4]
unused devices: <none>

Mdadm modes of operation

mdadm has 7 major modes of operation. Normal operation just uses the 'Create', 'Assemble' and 'Monitor' commands - the rest is typically used for fixing or changing your array.

  • Create:Create a new array with per-device superblocks (normal creation).
  • Assemble:Assemble the parts of a previously created array into an active array.
  • Follow or Monitor:Monitor one or more md devices and act on any state changes.
  • Build:Build an array that doesn't have per-device superblocks. [Rarely used!]
  • Grow:Grow, shrink or otherwise reshape an array in some way. [Rarely used!]
  • Manage:This is for doing things to specific components of an array such as adding new spares and removing faulty devices.
  • Misc:This is an 'everything else' mode that supports operations on active arrays, operations on component devices such as erasing old superblocks, and information gathering operations.

Create the Partition Table (GPT)

It is highly recommended to pre-partition the disks to be used in the array.

Note: It is also possible to create a RAID directly on the raw disks (without partitions), but not recommended because it can cause problems when swapping a failed disk.

parted -a optimal /dev/vdX -mklabel gpt
parted -a optimal /dev/vdX mkpart 1M xM #x = total_Mb - 100M
parted -a optimal /dev/vdX set raid
...
parted -a optimal /dev/vdZ -mklabel gpt
parted -a optimal /dev/vdZ mkpart 1M xM #x is the previous x, do not recalculate!
parted -a optimal /dev/vdZ set raid

Create RAID device

Raid0
mdadm --create --auto=mdp /dev/mdX --level=0 --raid-devices=26 /dev/vd{a..z}1
#If --auto is not given on the command line or in the config file, then the default will be --auto=yes
#"part" or "mdp" causes a partitionable array (2.6 and later) to be used
Raid1
mdadm --create /dev/mdX --level=1 --raid-devices=2 /dev/vd{a,b}1 --spare-devices=2 /dev/vd{c,d}1 Raid6
mdadm --create /dev/mdX --level=6 --raid-devices=4 /dev/vd{a..d}1 --spare-devices=1 /dev/vde1 Raid10 #Raid10 with “--layout=f2" algorithm perform best in reading data
mdadm --create --verbose /dev/mdX --metadata=1.2 --chunk=256 --level=10 --raid-devices=6 --layout=f2 /dev/vd{a..f}1 --spare-devices=2 /dev/vd{g,h}1

Remember to this for possiable assembling in the future:

# echo 'DEVICE partitions' > /etc/mdadm.conf
# mdadm --detail --scan >> /etc/mdadm.conf

This results in something like the following:

root # cat /etc/mdadm.conf
DEVICE partitions
ARRAY /dev/md/ metadata=1.2 name=pine: UUID=27664f0d:111e493d:4d810213:9f291abe

Create partitions on array (or use LVM upon it,will discussing in the {H} chapter)

Same as normal disk-partitions: use parted OR gdisk
And format them:
mke2fs -t ext4 -b /dev/md0_pX
...

Removing devices from an array

  • Mark it as faulty
  • mdadm --fail /dev/md0 /dev/sdxx
  • Remove it from the array
  • mdadm -r /dev/md0 /dev/sdxx 
  • Remove device permanently(After the two commands described above)
  • mdadm --zero-superblock /dev/sdxx
    OR
    dd if=/dev/null of=/dev/sdxx bs=1M count=
Warning: Reusing the removed disk without zeroing the superblock WILL CAUSE LOSS OF ALL DATA on the next boot. (After mdadm will try to use it as the part of the raid array).

Stop using an array

  • Umount target array
  • Stop the array with: mdadm --stop /dev/md0
  • Do "mdadm --zero-superblock /dev/vdxx" on each device
  • Remove the corresponding line from /etc/mdadm.conf

Adding a New Device to an Array for repair or spare purpose(Not mean growing numbers of array!)

Adding new devices with mdadm can be done on a running system with the devices mounted. Partition the new device using the same layout as others in the same array.

  • Assemble the RAID array if it is not already assembled
  • mdadm --assemble /dev/md0 /dev/sda1 /dev/sdb1
    OR
    mdadm --assemble UUID=27664f0d:111e493d:4d810213:9f291abe #Need "mdadm.conf" which must be prepared in adance
  • Add the new device the array
  • mdadm --add /dev/md0 /dev/sdc1 

Change sync speed limits

Syncing can take a while. If the machine is not needed for other tasks the speed limit can be increased.

# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sda3[] sdb3[]
blocks super 1.2 [/] [_U]
[>....................] recovery = 0.0% (/) finish=.8min speed=9712K/sec unused devices: <none>

Check the current speed limit.

# cat /proc/sys/dev/raid/speed_limit_min

# cat /proc/sys/dev/raid/speed_limit_max

Increase the limits.

# echo  >/proc/sys/dev/raid/speed_limit_min
# echo >/proc/sys/dev/raid/speed_limit_max

Then check out the syncing speed and estimated finish time

# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sda3[] sdb3[]
blocks super 1.2 [/] [_U]
[>....................] recovery = 1.3% (/) finish=.2min speed=16102K/sec unused devices: <none>

{F} Further reading


Calculating the Stride and Stripe-width

The array will have an entry in

# /sys/devices/virtual/block/mdX/queue/optimal_io_size

(where mdX is the name of your array). It will give the stripe-width in bytes. Divide by the block size to get the stripe width in blocks, then divide by number of data disks to get the stride. The following calculations should match this.

Stride = (chunk size/block size)

what is a reasonable chunk size?

  • It depends on your average I/O request size. Here's the rule of thumb: big I/Os = small chunks; small I/Os = big chunks.
Tip: See also Chunks: the hidden key to RAID performance.

Next, calculate:

Stripe-width = (# of physical data disks * stride)
Example: RAID10,far2[formatting to ext4 with the correct stripe-width and stride]
# cat /sys/devices/virtual/block/md0/queue/optimal_io_size
#

Hypothetical RAID10 array is composed of  physical disks. Because of the properties of RAID10 in far2 layout, both count as data disks.
Chunk size is 512k.
Block size is 4k.
So the stripe-width should match / = , and the stride should match / = .
Stride = (chunk size/block size). In this example, the math is (/) so the stride = .
Stripe-width = (# of physical data disks * stride). In this example, the math is (*) so the stripe-width = . # mkfs.ext4 -v -L myarray -m 0.01 -b -E stride=,stripe-width= /dev/md0 

{G} How to replace the broken disks? 


Remove all usage of the failed disk

  • mdadm --manage /dev/mdX --remove /dev/sdX
  • umount /dev/sdX*

(FIRST) Remove the data cable of the failed disk

(SECOND) Remove the power cable of the failed disk

  • Force system to re-scan
  • echo "- - -" > /sys/class/scsi_host/hostX/scan # For all "X"
  • tail -f /var/log/syslog OR journalctl -kf # is a good idea

Replace the failed disk

(FIRST) Connect the power cable of the new disk (and wait some seconds)

(SECOND) Connect the data cable of the new disk

  • Force system to re-scan
  • echo "- - -" > /sys/class/scsi_host/hostX/scan # For all "X"
  • tail -f /var/log/syslog OR journalctl -kf # is a good idea

{H} Linux LVM


  • pvcreate vgcreate lvcreate
  • pvmove
  • pvremove vgremove lvremove
  • pvscan vgscan lvsan
  • pvdispaly vgdisplay lvdisplay
  • vgreduce lvreduce
  • vgextend lvextend

If a physical volume needs to be removed from a volume group, the data first needs to be moved away from the physical volume. With the pvmove command, all data on a physical volume is moved to other physical volumes within the same volume group.

root #pvmove -v /dev/sda1

Such an operation can take a while depending on the amount of data that needs to be moved. Once finished, there should be no data left on the device. Verify with pvdisplay that the physical volume is no longer used by any logical volume.


If a logical volume needs to be reduced in size, first shrink the file system itself. Not all file systems support online shrinking.For instance, ext4 does not support online shrinking so the file system needs to be unmounted first. It is also recommended to do a file system check to make sure there are no inconsistencies:

root #umount /mnt/data
root #e2fsck -f /dev/vg0/lvol1
root #resize2fs /dev/vg0/lvol1 150M
root #lvreduce --size 150M /dev/vg0/lv0l1

An extended volume group does not immediately provide the additional storage to the end users. For that, the file system on top of the volume group needs to be increased in size as well. Not all file systems allow online resizing!

For instance, to resize an ext4 file system to become 500MB in size:

lvextend --size 500M /dev/vg0/lv0l1
resize2fs /dev/vg0/lvol1 500M
OR combine two steps in one:
lvextend --resizefs --size 500M /dev/vg0/lv0l1

Create snapshot:

lvcreate  --size 1G --snapshot --name lv0-snapshot --permission r[w]  /dev/vg0/lv0 

REFERENCE

  • https://wiki.archlinux.org/index.php/RAID
  • https://raid.wiki.kernel.org/index.php/Linux_Raid
  • https://wiki.gentoo.org/wiki/LVM
  • https://wiki.archlinux.org/index.php/LVM

专题:mdadm Raid & LVM的更多相关文章

  1. RAID&LVM有关磁盘的故障

    目录 RAID&LVM有关磁盘的故障 RAID 注意:RAID硬盘失效处理--热备和热拔插 RAID实战 LVM介绍 磁盘故障 RAID&LVM有关磁盘的故障 RAID 好处:1.更多 ...

  2. linux磁盘限额和进阶文件系统的管理 quota RAID LVM

    概念: Quota 的一般用途: 针对 WWW server ,例如:每个人的网页空间的容量限制! 针对 mail server,例如:每个人的邮件空间限制. 针对 file server,例如:每个 ...

  3. [daily][archlinux][mdadm][RAID] 软RAID

    一, 使用mdadm创建RAID 参考:https://wiki.archlinux.org/index.php/RAID 1.  安装 mdadm /home/tong [tong@TStation ...

  4. 文件系统的几种类型:ext3, swap, RAID, LVM

    分类: 架构设计与优化 1.  ext3 在异常断电或系统崩溃(不洁关机, unclean system shutdown  ).每个已挂载ext2文件系统计算机必须使用e2fsck程序来检查其一致性 ...

  5. 软RAID 0的技术概要及实现

    1 什么是RAID,RAID的级别和特点 : 什么是RAID呢?全称是 “A Case for Redundant Arrays of Inexpensive Disks (RAID)”,在1987年 ...

  6. 第十五章 LVM管理和ssm存储管理器使用 随堂笔记

    第十五章 LVM管理和ssm存储管理器使用 本节所讲内容: 15.1 LVM的工作原理 15.2 创建LVM的基本步骤 15.3 实战-使用SSM工具为公司的邮件服务器创建可动态扩容的存储池 LVM的 ...

  7. Linux运维基础提高之RAID卡和磁盘分区

    磁盘大小计算: 柱面的数量*每个柱面的大小(容量) [root@luffy001 ~]# fdisk -l Disk /dev/sda: 10.7 GB, 10737418240 bytes 255 ...

  8. 学习笔记:CentOS7学习之十六:LVM管理和ssm存储管理器使用

    目录 学习笔记:CentOS7学习之十六:LVM管理和ssm存储管理器使用 16.1 LVM的工作原理 16.1.1 LVM常用术语 16.1.2 LVM优点 16.2 创建LVM的基本步骤 16.2 ...

  9. Linux运维-磁盘存储---3.LVM

    LVM的工作原理 LVM( Logical Volume Manager)逻辑卷管理,是在磁盘分区和文件系统之间添加的一个逻辑层,来为文件系统屏蔽下层磁盘分区布局,提供一个抽象的盘卷,在盘卷上建立文件 ...

随机推荐

  1. 【原】CSS3的3D动画 ——3D旋转之骰子样式的钟表(2)下.md

    之前看到智能社主页的那个骰子样式的钟表,最近研究了一下,虽然没有仔细看他是怎么做的,但是学了css3的动画之后想自己尝试着写一下,用到的原理可能和智能社网站的不太一样,我自己主要用到了css3和js. ...

  2. ORACLE创建表之前判断表是否存在与SQL Server 对比使用

    在SQL Server 数据库中,我们在创建表之前删除表,有if exit()这样的语句,但是在oracle中却没有.如果直接使用drop table那么如果表不存在会报错,导致后续语句无法运行.因此 ...

  3. python模块介绍- SocketServer 网络服务框架

    来源:https://my.oschina.net/u/1433482/blog/190612 摘要: SocketServer简化了网络服务器的编写.它有4个类:TCPServer,UDPServe ...

  4. gerrit集成gitweb:Error injecting constructor, java.io.IOException: Permission denied

    使用gerrit账户在centos上安装gerrit,然后集成gitweb,gerrit服务启动失败,查看日志,报错信息如下: [-- ::,] ERROR com.google.gerrit.pgm ...

  5. Bugtags 与其它产品的区别

    如果您刚刚接触 Bugtags,可能心里会有这样的疑问,下面将介绍 Bugtags 与其它的一些产品的区别. Bugtags 不是做统计的 SDK 大家都会在 App 里集成用户数据统计的 SDK,但 ...

  6. OpenWrt镜像编译和ipv6支持

    离成功实现路由器刷OpenWrt.接入校园网差不多一年了.路由工作比较稳定,还是很满意的. 这次回来有个新发现:学校有原生ipv6支持,在win7和ubuntu下什么都不用设置,自动获取global ...

  7. 剑指Offer:面试题27——二叉搜索树与双向链表(java实现)

    问题描述: 输入一棵二叉搜索树,将该二叉搜索树转换成一个排序的双向链表.要求不能创建任何新的结点,只能调整树中结点指针的指向. 思路: 将树分为三部分:左子树,根结点,右子树. 1.我们要把根结点与左 ...

  8. CSS隐藏元素的几种方法

    使用CSS隐藏元素的方法很多,在这里简单总结一下: 1.display:none display:none 应该是最常用的一种隐藏元素的方法,使用该方法隐藏的元素脱离文档流不占据空间,不会被浏览器解析 ...

  9. mybatis关联查询,一对一,一对多

    注:这篇文章的代码有部分删减,不能直接使用,不过关键代码都存在  应用场景: 想用mybatis做关联查询,并且把查询出的数据自动组装成对象可以使用关联查询. 1.一对一实现 例如:一部小说,属于一个 ...

  10. ARCGIS对谷歌影像进行投影转换

    相信有不少同学会有这样的困扰,通过软件下载的谷歌遥感影像,直接用ARCGIS等专业软件打开之后发现,遥感影像有拉伸的情况,这是什么原因呢.那是因为,通过软件下载下来的遥感影像的投影信息包含的是经纬度信 ...