Subscribe YouTube Channel For More Live Tutorials

How many ASM Failure Groups should I create?

Choosing Failure Groups depends on the kinds of failures that need to be tolerated without loss of data availability. For small numbers of disks (<20) it is usually best to use the default Failure Group creation that puts every disk in its own Failure Group. This even makes sense for large numbers of disks when the main concern is spindle failure. If there is a need to protect against the simultaneous loss of multiple disk drives due to a single component failure, then Failure Group specification can be used. For example, a Diskgroup may be constructed from several small modular disk arrays. If the system needs to continue operation when an entire modular array fails, then a Failure Group should consist of all the disks in one module. If one module fails, all the data on that module will be relocated to other modules to restore redundancy. Disks should be placed in the same Failure Group if they depend on a common piece of hardware whose failure needs to be tolerated with no loss of availability.

Having a small number of large Failure Groups may actually reduce availability in some cases. For example, half the disks in a Diskgroup could be on one power supply while the other half are on a different power supply.If this is used to divide the Diskgroup into two failure groups then tripping the breaker on one power supply could drop half the disks in the Diskgroup. Reconstructing the dropped disks would require copying all the data from the surviving disks after power is restored. This can be done online but consumes a lot of I/O and leaves the disk group unprotected against a spindle failure during the copy. However if each disk were its own Failure Group, the Diskgroup would be dismounted when the breaker tripped. Resetting the breaker would allow the Diskgroup to be remounted and no data copying would be needed.

Having Failure Groups of different sizes can waste disk space. You may have room to allocate primary extents, but no space available for secondary extents. For example, suppose there is a Diskgroup with six disks and three failure groups. If two disks are each their own individual Failure Group and the other four are in one common Failure Group then there will be very unequal allocation. All the secondary extents from the big Failure Group can only be placed on two of the six disks. The disks in the individual Failure Groups will fill up with secondary extents and block additional allocation even though there is plenty of space left in the large Failure Group. This will also put an uneven write load on the two disks that are full since they contain more secondary extents that are only accessed for writes.

The unit of failure is still the individual disk even when there are multiple disks in a Failure Group. Failure of one disk in a Failure Group does not affect the other disks in that Failure Group. For example a Failure Group could consist of six disks connected to the same disk controller. If one of the six disks has a motor failure the other five can continue to operate. The bad disk will be dropped from the Diskgroup and the other five will stay in the disk group.

Once a disk has been assigned to a Failure Group it cannot be reassigned to another Failure Group. If it needs to be in another Failure Group then it can be dropped from the Diskgroup and then added back. Since the choice of a Failure Group depends on the hardware configuration, a disk would not need to be reassigned unless it is physically moved.

A Failure Group is always a subset of the disks within a single Diskgroup. Thus a Failure Group does not include disks from two different Disk groups. However there could be disks in different Disk groups that share the same hardware. It would be reasonable to use the same Failure Group name for these disks even though they are in different Disk groups. This would give the impression of being in the same failure group even though that is not strictly the case.