VXVM Failures

Cloud_Devops

13 years ago

we may very often to fight with volume issues in day to day UNIX administration. When its come to veritas volume manager,it will be very less compare to other volume mangers. In rare cases you may need to face such a issues due environment(Like SAN issue),volume may go in to I/O error state or LUN paths will be disconnected to the server.However you should know how to fix these rare issues in veritas volume manager.

here i haven’t posted with command outputs. But i just shared high plan to fix the volume issues.Please go through the steps and let me know if you have any doubt by adding comment on this article.

1. Check and pinpoint the affected / failed volumes. Volume failure usually occurs in-case of disk i/o problems, or disk failures or system crash.

Use the following commands to check the status of the volume or disks:

# iostat –En (for possible io errors)

# vxdisk list

# echo | format (for disk errors)

# vxprint –hft (to check possible volume, plex or subdisk errors)

2. If it is a Volume failure because of Disk Failing intermittently, do the following:

#vxdctl enable

#vxreattach

#vxreattach -c cxtxtdx (checking that reattach is possible)

#vxreattach -br cxtxdx (recoverying stale plexs & doing it in background)

#vxrecover

#vxvol -g rootdg -f start volume-name

#fsck -y <raw-dev-veritas-volume-name>

But if the volume is disabled recover with below steps:

# vxmend -g <diskgroup> -o force off <plexname> (Example : lvtest1-01)

# vxmend -g <diskgroup> on <plexname> (Example: lvtest1-01)

# vxmend -g <diskgroup> fix clean <plexname> (Example: lvtest1-01)

# vxvol -g <diskgroup> start <volume name> (Example: lvtest1-01)

Finally, do the following to verify the mount point is working fine:

· Run “vxdisk list” again and see if the disks came online.

· Run “vxprint –htg diskgroupname” (Make sure all is enabled / active).

· Run unmount on filesystems that was listed as i/o error

· Run “fuser –ck” to find processes that won’t allow unmount.

· Run “fsck -F vxfs /dev/vx/rdsk/……” (run on rdsk)

· Run “mount –F vxfs /dev/vx/dsk/….. /mountpoint”

· e.g. “mount -F vxfs /dev/vx/dsk/ testdg1/ lvtest1 /lvtest”

or use mountall command

3. If it is a Volume failure because of Disk failing permanently, do the following:

· 1. Use vxdiskadm option # 4 “Remove a disk for Replacement”

2. Do the physical remove and replace of the failed disk; if it’s SAN disk, alert storage and have the lun replaced.

3. The newly replaced disk might have to be labeled to be able to use it #format <cXtXdX>; >label

4. Do “vxdisksetup –I <new-disk>” to initialize the new disk

5. Use vxdiskadm option # 5 “Replace a failed or removed disk”

6. If this is a redundant volume, use vxdiskadm option # 6 “Mirror volumes on a disk”

7. Incase of non-redundant volume do “vxvol –g <diskgroupname> -f start <volumename>”, then the data must be restored from the recent dump or backup

If it is a volume failure because it’s Unstartable, follow below steps:

1. An unstartable volume can be incorrectly configured or have other errors or conditions that prevent it from being started.

2. To display unstartable volumes, use the vxinfo command. This displays information about the accessibility and usability of volumes:

# vxinfo –g diskgroupname

– The following example output shows one volume, samplevol, as being unstartable:

samplevol fsgen Unstartable

rootvol root Started

swapvol swap Started

– If a disk failure caused a volume to be disabled, you must restore the volume from a backup after replacing the failed disk.

– Any volumes that are listed as Unstartable must be restarted using the vxvol command before restoring their contents from a backup.

# vxvol -o bg -f start volumename

(The -f option forcibly restarts the volume, and the -o bg option resynchronizes plexes as

a background task.)

4. If the Volume is in DISABLED ACTIVE state and the Plex in DISABLED RECOVER:

· The plex state can be stale, empty, nodevice, etc. A particular state does not mean that the data is good or bad.

· Use “vxprint –htg diskgroupname” to check if KSTATE and STATE fields display DISABLED ACTIVE for volume, and DISABLED RECOVER for plex

· Follow below recovery steps to bring the volume back to ENABLED ACTIVE state:

# vxmend -g <diskgroup> -o force off <plexname>

# vxmend -g <diskgroup> on <plexname>

# vxmend -g <diskgroup> fix clean <plexname>

# vxmend -g diskgroup fix stale <plex_name>

· If the plex won’t change it’s state, detach and attach the plex as below:

# vxplex -g rootdg det <plexname>

# vxplex -g rootdg att <volumename> <plexname>

· Start the volume once the plex STATE is fixed:

# vxvol -g diskgroup start <volume>

· repair and mount the filesystem:

# vxprint –htg <diskgroupname>

# fsck –F vxfs /dev/vx/rdsk/………….

# mount –F vxfs /dev/vx/rdsk/<diskgroupname>/<volumename> /mount/point

5. If it is a Volume failure because of Disk Group in “Online Failing” state:

If a disk shows as online failing state, check the system log and “iostat –En” output for any obvious errors/faults

# vxdisk list

DEVICE TYPE DISK GROUP STATUS

c0t0d0s2 sliced rootdisk rootdg online failing

c0t1d0s2 sliced int-0.1 aldg00 online

If no issue is found, sun case should be opened to verify the status of the disk for server attached disks; or SAN admin should be alerted for external luns. If there is no real problem, the flags could be reset with the below command:

Thank you for reading this article.Please leave a comment if you have any doubt ,i will get back to you as soon as possible.