Solaris Zone’s shutting down state

Lingesh

11 years ago

As you know that Solaris zones are not completely isolated from Solaris global zone.All of the local zones will be considered as global zone’s instances and it all depends on global zone’s kernel.For an example,you can see all the local zone process from global zone using “ps -efZ” command and that shows that zones are not completely isolated from global.Sometimes this mechanism makes Unix admin job more harder.In recent times i have seen some of the local zones are not halting and it’s going to temporary state “shutting_down” permanently .

Here is my local zone status after issuing reboot command to the local zone.The zone “shuting_down” state indicates that the zone is being halted.

bash-3.00# zoneadm list -cv
  ID NAME             STATUS     PATH                           BRAND    IP
   0 global           running    /                              native   shared
   3 sol1             shutting_down /export/zone/sollz1         native   shared

Sometimes it may went to down status as well.

bash-3.00# zoneadm list -cv
  ID NAME             STATUS     PATH                           BRAND    IP
   0 global           running    /                              native   shared
   2 sol1             down       /export/zone/sollz1            native   shared

Finally we end-up with global zone reboot to fix the issue.I have tried the below things to bring down local zone and didn’t work for me.

Find out the local zone’s zoneadm process & tried to kill it.

bash-3.00# ps -ef |grep zoneadmd |grep sol1
    root  4763  4762   0 14:41:15 ?           0:00 zoneadmd -z sol1
    root  4783 29263   0 14:42:20 pts/4       0:00 grep zoneadmd
    root  4762  4761   0 14:41:15 pts/4       0:00 zoneadmd -z sol1
bash-3.00# kill -9 4763
bash-3.00# kill -9 4762
bash-3.00# zoneadm -z sol1 halt

We had raised oracle support case for the same and they said that its know issue.

To find the root cause ,they have requested to collect the below information.

1.To know which process did not stop or still pending for something on the local zone, use pgrep against the issue zone.

bash-3.00# zoneadm list -cv |grep -i down
  ID NAME             STATUS     PATH                           BRAND    IP
   0 global           running    /                              native   shared
   3 sol1             shutting_down /export/zone/sollz1         native   shared
bash-3.00# pgrep -fl sol1
  320 zpool-sol1pool
 4761 zoneadm -z sol1 boot
13675 zlogin sol1 halt
bash-3.00# pstack 13675
13675:  zlogin sol1 halt
 fee2b075 read     (0, 80442c0, 400)
 08052de8 ???????? (4, 5)
 08053017 ???????? (4, 5, 6, 8, 1)
 08053f84 ???????? (80476db, 8046d50, 1, 8166d60, 8176f50)
 080549cf main     (3, 80475a8, 80475b8) + 740
 080520da ???????? (3, 80476d4, 80476db, 80476e0, 0, 80476e5)
bash-3.00# ptree 13675
5021  /usr/bin/gnome-terminal
  5024  sh
    5025  bash
      13675 zlogin sol1 halt
        13676 
bash-3.00# pfiles 13675
13675:  zlogin sol1 halt
  Current rlimit: 256 file descriptors
   0: S_IFCHR mode:0620 dev:296,0 ino:12582922 uid:0 gid:7 rdev:24,3
      O_RDWR
      /devices/pseudo/pts@0:3
   1: S_IFCHR mode:0620 dev:296,0 ino:12582922 uid:0 gid:7 rdev:24,3
      O_RDWR
      /devices/pseudo/pts@0:3
   2: S_IFCHR mode:0620 dev:296,0 ino:12582922 uid:0 gid:7 rdev:24,3
      O_RDWR
      /devices/pseudo/pts@0:3
   4: S_IFIFO mode:0000 dev:294,0 ino:12236 uid:0 gid:0 size:0
      O_RDWR
   5: S_IFIFO mode:0000 dev:294,0 ino:12236 uid:0 gid:0 size:0
      O_RDWR
   6: S_IFIFO mode:0000 dev:294,0 ino:12237 uid:0 gid:0 size:0
      O_RDWR
   8: S_IFIFO mode:0000 dev:294,0 ino:12238 uid:0 gid:0 size:0
      O_RDWR
bash-3.00# gcore 13675
gcore: core.13675 dumped

Provide all the above mentioned command output and core files to oracle support to find the root cause for this issue.

2.In some cases ,none of the commands will work.In this situation,its better to reboot the global zone using “reboot -d” to generate the crash dump. After rebooting the global zone ,you can upload the crashdump to oracle to find root cause.

Root cause:
The zone can become stuck in one of these states if it is unable to tear down the application environment state (such as mounted file systems)or if some portion of the virtual platform cannot be destroyed.In Such cases require operator intervention.In most of the times you need to end up with global zone reboot to fix this kind of issues.

Analysis of the crashdump file shows that the local zone shutdown is hung due to some kernel threads getting stuck in nfs in my case.

The conclusion is that there is no way to force the zone halt if its stuck in one of the above mentioned issues.You need to reboot the global zone to come out from that.

Thank you for reading this article.Please leave a comment if you have any doubt.