Monday, December 04, 2006

ZFS Cheatsheet

The three primary goals of ZFS are:

  1. Highly scalable (128-bit) data repository
  2. Ease of administration
  3. Guaranteed on disk data integrity

Sample ZFS commands and usage


What You Do and See Why
$ man zpool
$ man zfs
Get familiar with command structure and options
$ su
Password:
# cd /
# mkfile 100m disk1 disk2 disk3 disk5
# mkfile 50m disk4
# ls -l disk*
-rw------T 1 root root 104857600 Oct 6 08:13 disk1
-rw------T 1 root root 104857600 Oct 6 08:13 disk2
-rw------T 1 root root 104857600 Oct 6 08:13 disk3
-rw------T 1 root root 52428800 Oct 6 08:13 disk4
Create some “virtual devices” or vdevs as described in the zpool documentation. These can also be real disk slices if you have them available.
# zpool create myzfs /disk1 /disk2
# zpool list

NAME SIZE USED AVAIL CAP HEALTH ALTROOT
myzfs 199M 24.0K 199M 0% FAULTED -
Create a storage pool and check the size and usage.
# zpool status -v
pool: myzfs
state: FAULTED
reason: Too many devices are damaged or missing for the pool
to function.
see: http://www.sun.com/msg/ZFS-XXXX-02
config:

NAME STATUS
/disk1 ONLINE
/disk2 ONLINE
Get more detailed status of the zfs storage pool.
# zpool destroy myzfs
# zpool list
no pools available
Destroy a zfs storage pool
# zpool create myzfs mirror /disk1 /disk4
invalid vdev specification
use '-f' to override the following errors:
mirror contains devices of different sizes
Attempt to create a zfs pool with different size vdevs fails. Using -f options forces it to occur but only uses space allowed by smallest device.
# zpool create myzfs mirror /disk1 /disk2
# zpool status -v
pool: myzfs
state: FAULTED
reason: Too many devices are damaged or missing for the pool
to function.
see: http://www.sun.com/msg/ZFS-XXXX-02
config:

NAME STATUS
mirror ONLINE
/disk1 ONLINE
/disk2 ONLINE
Create a mirrored pool. I’m not sure at this time why it is labeled as faulted.
# zpool iostat 5
capacity operations bandwidth
pool used avail read write read write
---------- ----- ----- ----- ----- ----- -----
myzfs 24.5K 99.5M 0 24 0 173
myzfs 24.5K 99.5M 0 0 0 0
myzfs 24.5K 99.5M 0 0 0 0
Get I/O statistics for the pool
# zfs create myzfs/jim
# df -k
Filesystem kbytes used avail capacity Mounted on
...
myzfs/jim 85478 8 85470 1% /zfs/myzfs/jim
Create a file system and check it with standard df -k command. Note the size is 83.5 MB from a
100 MB zpool (mirrored). File systems are automatically mounted by default under the /zfs location. See the Mountpoints section of the zfs man page for more details.
# zfs list
NAME USED AVAIL REFER MOUNTPOINT
myzfs 33.5K 83.5M - /zfs/myzfs
myzfs/jim 8K 83.5M 8K /zfs/myzfs/jim
List current zfs file systems.
# zpool add myzfs /disk3
invalid vdev specification
use '-f' to override the following errors:
mismatched replication level: pool uses 2-way mirror
and new vdev uses 1-way file
Attempt to add a single vdev to a mirrored set fails
# zpool add myzfs mirror /disk3 /disk5
pool: myzfs
state: FAULTED
reason: Too many devices are damaged or missing for the pool
to function.
see: http://www.sun.com/msg/ZFS-XXXX-02
config:

NAME STATUS
mirror ONLINE
/disk1 ONLINE
/disk2 ONLINE
mirror ONLINE
/disk3 ONLINE
/disk5 ONLINE
Add a mirrored set of vdevs
# zfs create myzfs/jim2
# zfs list
NAME USED AVAIL REFER MOUNTPOINT
myzfs 26.5M 156M - /zfs/myzfs
myzfs/jim 26.3M 156M 26.3M /zfs/myzfs/jim
myzfs/jim2 8K 156M 8K /zfs/myzfs/jim2
Create a second file system. Note that both file system show 156M available because no quotas are
set. Each “could” grow to file the pool.
# zfs set reservation=20m myzfs/jim
# zfs list -o reservation
RESERV
none
20.0M
Reserve a specified amount of space for a file system ensuring that other users don’t take up all the space.
# zfs set quota=20m myzfs/jim2
# zfs list -o quota myzfs/jim myzfs/jim2
QUOTA
none
20.0M
Set and view quotas
# zfs set compression=on myzfs/jim2
# zfs list -o compression
COMPRESS
off
off
on
Turn on compression
# zfs snapshot myzfs/jim@test
# zfs list
NAME USED AVAIL REFER MOUNTPOINT
myzfs 20.0M 63.5M - /zfs/myzfs
myzfs/jim 8K 83.5M 8K /zfs/myzfs/jim
myzfs/jim@test 0 - 8K /zfs/myzfs/jim@test
Create a snapshot called test.
# zfs rollback myzfs/jim@test
Rollback to a snapshot.
# zfs clone myzfs/jim@test myzfs/jim3
# zfs list
NAME USED AVAIL REFER MOUNTPOINT
myzfs 20.0M 63.5M - /zfs/myzfs
myzfs/jim 8K 83.5M 8K /zfs/myzfs/jim
myzfs/jim@test 0 - 8K /zfs/myzfs/jim@test
myzfs/jim3 0 63.5M 8K /zfs/myzfs/jim3
A snapshot is not directly addressable. A clone must be made.The target dataset can be located anywhere in the ZFS hierarchy, and will be created as the same type as the original.
# zfs destroy myzfs/jim2
# zfs list
NAME USED AVAIL REFER MOUNTPOINT
myzfs 26.5M 157M - /zfs/myzfs
myzfs/jim 26.3M 157M 26.3M /zfs/myzfs/jim
Destroy a filesystem
# zpool destroy myzfs
cannot destroy 'myzfs': pool is not empty
use '-f' to force destruction anyway
Can’t destroy a pool with active filesystems.
# zfs unmount myzfs/jim
Unmount a ZFS file system
# zpool destroy -f myzfs
# zpool status -v
no pools available
Use the -f option to destroy a pool with files systems created.

Solaris Network configuration

Table of contents :

  1. Enable the network card
  2. Configuring ipaddress and netmask and making the interface status as up .
  3. Configuring Virtual interface :
  4. Ip-forwording :
  5. Router Configuration
  6. Network Terms
  7. Next Steps
Ifconfig command is used in Solaris to configure the network interfaces . The following lines describes the activities needed to configure a freshly installed network card from the root prompt .

1. Enable the network card

#ifconfig hme0 plumb

ifconfig -a command should show following type of output which means device is enabled and is ready to configure ip address and netmask :

hme0: flags=842 mtu 1500
inet 0.0.0.0 netmask 0
ether 3:22:11:6d:2e:1f

2. Configuring ipaddress and netmask and making the interface status as up .

#ifconfig hme0 192.9.2.106 netmask 255.255.255.0 up

#ifconfig -a will now show the ip address , netmask and up status as follows :

hme0: flags=843 mtu 1500
inet 192.9.2.106 netmask ffffff00 broadcast 192.9.2.255
ether 3:22:11:6d:2e:1f


The file /etc/netmasks is used to define netmasks for ip addresses .

127.0.0.1, is the standard loop back route and 127.0.0.0 is the default loopback ipaddress used by the kernel when no interface is configured this will be the only entry dispalyed by the system on invoking ifconfig -a command..

3. Configuring Virtual interface :
Vitual interface can be configured to enable hme0 reply to more then one ip addresses. This is possible by using hme0 alias which can be configured by ifconfig command only . The new alias device name now becomes hme0:1 hme:2 etc.

#ifconfig hme0:1 172.40.30.4 netmask 255.255.0.0 up

ifconfig -a will show the original hme0 and alias interface :

hme0: flags=843 mtu 1500
inet 192.9.2.106 netmask ffffff00 broadcast 192.9.2.255
ether 3:22:11:6d:2e:1f
hme0:1: flags=842 mtu 1500
inet 172.40.30.4 netmask ffff0000 broadcast 172.40.255.255


4. Ip-forwording :

IP forwarding allows you to forward all requests coming for a certain port or URL to be redirected to a specified IP address.

ip forwording becomes enabled automatically when system detects more then one interface at the booting time . The file involed is /etc/rc2.d/S69inet .

ipforwording is on by default but can be turned off by following command :

#ndd -set /dev/ip ip_forwarding 0

5. Router Configuration

After interfaces and ipaddess have been configured the system needs a default router which will allow the machine to talk to world outside of local network .

You can specify a particular route for a particular address as in following example

#route add -net 10.0.0.0 -netmask 255.0.0.0 172.40.30.1 1

if the the destination ipaddess is not defined in this manner system forwards all requests to the default router .

default route is defined manually by editing /etc/defaultrouter file and putting router's ipaddress entry in it. This file is read by /etc/rc2.d/S69inet file during the booting process and entry added to the routing table .

The route can be defined online also using routeadd command but the changes will be lost on reboot .To make changes permanent make sure to put an entry in /etc/defaultrouter.

#route add default 205.100.155.2 1

#route change default 205.100.155.2 1

The 1 at the end is the number of hops to the next gateway.

If an interface is not responding to the network, check to be sure it has the correct IP address and netmask , network cables are fine .

6. Network Terms
CIDR :
CIDR : Classless Inter-Domain Routing - the notation often used instead of writing the subnet mask along with ip-address . It has network prefix at the end of a address as / number of network bits.This means that the IP address 192.200.20.10 with the subnet mask 255.255.255.0 can also be expressed as 192.200.20.10/24. The /24 indicates the network prefix length, which is equal to the number of continuous binary one-bits in the subnet mask (11111111.11111111.11111111.000000). Zeros are for addressing the hosts on this network.

VLSM :
network can be variably subnetted into smaller networks, each smaller network having a different subnet mask .This functionality is avaiable in Solaris 2.6 above. the ipaddresses

Testing Veritas Cluster

Testing Veritas Cluster

Actual commands are in black.

0. Check Veritas Licenses - for FileSystem, Volume Manager AND Cluster

vxlicense -p

If any licenses are not valid or expired -- get them FIXED before continuing! All licenses should say "No expiration". If ANY license has an actual expiration date, the test failed. Permenant licenses do NOT have an expiration date. Non-essential licenses may be moved -- however, a senior admin should do this.

1. Hand check SystemList & AutoStartList

On either machine:

grep SystemList /etc/VRTSvcs/conf/config/main.cf
You should get:
SystemList = { system1, system2 }

grep AutoStartList /etc/VRTSvcs/conf/config/main.cf
You should get:
AutoStartList = { system1, system2 }

Each list should contain both machines. If not, many of the next tests will fail.

If your lists do NOT contain both systems, you will probably need to modify them with commands that follow.

more /etc/VRTSvcs/conf/config/main.cf (See if it is reasonable. It is likely that the systems aren't fully set up)
haconf -makerw (this lets you write the conf file)
hagrp -modify oragrp SystemList system1 0 system2 1
hagrp -modify oragrp AutoStartList system1 system2
haconf -dump -makero (this makes conf file read only again)

2. Verify Cluster is Running

First verify that veritas is up & running:

hastatus -summary

If this command could NOT be found, add the following to root's path in /.profile:

vi /.profile
add /opt/VRTSvcs/bin to your PATH variable

If /.profile does not already exist, use this one:

PATH=/usr/bin:/usr/sbin:/usr/ucb:/usr/local/bin:/opt/VRTSvcs/bin:/sbin:$PATH
export PATH

. /.profile

Re-verify command now runs if you changed /.profile:
hastatus -summary

Here is the expected result (your SYSTEMs/GROUPs may vary):

One system should be OFFLINE and one system should be ONLINE ie:
#
hastatus -summary

-- SYSTEM STATE

-- System State Frozen

A e4500a RUNNING 0

A e4500b RUNNING 0

-- GROUP STATE

-- Group System Probed AutoDisabled State

B oragrp e4500a Y N ONLINE

B oragrp e4500b Y N OFFLINE

If your systems do not show the above status, try these debugging steps:

  • If NO systems are up, run hastart on both systems and run hastatus -summary again.
  • If only one system is shown, start other system with hastart. Note: one system should ALWAYS be OFFLINE for the way we configure systems here. (If we ran oracle parallel server, this could change -- but currently we run standard oracle server)
  • If both systems are up but are OFFLINE and hastart did NOT correct the problem and oracle filesystems are not running on either system, the cluster needs to be reset. (This happens under strange network situations with GE Access.) [You ran hastart and that wasn't enough to get full cluster to work.]

Verify that the systems have the following EXACT status (though your machine names will vary for other customers):

gedb002# hastatus -summary

-- SYSTEM STATE

-- System State Frozen

A gedb001 RUNNING 0

A gedb002 RUNNING 0

-- GROUP STATE

-- Group System Probed AutoDisabled State

B oragrp gedb001 Y N OFFLINE

B oragrp gedb002 Y N OFFLINE

gedb002# hares -display | grep ONLINE

nic-qfe3 State gedb001 ONLINE

nic-qfe3 State gedb002 ONLINE

gedb002# vxdg list

NAME STATE ID

rootdg enabled 957265489.1025.gedb002

gedb001# vxdg list

NAME STATE ID

rootdg enabled 957266358.1025.gedb001

Recovery Commands:

hastop -all
on one machine
hastart
wait a few minutes
on other machine
hastart
hastatus -summary (make sure one is OFFLINE && one is ONLINE)


If none of these steps resolved the situation, contact Lorraine or Luke (possibly Russ Button or Jen Redman if they made it to Veritas Cluster class) or a Veritas Consultant.

3. Verify Services Can Switch Between Systems

Once, hastatus -summary works, note the GROUP name used. Usually, it will be "oragrp", but the installer can use any name, so please determine it's name.

First check if group can switch back and forth. On the system that is running (system1), switch veritas to other system (system2):

hagrp -switch groupname -to system2 [ie: hagrp -switch oragrp -to e4500b]

Watch failover with hastatus -summary. Once it is failed over, switch it back:

hagrp -switch groupname -to system1

4. Verify OTHER System Can Go Up & Down Smoothly For Maintanence

On system that is OFFLINE (should be system 2 at this point), reboot the computer.

ssh system2
/usr/sbin/shutdown -i6 -g0 -y

Make sure that the when the system comes up & is running after the reboot. That is, when the reboot is finished, the second system should say it is offline using hastatus.

hastatus -summary

Once this is done, hagrp -switch groupname -to system2 and repeat reboot for the other system

hagrp -switch groupname -to system2
ssh system1
/usr/sbin/shutdown -i6 -g0 -y

Verify that system1 is in cluster once rebooted

hastatus -summary

5. Test Actual Failover For System 2 (and pray db is okay)

To do this, we will kill off the listener process, which should force a failover. This test SHOULD be okay for the db (that is why we choose LISTENER) but there is a very small chance things will go wrong .. hence the "pray" part :).

On system that is online (should be system2), kill off ORACLE LISTENER Process

ps -ef | grep LISTENER

Output should be like:

root 1415 600 0 20:43:58 pts/0 0:00 grep LISTENER

oracle 831 1 0 20:27:06 ? 0:00 /apps/oracle/product/8.1.5/bin/tnslsnr LISTENER -inherit

kill -9 process-id (the first # in list - in this case 831)

Failover will take a few minutes

You will note that system 2 is faulted -- and system 1 is now online

You need to CLEAR the fault before trying to fail back over.

hares -display | grep FAULT
for the resource that is failed (in this case, LISTENER)
Clear the fault
hares -clear resource-name -sys faulted-system [ie: hares -clear LISTENER -sys e4500b]

6. Test Actual Failover For System 1 (and pray db is okay)

Now we do same thing for the other system first verify that the other system is NOT faulted

hastatus -summary

Now do the same thing on this system... To do this, we will kill off the listener process, which should force a failover.

On system that is online (should be system2), kill off ORACLE LISTENER Process

ps -ef | grep LISTENER

Output should be like:

oracle 987 1 0 20:49:19 ? 0:00 /apps/oracle/product/8.1.5/bin/tnslsnr LISTENER -inherit

root 1330 631 0 20:58:29 pts/0 0:00 grep LISTENER

kill -9 process-id (the first # in list - in this case 987)

Failover will take a few minutes

You will note that system 1 is faulted -- and system 1 is now online

You need to CLEAR the fault before trying to fail back over.

hares -display | grep FAULT for the resource that is failed (in this case, LISTENER)
Clear the fault

hares -clear resource-name -sys faulted-system [ie: hares -clear LISTENER -sys e4500a]

Run:

hastatus -summary

to make sure everything is okay.

Testing Veritas Cluster

Testing Veritas Cluster

Actual commands are in black.

0. Check Veritas Licenses - for FileSystem, Volume Manager AND Cluster

vxlicense -p

If any licenses are not valid or expired -- get them FIXED before continuing! All licenses should say "No expiration". If ANY license has an actual expiration date, the test failed. Permenant licenses do NOT have an expiration date. Non-essential licenses may be moved -- however, a senior admin should do this.

1. Hand check SystemList & AutoStartList

On either machine:

grep SystemList /etc/VRTSvcs/conf/config/main.cf
You should get:
SystemList = { system1, system2 }

grep AutoStartList /etc/VRTSvcs/conf/config/main.cf
You should get:
AutoStartList = { system1, system2 }

Each list should contain both machines. If not, many of the next tests will fail.

If your lists do NOT contain both systems, you will probably need to modify them with commands that follow.

more /etc/VRTSvcs/conf/config/main.cf (See if it is reasonable. It is likely that the systems aren't fully set up)
haconf -makerw (this lets you write the conf file)
hagrp -modify oragrp SystemList system1 0 system2 1
hagrp -modify oragrp AutoStartList system1 system2
haconf -dump -makero (this makes conf file read only again)

2. Verify Cluster is Running

First verify that veritas is up & running:

hastatus -summary

If this command could NOT be found, add the following to root's path in /.profile:

vi /.profile
add /opt/VRTSvcs/bin to your PATH variable

If /.profile does not already exist, use this one:

PATH=/usr/bin:/usr/sbin:/usr/ucb:/usr/local/bin:/opt/VRTSvcs/bin:/sbin:$PATH
export PATH

. /.profile

Re-verify command now runs if you changed /.profile:
hastatus -summary

Here is the expected result (your SYSTEMs/GROUPs may vary):

One system should be OFFLINE and one system should be ONLINE ie:
#
hastatus -summary

-- SYSTEM STATE

-- System State Frozen

A e4500a RUNNING 0

A e4500b RUNNING 0

-- GROUP STATE

-- Group System Probed AutoDisabled State

B oragrp e4500a Y N ONLINE

B oragrp e4500b Y N OFFLINE

If your systems do not show the above status, try these debugging steps:

  • If NO systems are up, run hastart on both systems and run hastatus -summary again.
  • If only one system is shown, start other system with hastart. Note: one system should ALWAYS be OFFLINE for the way we configure systems here. (If we ran oracle parallel server, this could change -- but currently we run standard oracle server)
  • If both systems are up but are OFFLINE and hastart did NOT correct the problem and oracle filesystems are not running on either system, the cluster needs to be reset. (This happens under strange network situations with GE Access.) [You ran hastart and that wasn't enough to get full cluster to work.]

Verify that the systems have the following EXACT status (though your machine names will vary for other customers):

gedb002# hastatus -summary

-- SYSTEM STATE

-- System State Frozen

A gedb001 RUNNING 0

A gedb002 RUNNING 0

-- GROUP STATE

-- Group System Probed AutoDisabled State

B oragrp gedb001 Y N OFFLINE

B oragrp gedb002 Y N OFFLINE

gedb002# hares -display | grep ONLINE

nic-qfe3 State gedb001 ONLINE

nic-qfe3 State gedb002 ONLINE

gedb002# vxdg list

NAME STATE ID

rootdg enabled 957265489.1025.gedb002

gedb001# vxdg list

NAME STATE ID

rootdg enabled 957266358.1025.gedb001

Recovery Commands:

hastop -all
on one machine
hastart
wait a few minutes
on other machine
hastart
hastatus -summary (make sure one is OFFLINE && one is ONLINE)


If none of these steps resolved the situation, contact Lorraine or Luke (possibly Russ Button or Jen Redman if they made it to Veritas Cluster class) or a Veritas Consultant.

3. Verify Services Can Switch Between Systems

Once, hastatus -summary works, note the GROUP name used. Usually, it will be "oragrp", but the installer can use any name, so please determine it's name.

First check if group can switch back and forth. On the system that is running (system1), switch veritas to other system (system2):

hagrp -switch groupname -to system2 [ie: hagrp -switch oragrp -to e4500b]

Watch failover with hastatus -summary. Once it is failed over, switch it back:

hagrp -switch groupname -to system1

4. Verify OTHER System Can Go Up & Down Smoothly For Maintanence

On system that is OFFLINE (should be system 2 at this point), reboot the computer.

ssh system2
/usr/sbin/shutdown -i6 -g0 -y

Make sure that the when the system comes up & is running after the reboot. That is, when the reboot is finished, the second system should say it is offline using hastatus.

hastatus -summary

Once this is done, hagrp -switch groupname -to system2 and repeat reboot for the other system

hagrp -switch groupname -to system2
ssh system1
/usr/sbin/shutdown -i6 -g0 -y

Verify that system1 is in cluster once rebooted

hastatus -summary

5. Test Actual Failover For System 2 (and pray db is okay)

To do this, we will kill off the listener process, which should force a failover. This test SHOULD be okay for the db (that is why we choose LISTENER) but there is a very small chance things will go wrong .. hence the "pray" part :).

On system that is online (should be system2), kill off ORACLE LISTENER Process

ps -ef | grep LISTENER

Output should be like:

root 1415 600 0 20:43:58 pts/0 0:00 grep LISTENER

oracle 831 1 0 20:27:06 ? 0:00 /apps/oracle/product/8.1.5/bin/tnslsnr LISTENER -inherit

kill -9 process-id (the first # in list - in this case 831)

Failover will take a few minutes

You will note that system 2 is faulted -- and system 1 is now online

You need to CLEAR the fault before trying to fail back over.

hares -display | grep FAULT
for the resource that is failed (in this case, LISTENER)
Clear the fault
hares -clear resource-name -sys faulted-system [ie: hares -clear LISTENER -sys e4500b]

6. Test Actual Failover For System 1 (and pray db is okay)

Now we do same thing for the other system first verify that the other system is NOT faulted

hastatus -summary

Now do the same thing on this system... To do this, we will kill off the listener process, which should force a failover.

On system that is online (should be system2), kill off ORACLE LISTENER Process

ps -ef | grep LISTENER

Output should be like:

oracle 987 1 0 20:49:19 ? 0:00 /apps/oracle/product/8.1.5/bin/tnslsnr LISTENER -inherit

root 1330 631 0 20:58:29 pts/0 0:00 grep LISTENER

kill -9 process-id (the first # in list - in this case 987)

Failover will take a few minutes

You will note that system 1 is faulted -- and system 1 is now online

You need to CLEAR the fault before trying to fail back over.

hares -display | grep FAULT for the resource that is failed (in this case, LISTENER)
Clear the fault

hares -clear resource-name -sys faulted-system [ie: hares -clear LISTENER -sys e4500a]

Run:

hastatus -summary

to make sure everything is okay.

Testing Veritas Cluster

Actual commands are in black.

0. Check Veritas Licenses - for FileSystem, Volume Manager AND Cluster

vxlicense -p

If any licenses are not valid or expired -- get them FIXED before continuing! All licenses should say "No expiration". If ANY license has an actual expiration date, the test failed. Permenant licenses do NOT have an expiration date. Non-essential licenses may be moved -- however, a senior admin should do this.

1. Hand check SystemList & AutoStartList

On either machine:

grep SystemList /etc/VRTSvcs/conf/config/main.cf
You should get:
SystemList = { system1, system2 }

grep AutoStartList /etc/VRTSvcs/conf/config/main.cf
You should get:
AutoStartList = { system1, system2 }

Each list should contain both machines. If not, many of the next tests will fail.

If your lists do NOT contain both systems, you will probably need to modify them with commands that follow.

more /etc/VRTSvcs/conf/config/main.cf (See if it is reasonable. It is likely that the systems aren't fully set up)
haconf -makerw (this lets you write the conf file)
hagrp -modify oragrp SystemList system1 0 system2 1
hagrp -modify oragrp AutoStartList system1 system2
haconf -dump -makero (this makes conf file read only again)

2. Verify Cluster is Running

First verify that veritas is up & running:

hastatus -summary

If this command could NOT be found, add the following to root's path in /.profile:

vi /.profile
add /opt/VRTSvcs/bin to your PATH variable

If /.profile does not already exist, use this one:

PATH=/usr/bin:/usr/sbin:/usr/ucb:/usr/local/bin:/opt/VRTSvcs/bin:/sbin:$PATH
export PATH

. /.profile

Re-verify command now runs if you changed /.profile:
hastatus -summary

Here is the expected result (your SYSTEMs/GROUPs may vary):

One system should be OFFLINE and one system should be ONLINE ie:
#
hastatus -summary

-- SYSTEM STATE

-- System State Frozen

A e4500a RUNNING 0

A e4500b RUNNING 0

-- GROUP STATE

-- Group System Probed AutoDisabled State

B oragrp e4500a Y N ONLINE

B oragrp e4500b Y N OFFLINE

If your systems do not show the above status, try these debugging steps:

  • If NO systems are up, run hastart on both systems and run hastatus -summary again.
  • If only one system is shown, start other system with hastart. Note: one system should ALWAYS be OFFLINE for the way we configure systems here. (If we ran oracle parallel server, this could change -- but currently we run standard oracle server)
  • If both systems are up but are OFFLINE and hastart did NOT correct the problem and oracle filesystems are not running on either system, the cluster needs to be reset. (This happens under strange network situations with GE Access.) [You ran hastart and that wasn't enough to get full cluster to work.]

Verify that the systems have the following EXACT status (though your machine names will vary for other customers):

gedb002# hastatus -summary

-- SYSTEM STATE

-- System State Frozen

A gedb001 RUNNING 0

A gedb002 RUNNING 0

-- GROUP STATE

-- Group System Probed AutoDisabled State

B oragrp gedb001 Y N OFFLINE

B oragrp gedb002 Y N OFFLINE

gedb002# hares -display | grep ONLINE

nic-qfe3 State gedb001 ONLINE

nic-qfe3 State gedb002 ONLINE

gedb002# vxdg list

NAME STATE ID

rootdg enabled 957265489.1025.gedb002

gedb001# vxdg list

NAME STATE ID

rootdg enabled 957266358.1025.gedb001

Recovery Commands:

hastop -all
on one machine
hastart
wait a few minutes
on other machine
hastart
hastatus -summary (make sure one is OFFLINE && one is ONLINE)


If none of these steps resolved the situation, contact Lorraine or Luke (possibly Russ Button or Jen Redman if they made it to Veritas Cluster class) or a Veritas Consultant.

3. Verify Services Can Switch Between Systems

Once, hastatus -summary works, note the GROUP name used. Usually, it will be "oragrp", but the installer can use any name, so please determine it's name.

First check if group can switch back and forth. On the system that is running (system1), switch veritas to other system (system2):

hagrp -switch groupname -to system2 [ie: hagrp -switch oragrp -to e4500b]

Watch failover with hastatus -summary. Once it is failed over, switch it back:

hagrp -switch groupname -to system1

4. Verify OTHER System Can Go Up & Down Smoothly For Maintanence

On system that is OFFLINE (should be system 2 at this point), reboot the computer.

ssh system2
/usr/sbin/shutdown -i6 -g0 -y

Make sure that the when the system comes up & is running after the reboot. That is, when the reboot is finished, the second system should say it is offline using hastatus.

hastatus -summary

Once this is done, hagrp -switch groupname -to system2 and repeat reboot for the other system

hagrp -switch groupname -to system2
ssh system1
/usr/sbin/shutdown -i6 -g0 -y

Verify that system1 is in cluster once rebooted

hastatus -summary

5. Test Actual Failover For System 2 (and pray db is okay)

To do this, we will kill off the listener process, which should force a failover. This test SHOULD be okay for the db (that is why we choose LISTENER) but there is a very small chance things will go wrong .. hence the "pray" part :).

On system that is online (should be system2), kill off ORACLE LISTENER Process

ps -ef | grep LISTENER

Output should be like:

root 1415 600 0 20:43:58 pts/0 0:00 grep LISTENER

oracle 831 1 0 20:27:06 ? 0:00 /apps/oracle/product/8.1.5/bin/tnslsnr LISTENER -inherit

kill -9 process-id (the first # in list - in this case 831)

Failover will take a few minutes

You will note that system 2 is faulted -- and system 1 is now online

You need to CLEAR the fault before trying to fail back over.

hares -display | grep FAULT
for the resource that is failed (in this case, LISTENER)
Clear the fault
hares -clear resource-name -sys faulted-system [ie: hares -clear LISTENER -sys e4500b]

6. Test Actual Failover For System 1 (and pray db is okay)

Now we do same thing for the other system first verify that the other system is NOT faulted

hastatus -summary

Now do the same thing on this system... To do this, we will kill off the listener process, which should force a failover.

On system that is online (should be system2), kill off ORACLE LISTENER Process

ps -ef | grep LISTENER

Output should be like:

oracle 987 1 0 20:49:19 ? 0:00 /apps/oracle/product/8.1.5/bin/tnslsnr LISTENER -inherit

root 1330 631 0 20:58:29 pts/0 0:00 grep LISTENER

kill -9 process-id (the first # in list - in this case 987)

Failover will take a few minutes

You will note that system 1 is faulted -- and system 1 is now online

You need to CLEAR the fault before trying to fail back over.

hares -display | grep FAULT for the resource that is failed (in this case, LISTENER)
Clear the fault

hares -clear resource-name -sys faulted-system [ie: hares -clear LISTENER -sys e4500a]

Run:

hastatus -summary

to make sure everything is okay.