TAF FAILOVER EXAMPLE
--------------------
1. Connect to the service:
sqlplus scott/tiger@SRV_AVAIL
2. Verify the instance you are connected to:
SQL> select instance_name, host_name from v$instance;
INSTANCE_NAME
----------------
HOST_NAME
----------------------------------------------------------------
V10SN1
opcbsol1
3. Run a long running select (for example, select * from dba_extents)
4. Power off node 1 (opcbsol1)
5. The query will delay for a short period of time then will fail over to node
2 as seen with the following query after the query is finished:
SQL> select instance_name, host_name from v$instance;
INSTANCE_NAME
----------------
HOST_NAME
----------------------------------------------------------------
V10SN2
opcbsol2
What was happening underneath of that was that node 1's virtual IP and the
SRV_AVAIL service failed over to node 2. The section below shows what the
logs and traces looked like when node 1 was powered off.
TAF FAILOVER TRACE DATA FROM SUCCESSFUL FAILOVER
------------------------------------------------
- Output of $ORA_CRS_HOME/bin/crs_stat (relevant entries):
[opcbsol1]/> $ORA_CRS_HOME/bin/./crs_stat
NAME=ora.V10SN.SRV_AVAIL.V10SN1.sa
TYPE=application
TARGET=ONLINE
STATE=OFFLINE
NAME=ora.V10SN.SRV_AVAIL.V10SN1.srv
TYPE=application
TARGET=ONLINE
STATE=ONLINE on opcbsol2
NAME=ora.V10SN.SRV_AVAIL.cs
TYPE=application
TARGET=ONLINE
STATE=ONLINE on opcbsol2
NAME=ora.opcbsol1.LISTENER_OPCBSOL1.lsnr
TYPE=application
TARGET=ONLINE
STATE=OFFLINE
NAME=ora.opcbsol1.vip
TYPE=application
TARGET=ONLINE
STATE=ONLINE on opcbsol2
NAME=ora.opcbsol2.LISTENER_OPCBSOL2.lsnr
TYPE=application
TARGET=ONLINE
STATE=ONLINE on opcbsol2
NAME=ora.opcbsol2.vip
TYPE=application
TARGET=ONLINE
STATE=ONLINE on opcbsol2
Comparing this to the previous sample we see that Node 1's virtual IP was
failed over to node 2 by CRS. We also see that the
ora.V10SN.SRV_AVAIL.V10SN1.srv and ora.V10SN.SRV_AVAIL.cs resources where
failed over to node 2.
- Output of "srvctl status service -d -s "
[opcbsol2]/> srvctl status service -d V10SN -s SRV_AVAIL
Service SRV_AVAIL is running on instance(s) V10SN2
As we previously saw, the service is now running on instance 2.
- CRS Logs in $ORA_CRS_HOME/crs/log and $ORA_CRS_HOME/css/log from each node to
monitor service failover.
CSS Log from Node 2:
2004-05-05 15:32:53.457 [17] >WARNING: clssnmPollingThread: Eviction started for node 0, flags 0x0001, state 3, wt4c 0
2004-05-05 15:32:58.565 [17] >TRACE: clssnmDoSyncUpdate: Initiating sync 5
2004-05-05 15:32:58.566 [12] >TRACE: clssnmHandleSync: Acknowledging sync: src[1] seq[1282004] sync[5]
2004-05-05 15:32:59.197 [1] >USER: NMEVENT_SUSPEND [00][00][00][02]
2004-05-05 15:33:00.626 [12] >USER: clssnmHandleUpdate: SYNC(5) from node(1) completed
2004-05-05 15:33:00.626 [12] >USER: clssnmHandleUpdate: NODE(1) IS ACTIVE MEMBER OF CLUSTER
2004-05-05 15:33:01.218 [26] >USER: NMEVENT_RECONFIG [00][00][00][02]
CLSS-3000: reconfiguration successful, incarnation 5 with 1 nodes
CRS finished deciding that we had a one node cluster at 15:33:01
CRS Log from Node 2:
2004-05-05 15:33:01.265: Processing MemberLeave
2004-05-05 15:33:01.265: [MEMBERLEAVE:82269] Processing member leave for opcbsol1, incarnation: 5
2004-05-05 15:33:07.480: Attempting to start `ora.V10SN.SRV_AVAIL.V10SN1.sa` on member `opcbsol2`
`ora.V10SN.SRV_AVAIL.cs` on `opcbsol2` went OFFLINE unexpectedly
2004-05-05 15:33:08.965: Attempting to stop `ora.V10SN.SRV_AVAIL.cs` on member `opcbsol2`
2004-05-05 15:33:11.635: Stop of `ora.V10SN.SRV_AVAIL.cs` on member `opcbsol2` succeeded.
`ora.V10SN.SRV_AVAIL.cs` exceeded it's failure threshold. Stopping it and its dependents!
2004-05-05 15:33:12.403: Target set to OFFLINE for `ora.V10SN.SRV_AVAIL.cs`
2004-05-05 15:33:17.294: Start of `ora.V10SN.SRV_AVAIL.V10SN1.sa` on member `opcbsol2` succeeded.
2004-05-05 15:33:18.261: Attempting to start `ora.V10SN.SRV_AVAIL.V10SN1.srv` on member `opcbsol2`
2004-05-05 15:33:25.395: Start of `ora.V10SN.SRV_AVAIL.V10SN1.srv` on member `opcbsol2` succeeded.
2004-05-05 15:33:26.798: CRS-1002: Resource ora.V10SN.SRV_AVAIL.V10SN1.sa is already running on member opcbsol2
2004-05-05 15:33:28.831: Attempting to start `ora.V10SN.SRV_AVAIL.cs` on member `opcbsol2`
2004-05-05 15:33:29.612: Attempting to start `ora.opcbsol1.vip` on member `opcbsol2`
2004-05-05 15:33:31.198: Start of `ora.V10SN.SRV_AVAIL.cs` on member `opcbsol2` succeeded.
2004-05-05 15:33:37.605: Start of `ora.opcbsol1.vip` on member `opcbsol2` succeeded.
2004-05-05 15:33:37.635: [MEMBERLEAVE:82269] Do failover for: opcbsol1
2004-05-05 15:33:37.639: [MEMBERLEAVE:82269] Post recovery done evmd event for: opcbsol1
`ora.V10SN.SRV_AVAIL.V10SN1.sa` on `opcbsol2` went OFFLINE unexpectedly
2004-05-05 15:35:22.872: Attempting to stop `ora.V10SN.SRV_AVAIL.V10SN1.sa` on member `opcbsol2`
2004-05-05 15:35:24.908: Stop of `ora.V10SN.SRV_AVAIL.V10SN1.sa` on member `opcbsol2` succeeded.
`ora.V10SN.SRV_AVAIL.V10SN1.sa` failed on `opcbsol2`, relocating.
Cannot relocate `ora.V10SN.SRV_AVAIL.V10SN1.sa`. Stopping dependents
Here we see the service and VIP resources failing over (note that the listener resource does not fail over).
- Output of the following query for parameter settings:
SQL> set lines 120
SQL> set pages 200
SQL> column name format a20 tru
SQL> column value format a40 wra
SQL> select inst_id, name, value
2 from gv$parameter
3 where name in ('service_names','local_listener','remote_listener',
4 'db_name','db_domain','instance_name')
5 order by 1,2,3;
INST_ID NAME VALUE
---------- -------------------- ----------------------------------------
2 db_domain
2 db_name V10SN
2 instance_name V10SN2
2 local_listener
2 remote_listener LISTENERS_V10SN
2 service_names V10SN, SRV_PREF, SRV_AVAIL
Notice that instance 2 has SRV_AVAIL added to service_names.
- Output of "lsnrctl services " from node 2:
Service "SRV_AVAIL" has 2 instance(s).
Instance "V10SN1", status READY, has 1 handler(s) for this service...
Handler(s):
"DEDICATED" established:0 refused:0 state:ready
REMOTE SERVER
(ADDRESS=(PROTOCOL=TCP)(HOST=opcbsol1)(PORT=1521))
Instance "V10SN2", status READY, has 2 handler(s) for this service...
Handler(s):
"DEDICATED" established:0 refused:0 state:ready
REMOTE SERVER
(ADDRESS=(PROTOCOL=TCP)(HOST=opcbsol2)(PORT=1521))
"DEDICATED" established:1 refused:0 state:ready
LOCAL SERVER
Here we see that the listener on node 2 is now listening for both nodes.
- Ifconfig -a Output:
# ifconfig -a
lo0: flags=1000849 mtu 8232 index 1
inet 127.0.0.1 netmask ff000000
hme0: flags=1000843 mtu 1500 index 2
inet 138.1.137.144 netmask fffffc00 broadcast 138.1.139.255
ether 8:0:20:aa:a6:3d
hme0:1: flags=1040843 mtu 1500 index 2
inet 138.1.138.51 netmask ffffff00 broadcast 138.1.255.255
hme0:2: flags=1040843 mtu 1500 index 2
inet 138.1.138.50 netmask ffffff00 broadcast 138.1.255.255
hme1: flags=1008843 mtu 1500 index 4
inet 172.16.0.129 netmask ffffff80 broadcast 172.16.0.255
ether 8:0:20:aa:a6:3d
hme2: flags=1008843 mtu 1500 index 3
inet 172.16.1.1 netmask ffffff80 broadcast 172.16.1.127
ether 8:0:20:aa:a6:3d
Notice that hme0:2 has been added (node 1's VIP)