770 views
in RAC by ACE (20,920 points)

5 Answers

by ACE (20,920 points)

TROUBLESHOOTING TAF ISSUES IN 10g RAC
-------------------------------------

Troubleshooting TAF issues in 10g RAC is different then previous versions because of the additional CRS (Cluster Ready Services) layer involved.  With 10g RAC, one or more services would be created for application and client
connections.  These services could be started, stopped, or failed over by CRS.   CRS also manages virtual IPs that the listeners listen on.  If a node goes down, the service and virtual IP that are managed by CRS will fail over to a surviving
node.  By having the listener listen on the VIPs, we can avoid waiting for a TCP timeout if a node goes down.  Connections will be routed to surviving nodes and will be able to resume their work when the failover is completed.  When
troubleshooting TAF issues in 10g RAC, it is important to look at the CRS layer, the network layer, and database layer to identify where a problem may reside.

by ACE (20,920 points)
FILES TO REVIEW AND GATHER IF THERE IS A PROBLEM
------------------------------------------------

- Output of $ORA_CRS_HOME/bin/crs_stat

- Output of "srvctl status service -d  -s "

- Output of "srvctl status nodeapps -n "

- CRS Logs in $ORA_CRS_HOME/crs/log and $ORA_CRS_HOME/css/log from each node to
  monitor service failover.

- Alert logs on each node to see when the instance(s) went down.

- Output of the following query for parameter settings:

  set lines 120
  set pages 200
  column name format a20 tru
  column value format a40 wra
  select inst_id, name, value
  from gv$parameter
  where name in ('service_names','local_listener','remote_listener',
  'db_name','db_domain','instance_name')
  order by 1,2,3;

- tnsnames.ora and sqlnet.ora from the client

- listener.ora and sqlnet.ora from the servers

- Output of "lsnrctl services " from each node

- Ifconfig output and /etc/hosts from each node

- Possible additional sqlnet.ora tracing on the client and server:

  trace_level_client = 16
  trace_level_server = 16
  trace_unique_client = on
  trace_directory_client = /t02/app/oracle/product/10.1.0/network/trace
  trace_timestamp_client = on
  trace_timestamp_server = on

  If tracing, set the trace_level_client and trace_level_server parameters to 0
  after the test to avoid having large trace files written.
by ACE (20,920 points)
TAF SETUP EXAMPLE
-----------------

In a simple example, a service called "SRV_AVAIL" is created in DBCA.  This
service is defined so that node 1 is the "preferred" instance for connections
and node 2 is the "available" instance for connections.  It was also configured
with a TAF policy of "Basic".  This means that at startup CRS tries to run the
service on node 1.  TAF clients connect to the service transparent to where the
service executes.   If there is a failure with the instance on node 1 (opcbsol1),
the service will be failed over to node 2 (opcbsol2) by CRS and all client
connections will be routed to node 2 (opcbsol2).  In 10g TAF operates with load
balancing.  If you have a scenerio with more than 2 instances, on any failure,
the clients load balance across surviving instances.  The following is all of
the configuration information for the SRV_AVAIL service:

- Output of $ORA_CRS_HOME/bin/crs_stat (relevant entries):

  [opcbsol1]/> $ORA_CRS_HOME/bin/./crs_stat

  NAME=ora.V10SN.SRV_AVAIL.V10SN1.sa
  TYPE=application
  TARGET=ONLINE
  STATE=ONLINE on opcbsol2

  NAME=ora.V10SN.SRV_AVAIL.V10SN1.srv
  TYPE=application
  TARGET=ONLINE
  STATE=ONLINE on opcbsol1

  NAME=ora.V10SN.SRV_AVAIL.cs
  TYPE=application
  TARGET=ONLINE
  STATE=ONLINE on opcbsol1

  NAME=ora.opcbsol1.LISTENER_OPCBSOL1.lsnr
  TYPE=application
  TARGET=ONLINE
  STATE=ONLINE on opcbsol1

  NAME=ora.opcbsol1.vip
  TYPE=application
  TARGET=ONLINE
  STATE=ONLINE on opcbsol1

  NAME=ora.opcbsol2.LISTENER_OPCBSOL2.lsnr
  TYPE=application
  TARGET=ONLINE
  STATE=ONLINE on opcbsol2

  NAME=ora.opcbsol2.vip
  TYPE=application
  TARGET=ONLINE
  STATE=ONLINE on opcbsol2

- Output of "srvctl status service -d  -s ":

  [opcbsol1]/> srvctl status service -d V10SN -s SRV_AVAIL
  Service SRV_AVAIL is running on instance(s) V10SN1

- Output of "srvctl status nodeapps -n ":

  [opcbsol1]/> srvctl status nodeapps -n opcbsol1
  VIP is running on node: opcbsol1
  GSD is running on node: opcbsol1
  Listener is running on node: opcbsol1
  ONS daemon is running on node: opcbsol1
  [opcbsol1]/> srvctl status nodeapps -n opcbsol2
  VIP is running on node: opcbsol2
  GSD is running on node: opcbsol2
  Listener is running on node: opcbsol2
  ONS daemon is running on node: opcbsol2

- CRS log showing service creation:

  2004-05-05 13:11:02.284: Attempting to start `ora.V10SN.SRV_AVAIL.cs` on
   member `opcbsol1`
  2004-05-05 13:11:06.792: Attempting to start `ora.V10SN.SRV_AVAIL.V10SN1.srv`
   on member `opcbsol1`
  2004-05-05 13:11:12.620: Start of `ora.V10SN.SRV_AVAIL.V10SN1.srv` on member
   `opcbsol1` succeeded.
  2004-05-05 13:11:12.954: Start of `ora.V10SN.SRV_AVAIL.cs` on member
   `opcbsol1` succeeded.
  2004-05-05 13:11:16.811: Attempting to start `ora.V10SN.SRV_AVAIL.V10SN1.sa`
   on member `opcbsol2`
  2004-05-05 13:11:19.629: Start of `ora.V10SN.SRV_AVAIL.V10SN1.sa` on member
   `opcbsol2` succeeded.

- Output of the following query for parameter settings:

  SQL> set lines 120
  SQL> set pages 200
  SQL> column name format a20 tru
  SQL> column value format a40 wra
  SQL> select inst_id, name, value
    2  from gv$parameter
    3  where name in ('service_names','local_listener','remote_listener',
    4  'db_name','db_domain','instance_name')
    5  order by 1,2,3;

     INST_ID NAME                 VALUE
  ---------- -------------------- ----------------------------------------
           1 db_domain
           1 db_name              V10SN
           1 instance_name        V10SN1
           1 local_listener
           1 remote_listener      LISTENERS_V10SN
           1 service_names        V10SN, SRV_AVAIL
           2 db_domain
           2 db_name              V10SN
           2 instance_name        V10SN2
           2 local_listener
           2 remote_listener      LISTENERS_V10SN
           2 service_names        V10SN

- Tnsnames.ora entry from the client:

SRV_AVAIL =
  (DESCRIPTION =
    (ADDRESS = (PROTOCOL = TCP)(HOST = opcbsol1-v.us.oracle.com)(PORT = 1521))
    (ADDRESS = (PROTOCOL = TCP)(HOST = opcbsol2-v.us.oracle.com)(PORT = 1521))
    (LOAD_BALANCE = yes)
    (CONNECT_DATA =
      (SERVER = DEDICATED)
      (SERVICE_NAME = SRV_AVAIL)
      (FAILOVER_MODE =
        (TYPE = SELECT)
        (METHOD = BASIC)
        (RETRIES = 180)
        (DELAY = 5)
      )
    )
  )

- Listener.ora entry from the server:

  LISTENER_OPCBSOL1 =
    (DESCRIPTION_LIST =
      (DESCRIPTION =
        (ADDRESS_LIST =
          (ADDRESS = (PROTOCOL = IPC)(KEY = EXTPROC))
        )
        (ADDRESS_LIST =
          (ADDRESS = (PROTOCOL = TCP)(HOST = opcbsol1-v.us.oracle.com)(PORT = 1521
  ))
        )
        (ADDRESS_LIST =
          (ADDRESS = (PROTOCOL = TCP)(HOST = 138.1.137.144)(PORT = 1521))
        )
      )
    )

  SID_LIST_LISTENER_OPCBSOL1 =
    (SID_LIST =
      (SID_DESC =
        (SID_NAME = PLSExtProc)
        (ORACLE_HOME = /t02/app/oracle/product/10.1.0)
        (PROGRAM = extproc)
      )
    )

- Output of "lsnrctl services " from each node (relevant entries):

  [opcbsol1]/> lsnrctl services listener_opcbsol1

  Service "SRV_AVAIL" has 1 instance(s).
    Instance "V10SN1", status READY, has 2 handler(s) for this service...
      Handler(s):
        "DEDICATED" established:20 refused:0 state:ready
           LOCAL SERVER
        "DEDICATED" established:0 refused:0 state:ready
           REMOTE SERVER
           (ADDRESS=(PROTOCOL=TCP)(HOST=opcbsol1)(PORT=1521))

  [opcbsol2]/> lsnrctl services listener_opcbsol2

  Service "SRV_AVAIL" has 1 instance(s).
    Instance "V10SN1", status READY, has 1 handler(s) for this service...
      Handler(s):
        "DEDICATED" established:0 refused:0 state:ready
           REMOTE SERVER
           (ADDRESS=(PROTOCOL=TCP)(HOST=opcbsol1)(PORT=1521))

- Ifconfig output and /etc/hosts from each node:

  # ifconfig -a
  lo0: flags=1000849 mtu 8232 index 1
          inet 127.0.0.1 netmask ff000000
  hme0: flags=1000843 mtu 1500 index 2
          inet 138.1.137.144 netmask fffffc00 broadcast 138.1.139.255
          ether 8:0:20:aa:a6:3d
  hme0:1: flags=1000843 mtu 1500 index 2
          inet 138.1.138.50 netmask fffffc00 broadcast 138.1.139.255
  hme1: flags=1008843 mtu 1500 index 4
          inet 172.16.0.129 netmask ffffff80 broadcast 172.16.0.255
          ether 8:0:20:aa:a6:3d
  hme2: flags=1008843 mtu 1500 index 3
          inet 172.16.1.1 netmask ffffff80 broadcast 172.16.1.127
          ether 8:0:20:aa:a6:3d

  [opcbsol1]/> more /etc/hosts
  #
  # Internet host table
  #
  127.0.0.1       localhost       
  138.1.137.144   opcbsol1.us.oracle.com  opcbsol1        loghost
  138.1.137.142   opcbsol2.us.oracle.com  opcbsol2
  172.16.0.129    int-opcbsol1.us.oracle.com  int-opcbsol1        
  172.16.0.130    int-opcbsol2.us.oracle.com  int-opcbsol2
  138.1.138.50    opcbsol1-v.us.oracle.com        opcbsol1-v
  138.1.138.51    opcbsol2-v.us.oracle.com        opcbsol2-v
by ACE (20,920 points)
TAF FAILOVER EXAMPLE
--------------------

1. Connect to the service:

   sqlplus scott/tiger@SRV_AVAIL

2. Verify the instance you are connected to:

   SQL> select instance_name, host_name from v$instance;
 
   INSTANCE_NAME
   ----------------
   HOST_NAME
   ----------------------------------------------------------------
   V10SN1
   opcbsol1

3. Run a long running select (for example, select * from dba_extents)

4. Power off node 1 (opcbsol1)

5. The query will delay for a short period of time then will fail over to node
   2 as seen with the following query after the query is finished:

   SQL> select instance_name, host_name from v$instance;

   INSTANCE_NAME
   ----------------
   HOST_NAME
   ----------------------------------------------------------------
   V10SN2
   opcbsol2

What was happening underneath of that was that node 1's virtual IP and the
SRV_AVAIL service failed over to node 2.  The section below shows what the
logs and traces looked like when node 1 was powered off.

TAF FAILOVER TRACE DATA FROM SUCCESSFUL FAILOVER
------------------------------------------------

- Output of $ORA_CRS_HOME/bin/crs_stat (relevant entries):

  [opcbsol1]/> $ORA_CRS_HOME/bin/./crs_stat

  NAME=ora.V10SN.SRV_AVAIL.V10SN1.sa
  TYPE=application
  TARGET=ONLINE
  STATE=OFFLINE

  NAME=ora.V10SN.SRV_AVAIL.V10SN1.srv
  TYPE=application
  TARGET=ONLINE
  STATE=ONLINE on opcbsol2

  NAME=ora.V10SN.SRV_AVAIL.cs
  TYPE=application
  TARGET=ONLINE
  STATE=ONLINE on opcbsol2

  NAME=ora.opcbsol1.LISTENER_OPCBSOL1.lsnr
  TYPE=application
  TARGET=ONLINE
  STATE=OFFLINE

  NAME=ora.opcbsol1.vip
  TYPE=application
  TARGET=ONLINE
  STATE=ONLINE on opcbsol2

  NAME=ora.opcbsol2.LISTENER_OPCBSOL2.lsnr
  TYPE=application
  TARGET=ONLINE
  STATE=ONLINE on opcbsol2

  NAME=ora.opcbsol2.vip
  TYPE=application
  TARGET=ONLINE
  STATE=ONLINE on opcbsol2

  Comparing this to the previous sample we see that Node 1's virtual IP was
  failed over to node 2 by CRS.  We also see that the
  ora.V10SN.SRV_AVAIL.V10SN1.srv and ora.V10SN.SRV_AVAIL.cs resources where
  failed over to node 2.

- Output of "srvctl status service -d  -s "

  [opcbsol2]/> srvctl status service -d V10SN -s SRV_AVAIL
  Service SRV_AVAIL is running on instance(s) V10SN2

  As we previously saw, the service is now running on instance 2.
 
- CRS Logs in $ORA_CRS_HOME/crs/log and $ORA_CRS_HOME/css/log from each node to
  monitor service failover.

  CSS Log from Node 2:

  2004-05-05 15:32:53.457 [17] >WARNING: clssnmPollingThread: Eviction started for node 0, flags 0x0001, state 3, wt4c 0
  2004-05-05 15:32:58.565 [17] >TRACE:   clssnmDoSyncUpdate: Initiating sync 5
  2004-05-05 15:32:58.566 [12] >TRACE:   clssnmHandleSync: Acknowledging sync: src[1] seq[1282004] sync[5]
  2004-05-05 15:32:59.197 [1] >USER:    NMEVENT_SUSPEND [00][00][00][02]
  2004-05-05 15:33:00.626 [12] >USER:    clssnmHandleUpdate: SYNC(5) from node(1) completed
  2004-05-05 15:33:00.626 [12] >USER:    clssnmHandleUpdate: NODE(1) IS ACTIVE MEMBER OF CLUSTER
  2004-05-05 15:33:01.218 [26] >USER:    NMEVENT_RECONFIG [00][00][00][02]
  CLSS-3000: reconfiguration successful, incarnation 5 with 1 nodes

  CRS finished deciding that we had a one node cluster at 15:33:01

  CRS Log from Node 2:

  2004-05-05 15:33:01.265: Processing MemberLeave
  2004-05-05 15:33:01.265: [MEMBERLEAVE:82269] Processing member leave for opcbsol1, incarnation: 5
  2004-05-05 15:33:07.480: Attempting to start `ora.V10SN.SRV_AVAIL.V10SN1.sa` on member `opcbsol2`
  `ora.V10SN.SRV_AVAIL.cs` on `opcbsol2` went OFFLINE unexpectedly
  2004-05-05 15:33:08.965: Attempting to stop `ora.V10SN.SRV_AVAIL.cs` on member `opcbsol2`
  2004-05-05 15:33:11.635: Stop of `ora.V10SN.SRV_AVAIL.cs` on member `opcbsol2` succeeded.
  `ora.V10SN.SRV_AVAIL.cs` exceeded it's failure threshold.  Stopping it and its dependents!
  2004-05-05 15:33:12.403: Target set to OFFLINE for `ora.V10SN.SRV_AVAIL.cs`
  2004-05-05 15:33:17.294: Start of `ora.V10SN.SRV_AVAIL.V10SN1.sa` on member `opcbsol2` succeeded.
  2004-05-05 15:33:18.261: Attempting to start `ora.V10SN.SRV_AVAIL.V10SN1.srv` on member `opcbsol2`
  2004-05-05 15:33:25.395: Start of `ora.V10SN.SRV_AVAIL.V10SN1.srv` on member `opcbsol2` succeeded.
  2004-05-05 15:33:26.798: CRS-1002: Resource ora.V10SN.SRV_AVAIL.V10SN1.sa is already running on member opcbsol2
  2004-05-05 15:33:28.831: Attempting to start `ora.V10SN.SRV_AVAIL.cs` on member `opcbsol2`
  2004-05-05 15:33:29.612: Attempting to start `ora.opcbsol1.vip` on member `opcbsol2`
  2004-05-05 15:33:31.198: Start of `ora.V10SN.SRV_AVAIL.cs` on member `opcbsol2` succeeded.
  2004-05-05 15:33:37.605: Start of `ora.opcbsol1.vip` on member `opcbsol2` succeeded.
  2004-05-05 15:33:37.635: [MEMBERLEAVE:82269] Do failover for: opcbsol1
  2004-05-05 15:33:37.639: [MEMBERLEAVE:82269] Post recovery done evmd event for: opcbsol1
  `ora.V10SN.SRV_AVAIL.V10SN1.sa` on `opcbsol2` went OFFLINE unexpectedly
  2004-05-05 15:35:22.872: Attempting to stop `ora.V10SN.SRV_AVAIL.V10SN1.sa` on member `opcbsol2`
  2004-05-05 15:35:24.908: Stop of `ora.V10SN.SRV_AVAIL.V10SN1.sa` on member `opcbsol2` succeeded.
  `ora.V10SN.SRV_AVAIL.V10SN1.sa` failed on `opcbsol2`, relocating.
  Cannot relocate `ora.V10SN.SRV_AVAIL.V10SN1.sa`.  Stopping dependents

  Here we see the service and VIP resources failing over (note that the listener resource does not fail over).  

- Output of the following query for parameter settings:

  SQL> set lines 120
  SQL> set pages 200
  SQL> column name format a20 tru
  SQL> column value format a40 wra
  SQL> select inst_id, name, value
    2  from gv$parameter
    3  where name in ('service_names','local_listener','remote_listener',
    4  'db_name','db_domain','instance_name')
    5  order by 1,2,3;

     INST_ID NAME                 VALUE
  ---------- -------------------- ----------------------------------------
           2 db_domain
           2 db_name              V10SN
           2 instance_name        V10SN2
           2 local_listener
           2 remote_listener      LISTENERS_V10SN
           2 service_names        V10SN, SRV_PREF, SRV_AVAIL

  Notice that instance 2 has SRV_AVAIL added to service_names.

- Output of "lsnrctl services " from node 2:

  Service "SRV_AVAIL" has 2 instance(s).
    Instance "V10SN1", status READY, has 1 handler(s) for this service...
      Handler(s):
        "DEDICATED" established:0 refused:0 state:ready
           REMOTE SERVER
           (ADDRESS=(PROTOCOL=TCP)(HOST=opcbsol1)(PORT=1521))
    Instance "V10SN2", status READY, has 2 handler(s) for this service...
      Handler(s):
        "DEDICATED" established:0 refused:0 state:ready
           REMOTE SERVER
           (ADDRESS=(PROTOCOL=TCP)(HOST=opcbsol2)(PORT=1521))
        "DEDICATED" established:1 refused:0 state:ready
           LOCAL SERVER

  Here we see that the listener on node 2 is now listening for both nodes.

- Ifconfig -a Output:

  # ifconfig -a
  lo0: flags=1000849 mtu 8232 index 1
          inet 127.0.0.1 netmask ff000000
  hme0: flags=1000843 mtu 1500 index 2
          inet 138.1.137.144 netmask fffffc00 broadcast 138.1.139.255
          ether 8:0:20:aa:a6:3d
  hme0:1: flags=1040843 mtu 1500 index 2
        inet 138.1.138.51 netmask ffffff00 broadcast 138.1.255.255
  hme0:2: flags=1040843 mtu 1500 index 2
        inet 138.1.138.50 netmask ffffff00 broadcast 138.1.255.255
  hme1: flags=1008843 mtu 1500 index 4
          inet 172.16.0.129 netmask ffffff80 broadcast 172.16.0.255
          ether 8:0:20:aa:a6:3d
  hme2: flags=1008843 mtu 1500 index 3
          inet 172.16.1.1 netmask ffffff80 broadcast 172.16.1.127
          ether 8:0:20:aa:a6:3d

  Notice that hme0:2 has been added (node 1's VIP)
by ACE (20,920 points)
PUTTING THINGS BACK
-------------------

When node 1 comes back up, it will start it's CRS stack.  It will take back
it's VIP.  In order to fail the service back, you would need to run srvctl
relocate.  The following example is relocating the service from node 2 back to
node 1:

  [opcbsol1]/> srvctl relocate service -d V10SN -s SRV_AVAIL -i V10SN2 -t V10SN1
...