History | Log In     View a printable version of the current page.  
Issue Details (XML | Word | Printable)

Key: QA-48
Type: Oracle - Operating System Oracle - Operating System
Status: Closed Closed
Resolution: Answered
Priority: Major Major
Assignee: ubTools Support
Reporter: ubTools Support
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
Questions & Answers

Unable to start VIP because of invalid RX packets numbers.

Created: 18/Mar/09 07:44 PM   Updated: 19/Mar/09 01:45 PM
Return to search
Fix Version/s: None

Product Version: Oracle 10.2.0.4, RAC
Operating System: IBM-AIX
Operating System Version: 6.1


 Description  « Hide
*When starting a VIP on a node, it fails and started on the other node.

Starting the VIP:

# ./crs_start ora.akyorap2.vip
Attempting to start `ora.akyorap2.vip` on member `akyorap2`
Start of `ora.akyorap2.vip` on member `akyorap2` failed.
Attempting to start `ora.akyorap2.vip` on member `akyorap1`
Start of `ora.akyorap2.vip` on member `akyorap1` succeeded.
#

The log level increased to get more detailed diagnostic data.

Setting Log Level:

#./crsctl debug log res "ora.akyorap2.vip:1" 
Set Resource Debug Module: ora.akyorap2.vip  Level: 1
#

Errors from the Log:
(<ORA_CRS_HOME>/log/<nodeName>/racg/ora.akyorap2.vip.log)

Wed Mar 18 20:58:49 GMT+02:00 2009 [ 413770 ] checkIf: start for if=en1
Wed Mar 18 20:58:49 GMT+02:00 2009 [ 413770 ] IsIfAlive: start for if=en1

2009-03-18 20:58:52.212: [    RACG][1] [360462][1][ora.akyorap2.vip]: Wed Mar 18
 20:58:49 GMT+02:00 2009 [ 413770 ] defaultgw:  started
Wed Mar 18 20:58:49 GMT+02:00 2009 [ 413770 ] defaultgw:  completed with 10.46.1
80.1

2009-03-18 20:58:52.212: [    RACG][1] [360462][1][ora.akyorap2.vip]: Wed Mar 18
 20:58:49 GMT+02:00 2009 [ 413770 ] About to execute command: /usr/sbin/ping -S
10.46.180.52  -c 1 -w 1 10.46.180.1

2009-03-18 20:58:52.212: [    RACG][1] [360462][1][ora.akyorap2.vip]: Wed Mar 18
 20:58:51 GMT+02:00 2009 [ 413770 ] About to execute command: /usr/sbin/ping -S
10.46.180.52  -c 1 -w 1 10.46.180.1

2009-03-18 20:58:52.212: [    RACG][1] [360462][1][ora.akyorap2.vip]: Wed Mar 18
 20:58:52 GMT+02:00 2009 [ 413770 ] IsIfAlive: RX packets checked if=en1 failed
Wed Mar 18 20:58:52 GMT+02:00 2009 [ 413770 ] Interface en1 checked failed (host
=akyorap2)
Wed Mar 18 20:58:52 GMT+02:00 2009 [ 413770 ] IsIfAlive: end for if=en1

2009-03-18 20:58:52.212: [    RACG][1] [360462][1][ora.akyorap2.vip]: Wed Mar 18
 20:58:52 GMT+02:00 2009 [ 413770 ] checkIf: end for if=en1
Invalid parameters, or failed to bring up VIP (host=akyorap2)


 All   Comments   Change History      Sort Order: Ascending order - Click to sort in descending order
ubTools Support - 18/Mar/09 08:10 PM
The problem raised from IsIfAlive() of $ORA_CRS_HOME/racgvip.

Here are the related excerpt from racgvip:

  # Check the status of the interface thro' pinging gateway
  if [ -n "$DEFAULTGW" ]
  then
    _RET=1
    # get base IP address of the interface
    tmpIP=`$LSATTR -El ${_IF} -a netaddr | $AWK '{print $2}'`
    # get RX packets numbers
    _O1=`$NETSTAT -n -I $_IF | $AWK "{ if (/^$_IF/) {print \\$5; exit}}"`
    x=$CHECK_TIMES
    while [ $x -gt 0 ]
    do
      if [ -n "$tmpIP" ]
      then
        logx "About to execute command: $PING -S $tmpIP $PING_TIMEOUT $DEFAULTGW
"
        $PING -S $tmpIP $PING_TIMEOUT $DEFAULTGW > /dev/null 2>&1
      else
        logx "About to execute command: $PING $PING_TIMEOUT $DEFAULTGW"
        $PING $PING_TIMEOUT $DEFAULTGW > /dev/null 2>&1
      fi
      _O2=`$NETSTAT -n -I $_IF | $AWK "{ if (/^$_IF/) {print \\$5; exit}}"`
      if [ "$_O1" != "$_O2" ]
      then
        # RX packets numbers changed
        _RET=0
        break
      fi
      $SLEEP 1
      x=`$EXPR $x - 1`
    done
    if [ $_RET -ne 0 ]
    then
      logx "IsIfAlive: RX packets checked if=$_IF failed"
    else
      logx "IsIfAlive: RX packets checked if=$_IF OK"
    fi
....

According to the the code above, it does the followings:

  • Assigns the current RX packet number to _O1 variable as the first RX packet number.
  • Loops $CHECK_TIMES times:
    • Pings default gateway.
    • Assigns the current RX packet number to _O2 variable as the next RX packet number.
    • If RX packet number changed(_O1!=_O2), break the loop.
    • Sleep 1 second.
  • If RX packet number is NOT changed(_O1==_O2) raise the error; else it's OK.

ubTools Support - 18/Mar/09 08:28 PM
racgvip was modified as below to dump the values of _O1 and _O2:
...
    # get RX packets numbers
    _O1=`$NETSTAT -n -I $_IF | $AWK "{ if (/^$_IF/) {print \\$5; exit}}"`
    logx "--------------> by dunal: _O1: $_O1"

    x=$CHECK_TIMES
    while [ $x -gt 0 ]
    do
      if [ -n "$tmpIP" ]
      then
        logx "About to execute command: $PING -S $tmpIP $PING_TIMEOUT $DEFAULTGW
"
        $PING -S $tmpIP $PING_TIMEOUT $DEFAULTGW > /dev/null 2>&1
      else
        logx "About to execute command: $PING $PING_TIMEOUT $DEFAULTGW"
        $PING $PING_TIMEOUT $DEFAULTGW > /dev/null 2>&1
      fi
      _O2=`$NETSTAT -n -I $_IF | $AWK "{ if (/^$_IF/) {print \\$5; exit}}"`
      logx "--------------> by dunal: _O2: $_O2"
...

As seen above, logx "--------------> by dunal: ..." lines are added to the script. Don't do that if you're not sure about what you do.

After restarting the VIP, the values of _O1 and _O2 are dumped in the logs.

Failed Node:

...
Wed Mar 18 20:58:49 GMT+02:00 2009 [ 413770 ] --------------> by dunal: _O1: -

2009-03-18 20:58:52.212: [    RACG][1] [360462][1][ora.akyorap2.vip]: Wed Mar 18
 20:58:49 GMT+02:00 2009 [ 413770 ] About to execute command: /usr/sbin/ping -S
10.46.180.52  -c 1 -w 1 10.46.180.1
Wed Mar 18 20:58:50 GMT+02:00 2009 [ 413770 ] --------------> by dunal: _O2: -

2009-03-18 20:58:52.212: [    RACG][1] [360462][1][ora.akyorap2.vip]: Wed Mar 18
 20:58:51 GMT+02:00 2009 [ 413770 ] About to execute command: /usr/sbin/ping -S
10.46.180.52  -c 1 -w 1 10.46.180.1
Wed Mar 18 20:58:51 GMT+02:00 2009 [ 413770 ] --------------> by dunal: _O2: -

2009-03-18 20:58:52.212: [    RACG][1] [360462][1][ora.akyorap2.vip]: Wed Mar 18
 20:58:52 GMT+02:00 2009 [ 413770 ] IsIfAlive: RX packets checked if=en1 failed
Wed Mar 18 20:58:52 GMT+02:00 2009 [ 413770 ] Interface en1 checked failed (host
=akyorap2)
...

As seen above, the values are '-'. It's wrong. But, they are same. So, RX packet number not changed.

Successful Node:

Wed Mar 18 20:58:55 GMT+02:00 2009 [ 405728 ] --------------> by dunal: _O1: 17297

2009-03-18 20:58:55.793: [    RACG][1] [397546][1][ora.akyorap2.vip]: Wed Mar 18
 20:58:55 GMT+02:00 2009 [ 405728 ] About to execute command: /usr/sbin/ping -S
10.46.180.51  -c 1 -w 1 10.46.180.1
Wed Mar 18 20:58:55 GMT+02:00 2009 [ 405728 ] --------------> by dunal: _O2: 17298

2009-03-18 20:58:55.793: [    RACG][1] [397546][1][ora.akyorap2.vip]: Wed Mar 18
 20:58:55 GMT+02:00 2009 [ 405728 ] IsIfAlive: RX packets checked if=en1 OK

_O1 and _O2 are different. That means RX packet number changed and the interface is up.


ubTools Support - 18/Mar/09 08:44 PM

netstat Output on Failed Node:

/usr/bin/netstat -f inet -n -I en1 | /usr/bin/awk "{ if (/^en1/) {print $5; exit}}"
en1   1500  link#3      0.21.5e.34.55.bc       -    34601     0    16269     3     0

The column#5 is '-'. This is wrong and caused the problem.

netstat Output on Successful Node:

en1   1500  link#3      0.21.5e.34.57.fe            29223     0    10609     3     0

The column#5 is 29223. This is expected number.

Headers of netstat on Failed Node:

#/usr/bin/netstat -f inet -n -I en1 
Name  Mtu   Network     Address           ZoneID    Ipkts Ierrs    Opkts Oerrs  Coll
en1   1500  link#3      0.21.5e.34.55.bc       -    35645     0    16801     3     0
en1   1500  10.46.180   10.46.180.52           -    35645     0    16801     3     0

Headers of netstat on Successful Node:

#/usr/bin/netstat -f inet -n -I en1 
Name  Mtu   Network     Address           ZoneID    Ipkts Ierrs    Opkts Oerrs  Coll
en1   1500  link#3      0.21.5e.34.57.fe            29743     0    10762     3     0
en1   1500  10.46.180   10.46.180.51                29743     0    10762     3     0
en1   1500  10.46.180   10.46.180.53                29743     0    10762     3     0
en1   1500  10.46.180   10.46.180.54                29743     0    10762     3     0

The difference is the ZoneID column.

Looks like a network configuration problem. This issue will be open for an update from Network Administrators.


ubTools Support - 19/Mar/09 12:54 PM
The Network Adminisitrator said it was an AIX Bug:

But, this fix changes ZoneID from blank value to '-'. After this fix, no VIP could be started.


ubTools Support - 19/Mar/09 01:11 PM - edited
No solution found from Metalink.

ubTools Support - 19/Mar/09 01:45 PM
Looks like an inconsistency of Oracle on AIX 6.1.

Workaround:

Capturing column number of netstat must be changed from 5 to 6.

Original lines for _O1:

...
    tmpIP=`$LSATTR -El ${_IF} -a netaddr | $AWK '{print $2}'`
    # get RX packets numbers
    _O1=`$NETSTAT -n -I $_IF | $AWK "{ if (/^$_IF/) {print \\$5; exit}}"`
    x=$CHECK_TIMES
    while [ $x -gt 0 ]
...

Modified line for _O1:

...
    tmpIP=`$LSATTR -El ${_IF} -a netaddr | $AWK '{print $2}'`
    # get RX packets numbers
    _O1=`$NETSTAT -n -I $_IF | $AWK "{ if (/^$_IF/) {print \\$6; exit}}"`
    x=$CHECK_TIMES
    while [ $x -gt 0 ]
...

Original lines for _O2:

...
      fi
      _O2=`$NETSTAT -n -I $_IF | $AWK "{ if (/^$_IF/) {print \\$5; exit}}"`
      if [ "$_O1" != "$_O2" ]
      then
        # RX packets numbers changed
...

Modified line for _O2:

...
      fi
      _O2=`$NETSTAT -n -I $_IF | $AWK "{ if (/^$_IF/) {print \\$6; exit}}"`
      if [ "$_O1" != "$_O2" ]
      then
        # RX packets numbers changed
...

Then, VIP could be started on the correct nodes:

./crs_stat -t
Name           Type           Target    State     Host
------------------------------------------------------------
ora....ap1.gsd application    ONLINE    ONLINE    akyorap1
ora....ap1.ons application    ONLINE    ONLINE    akyorap1
ora....ap1.vip application    ONLINE    ONLINE    akyorap1
ora....ap2.gsd application    ONLINE    ONLINE    akyorap2
ora....ap2.ons application    ONLINE    ONLINE    akyorap2
ora....ap2.vip application    ONLINE    ONLINE    akyorap2

Note: Don't edit Oracle scripts unless you know what you're doing.