이것 또한 지나가리라 (This, Too, Shall Pass Away)

보호되어 있는 글입니다.
내용을 보시려면 비밀번호를 입력하세요.

확인


NTP leap second event causing Oracle Clusterware node reboot (Doc ID 759143.1)

 

 

In this Document

Symptoms

Changes

Cause

Solution

References

 

APPLIES TO:

Oracle Server - Enterprise Edition - Version 10.1.0.2 to 11.1.0.7 [Release 10.1 to 11.1]

Oracle Solaris on SPARC (64-bit)

Oracle Solaris on x86-64 (64-bit)

Sun Solaris SPARC (64-bit)Sun Solaris x86-64 (64-bit)

Oracle Clusterware and patchsets 10.2.0.1 - 11.1.0.7

Sun Solaris 5.8 - 5.10 adjusting time through NTP daemon (xntpd)

 

SYMPTOMS

UTC time (Coordinated Universal Time) is regularly adjusted by introducing a leap second based on the

accumulated difference between the atomic clock time (TAI) and UT1,

the time reflecting the Earth's rotational speed. 

 

The last adjustment happened on Dec 31, 2008 at 23:59:59 UTC time by adding one leap second, 

i.e. UTC -TAI was -1 meaning the UTC time was adjusted backwards by one second. 

 

The decision to introduce a leap second in UTC is the responsibility of the 

Earth Orientation Center of the International Earth Rotation and reference System Service (IERS). 

IERS announce every six months if a new leap second will occur on the next date of Jun 30 or Dec 31. 

 

Due to the backwards time setting NTP daemons had to adjust time in accordance to the leap second

requirements, affecting Oracle Clusterware, which resulted in node reboots.

The announcement of leap second events can be viewed at:

       http://hpiers.obspm.fr/eoppc/bul/bulc/bulletinc.dat

CHANGES

UTC time adjustment by leap second worldwide on upstream NTP servers.

In configurations using third party cluster solutions, e.g. Sun Cluster or Veritas SFRAC, the 

oproc daemon does not get started. Systems using third party cluster configurations should therefore not be 

affected. The third party cluster solution will have to accommodate to the leap second change, if needed.

CAUSE

A node reboot can occur in the event of a leap second due to both conditions occurring simultaneously:

  • C1. xntpd daemon does not have slewing enabled (default) or does not have PLL mode disabled (default)
     
  • C2. the Oracle Clusterware version does not have a fix for bug 5015469 or bug 6022204
    or the Oracle Clusterware version does have a fix for at least one these defects but due 
    to Solaris CR#6595936 the alarm signal arrival has been delayed exceeding the oprocd 
    0.5 sec default margin

Condition C1 can be viewed as a generic NTP configuration issue, i.e. other platforms can

be affected not only Solaris.

Typically slewing is enabled on all platforms by starting the NTP 

daemon with -x option. 

 

In Solaris 10 the -x option still exists but the "slewalways yes"

option in /etc/inet/ntp.conf should be used.

Note, both "-x" and "slewalways yes" are non-default. 

 

In case condition C1 is not fullfilled the NTP daemon will decide

to step the time offset,i.e. when slewing is not enabled any time

adjustment over 0.5 sec (512 millisec) will 

result in stepping the system time. And when PLL mode is enabled

time correction by more than 128 millisec will be stepped.

Therefore the need for both "slewalways yes" and "disable pll".

Condition C2 is partially generic as well, i.e. both bugs bugs 5015469 and 6022204 are generic. 

The only strong Solaris specific issue in this constellation is Sun CR#6595936 which affects

oprocd daemon when a negative NTP time adjustement has occurred.

 

Regarding CR#6595936: this would affect only NTP configurations as per C1 which would 

result in stepping the time offset thus being detected by oprocd as a time drift comparing 

gettimeofday() values. 

However the alarm signal for setitimer() needed by oprocd to wake up will be delayed up to 

5 sec and will result in a fatal condition for oprocd initiating a fast reboot. 

A fix for Solaris CR#6595936 will be included in Solaris 10 U7, which as of January 2009, 

is not yet available. 

 

SOLUTION

To avoid node reboot due to a leap second event either of the two following solutions can be used:

  • S1. Configure xntpd Solaris daemon running on the Oracle Clusterware cluster node to 
    disable PLL mode and enable slewing: 

    I.e. adding to the /etc/inet/ntp.conf file the two lines:  

                       slewalways yes 
                       disable pll 

    To restart xntpd the commands are: 

    Solaris 10:      svcadm restart ntp 
    Solaris 8 and 9: /etc/init.d/xntpd stop ; /etc/init.d/xntpd start

As a result a message is written to /var/adm/messages file like: 

 

Jan 2 18:20:34 psyche xntpd[8724]: [ID 998766 daemon.warning] phase-lock loop disabled: will *not* use clock drift file.

Please note that restarting xntpd on a server having the system  

clock out of sync with the NTP server time might result in oprocd 

causing a node reboot. 

This is because the scripts used to restart xntpd 

(/etc/init.d/xntpd in Solaris 8 and 9 and /lib/svc/method/xntp in Solaris 10) 

execute the ntpdate command which steps a time difference. 

 

Therefore after changing ntp.conf xntpd should be restarted only before 

Oracle Clusterware restart or on next system reboot if it cannot be ensured 

that NTP client and server times are in sync. 

OR 

  • S2. Apply recent Oracle Clusterware patch bundles or a recent MLR (i.e. MLR # 9 or higher) 
    in order to resolve bugs 5015469 and 6022204. 
    and 
    increase oprocd daemon timeout margin to ignore alarm signal drifts due to Sun CR#6595936. 

    Following patches are available: 

    Bug 5015469: fix included in Oracle Clusterware 10.2.0.3 and higher, one off fixes exist for 10.1.0.3,
                         10.1.0.4, 10.1.0.5, 10.2.0.1, 10.2.0.2 
                         This fix mainly rearms the oprocd timer after a negative time drift.

    Bug 6022204: fix included in Oracle Clusterware 10.2.0.4 and higher (including 11g), 
                         included in Oracle Clusterware 10.2.0.3 MLR 9 and higher 
                         and Oracle Clusterware 10.2.0.3 bundle patches 2 and 3 
                         (patch # 6756433 and 7117233). 
                         This fix doesn't exist for Oracle Clusterware 10.2.0.2 or 10.2.0.1. 
                         This fix supersedes 5015469 and is needed in conjunction
                         with increasing the oprocd margin.

    For solution S2 it is needed to increase the oprocd margin to at least 5 sec to avoid the 
    alarm signal delays imposed by CR#6595936.
    Please refer to Note 559365.1 how to increase oprocd margin.
    Please note that increasing the oprocd margin in this solution is needed only in order to
    avoid the OS alarm signal drift.

Veritas/Oracle/Sun (VOS) team recommends to use solution S1 and not modify the 

oprocd margin while applying recent Oracle Clusterware patches would still be recommended.

REFERENCES

BUG:5015469 - OPROCD REBOOTS NODE WHEN TIME IS SET BACK BY XNTPD

BUG:6022204 - OPROCD ENHANCED ERROR REPORTING MECHANISMS TO BE ADDED

NOTE:559365.1 - Using Diagwait as a diagnostic to get more information for diagnosing Oracle Clusterware Node evictions

'업무.DB > ORACLE' 카테고리의 다른 글

How to check RMAN backup job status in Oracle  (0) 2016.04.18