在Oracle RAC中,能夠從多個層次,多個不同的機制來檢測RAC的健康狀況,即能夠通過心跳機制以及一定的投票算法來隔離故障。假設檢測到某節點失敗,則存在故障的節點將會被逐出集群以避免故障節點破壞數據。本文主要描寫敘述了Oracle RAC下的幾種心跳機制以及心跳參數的調整。
?
一、OCSSD與CSS?
OCSSD是一個管理及提供Cluster Synchronization Services (CSS)服務的Linux或者Unix進程。使用Oracle用戶來執行該進程并提供節點成員管理功能,一旦該進程失敗。將導致節點重新啟動。CSS服務提供2種心跳機制。一種為網絡心跳。一種為磁盤心跳。兩種心跳都有最大延時,網絡心跳的延時叫MC(Misscount), 磁盤心跳延時叫作IOT (I/O Timeout)。
這2個參數都以秒為單位。缺省時情況下Misscount < Disktimeout。
以下分別描寫敘述這2種心跳機制。
?
二、網絡心跳
故名思義即是通過私有網絡來檢測節點的狀態。假設私有網絡硬件、軟件導致集群節點間私有網絡在一定時間內無法進行正常通信。由此而導致腦裂。由于集群環境中的存儲為共享存儲,因此此時必須要將故障節點從?集群隔離出來,以避免數據災難。關于這個網絡心跳的詳細動作描寫敘述例如以下:?
?? ?Every one second, a sending thread in the cssd sends a network tcp heartbeat to itself and all nodes. The receiving thread of the ocssd.bin receives the heartbeat.??
??? If the package network is dropped or has error, the error correction mechanism on tcp would retransmit the package.???
??? Oracle does not retransmit.? From the ocssd.log, you will see a WARNING message about missing of heartbeat if a node does not receive a heartbeat from another node for 15 seconds (50% of miscount).??Another warning is reported in ocssd.log if the same node is missing for 22 seconds (75% of miscount)..another warning continues from the same node for 27 seconds (90% miscount).??When the heartbeat is missing 100% ..30 seconds miscount, the node is evicted?
??
這個網絡心跳的延遲稱之為misscount,能夠通過crsctl 工具查詢及改動。?
[grid@Linux-01 ~]$ crsctl get css misscount?
CRS-4678: Successful get misscount 30 for Cluster Synchronization Services.?