Here is what I found last night and why we changed the Controller Mode in BIOS setting from IDE to AHCI. It was disclosed by Intel to use
AHCI mode when using a SSD (Solid State Driver) or a.k.a. DOM (Disk-On-a-Module) with Serial ATA (Serial AT Attachment).
SATA is a computer bus interface that connects host bus adapters to mass storage devices such as hard disk drives and optical drives.
AHCI stand for Advance Host Controller Interface. AHCI is a hardware mechanism that allows software to communicate
with Serial ATA (SATA) devices (such as host bus adapters) that are designed to offer features not offered by Parallel ATA
(PATA) controllers, such as hot-plugging and native command queuing (NCQ). See link below.
As of last night since the setting was changed, the bad “V” DOM unit and the Golden “#4” DOM unit heartbeats are still running fine.
It is very likely in the light of my finding that the solution to the freezing heartbeat is to do the suggested BIOS setting from IDE to AHCI.
I came this morning and found the “Old Fateful” Ruby2 in a Heartbeat freeze state at A9 off. I checked the SATA protocol monitor file and found an error.
The error is flagged as #1 Code Violation and right after that #2 Disparity Error. My research found a detailed explanation of the error codes.
The recording started last night when I left at 6:05PM and stopped after 8h 39min which is a little after 2:00AM today. For 55sec the Kernel tried to
correct the error and reestablish communication but the heartbeat went into a freeze state and the protocol error recovery stopped.
Below is the explanation of the process. I believe this issue is caused by a marginal hardware deviation that results in a SATA protocol malfunction right above the physical layer.
The link layer is the next layer and is directly above the physical layer (PHY). This layer is responsible for encapsulating data payloads and manages the protocol for sending and
receiving them. A data payload that is sent is called a Frame Information Structure (FIS). The link layer also provides some other services for ensuring data integrity, handling flow
control, and reducing EMI. The host and the disk each have their own transmit pair in a SATA cable, and theoretically data could be sent in both directions simultaneously.
However, this does not occur. Instead, the receiver sends “backchannel” information to the sender that indicates the status of the transfer in progress.
For instance, if an error were to be detected midtransmission, such as a disparity error, the receiver could notify the sender of this.
The link layer uses a set of defined Link Layer Primitives to perform these functions. Primitives are each 4 Dwords long and start with the control character K28.3
(except for ALIGN, as discussed above). The following table lists most of the defined primitives and their value in hexadecimal before encoding. The usage of these will be
discussed in more detail.
Primitive Hex Representation
Most of the communication between the Host (Kernel) and Device (DOM) are ALIGN and SYNC primitives.
ALIGN: This primitive allows the receiver to determine the byte boundaries in the data stream. A pair of them is sent at least every
256 Dwords regardless of what state the link layer is in.
SYNC: SYNC is used to indicate that the line is idle. When frames are not being sent, both the host and the disk will send this primitive. This primitive also has a special
function called the “SYNC Escape.” If the host sends a SYNC, the line is forced to go idle, terminating all current transfers. The disk must respond SYNC.
This way, if the host needs to issue a soft reset, it can do so. Well, I believe the Host sent a Soft reset but after the Heartbeat freeze could not reboot the system.
I am focusing my attention to the 4 system signals: SYS_RESET, FPGA_CPU_RESET, PCH_GPO7, PCH_GPO8. Page 19 and 20 of VF2700D Rev. 0.1 Doc 03-22-2012 SCH.