0 Replies Latest reply on Jan 5, 2016 11:55 AM by nzlatanov

    Why my Intel Atom D2550 chipset freezes after a few hours of operation? It is used in a Laptop MB designed by BCM for VeriFone Ruby2 Touch Screen terminal working with a DOM (SSD). SATA analyzer shows no issue.

    nzlatanov

       

      Here is what I found last night and why we changed the Controller Mode in BIOS setting from IDE to AHCI. It was disclosed by Intel to use

       

      AHCI mode when using a SSD (Solid State Driver) or a.k.a. DOM (Disk-On-a-Module) with Serial ATA (Serial AT Attachment).

       

      SATA is a computer bus interface that connects host bus adapters to mass storage devices such as hard disk drives and optical drives.

       

       

      AHCI stand for Advance Host Controller Interface. AHCI is a hardware mechanism that allows software to communicate

       

      with Serial ATA (SATA) devices (such as host bus adapters) that are designed to offer features not offered by Parallel ATA

       

      (PATA) controllers, such as hot-plugging and native command queuing (NCQ). See link below.

       

       

      http://forum.crucial.com/t5/Crucial-SSDs/Why-do-i-need-AHCI-with-a-SSD-Drive-Guide-Here-Crucial-AHCI-vs/td-p/57078

       

       

      As of last night since the setting was changed, the bad “V” DOM unit and the Golden “#4” DOM unit heartbeats are still running fine.

       

      It is very likely in the light of my finding that the solution to the freezing heartbeat is to do the suggested BIOS setting from IDE to AHCI.

       

       

         

      I came this morning and found the “Old Fateful”  Ruby2 in a Heartbeat freeze state at A9 off. I checked the SATA protocol monitor file and found an error.

       

      The error is flagged as #1 Code Violation and right after that #2 Disparity Error. My research found a detailed explanation of the error codes.

       

      The recording started last night when I left at 6:05PM and stopped after 8h 39min which is a little after 2:00AM today. For 55sec the Kernel tried to

       

      correct the error and reestablish communication but the heartbeat went into a freeze state and the protocol error recovery stopped.

       

      Below is the explanation of the process. I believe this issue is caused by a marginal hardware deviation that results in a SATA protocol malfunction right above the physical layer.

       

      The link layer is the next layer and is directly above the physical layer (PHY). This layer is responsible for encapsulating data payloads and manages the protocol for sending and

       

      receiving them. A data payload that is sent is called a Frame Information Structure (FIS). The link layer also provides some other services for ensuring data integrity, handling flow

       

      control, and reducing EMI. The host and the disk each have their own transmit pair in a SATA cable, and theoretically data could be sent in both directions simultaneously.

       

      However, this does not occur. Instead, the receiver sends “backchannel” information to the sender that indicates the status of the transfer in progress.

       

      For instance, if an error were to be detected midtransmission, such as a disparity error, the receiver could notify the sender of this.

       

      The link layer uses a set of defined Link Layer Primitives to perform these functions. Primitives are each 4 Dwords long and start with the control character K28.3

       

      (except for ALIGN, as discussed above). The following table lists most of the defined primitives and their value in hexadecimal before encoding. The usage of these will be

       

      discussed in more detail.

       

      Primitive Hex Representation

       

      ALIGN 0x7B4A4ABC

       

      SYNC 0xB5B5957C

       

      X_RDY 0x5757B57C

       

      R_RDY 0x4A4A957C

       

      SOF 0x3737B57C

       

      R_IP 0x5555B57C

       

      HOLD 0xD5D5AA7C

       

      HOLD_ACK 0x9595AA7C

       

      EOF 0xD5D5B57C

       

      WTRM 0x5858B57C

       

      R_OK 0x3535B57C

       

      R_ERR 0x5656B57C

       

      CONT 0x9999AA7C

       

      Most of the communication between the Host (Kernel) and Device (DOM) are ALIGN and SYNC primitives.

       

      ALIGN: This primitive allows the receiver to determine the byte boundaries in the data stream. A pair of them is sent at least every

       

      256 Dwords regardless of what state the link layer is in.

       

      SYNC: SYNC is used to indicate that the line is idle. When frames are not being sent, both the host and the disk will send this primitive. This primitive also has a special

       

      function called the “SYNC Escape.” If the host sends a SYNC, the line is forced to go idle, terminating all current transfers. The disk must respond SYNC.

       

      This way, if the host needs to issue a soft reset, it can do so. Well, I believe the Host sent a Soft reset but after the Heartbeat freeze could not reboot the system.

       

      I am focusing my attention to the 4 system signals: SYS_RESET, FPGA_CPU_RESET, PCH_GPO7, PCH_GPO8. Page 19 and 20 of VF2700D Rev. 0.1 Doc 03-22-2012 SCH.