11 Replies Latest reply on Feb 2, 2016 1:47 PM by PabloM_Intel

    Edison boot reliability

    slanzise

      We are using the edison as a standalone embedded device, and we've noticed a boot reliability issue with the board. The board boots about 95% of the time (soft boots or hard boots have slightly different numbers). On other platforms the u boot delay is a factor, but we set this delay to 0  on the edison (confirmed via observation on the console) and found no improvement. For an unattended system, booting with several 9 reliability is quite important.

       

      Have others seen similar boot reliability numbers?  Your experience as well as suggestions on how to get to high boot reliability welcome.

       

      Note: power cycling until successful boot is not possible because there's no person around to go through this process. Board boots successfully after a hard reset after a failed boot attempt.

        • 1. Re: Edison boot reliability
          PabloM_Intel

          Hi slanzise,

           

          Could you share some more details about your project? How’s the scenario exactly? We would like to know how you’re powering the board, if there’s any external circuitry connected to it, the image that you’re using, etc. If you’re using it as standalone embedded device my guess is that you’re using the Mini-breakout board, right?

           

          Any other detail about your connection and possible software changes would be useful. Also, we would like to know how you're testing the boot reliability so we can replicate it.

           

          Regards,

          PabloM_Intel

          • 2. Re: Edison boot reliability
            slanzise

            Hi Pablo,

             

            We have an edison on a custom adapter board that is modeled off of the intel mini breakout board. We've added our own peripherals, and the largest (I/O) related change is that we have the usb on the go port disconnected (floating). We also use the UART and some GPIOs, but only a couple of GPIOs used in our reliability testing. It is worth noting that in our full application, and we have no problems with the edison when it is fully running. The board is powered by a switching converter on our board that outputs 4.3V.

             

            We find that about 5% of the time, the edison doesn't completely boot. You can watch (on the serial console) the boot process, and it appears to proceed normally. However, when you enter your user at the login prompt, the edison appears to freeze after it displays the password prompt. In addition, our services launched by systemd do not seem to be running (one controls the state of an LED, and we don't see the LED states we expect), but you can see the service launched in the boot log. To add to the strangeness, at the password prompt, you can hit cntl-c and see the tty restart and a new login prompt is displayed. I have included an example boot failure log output below so you can see this behavior. There are a bunch of odd escape characters polluting the log output, but I think you'll be able see what we're seeing.

             

            Based on prior experience, we set the boot delay parameter in u boot to -2. This eliminates the pause for user input that can prevent a normal boot. This change had no significant impact on boot reliability however.

             

            In order to test boot reliability, we have created a simple service launched by systemd. This service reboots the edison when the uptime exceeds five minutes, and it also posts uptime information to a database so we can observe the behavior. Several edisons run this service, and they are all connected to a wall timer that toggles wall power to the power supply every 30 minutes. This gives both hard and soft reboots a test, and we can run this autonomously for hours or days at a time.

             

            Thanks for your thoughts.

             

            ******************************

            PSH KERNEL VERSION: b0182b2b

              WR: 20104000

            ******************************

             

            SCU IPC: 0x800000d0  0xfffce92c

            PSH miaHOB version: TNG .B0 .VVBD .0000000c

            microkernel  built 11:24:08  Feb  5 2015

             

            ******* PSH loader *******

            PCM page cache size = 192 KB

            Cache Constraint = 0 Pages

            Arming IPC driver ..

            Adding page store pool ..

            PagestoreAddr(IMR Start Address) = 0x04899000

            pageStoreSize(IMR Size)          = 0x00080000

             

            *** Ready to receive application ***

             

             

            U-Boot 2014.04 (Dec 30 2015 - 15:20:03)

             

                   Watchdog enabled

             

            DRAM:  980.6 MiB

            MMC:   tangier_sdhci: 0

            In:    serial

            Out:   serial

            Err:   serial

             

            Target:blank

            Partitioning already done...

            Flashing already done...

            GADGET DRIVER: usb_dnl_dfu

            reading vmlinuz

            5330528 bytes read in 132 ms (38.5 MiB/s)

            Valid Boot Flag

            Setup Size = 0x00003c00

            Magic signature found

            Using boot protocol version 2.0c

             

            Linux kernel version 3.10.17-yocto-standard (slanzise@build) #2 SMP PREEMPT Wed Aug 26 17:32:38 PDT 2015

             

            Building boot_params at 0x00090000

            Loading bzImage at address 00100000 (5315168 bytes)

            Magic signature found

             

            Kernel command line: "rootwait root=PARTUUID=012b3303-34ac-284d-99b4-34e03a2335f4 rootfstype=ext4 console=ttyMFD2 earlyprintk=ttyMFD2,keep loglevel=4 g_multi.ethernet_config=cdc systemd.unit=multi-user.target hardware_id=00 g_multi.iSerialNumber=7a1dea0a4b43e8c5cf4b42170ab3013b g_multi.dev_addr=02:00:86:b3:01:3b platform_mrfld_audio.audio_codec=dummy"

             

            Starting kernel ...

            [    0.696400] pca953x 1-0020: failed reading register

            [    0.701543] pca953x 1-0021: failed reading register

            [    0.706595] pca953x 1-0022: failed reading register

            [    0.711772] pca953x 1-0023: failed reading register

            [    1.068917] snd_soc_sst_platform: Enter:sst_soc_probe

            [    1.466893] pmic_ccsm pmic_ccsm: Error reading battery profile from battid frmwrk

            [    1.475296] pmic_ccsm pmic_ccsm: Battery Over heat exception

            [    1.475381] pmic_ccsm pmic_ccsm: Battery0 temperature inside boundary

             

            Welcome to  [1mLinux [0m!

                     Expecting device dev-ttyMFD2.device...

                     Expecting device dev-disk-by\x2dpartlabel-home.device...

            [ [32m  OK   [0m] Reached target Remote File Systems.

                     Expecting device dev-disk-by\x2dpartlabel-factory.device...

            [ [32m  OK   [0m] Reached target Paths.

            [ [32m  OK   [0m] Set up automount Arbitrary Executable File Formats F...utomount Point.

            [ [32m  OK   [0m] Reached target Swap.

            [ [32m  OK   [0m] Set up automount boot.automount.

            [ [32m  OK   [0m] Created slice Root Slice.

            [ [32m  OK   [0m] Listening on Journal Socket.

            [ [32m  OK   [0m] Listening on Delayed Shutdown Socket.

            [ [32m  OK   [0m] Listening on /dev/initctl Compatibility Named Pipe.

            [ [32m  OK   [0m] Listening on udev Control Socket.

            [ [32m  OK   [0m] Listening on udev Kernel Socket.

            [ [32m  OK   [0m] Created slice User and Session Slice.

            [ [32m  OK   [0m] Created slice System Slice.

                     Starting Load Kernel Modules...

                     Mounting Debug File System...

                     Mounting POSIX Message Queue File System...

                     Starting Apply Kernel Variables...

                     Starting udev Coldplug all Devices...

                     Starting Journal Service...

            [ [32m  OK   [0m] Started Journal Service.

                     Starting Create list of required static device nodes...rrent kernel...

            [ [32m  OK   [0m] Reached target Slices.

                     Starting Remount Root and Kernel File Systems...

            [ [32m  OK   [0m] Created slice system-serial\x2dgetty.slice.

            [ [32m  OK   [0m] Created slice system-getty.slice.

            [ [32m  OK   [0m] Created slice system-systemd\x2dfsck.slice.

                     Mounting Temporary Directory...

            [ [32m  OK   [0m] Set up automount home.automount.

            [ [32m  OK   [0m] Mounted POSIX Message Queue File System.

            [ [32m  OK   [0m] Mounted Debug File System.

            [ [32m  OK   [0m] Mounted Temporary Directory.

            [ [32m  OK   [0m] Started Apply Kernel Variables.

            [ [32m  OK   [0m] Started Create list of required static device nodes ...current kernel.

            [ [32m  OK   [0m] Started Remount Root and Kernel File Systems.

            [ [32m  OK   [0m] Started udev Coldplug all Devices.

                     Starting Load/Save Random Seed...

                     Starting Create Static Device Nodes in /dev...

            [ [32m  OK   [0m] Started Load/Save Random Seed.

            [ [32m  OK   [0m] Started Create Static Device Nodes in /dev.

            [ [32m  OK   [0m] Started Load Kernel Modules.

                     Mounting Configuration File System...

                     Mounting FUSE Control File System...

                     Starting udev Kernel Device Manager...

            [ [32m  OK   [0m] Reached target Local File Systems (Pre).

                     Mounting /var/volatile...

            [ [32m  OK   [0m] Mounted FUSE Control File System.

            [ [32m  OK   [0m] Mounted Configuration File System.

            [ [32m  OK   [0m] Mounted /var/volatile.

            [ [32m  OK   [0m] Started udev Kernel Device Manager.

            [ [32m  OK   [0m] Reached target Local File Systems.

                     Starting Trigger Flushing of Journal to Persistent Storage...

                     Starting Create Volatile Files and Directories...

            [ [32m  OK   [0m] Started Create Volatile Files and Directories.

            [ [32m  OK   [0m] Started Trigger Flushing of Journal to Persistent Storage.

                     Starting Network Time Synchronization...

                     Starting Update UTMP about System Boot/Shutdown...

            [ [32m  OK   [0m] Started Network Time Synchronization.

            [ [32m  OK   [0m] Started Update UTMP about System Boot/Shutdown.

            [ [32m  OK   [0m] Found device /dev/ttyMFD2.

            [ [32m  OK   [0m] Found device /dev/disk/by-partlabel/factory.

            [ [32m  OK   [0m] Found device /dev/disk/by-partlabel/home.

                     Starting File System Check on /dev/disk/by-partlabel/home...

                     Mounting Mount for factory...

            [ [32m  OK   [0m] Reached target System Initialization.

            [ [32m  OK   [0m] Reached target Timers.

            [ [32m  OK   [0m] Listening on D-Bus System Message Bus Socket.

                     Starting Restore Sound Card State...

            [ [32m  OK   [0m] Mounted Mount for factory.

            [ [32m  OK   [0m] Listening on sshd.socket.

            [    4.437516] systemd-fsck[163]: /dev/mmcblk0p10: clean, 18/87120 files, 14190/348155 blocks

            [ [32m  OK   [0m] Started File System Check on /dev/disk/by-partlabel/home.

            [ [32m  OK   [0m] Reached target Sound Card.

                     Mounting /home...

            [ [32m  OK   [0m] Reached target Sockets.

            [ [32m  OK   [0m] Reached target Basic System.

                     Starting Cleanjournal service...

            [ [32m  OK   [0m] Started Cleanjournal service.

                     Starting Crashlog service...

            [ [32m  OK   [0m] Started Crashlog service.

                     Starting Edison PWR button handler...

            [ [32m  OK   [0m] Started Edison PWR button handler.

                     Starting Bluetooth rf kill event daemon...

            [ [32m  OK   [0m] Started Bluetooth rf kill event daemon.

                     Starting Daemon to handle arduino sketches...

            [ [32m  OK   [0m] Started Daemon to handle arduino sketches.

                     Starting Daemon to load edison mcu app binary...

            [ [32m  OK   [0m] Started Daemon to load edison mcu app binary.

                     Starting Daemon to reset sketches...

            [ [32m  OK   [0m] Started Daemon to reset sketches.

                     Starting Start or stop WiFI AP Mode in Edison...

            Application available at (physical) address 0x04819000

              VRL map(`ِ[ [32m  OK   [0m] Stto 0xff217000

              App size = 11508 bytes

             

              App Authentication feature is disabled!

              Resetting IPC

             

            *** Ready to receive application ***

                      Starting Login Service...

                     Starting D-Bus System Message Bus...

            [ [32m  OK   [0m] Started D-Bus System Message Bus.

                     Starting Network Service...

                     Starting Permit User Sessions...

                     Starting Watchdog sample daemon...

            [ [32m  OK   [0m] Started Watchdog sample daemon.

            [ [32m  OK   [0m] Mounted /home.

            [ [32m  OK   [0m] Started Permit User Sessions.

            [ [32m  OK   [0m] Started Network Service.

            [ [32m  OK   [0m] Created slice system-systemd\x2drfkill.slice.

                     Starting Load/Save RF Kill Switch Status of rfkill0...

                     Starting Load/Save RF Kill Switch Status of rfkill2...

                     Starting Load/Save RF Kill Switch Status of rfkill1...

                     Mounting Arbitrary Executable File Formats File System...

                     Starting Network Name Resolution...

            [ [32m  OK   [0m] Reached target Network.

                     Starting Zero-configuration networking...

                     Starting Mosquitto - lightweight server implementati...SN protocols...

                     Starting Serial Getty on ttyMFD2...

            [ [32m  OK   [0m] Started Serial Getty on ttyMFD2.

                     Starting Getty on tty1...

            [ [32m  OK   [0m] Started Getty on tty1.

            [ [32m  OK   [0m] Reached target Login Prompts.

                     Starting Post wifi status ...

            [ [32m  OK   [0m] Started Post wifi status.

            [ [32m  OK   [0m] Mounted Arbitrary Executable File Formats File System.

            [ [32m  OK   [0m] Started Network Name Resolution.

            [ [32m  OK   [0m] Started Load/Save RF Kill Switch Status of rfkill0.

            [ [32m  OK   [0m] Started Load/Save RF Kill Switch Status of rfkill2.

            [ [32m  OK   [0m] Started Load/Save RF Kill Switch Status of rfkill1.

            [ [32m  OK   [0m] Started Mosquitto - lightweight server implementatio...T-SN protocols.

            [ [32m  OK   [0m] Started Login Service.

            [ [32m  OK   [0m] Started Zero-configuration networking.

                     Starting The Edison status and configuration service...

            [ [32m  OK   [0m] Started The Edison status and configuration service.

                     Starting Intel_XDK_Daemon...

            [ [32m  OK   [0m] Started Intel_XDK_Daemon.

                     Starting File System Check on /dev/disk/by-partlabel/boot...

            [    6.947740] systemd-fsck[241]: dosfsck 2.11, 12 Mar 2005, FAT32, LFN

            [    6.950364] systemd-fsck[241]: /dev/mmcblk0p7: 5 files, 2691/2923 clusters

            [ [32m  OK   [0m] Started File System Check on /dev/disk/by-partlabel/boot.

                     Mounting /boot...

            [ [32m  OK   [0m] Mounted /boot.

             

            Poky (Yocto Project Reference Distro) 1.6.1 edison ttyMFD2

             

            edison login: root

            Password:

            ^C         Stopping Serial Getty on ttyMFD2...

            [ [32m  OK   [0m] Stopped Serial Getty on ttyMFD2.

                     Starting Serial Getty on ttyMFD2...

            [ [32m  OK   [0m] Started Serial Getty on ttyMFD2.

             

            Poky (Yocto Project Reference Distro) 1.6.1 edison ttyMFD2

             

            edison login: root

            Password:

            • 3. Re: Edison boot reliability
              slanzise

              Pablo,

               

              Although we don't have many, we re-ran the test using the Intel mini-breakout board to eliminate something about our specific hardware design as an issue, and we still see the same booting issue.

              • 4. Re: Edison boot reliability
                slanzise

                And the stock image downloaded from the Edison download's page also exhibits this issue on the default hardware. I'm guessing this is a systemd issue, but we haven't isolated the issue to the point where we can be sure.

                • 5. Re: Edison boot reliability
                  evanmeagher

                  Hello, Pablo. I work with slanzise and have been attempting to mitigate this boot issue by patching our meta-intel-edison layer. Based on log output at boot (pasted above) and a few similar-sounding bug reports regarding systemd [1-2], I wonder if the root cause could be a race condition somewhere in systemd during boot. Without a good way to test this theory, I went about trying to upgrade the version of systemd baked into our Yocto images, without success. I've attempted to cherrypick the latest systemd recipe from openembedded (systemd version 228 vs meta-intel-edison's 213) into our layer, but have been foiled thus far by library and kernel-module dependency fallout.

                   

                  Before investing more time in this speculative upgrade, I was wondering two things:

                   

                  1) Have there been any reported issues related to the version of systemd installed by meta-intel-edison (v213)?

                  2) Is there any precedent for upgrading this version of systemd in the meta-intel-edison layer?

                   

                  Thanks for helping us look into this.

                   

                  [1] Bug #1385630 “systemd 215 hangs during boot” : Bugs : systemd package : Ubuntu

                  [2] https://bbs.archlinux.org/viewtopic.php?id=170756

                  • 6. Re: Edison boot reliability
                    PabloM_Intel

                    Hi guys,

                     

                    Thank you for sharing all this information. So just to try and replicate the issue, you’re using the latest Edison image, right? Or at least you downloaded the one from this site https://software.intel.com/en-us/iot/hardware/edison/downloads, I believe.

                    Also, there’s no need then for any external circuitry to conduct this test, right? We’ll be using the Mini-Breakout board, just as you did.

                    Could you please provide the service used to test reliability? We would like to have your other custom services if you’re ok with that, but they are not a priority.

                    About evanmeagher questions on systemd, we will investigate this to give you an answer.

                     

                    Regards,

                    PabloM_Intel

                    • 7. Re: Edison boot reliability
                      evanmeagher

                      Thanks for the reply, Pablo. Let me answer your questions in-line.

                       

                      > you’re using the latest Edison image, right?

                       

                      That's correct. We've run our boot test (source provided below) with images from the latest official release from Intel (2.1) and with the latest meta-intel-edison Yocto layer in Git [1]. Additionally, we extended each of these "stock" images to disable U-boot's bootdelay feature by patching the u-boot recipe in meta-intel-edison (i.e. setting `bootdelay=-2` in the relevant u-boot configuration). We've found the bootdelay feature to be problematic on other SoC platforms, wherein noise on a serial line manifests as input which irrevocably pauses the boot sequence.

                       

                      We observe the same ~95% boot reliability with all four of these images.

                       

                      > there’s no need then for any external circuitry to conduct this test, right?

                       

                      Correct, our testing was done with a stock Mini-Breakout board. We've also run tests with our custom adapter board, which as slanzise mentioned above, is based on Intel's Mini-Breakout board.

                       

                      > Could you please provide the service used to test reliability?

                       

                      Here is the Python source code of our test, with the server interaction removed: wifi_status.py · GitHub

                       

                      As slanzise described, this script boils down to a loop which posts wifi signal strength and system uptime to our server and reboots the machine after uptime has exceeded five minutes. This script is wired into systemd with the wifi_status.service file included in the above gist. It's worth mentioning that we've observed boot failures with the same symptoms (login prompt accepting input, but hangs after password receipt) when this Python script is not installed, so it doesn't seem to be an issue related to our test itself.

                       

                      Devices running this service are attached to a wall timer which toggles power every 30 minutes. Thus, we're able to test six soft reboots and one hard reboot per device per hour.

                       

                      [1] meta-intel-edison - Layer for the Intel Edison Development Platform

                      • 8. Re: Edison boot reliability
                        PabloM_Intel

                        Hi slanzise evanmeagher,

                         

                        We are still working on this case. As soon as we have an update we will let you know.

                         

                        Regards,

                        PabloM_Intel

                        • 9. Re: Edison boot reliability
                          PabloM_Intel

                          Hi guys,

                           

                          We already have some more information to share with you. We set up a simple environment to test the boot reliability of Edison. We modified the code by removing everything but the code to obtain the system uptime and then checking if the system has been up for five minutes. If so, the shutdown command will be executed. We didn’t encounter any issue at the time of reboot, the board kept running for hours. So we are assuming the issue is related to the other part of your script.

                           

                          Please let us know if you want us to facilitate the code that was used.

                           

                          Regards,

                          Pablo

                          • 10. Re: Edison boot reliability
                            slanzise

                            Thanks for looking into this. I'm not clear on your result. How many reboots did the system undergo without a boot failure? With a single edison rebooting every 5 minutes with a 95% success probability, it can take quite a long time to see a failure. For example, after 60 boots (a little over 3 hours assuming a reboot after 5 minutes of uptime), there's still a 5% change you wouldn't have seen a failure.

                             

                            If you can share your code, we will run it here to verify we have the same performance.

                             

                            Thanks again for your attention.

                            • 11. Re: Edison boot reliability
                              PabloM_Intel

                              Hi slanzise,

                               

                              We already try running your code, we made some little changes so that it would run. In the first test, the boot was unsuccessful 2 out of 20 times. In the second test (still with your code) the boot was unsuccessful 2 out of 10 times. We even got higher numbers that you got.

                              After this, we modified the code and left only the necessary parts for it to reboot every 5 minutes. We didn’t find any issue using this code. You can find the script and the service attached.

                               

                               

                              Regards,

                              Pablo