12 Replies Latest reply on Aug 21, 2012 7:02 PM by mwaughex

    crbif driver troubles

    ms705

      Hello all,

       

      Since we've come back to doing work with the SCC (RockyLake board) after some idle time, we've had problems connecting to it. We have an Intel-supplied MCPC, and were intending to upgrade sccKit before doing any further work. Initially, the CBL LED on the PCIe interface was off, but we fixed this by ensuring the card was properly slotted in. CBL is now on; however, we fail to be able to bring up the crb0 interface. We suspect something is wrong with our crbif driver installation -- see below for some diagnosis.

       

      Checking if the kernel module is loaded shows it is not:

       

      ms705@mcpc:~$ lsmod | grep crb

      ms705@mcpc:~$

       

      ... so I do

       

      ms705@mcpc:~$ modprobe crbif

       

      after which the driver is loaded:

       

      ms705@mcpc:~$ lsmod | grep crb

      crbif                  41528  0

      ms705@mcpc:~$

       

      and dmesg agrees:

       

      ms705@mcpc:~$ dmesg

      [...]

      [  580.696230] mcedev Id: $Id: mcedev_main.c 16545 2010-06-08 14:25:34Z jbrummer $

       

      ... but still no crb0 network interface:

       

      ms705@mcpc:~$ ifconfig crb0

      crb0: error fetching interface information: Device not found

       

      Also, lspci does not show the device:

       

      ms705@mcpc:~$ lspci | grep c148

      ms705@mcpc:~$

       

      ms705@mcpc:~$ lspci

      00:00.0 Host bridge: Intel Corporation Core Processor DMI (rev 11)

      00:08.0 System peripheral: Intel Corporation Core Processor System Management Registers (rev 11)

      00:08.1 System peripheral: Intel Corporation Core Processor Semaphore and Scratchpad Registers (rev 11)

      00:08.2 System peripheral: Intel Corporation Core Processor System Control and Status Registers (rev 11)

      00:08.3 System peripheral: Intel Corporation Core Processor Miscellaneous Registers (rev 11)

      00:10.0 System peripheral: Intel Corporation Core Processor QPI Link (rev 11)

      00:10.1 System peripheral: Intel Corporation Core Processor QPI Routing and Protocol Registers (rev 11)

      00:19.0 Ethernet controller: Intel Corporation 82578DM Gigabit Network Connection (rev 05)

      00:1a.0 USB Controller: Intel Corporation 5 Series/3400 Series Chipset USB2 Enhanced Host Controller (rev 05)

      00:1c.0 PCI bridge: Intel Corporation 5 Series/3400 Series Chipset PCI Express Root Port 1 (rev 05)

      00:1c.4 PCI bridge: Intel Corporation 5 Series/3400 Series Chipset PCI Express Root Port 5 (rev 05)

      00:1c.6 PCI bridge: Intel Corporation 5 Series/3400 Series Chipset PCI Express Root Port 7 (rev 05)

      00:1c.7 PCI bridge: Intel Corporation 5 Series/3400 Series Chipset PCI Express Root Port 8 (rev 05)

      00:1d.0 USB Controller: Intel Corporation 5 Series/3400 Series Chipset USB2 Enhanced Host Controller (rev 05)

      00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev a5)

      00:1f.0 ISA bridge: Intel Corporation 3400 Series Chipset LPC Interface Controller (rev 05)

      00:1f.2 SATA controller: Intel Corporation 5 Series/3400 Series Chipset 6 port SATA AHCI Controller (rev 05)

      00:1f.3 SMBus: Intel Corporation 5 Series/3400 Series Chipset SMBus Controller (rev 05)

      02:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection

      03:00.0 VGA compatible controller: Matrox Graphics, Inc. MGA G200e [Pilot] ServerEngines (SEP1) (rev 02)

      ms705@mcpc:~$

       

      Any ideas? In particular, how could we check if the driver is encountering problems, and how could we re-install it if necessary? (It is possible that the lab sysadmins have made changes to the MCPC software, e.g. kernel/package upgrades, since we last used it, which could have caused these troubles.)

       

      Thanks in advance for your help!

       

      Best wishes,

      Malte

        • 1. Re: crbif driver troubles
          mwaughex

          Hello Malte,

          What version of sccKit are you currently using?

          You can tell by looking in the /opt/sccKit.  The softlink 'current' should point to the version.

          Hopefully you are using at least sccKit 1.4.1.3.

           

          You can reinstall the driver by incrementing the /opt/sccKit/current/firmware/RockyLake/update/update.txt file and running as root from the /opt/sccKit/current/firmware/ directory:

          /install.csh

           

          this link may also help.

          http://communities.intel.com/docs/DOC-19474

           

          The current sccKit is 1.4.2.2

          Can you also tell me what institution you are working from?

           

          -augie

          • 2. Re: crbif driver troubles
            ms705

            Hi augie,

             

            Thanks for the response! I am working from the University of Cambridge.

             

            We are still on 1.3.0, as this problem turned up when we tried to upgrade to 1.4.0 (and then further; as I understand, the upgrades need to be applied in sequence?).

             

            When I try to use the install.csh script, it initially works and makes progress, but then eventually fails with the following messages:

             

            [...]

            Unpacking replacement crbif-dkms ...

            Removing old module source...

            Setting up crbif-dkms (1.1.0-0ubuntu1~ppa1l) ...

            Loading new crbif-dkms-1.1.0 DKMS files...

             

             

            Error! Could not find module source directory.

            Directory: /usr/src/crbif-dkms-1.1.0 does not exist.

            dpkg: error processing crbif-dkms (--install):

            subprocess installed post-installation script returned error exit status 2

            Processing triggers for man-db ...

            Errors were encountered while processing:

            crbif-dkms


            Any ideas what the elusive error 2 might be?

             

            Thanks and best wishes,

            Malte

            • 3. Re: crbif driver troubles
              mwaughex

              Hello Malte,
              Upgrading from 1.3.0 to 1.4.x does involve some configuration changes.
              I use this document:
              http://communities.intel.com/docs/DOC-6868
              When it references the download use this tar file:
              wget http://marcbug.scc-dc.com/svn/repository/tarballs/sccKit_1.4.2.2.tar.bz2

               

              Did you build your own MCPC or is it one that was supplied to you by Intel?

               

              Regarding your error:
              In order for dkms to work, it needs matching linux-headers in /usr/src
                apt-get install linux-headers-$(uname –r)
              Should pull the linux-headers for the running kernel.

              This should fix the install script error.


              If things don't progress, also feel free to post the problem on
              http://marcbug.scc-dc.com/bugzilla3/

               

              It might take a few tries to get 1.4.2.2 up and running.  But it’s not difficult once the configuration files are setup correctly.

              Not related, but something to look at when upgrading to 1.4.x is the switch.  Make sure you are using 1GB switch for the EMAC and NIC ports.  Most customers use a 5 port 1GB switch.  Most have one sitting around, or any electronics store will carry.  Any 1GB switch should work.

              • 4. Re: crbif driver troubles
                ms705

                Hi augie,

                 

                The MCPC is one that was supplied by Intel. We were trying to get 1.3.0 back into known-good working state (i.e. with a working PCIe link) before doing the upgrade -- is this necessary, or can we just go ahead and do the upgrade?

                 

                I managed to build and install the module now, but the effect is still the same -- no "c148" device on the PCIe bus, even after loading the module.

                 

                Thanks again, and best wishes!

                Malte

                • 5. Re: crbif driver troubles
                  mwaughex


                  Hi Malte,

                  You could go ahead with the 1.4.2.2 upgrade, but you are right, it would make troubleshooting more difficult, since you are adding a lot of changes to your config files and even your network connections.

                   

                  If you have created backups of your config files on the mcpc, it would probably make more sense to go back to 1.3.0 confirm all is well, then upgrade to 1.4.2.2.

                   

                  I am wondering at what point you started having problems?  Was someone else involved, and you 'inherited' these problems?

                   

                  On the MCPC, both LEDs on the PCIe card should be green, Their is also a light on the SCC board that will light up when you have a good PCIe connection to the MCPC.  I just look at the back, and you can see the light shining on the memory DIMM.  Otherwise you would have to open the case.

                   

                  I also use the command:

                  # dmidecode -t 9

                  If the last entry says:

                  Handle 0x0020, DMI type 9, 13 bytes

                  System Slot Information

                          Designation: PCI-E SLOT6

                          Type: x16 PCI Express

                          Current Usage: In Use

                          Length: Long

                          ID: 6

                          Characteristics:

                                  3.3 V is provided

                                  PME signal is supported

                   

                  Then the SCC should have a connection to the mcpc PCIe card at the 'bios' level.  if it says

                            Current Usage: Available

                  Then you have no connectivity to the PCIe card.

                  Be sure you 1) Turn on the SCC, 2) Power on the Chip (via telnet or rocker switch on the front of SCC)

                  Then 3) Boot MCPC.  If the chip is not turned on (green system light on front panel) before you boot the MCPC, you will not gain connectivity to the MCPC via the PCIe cable.  Without connectivity the crbif driver will not load.

                   

                  -augie

                  • 6. Re: crbif driver troubles
                    ms705

                    Hi augie,

                     

                    Thanks, this information helped me solve the issue. The PCIe connection was indeed broken, but the correct reboot sequence fixed it. We're almost back to a good state now, except that the BMC appears to be unreachable from the MCPC. Neither sccKit nor direct telnet can access it, and pinging its IP just yields no response. (We know that the IP address is correct, as it worked previously -- or does this get reset when everything is powered down?)

                     

                    Am I correct in assuming that I can use the physical "Reset BMC" button to hopefully get it back into a working state? If not, would following the BMC upgrade instructions for v1.06 help to get it back?

                     

                    Thanks and best wishes,

                    Malte

                    • 7. Re: crbif driver troubles
                      mwaughex

                      Hello Malte,

                      Good to see you are making progress.

                      I have not used the reset button before.  I can run a test there in the MARC lab.  My thoughts are that if it does reset the IP, the default should be 192.168.2.127

                      So you would need to make sure you have a NIC on the same subnet.

                      So you could add something like this to your MCPC /etc/network/interfaces file.

                           auto eth1:2

                           iface eth1:2 inet static

                           address 192.168.2.125

                           netmask 255.255.255.0

                       

                      Also check that your BMC connection (the one closest to the power supply) has a network wink, and also your MCPC's NICs have winks.

                       

                      What IP is in the /opt/sccKit/systemSettings.ini file?  The CRBServer= should list the IP for the BMC.

                      [General]

                      CRBServer=10.3.16.162:5010

                      memorySize=8

                      platform=RockyLake

                      platform=RockyLake

                      maxTransId=64

                       

                      It might be, your interfaces file does not have an IP that is in subnet of BMC, or that the cable is bad?

                      I am not sure if you are connected directly from MCPC to BMC, or using a switch.  If you are going directly to the BMC with a cable, make sure the correct eth port is connecting to BMC.

                      -augie

                      • 8. Re: crbif driver troubles
                        ms705

                        Hi augie,

                         

                        Ah, I am just back from the machine room -- the BMC issue was indeed easily resolved by pushing the reset button and restarting everything. After that, and after adding a GBit switch (turns out the previous switch was slightly faulty as well, and would drop packets when talking to the BMC!), the upgrade went smoothly

                         

                        Everything works now, apart from the fact that I do not get connectivity to the booted cores once I have started Linux on them; pings time out and it seems that the GBit port I used (A, which is active according to the "Usable GB ETH" message in the BMC) only operates at 100 MBit (the right hand side LED is on, but not the left hand side one). It does appear to send bursts of packets, but I am not receiving anything on the MCPC. BMC communication through the same switch works without problem.

                         

                        Is there anything else I need to do after completing all the upgrade steps before network access to the cores will work? I tried both hostnames (rck00) and IP (192.168.3.1 in our case), and I updated all configuration files (including systemSettings.ini in /opt/sccKit) accordingly.

                         

                        Thanks a lot for your help!

                        Malte

                        • 9. Re: crbif driver troubles
                          mwaughex

                          This is good news, as far as getting all the configuration files changed.

                          Just to confirm a few things.

                          You do not receive any errors running sccBmc -i or sccBoot -l, ?

                          You are running sccKit 1.4.2.2?

                           

                          If you issue
                          sccBmc -c set | grep FPGA

                          What bitstream does it list?  I would hope it will look like this:

                             Default FPGA bitstream: /mnt/flash4/rl_20110624_ab.bit

                          It is possible that an older version of 1.4.x is still being used, since their names are the same.  Make sure you run the install script and increment the .update.txt file.  I use the current date, that way you are sure to have a larger number in the update.txt file.

                          cat /opt/sccKit/current/firmware/RockyLake/update/update.txt

                          20120816001

                           

                          I have seen this exact problem with an older 1.4.x bitstream.  They boot with no errors, but can not be accessed.

                           

                          Also if you issue a route command what is the output?

                           

                          We may want to file a 'admin' bug on http://marcbug.scc-dc.com/bugzilla3/

                          Bugzilla is a little easier to work with regarding uploading config files, etc.

                          -augie

                          • 10. Re: crbif driver troubles
                            ms705

                            I'll make a bug to correspond to this thread; should I use the "MARC administration needed" section? (Asking as that section appears to be concerned with the SCC DC, but this is not a DC system).

                             

                            No errors on sccBmc -i or sccBoot -l, and we are running sccKit 1.4.2.2 now. Bitstream is also correct (/mnt/flash4/rl_20110624_ab.bit), and I've made sure that the upgrade script ran.

                             

                            route gives:

                            ms705@mcpc:~$ route

                            Kernel IP routing table

                            Destination     Gateway         Genmask         Flags Metric Ref    Use Iface

                            192.168.3.0     *               255.255.255.0   U     0      0        0 eth1

                            192.168.2.0     *               255.255.255.0   U     0      0        0 eth1

                            128.232.0.0     *               255.255.240.0   U     0      0        0 eth0

                            link-local      *               255.255.0.0     U     1000   0        0 eth0

                            default         route.cl.cam.ac 0.0.0.0         UG    100    0        0 eth0

                             

                            ... which looks correct to me.

                             

                            When I boot up the cores and keep the performance dashboard from sccGui open at the same time, I see how the CPU utilization goes up on all cores, and then after a while stabilizes near 0 again. This presumably suggests that the cores booted up?

                             

                            Also, issuing the arp command on the MCPC gives this:

                            ms705@mcpc:~$ arp

                            Address                  HWtype  HWaddress           Flags Mask            Iface

                            rck45.ex.rck.net                 (incomplete)                              eth1

                            rck43.ex.rck.net                 (incomplete)                              eth1

                            rck16.ex.rck.net                 (incomplete)                              eth1

                            rck30.ex.rck.net                 (incomplete)                              eth1

                            resolv1.cl.cam.ac.uk     ether   00:16:3e:e8:01:02   C                     eth0

                            ntp1c.cl.cam.ac.uk       ether   00:0a:42:cf:68:0a   C                     eth0

                            rck05.ex.rck.net                 (incomplete)                              eth1

                            rck42.ex.rck.net                 (incomplete)                              eth1

                            rck13.ex.rck.net                 (incomplete)                              eth1

                            rck03.ex.rck.net                 (incomplete)                              eth1

                            rck32.ex.rck.net                 (incomplete)                              eth1

                            rck07.ex.rck.net                 (incomplete)                              eth1

                            rck02.ex.rck.net                 (incomplete)                              eth1

                            [...]

                             

                            (with entries corresponding to all cores, but all with "incomplete" MAC addresses).

                             

                            Any ideas? :-S

                            Malte

                            • 11. Re: crbif driver troubles
                              ms705

                              A little more insight: I followed the (excellent) instructions here to set up a serial console, and booted up a core with that attached.

                               

                              The result I am getting confirms that the cores boot up fine, but also includes this:

                               

                              [...]

                              Starting network...                                                            

                              Configuring on-chip network: mb0 (192.168.253.1 R:192.168.253.1)               

                              Configuring host network: emac0 (192.168.3.1)                                  

                              route: SIOCADDRT: Network is unreachable                                       

                              mount: rckhost: Host name lookup failure                                       

                              mount: mounting rckhost:/shared on /shared failed                              

                              Starting rhid: OK                                     

                              [...]                        

                               

                              ... which suggests that the cores believe there to be no network connection.

                               

                              Edit: I managed to log in using the serial console, and found that all interfaces are up on the cores, and I can contact over cores via the on-chip network, but I can't get through to the host, despite the cable being plugged into port A.

                               

                              Nonetheless, maybe this helps shedding some more light on the issue?

                               

                              Cheers,

                              Malte

                               

                              Message was edited by: ms705

                              • 12. Re: crbif driver troubles
                                mwaughex

                                Hello Malte,

                                Yes, marcbug with MARC administration needed would work.

                                 

                                I would like to see the contents of:

                                /opt/sccKit/systemSettings.ini

                                /etc/network/interfaces

                                /etc/resolv.conf

                                /etc/bind/

                                     XX.168.192.zone

                                     ex.rck.zone

                                     in.rck.zone

                                     named.conf.local

                                     named.conf.options

                                 

                                Also, a few things to take a look at:

                                named.conf.options

                                   Make sure the forwarders IP is a true Domain Name Server, and is accessible.

                                 

                                Also your /etc/hosts file should include something like below.  (Your IP will be different, probably 192.168.3.1 to 192.168.3.48???

                                 

                                127.0.0.1       localhost
                                127.0.1.1       marc004
                                192.168.24.1       rck00
                                192.168.24.2       rck01
                                192.168.24.3       rck02
                                192.168.24.4       rck03
                                192.168.24.5       rck04
                                ....Through
                                192.168.24.46      rck45
                                192.168.24.47      rck46
                                192.168.24.48      rck47

                                 

                                It sounds like a configuration problem, but until I review the above configurations, I am not sure which one is causing the problem.

                                 

                                Augie