4 Replies Latest reply on Sep 29, 2011 3:57 PM by Kostrov

    A problem related to floating point arithmetics on Intel's FPUs? Or, something else?

    Kostrov

      Hello everybody,

       

      I detected some problem related to floating point arithmetics on Intel's FPUs ( Floating Point Unit ). Or, it could be

      related to something else and I simply missed it.

       

      In general, I'm very surprised with my results and I wonder if Intel's engineers have any idea what could be

      possibly wrong.

       

      Best regards,

      Sergey

       

      //////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

       

      Let's say two square 8x8 matrices need to be multiplied. Here are examples of these test matrices:

       

      // Matrix A - 8x8 - 'float' type:

       

        101.0   201.0  301.0   401.0   501.0   601.0  701.0   801.0
        901.0 1001.0 1101.0 1201.0 1301.0 1401.0 1501.0 1601.0
      1701.0 1801.0 1901.0 2001.0 2101.0 2201.0 2301.0 2401.0
      2501.0 2601.0 2701.0 2801.0 2901.0 3001.0 3101.0 3201.0
      3301.0 3401.0 3501.0 3601.0 3701.0 3801.0 3901.0 4001.0
      4101.0 4201.0 4301.0 4401.0 4501.0 4601.0 4701.0 4801.0
      4901.0 5001.0 5101.0 5201.0 5301.0 5401.0 5501.0 5601.0
      5701.0 5801.0 5901.0 6001.0 6101.0 6201.0 6301.0 6401.0

       

      // Matrix B - 8x8 - 'float' type:

       

        101.0   201.0  301.0   401.0   501.0   601.0  701.0   801.0
        901.0 1001.0 1101.0 1201.0 1301.0 1401.0 1501.0 1601.0
      1701.0 1801.0 1901.0 2001.0 2101.0 2201.0 2301.0 2401.0
      2501.0 2601.0 2701.0 2801.0 2901.0 3001.0 3101.0 3201.0
      3301.0 3401.0 3501.0 3601.0 3701.0 3801.0 3901.0 4001.0
      4101.0 4201.0 4301.0 4401.0 4501.0 4601.0 4701.0 4801.0
      4901.0 5001.0 5101.0 5201.0 5301.0 5401.0 5501.0 5601.0
      5701.0 5801.0 5901.0 6001.0 6101.0 6201.0 6301.0 6401.0

       

      In two Test-Cases classic 3-Loop matrix multiplication algorithm is used with Rolled and

      Unrolled Loops:

      //////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
      // Test-Case 1 - Rolled Loops - 1-in-1

       

      for( i = 0; i < 8; i++ )
      {
          for( j = 0; j < 8; j++ )
          {
             fC[i][j] = 0.0f;
             for( k = 0; k < 8; k++ )
             {
                fC[i][j] += ( fA[i][k] * fB[k][j] );
             }
          }
      }

       

      // Matrix C = Matrix A * Matrix B - 8x8 - 'float' type

       

         13826808.0    14187608.0   14548408.0   14909208.0   15270008.0   15630808.0  15991608.0   16352408.0
         32393208.0    33394008.0   34394808.0   35395608.0   36396408.0   37397208.0  38398008.0   39398808.0
      >50959604.0<  52600404.0   54241204.0   55882004.0   57522804.0   59163604.0  60804404.0   62445204.0
         69526008.0    71806808.0   74087608.0   76368408.0   78649208.0   80930008.0  83210808.0   85491608.0
         88092408.0    91013208.0   93934008.0   96854808.0   99775608.0 102696408.0 105617208.0 108538008.0
      106658808.0   110219608.0 113780408.0 117341208.0 120902008.0 124462808.0 128023608.0 131584408.0
      125225208.0   129426008.0 133626808.0 137827616.0 142028400.0 146229216.0 150430000.0 154630816.0
      143791600.0   148632416.0 153473200.0 158314016.0 163154800.0 167995616.0 172836416.0 177677200.0

       

      I marked an incorrect value as '>50959604.0<'. By some reason it is NOT '50959608.0'. There are also a couple of more

      incorrect values ( underlined ) because in all cases last two digits of any value in the Matrix C must be '08'.

      It can't be '...00' or '...16'.

       

      //////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
      // Test-Case 2 - Unrolled Loops - 4-in-1

       

      for( i = 0; i < 8; i++ )
      {
          for( j = 0; j < 8; j++ )
          {
             fC[i][j] = 0.0f;
             for( k = 0; k < 8; k += 4 )
             {
                fC[i][j] += ( fA[i][k  ] * fB[k  ][j] ) +
                            ( fA[i][k+1] * fB[k+1][j] ) +
                            ( fA[i][k+2] * fB[k+2][j] ) +
                            ( fA[i][k+3] * fB[k+3][j] );
             }
          }
      }

       

      // Matrix C = Matrix A * Matrix B - 8x8 - 'float' type

       

          13826808.0    14187608.0   14548408.0   14909208.0   15270008.0   15630808.0  15991608.0   16352408.0
          32393208.0    33394008.0   34394808.0   35395608.0   36396408.0   37397208.0  38398008.0   39398808.0
          50959608.0    52600408.0   54241208.0   55882008.0   57522808.0   59163608.0  60804408.0   62445208.0
          69526008.0    71806808.0   74087608.0   76368408.0   78649208.0   80930008.0  83210808.0   85491608.0
          88092408.0    91013208.0   93934008.0   96854808.0   99775608.0 102696408.0 105617208.0 108538008.0
        106658808.0   110219608.0 113780408.0 117341208.0 120902008.0 124462808.0 128023608.0 131584408.0
        125225208.0   129426008.0 133626808.0 137827616.0 142028416.0 146229216.0 150430016.0 154630816.0
      >143791616.0< 148632416.0 153473216.0 158314016.0 163154816.0 167995616.0 172836416.0 177677216.0

       

      I marked an incorrect value as '>143791616.0<'. By some reason it is NOT '143791608.0'. There are also a couple of more

      incorrect values ( underlined ) because in all cases last two digits of any value in the Matrix C must be '08'.

      It can't be '...00' or '...16'.

       

      //////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
      // Test-Case 3 - Online Matrix Multiplication calculator:

      // www.bluebit.gr/matrix-calculator/multiply.aspx

       

      // Matrix C = Matrix A * Matrix B - 8x8 - 'float' type

       

          13826808.0    14187608.0   14548408.0   14909208.0   15270008.0   15630808.0   15991608.0  16352408.0
          32393208.0    33394008.0   34394808.0   35395608.0   36396408.0   37397208.0   38398008.0  39398808.0
          50959608.0    52600408.0   54241208.0   55882008.0   57522808.0   59163608.0   60804408.0  62445208.0
          69526008.0    71806808.0   74087608.0   76368408.0   78649208.0   80930008.0   83210808.0  85491608.0
          88092408.0    91013208.0   93934008.0   96854808.0   99775608.0 102696408.0 105617208.0 108538008.0
        106658808.0   110219608.0 113780408.0 117341208.0 120902008.0 124462808.0 128023608.0 131584408.0
        125225208.0   129426008.0 133626808.0 137827608.0 142028408.0 146229208.0 150430008.0 154630808.0
        143791608.0   148632408.0 153473208.0 158314008.0 163154808.0 167995608.0 172836408.0 177677208.0

       

      As you can see the Matrix C has all correct values, that is, last two digits of any value in the Matrix C is '...08'.

       

      //////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

      So, if you're a C/C++ Software Developer could you verify results of matrix multiplication with the classic

      algorithm using the same values for Matrices A and B? Personally, I'm VERY puzzled with my results.

       

      //////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

      Here is additional information about a test computer where all calculations were done:

       

      >> TechInfo - Part 1 <<

         Operating system: Microsoft Windows XP Version 2002 SP3 ( Up-to-date regarding Microsoft's security updates )

       

      >> TechInfo - Part 2 <<

         Basic CPUID Information 0
                 CPUInfo[0] = 0x00000002
                 CPUInfo[1] = 0x756E6547
                 CPUInfo[2] = 0x6C65746E
                 CPUInfo[3] = 0x49656E69
         Basic CPUID Information 1
                 CPUInfo[0] = 0x00000F12
                 CPUInfo[1] = 0x00010808
                 CPUInfo[2] = 0x00000000
                 CPUInfo[3] = 0x3FEBFBFF
         Basic CPUID Information 2
                 CPUInfo[0] = 0x665B5001
                 CPUInfo[1] = 0x00000000
                 CPUInfo[2] = 0x00000000
                 CPUInfo[3] = 0x007A7040
         Extended Function CPUID Information 80000000
                 CPUInfo[0] = 0x80000004
                 CPUInfo[1] = 0x00000000
                 CPUInfo[2] = 0x00000000
                 CPUInfo[3] = 0x00000000
         Extended Function CPUID Information 80000001
                 CPUInfo[0] = 0x00000000
                 CPUInfo[1] = 0x00000000
                 CPUInfo[2] = 0x00000000
                 CPUInfo[3] = 0x00000000
         Extended Function CPUID Information 80000002
                 CPUInfo[0] = 0x20202020
                 CPUInfo[1] = 0x20202020
                 CPUInfo[2] = 0x20202020
                 CPUInfo[3] = 0x6E492020
         Extended Function CPUID Information 80000003
                 CPUInfo[0] = 0x286C6574
                 CPUInfo[1] = 0x50202952
                 CPUInfo[2] = 0x69746E65
                 CPUInfo[3] = 0x52286D75
         Extended Function CPUID Information 80000004
                 CPUInfo[0] = 0x20342029
                 CPUInfo[1] = 0x20555043
                 CPUInfo[2] = 0x30362E31
                 CPUInfo[3] = 0x007A4847
                 CPU Brand String:               Intel(R) Pentium(R) 4 CPU 1.60GHz
                 CPU Vendor      : GenuineIntel
                 Stepping ID = 2
                 Model = 1
                 Family = 15
                 Brand Index = 8
                 CLFLUSH Cache Line Size = 64
         The following features are supported:
                 FPU  - Floating Point Unit On Chip
                 VME  - Virtual 8086 Mode Enhancement
                 DE   - Debugging Extensions
                 PSE  - Page Size Extensions
                 TSC  - Time Stamp Counter
                 MSR  - Model Specific Registers RDMSR and WRMSR Instructions
                 PAE  - Physical Address Extensions
                 MCE  - Machine Check Exception
                 CX8  - CMPXCHG8B Instruction
                 APIC - APIC On Chip
                 SEP  - SYSENTER and SYSEXIT
                 MTRR - Memory Type Range Registers
                 PGE  - PTE Global Bit
                 MCA  - Machine Check Architecture
                 CMOV - Conditional Move Instructions
                 PAT  - Page Attribute Table
                 PSE36- 36-bit Page Size Extension
                 CLFSH- CLFLUSH Instruction
                 DS   - Debug Store
                 ACPI - Thermal Monitor and Software Controlled Clock Facilities
                 MMX  - Intel MMX Technology
                 FXSR - FXSAVE and FXRSTOR Instructions
                 SSE  - SSE Extensions
                 SSE2 - SSE2 Extensions
                 SS   - Self Snoop
                 HTT  - Hyper-Threading Technology
                 TM   - Thermal Monitor
         Enchanced Intel SpeedStep Technology - Unsupported

       

      >> TechInfo - Part 3 <<

         CPU Vendor: GenuineIntel
         HTT and Streaming SIMD Extensions features:
         HTT      : 1
         MMX    : 1
         SSE2   : 1
         SSE3   : 0
         SSSE3 : 0
         SSE4.1: 0
         SSE4.2: 0

       

      >> TechInfo - Part 4 <<

         I reproduced the same problem on 3 computers with the following Intel's CPUs:

       

         - Pentium 4
         - Pentium II-300
         - Atom N270

       

      >> TechInfo - Part 5 <<

         I reproduced the same problem with the following C++ compilers:

       

         Microsoft Visual Studio 2005 ( Version 8.0.50727.762 SP.050727-7800 )

         Borland C++ v5.5.1
         MinGW v3.4.2 ( GNU C++ version 3.4.2 (mingw-special) )
         Borland Turbo C++ Version 3.0.0

       

      >> TechInfo - Part 6 <<

         I reproduced the same problem with an application compiled by
         Turbo C++ Version 3.00 for an MS-DOS operating system and executed
         on the same computer.

       

         Note: An MS-DOS operating system was loaded from a 3.5" floppy disk

        • 1. Re: A problem related to floating point arithmetics on Intel's FPUs? Or, something else?
          MillerRD

          Sergey,

           

          You used the float data type which is single precisioin which offers slightly more than six significant digits. You need double precision for the number of significants used in your example. In double precision, you will see correct results.

           

          Robert Miller

          • 2. Re: A problem related to floating point arithmetics on Intel's FPUs? Or, something else?
            Kostrov

            If I change to 32-bit 'int', for example, I will also get a right 'C Matrix':

             

            13826808     14187608   14548408   14909208   15270008   15630808   15991608  16352408
            32393208     33394008   34394808   35395608   36396408   37397208   38398008  39398808
            50959608     52600408   54241208   55882008   57522808   59163608   60804408  62445208
            69526008     71806808   74087608   76368408   78649208   80930008   83210808  85491608
            88092408     91013208   93934008   96854808   99775608 102696408 105617208 108538008
            106658808   110219608 113780408 117341208 120902008 124462808 128023608 131584408
            125225208   129426008 133626808 137827608 142028408 146229208 150430008 154630808
            143791608   148632408 153473208 158314008 163154808 167995608 172836408 177677208

             

            This is not about changing a data type from 'float' to 'double'. I'll need to understand why precision is lost for

            different values in the 'C' Matrix in case of 'Rolled' and 'Unrolled' loops.

             

            In my Test-Cases I use 8x8 matrices, that is, very-very small matrices. In a real life a customer

            could use a matrix with size up to 8192x8192. In case of declaration as 'double' an application will

            use significantly more memory. Then, problem escalates because a Strassen's algorithm for matrix

            multiplication needs to be used in order to speed up calculations, and it creates another impact on memory.

             

            What I can see that with 'float' type I reached a limitation of precision defined in IEEE 754 standard.

             

            I also looked at assembler codes generated by Visual C++ and I confirm FPU's register ST0 gets a '50959608.0' value ( 1st Test-Case ).

            After that a variable fC[2][0] gets '50959604.0'.

             

            Thank you for your response!

            • 3. Re: A problem related to floating point arithmetics on Intel's FPUs? Or, something else?
              DoctorBinary

              Your inputs are all integers, and floats can represent them exactly because they are less than or equal to 2^24-1 = 16777215. Your results are all integers, but they are too big to be represented exactly (in general) in floats. That's the first problem.

               

              Your intermediate computations are being done in the FPU in double (or extended) precision (unless you changed the processor precision mode to 24-bit). As doubles, your results are exact integers. But intermediate computations can be "flushed" back to your float variables, at which point they will be rounded. How long your values are kept in the FPU before being flushed back depends on what instructions the compiler generates; certainly your different algorithms will generate different code -- and potentially different results. That's the second problem.

               

              (Please read my article http://www.exploringbinary.com/when-floats-dont-behave-like-floats/ for more information.)

              • 4. Re: A problem related to floating point arithmetics on Intel's FPUs? Or, something else?
                Kostrov

                Thank you to everybody who responded to my initial post.

                 

                I've just completed implementation of two prototypes for Precision Loss Detection ( PLD ) subsystem. Both

                prototypes don't use exceptions, C++ or SEH, etc.

                 

                For example, for the 'float' data type detection works when a result is greater than '16777216.0':

                 

                4096.0 * 4096.0 = 16777216.0 - No Precision Loss   - No Detection

                4097.0 * 4097.0 = 16785409.0 - Yes Precision Loss - Yes Detected

                 

                A first prototype is extremely simple and it "monitors" results of all calculations, additions or multiplications. It is absolutely portable,

                but affects performance.

                 

                A second prototype uses CRT-functions 'signal(...)', 'setjmp(...)' and 'longjmp(...)'. I coded it as 'SSL'. But, some modern C++ compilers failed to fully support it and

                it can't be used on the project. Here is some information:

                 

                MS Visual C++   : SSL based PLD didn't work

                Borland C++ v5.x: SSL based PLD didn't work

                MinGW v3.4.2    : SSL based PLD didn't work

                Turbo C++ 3.x    : SSL based PLD worked absolutely fine!

                 

                PS: Unfortunately, due to project constraints C++ or SEH exceptions can't be used at all.