What you should know about floating-point

Author: Wolfgang Meincke | Date: November 29, 2018

  • Floating-point code becomes increasingly popular
  • Despite all advantages, floating-point mathematics also brings new challenges compared to fixed-point
  • This article shows some interesting facts that you might not have known

Floating-point becomes more and more common in today’s automotive industry. Since development is often done model-based using tools such as Simulink and dSPACE TargetLink, the model serves as an additional abstraction layer and therefore is an additional stage in the development and testing process. In comparison to fixed-point, floating-point comes up with several advantages. The biggest advantage, from most developers’ point of view, is to get rid of the scaling task. There is also a better detection of infinite and NaN (not a number), represented by special values. Beyond that, it seems that the precision increases and almost enables to represent the physical behavior (let’s see if that’s true). Finally, the range of representable values is much larger (which also comes with a downside).

So, looking at all the advantages, why not switch everything to floating-point?

Consider the following characteristics, before you decide to use floating-point:


Let’s say there is an unsigned fixed-point integer with a scaling (=resolution) of 0.01. Based on this assumption, the calculation 0.01 + 0.01 will naturally result in exactly 0.02. Assigning the same calculation using floating-point (single precision), 0.01 + 0.01 will result in 0.0199999995529651641845703125. It’s close to 0.02 but not the exact value. If we have a look at the value 0.01, we can see that it is neither represented in an exact way. This becomes clear if we look how the value 0.01 is really represented in floating-point. A floating-point value is made up of 3 elements; 1 bit for the sign, 8 bits for the exponent and 23 bits for the mantissa totaling 32 bits. To represent a certain number, the mantissa represents a certain value with the exponent. In the case of 0.01, it looks like this:

  • Sign:
  • Exponent:
  • Mantissa:

0111 1000
0100 0111 1010 1110 0001 010

     positive number

If we calculate 1.2799999713897705 * 2-7 it results in 0.00999999977648258209228515625, the closest value in float to 0.01. This relative error due to rounding, is called the machine epsilon.

The conclusion is, that if you need an exact representation of values for things like loop counters, floating-point is not the right choice.

Another important aspect of inaccuracy becomes visible with larger values. With a float (single precision), the number 16.777.217 cannot be represented and depending on the rounding mode of the compiler, instead, this will lead to 16.777.216 or 16.777.218. For larger numbers, the gap between each representable value is even larger.

To summarize, floating-point variables do not bring more precision compared to fixed point. However, they bring a precision which dynamically changes across the value range; small numbers have high precision, large numbers have less precision. If your variables stay inside a clearly defined value range, similar (or even better) precision is possible with a constant scaling in fixed-point.


Being aware of the discoveries in the previous section, there is also an effect on comparisons. There are different situations, where comparisons are done.

In these use cases, an exact comparison is needed, otherwise, this might lead to unintended behavior. In other use cases like greater/less comparisons, the floating-point precision can be suitable. However, if an exact and transparent behavior is intended, the inaccuracy might lead to a smaller or greater value than expected as a result of a calculation and therefore the wrong decision is made. For these use cases, the usage of fixed-point is the first choice.

Influence of the compiler

Using floating-point, the compiler has a much higher impact on the behavior of the code compared to fixed-point. This becomes more important for model-based development, where usually three stages of implementation are taken into account. These are model-in-the-loop (MIL), software-in-the-loop (SIL) and processor-in-the-loop (PIL, target object code). MIL usually represents the values with double precision, while most of the target processors in embedded software are based on 32-bit, so the SIL implementation is typically done in single precision. Based on the data type used, this might lead to different behavior in certain situations on MIL and SIL/PIL. In addition, there are three main influence factors that you should be aware of:


Let’s assume the following code:

double v = 1 x 10308;
double x = (v * v) / v; 

The expected result of the second line of code is +∞. Even though this might be correct, sometimes the result can be x = v, if the processor is able to internally calculate with 80-bit precision. This can have an effect on development and testing on different implementation levels.


Algebraic laws do not hold in general in finite precision arithmetic like

  • Associativity:
  • Distributivity:
  • Zero folding:
  • Multiply be reciprocal:

((a + b) + c = a + (b + c))
(a *(b + c) = a * b + a * c)
(x * 0 = 0)
(a / b = a *(1 / b))

Although many compilers give the opportunity to control compiler optimizations, such value-changing compiler optimizations can be enabled by default to improve efficiency and/or reduce size like for the Intel® Compiler.


If the number of available digits is not sufficient to represent a real number precisely, rounding is used to represent a close value

double a = 0.1;
double x = (a * -a) * (a * a); 

The result for x in the second line of code depends on the used rounding mode:

  • to nearest, roundTiesToEven
  • roundTowardPositive
  • roundTowardNegative
  • roundTowardZero

This means that depending on the compiler settings and the capabilities of the underlying CPU, the results of calculations might be different between different compilers and CPUs. This can even lead to slightly different code coverage results between SIL and PIL.


To answer the question of why not switching everything to floating-point, there is a wide range of use cases, that will improve and get easier by using floating-point variables. However, there are certain use cases where using floating-point might lead to unintended and untransparent behavior. Furthermore, on different implementation levels like MIL, SIL and PIL the results of floating-point arithmetic might be different due to different settings for precision, optimization and rounding. This makes testing of the target code (PIL) much more important than in fixed-point arithmetic. To deal with these new challenges and to stay ISO 26262 compliant, the test process should consider a structural Back-to-Back test besides the Requirements-based Testing. This should be done between the reference implementation (usually MIL) and the target code (PIL), if available or SIL otherwise.

The Author:

Wolfgang Meincke studied Computer Science at the University Ravensburg-Weingarten where he graduated in 2006. He then worked at EWE TEL GmbH where he was responsible for requirements engineering and project management for software development projects as well as agile software development processes. In 2014 he joined BTC as senior pilot engineer to support the customers to integrate the tools into their development processes as well as giving support and training on test methodologies regarding the ISO 26260 standard. One of his main areas of interest is the formalization of safety requirements and their verification based on formal testing and model-checking technologies for unit test, integration test and HIL real-time-testing.