Floating Point Precision and Binary 32 or, Arduino Don’t Know Math

https://ucexperiment.files.wordpress.com/2016/02/levitation.jpg?w=640

Did you know?

0.1 + 0.2 = 0.30000001

Try this simple arduino program to prove it:

void setup() {
  float f;
  char s[12];

  f = 0.1 + 0.2;
  dtostrf(f, 1, 9, s); //convert float to string
  Serial.begin(9600);
  Serial.println(s); 
}

void loop() { }

First, don’t be alarmed, and second, don’t throw your arduino into the trash thinking it’s defective. Its working just fine. For comparison, performing this same math on your PC would produce a similar result (see this for a silly example). The reason for this seemingly odd behavior stems from the internal workings of a binary computer.

The arduino (and your PC) is a binary device. All calculations are reduced to on and off, or 1 and 0. Because of this, numbers are formatted in a base 2 numbering system, and all math is performed using this binary numbering system. Additionally, there are no provisions in binary for decimal fractions. To further complicate matters, we humans use a base-10 decimal numbering system. Obviously, we’ll need a process to incorporate fractions in binary and to swap between the base 2 and 10 systems. And this is where the problem lies.

The arduino utilizes a binary floating point representation for decimal numbers. The description of the data type can be found here. Officially it’s called IEEE 754 single precision floating point, and its specification can be found here. But don’t try to read that unless you’re a glutton for punishment. I’ll attempt to simply.

A floating point number is composed of 2 primary parts, the significand which contains the digits, and the exponent which determines where to place the decimal point. It’s basically scientific notation.

Our significand is 23-bits wide, which allows for 6-7 digits. The exponent is 8-bits (biased by -127, which basically allows for negative exponents) permitting numbers in the range of 10^-38 to 10^38. The left most bit is reserved for the sign of the number, and brings the total size of this value to 32-bits. The actual internal representation is called Binary32 and looks like this:

fp

To determine why our arduino doesn’t add 0.1 and 0.2 properly, we need to examine the internal representation of our floating point values. You can easily see the Binary32 representation of a floating point number by running the following simple program:

void setup(void) {
  union {
    uint32_t B32;
    float Float;
  } floatb32;

  Serial.begin(9600);
  floatb32.Float = 0.1;
  Serial.println(floatb32.B32, HEX); 
  floatb32.Float = 0.2;
  Serial.println(floatb32.B32, HEX); 
  floatb32.Float = 0.3;
  Serial.println(floatb32.B32, HEX); 
}

void loop(void) { }

Here are our floating point numbers encoded into Binary32 (hexadecimal):

0.1 = 3DCCCCCD
0.2 = 3E4CCCCD
0.3 = 3E99999A

Here are two functions which perform the conversions between floating point and the 32-bit internal representation:

uint32_t ConvertFloatToB32(float f) {
  float normalized;
  int16_t shift;
  int32_t sign, exponent, significand;

  if (f == 0.0) 
    return 0; //handle this special case
  //check sign and begin normalization
  if (f < 0) { 
    sign = 1; 
    normalized = -f; 
  } else { 
    sign = 0; 
    normalized = f; 
  }
  //get normalized form of f and track the exponent
  shift = 0;
  while (normalized >= 2.0) { 
    normalized /= 2.0; 
    shift++; 
  }
  while (normalized < 1.0) { 
    normalized *= 2.0; 
    shift--; 
  }
  normalized = normalized - 1.0;
  //calculate binary form (non-float) of significand 
  significand = normalized*(0x800000 + 0.5f);
  //get biased exponent
  exponent = shift + 0x7f; //shift + bias
  //combine and return
  return (sign<<31) | (exponent<<23) | significand;
}

float ConvertB32ToFloat(uint32_t b32) {
  float result;
  int32_t shift;
  uint16_t bias;

  if (b32 == 0) 
    return 0.0;
  //pull significand
  result = (b32&0x7fffff); //mask significand
  result /= (0x800000);    //convert back to float
  result += 1.0f;          //add one back 
  //deal with the exponent
  bias = 0x7f;
  shift = ((b32>>23)&0xff) - bias;
  while (shift > 0) { 
    result *= 2.0; 
    shift--; 
  }
  while (shift < 0) { 
    result /= 2.0; 
    shift++; 
  }
  //sign
  result *= (b32>>31)&1 ? -1.0 : 1.0;
  return result;
}

void setup(void) {
  char s[16];
  
  Serial.begin(9600);
  dtostrf(ConvertB32ToFloat(0x3E999999), 1, 9, s);
  Serial.println(s);
  dtostrf(ConvertB32ToFloat(0x3E99999A), 1, 9, s);
  Serial.println(s);
  dtostrf(ConvertB32ToFloat(0x3E99999B), 1, 9, s);
  Serial.println(s);
}

void loop(void) { }

This process of converting between a number and its internal Binary32 representation (and vice versa) includes several nuances which are beyond the scope of this post. If you are interested in the exact process, I suggest studying the above conversion functions, or reading this wiki.

However, if we use the above functions, we can easily see that 0.3 cannot be converted exactly into a Binary32 32-bit floating point number. Take a close look at the following sequential numbers:


3E999999 = 0.299999980
3E99999A = 0.300000010
3E99999B = 0.300000040

Notice there is no exact representation of 0.3. And this is why our arduino produced the odd result when asked to add 0.1 and 0.2.

Further Reading

There are many websites which explore the weird and wonderful world of floating-point math. Some are accurate, while many are not. Even fewer of the accurate ones are as interesting as the series of blog posts found here.

Advertisements

About Jim Eli

µC experimenter
This entry was posted in Uncategorized and tagged , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s