by Federico Mena Quintero

Introduction

A few months ago I went down a rabbit-hole of learning about floating-point numbers. I was porting librsvg — a library that renders SVG vector images into pixels — from C to Rust, so that I could use a safer, more modern language. SVG files are XML, and the shapes in SVG can have their sizes specified as a number plus units.

For example, a rectangle may have an attribute like width=”10.5cm”. For that particular case, an XML parser will usually tell you, “there is a width attribute, and its value is the string “10.5cm”. The C version of librsvg used the strtod() function from C’s standard library to do two things at the same time upon the string: parse out the floating-point number for 10.5, and separate it from the “cm” part. The code would then say, “cm means centimeters, so convert it to pixels in the following way…”.

However, Rust’s standard library does not have a function that does the same thing as strtod(). I decided to write my own: I wrote a simple algorithm to walk the string and look for a plus or minus sign — to make it handle numbers like “-10.5” and “+10.5”, then some digits with an optional decimal point, and an optional exponent so that “1.2e6” will read out as 12500000. Along the way, the algorithm would do a trivial conversion to floating-point. The algorithm would stop parsing the string as soon as it found something that cannot be part of a number: when parsing “10.5cm”, it would first parse the 10.5 as a floating-point number, and stop when it found the “c”, since it can’t be part of a number. The algorithm would return two things: the number it parsed, and the “cm” string.

I knew that my simple algorithm would not produce very accurate numbers, as I was not doing anything special to minimize inaccuracies in the calculations. My first version interleaved two behaviors: looking for the part of the string that corresponds to the number, and converting it to floating-point at the same time. In the final version, I ended up just doing the first part: finding the number part in the given string, and passing that on to Rust’s “parse a float” function. I the got curious about how this process really works, so I started reading the code Rust’s standard library that deals with binary-to-decimal and decimal-to-binary conversions for floating point numbers. So, let’s look at the fundamentals of floating-point.

Binary integers

Computers represent numbers in binary, and integers are sequences of bits. 0 is all bits zero, 1 is a “1” bit, 2 is the bits “10”, 5 is the bits “101”, and so on.

To avoid confusion between number bases, we will write numbers with their base as a subscript. So, 4210 is forty-two in decimal, while its binary representation would be 1010102. How can we show that this is true? We multiply each bit by its corresponding power-of-two, and add them:

1010102

=

1 * 25 -> 32
0 * 24
1 * 23 -> 8
0 * 22
1 * 21 -> 2
0 * 20
----------
Total 4210

How do we convert 4210 to binary? We can repeatedly divide by 2 until we get 0 for the quotient, and then print the remainders backwards:

42 / 2 = 21, remainder 0
21 / 2 = 10, remainder 1
10 / 2 = 5, remainder 0
5 / 2 = 2, remainder 1
2 / 2 = 1, remainder 0
1 / 2 = 0, remainder 1

^                    ^- Read these from bottom to top
|                       to obtain 1010102
|
+- We stop when we hit zero

Numbers with fractions, and the “no leading zeros” rule

If we have 42.5310, this means 42 + 53/100. We can also write this as 4253/100, or 4253 * 10-2. Here, we call “4253” the significand or mantissa, “10” the base or radix, and “-2” the exponent.

From now on, we will specify all numbers as significand/exponent/base. We also need a sign if the number is negative, but that’s only an extra bit, and we can ignore it for this discussion.

Let’s make up a rule: to represent a number, we must specify a significand of certain number of digits, with no leading zeros.

For example, following that rule, let’s establish that we must specify a significand of 8 digits with no leading zeros, plus an exponent. So, we cannot use “00004253 * 10-2” because of the leading zeros.1 Instead, we must use

42530000 * 10-6 = 42.530000 = 42.53
                    ^ decimal point moves 6 places to the left

Representing small numbers is not a problem: 0.00000000004253 (that’s 10 zeros after the decimal point) is 42530000 * 10-18. Note that we still have our 8-digit significand with no leading zeros.

And big numbers? 425300000000000000000000 (that’s 20 zeros) would be 42530000 * 1016.

That’s exactly what floating point is. We have a number base, a sign, a fixed-size significand, and an exponent to move the decimal point in the significand to the correct position. When there are no leading zeros in the significand, the floating-point number is normalized.

Binary fractions

In base 10, 42.53 means forty-two units plus fifty-three hundredths.

Now, let’s look at a completely different number. If we convert 6.62510 to binary, we obtain 110.1012. We can read this number as follows:

1 -> 1 * 22 -> 4
1 -> 1 * 21 -> 2
0 -> 0 * 20 -> 0     subtotal is 6

.

1 -> 1 * 2-1 -> 1/2
0 -> 0 * 2-2 -> 0
1 -> 1 * 2-3 -> 1/8     1/2 + 1/8 = 0.625
-----------------
Total       6.625

Approximations

Even with very long binary fractions, we will always have an exact decimal representation. For example, to calculate the decimal equivalent to

.001012

we have 2-3 + 2-5 = 1/8 + 1/32 = 0.125 + 0.03125 = 0.15625

But if we have a decimal fraction, we cannot always represent it exactly in a finite number of binary digits. A famous example is to represent one-tenth: what is .110 in binary?

.110 = .00011001100110011001100110011...2
            ^^^^ this chunk repeats

If we were to obtain this value by brute force, by simply trying to add together ever-smaller powers of two, trying to approximate 0.1:

0 * 2-1 = 0 * 0.5 (doesn't fit in 0.1)
0 * 2-2 = 0 * 0.25 (doesn't fit)
0 * 2-3 = 0 * 0.125 (still doesn't fit)
1 * 2-4 = 1 * 0.0625 (fits!)
1 * 2-5 = 1 * 0.03125 (subtotal 0.09375 fits in 0.1)
0 * 2-6 = 0 * 0.015625 (wouldn't fit when added to subtotal)
0 * 2-7 = 0 * 0.0078125 (wouldn't fit when added to subtotal)
1 * 2-8 = 1 * 0.00390625 (subtotal 0.09765625 fits in 0.1)
1 * 2-9 = 1 * 0.001803125 (subtotal 0.099459375 fits in 0.1)

etc.

That is, this is a periodic fraction. We cannot represent one-tenth exactly in a finite number of binary digits, so we must use an approximation. Two number systems are commensurable if it is always possible to convert between them exactly; base two and base ten are not commensurable.

What would give us similar trouble in base ten? Well, if we were converting numbers from base three to base ten, then .13 = 1 * 3-1 = 1/3 = 0.333333333…10 which we can not represent exactly in base 10.

Fortunately, with computers we only have to go back and forth between base ten (for humans) and base two (for machines). We will end up with the following:

  • Any binary fraction can be represented exactly with a finite number of decimal digits.
  • Some decimal fractions cannot be represented exactly with a finite number of binary digits, and we will require approximations.

CPUs are quite good at doing arithmetic, even with floating-point numbers. If you have ever had a problem where you calculate 1.0 / 10.0 and the result comes out as “0.099999994”, the problem is usually not with the arithmetic, but with the printing algorithm.

But one-tenth cannot be stored exactly in binary!

So, how can we expect a printing algorithm to actually print “0.1” for the result of one divided by ten?

If we have a binary number stored in the computer, it would be desirable to:

  • Convert it to decimal and not get unexpected rounding. If we stored “0.110” in a floating-point value, getting back “.0999999996274” would be annoying.
  • Convert that decimal back to binary, and get the original binary number back.

This is called a roundtrip from binary to decimal and back. The literature on binary/decimal conversions defines that

A printed representation R of a floating-point number v satisfies the internal identity requirement if and only if R would convert to v when read again. (Florian Loitsch, section 2.2)

So, we start with a binary floating-point number v, and run an algorithm that converts it to a decimal representation R. Then, R satisfies the internal identity requirement iff we can convert R back into binary and get the exact same v again (“exact” as in “bit-by-bit identical”).

This requires both conversions to work properly. We need a binary-to-decimal writer that writes only as many digits as needed to represent the original number faithfully, and we also need a decimal-to-binary reader that can reconstruct the original number.

When any of the two algorithms does not do a proper job, it is possible to get drift when converting back and forth. Intuitively, you would expect that if you do this in a program:

let x = 1.0 / 10.0;

Then printing that value would give you “0.1”, or something very close; and reading that back would give you the exact same bits for x as you started with. It would not be good to get 0.0999999994 and then 0.0999999996 when reading and printing back and forth.

As we saw, we cannot store the 0.0001100110011…2 periodic fraction for “one tenth” exactly. We can only store it up to a certain number of binary digits. We can convert that to an exact decimal representation. For example, if we can only store 27 bits for the significand in a value called Foo:

(remember, no leading zeros)

Foo = .110011001100110011001100110_2 * 2-3
       \-------- 27 bits --------/

Its exact decimal representation is

.09999999962747097015380859375

If we convert that monster back into binary, we will get the same Foo bits back.

However, we would get the exact same bits if we simply converted “.1” into binary!

This means that both “.09999999962747097015380859375” and “.1” satisfy the internal identity requirement if we start out with the binary fraction Foo. And since “.1” is shorter, we will prefer it over the long one.

So, the internal identity requirement gives us a formal way to test if a binary/decimal roundtrip is correct: we should get the original bits back.

If we add the human-friendly requirement that we should prefer a shorter decimal representation than a longer one (as long as they both convert back to the same bits), this takes care of our desire to “not get unexpected rounding” from the beginning of this section.

The paper I’ll mention in the next section puts this nicely: “We would like to produce enough digits to preserve the information content of a binary number, but no more”.2

Printing and reading floating-point numbers

“How to print floating-point numbers accurately” (PDF) is a paper by Guy L. Steele Jr. and Jon L. White from 1990. That link is to a retrospective article, plus the original article itself. The retrospective gives a history of the state of floating-point numbers in the 1970s through the 1990s. In the early 1970s, systems produced incorrect decimal versions of binary numbers, and they didn’t provide a guaranteed bound of how wrong they might be. Even in the IEEE 754 floating-point standard from the 1980s has a requirement for conversion accuracy that could be tighter. Back then, people passed around more-or-less working code for reading/writing floating-point numbers but didn’t worry much about correctness: the idea that “floating-point is inherently inaccurate anyway” was at fault here.

The paper solves the problem of printing floating-point numbers accurately by building four different algorithms incrementally. They culminate in “Dragon4”, and there is an interesting footnote about why that name was chosen: The intermediate algorithms have names like Finite-Precision Fixed-Point Fraction Printout (FP)^3^, and all those “F” and “P” look like the letters that specify the shape of a Dragon space-filling curve. The most appealing feature of Dragon4 is that it lets one specify the format in which to print the floating-point number, for example, “use at most 8 significant digits, or an exponent if necessary”. It turns out that doing this efficiently and accurately takes a lot of theory; it is not enough to generate as many digits as possible and then just round the result. Also, Dragon4 uses bignums, or arbitrary-precision arithmetic to be able to handle very big or very small numbers in a general fashion.

After Guy L. Steele’s and Jon L. White’s paper, there was some research in trying to avoid using bignums to make common cases more efficient. This seems to have culminated in Florian Loitsch’s paper from 2004, “Printing floating-point numbers quickly and accurately with integers”. This paper explains that the requirement for bignums can be greatly reduced if the algorithm stores a pre-computed cache of powers-of-ten represented in binary. About 99.4% of cases can then be printed using only 64-bit integer arithmetic, and the rest can fall back to Dragon4.

In this paper, the various evolving algorithms are called Grisu2 and Grisu3. The name comes from Grisù, an Italian cartoon dragon.

The Dragon paper from 1990 is for printing floating-point numbers into a decimal representation. Their discussion of the internal identity requirement mentions that they assume that the reading problem is solved; the Grisu paper makes the same assumption.

For the reading problem, there is another paper “How to read floating point numbers accurately” with a retrospective by William D. Clinger, again from 1990. The paper explains why “obvious” algorithms for rounding would be subtly wrong:

The key idea is this. Consider the problem of computing the value of some function g, rounded to the nearest 10000, given an oracle that delivers the value of g rounded to the nearest 10. Given an input x, the obvious approach is to ask the oracle for the value of g(x) to the nearest 10, and then to round that result to the nearest 10000. Unfortunately, this does not always work. If g(x) is 11074996 and g(y) is 11075004, for example, then the answers should be 11070000 and 11080000, respectively, but the oracle will deliver 11075000 for both x and y. This approach usually works, though. It fails only when the value delivered by the oracle ends in 5000.

The paper expands this idea to explain the requirement for reading numbers using extended precision, and then rounding to a shorter precision in the end. The paper proves that the conversion cannot be done with fixed-size numbers, and it requires bignums instead.

The final algorithm in William D. Clinger’s paper is called Bellerophon, the Greek hero who slayed the Chimera… who breathed fire. This complements the Dragon algorithms nicely.

Conclusion

Floating-point numbers and their arithmetic is a big topic. Most programming languages nowadays have good functions in their standard libraries to parse floating-point numbers from strings, and to print them back out. This article mentions some of the key papers with the algorithms that perform those conversions. In the Bibliography you will find materials for more in-depth study.

Some programs demand a lot of care to maintain numerical accuracy (are you writing a spreadsheet, or a physical simulation system?). For those cases, “What Every Computer Scientist Should Know About Floating Point Arithmetic” is an excellent resource with many recipes for how to do many common operations accurately.

I hope this article gives you some pointers on where to start if you want to learn more about floating-point numbers.

Bibliography

Dragon algorithm – “How to print floating point numbers accurately”, Guy L. Steele and

Jon L. White – http://kurtstephens.com/files/p372-steele.pdf

Grisu algorithm – “Printing Floating-Point Numbers Quickly and Accurately with Integers”, Florian Loitsch

http://www.cs.tufts.edu/%7Enr/cs257/archive/florian-loitsch/printf.pdf

Bellerophon algorithm – “How to read floating point numbers accurately”, William D. Clinger

Retrospective: http://www.cesura17.net/\~will/Professional/Research/Papers/retrospective.pdf

Article: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.45.4152

“What Every Computer Scientist Should Know About Floating Point Arithmetic”, David Goldberg. This provides a detailed description of floating-point numbers, the “units in the last place” terminology for measuring rounding error, and very useful tips on avoiding precision loss during calculations. There is a lot to learn here, and although I don’t think every computer scientist needs to know everything here, it definitely helps to be aware of floating-point gotchas.

http://cse.msu.edu/~cse320/Documents/FloatingPoint.pdf

Roundtripping (at the end): https://randomascii.wordpress.com/2012/03/08/float-precisionfrom-zero-to-100-digits-2/

https://insidehpc.com/2017/02/john-gustafson-presents-beyond-floating-point-next-generation-computer-arithmetic/

A presentation about fixing some of floating-point’s shortcomings with a different encoding for numbers called “posits”. The concepts are quite interesting, but I found the talk rather annoying to listen. There is a lot of “floating-point is so broken; how could anyone ever want to use it; if I could convince all hardware makers to use this instead”. Maybe stay for the math and the pretty pictures; ignore the ranting?

The Rust standard library uses Grisu for printing most floating-point numbers, and falls back to Dragon for the hard cases. For reading, it uses Bellerophon.

rust/src/libcore/num/diy_float.rs – from Grisu

rust/src/libcore/num/flt2dec/strategy/dragon.rs

rust/src/libcore/num/flt2dec/strategy/grisu.rs

rust/src/libcore/num/dec2flt/algorithm.rs – Bellerophon

You can browse these sources online at https://github.com/rust-lang/rust/tree/master/src/libcore/num

Federico is one of the founders of the GNOME
Project, a graphical desktop for free software systems. He procrastinates with bicycling, woodworking, cooking, OpenStreetMap, and vegetable gardening. His blog is at https://people.gnome.org/~federico/blog/


  1. IEEE754 floating-point numbers do allow leading zeros in the significand for very small numbers. These are called subnormal floating-point numbers, but they are a special case that does not really concern us here. 
  2. From “How to print floating-point numbers accurately”, by Guy L. Steele and Jon L. White:

    Satisfaction of the internal identity requirement implies that an *external consistency requirement* be met: if an internal value is printed out, then reading that exact same external value and then printing it once more will produce the same external representation. This is important to the user of an interactive system. If one types PRINT 3.1415926535 and the system replies 3.1415926, the user might think “Well and good; the computer has truncated it, and that’s all I need type from now on.” The user would be rightly upset to then type in PRINT 3.1415926 and receive the response 3.1415925! Many language implementations indeed exhibit such undesirable “drifting” behavior.