by Federico Mena Quintero
Introduction
A few months ago I went down a rabbithole of learning about floatingpoint numbers. I was porting librsvg
— a library that renders SVG vector images into pixels — from C to Rust, so that I could use a safer, more modern language. SVG files are XML, and the shapes in SVG can have their sizes specified as a number plus units.
For example, a rectangle may have an attribute like width=”10.5cm”
. For that particular case, an XML parser will usually tell you, “there is a width attribute, and its value is the string “10.5cm”. The C version of librsvg used the strtod()
function from C’s standard library to do two things at the same time upon the string: parse out the floatingpoint number for 10.5, and separate it from the “cm” part. The code would then say, “cm means centimeters, so convert it to pixels in the following way…”.
However, Rust’s standard library does not have a function that does the same thing as strtod()
. I decided to write my own: I wrote a simple algorithm to walk the string and look for a plus or minus sign — to make it handle numbers like “10.5” and “+10.5”, then some digits with an optional decimal point, and an optional exponent so that “1.2e6” will read out as 12500000. Along the way, the algorithm would do a trivial conversion to floatingpoint. The algorithm would stop parsing the string as soon as it found something that cannot be part of a number: when parsing “10.5cm”, it would first parse the 10.5 as a floatingpoint number, and stop when it found the “c”, since it can’t be part of a number. The algorithm would return two things: the number it parsed, and the “cm” string.
I knew that my simple algorithm would not produce very accurate numbers, as I was not doing anything special to minimize inaccuracies in the calculations. My first version interleaved two behaviors: looking for the part of the string that corresponds to the number, and converting it to floatingpoint at the same time. In the final version, I ended up just doing the first part: finding the number part in the given string, and passing that on to Rust’s “parse a float” function. I the got curious about how this process really works, so I started reading the code Rust’s standard library that deals with binarytodecimal and decimaltobinary conversions for floating point numbers. So, let’s look at the fundamentals of floatingpoint.
Binary integers
Computers represent numbers in binary, and integers are sequences of bits. 0 is all bits zero, 1 is a “1” bit, 2 is the bits “10”, 5 is the bits “101”, and so on.
To avoid confusion between number bases, we will write numbers with their base as a subscript. So, 42_{10} is fortytwo in decimal, while its binary representation would be 101010_{2}. How can we show that this is true? We multiply each bit by its corresponding poweroftwo, and add them:
101010_{2} = 1 * 2^{5} > 32 0 * 2^{4} 1 * 2^{3} > 8 0 * 2^{2} 1 * 2^{1} > 2 0 * 2^{0}  Total 42_{10}
How do we convert 42_{10} to binary? We can repeatedly divide by 2 until we get 0 for the quotient, and then print the remainders backwards:
42 / 2 = 21, remainder 0 21 / 2 = 10, remainder 1 10 / 2 = 5, remainder 0 5 / 2 = 2, remainder 1 2 / 2 = 1, remainder 0 1 / 2 = 0, remainder 1 ^ ^ Read these from bottom to top  to obtain 101010_{2}  + We stop when we hit zero
Numbers with fractions, and the “no leading zeros” rule
If we have 42.53_{10}, this means 42 + 53/100. We can also write this as 4253/100, or 4253 * 10^{2}. Here, we call “4253” the significand or mantissa, “10” the base or radix, and “2” the exponent.
From now on, we will specify all numbers as significand/exponent/base. We also need a sign if the number is negative, but that’s only an extra bit, and we can ignore it for this discussion.
Let’s make up a rule: to represent a number, we must specify a significand of certain number of digits, with no leading zeros.
For example, following that rule, let’s establish that we must specify a significand of 8 digits with no leading zeros, plus an exponent. So, we cannot use “00004253 * 10^{2}” because of the leading zeros.^{1} Instead, we must use
42530000 * 10^{6} = 42.530000 = 42.53 ^ decimal point moves 6 places to the left
Representing small numbers is not a problem: 0.00000000004253 (that’s 10 zeros after the decimal point) is 42530000 * 10^{18}. Note that we still have our 8digit significand with no leading zeros.
And big numbers? 425300000000000000000000 (that’s 20 zeros) would be 42530000 * 10^{16}.
That’s exactly what floating point is. We have a number base, a sign, a fixedsize significand, and an exponent to move the decimal point in the significand to the correct position. When there are no leading zeros in the significand, the floatingpoint number is normalized.
Binary fractions
In base 10, 42.53 means fortytwo units plus fiftythree hundredths.
Now, let’s look at a completely different number. If we convert 6.625_{10} to binary, we obtain 110.101_{2}. We can read this number as follows:
1 > 1 * 2^{2} > 4 1 > 1 * 2^{1} > 2 0 > 0 * 2^{0} > 0 subtotal is 6 . 1 > 1 * 2^{1} > 1/2 0 > 0 * 2^{2} > 0 1 > 1 * 2^{3} > 1/8 1/2 + 1/8 = 0.625  Total 6.625
Approximations
Even with very long binary fractions, we will always have an exact decimal representation. For example, to calculate the decimal equivalent to
.00101_{2}
we have 2^{3} + 2^{5} = 1/8 + 1/32 = 0.125 + 0.03125 = 0.15625
But if we have a decimal fraction, we cannot always represent it exactly in a finite number of binary digits. A famous example is to represent onetenth: what is .1_{10} in binary?
.1_{10} = .00011001100110011001100110011..._{2} ^^^^ this chunk repeats
If we were to obtain this value by brute force, by simply trying to add together eversmaller powers of two, trying to approximate 0.1:
0 * 2^{1} = 0 * 0.5 (doesn't fit in 0.1) 0 * 2^{2} = 0 * 0.25 (doesn't fit) 0 * 2^{3} = 0 * 0.125 (still doesn't fit) 1 * 2^{4} = 1 * 0.0625 (fits!) 1 * 2^{5} = 1 * 0.03125 (subtotal 0.09375 fits in 0.1) 0 * 2^{6} = 0 * 0.015625 (wouldn't fit when added to subtotal) 0 * 2^{7} = 0 * 0.0078125 (wouldn't fit when added to subtotal) 1 * 2^{8} = 1 * 0.00390625 (subtotal 0.09765625 fits in 0.1) 1 * 2^{9} = 1 * 0.001803125 (subtotal 0.099459375 fits in 0.1) etc.
That is, this is a periodic fraction. We cannot represent onetenth exactly in a finite number of binary digits, so we must use an approximation. Two number systems are commensurable if it is always possible to convert between them exactly; base two and base ten are not commensurable.
What would give us similar trouble in base ten? Well, if we were converting numbers from base three to base ten, then .1_{3} = 1 * 3^{1} = 1/3 = 0.333333333…_{10} which we can not represent exactly in base 10.
Fortunately, with computers we only have to go back and forth between base ten (for humans) and base two (for machines). We will end up with the following:
 Any binary fraction can be represented exactly with a finite number of decimal digits.
 Some decimal fractions cannot be represented exactly with a finite number of binary digits, and we will require approximations.
CPUs are quite good at doing arithmetic, even with floatingpoint numbers. If you have ever had a problem where you calculate 1.0 / 10.0 and the result comes out as “0.099999994”, the problem is usually not with the arithmetic, but with the printing algorithm.
But onetenth cannot be stored exactly in binary!
So, how can we expect a printing algorithm to actually print “0.1” for the result of one divided by ten?
If we have a binary number stored in the computer, it would be desirable to:
 Convert it to decimal and not get unexpected rounding. If we stored “0.1_{10}” in a floatingpoint value, getting back “.0999999996274” would be annoying.
 Convert that decimal back to binary, and get the original binary number back.
This is called a roundtrip from binary to decimal and back. The literature on binary/decimal conversions defines that
A printed representation R of a floatingpoint number v satisfies the internal identity requirement if and only if R would convert to v when read again. (Florian Loitsch, section 2.2)
So, we start with a binary floatingpoint number v, and run an algorithm that converts it to a decimal representation R. Then, R satisfies the internal identity requirement iff we can convert R back into binary and get the exact same v again (“exact” as in “bitbybit identical”).
This requires both conversions to work properly. We need a binarytodecimal writer that writes only as many digits as needed to represent the original number faithfully, and we also need a decimaltobinary reader that can reconstruct the original number.
When any of the two algorithms does not do a proper job, it is possible to get drift when converting back and forth. Intuitively, you would expect that if you do this in a program:
let x = 1.0 / 10.0;
Then printing that value would give you “0.1”, or something very close; and reading that back would give you the exact same bits for x as you started with. It would not be good to get 0.0999999994 and then 0.0999999996 when reading and printing back and forth.
As we saw, we cannot store the 0.0001100110011…_{2} periodic fraction for “one tenth” exactly. We can only store it up to a certain number of binary digits. We can convert that to an exact decimal representation. For example, if we can only store 27 bits for the significand in a value called Foo
:
(remember, no leading zeros) Foo = .110011001100110011001100110_2 * 2^{3} \ 27 bits / Its exact decimal representation is .09999999962747097015380859375
If we convert that monster back into binary, we will get the same Foo
bits back.
However, we would get the exact same bits if we simply converted “.1” into binary!
This means that both “.09999999962747097015380859375” and “.1” satisfy the internal identity requirement if we start out with the binary fraction Foo
. And since “.1” is shorter, we will prefer it over the long one.
So, the internal identity requirement gives us a formal way to test if a binary/decimal roundtrip is correct: we should get the original bits back.
If we add the humanfriendly requirement that we should prefer a shorter decimal representation than a longer one (as long as they both convert back to the same bits), this takes care of our desire to “not get unexpected rounding” from the beginning of this section.
The paper I’ll mention in the next section puts this nicely: “We would like to produce enough digits to preserve the information content of a binary number, but no more”.^{2}
Printing and reading floatingpoint numbers
“How to print floatingpoint numbers accurately” (PDF) is a paper by Guy L. Steele Jr. and Jon L. White from 1990. That link is to a retrospective article, plus the original article itself. The retrospective gives a history of the state of floatingpoint numbers in the 1970s through the 1990s. In the early 1970s, systems produced incorrect decimal versions of binary numbers, and they didn’t provide a guaranteed bound of how wrong they might be. Even in the IEEE 754 floatingpoint standard from the 1980s has a requirement for conversion accuracy that could be tighter. Back then, people passed around moreorless working code for reading/writing floatingpoint numbers but didn’t worry much about correctness: the idea that “floatingpoint is inherently inaccurate anyway” was at fault here.
The paper solves the problem of printing floatingpoint numbers accurately by building four different algorithms incrementally. They culminate in “Dragon4”, and there is an interesting footnote about why that name was chosen: The intermediate algorithms have names like FinitePrecision FixedPoint Fraction Printout (FP)^3^, and all those “F” and “P” look like the letters that specify the shape of a Dragon spacefilling curve. The most appealing feature of Dragon4 is that it lets one specify the format in which to print the floatingpoint number, for example, “use at most 8 significant digits, or an exponent if necessary”. It turns out that doing this efficiently and accurately takes a lot of theory; it is not enough to generate as many digits as possible and then just round the result. Also, Dragon4 uses bignums, or arbitraryprecision arithmetic to be able to handle very big or very small numbers in a general fashion.
After Guy L. Steele’s and Jon L. White’s paper, there was some research in trying to avoid using bignums to make common cases more efficient. This seems to have culminated in Florian Loitsch’s paper from 2004, “Printing floatingpoint numbers quickly and accurately with integers”. This paper explains that the requirement for bignums can be greatly reduced if the algorithm stores a precomputed cache of powersoften represented in binary. About 99.4% of cases can then be printed using only 64bit integer arithmetic, and the rest can fall back to Dragon4.
In this paper, the various evolving algorithms are called Grisu2 and Grisu3. The name comes from Grisù, an Italian cartoon dragon.
The Dragon paper from 1990 is for printing floatingpoint numbers into a decimal representation. Their discussion of the internal identity requirement mentions that they assume that the reading problem is solved; the Grisu paper makes the same assumption.
For the reading problem, there is another paper “How to read floating point numbers accurately” with a retrospective by William D. Clinger, again from 1990. The paper explains why “obvious” algorithms for rounding would be subtly wrong:
The key idea is this. Consider the problem of computing the value of some function g, rounded to the nearest 10000, given an oracle that delivers the value of g rounded to the nearest 10. Given an input x, the obvious approach is to ask the oracle for the value of g(x) to the nearest 10, and then to round that result to the nearest 10000. Unfortunately, this does not always work. If g(x) is 11074996 and g(y) is 11075004, for example, then the answers should be 11070000 and 11080000, respectively, but the oracle will deliver 11075000 for both x and y. This approach usually works, though. It fails only when the value delivered by the oracle ends in 5000.
The paper expands this idea to explain the requirement for reading numbers using extended precision, and then rounding to a shorter precision in the end. The paper proves that the conversion cannot be done with fixedsize numbers, and it requires bignums instead.
The final algorithm in William D. Clinger’s paper is called Bellerophon, the Greek hero who slayed the Chimera… who breathed fire. This complements the Dragon algorithms nicely.
Conclusion
Floatingpoint numbers and their arithmetic is a big topic. Most programming languages nowadays have good functions in their standard libraries to parse floatingpoint numbers from strings, and to print them back out. This article mentions some of the key papers with the algorithms that perform those conversions. In the Bibliography you will find materials for more indepth study.
Some programs demand a lot of care to maintain numerical accuracy (are you writing a spreadsheet, or a physical simulation system?). For those cases, “What Every Computer Scientist Should Know About Floating Point Arithmetic” is an excellent resource with many recipes for how to do many common operations accurately.
I hope this article gives you some pointers on where to start if you want to learn more about floatingpoint numbers.
Bibliography
Dragon algorithm – “How to print floating point numbers accurately”, Guy L. Steele and
Jon L. White – http://kurtstephens.com/files/p372steele.pdf
Grisu algorithm – “Printing FloatingPoint Numbers Quickly and Accurately with Integers”, Florian Loitsch
http://www.cs.tufts.edu/%7Enr/cs257/archive/florianloitsch/printf.pdf
Bellerophon algorithm – “How to read floating point numbers accurately”, William D. Clinger
Retrospective: http://www.cesura17.net/\~will/Professional/Research/Papers/retrospective.pdf
Article: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.45.4152
“What Every Computer Scientist Should Know About Floating Point Arithmetic”, David Goldberg. This provides a detailed description of floatingpoint numbers, the “units in the last place” terminology for measuring rounding error, and very useful tips on avoiding precision loss during calculations. There is a lot to learn here, and although I don’t think every computer scientist needs to know everything here, it definitely helps to be aware of floatingpoint gotchas.
http://cse.msu.edu/~cse320/Documents/FloatingPoint.pdf
Roundtripping (at the end): https://randomascii.wordpress.com/2012/03/08/floatprecisionfromzeroto100digits2/
A presentation about fixing some of floatingpoint’s shortcomings with a different encoding for numbers called “posits”. The concepts are quite interesting, but I found the talk rather annoying to listen. There is a lot of “floatingpoint is so broken; how could anyone ever want to use it; if I could convince all hardware makers to use this instead”. Maybe stay for the math and the pretty pictures; ignore the ranting?
The Rust standard library uses Grisu for printing most floatingpoint numbers, and falls back to Dragon for the hard cases. For reading, it uses Bellerophon.
rust/src/libcore/num/diy_float.rs – from Grisu
rust/src/libcore/num/flt2dec/strategy/dragon.rs
rust/src/libcore/num/flt2dec/strategy/grisu.rs
rust/src/libcore/num/dec2flt/algorithm.rs – Bellerophon
You can browse these sources online at https://github.com/rustlang/rust/tree/master/src/libcore/num
Federico is one of the founders of the GNOME
Project, a graphical desktop for free software systems. He procrastinates with bicycling, woodworking, cooking, OpenStreetMap, and vegetable gardening. His blog is at https://people.gnome.org/~federico/blog/
 IEEE754 floatingpoint numbers do allow leading zeros in the significand for very small numbers. These are called subnormal floatingpoint numbers, but they are a special case that does not really concern us here. ↩

From “How to print floatingpoint numbers accurately”, by Guy L. Steele and Jon L. White:
Satisfaction of the internal identity requirement implies that an *external consistency requirement* be met: if an internal value is printed out, then reading that exact same external value and then printing it once more will produce the same external representation. This is important to the user of an interactive system. If one types PRINT 3.1415926535 and the system replies 3.1415926, the user might think “Well and good; the computer has truncated it, and that’s all I need type from now on.” The user would be rightly upset to then type in PRINT 3.1415926 and receive the response 3.1415925! Many language implementations indeed exhibit such undesirable “drifting” behavior. ↩