# Machine Numbers, Rounding Error and Error Propagation

## Contents

# 9. Machine Numbers, Rounding Error and Error Propagation#

**References:**

Sections 0.3 and 0.4 of Sauer

Section 1.2 of Burden&Faires

Sections 1.3 and 1.4 of

*Numerical Mathematics and Computing*by Chenney&Kincaid

## 9.1. Overview#

The naive Gaussian elimination algorithm seen in Solving Simultaneous Linear Equations, Part 1: Row Reduction/Gaussian Elimination. has several related weaknesses which make it less robust and flexible than desired.

Most obviously, it can fail even when the equations are solvable, due to its naive insistence on always working from the top down. For example, as seen in Example 1 of that section, it fails with the system

because the formula for the first multiplier \(l_{2,1} = a_{2,1}/a_{1,1}\) gives \(1/0\).

Yet the equations are easily solvable, indeed with no reduction needed: the first equation just says \(x_2 = 1\), and then the second gives \(x_1 = 2 - x_2 = 1\).

All one has to do here to avoid this problem is change the order of the equations.
Indeed we will see that such reordering is *all that one ever needs to do,* so long as the original equation has a unique solution.

However, to develop a good strategy, we will also take account of errors introduced by rounding in computer arithmetic, so that is our next topic.

## 9.2. Robustness and well-posedness#

The above claim raises the concept of **robustness** and the importance of both existence and uniqueness of solutions.

**Definition 1: Well-Posed**
A problem is *well-posed* if it is stated in a way that it has a unique solution.
(Note that this might include asking for the set of all solutions, such as asking for all roots of a polynomial.)
For example, the problem of finding the root of a continuous, monotonic function
\(f:[a, b] \to \mathbb{R}\)
with \(f(a)\) and \(f(b)\) of opposite sign is well-posed.
Note the care taken with details to ensure both existence and uniqueness of the solution.

**Definition 2: Robust**
An algorithm for solving a class of problems is *robust* if it is guaranteed to solve any well-posed problem in the class.

For example, the bisection method is well-posed for the above class of problems. On the other hand, Newton’s method is not, and if we dropped the specification of monotonicity (so allowing multiple solutons) then the bisection method in its current form would not be robust: it would fail whenever there is more that one solution in the interval \([a, b]\).

## 9.3. Rounding error and accuracy problems due to “loss of significance”#

There is a second slightly less obvious problem with the naive algorithm for Guassian elimination, closely related to the first. As soon as the algorithm is implemented using any rounding in the arithmetic (rather than, say, working with exact arithmetic on rational numbers) division by values that are very close to zero can lead to very large intermediate values, which thus have very few correct decimals (correct bits); that is, very large absolute errors. These large errors can then propagate, leading to low accuracy in the final results, as seen in Example 2 and Example 4 of Solving Simultaneous Linear Equations, Part 1: Row Reduction/Gaussian Elimination

This is the hazard of *loss of significance*, discussed in
Section 0.4 of Sauer
and
Section 1.4 of Chenney&Kincaid.

So it is time to take Step 2 of the strategy described in the previous notes:

### 9.3.1. 2. Refine to get a more **robust** algorithm#

Identify cases that can lead to failure due to division by zero and such, and revise to avoid them.

Avoid inaccuracy due to problems like severe rounding error. One rule of thumb is that anywhere that a zero value is a fatal flaw (in particular, division by zero), a very small value is also a hazard when rounding error is present. So avoid very small denominators. …

## 9.4. The essentials of machine numbers and rounding in machine arithmetic#

As a very quick summary, standard computer arithmetic handles real numbers using *binary machine numbers* with \(p\) significant bits, and rounding off of other numbers to such *machine numbers* introduces a relative error of at most \(2^{-p}\).
The current dominant choice for machine numbers and arithmetic is IEEE-64, using 64 bits in total and with
\(p=53\)
significant bits, so that
\(1/2^p \approx 1.11 \cdot 10^{-16}\),
giving about fifteen significant digits.
(The other bits are used for an exponent and the sign.)

(Note: in the above, I ignore the extra problems with real numbers whose magnitude is too large or too small to be represented: *underflow* and *overflow*.
Since the allowable range of magnitudes is from \(2^{-1022} \approx 2.2 \cdot 10^{-308}\) to
\(2^{1024} \approx 1.8 \cdot 10^{308}\), this is rarely a problem in practice.)

With other systems of binary machine numbers (like older 32-bit versions, or higher precision options like 128 bits) the significant differences are mostly encapsulated in that one number, the **machine unit**, \(u = 1/2^p\).

### 9.4.1. Binary floating point machine numbers#

The basic representation is a binary version of the familiar *scientific* or *decimal floating point* notation:
in place of the form
\(\pm d_0 . d_1 d_2 \dots d_{p-1} \times 10^e\)
where the *fractional part* or *mantissa* is
\(f = d_0 . d_1 d_2 \dots d_{p-1} = d_0 + \frac{d_1}{10} + \cdots + \frac{d_{p-1}}{10^{p-1}}\).

**Binary floating point machine numbers** with \(p\) significant bits can be described as

Just as decimal floating point numbers are typically written with the exponent chosen to have non-zero leading digit \(d_0 \neq 0\), **normalized** binary floating point machine numbers have exponent \(e\) chosen so that \(b_0 \neq 0\).
Thus in fact \(b_0=1\) — and so it need not be stored; only \(p-1\) bits are needed to stored for the mantissa.

### 9.4.2. Worst case rounding error#

It turns out that the relative errors are determined solely by the number of significant bits in the mantissa, regardless of the exponent, so we look at that part first.

#### 9.4.2.1. Rounding error in the mantissa, \((1. b_1 b_2 \dots b_{p-1})_2\)#

The spacing of consecutive mantissa values \((1. b_1 b_2 \dots b_{p-1})_2\) is one in the last bit, or \(2^{1-p}\).
Thus rounding of any intermediate value \(x\) to the nearest number of this form introduces an absolute error of at most half of this: \(u = 2^{-p}\), which is called the *machine unit*

How large can the *relative* error be?
It is largest for the smallest possible denominator, which is \((1.00 \dots 0)_2 = 1\), so the relative error due to rounding is also at most \(2^{-p}\).

#### 9.4.2.2. Rounding error in general, for \( \pm (1. b_1 b_2 \dots b_{p-1})_2 \cdot 2^e\).#

The sign has no effect on the absolute error, and the exponent changes the spacing of consecutive machine numbers by a factor of \(2^e\). This scales the maximum possible absolute error to \(2^{e-p}\), but in the relative error calculation, the smallest possible denominator is also scaled up to \(2^e\), so the largest possible relative error is again the machine unit, \(u = 2^{-p}\).

One way to describe the machine unit u (sometimes called *machine epsilon*) is to note that the next number above \(1\) is \(1 + 2^{1-p} = 1 + 2u\).
Thus \(1+u\) is at the threshold between rounding down to 1 and rounding up to a higher value.

### 9.4.3. IEEE 64-bit numbers: more details and some experiments#

For completely full details, you could read about the IEEE 754 Standard for Floating-Point Arithmetic and specifically the binary64 case. (For historical reasons, this is known as “Double-precision floating-point format”, from the era when computers were typicaly used 32-bit words, so 64-bit numbers needed two words.)

In the standard IEEE-64 number system:

64 bit words are used to store real numbers (a.k.a.

*floating point*numbers, sometimes called*floats*.)There are \(p=53\) bits of precision, so that 52 bits are used to store the mantissa (fractional part).

The sign is stored with one bit \(s\): effectively a factor of \((-1)^s\), so \(s=0\) for positive, \(s=1\) for negative.

The remaining 11 bits are use for the exponent, which allows for \(2^{11} = 2048\) possibilities; these are chosen in the range \(-1023 \leq e \leq 1024\).

However, so far, this does not allow for the value zero! This is handled by giving a special meaning for the smallest exponent \(e=-1023\), so the smallest exponent for

*normalized*numbers is \(e = -1022\).At the other extreme, the largest exponent \(e=1024\) is used to encode “infinite” numbers, which can arise when a calculation gives a value too large to represent. (Python displays these as

`inf`

and`-inf`

). This exponent is also used to encode “Not a Number”, for situations like trying to divide zero by zero or multiply zero by`inf`

.Thus, the exponential factors for normlaized numbers are in the range \(2^{-1022} \approx 2 \times 10^{-308}\) to \(2^{1023} \approx 9 \times 10^{307}\). Since the mantissa ranges from 1 to just under 2, the range of magnitudes of normalized real numbers is thus from \(2^{-1022} \approx 2 \times 10^{-308}\) to just under \(2^{1024} \approx 1.8 \times 10^{308}\).

Some computational experiments:

```
p = 53
u = 2.0^(-p)
println("For IEEE-64 arithmetic, there are $(p) bits of precision and the machine unit is u=$(u).")
println("The next numbers above 1 are 1+2u = $(1+2*u), 1+4u = $(1+4*u) and so on.")
for factor in [3, 2, 1.00000000001, 1]
onePlusSmall = 1 + factor * u
println("1 + $(factor)u rounds to $(onePlusSmall)")
difference = onePlusSmall - 1
println("\tThis is more than 1 by $(difference), which is $(difference/u) times u")
end
```

```
For IEEE-64 arithmetic, there are 53 bits of precision and the machine unit is u=1.1102230246251565e-16.
The next numbers above 1 are 1+2u = 1.0000000000000002, 1+4u = 1.0000000000000004 and so on.
1 + 3.0u rounds to 1.0000000000000004
This is more than 1 by 4.440892098500626e-16, which is 4.0 times u
1 + 2.0u rounds to 1.0000000000000002
This is more than 1 by 2.220446049250313e-16, which is 2.0 times u
1 + 1.00000000001u rounds to 1.0000000000000002
This is more than 1 by 2.220446049250313e-16, which is 2.0 times u
1 + 1.0u rounds to 1.0
This is more than 1 by 0.0, which is 0.0 times u
```

```
println("On the other side, the spacing is halved:")
println("the next numbers below 1 are 1-u = $(1-u), 1-2u = $(1-2*u) and so on.")
for factor in [2., 1., 1.00000000001/2, 1/2]
oneMinusSmall = 1 - factor * u
println("1 - $(factor)u rounds to $(oneMinusSmall)")
difference = 1 - oneMinusSmall
println("\tThis is less than 1 by $(difference), which is $(difference/u) times u ")
end
```

```
On the other side, the spacing is halved:
the next numbers below 1 are 1-u = 0.9999999999999999, 1-2u = 0.9999999999999998 and so on.
1 - 2.0u rounds to 0.9999999999999998
This is less than 1 by 2.220446049250313e-16, which is 2.0 times u
1 - 1.0u rounds to 0.9999999999999999
This is less than 1 by 1.1102230246251565e-16, which is 1.0 times u
1 - 0.500000000005u rounds to 0.9999999999999999
This is less than 1 by 1.1102230246251565e-16, which is 1.0 times u
1 - 0.5u rounds to 1.0
This is less than 1 by 0.0, which is 0.0 times u
```

Next, look at the extremes of very small and very large magnitudes:

```
println("The smallest normalized positive number is 2^(-1022)=$(2.0^(-1022))")
println("The largest mantissa is binary (1.1111...) with 53 ones: 2 - 2^(-52)=$(2-2.0^(-52))")
println("The largest normalized number is (2 - 2^(-52))*2^1023=$((2 - 2.0^(-52)) * (2.0^1023))")
println("If instead we round that mantissa up to 2 and try again, we get 2*2^1023=$(2.0 * 2.0^1023)")
```

```
The smallest normalized positive number is 2^(-1022)=2.2250738585072014e-308
The largest mantissa is binary (1.1111...) with 53 ones: 2 - 2^(-52)=1.9999999999999998
The largest normalized number is (2 - 2^(-52))*2^1023=1.7976931348623157e308
If instead we round that mantissa up to 2 and try again, we get 2*2^1023=Inf
```

What happens if we compute positive numbers smaller than that smallest normalized positive number \(2^{-1022}\)?

```
for S in [0, 1, 2, 52, 53]
exponent = -1022-S
println("2^(-1022-$(S)) = 2^($(exponent)) = $(2.0^(exponent))")
end
```

```
2^(-1022-0) = 2^(-1022) = 2.2250738585072014e-308
2^(-1022-1) = 2^(-1023) = 1.1125369292536007e-308
2^(-1022-2) = 2^(-1024) = 5.562684646268003e-309
2^(-1022-52) = 2^(-1074) = 5.0e-324
2^(-1022-53) = 2^(-1075) = 0.0
```

These extremely small values are called *denormalized numbers*.
Numbers with exponent \(2^{-1022-S}\) have fractional part with \(S\) leading zeros, so only \(p-S\) significant bits.
So when the shift \(S\) reaches \(p=53\), there are no significant bits left, and the value is truly zero.

## 9.5. Propagation of error in arithmetic#

The only errors in the results of Gaussian elimination come from errors in the initial data (\(a_{ij}\) and \(b_i\)) and from when the results of subsequent arithmetic operations are rounded to machine numbers. Here, we consider how errors from either source are propagated — and perhaps amplified — in subsequent arithmetic operations and rounding.

In summary:

When

*multiplying*two numbers, the relative error in the sum is no worse than slightly more than the sum of the relative errors in the numbers multiplied. (the be pedantic, it is at most the sum of those relative plus their product, but that last piece is typically far smaller.)When

*dividing*two numbers, the relative error in the quotient is again no worse than slightly more than the sum of the relative errors in the numbers divided.When

*adding*two**positive**numbers, the relative error is no more that the larger of the relative errors in the numbers added, and the absolute error in the sum is no larger than the sum of the absolute errors.When

*subtracting*two**positive**numbers, the absolute error is again no larger than the sum of the absolute errors in the numbers subtracted,**but the relative error can get far worse!**

Due to the differences between the last two cases, this discussion of error propagation will use “addition” to refer only to adding numbers of the same sign, and “subtraction” when subtracting numbers of the same sign.

More generally, we can think of rewriting the operation in terms of a pair of numbers that are both positive, and assume WLOG that all input values are positive numbers.

### 9.5.1. Notation: \(x_a = x(1 + \delta_x)\) for errors and \(fl(x)\) for rounding#

Two notations will be useful.

Firstly, for any approximation \(x_a\) of a real value \(x\), let \(\displaystyle\delta_x = \frac{x_a - x}{x}\), so that \(x_a = x(1 + \delta_x)\).

Thus, \(|\delta_x|\) is the relative error, and \(\delta_x\) helps keep track of the sign of the error.

Also, introduce the function \(fl(x)\) which does rounding to the nearest machine number. For the case of the approximation \(x_a = fl(x)\) to \(x\) given by rounding, the above results on machine numbers then give the bound \(|\delta_x| \leq u = 2^{-p}\).

### 9.5.2. Propagation of error in products#

Let \(x\) and \(y\) be exact quantities, and \(x_a = x(1 + \delta_x)\), \(y_a = y(1 + \delta_y)\) be approximations. The approximate product \((xy)_a = x_a y_a = x(1 + \delta_x) y(1 + \delta_y)\) has error

Thus the relative error in the product is

For example if the initial errors are due only to rounding, \(|\delta_x| \leq u - 2^{-p}\) and similarly for \(|\delta_y|\), so the relative error in \(x_a y_a\) is at most \(2u + u^2 = 2^{1-p} + 2^{-2p}\). In this and most situations, that final “product of errors” term \(\delta_x \delta_y\) is far smaller than the first two, giving to a very good approximation

This is the above stated “sum of relative errors” result.

When the “input errors” in \(x_a\) and \(y_a\) come just from rounding to machine numbers, so that each has \(p\) bits of precision, \(|\delta_x|, |\delta_y| \leq 1/2^p\) and the error bound for the product is \(1/2^{p-1}\): at most one bit of precision is lost.

#### 9.5.2.1. Exercise 1#

Derive the corresponding result for quotients.

### 9.5.3. Propagation or error in sums (of positive numbers)#

With \(x_a\) and \(y_a\) as above (and positive), the approximate sum \(x_a + y_a\) has error

so the absolute error is bounded by \(|x_a - x| + |y_a - y|\); the sum of the absolute errors.

For the relative errors, express this error as

Let \(\delta\) be the maximum or the relative errors, \(\delta = \max(|\delta_x|, |\delta_y|)\); then the absolute error is at most \((|x| + |y|) \delta = (x+y)\delta\) and so the relative error is at most

That is, *the relative error in the sum is at most the sum of the relative errors*, again as advertised above.

When the “input errors” in \(x_a\) and \(y_a\) come just from rounding to machine numbers, the error bound for the sum is no larger: no precision is lost! Thus, if you take any collection of non-negative numbers, round the to machine numbers so that each has relative error at must \(u\), then the sum of these rounded values also has relative error at most \(u\).

### 9.5.4. Propagation or error in differences (of positive numbers): loss of significance/loss of precision#

The above calculation for the absolute error works fine regardless of the signs of the numbers, so the absolute error of a difference is still bounded by the sum of the absolute errors:

But for subtraction, the denominator in the relative error formulas can be far smaller. WLOG let \(x > y > 0\). The relative error bound is

Clearly if \(x-y\) is far smaller than \(x\) or \(y\), this can be far larger than the “input” relative errors \(|\delta_x|\) and \(|\delta_y|\).

The extreme case is where the values \(x\) and \(y\) round to the same value, so that \(x_a - y_a = 0\), and the relative error is 1: “100% error”, a case of *catastrophic cancellation.*

### 9.5.5. Exercise 2#

Let us move slightly away from the worst case scenario where the difference is exactly zero to one where it is close to zero;
this will illustrate the idea mentioned earlier that
*whereever a zero value is a problem in exact aritmetic, a very small value can be a problem in approximate arithmetic.*

For \(x = 8.024\) and \(y = 8.006\),

Round each to three significant figures, giving \(x_a\) and \(y_a\).

Compute the absolute errors in each of these approximations, and in their difference as an approximation of \(x - y\).

Compute the relative errors in each of these three approximations.

Then look at rounding to only two significant digits!

### 9.5.6. Upper and lower bounds on the relative error in subtraction#

The problem is worst when \(x\) and \(y\) are close in relative terms, in that \(y/x\) is close to 1. In the case of the errors in \(x_a\) and \(y_a\) coming just from rounding to machine enumbers, we have:

**Theorem on Loss of Precision.**
Consider \(x > y > 0\) that are close in that they agree in at least \(q\) significant bits and at most \(r\) significant bits:

Then when rounded to machine numbers which are then subtracted, the relative error in that approximation of the difference is greater than that due to rounding by a factor of between \(2^q\) and \(2^r\).

That is, subtraction loses between \(q\) and \(r\) significant bits of precision.

### 9.5.7. Exercise 3#

(a) Illustrate why computing the roots of the quadratic equation \(ax^2 + bx + c = 0\) with the standard formula

can sometimes give poor accuracy when evaluated using machine arithmetic such as IEEE-64 floating-point arithmetic. This is not alwys a problem, so identify specifically the situations when this could occur, in terms of a condition on the coefficents \(a\), \(b\), and \(c\). (It is sufficient to consider real value of the ocefficients. Also as an aside, there is no loss pr precisio prbe, when th roots are non-real, so you only need consider quadratics with real roots.)

(b) Then describe a careful procedure for always getting accurate answers. State the procedure first with words and mathematical formulas, and then express it in pseudo-code.

## 9.6. Example: Approximating derivatives#

To deal with differential equations, we will need to approximate the derivative of function from just some values of the function itself. The simplest approach is suggested by the definition of the derivative

by using

with a small value of \(h\) — but this inherently involves the difference of almost equal quantities, and so loss of significance.

Taylor’s theorem give an error bound if we assume exact arithmetic — worse for larger \(h\). Then the above results give a measure of rounding error effects — worse for smaller \(h\).

This leads to the need to balance these error sources, to find an optimal choice for \(h\) and the corresponding error bound.

Denote the error in approximately calculating \(D_h f(x)\) with machine arithmetic as \(\tilde{D}_h f(x)\).

The error in this as an approximating of the exact derivative is

which we will consider as the sum of two pieces, \(E = E_A + E_D\) where

is the error due to machine **A**rithmetic in evaluation of the difference quotient \(D_h f(x)\), and

is the error in this difference quotient as an approximation of the exact derivative \(D f(x), = f'(x)\).
This error is sometimes called the *Discretization error* because it arises whe we replace the derivative by a discrete algebraic calculation.

### 9.6.1. Bounding the Arithmetic error \(E_A\)#

The first source of error is rounding of \(f(x)\) to a machine number; as seen above, this gives \(f(x) (1 + \delta_1)\), with \(|\delta_1| \leq u\), so absolute error \(|f(x) \delta_1| \leq |f(x)| u\).

Similarly, \(f(x+h)\) is rounded to \(f(x+h) (1 + \delta_2)\), absolute error at most \(|f(x+h)| u\).

Since we are interested in fairly small values of \(h\) (to keep \(E_D\) under control), we can assume that \(|f(x+h)| \approx |f(x)|\), so this second absolute error is also very close to \(|f(x)| u\).

Then the absolute error in the difference in the numerator of \(D_h f(x)\) is at most \(2 |f(x)| u\) (or only a tiny bit greater).

Next the division. We can assume that \(h\) is an exact machine number, for example by choosing \(h\) to be a power of two, so that division by \(h\) simply shifts the power of two in the exponent part of the machine number. This has no effect on on the relative error, but scales the absolute error by the factor \(1/h\) by which one is multiplying: the absolute error is now bounded by

This is a critical step: the difference has a small absolute error, which conceals a large relative error due to the difference being small; now the absolute error gets amplified greatly when \(h\) is small.

### 9.6.2. Bounding the Discretization error \(E_D\)#

As seen in Taylor’s Theorem and the Accuracy of Linearization — for the basic case of linearization — we have

so

and with \(M_2 = \max | f'' |\),

### 9.6.3. Bounding the total absolute error, and minimizing it#

The above results combine to give an upper limit on how bad the total error can be:

As aniticipated, the errors go in opposite directions: decreasing \(h\) to reduce \(E_D\) makes \(E_A\) worse, and vice versa. Thus we can expect that there is a “goldilocks” value of \(h\) — neither too small nor too big — that gives the best overall bound on the total error.

To do this, let’s clean up the notation: let

so that the error bound for a given value of \(h\) is

This can be minimized with a little calculus:

which is zero only for the unique critical point

using the short-hand \(\displaystyle K = 2 \sqrt{\frac{|f(x)|}{M_2}}\).

This is easily verified to give the global mimimum of \(E(h)\); thus, the best error bound we can get is for this value of \(h\):

### 9.6.4. Conclusions from this example#

In practical cases, we do not know the constant \(K\) or the coefficient of \(\sqrt{u}\) in parentheses — but that does not matter much!

The most important — and somewhat disappointing — observation here is that both the optimal size of \(h\) and the resulting error bound is roughly proportional to the square root of the machine unit \(u\). For example with \(p\) bits of precision, \(u = 2^{-p}\), the best error is of the order of \(2^{-p/2}\), or about \(p/2\) significant bits: at best we can hope for about half as many signnificant bits as our machine arithmetic gives.

In decimal terms: with IEEE-64 arithmetic \(u = 2^{-53} \approx 10^{-16}\), so giving about sixteen significant digits, and \(\sqrt{u} \approx 10^{-8}\), so \(\tilde{D}_h f(x)\) can only be expected to give about half as many; eight significant digits.

This is a first indication of why machine arithmetic sometimes needs to be so precise — more precise than any physical measurement by a factor of well over a thousand.

It also shows that when we get to computing derivatives and solving differential equations, we will often need to do a better job of approximating derivatives!

This work is licensed under Creative Commons Attribution-ShareAlike 4.0 International