# 2.2. Machine Numbers, Rounding Error and Error Propagation#

Last revised on August 14, 2024

**References:**

Sections 0.3

*Floating Point Represenation of Real Numbers*and 0.4*Loss of Significance*in [Sauer, 2022].Section 1.2

*Round-off Errors and Computer Arithmetic*of [Burden*et al.*, 2016].Sections 1.3 and 1.4 of [Chenney and Kincaid, 2012].

## 2.2.1. Introduction#

The naive Row Reduction algorithm has several related weaknesses which make it less robust and flexible than desired.

Most obviously, it can fail even when the equations are solvable, due to its naive insistence on always working from the top down. For example, as seen in Example 2.1 of that section, it fails with the system

because the formula for the first multiplier \(l_{2,1} = a_{2,1}/a_{1,1}\) gives \(1/0\).

Yet the equations are easily solvable, indeed with no reduction needed: the first equation just says \(x_2 = 1\), and then the second gives \(x_1 = 2 - x_2 = 1\).

All one has to do here to avoid this problem is change the order of the equations.
Indeed we will see that such reordering is *all that one ever needs to do,* so long as the original equation has a unique solution.

However, to develop a good strategy, we will also take account of errors introduced by rounding in computer arithmetic, so that is our next topic.

## 2.2.2. Robustness and well-posedness#

The above claim raises the concept of **robustness** and the importance of both existence and uniqueness of solutions.

(Well-Posed)

A problem is **well-posed** if it is stated in a way that it has a unique solution.
(Note that this might include asking for the set of all solutions, such as asking for all roots of a polynomial.)

For example, the problem of finding the root of a continuous, monotonic function \(f:[a, b] \to \mathbb{R}\) with \(f(a)\) and \(f(b)\) of opposite sign is well-posed. Note the care taken with details to ensure both existence and uniqueness of the solution.

(Robust)

An algorithm for solving a class of problems is **robust** if it is guaranteed to solve any well-posed problem in the class.

For example, the bisection method is robust for the above class of problems. On the other hand, Newton’s method is not, and if we dropped the specification of monotonicity (so allowing multiple solutons) then the bisection method in its current form would not be robust: it would fail whenever there is more that one solution in the interval \([a, b]\).

## 2.2.3. Rounding error and accuracy problems due to “loss of significance”#

There is a second slightly less obvious problem with the naive algorithm for Guassian elimination, closely related to the first. As soon as the algorithm is implemented using any rounding in the arithmetic (rather than, say, working with exact arithmetic on rational numbers) division by values that are very close to zero can lead to very large intermediate values, which thus have very few correct decimals (correct bits); that is, very large absolute errors. These large errors can then propagate, leading to low accuracy in the final results, as seen in Example 2.2 and Example 2.4 of the section on row reduction.

This is the hazard of *loss of significance*, discussed in
Section 0.4 of [Sauer, 2022]
and
Section 1.4 of [Chenney and Kincaid, 2012].

So it is time to take Step 2 of the strategy described in the previous section:

### 2. Refine to get a more **robust** algorithm#

Identify cases that can lead to failure due to division by zero and such, and revise to avoid them.

Avoid inaccuracy due to problems like severe rounding error. One rule of thumb is that anywhere that a zero value is a fatal flaw (in particular, division by zero), a very small value is also a hazard when rounding error is present. So avoid very small denominators. …

## 2.2.4. The essentials of machine numbers and rounding in machine arithmetic#

As a very quick summary, standard computer arithmetic handles real numbers using *binary machine numbers* with \(p\) significant bits, and rounding off of other numbers to such *machine numbers* introduces a relative error of at most \(u := 2^{-p}\),
called the **machine unit**.

The current dominant choice for machine numbers and arithmetic is “IEEE-64”, as described by the IEEE-64 double-precision binary floating-point format. This uses 64 bits in total and with \(p=53\) significant bits, so that \(u \approx 1.11 \cdot 10^{-16}\), giving about fifteen significant digits. (The other 11 bits are used for an exponent, the sign, and some handling of special cases.)

For historical reasons, this is known as “double-precision floating-point format”, from the era when computers typically used 32-bit words, so that the standard precision floating point numbers used only one 32-bit word. With other systems of binary machine numbers—like those 32-bit versions, or higher precision options like 128 bits—the significant differences are mostly determined by the number \(p\) of significant bits, via the machine unit \(u = 1/2^p\).

So far, we ignore the problems with real numbers whose magnitude is too large or too small to be represented:

**underflow**and**overflow**. Since the allowable range of magnitudes in IEEE-64 is from \(2^{-1022} \approx 2.2 \cdot 10^{-308}\) to \(2^{1024} \approx 1.8 \cdot 10^{308}\), this is rarely a problem in practice, but if you are curious about how such details are handled, see IEEE 64-bit numbers: more details and some experiments below or the above reference on IEEE-64 numbers and arithmetic.

### Binary floating point machine numbers#

The basic representation is a binary version of the familiar *scientific* or *decimal floating point* notation:
in place of the form
\(\pm d_0 . d_1 d_2 \dots d_{p-1} \times 10^e\)
where the **fractional part** or **mantissa** is
\(f = d_0 . d_1 d_2 \dots d_{p-1} = d_0 + \frac{d_1}{10} + \cdots + \frac{d_{p-1}}{10^{p-1}}\).

**Binary floating point machine numbers** with \(p\) significant bits can be described as

Just as decimal floating point numbers are typically written in *normalized* form, with the exponent chosen to have non-zero leading digit \(d_0 \neq 0\), binary floating point machine numbers are normalized by choosing the exponent \(e\) so that \(b_0 \neq 0\).
But then \(b_0\) is known to be \(1\), and so need not be stored: only \(p-1\) bits are needed to stored for the mantissa.

The sign is then encoded with one bit \(s\) via factor \((-1)^s\), so \(s=0\) for a positive number, \(s=1\) for negative. Puttig this all together, the number format is

A few details are omitted here, above all the fact that the number zero cannot be put in this form! Those are not important to what we do here, but if you are curious, again see IEEE 64-bit numbers: more details and some experiments below or the above reference on IEEE-64.

### Worst case rounding error#

It turns out that the relative errors are determined solely by the number of significant bits in the mantissa, regardless of the exponent, so we look at that part first.

#### Rounding error in the mantissa, \((1. b_1 b_2 \dots b_{p-1})_2\)#

The spacing of consecutive mantissa values \((1. b_1 b_2 \dots b_{p-1})_2\) is one in the last bit, or \(2^{1-p}\).
Thus rounding of any intermediate value \(x\) to the nearest number of this form introduces an absolute error of at most half of this: \(u = 2^{-p}\), which is called the **machine unit**.

How large can the *relative* error be?
It is largest for the smallest possible denominator, which is \((1.00 \dots 0)_2 = 1\), so the relative error due to rounding is also at most \(2^{-p}\).

#### Rounding error in general, for \( \pm (1. b_1 b_2 \dots b_{p-1})_2 \cdot 2^e\).#

The sign has no effect on the absolute error, and the exponent changes the spacing of consecutive machine numbers by a factor of \(2^e\). This scales the maximum possible absolute error to \(2^{e-p}\), but in the relative error calculation, the smallest possible denominator is also scaled up to \(2^e\), so the largest possible relative error is again the machine unit, \(u = 2^{-p}\).

One way to describe the machine unit u (sometimes called *machine epsilon*) is to note that the next number above \(1\) is \(1 + 2^{1-p} = 1 + 2u\).
Thus \(1+u\) is at the threshold between rounding down to 1 and rounding up to a higher value.

### IEEE 64-bit numbers: more details and some experiments#

In the standard IEEE-64 number system:

64 bit words are used to store real numbers (a.k.a.

*floating point*numbers, sometimes called*floats*.)There are \(p=53\) bits of precision, needing 52 bits to store the mantissa (fractional part).

The sign is stored with one bit \(s\): effectively a factor of \((-1)^s\), so \(s=0\) for positive, \(s=1\) for negative.

The remaining 11 bits are use for the exponent, which allows for \(2^{11} = 2048\) possibilities; these are chosen in the range \(-1023 \leq e \leq 1024\).

However, so far, this does not allow for the value zero! This is handled by giving a special meaning for the smallest exponent \(e=-1023\), so the smallest exponent for

*normalized*numbers is \(e = -1022\).At the other extreme, the largest exponent \(e=1024\) is used to encode “infinite” numbers, which can arise when a calculation gives a value too large to represent. (Python displays these as

`inf`

and`-inf`

). This exponent is also used to encode “Not a Number”, for situations like trying to divide zero by zero or multiply zero by`inf`

.Thus, the exponential factors for normalized numbers are in the range \(2^{-1022} \approx 2 \times 10^{-308}\) to \(2^{1023} \approx 9 \times 10^{307}\). Since the mantissa ranges from 1 to just under 2, the range of magnitudes of normalized real numbers is thus from \(2^{-1022} \approx 2 \times 10^{-308}\) to just under \(2^{1024} \approx 1.8 \times 10^{308}\).

Some computational experiments:

```
p = 53
u = 2**(-p)
print(f"For IEEE-64 arithmetic, there are {p} bits of precision and the machine unit is u={u},")
print(f"and the next numbers above 1 are 1+2u = {1+2*u}, 1+4u = {1+4*u} and so on.")
for factor in [3, 2, 1.00000000001, 1]:
onePlusSmall = 1 + factor * u
print(f"1 + {factor}u rounds to {onePlusSmall}")
difference = onePlusSmall - 1
print(f"\tThis is more than 1 by {difference:.4}, which is {difference/u} times u")
```

```
For IEEE-64 arithmetic, there are 53 bits of precision and the machine unit is u=1.1102230246251565e-16,
and the next numbers above 1 are 1+2u = 1.0000000000000002, 1+4u = 1.0000000000000004 and so on.
1 + 3u rounds to 1.0000000000000004
This is more than 1 by 4.441e-16, which is 4.0 times u
1 + 2u rounds to 1.0000000000000002
This is more than 1 by 2.22e-16, which is 2.0 times u
1 + 1.00000000001u rounds to 1.0000000000000002
This is more than 1 by 2.22e-16, which is 2.0 times u
1 + 1u rounds to 1.0
This is more than 1 by 0.0, which is 0.0 times u
```

```
print("On the other side, the spacing is halved:")
print(f"the next numbers below 1 are 1-u = {1-u}, 1-2u = {1-2*u} and so on.")
for factor in [2, 1, 1.00000000001/2, 1/2]:
oneMinusSmall = 1 - factor * u
print(f"1 - {factor}u rounds to {oneMinusSmall}")
difference = 1 - oneMinusSmall
print(f"\tThis is less than 1 by {difference:.4}, which is {difference/u} times u ")
```

```
On the other side, the spacing is halved:
the next numbers below 1 are 1-u = 0.9999999999999999, 1-2u = 0.9999999999999998 and so on.
1 - 2u rounds to 0.9999999999999998
This is less than 1 by 2.22e-16, which is 2.0 times u
1 - 1u rounds to 0.9999999999999999
This is less than 1 by 1.11e-16, which is 1.0 times u
1 - 0.500000000005u rounds to 0.9999999999999999
This is less than 1 by 1.11e-16, which is 1.0 times u
1 - 0.5u rounds to 1.0
This is less than 1 by 0.0, which is 0.0 times u
```

Next, look at the extremes of very small and very large magnitudes:

```
print(f"The smallest normalized positive number is {2**(-1022)=}")
print(f"The largest mantissa is binary (1.1111...) with 53 ones: {2 - 2**(-52)=:0.20}...")
print(f"The largest normalized number is {(2 - 2**(-52))*2.**1023=}")
print(f"If instead we round that mantissa up to 2 and try again, we get {2*2.**1023=}")
```

```
The smallest normalized positive number is 2**(-1022)=2.2250738585072014e-308
The largest mantissa is binary (1.1111...) with 53 ones: 2 - 2**(-52)=1.999999999999999778...
The largest normalized number is (2 - 2**(-52))*2.**1023=1.7976931348623157e+308
If instead we round that mantissa up to 2 and try again, we get 2*2.**1023=inf
```

What happens if we compute positive numbers smaller than that smallest normalized positive number \(2^{-1022}\)?

```
for S in [0, 1, 2, 52, 53]:
exponent = -1022-S
print(f" 2**(-1022-{S}) = 2**({exponent}) = {2**(exponent)}")
```

```
2**(-1022-0) = 2**(-1022) = 2.2250738585072014e-308
2**(-1022-1) = 2**(-1023) = 1.1125369292536007e-308
2**(-1022-2) = 2**(-1024) = 5.562684646268003e-309
2**(-1022-52) = 2**(-1074) = 5e-324
2**(-1022-53) = 2**(-1075) = 0.0
```

These extremely small values are called *denormalized numbers*.
Numbers with exponent \(2^{-1022-S}\) have fractional part with \(S\) leading zeros, so only \(p-S\) significant bits.
So when the shift \(S\) reaches \(p=53\), there are no significant bits left, and the value is truly zero.

## 2.2.5. Propagation of error in arithmetic#

The only errors in the results of row reduction come from errors in the initial data (\(a_{ij}\) and \(b_i\)) and from when the results of subsequent arithmetic operations are rounded to machine numbers. Here, we consider how errors from either source are propagated — and perhaps amplified — in subsequent arithmetic operations and rounding.

In summary:

When

*multiplying*two numbers, the relative error in the sum is no worse than slightly more than the sum of the relative errors in the numbers multiplied. (the be pedantic, it is at most the sum of those relative plus their product, but that last piece is typically far smaller.)When

*dividing*two numbers, the relative error in the quotient is again no worse than slightly more than the sum of the relative errors in the numbers divided.When

*adding*two**positive**numbers, the relative error is no more that the larger of the relative errors in the numbers added, and the absolute error in the sum is no larger than the sum of the absolute errors.When

*subtracting*two**positive**numbers, the absolute error is again no larger than the sum of the absolute errors in the numbers subtracted,**but the relative error can get far worse!**

Due to the differences between the last two cases, this discussion of error propagation will use “addition” to refer only to adding numbers of the same sign, and “subtraction” when subtracting numbers of the same sign.

More generally, we can think of rewriting the operation in terms of a pair of numbers that are both positive, and assume WLOG that all input values are positive numbers.

### Notation: \(x_a = x(1 + \delta_x)\) for errors and \(fl(x)\) for rounding#

Two notations will be useful.

Firstly, for any approximation \(x_a\) of a real value \(x\), let \(\displaystyle\delta_x = \frac{x_a - x}{x}\), so that \(x_a = x(1 + \delta_x)\).

Thus, \(|\delta_x|\) is the relative error, and \(\delta_x\) helps keep track of the sign of the error.

Also, introduce the function \(fl(x)\) which does rounding to the nearest machine number. For the case of the approximation \(x_a = fl(x)\) to \(x\) given by rounding, the above results on machine numbers then give the bound \(|\delta_x| \leq u = 2^{-p}\).

### Propagation of error in products#

Let \(x\) and \(y\) be exact quantities, and \(x_a = x(1 + \delta_x)\), \(y_a = y(1 + \delta_y)\) be approximations. The approximate product \((xy)_a = x_a y_a = x(1 + \delta_x) y(1 + \delta_y)\) has error

Thus the relative error in the product is

For example if the initial errors are due only to rounding, \(|\delta_x| \leq u = 2^{-p}\) and similarly for \(|\delta_y|\), so the relative error in \(x_a y_a\) is at most \(2u + u^2 = 2^{1-p} + 2^{-2p}\). In this and most situations, that final “product of errors” term \(\delta_x \delta_y\) is far smaller than the first two, giving to a very good approximation

This is the above stated “sum of relative errors” result.

When the “input errors” in \(x_a\) and \(y_a\) come just from rounding to machine numbers, so that each has \(p\) bits of precision, \(|\delta_x|, |\delta_y| \leq 1/2^p\) and the error bound for the product is \(1/2^{p-1}\): at most one bit of precision is lost.

#### Exercise 1#

Derive the corresponding result for quotients.

### Propagation or error in sums (of positive numbers)#

With \(x_a\) and \(y_a\) as above (and positive), the approximate sum \(x_a + y_a\) has error

so the absolute error is bounded by \(|x_a - x| + |y_a - y|\); the sum of the absolute errors.

For the relative errors, express this error as

Let \(\delta\) be the maximum or the relative errors, \(\delta = \max(|\delta_x|, |\delta_y|)\); then the absolute error is at most \((|x| + |y|) \delta = (x+y)\delta\) and so the relative error is at most

That is, *the relative error in the sum is at most the sum of the relative errors*, again as advertised above.

When the “input errors” in \(x_a\) and \(y_a\) come just from rounding to machine numbers, the error bound for the sum is no larger: no precision is lost! Thus, if you take any collection of non-negative numbers, round the to machine numbers so that each has relative error at must \(u\), then the sum of these rounded values also has relative error at most \(u\).

### Propagation or error in differences (of positive numbers): loss of significance/loss of precision#

The above calculation for the absolute error works fine regardless of the signs of the numbers, so the absolute error of a difference is still bounded by the sum of the absolute errors:

But for subtraction, the denominator in the relative error formulas can be far smaller. WLOG let \(x > y > 0\). The relative error bound is

Clearly if \(x-y\) is far smaller than \(x\) or \(y\), this can be far larger than the “input” relative errors \(|\delta_x|\) and \(|\delta_y|\).

The extreme case is where the values \(x\) and \(y\) round to the same value, so that \(x_a - y_a = 0\), and the relative error is 1: “100% error”, a case of **catastrophic cancellation.**

### Exercise 2#

Let us move slightly away from the worst case scenario where the difference is exactly zero to one where it is close to zero;
this will illustrate the idea mentioned earlier that
*wherever a zero value is a problem in exact arithmetic, a very small value can be a problem in approximate arithmetic.*

For \(x = 80.24\) and \(y = 80.06\),

Round each to three significant figures, giving \(x_a\) and \(y_a\).

Compute the absolute errors in each of these approximations, and in their difference as an approximation of \(x - y\).

Compute the relative errors in each of these three approximations.

Then look at rounding to only two significant digits (rounding to integers)!

### Upper and lower bounds on the relative error in subtraction#

The problem is worst when \(x\) and \(y\) are close in relative terms, in that \(y/x\) is close to 1. In the case of the errors in \(x_a\) and \(y_a\) coming just from rounding to machine enumbers, we have:

(Loss of Precision)

Consider \(x > y > 0\) that are close in that they agree in at least \(q\) significant bits and at most \(r\) significant bits:

Then when rounded to machine numbers which are then subtracted, the relative error in that approximation of the difference is greater than that due to rounding by a factor of between \(2^q\) and \(2^r\).

That is, subtraction loses between \(q\) and \(r\) significant bits of precision.

### Exercise 3#

(a) Illustrate why computing the roots of the quadratic equation \(ax^2 + bx + c = 0\) with the standard formula

can sometimes give poor accuracy when evaluated using machine arithmetic such as IEEE-64 floating-point arithmetic. This is not alwys a problem, so identify specifically the situations when this could occur, in terms of a condition on the coefficents \(a\), \(b\), and \(c\). (It is sufficient to consider real value of the ocefficients. Also as an aside, there is no loss of precision problem when the roots are non-real, so you only need consider quadratics with real roots.)

(b) Then describe a careful procedure for always getting accurate answers. State the procedure first with words and mathematical formulas, and then express it in pseudo-code.

(Errors when approximating derivatives)

To deal with differential equations, we will need to approximate the derivative of function from just some values of the function itself. The simplest approach is suggested by the definition of the derivative

by using

with a small value of \(h\) — but this inherently involves the difference of almost equal quantities, and so loss of significance.

Taylor’s theorem give an error bound if we assume exact arithmetic — worse for larger \(h\). Then the above results give a measure of rounding error effects — worse for smaller \(h\).

This leads to the need to balance these error sources, to find an optimal choice for \(h\) and the corresponding error bound.

Denote the error in approximately calculating \(D_h f(x)\) with machine arithmetic as \(\tilde{D}_h f(x)\).

The error in this as an approximating of the exact derivative is

which we will consider as the sum of two pieces, \(E = E_A + E_D\) where

is the error due to machine **A**rithmetic in evaluation of the difference quotient \(D_h f(x)\), and

is the error in this difference quotient as an approximation of the exact derivative \(D f(x), = f'(x)\).
This error is sometimes called the **discretization error** because it arises whe we replace the derivative by a discrete algebraic calculation.

**Bounding the Arithmetic error \(E_A\)**

The first source of error is rounding of \(f(x)\) to a machine number; as seen above, this gives \(f(x) (1 + \delta_1)\), with \(|\delta_1| \leq u\), so absolute error \(|f(x) \delta_1| \leq |f(x)| u\).

Similarly, \(f(x+h)\) is rounded to \(f(x+h) (1 + \delta_2)\), absolute error at most \(|f(x+h)| u\).

Since we are interested in fairly small values of \(h\) (to keep \(E_D\) under control), we can assume that \(|f(x+h)| \approx |f(x)|\), so this second absolute error is also very close to \(|f(x)| u\).

Then the absolute error in the difference in the numerator of \(D_h f(x)\) is at most \(2 |f(x)| u\) (or only a tiny bit greater).

Next the division. We can assume that \(h\) is an exact machine number, for example by choosing \(h\) to be a power of two, so that division by \(h\) simply shifts the power of two in the exponent part of the machine number. This has no effect on on the relative error, but scales the absolute error by the factor \(1/h\) by which one is multiplying: the absolute error is now bounded by

This is a critical step: the difference has a small absolute error, which conceals a large relative error due to the difference being small; now the absolute error gets amplified greatly when \(h\) is small.

**Bounding the Discretization error \(E_D\)**

As seen in Taylor’s Theorem and the Accuracy of Linearization — for the basic case of linearization — we have

so

and with \(M_2 = \max | f'' |\),

**Bounding the total absolute error, and minimizing it**

The above results combine to give an upper limit on how bad the total error can be:

As aniticipated, the errors go in opposite directions: decreasing \(h\) to reduce \(E_D\) makes \(E_A\) worse, and vice versa. Thus we can expect that there is a “goldilocks” value of \(h\) — neither too small nor too big — that gives the best overall bound on the total error.

To do this, let’s clean up the notation: let

so that the error bound for a given value of \(h\) is

This can be minimized with a little calculus:

which is zero only for the unique critical point

using the short-hand \(\displaystyle K = 2 \sqrt{\frac{|f(x)|}{M_2}}\).

This is easily verified to give the global mimimum of \(E(h)\); thus, the best error bound we can get is for this value of \(h\):

### Conclusions from this example#

In practical cases, we do not know the constant \(K\) or the coefficient of \(\sqrt{u}\) in parentheses — but that does not matter much!

The most important — and somewhat disappointing — observation here is that both the optimal size of \(h\) and the resulting error bound is roughly proportional to the square root of the machine unit \(u\). For example with \(p\) bits of precision, \(u = 2^{-p}\), the best error is of the order of \(2^{-p/2}\), or about \(p/2\) significant bits: at best we can hope for about half as many signnificant bits as our machine arithmetic gives.

In decimal terms: with IEEE-64 arithmetic \(u = 2^{-53} \approx 10^{-16}\), so giving about sixteen significant digits, and \(\sqrt{u} \approx 10^{-8}\), so \(\tilde{D}_h f(x)\) can only be expected to give about half as many; eight significant digits.

This is a first indication of why machine arithmetic sometimes needs to be so precise — more precise than any physical measurement by a factor of well over a thousand.

It also shows that when we get to computing derivatives and solving differential equations, we will often need to do a better job of approximating derivatives!

## 2.2.6. Further Exercises#

### Exercise 3#

If one thousand non-negative numbers are entered into software that uses IEEE64 floating point numbers and arithmetic and then added, what is the largest possible relative error in the result? Note that each intermediate result is computed as an IEEE64 machine number, so there can be rounding at every step.

Express your result as the maximum number of decimal places that can be trusted in the computed value for the sum.