4.2. Machine Numbers, Rounding Error and Error Propagation#
References:
Sections 0.3 Floating Point Represenation of Real Numbers and 0.4 *Loss of Significance in [Sauer, 2019].
Section 1.2 Round-off Errors and Computer Arithmetic of [Burden et al., 2016].
Sections 1.3 and 1.4 of [Chenney and Kincaid, 2012].
Overview#
The naive Gaussian elimination algorithm seen in the section Row Reduction/Gaussian Elimination has several related weaknesses which make it less robust and flexible than desired.
Most obviously, it can fail even when the equations are solvable, due to its naive insistence on always working from the top down. For example, as seen in Example 4.1 of that section, it fails with the system
because the formula for the first multiplier
Yet the equations are easily solvable, indeed with no reduction needed:
the first equation just says
All one has to do here to avoid this problem is change the order of the equations. Indeed we will see that such reordering is all that one ever needs to do, so long as the original equation has a unique solution.
However, to develop a good strategy, we will also take account of errors introduced by rounding in computer arithmetic, so that is our next topic.
Robustness and well-posedness#
The above claim raises the concept of robustness and the importance of both existence and uniqueness of solutions.
Definition 4.3 (Well-Posed)
A problem is well-posed if it is stated in a way that it has a unique solution. (Note that this might include asking for the set of all solutions, such as asking for all roots of a polynomial.)
For example, the problem of finding the root of a continuous, monotonic function
Definition 4.4 (Robust)
An algorithm for solving a class of problems is robust if it is guaranteed to solve any well-posed problem in the class.
For example, the bisection method is robust for the above class of problems.
On the other hand, Newton’s method is not, and if we dropped the specification of monotonicity (so allowing multiple solutons) then the bisection method in its current form would not be robust: it would fail whenever there is more that one solution in the interval
Rounding error and accuracy problems due to “loss of significance”#
There is a second slightly less obvious problem with the naive algorithm for Guassian elimination, closely related to the first. As soon as the algorithm is implemented using any rounding in the arithmetic (rather than, say, working with exact arithmetic on rational numbers) division by values that are very close to zero can lead to very large intermediate values, which thus have very few correct decimals (correct bits); that is, very large absolute errors. These large errors can then propagate, leading to low accuracy in the final results, as seen in Example 4.2 and Example 4.4 of Row Reduction/Gaussian Elimination
This is the hazard of loss of significance, discussed in Section 0.4 of [Sauer, 2019] and Section 1.4 of [Chenney and Kincaid, 2012].
So it is time to take Step 2 of the strategy described in the previous notes:
2. Refine to get a more robust algorithm#
Identify cases that can lead to failure due to division by zero and such, and revise to avoid them.
Avoid inaccuracy due to problems like severe rounding error. One rule of thumb is that anywhere that a zero value is a fatal flaw (in particular, division by zero), a very small value is also a hazard when rounding error is present. So avoid very small denominators. …
The essentials of machine numbers and rounding in machine arithmetic#
As a very quick summary, standard computer arithmetic handles real numbers using binary machine numbers with
(Note: in the above, I ignore the extra problems with real numbers whose magnitude is too large or too small to be represented: underflow and overflow.
Since the allowable range of magnitudes is from
With other systems of binary machine numbers (like older 32-bit versions, or higher precision options like 128 bits) the significant differences are mostly encapsulated in that one number, the machine unit,
Binary floating point machine numbers#
The basic representation is a binary version of the familiar scientific or decimal floating point notation:
in place of the form
Binary floating point machine numbers with
Just as decimal floating point numbers are typically written with the exponent chosen to have non-zero leading digit
Worst case rounding error#
It turns out that the relative errors are determined solely by the number of significant bits in the mantissa, regardless of the exponent, so we look at that part first.
Rounding error in the mantissa, #
The spacing of consecutive mantissa values
How large can the relative error be?
It is largest for the smallest possible denominator, which is
Rounding error in general, for .#
The sign has no effect on the absolute error, and the exponent changes the spacing of consecutive machine numbers by a factor of
One way to describe the machine unit u (sometimes called machine epsilon) is to note that the next number above
IEEE 64-bit numbers: more details and some experiments#
For completely full details, you could read about the IEEE 754 Standard for Floating-Point Arithmetic and specifically the binary64 case. (For historical reasons, this is known as “Double-precision floating-point format”, from the era when computers were typicaly used 32-bit words, so 64-bit numbers needed two words.)
In the standard IEEE-64 number system:
64 bit words are used to store real numbers (a.k.a. floating point numbers, sometimes called floats.)
There are
bits of precision, so that 52 bits are used to store the mantissa (fractional part).The sign is stored with one bit
: effectively a factor of , so for positive, for negative.The remaining 11 bits are use for the exponent, which allows for
possibilities; these are chosen in the range .However, so far, this does not allow for the value zero! This is handled by giving a special meaning for the smallest exponent
, so the smallest exponent for normalized numbers is .At the other extreme, the largest exponent
is used to encode “infinite” numbers, which can arise when a calculation gives a value too large to represent. (Python displays these asinf
and-inf
). This exponent is also used to encode “Not a Number”, for situations like trying to divide zero by zero or multiply zero byinf
.Thus, the exponential factors for normlaized numbers are in the range
to . Since the mantissa ranges from 1 to just under 2, the range of magnitudes of normalized real numbers is thus from to just under .
Some computational experiments:
p = 53
u = 2**(-p)
print(f"For IEEE-64 arithmetic, there are {p} bitd of precision and the machine unit is u={u},")
print(f"and the next numbers above 1 are 1+2u = {1+2*u}, 1+4u = {1+4*u} and so on.")
for factor in [3, 2, 1.00000000001, 1]:
onePlusSmall = 1 + factor * u
print(f"1 + {factor}u rounds to {onePlusSmall}")
difference = onePlusSmall - 1
print(f"\tThis is more than 1 by {difference:.4}, which is {difference/u} times u")
For IEEE-64 arithmetic, there are 53 bitd of precision and the machine unit is u=1.1102230246251565e-16,
and the next numbers above 1 are 1+2u = 1.0000000000000002, 1+4u = 1.0000000000000004 and so on.
1 + 3u rounds to 1.0000000000000004
This is more than 1 by 4.441e-16, which is 4.0 times u
1 + 2u rounds to 1.0000000000000002
This is more than 1 by 2.22e-16, which is 2.0 times u
1 + 1.00000000001u rounds to 1.0000000000000002
This is more than 1 by 2.22e-16, which is 2.0 times u
1 + 1u rounds to 1.0
This is more than 1 by 0.0, which is 0.0 times u
print("On the other side, the spacing is halved:")
print(f"the next numbers below 1 are 1-u = {1-u}, 1-2u = {1-2*u} and so on.")
for factor in [2, 1, 1.00000000001/2, 1/2]:
oneMinusSmall = 1 - factor * u
print(f"1 - {factor}u rounds to {oneMinusSmall}")
difference = 1 - oneMinusSmall
print(f"\tThis is less than 1 by {difference:.4}, which is {difference/u} times u ")
On the other side, the spacing is halved:
the next numbers below 1 are 1-u = 0.9999999999999999, 1-2u = 0.9999999999999998 and so on.
1 - 2u rounds to 0.9999999999999998
This is less than 1 by 2.22e-16, which is 2.0 times u
1 - 1u rounds to 0.9999999999999999
This is less than 1 by 1.11e-16, which is 1.0 times u
1 - 0.500000000005u rounds to 0.9999999999999999
This is less than 1 by 1.11e-16, which is 1.0 times u
1 - 0.5u rounds to 1.0
This is less than 1 by 0.0, which is 0.0 times u
Next, look at the extremes of very small and very large magnitudes:
print(f"The smallest normalized positive number is {2**(-1022)=}")
print(f"The largest mantissa is binary (1.1111...) with 53 ones: {2 - 2**(-52)=:0.20}...")
print(f"The largest normalized number is {(2 - 2**(-52))*2.**1023=}")
print(f"If instead we round that mantissa up to 2 and try again, we get {2*2.**1023=}")
The smallest normalized positive number is 2**(-1022)=2.2250738585072014e-308
The largest mantissa is binary (1.1111...) with 53 ones: 2 - 2**(-52)=1.999999999999999778...
The largest normalized number is (2 - 2**(-52))*2.**1023=1.7976931348623157e+308
If instead we round that mantissa up to 2 and try again, we get 2*2.**1023=inf
What happens if we compute positive numbers smaller than that smallest normalized positive number
for S in [0, 1, 2, 52, 53]:
exponent = -1022-S
print(f" 2**(-1022-{S}) = 2**({exponent}) = {2**(exponent)}")
2**(-1022-0) = 2**(-1022) = 2.2250738585072014e-308
2**(-1022-1) = 2**(-1023) = 1.1125369292536007e-308
2**(-1022-2) = 2**(-1024) = 5.562684646268003e-309
2**(-1022-52) = 2**(-1074) = 5e-324
2**(-1022-53) = 2**(-1075) = 0.0
These extremely small values are called denormalized numbers.
Numbers with exponent
Propagation of error in arithmetic#
The only errors in the results of Gaussian elimination come from errors in the initial data (
In summary:
When multiplying two numbers, the relative error in the sum is no worse than slightly more than the sum of the relative errors in the numbers multiplied. (the be pedantic, it is at most the sum of those relative plus their product, but that last piece is typically far smaller.)
When dividing two numbers, the relative error in the quotient is again no worse than slightly more than the sum of the relative errors in the numbers divided.
When adding two positive numbers, the relative error is no more that the larger of the relative errors in the numbers added, and the absolute error in the sum is no larger than the sum of the absolute errors.
When subtracting two positive numbers, the absolute error is again no larger than the sum of the absolute errors in the numbers subtracted, but the relative error can get far worse!
Due to the differences between the last two cases, this discussion of error propagation will use “addition” to refer only to adding numbers of the same sign, and “subtraction” when subtracting numbers of the same sign.
More generally, we can think of rewriting the operation in terms of a pair of numbers that are both positive, and assume WLOG that all input values are positive numbers.
Notation: for errors and for rounding#
Two notations will be useful.
Firstly, for any approximation
Thus,
Also, introduce the function
Propagation of error in products#
Let
Thus the relative error in the product is
For example if the initial errors are due only to rounding,
This is the above stated “sum of relative errors” result.
When the “input errors” in
Exercise 1#
Derive the corresponding result for quotients.
Propagation or error in sums (of positive numbers)#
With
so the absolute error is bounded by
For the relative errors, express this error as
Let
That is, the relative error in the sum is at most the sum of the relative errors, again as advertised above.
When the “input errors” in
Propagation or error in differences (of positive numbers): loss of significance/loss of precision#
The above calculation for the absolute error works fine regardless of the signs of the numbers, so the absolute error of a difference is still bounded by the sum of the absolute errors:
But for subtraction, the denominator in the relative error formulas can be far smaller.
WLOG let
Clearly if
The extreme case is where the values
Exercise 2#
Let us move slightly away from the worst case scenario where the difference is exactly zero to one where it is close to zero; this will illustrate the idea mentioned earlier that whereever a zero value is a problem in exact aritmetic, a very small value can be a problem in approximate arithmetic.
For
Round each to three significant figures, giving
and .Compute the absolute errors in each of these approximations, and in their difference as an approximation of
.Compute the relative errors in each of these three approximations.
Then look at rounding to only two significant digits!
Upper and lower bounds on the relative error in subtraction#
The problem is worst when
Theorem 4.2 (Loss of Precision)
Consider
Then when rounded to machine numbers which are then subtracted, the relative error in that approximation of the difference is greater than that due to rounding by a factor of between
That is, subtraction loses between
Exercise 3#
(a) Illustrate why computing the roots of the quadratic equation
can sometimes give poor accuracy when evaluated using machine arithmetic such as IEEE-64 floating-point arithmetic.
This is not alwys a problem, so identify specifically the situations when this could occur, in terms of a condition on the coefficents
(b) Then describe a careful procedure for always getting accurate answers. State the procedure first with words and mathematical formulas, and then express it in pseudo-code.
Example 4.5 (Errors when approximating derivatives)
To deal with differential equations, we will need to approximate the derivative of function from just some values of the function itself. The simplest approach is suggested by the definition of the derivative
by using
with a small value of
Taylor’s theorem give an error bound if we assume exact arithmetic — worse for larger
This leads to the need to balance these error sources, to find an optimal choice for
Denote the error in approximately calculating
The error in this as an approximating of the exact derivative is
which we will consider as the sum of two pieces,
is the error due to machine Arithmetic in evaluation of the difference quotient
is the error in this difference quotient as an approximation of the exact derivative
Bounding the Arithmetic error
The first source of error is rounding of
Similarly,
Since we are interested in fairly small values of
Then the absolute error in the difference in the numerator of
Next the division.
We can assume that
This is a critical step: the difference has a small absolute error, which conceals a large relative error due to the difference being small; now the absolute error gets amplified greatly when
Bounding the Discretization error
As seen in Taylor’s Theorem and the Accuracy of Linearization — for the basic case of linearization — we have
so
and with
Bounding the total absolute error, and minimizing it
The above results combine to give an upper limit on how bad the total error can be:
As aniticipated, the errors go in opposite directions: decreasing
To do this, let’s clean up the notation: let
so that the error bound for a given value of
This can be minimized with a little calculus:
which is zero only for the unique critical point
using the short-hand
This is easily verified to give the global mimimum of
Conclusions from this example#
In practical cases, we do not know the constant
The most important — and somewhat disappointing — observation here is that both the optimal size of
In decimal terms: with IEEE-64 arithmetic
This is a first indication of why machine arithmetic sometimes needs to be so precise — more precise than any physical measurement by a factor of well over a thousand.
It also shows that when we get to computing derivatives and solving differential equations, we will often need to do a better job of approximating derivatives!