The purpose of this book is to develop the understanding of basic numerical methods and their implementations as software that are necessary for solving fundamental mathematical problems by numerical means. It is designed for the person who wants to do numerical computing. Through the examples and exercises, the reader studies the behavior of solutions of the mathematical problem along with an algorithm for solving the problem. Experience and understanding of the algorithm are gained through hand computation and practice solving problems with a computer implementation. It is essential that the reader understand how the codes provided work, precisely what they do, and what their limitations are. The codes provided are powerful, yet simple enough for pedagogical use. The reader is exposed to the art of numerical computing as well as the science. The book is intended for a onesemester course, requiring only calculus and a modest acquaintance with FORTRAN, C, C++, or MATLAB. These constraints of background and time have important implications: the book focuses on the problems that are most common in practice and accessible with the background assumed. By concentrating on one effective algorithm for each basic task, it is possible to develop the fundamental theory in a brief, elementary way. There are ample exercises, and codes are provided to reduce the time otherwise required for programming and debugging. The intended audience includes engineers, scientists, and anyone else interested in scientific programming. The level is upperdivision undergraduate to beginning graduate and there is adequate material for a one semester to two quarter course. Numerical analysis blends mathematics, programming, and a considerable amount of art. We provide programs with the book that illustrate this. They are more than mere implementations in a particular language of the algorithms presented, but they are not productiongrade software. To appreciate the subject fully, it will be necessary to study the codes provided and gain experience solving problems first with these programs and then with productiongrade software. Many exercises are provided in varying degrees of difficulty. Some are designed to get the reader to think about the text material and to test understanding, while others are purely computational in nature. Problem sets may involve hand calculation, algebraic derivations, straightforward computer solution, or more sophisticated computing exercises. The algorithms that we study and implement in the book are designed to avoid severe roundoff errors (arising from the finite number of digits available on computers and calculators), estimate truncation errors (arising from mathematical approximations), and give some indication of the sensitivity of the problem to errors in the data. In Chapter 1 we give some basic definitions of errors arising in computations and study roundoff errors through some simple but illuminating computations. Chapter 2 deals with one of the most frequently occurring problems in scientific computation, the solution of linear systems of equations. In Chapter 3 we deal with the problem of V
vi
PRELIMINARIES
interpolation, one of the most fundamental and widely used tools in numerical computation. In Chapter 4 we study methods for finding solutions to nonlinear equations. Numerical integration is taken up in Chapter 5 and the numerical solution of ordinary differential equations is examined in Chapter 6. Each chapter contains a case study that illustrates how to combine analysis with computation for the topic of that chapter. Before taking up the various mathematical problems and procedures for solving them numerically, we need to discuss briefly programming languages and acquisition of software.
PROGRAMMING LANGUAGES The FORTRAN language was developed specifically for numerical computation and has evolved continuously to adapt it better to the task. Accordingly, of the widely used programming languages, it is the most natural for the programs of this book. The C language was developed later for rather different purposes, but it can be used for numerical computation. At present FORTRAN 77 is very widely available and codes conforming to the ANSI standard for the language are highly portable, meaning that they can be moved to another hardware/software configuration with very little change. We have chosen to provide codes in FORTRAN 77 mainly because the newer Fortran 90 is not in wide use at this time. A Fortran 90 compiler will process correctly our FORTRAN 77 programs (with at most trivial changes), but if we were to write the programs so as to exploit fully the new capabilities of the language, a number of the programs would be structured in a fundamentally different way. The situation with C is similar, but in our experience programs written in C have not proven to be nearly as portable as programs written in standard FORTRAN 77. As with FORTRAN, the C language has evolved into C++, and as with Fortran 90 compared to FORTRAN 77, exploiting fully the additional capabilities of C++ (in particular, object oriented programming) would lead to programs that are completely different from those in C. We have opted for a middle ground in our C++ implementations. In the last decade several computing environments have been developed. Popular ones familiar to us are MATLAB [l] and Mathematica [2]. MATLAB is very much in keeping with this book, for it is devoted to the solution of mathematical problems by numerical means. It integrates the formation of a mathematical model, its numerical solution, and graphical display of results into a very convenient package. Many of the tasks we study are implemented as a single command in the MATLAB language. As MATLAB has evolved, it has added symbolic capabilities. Mathematica is a similar environment, but it approaches mathematical problems from the other direction. Originally it was primarily devoted to solving mathematical problems by symbolic means, but as it has evolved, it has added significant numerical capabilities. In the book we refer to the numerical methods implemented in these widely used packages, as well as others, but we mention the packages here because they are programming languages in their own right. It is quite possible to implement the algorithms of the text in these languages. Indeed, this is attractive because the environments deal gracefully with a number of issues that are annoying in general computing using languages like FORTRAN or C.
SOFTWARE
vii
At present we provide programs written in FORTRAN 77, C, C++, and MATLAB that have a high degree of portability. Quite possibly in the future the programs will be made available in other environments (e.g., Fortran 90 or Mathematica.)
SOFTWARE In this section we describe how to obtain the source code for the programs that accompany the book and how to obtain productiongrade software. It is assumed that the reader has available a browser for the World Wide Web, although some of the software is available by ftp or gopher. The programs that accompany this book are currently available by means of anonymous ftp (log in as anonymous or as ftp) at ftp.wiley.com in subdirectories of public/college/math/sapcodes for the various languages discussed in the preceding section. The best single source of software is the Guide to Available Mathematical Software (GAMS) developed by the National Institute of Standards and Technology (NIST). It is an online crossindex of mathematical software and a virtual software repository. Much of the highquality software is free. For example, GAMS provides a link to netlib, a large collection of publicdomain mathematical software. Most of the programs in netlib are written in FORTRAN, although some are in C. A number of the packages found in netlib are stateoftheart software that are cited in this book. The internet address is http://gams.nist.gov for GAMS. A useful source of microcomputer software and pointers to other sources of software is the Mathematics Archives at http://archives.math.utk.edu:80/ It is worth remarking that one item listed there is an “Index of resources for numerical computation in C or C++.” There are a number of commercial packages that can be located by means of GAMS. We are experienced with the NAG and IMSL libraries, which are large collections of highquality mathematical software found in most computing centers. The computing environments MATLAB and Mathematica mentioned in the preceding section can also be located through GAMS.
REFERENCES 1. C. Moler, J. Little, S. Bangert, and S. Kleiman, ProMatlab User’s Guide, MathWorks, Sherborn, Mass., 1987. email: info@mathworks.com 2. S. Wolfram, Mathematica, AddisonWesley, Redwood City, Calif., 1991. email: info@wri.com
viii
PRELIMINARIES
ACKNOWLEDGMENTS The authors are indebted to many individuals who have contributed to the production of this book. Professors Bernard Bialecki and Michael Hosea have been especially sharpeyed at catching errors in the latest versions. We thank the people at Wiley, Barbara Holland, Cindy Rhoads, and Nancy Prinz, for their contributions. David Richards at the University of Illinois played a critical role in getting the functioning for us, and quickly and accurately fixing other We also acknowledge the work of James Otto in checking all solutions and examples, and Hongsung Jin who generated most of the figures. Last, but certainly not least, we are indeed grateful to the many students, too numerous to mention, who have made valuable suggestions to us over the years.
CONTENTS
CHAPTER 1
ERRORS AND FLOATING POINT ARITHMETIC 1 1.1 Basic Concepts 1 1.2 Examples of Floating Point Calculations 1.3 Case Study 1 25 1.4 Further Reading 28
CHAPTER 2
SYSTEMS OF LINEAR EQUATIONS 30 2.1 2.2 2.3 2.4 2.5 2.6
CHAPTER 3
Gaussian Elimination with Partial Pivoting 32 Matrix Factorization 44 Accuracy 4 8 Routines Factor and Solve 61 Matrices with Special Structure 65 Case Study 2 72
INTERPOLATION 82 3.1 3.2 3.3 3.4 3.5 3.6 3.7
CHAPTER 4
12
Polynomial Interpolation 83 More Error Bounds 90 Newton Divided Difference Form 93 Assessing Accuracy 98 Spline Interpolation 101 Interpolation in the Plane 119 Case Study 3 128
ROOTS OF NONLINEAR EQUATIONS 134 4.1 Bisection, Newton’s Method, and the Secant Rule 137 4.2 An Algorithm Combining Bisection and the Secant Rule 150 4.3 Routines for Zero Finding 152 4.4 Condition, Limiting Precision, and Multiple Roots 157 4.5 Nonlinear Systems of Equations 160 4.6 Case Study 4 163 ix
Basic Quadrature Rules 170 Adaptive Quadrature 184 Codes for Adaptive Quadrature 188 Special Devices for Integration 191 Integration of Tabular Data 200 Integration in Two Variables 202 Case Study 5 203
Some Elements of the Theory 210 A Simple Numerical Scheme 216 OneStep Methods 221 ErrorsLocal and Global 228 The Algorithms 233 The Code Rke 236 Other Numerical Methods 240 Case Study 6 244
NOTATION AND SOME THEOREMS FROM THE CALCULUS 251 A.1 Notation 251 A.2 Theorems 252
ANSWERS TO SELECTED EXERCISES 255 INDEX 266
Previous Home Next
CHAPTER 1
ERRORS AND FLOATING POINT ARITHMETIC
Errors in mathematical computation have several sources. One is the modeling that led to the mathematical problem, for example, assuming no wind resistance in studying projectile motion or ignoring finite limits of resources in population and economic growth models. Such errors are not the concern of this book, although it must be kept in mind that the numerical solution of a mathematical problem can be no more meaningful than the underlying model. Another source of error is the measurement of data for the problem. A third source is a kind of mathematical error called discretization or truncation error. It arises from mathematical approximations such as estimating an integral by a sum or a tangent line by a secant line. Still another source of error is the error that arises from the finite number of digits available in the computers and calculators used for the computations. It is called roundoff error. In this book we study the design and implementation of algorithms that aim to avoid severe roundoff errors, estimate truncation errors, and give some indication of the sensitivity of the problem to errors in the data. This chapter is devoted to some fundamental definitions and a study of roundoff by means of simple but illuminating computations.
1.1 BASIC CONCEPTS How well a quantity is approximated is measured in two ways: absolute error = true value  approximate value relative error =
true value  approximate value . true value
Relative error is not defined if the true value is zero. In the arithmetic of computers, relative error is the more natural concept, but absolute error may be preferable when studying quantities that are close to zero. A mathematical problem with input (data) x and output (answer) y = F(x) is said to be wellconditioned if “small” changes in x lead to “small” changes in y. If the 1
2
CHAPTER 1
ERRORS AND FLOATING POINT ARITHMETIC
changes in y are “large,” the problem is said to be illconditioned. Whether a problem is well or illconditioned can depend on how the changes are measured. A concept related to conditioning is stability. It is concerned with the sensitivity of an algorithm for solving a problem with respect to small changes in the data, as opposed to the sensitivity of the problem itself. Roundoff errors are almost inevitable, so the reliability of answers computed by an algorithm depends on whether small roundoff errors might seriously affect the results. An algorithm is stable if “small” changes in the input lead to “small” changes in the output. If the changes in the output are “large,” the algorithm is unstable. To gain some insight about condition, let us consider a differentiable function F(x) and suppose that its argument, the input, is changed from x to x + ?x. This is a relative change of ? in the input data. According to Theorem 4 of the appendix, the change induces an absolute change in the output value F(x) of F(x)  F( x+?x
)
(x).
The relative change is F (x)  F(x+ ?x )
F?(x)
F(x)
F(x)
.
Example 1.1. If, for example, F(x) = ex, the absolute change in the value of the exponential function due to a change ?x in its argument x is approximately ??xex, and the relative change is about ??x. When x is large, the conditioning of the evaluation of this function with respect to a small relative change in the argument depends strongly n on whether the change is measured in an absolute or relative sense. Example 1.2. If F(x) = cosx, then near x = ?/2 the absolute error due to perturbing 2. The relative error at x = ?/2 is not x to x + ?x is approximately ??x(  sin x) defined since cos(?/2) = 0. However, the accurate values cos( 1.57079) = 0.63267949 ? 105 cos( 1.57078) = 1.63267949 ? 105 show how a very small change in the argument near ?/2 can lead to a significant (63%) change in the value of the function. In contrast, evaluation of the cosine function is n wellconditioned near x = 0 (see Exercise 1.4). Example 1.3. A common application of integration by parts in calculus courses is the evaluation of families of integrals by recursion. As an example, consider En =
dx for n = 1, 2,....
From this definition it is easy to see that E1>E2>>En1>En>>0.
1.1 BASIC CONCEPTS
3
To obtain a recursion, integrate by parts to get E n = x n e x1
nxn1ex1dx
= 1  nE n1 . The first member of the family is E1= l
ex1dx = e1,
and from it we can easily compute any En. If this is done in single precision on a PC or workstation (IEEE standard arithmetic), it is found that El E2
= =
0.367879 0.264241
E10 El1 E12
= = =
0.0506744 0.442581 4.31097
(the exact En decrease!) (the exact En are positive!)
11 (the exact En are between 0 and 1!) E20 = 0.222605 ? 10 This is an example of an unstable algorithm. A little analysis helps us understand what is happening. Suppose we had started with ?1 = E1 + ? and made no arithmetic errors when evaluating the recurrence. Then
? 2 = l  2?1 = l  2 ? 1  2 ? = E2  2? ? 3 = l  3?2 = l  3 E2 + 6? = E3 + 3!? ?n
= E n + n!?.
A small change in the first value E1 grows very rapidly in the later En. The effect is worse in a relative sense because the desired quantities En decrease as n increases. For this example there is a way to get a stable algorithm. If we could find an approximation ?N to EN for some N, we could evaluate the recursion in reverse order, l  En n = N , N  1,...,2, Enl n to approximate EN1, EN2, . . . , E1. Studying the stability of this recursion as before, if ?N = EN + ?, then
4
CHAPTER 1 ERRORS AND FLOATING POINT ARITHMETIC
The recursion is so cheap and the error damps out so quickly that we can start with a poor approximation ?N for some large N and get accurate answers inexpensively for the En that really interest us. Notice that recurring in this direction, the En increase, making the relative errors damp out even faster. The inequality 0 0. The portion .dld2... is called the fraction or mantissa or significand; it has the meaning dl ? 10l + d2 ? 102 + ··· + ds ? 10 s + ··· . There is an ambiguity in this representation; for example, we must agree that 0.24000000 ··· is the same as 0.23999999 ··· . The quantity e in (1.1) is called the exponent; it is a signed integer. Nearly all numerical computations on a digital computer are done in floating point arithmetic. This is a number system that uses a finite number of digits to approximate the real number system used for exact computation. A system with s digits and base 10 has all of its numbers of the form y = ±.dld2 ··· ds ? 10 e .
(1.2)
1.1 BASIC CONCEPTS
5
Again, for nonzero numbers each di is one of the digits 0, 1,...,9 and d1 > 0 for a normalized number. The exponent e also has only a finite number of digits; we assume the range
The number zero is special; it is written as 0.0 ··· 0 ? 10m ? Example 1.4.
If s = l, m = 1, and M = 1, then the set of floating point numbers is +0.l ? 101, +0.2 ? 101, ...) +0.9 ? 101 +0.l ? 1 0 0 , +0.2 ? 1 0 0 , ...) +0.9 ? 1 0 0 +0.l ? 101) +0.2 ? 101, ...) +0.9 ? 101,
together with the negative of each of these numbers and 0.0 ? 10l for zero. There are only 55 numbers in this floating point number system. In floating point arithmetic the numbers are not equally spaced. This is illustrated in Figure 1.1, which is discussed after we consider number bases other than decimal. n Because there are only finitely many floating point numbers to represent the real number system, each floating point number must represent many real numbers. When the exponent e in (1.1) is bigger than M, it is not possible to represent y at all. If in the course of some computations a result arises that would need an exponent e > M, the computation is said to have overflowed. Typical operating systems will terminate the run on overflow. The situation is less clear when e < m, because such a y might reasonably be approximated by zero. If such a number arises during a computation, the computation is said to have underflowed. In scientific computation it is usually appropriate to set the result to zero and continue. Some operating systems will terminate the run on underflow and others will set the result to zero and continue. Those that continue may report the number of underflows at the end of the run. If the response of the operating system is not to your liking, it is usually possible to change the response by means of a system routine. Overflows and underflows are not unusual in scientific computation. For example, exp(y) will overflow for y > 0 that are only moderately large, and exp(y) will underflow. Our concern should be to prevent going out of range unnecessarily. FORTRAN and C provide for integer arithmetic in addition to floating point arithmetic. Provided that the range of integers allowed is not exceeded, integer arithmetic is exact. It is necessary to beware of overflow because the typical operating system does not report an integer overflow; the computation continues with a number that is not related to the correct value in an obvious way. Both FORTRAN and C provide for two precisions, that is, two arithmetics with different numbers of digits s, called single and double precision. The languages deal with mixing the various modes of arithmetic in a sensible way, but the unwary can get into trouble. This is more likely in FORTRAN than C because by default, constants in C are double precision numbers. In FORTRAN the type of a constant is taken from the
6
CHAPTER 1
ERRORS AND FLOATING POINT ARITHMETIC
way it is written. Thus, an expression like (3/4)*5. in FORTRAN and in C means that the integer 3 is to be divided by the integer 4 and the result converted to a floating point number for multiplication by the floating point number 5. Here the integer division 3/4 results in 0, which might not be what was intended. It is surprising how often users ruin the accuracy of a calculation by providing an inaccurate value for a basic constant like ? Some constants of this kind may be predefined to full accuracy in a compiler or a library, but it should be possible to use intrinsic functions to compute accurately constants like ? = acos(1.0). Evaluation of an asymptotic expansion for the special function Ei(x), called the exponential integral, involves computing terms of the form n!/xn. To contrast computations in integer and floating point arithmetic, we computed terms of this form for a range of n and x = 25 using both integer and double precision functions for the factorial. Working in C on a PC using IEEE arithmetic, it was found that the results agreed through n = 7, but for larger n the results computed with integer arithmetic were uselessthe result for n = 8 was negative! The integer overflows that are responsible for these erroneous results are truly dangerous because there was no indication from the system that the answers might not be reliable.
Example 1.5.
In Chapter 4 we study the use of bisection to find a number z such that f(z) = 0, that is, we compute a root of f(x). Fundamental to this procedure is the question, Do f(a) and f(b) have opposite signs? If they do, a continuous function f(x) has a root z between a and b. Many books on programming provide illustrative programs that test for f(a)f(b) < 0. However, when f(a) and f(b) are sufficiently small, the product underflows and its sign cannot be determined. This is likely to happen because we are interested in a and b that tend to z, causing f(a) and f(b) to tend to zero. It is easy enough to code the test so as to avoid the difficulty; it is just necessary to realize that the floating point number system does not behave quite like the real number system in this test. n
As we shall see in Chapter 4, finding roots of functions is a context in which underflow is quite common. This is easy to understand because the aim is to find a z that makes f(z) as small as possible.
Example 1.6. Determinants.
In Chapter 2 we discuss the solution of a system of linear equations. As a byproduct of the algorithm and code presented there, the determinant of a system of n equations can be computed as the product of a set of numbers returned: d e t = y1 y 2 ···y n .
Unfortunately, this expression is prone to unnecessary under and overflows. If, for example, M = 100 and yl = 1050, y2 = 1060, y3 = 1030, all the numbers are in range and so is the determinant 1080. However, if we form (yl ? y2) ? y3, the partial product yl ? y2 overflows. Note that yl ? (y2 ? y3 ) can be formed. This illustrates the fact that floating point numbers do not always satisfy the associative law of multiplication that is true of real numbers.
1.1 BASIC CONCEPTS
7
The more fundamental issue is that because det(cA) = cn det(A), the determinant is extremely sensitive to the scale of the matrix A when the number of equations n is large. A software remedy used in LINPACK [4] in effect extends the range of exponents available. Another possibility is to use logarithms and exponentials: l n  d e t  = lny i  det=exp(lndet). If this leads to an overflow, it is because the answer cannot be represented in the floating point number system. n Example 1.7. Magnitude. z=x+iy,
When computing the magnitude of a complex number
there is a difficulty when either x or y is large. Suppose that If x is sufficiently large, x2 will overflow and we are not able to compute z even when it is a valid floating point number. If the computation is reformulated as
the difficulty is avoided. Notice that underflow could occur when y 0 (as long as y 0). This means that ± ( dl ? ?  1 + d 2 ? ? 2 +···+ d s ? ? s +···) ? ? e . All the earlier discussion is easily modified for the other bases. In particular, we have in base ? with s digits the unit roundoff (1.4) Likewise, fl(y) = y(1 + ?), where ?
u.
For most purposes, the fact that computations are not carried out in decimal is inconsequential. It should be kept mind that small rounding errors are made as numbers input are converted from decimal to the base of the machine being used and likewise on output. Table 1.1 illustrates the variety of machine arithmetics used in the past. Today the IEEE standard [l] described in the last two rows is almost universal. In the table the notation 1.2(7) means 1.2 ? 107. As was noted earlier, both FORTRAN and C specify that there will be two precisions available. The floating point system built into the computer is its single precision arithmetic. Double precision may be provided by either software or hardware. Hardware double precision is not greatly slower than single precision, but software double precision arithmetic is considerably slower. The IEEE standard uses a normalization different from (1.2). For y 0 the leading nonzero digit is immediately to the left of the decimal point. Since this digit must be 1, there is no need to store it. The number 0 is distinguished by having its e = m  1.
10
CHAPTER 1
ERRORS AND FLOATING POINT ARITHMETIC Table 1.1 Examples of Computer Arithmetics. machine
?
s
VAX VAX CRAY 1 IBM 3081 IBM 3081 IEEE Single Double
2 2 2 16 16 2 2
m
M
24 56 48 6 14
128 128 16384 64 64
127 127 16383 63 63
6.0(08) 1.4(17) 3.6(15) 9.5(07) 2.2(16)
24 53
125 1021
128 1024
6.0(08) l.l(16)
approximate u
It used to be some trouble to find out the unit roundoff, exponent range, and the like, but the situation has improved greatly. In standard C, constants related to floating point arithmetic are available in . For example, dbl_epsilon is the unit roundoff in double precision. Similarly, in Fortran 90 the constants are available from intrinsic functions. Because this is not true of FORTRAN 77, several approaches were taken to provide them: some compilers provide the constants as extensions of the language; there are subroutines DlMACH and IlMACH for the machine constants that are widely available because they are public domain. Major libraries like IMSL and NAG include subroutines that are similar to DlMACH and IlMACH. In Example 1.4 earlier in this section we mentioned that the numbers in the floating point number system were not equally spaced. As an illustration, see Figure 1.1 where all 19 floating point numbers are displayed for the system for which ? = 4, s = 1, m = 1, and M = 1. Arithmetic in the floating point number system is to approximate that in the real number system. We use to indicate the floating point approximations to the arithmetic operations +, , ?, /. If y and z are floating point numbers of s digits, the product y ? z has 2s digits. For example, 0.999 ? 0.999 = 0.998001. About the best we could hope for is that the arithmetic hardware produce the result fl(y ? z), so u. It is practical to do this that for some real number ? with ? for all the basic arithmetic operations. We assume an idealized arithmetic that for the basic arithmetic operations produces
provided that the results lie in the range of the floating point system. Hence,
where op = +, , ?, or / and ? is a real number with ? u. This is a reasonable assumption, although hardware considerations may lead to arithmetic for which the bound on ? is a small multiple of u.
1.1 BASIC CONCEPTS
11
Figure 1.1 Distribution of floating point numbers for ? = 4, s = 1, m = 1, M = 1.
To carry out computations in this model arithmetic by hand, for each operation +, , ?, /, perform the operation in exact arithmetic, normalize the result, and round (chop) it to the allotted number of digits. Put differently, for each operation, calculate the result and convert it to the machine representation before going on to the next operation. Because of increasingly sophisticated architectures, the unit roundoff as defined in (1.4) is simplistic.For example, many computers do intermediate computations with more than s digits. They have at least one “guard digit,” perhaps several, and as a consequence results can be rather more accurate than expected. (When arithmetic operations are carried out with more than s digits, apparently harmless actions like printing out intermediate results can cause the final result of a computation to change! This happens when the extra digits are shed as numbers are moved from arithmetic units to storage or output devices.) It is interesting to compute (1 + ?) 1 for decreasing ? to see how small ? can be made and still get a nonzero result. A number of codes for mathematical computations that are in wide use avoid defining the unit roundoff by coding a test for ux < h as if ((x+h)
x) then . . . .
On today’s computers this is not likely to work properly for two reasons, one being the presence of guard digits just discussed. The other is that modern compilers defeat the test when they “optimize” the coding by converting the test to if (h
0) then . . . ,
which is always passed.
EXERCISES 1.1 Solve
mates the unit roundoff u by a computable quantity U: 0.461 x 1 + 0.311x 2 = 0.150 0.209 x 1 + 0.141x2 = 0.068
using threedigit chopped decimal arithmetic. The exact answer is x1 = 1, x2 = 1; how does yours compare? 1.2 The following algorithm (due to Cleve Moler) esti
A B C U
:= 4./3 := A 1 := B+B+B := C  1.
(a) What does the above algorithm yield for U in sixdigit decimal rounded arithmetic?
12
CHAPTER 1
ERRORS AND FLOATING POINT ARITHMETIC
(b) What does it yield for U in sixdigit decimal chopped arithmetic? (c) What are the exact values from (1.4) for u in the arithmetics of (a) and (b)? (d) Use this algorithm on the machine(s) and calculator(s) you are likely to use. What do you get? 1.3 Consider the following algorithm for generating noise in a quantity x: A := 10n * x B:=A+x y:=BA (a) Calculate y when x = 0.123456 and n = 3 using sixdigit decimal chopped arithmetic. What is the error x  y? (b) Repeat (a) for n = 5. 1.4 Show that the evaluation of F(x) = cosx is wellconditioned near x = 0; that is, for x show that the magnitude of the relative error  [F(x) F(0)] /F(0)  is bounded by a quantity that is not large. 1.5 If F(x) = (x  1)2, what is the exact formula for [F(x + ?x)  F(x)]/F(x)? What does this say about the conditioning of the evaluation of F(x) near x = l? sinxdx and show that two inte1.6 Let Sn := grations by parts results in the recursion
Further argue that S 0 = 2 and that Sn1 > Sn > 0 for every n.
(a) Compute Sl5 with this recursion (make sure that you use an accurate value for ?). (b) To analyze what happened in (a), consider the recursion
with = 2( 1  u), that is, the same computation with the starting value perturbed by one digit in the last place. Find a recursion for Sn . From this recursion, derive a formula for in terms of Use this formula to explain what happened in (a). (c) Examine the “backwards” recursion
starting with
= 0. What is
Why?
1.7 For brevity let us write s = sin(?), c = cos(?) for some value of ?. Once c is computed, we can compute s inexpensively from s = (Either sign of the square root may be needed in general, but let us consider here only the positive root.) Suppose the cosine routine produces c + ?c instead of c. Ignoring any error made in evaluating the formula for s, show that this absolute error of ?c induces an absolute error in s of ?s For the range 0 /2, are with there ? for which this way of computing sin(?) has an accuracy comparable to the accuracy of cos(?)? Are there ? for which it is much less accurate? Repeat for relative errors.
1.2 EXAMPLES OF FLOATING POINT CALCULATIONS The floating point number system has properties that are similar to those of the real number system, but they are not identical. We have already seen some differences due to the finite range of exponents. It might be thought that because one arithmetic operation can be carried out with small relative error, the same would be true of several operations. Unfortunately this is not true. We shall see that multiplication and division are more satisfactory in this respect than addition and subtraction. For floating point numbers x, y and z,
1.2 EXAMPLES OF FLOATING POINT CALCULATIONS
13
The product (1 + ?1)(l + ?2) = 1 + ?, where ? is “small,” and can, of course, be explicitly bounded in terms of u. It is more illuminating to note that ( 1 + ?1 )( 1 + ?2 ) = 1+ ?1 + ?2+ ?1 ?2 1+?1+?2, so that and an approximate bound for ? is 2u. Before generalizing this, we observe that it may well be the case that
even when the exponent range is not exceeded. However,
so that
where ? is “small.” Thus, the associative law for multiplication is approximately true. In general, if we wish to multiply x l ,x2 , . . . ,xn, we might do this by the algorithm Pl Pi
= =
xl Pi1
xi ,
i = 2, 3 ,... ,n.
Treating these operations in real arithmetic we find that Pi = xlx2···xi(l + ?1)(1 + ?2)···(1 + ?? ), where each The relative error of each Pi can be bounded in terms of u without difficulty, but more insight is obtained if we approximate Pi
xIx2 ···xi(l + ?1 + ?2 + · · · + ??),
which comes from neglecting products of the ??. Then  ?1 + ?2 + ··· + ?? 
iu.
This says that a bound on the approximate relative errors grows additively. Each multiplication could increase the relative error by no more than one unit of roundoff. Division can be analyzed in the same way, and the same conclusion is true concerning the possible growth of the relative error. Example 1.10.
The gamma function, defined as
14
CHAPTER 1
ERRORS AND FLOATING POINT ARITHMETIC
generalizes the factorial function for integers to real numbers x (and complex x as well). This follows from the fundamental recursion (1.5) and the fact that (1) = 1. A standard way of approximating (x) for x 2 uses the fundamental recursion to reduce the task to approximating (y) for 2 < y < 3. This is done by letting N be an integer such that N < x < N + 1, letting y = x  N + 2, and then noting that repeated applications of (1.5) yield
The function I’(y) can be approximated well by the ratio R(y) of two polynomials for 2 < y < 3. Hence, we approximate
If x is not too large, little accuracy is lost when these multiplications are performed in floating point arithmetic. However, it is not possible to evaluate (x) for large x by this approach because its value grows very quickly as a function of x. This can be seen from the Stirling formula (see Case Study 5)
This example makes another point: the virtue of floating point arithmetic is that it automatically deals with numbers of greatly different size. Unfortunately, many of the special functions of mathematical physics grow or decay extremely fast. It is by no means unusual that the exponent range is exceeded. When this happens it is necessary to reformulate the problem to make it better scaled. For example, it is often better to work with the special function ln than with I’(x) because it is better scaled. n Addition and subtraction are much less satisfactory in floating point arithmetic than are multiplication and division. It is necessary to be alert for several situations that will be illustrated. When numbers of greatly different magnitudes are added (or subtracted), some information is lost. Suppose, for example, that we want to add ? = 0.123456 ? 104 to 0.100000 ? 101 in sixdigit chopped arithmetic. First, the exponents are adjusted to be the same and then the numbers are added: 0.100000 + 0.0000123456 0.1000123456
? 101 ?101 ?10 1 .
The result is chopped to 0.100012 ? 101. Notice that some of the digits did not participate in the addition. Indeed, if y < xu, then x y = x and the “small” number y plays no role at all. The loss of information does not mean the answer is inaccurate; it is accurate to one unit of roundoff. The problem is that the lost information may be needed for later calculations.
1.2 EXAMPLES OF FLOATING POINT CALCULATIONS
Example 1.11. Difference quotients.
15
Earlier we made use of the fact that for small
In many applications this is used to approximate F'(x). To get an accurate approximation, ? must be “small” compared to x. It had better not be too small for the precision, or else we would have and compute a value of zero for F'(x). If ? is large enough to affect the sum but still “small,” some of its digits will not affect the sum in the sense that In the difference quotient we want to divide by the actual difference of the arguments, not ? itself. A better way to proceed is to define
and approximate
The two approximations are mathematically equivalent, but computationally different. For example, suppose that F(x) = x and we approximate F'(x) for x = 1 using ? = 0.123456 ? 104 in sixdigit chopped arithmetic. We have just worked out 1 ? = 0.100012 ? 101; similarly, = 0.120000 ? 104 showing the digits of ? that actually affect the sum. The first formula has
The second has
Obviously the second form provides a better approximation to F'(1) = 1. Quality codes for the numerical approximation of the Jacobian matrices needed for optimization, root solving, and the solution of stiff differential equations make use of this simple device. n Example 1.12. Limiting precision. In many of the codes in this book we attempt to recognize when we cannot achieve a desired accuracy with the precision available. The kind of test we make will be illustrated in terms of approximating a definite integral
This might be done by splitting the integration interval [a,b] into pieces [?,?] and adding up approximations on all the pieces. Suppose that
The accuracy of this formula improves as the length ?  ? of the piece is reduced, so that, mathematically, any accuracy can be obtained by making this width sufficiently
16
CHAPTER 1
ERRORS AND FLOATING POINT ARITHMETIC
small. However, if ?  ? < 2u?, the floating point numbers a and a + (?  ?)/2 are the same. The details of the test are not important for this chapter; the point is that when the interval is small enough, we cannot ignore the fact that there are only a finite number of digits in floating point arithmetic. If a and ? cannot be distinguished in the precision available, the computational results will not behave like mathematical results from the real number system. In this case the user of the software must be warned that n the requested accuracy is not feasible. Example I. 13. Summing a divergent series.
The sum S of a series
is the limit of partial sums
There is an obvious algorithm for evaluating S: S1 Sn
= =
a1 Sn1
an,
n = 2, 3,...,
continuing until the partial sums stop changing. A classic example of a divergent series is the harmonic series
If the above algorithm is applied to the harmonic series, the computed Sn increase and the a, decrease until
and the partial sums stop changing. The surprising thing is how small Sn is when this happenstry it and see. In floating point arithmetic this divergent series has a finite sum. The observation that when the terms become small enough, the partial sums stop changing is true of convergent as well as divergent series. Whether the value so obtained is an accurate approximation to S depends on how fast the series converges. It really is necessary to do some mathematical analysis to get reliable results. Later in n this chapter we consider how to sum the terms a little more accurately. An acute difficulty with addition and subtraction occurs when some information, lost due to adding numbers of greatly different size, is needed later because of a subtraction. Before going into this, we need to discuss a rather tricky point. Example 1.14. Cancellation (loss of significance). Subtracting a number y from a number x that agrees with it in one or more leading digits leads to a cancellation of
17
1.2 EXAMPLES OF FLOATING POINT CALCULATIONS 5 5 these digits. For example, if x = 0.123654 ? 10 and y = 0.123456 ? 10 , then
–
0.123654 0.123456 0.000198
? 105 ? 105 ? 105 = 0.198000 ? 108.
The interesting point is that when cancellation takes place, the subtraction is done exactly, so that x y = x  y. The difficulty is what is called a loss of significance. When cancellation takes place, the result x  y is smaller in size than x and y, so errors already present in x and y are relatively larger in x  y. Suppose that x is an approximation to X and y is an approximation to Y. They might be measured values or the results of some computations. The difference x  y is an approximation to X  Y with the magnitude of its relative error satisfying
If x is so close to y that there is cancellation, the relative error can be large because the denominator X  Y is small compared to X or Y. For example, if X = 0.123654700 ··· ? 105, then x agrees with X to a unit roundoff in sixdigit arithmetic. 8 With Y = y the value we seek is X  Y = 0.198700 ··· ? 10 . Even though the sub8 traction x  y = 0.198000 ? 10 is done exactly, x  y and X  Y differ in the fourth digit. In this example, x and y have at least six significant digits, but their difference n has only three significant digits. It is worth remarking that we made use of cancellation in Example 1.11 when we computed Because ? is small compared to x, there is cancellation and way we obtain in A the digits of ? that actually affected the sum. Example 1.15. Roots of a quadratic.
In this
Suppose we wish to compute the roots of
x2 + b x + c =0. The familiar quadratic formula gives the roots xl and x2 as
assuming b > 0. If c is small compared to b, the square root can be rewritten and approximated using the binomial series to obtain
18
CHAPTER 1
ERRORS AND FLOATING POINT ARITHMETIC
This shows that the true roots x1 x2
b c/b.
In finite precision arithmetic some of the digits of c have no effect on the sum (b/2)2 c. The extreme case is
It is important to appreciate that the quantity is computed accurately in a relative sense. However, some information is lost and we shall see that in some circumstances we need it later in the computation. A square root is computed with a small relative error and the same is true of the subtraction that follows. Consequently, the bigger root xl b is computed accurately by the quadratic formula. In the computation of the smaller root, there is cancellation when the square root term is subtracted from b/2. The subtraction itself is done exactly, but the error already present in ( b/2)2 c becomes important in a relative sense. In the extreme case the formula results in zero as an approximation to x2. For this particular task a reformulation of the problem avoids the difficulty. The expression ( xx1 )( xx2 )
= =
x2  (xl + x2 ) + x1x2 x2+ bx + c
shows that x1x2 = c. As we have seen, the bigger root xl can be computed accurately using the quadratic formula and then x 2 = c/x l n
provides an accurate value for x2.
As we observed earlier, it is important to know Example 1.16. Alternating series. when enough terms have been taken from a series to approximate the limit to a desired accuracy. Alternating series are attractive in this regard. Suppose a0 > al > ··· > an> a n + 1 > ··· > 0 and = 0. Then the alternating series
converges to a limit S and the error of the partial sum
satisfies  S  S n  < a n +1· To see a specific example, consider the evaluation of sinx by its Maclaurin series
1.2 EXAMPLES OF FLOATING POINT CALCULATIONS
19
Although this series converges quickly for any given x, there are numerical difficulties when x is large. If, say, x = 10, the am are
Clearly there are some fairly large terms here that must cancel out to yield a result sin 10 that has magnitude at most 1. The terms am are the result of some computation that here can be obtained with small relative error. However, if am is large compared to the sum S, a small relative error in am will not be small compared to S and S will not be computed accurately. We programmed the evaluation of this series in a straightforward way, being careful to compute, say,
so as to avoid unnecessarily large quantities. Using single precision standard IEEE arithmetic we added terms until the partial sums stopped changing. This produced the value 0.544040 while the exact value should be sinx = 0.544021. Although the series converges quickly for all x, some intermediate terms become large when x is large. Indeed, we got an overflow due to the small exponent range in IEEE single precision arithmetic when we tried x = 100. Clearly floating point arithmetic does not free us from all concerns about scaling. Series are often used as a way of evaluating functions. If the desired function value is small and if some terms in the series are comparatively large, then there must be cancellation and we must expect that inaccuracies in the computation of the terms will cause the function value to be inaccurate in a relative sense. n We have seen examples showing that the sum of several numbers depends on the order in which they are added. Is there a “good” order? We now derive a rule of thumb that can be quite useful. We can form a1 + a2 + ··· + aN by the algorithm used in Example 1.13. The first computed partial sum is
where ?2 < u. It is a little special. The general case is represented by the next computed partial sum, which is
where ?3 < u. To gain insight, we approximate this expression by dropping terms involving the products of small factors so that
20
CHAPTER 1
ERRORS AND FLOATING POINT ARITHMETIC
Continuing in this manner we find that
According to this approximation, the error made when ak is added to Sk might grow, but its effect in S, will be no bigger than (N  k + l) uak. This suggests that to reduce the total error, the terms should be added in order of increasing magnitude. A careful bound on the error of repeated summation leads to the same rule of thumb. Adding in order of increasing magnitude is usually a good order, but not necessarily a good order (because of the complex ways that the individual errors can interact). Much mathematical software makes use of this device to enhance the accuracy of the computations. The approximate error can be bounded by
Here we use the symbol to mean “less than or equal to a quantity that is approximately.” (The “less than” is not sharp here.) Further manipulation provides an approximate bound on the magnitude of the sum’s relative error
The dangerous situation is when which is when cancellation takes place. An important consequence is that if all the terms have the same sign, the sum will be computed accurately in a relative sense, provided only that the number of terms is not too large for the precision available. For a convergent series
Rather than sum in the natural order m = it is necessary that 0, 1,..., it would often be better to work out mathematically how many terms N are needed to approximate S to the desired accuracy and then calculate SN in the reverse order a N , a N1 ,.... Example 1.17. There are two ways of interpreting errors that are important in numerical analysis. So far we have been considering a forward error analysis. This corresponds to bounding the errors in the answer by bounding at each stage the errors that might arise and their effects. To be specific, recall the expression for the error of summing three numbers:
1.2 EXAMPLES OF FLOATING POINT CALCULATIONS
21
A forward error analysis might bound the absolute error by
(This is a sharp version of the approximate bound given earlier.) A backward error analysis views the computed result as the result computed in exact arithmetic of a problem with somewhat different data. Let us reinterpret the expression for fl( S 3 ) in this light. It is seen that f l( S 3) = y l + y 2 + y 3 , where
In the backward error analysis view, the computed sum is the exact sum of terms yk that are each close in a relative sense to the given data xk. An algorithm that is stable in the sense of backward error analysis provides the exact solution of a problem with data close to that of the original problem. As to whether the two solutions are close, that is a matter of the conditioning of the problem. A virtue of this way of viewing errors is that it separates the roles of the stability of the algorithm and the condition of the problem. Backward error analysis is particularly attractive when the input data are of limited accuracy, as, for example, when the data are measured or computed. It may well happen that a stable algorithm provides the exact solution to a problem with data that cannot be distinguished from the given data because of their limited accuracy. We really cannot ask more of the numerical scheme in such a situation, but again we must emphasize that how close the solution is to that corresponding to the given data depends on the conditioning of the problem. We shall return to this matter in the next chapter. A numerical example will help make the point. For xl = 0.12 ? 102, x2 = 0.34 ? 1 10 , x3 = 0.15 ? 102, the true value of the sum is S3 = 0.40 ? 100. When evaluated in two digit decimal chopped arithmetic, fl (S3) = 0.00 ? 100, a very inaccurate result. Nevertheless, with y1 = 0.116 ? 102, y2 = x2, and y3 = x3, we have fl(S3) = y1+ y2 + y3. The computed result is the exact sum of numbers close to the original data. Indeed, two of the numbers are the same as the original data and the remaining one differs by n less than a unit of roundoff. For most of the numerical tasks in this book it is not necessary to worry greatly about the effects of finite precision arithmetic. Two exceptions are the subject of the remaining examples. The first is the computation of a root of a continuous function f(x). Naturally we would like to compute a number z for which f(z) is as close to zero as possible in the precision available. Routines for this purpose ask the user to specify a desired accuracy. Even if the user does not request a very accurate root, the routine may “accidentally” produce a number z for which f(z) is very small. Because it is usually not expensive to solve this kind of problem, it is quite reasonable for a user
22
CHAPTER 1
ERRORS AND FLOATING POINT ARITHMETIC
to ask for all the accuracy possible. One way or the other, we must ask what happens If when x is very close to a root. An underflow is possible since this does not happen, it is usually found that the value of f(z) fluctuates erratically as Because of this we must devise algorithms that will behave sensibly when the computed value f(x) does not have even the correct sign for x near z. An example will show how the details of evaluation of f(x) are important when x is near a root. Example 1.18. Let f(x) = x2  2x + 1 be evaluated at x = 1.018 with threedigit chopped arithmetic and  100 < e < 100. The exact answer is f (1.018) = 0.324 ? 103. Because the coefficients of f are small integers, no error arises when they are represented as floating point numbers. However, x is not a floating point number in this arithmetic and there is an error when = fl(x) = 0.101 ? 101 is formed. Several algorithms are possible that arise in different ways of writing f(x): f(x)
All of the results have large relative errors. This should not be too surprising since the problem is poorly conditioned (see Exercise 1.5). Figure 1.2 is a plot of 281 values of the function f(x) = ( x exp( x )  1)3 for arguments near x = 0.567. Single precision IEEE arithmetic was used for this calculation and the cubed term in the function was expanded out to generate more roundoff. In exact arithmetic f(x) vanishes at only one point ? near 0.567, a point that satisfies ? = exp(  ?). However, it is clear from the figure that the floating point version is not nearly so well behaved near this ct. n In Chapter 2 we discuss the numerical solution of a system of linear equations. In contrast to the solution of nonlinear equations, codes based on the method there try to compute an answer as accurately as possible in the precision available. A difficulty with precision arises when we try to assess the accuracy of the result. Example 1.19. Residual calculation.
The simplest system of linear equations is ax = b.
The quality of an approximate solution z can be measured by how well it satisfies the equation. The discrepancy is called its residual: r = b  az.
1.2 EXAMPLES OF FLOATING POINT CALCULATIONS
23
Figure 1.2 Floating point evaluation of f(x) = x3e3x  3x2 e2x + 3xex  1.
If z is a very good solution, its residual r is small and there is cancellation when forming Defining ? by
the computed residual
The computed residual differs from the true residual by a quantity that can be as large as azu bu. When r is small because z is a good solution and b happens to be en large, the computed residual may have few, if any, correct digits (although the relative residual r/b is fine). When z is a good solution, it is generally necessary to use double precision to obtain its residual to single precision accuracy. n
EXERCISES 1.8 Suppose that z = 0.180 ? 102 is an approximate solution of ax = b for a = 0.111 ? 100, b = 0.200 ? 101. Use threedigit decimal chopped arithmetic to compute the residual r = b  az. Compute the residual in double precision and in exact arithmetic. Discuss the results.
use fourdigit decimal chopped arithmetic, then fourdigit decimal rounded arithmetic. How reasonable are the answers? Find another formula for the midpoint and use fourdigit decimal (rounded or chopped) arithmetic to calculate the midpoint of [0.8717,0.8719]. Is your formula better or worse?
1.9 For ? = 0.8717 and ? = 0.8719 calculate the midpoint 1.10 In the model arithmetic, a single operation is carof the interval [?, ?] using the formula (?+ ?)/2. First ried out with a small relative error. Unfortunately
24
CHAPTER 1
ERRORS AND FLOATING POINT ARITHMETIC
the same is not true of complex arithmetic. To see Explain why f(q) cannot be evaluated accurately in this, let z = a + ib and w = c + id. By definition, finite precision arithmetic when q is small. In the exzw = (ac  bd) + i(ad + bc). Show how the real part, planation you should assume that y3/2 can be evalac  bd, of the product zw might be computed with a uated with a relative error that is bounded by a small large relative error even though all individual calculamultiple of the unit roundoff. Use the binomial series tions are done with a small relative error. to show 1.11 An approximation S to ex can be computed by using the Taylor series for the exponential function: S := 1 Why is this series a better way to evaluate f(q) when P := 1 q is small? for k = 1,2,...begin 1.13 Let a regular polygon of N sides be inscribed in a unit P := xP/k circle. If LN denotes the length of one side, the cirS := S+P cumference of the polygon, N ? LN, approximates the end k. circumference of the circle, 2?; hence ?: NLN/2 for The loop can be stopped when S = S + P to machine large N. Using Pythagoras’ theorem it is easy to relate precision. L2N to LN: (a) Try this algorithm with x =  10 using single precision arithmetic. What was k when you stopped? What is the relative error in the resulting approximation? Does this appear to be a good way to compute Starting with L4 = for a square, approximate ? e 10 to full machine precision? by means of this recurrence. Explain why a straightforward implementation of the recurrence in floating (b) Repeat (a) with x = + 10. point arithmetic does not yield an accurate value for (c) Why are the results so much more reasonable for ?. (Keep in mind that Show that (b)? the recurrence can be rearranged as (d) What would be a computationally safe way to compute e10 ? 1.12 Many problems in astrodynamics can be approximated by the motion of one body about another under and demonstrate that this form works better. the influence of gravity, for example, the motion of a satellite about the earth. This is a useful approxima 1.14 A study of the viscous decay of a line vortex leads to tion because by a combination of analytical and nuan expression for the velocity merical techniques, these two body problems can be solved easily. When a better approximation is desired, for example, we need to account for the effect of the moon or sun on the satellite, it is natural to compute it at a distance r from the origin at time t > 0. Here as a correction to the orbit of the two body problem. is the initial circulation and v > 0 is the kinematic visThis is the basis of Encke’s method; for details see cosity. For some purposes the behavior of the velocity Section 9.3 of [2]. A fundamental issue is to calculate at distances r 0. The function is available a function for the accurate computation of sinh( x ), manipulate the expression into one that can be evaluated in a more accurate way for very small r.
1.3 CASESTUDY 1
25
1.3 CASE STUDY 1 Now let us look at a couple of examples that illustrate points made in this chapter. The first considers the evaluation of a special function. The second illustrates the fact that practical computation often requires tools from several chapters of this book. Filon’s method for approximating finite Fourier integrals will be developed in Chapter 3 and applied in Chapter 5. An aspect of the method that we take up here is the accurate computation of coefficients for the method. The representation of the hyperbolic cosine function in terms of exponentials x = cosh(y) =
exp( y) + exp(y) 2
makes it easy to verify that for x > 1,
Let us consider the evaluation of this expression for cosh1 (x) in floating point arithmetic when x >> 1. An approximation made earlier in this chapter will help us to understand better what it is that we want to compute. After approximating
we find that
The first difficulty we encounter in the evaluation is that when x is very large, x2 overflows. This overflow is unnecessary because the argument we are trying to compute is on scale. If x is large, but not so large that x2 overflows, the effect of the 1 “falls off the end” in the subtraction, meaning that fl(x2  1) = fl(x2). This subtraction is carried out with a small relative error, and the same is true of less extreme cases, but there is a loss of information when numbers are of greatly different size. The square root is obtained with a small relative error. The information lost in the subtraction is needed at the next step because there is severe cancellation. Indeed, for large x, we might end up computing x  x = 0 as the argument for ln(x), which would be disastrous. How might we reformulate the task to avoid the difficulties just noted? A little calculation shows that
a form that avoids cancellation. The preliminary analysis we did to gain insight suggests a better way of handling the rest of the argument:
26
CHAPTER 1
ERRORS AND FLOATING POINT ARITHMETIC
Notice that here we form (l/x)2 instead of l/x2. This rearrangement exchanges a possible overflow when forming x2 for a harmless underflow, harmless, that is, if the system sets an underflow to zero and continues on. We see now that the expression
avoids all the difficulties of the original expression for cosh1 (x). Indeed, it is clear that for large x, evaluation of this expression in floating point arithmetic will lead to an approximation of ln(2x), as it should. For our second example we consider Filon’s method for approximating finite Fourier integrals, which is developed in Chapter 3:
Here ? = ?h and
The details of the terms Ce and C0 do not concern us here. There is a similar formula for integrals with the sine function in place of the cosine function that involves the same coefficients ?, ?, ?. It is shown in Case Study 3 that the absolute error of 3 this approximation is bounded by a constant times h . To get an accurate integral, it might be necessary to use a small h, meaning that ? is small, but the expressions for the coefficients are unsatisfactory in this case. Each suffers from cancellation in the numerator, and the resulting error is amplified by the division by the small quantity ?3. To see the cancellation more clearly, let us approximate the sine and cosine terms in, say, a by the leading terms in their Taylor series, sin(?) ? and cos(?) 1, to get
Obviously there is perfect cancellation of leading terms in the numerator. This analysis suggests a remedy: for small ?, expand the coefficients in Taylor series and deal with the cancellation and small divisor analytically. The resulting series are
1.3 CASE STUDY 1
27
It might be remarked that it was easy to compute these expansions by means of the symbolic capabilities of the Student Edition of MATLAB. In the program used to compute the integral of Case Study 3, these expressions were used for ? < 0.1. Because the terms decrease rapidly, nested multiplication is not only an efficient way to evaluate the expressions but is also accurate. As a numerical illustration of the difficulty we evaluated both forms of a for a range of ? in single precision in FORTRAN. Reference values were computed using the trigonometric form and double precision. This must be done with some care. For instance, if T is a single precision variable and we want a double precision copy DT for computing the reference values, the lines of code T = 0.lE0 DT = 0. 1D0 are not equivalent to T = 0.1E0 DT=T This is because on a machine with binary or hexadecimal arithmetic, 0. 1E 0 agrees with 0.1 D0 only to single precision. For the reference computation we require a double precision version of the actual machine number used in the single precision computations, hence we must use the second code. As we have remarked previously, most computers today perform intermediate computations in higher precision, despite specification of the precision of all quantities. With T, S, and C declared as single precision variables, we found remarkable differences in the result of S = SIN(T) C = COS(T) ALPHA=(T**2+T*S*C2EO*S**2)/T**3 and ALPHA=(T**2+T*SIN(T)*COS(T)2EO*SIN(T)**2)/T**3 differences that depended on the machine and compiler used. On a PC with a Pentium chip, the second code gave nearly full single precision accuracy. The first gave the poor results that we expect of computations carried out entirely in single precision. The coefficient a was computed for a range of ? using the trigonometric definition and single precision arithmetic and its relative error computed using a reference value computed in double precision. Similarly the error of the value computed in single precision from the Taylor series was found. Plotted against ? in Figure 1.3 is the relative error for both methods (on a logarithmic scale). Single precision accuracy corresponds to about seven digits, so the Taylor series approach gives about all the accuracy we could hope for, although for the largest value of ? it appears that another term in the expansion would be needed to get full accuracy. Obviously the trigonometric definition leads to a great loss of accuracy for “small” ?. Indeed, ? is not very small in an absolute sense here; rather, it is small considering its implications for the cost of evaluating the integral when the parameter ? is moderately large.
28
CHAPTER 1
ERRORS AND FLOATING POINT ARITHMETIC
Figure 1.3 Error in series form (•) versus trig form (*) for a Filon coefficient.
1.4 FURTHER READING A very interesting and readable account of the interaction of the floating point number system with the solution of quadratic equations has been given by Forsythe [6]. Henrici [8] gives another elementary treatment of floating point arithmetic that introduces the useful idea of a statistical treatment of errors. To pursue the subject in depth, consult the book Rounding Errors in Algebraic Processes by J. H. Wilkinson [10]. Wilkinson’s books are unmatched for their blend of theoretical advance, striking examples, practical insight, applications, and readability. For more information on the practical evaluation of special functions, see the books by Cody and Waite [3] or Fike [5]. Other interesting discussions on floating point arithmetic are the books of Goldberg [7] and Higham [9].
REFERENCES 1. ANSI/IEEE, IEEE Standard for Binary Floating Point Arithmetic, Std 7541985, New York, 1985. 2. R. Bate, D. Miller, and J. White, Fundamentals of Astrophysics, Dover, New York, 1971. 3. W. Cody and W. Waite, Software Manual for the Elementary Functions, Prentice Hall, Englewood Cliffs, N.J., 1980. 4. J. Dongarra, J. Bunch, C. Moler, and G. Stewart, LINPACK Users’ Guide, SIAM, Philadelphia, 1979. 5. C. Fike, Computer Evaluation of Mathematical Functions, Prentice Hall, Englewood Cliffs, N.J., 1968. 6. G. Forsythe, “What is a satisfactory quadratic equation solver?,” in Constructive Aspects of the Fundamental Theorem of Algebra, B. Dejon and P. Henrici, eds., Wiley, London, 1969.
REFERENCES 29 7. D. Goldberg, “What every computer scientist should know about floatingpoint arithmetic,” ACM Computing Surveys, 23 (1991), pp. 548. 8. P. Henrici, Elements of Numerical Analysis, Wiley, New York, 1964. 9. N. Higham, Accuracy and Stability of Numerical Algorithms, SIAM, Philadelphia, 1996. 10. J. Wilkinson, Rounding Errors in Algebraic Processes, Dover, Mineola, N.Y., 1994.
MISCELLANEOUS EXERCISES FOR CHAPTER 1 1.15 Use threedigit decimal chopped arithmetic with m =  100 and M = 100 to construct examples for which
are of great practical value. It appears to be necessary to evaluate a large number of sines and cosines if we wish to evaluate such a series, but this can be done cheaply by recursion. For the specific x of interest, for n = 1, 2,...let sn = sinnx and cn = cosnx.
You are allowed to use negative numbers. Examples can be constructed so that either one of the expressions cannot be formed in the arithmetic, or both can be formed but the values are different. 1.16 For a set of measurements xl ,x2,. . . ,xN, the sample mean is defined to be
The sample standard deviation s is defined to be
Show that for n = 2, 3,. . . sn =
S 1 C nI
+ c 1 s nl and c n = c 1 c nl s 1 s nl .
After evaluating sl = sinx and c1 = cosx with the intrinsic functions of the programming language, this recursion can be used to evaluate simply and inexpensively all the sinnx and cosnx that are needed. To see that the recursion is stable, suppose that for some m > 1, sm and cm are computed incorrectly as = sm + ?m and = cm + If no further arithmetic errors are made, the errors ? m and will propagate in the recurrence so that we compute
Another expression,
for n = m + l,.... Let ?n and so that, by definition,
is often recommended for hand computation of s. Show that these two expressions for s are mathematically equivalent. Explain why one of them may provide better numerical results than the other, and construct an example to illustrate your point.
Prove that for all n > m
be the errors in
and
which implies that for all n > m
1.17 Fourier series, In this sense, errors are not amplified and the recurrence is quite stable.
Previous Home Next
CHAPTER 2
SYSTEMS OF LINEAR EQUATIONS
One of the most frequently encountered problems in scientific computation is that of solving n simultaneous linear equations in n unknowns. If we denote the unknowns by x1,x2, . . . xn, such a system can be written in the form
The given data here are the righthand sides bi , i = 1, 2,. . . , n, and the coefficients aij for i, j = 1, 2,..., n. Problems of this nature arise almost everywhere in the applications of mathematics (e.g., the fitting of polynomials and other curves through data and the approximation of differential and integral equations by finite, algebraic systems). Several specific examples are found in the exercises for this chapter (see also [12] or [13]). To talk about (2.1) conveniently, we shall on occasion use some notation from matrix theory. However, we do not presume that the reader has an extensive background in this area. Using matrices, (2.1) can be written compactly as Ax=b,
(2.2)
where
If a11 0, the equation has a unique solution, namely x1 = b1/a11. If a11 = 0, then some problems do not have solutions (b1 0) while others have many solutions (if b1 = 0, any number x1 is a solution). The same is true for general n. There are two kinds of matrices, nonsingular and singular. If the matrix A is nonsingular, there is a unique solution vector x for any given righthand side b. If A is singular, there is no 30
31 solution for some righthand sides b and many solutions for other b. In this book we concentrate on systems of linear equations with nonsingular matrices.
Example 2.1.
The problem 2 x1 5 x1
+ 3x2 = 8 + 4x2 = 13
has a nonsingular coefficient matrix. The linear system has the unique solution x
Example 2.2.
1
=1,x
2
=2 or x=
The problem 2x1 4x1
+ 3x2 + 6x2
= =
4 7
has a singular coefficient matrix. If
there is no solution, for if x1 and x2 were numbers such that 4 = 2x1 + 3x2, then we would have 8 = 2 ? 4 = 2 ? (2x1+ 3x 2 ) = 4x1+ 6x2, which is impossible because of the second equation. If
there are many solutions, namely
n
for all real numbers c.
In the nonsingular case there exists a matrix called the inverse of A, denoted by A1, such that the unique solution of (2.2) is given by x = A1 b. For n = 1, A1 = (1/a 11). Should we compute A1 and then form the product A1 b to solve (2.2)? We shall see that the answer is generally no even if we want to solve (2.2) with the same matrix A and many different righthand sides b.
32
CHAPTER 2
2.1
GAUSSIAN ELIMINATION WITH PARTIAL PIVOTING
SYSTEMS OF LINEAR EQUATIONS
The most popular method for solving a nonsingular system of linear equations (2.1) is called Gaussian elimination. It is both simple and effective. In principle it can be used to compute solutions of problems with singular matrices when they have solutions, but there are better ways to do this. The basic idea in elimination is to manipulate the equations of (2.1) so as to obtain an equivalent set of equations that is easy to solve. An equivalent set of equations is one that has the same solutions. There are three basic operations used in elimination: (1) multiplying an equation by a nonzero constant, (2) subtracting a multiple of one equation from another, and (3) interchanging rows. First, if any equation of (2.1) is multiplied by the nonzero constant a, we obtain an equivalent set of equations. To see this, suppose that we multiply the k th equation by a to get (2.3) If x1, x2, . . . ,xn satisfy (2.1), then they obviously satisfy the set of equations that is the same as (2.1) except for the kth equation, which is (2.3). Conversely, because ? 0, if x1, x2, . . . ,xn satisfy this second set of equations, they obviously satisfy the first. Second, suppose we replace equation i by the result of subtracting the multiple a of equation k from equation i:
If x1, x2, . . . ,xn satisfy (2.l), then by definition and aklxl + ak2x2 + ··· + aknxn = bk, so that ( ai1x1 + ai2x2 + ··· + ainxn)  ? (ak1x1 + ··· + aknxn ) = bi  ?b k . Thus xl, x2 , . . . ,xn satisfy all the equations of (2.4). To work in reverse, suppose now that x1, x2,. . . ,xn satisfy (2.4). Then in particular they satisfy equations i and k,
so that
2.1 GAUSSIAN ELIMINATION WITH PARTIAL PIVOTING
33
which is just
Thus x1, x2 , . . . ,xn also satisfy (2.1). Third, writing the equations in a different order clearly does not affect the solution, so interchanging rows results in an equivalent set of equations. Example 2.3.
Consider the problem 3 x1 2 x1 x1
+ + +
6x2 + 9x3 = 39 5x2  2x3 = 3 3x2  x3 = 2.
(2.5)
If we subtract a multiple ? of the first equation from the second, we get Choosing ? = 2/3 makes the coefficient of x1 zero, so that the unknown x1 no longer appears in this equation: x2  8x3 = 23. We say that we have “eliminated” the unknown x1 from the equation. Similarly, we eliminate x1 from equation (2.5) by subtracting l/3 times the first equation from it: x2  4x3 = 11. The system of equations (2.5)(2.5) has been reduced to the equivalent system 3 x1
+
6x2 + 9x3 = 39 x2  8x3 = 23 x2  4x3 = 11.
(2.6)
Now we set aside the first equation and continue the elimination process with the last two equations in the system (2.6) involving only the unknowns x2 and x3. Multiply the second equation by 1 and subtract from the third to produce 4 x3 = 12,
(2.7)
a single equation in one unknown. The equations (2.6) have now become the equivalent set of equations 3 x1
+
6x2 + 9x3 = 39 x2  8x3 = 23 4 x3 = 12.
(2.8)
The system (2.8)(2.8) is easy to solve. From (2.8) x3 = 12/4 = 3. The known value of x3 is then used in (2.8) to obtain x2, that is, x
2
= 8x3  23 = 8 ? 3  23 = 1.
Finally, the values for x2 and x3 are used in (2.8) to obtain x1, x1 = (6x2  9x3 + 39)/3 =(6?19?3+39)/3=2.
34
CHAPTER 2
SYSTEMS OF LINEAR EQUATIONS
Because this set of equations is equivalent to the original set of equations, the solution of (2.5)(2.5) is x1 = 2, x2 = 1, x3 = 3. n Let us turn to the general problem (2.1), which we now write with superscripts to help explain what follows:
If
we can eliminate the unknown x1 from each of the succeeding equations.
A typical step is to subtract from equation i the multiple of the first equation. The results will be denoted with a superscript 2. The step is carried out by first forming
and then forming
and
The multiple of the first equation is chosen to make that is, to eliminate the unknown x1 from equation i. Of course, if mil = 0, the variable x1 does not appear in equation i, so it does not need to be eliminated. By recognizing this, the arithmetic of elimination can be avoided. Doing this for each i = 2,. . . ,n we arrive at the system
Notice that if we start out with A stored in a C or FORTRAN array, we can save a considerable amount of storage by overwriting the with the as they are created. Also, we can save the multipliers mi1 in the space formerly occupied by the entries of A and just remember that all the elements below the diagonal in the first column are really zero after elimination. Later we shall see why it is useful to save the multipliers. Similarly, the original vector b can be overwritten with the as they are formed.
2.1 GAUSSIAN ELIMINATION WITH PARTIAL PIVOTING
35
Now we set the first equation aside and eliminate x2 from equations i = 3,. . . , n in the same way. If
0, then for i = 3, 4,. . . , n we first form
and then
and
This results in
As before, we set the first two equations aside and eliminate x3 from equations i = 4, . . . n. This can be done as long as are called The elements pivot elements or simply pivots. Clearly, the process can be continued as long as no pivot vanishes. Assuming this to be the case, we finally arrive at
In the computer implementation, the elements of the original matrix A are successively overwritten by the as they are formed, and the multipliers mij are saved in the places corresponding to the variables eliminated. The process of reducing the system of equations (2.1) to the form (2.9) is called (forward) elimination. The result is a system with a kind of coefficient matrix called upper triangular. An upper triangular matrix U = ( u ij ) is one for which
It is easy to solve a system of equations (2.9) with an upper triangular matrix by a process known as back substitution. If we solve the last equation for xn,
36
CHAPTER 2
SYSTEMS OF LINEAR EQUATIONS
Using the known value of xn, we then solve equation n  1 for x n1, and so forth. A typical step is to solve equation k,
for xk using the previously computed xn, xn1,. . . ,xk+1 :
The only way this process can break down (in principle) is if a pivot element is zero. Example 2.4.
Consider the two examples 0 · xl + 2x2 = 3 4 x1 + 5x2 = 6
and 0 · x1 + 2x2 = 3 0 · x1 + 5x2 = 6. so it cannot be used as a pivot, but there is a simple remedy for the The entry difficulty. We merely interchange the equations to get the equivalent set 4 x 1 + 5x2 = 6 0 · x1 + 2x2 = 3. For this problem the difficulty was easily avoided. This device will not work on the other problem, however, as it is singular. The first equation of this set requires x2 = 3/2 n and the second requires x2 = 6/5, so there is no solution at all. In the general case, suppose we have arrived at
and
We examine the elements
l,
in column k for j > k. If for some index
we interchange equations k and l. This does not affect the solution, so
we rename the coefficients in the same way as before. The new pivot
is the old
which was nonzero, so the elimination process can now proceed as usual. If, however,
we have a difficulty of another sort: the
2.1 GAUSSIAN ELIMINATION WITH PARTIAL PIVOTING
37
matrix is singular. We prove this by showing that if a solution exists, it cannot be unique. Assume x is a solution to the problem. Set zk+l = xk+l , . . . , zn = xn and let zk be arbitrary. The quantities zk,zk+l , . . . ,zn satisfy equations k through n because the unknown xk does not appear in any of those equations. Now values for z1, . . . , zk1 may be determined by back substitution so that equations 1 through k  1 are satisfied:
This can be done because none of these pivot elements vanishes. Since all of the equations are satisfied, we have produced a whole family of solutions, namely zl, z2, . . ., zk, xk+l,. . ., xn with zk arbitrary. This shows that the matrix is singular. Example 2.5. The following problems illustrate how singular systems are revealed during elimination. In the system xl 2 x1 3 x1
+ + +
2x2 4x2 6x2
 x3 + x3  2x3
= = =
2 7 7,
one step of elimination yields xl
+
2x2 0 x2 0 x2
+ +
x3 = 27 3x3 = 3 x3 = 1.
Since we cannot find a nonzero pivot for the second elimination step, the system is singular. It is not hard to show that the solutions are x1 = 3  2c x2 = c x3 = 1 for all real numbers c. The system x1  x2 + x3 = 0 2x1 + x2  x3 = 3 xl + 2x2  2x2 = 2 is also singular, since two steps of elimination give xl

x2 3 x2
+ 
x3 3x3 0 x3
= = =
0 3 1.
38
CHAPTER 2
SYSTEMS OF LINEAR EQUATIONS
n
In this case there is no solution at all.
We conclude that by using interchanges, the elimination process has a zero pivot only if the original problem is singular. This statement is fine in theory, but the distinction between singular and nonsingular problems is blurred in practice by roundoff effects. Unless a pivot is exactly zero, interchange of equations is unnecessary in theory. However, it is plausible that working with a pivot that is almost zero will lead to problems of accuracy in finite precision arithmetic, and this turns out to be the case. Example 2.6.
The following example is due to Forsythe and Moler [6]: 0.000100x1 + 1.00x2 = 1.00 1.00x1 + 1.00x2 = 2.00.
Using threedigit decimal rounded floating point arithmetic, one step in the elimination process without interchanging equations yields for the second equation [ 1.00  (10,000) (1.00)]x2 = [2.00  (10,000) (1.00)] or 10,000x2
=
10,000.
Clearly, x2 = 1.00 and, by back substitution, x1 = 0.00. Notice that all information contained in the second equation was lost at this stage. This happened because the small pivot caused a large multiplier and subsequently the subtraction of numbers of very different size. With interchange we have 1.00x1
+
1.00x2 1.00x2
= =
2.00 1.00
and xl = 1.00, x2 = 1.00. The true solution is about x1 = 1.00010, x2 = 0.99990.
n
may lead to inaccurate results. As we saw in the last small pivot elements example, when eliminating the variable xk in row i, a small pivot element leads to a When large multiplier mik =
is much larger than is formed, there is a loss of information whenever information that may be needed later. A large multiplier is also likely to result in a large entry in the upper triangular matrix resulting from elimination. In the solution of the corresponding linear system by back substitution, we compute
If the pivot (the denominator) is small and the true value xk is of moderate size, it must be the case that the numerator is also small. But if there are entries
of
2.1 GAUSSIAN ELIMINATION WITH PARTIAL PIVOTING
39
the upper triangular matrix that are large, this is possible only if cancellation occurs in the numerator. The large entries might well have been computed with a modest relative error, but because the entries are large this leads to a large absolute error in the numerator after cancellation. The small denominator amplifies this and there is a substantial relative error in xk. Partial pivoting is the most popular way of avoiding small pivots and controlling the size of the When we eliminate xk, we select the largest coefficient (in magnitude) of xk in the last n  k + 1 equations as the pivot. That is, if the
is the largest of
for j = k, k + 1,. . . , n, we interchange row k and row l. By renaming the rows
we can assume that the pivot has the largest magnitude possible. Partial pivoting avoids small pivots and nicely controls the size of the multipliers
Controlling the size of the multipliers moderates the growth of the entries in the upper triangular matrix resulting from elimination. Let a = maxi , j Now
and it is easy to go on to show that
This implies that (2.10) when partial pivoting is done. The growth that is possible here is very important to bounds on the error of Gaussian elimination. Wilkinson [15] points out that there is equality in this bound for matrices of the form
However, usually the growth is very modest. Research into this matter is surveyed in [11]. There are other ways of selecting pivot elements that lead to better bounds on the error and there are other ways of solving linear systems that have still better bounds. Some details will be mentioned later, but in practice the numerical properties of Gaussian elimination with partial pivoting are so good that it is the method of choice except in special circumstances, and when one speaks of “Gaussian elimination” it is assumed that partial pivoting is done unless something to the contrary is said. Gaussian elimination with partial pivoting is the basic method used in the popular computing environments MATLAB, Mathematica, and MATHCAD.
40
CHAPTER 2
SYSTEMS OF LINEAR EQUATIONS
Example 2.7. Using exact arithmetic and elimination with partial pivoting, solve the following system:
Since 2 > 1, we interchange the first and second equations to get
Using 2 as a pivot, we eliminate the coefficients of xl in equations two and three to get
Since 2 > l, we interchange equations two and three,
and using 2 as pivot obtain
Back substitution then gives
or
The algorithm for elimination is quite compact: Elimination, modification of b. for k = l,...,n  1 begin interchange rows so that akk = maxk