Michael T. Heath University of Illinois at Urbana-Champaign
ii
c Copyright 1997 by The McGraw-Hill Companies. All rights reserved.
About the Author
Michael T. Heath holds four positions at the University of Illinois at Urbana-Champaign: Professor in the Department of Computer Science, Director of the Computational Science and Engineering Program, Director of the Center for Simulation of Advanced Rockets, and Senior Research Scientist at the National Center for Supercomputing Applications (NCSA). He received a B.A. in Mathematics from the University of Kentucky, an M.S. in Mathematics from the University of Tennessee, and a Ph.D. in Computer Science from Stanford University. Before joining the University of Illinois in 1991, he spent a number of years at Oak Ridge National Laboratory, first as Eugene P. Wigner Postdoctoral Fellow and later as Computer Science Group Leader in the Mathematical Sciences Research Section. His research interests are in numerical analysis—particularly numerical linear algebra and optimization—and in parallel computing. He has has been an editor of the SIAM Journal on Scientific Computing, SIAM Review, and the International Journal of High Performance Computing Applications, as well as several conference proceedings. In 2000, he was named an ACM Fellow.
This book presents a broad overview of numerical methods and software for students and professionals in computationally oriented disciplines who need to solve mathematical problems. It is not a traditional numerical analysis text in that it contains relatively little detailed analysis of the computational algorithms presented. Instead, I try to convey a general understanding of the techniques available for solving problems in each major category, including proper problem formulation and interpretation of results, but I advocate the use of professionally written mathematical software for obtaining solutions whenever possible. The book is aimed much more at potential users of mathematical software than at potential creators of such software. I hope to make the reader aware of the relevant issues in selecting appropriate methods and software and using them wisely. At the University of Illinois, this book is used as the text for a comprehensive, onesemester course on numerical methods that serves three main purposes: • As a terminal course for senior undergraduates, mainly computer science, mathematics, and engineering majors • As a breadth course for graduate students in computer science who do not intend to specialize in numerical analysis • As a training course for graduate students in science and engineering who need to use numerical methods and software in their research. It is a core course for the interdisciplinary graduate program in Computational Science and Engineering sponsored by the College of Engineering. To accommodate this diverse student clientele, the prerequisites for the course and the book have been kept to a minimum: basic familiarity with linear algebra, multivariate calculus, and a smattering of differential equations. No prior familiarity with numerical methods is assumed. The book adopts a fairly sophisticated perspective, however, and the course moves at a rather rapid pace in order to cover all of the material, so a reasonable level of maturity on the part of the student (or reader) is advisable. Beyond the academic setting, I hope that the book will also be useful as a reference for practicing engineers and scientists who may need a quick overview of a given computational problem and the methods and xiii
xiv
PREFACE
software available for solving it. Although the book emphasizes the use of mathematical software, unlike some other software-oriented texts it does not provide any software, nor does it concentrate on any specific software packages, libraries, or environments. Instead, for each problem category pointers are provided to specific routines available from publicly accessible repositories, other textbooks, and the major commercial libraries and packages. In many academic and industrial computing environments such software is already installed, and in any case pointers are also provided to public domain software that is freely accessible via the Internet. The computer exercises in the book are not dependent on any specific choice of software or programming language. The main elements in the organization of the book are as follows: Chapters: Each chapter of the book covers a major computational problem area. The first half of the book deals primarily with algebraic problems, whereas the second half treats analytic problems involving derivatives and integrals. The first two chapters are fundamental to the remainder of the book, but the subsequent chapters can be covered in various orders according to the instructor’s preference. More specifically, the direct interdependence of chapters is as follows: Chapter 2 3 4 5
Depends on 1 1, 2 1–3 1, 2, 4
Chapter 6 7 8 9
Depends on 1–5 1, 2 1, 2, 5, 7 1, 2, 4, 5, 7, 8
Chapter 10 11 12 13
Depends on 1, 2, 4, 5, 7–9 1, 2, 4–10 1, 2, 7 1
Thus, the main opportunities for moving material around are to cover Chapters 7 and 12 earlier and Chapter 6 later than their appearance in the book. For example, Chapters 3, 7, and 12 all involve some type of data fitting, so it might be desirable to cover them as a unit. As another example, iterative methods for linear systems are covered in Chapter 11 on partial differential equations because that is where the most important motivating examples come from, but much of this material could be covered immediately following direct methods for linear systems in Chapter 2. The entire book can be covered in one semester by moving at a rapid pace or by omitting a few sections. There is also sufficient material for a more leisurely two-quarter course. A one-quarter course would likely require omitting some chapters. Chapter 13, on random numbers and stochastic simulation, is only peripherally related to the remainder of the book and is an obvious candidate for omission if time runs short (random number generators are used in a number of exercises throughout the book, however). Examples: Almost every concept and method introduced is illustrated by one or more examples. These examples are meant to supplement the relatively terse general discussion and should be read as an essential part of the text. The examples have been kept as simple as possible (sometimes at the risk of oversimplification) so that the reader can easily follow them. In my experience, a simple example that is thoroughly understood is usually more helpful than a more realistic example that is more difficult to follow. Software: The lists of available software for each problem category are meant to be reasonably comprehensive. I have not attempted to single out the “best” software available for a given problem, partly because usually no single package is superior in all respects and
xv partly to allow for the varied software availability and choice of programming language that may apply for different readers. All of the recommended software is at least competently written, and some of it is superb. Exercises: The book contains many exercises, which are divided into three classes: • Review questions, which are short-answer questions designed to test basic conceptual understanding • Exercises, which require somewhat more thought, longer answers, and possibly some hand computation • Computer problems, which require some programming and often involve the use of existing software. The review questions are meant for self-testing on the part of the reader. They include some deliberate repetition to drive home key points and to build confidence in the mastery of the material. The longer exercises are meant to be suitable for written homework assignments. Some of these require manual computations with simple examples, whereas others are designed to supply missing details of derivations and proofs omitted from the main text. The latter should be especially useful if the book is used for a more theoretical course. The computer problems provide an opportunity for hands-on experience in using the recommended software for solving typical problems in each category. Some of these problems are generic, but others are directly related to specific applications in various scientific and engineering disciplines. This book provides a fairly comprehensive introduction to scientific computing, but scientific computing is only part of what has become known as computational science. Computational science is a relatively new mode of scientific investigation that includes several phases: 1. Development of a mathematical model—often expressed as some type of equation—of a physical phenomenon or system of interest 2. Development of an algorithm to solve the equation numerically 3. Implementation of the algorithm in computer software 4. Numerical simulation of the physical phenomenon using the computer software 5. Representation of the computed results in some comprehensible form, often through graphical visualization 6. Interpretation and validation of the computed results, which may lead to correction or further refinement of the original mathematical model and repetition of the cycle, if necessary. As we construe it, scientific computing is primarily concerned with phases 2–4: the development, implementation, and use of numerical algorithms and software. Although the other phases are equally important in the overall process, their detailed study is beyond the scope of this book. A serious study of mathematical modeling would require far more domain-specific knowledge than we assume and far more space than we can accommodate. Fortunately, mathematical modeling is the subject of numerous excellent books, some of a general nature and others focusing on specific individual disciplines. Thus, although numerous concrete applications appear in the exercises, our main discussion treats each major
xvi
PREFACE
problem type in a very general form. Similarly, we measure the accuracy of computed results with respect to the true solution of a given equation, whereas in practice results should also be validated against the actual physical phenomenon being modeled whenever possible. Learning about scientific computing is an important component in the training of computational scientists and engineers, but there is more to computational science than just numerical methods and software. Accordingly, this book is intended as only a portion of a well-rounded curriculum in computational science, which should also include additional computer skills—e.g., software design principles, data structures, non-numerical algorithms, performance evaluation and tuning, graphics/visualization, and the software tools associated with all of these—as well as much deeper treatment of specific applications in science and engineering. The presentation of largely familiar material is inevitably influenced by other treatments one has seen. My initial experience in presenting some of the material in this book was as a graduate teaching assistant at Stanford University using a prepublication draft of the book by Forsythe, Malcolm, and Moler [82]. “FMM” was one of the first softwareoriented textbooks on numerical methods, and its spirit is very much reflected in the current book. I later used FMM very successfully in teaching in-house courses for practical-minded scientists and engineers at Oak Ridge National Laboratory, and more recently I have used its successor, by Kahaner, Moler and Nash [142], in teaching a similar course at the University of Illinois. Readers familiar with those two books will recognize the origin of some aspects of the treatment given here. As far as they go, those two books would be difficult to improve upon; in the present book I have incorporated a significant amount of new material while trying to preserve the spirit of the originals. In addition to these two obvious sources, I have doubtless borrowed many examples and exercises from many other sources over the years, for which I am grateful. I would like to acknowledge the influence of the mentors who first introduced me to the unexpected charms of numerical computation, Alston Householder and Gene Golub. I am grateful for the feedback I have received from students and instructors who have used the lecture notes from which this book evolved and from numerous reviewers, some anonymous, who read and commented on the manuscript before publication. Specifically, I would like to acknowledge the helpful input of Eric Grosse, Jason Hibbeler, Paul Hovland, Linda Kaufman, Thomas Kerkhoven, Cleve Moler, Padma Raghavan, David Richards, Faisal Saied, Paul Saylor, Robert Skeel, and the following reviewers: Alan George, University of Waterloo; Dianne O’Leary, University of Maryland; James M. Ortega, University of Virginia; John Strikwerda, University of Wisconsin; and Lloyd N. Trefethen, Cornell University. Finally, I deeply appreciate the patience and understanding of my wife, Mona, during the countless hours spent in writing the original lecture notes and then transforming them into this book. With great pleasure and gratitude I dedicate the book to her. Michael T. Heath
Notation
The notation used in this book is fairly standard and should require little explanation. We freely use vector and matrix notation, generally using uppercase bold type for matrices, lowercase bold type for vectors, regular (nonbold) type for scalars, and calligraphic type for sets. Iteration and component indices are denoted by subscripts, usually i through n. For example, a vector x and matrix A have entries xi and aij , respectively. On the few occasions when both an iteration index and a component index are needed, the iteration (k) is indicated by a parenthesized superscript, as in xi to indicate the ith component of the kth vector in a sequence. Otherwise, xi denotes the ith component of a vector x, whereas xi denotes the ith vector in a sequence. For simplicity, we will deal primarily with real vectors and matrices, although most of the theory and algorithms we discuss carry over with little or no change to the complex field. The set of real numbers is denoted by R, n-dimensional real Euclidean space by Rn , and the set of real m ? n matrices by Rm?n . The transpose of a vector or matrix is indicated by a superscript T , and the conjugate transpose by superscript H (for Hermitian). Unless otherwise indicated, all vectors are regarded as column vectors; a row vector is indicated by explicitly transposing a column vector. For typesetting convenience, the components of a column vector are sometimes indicated by transposing the corresponding row vector, as in x = [ x1 x2 ]T . The inner product (also known as dot product or scalar product) of two n-vectors x and y is simply a special case of matrix multiplication and thus is denoted by xT y (or xH y in the complex case). Similarly, their outer product, which is an n ? n matrix, is denoted by xy T . The identity matrix of order n is denoted by In (or just I if the dimension n is clear from context), and its ith column is denoted by ei . A zero matrix is denoted by O, a zero vector by o, and a zero scalar by 0. A diagonal matrix with diagonal entries d1 , . . . , dn is denoted by diag(d1 , . . . , dn ). Inequalities between vectors or matrices are to be understood elementwise. The ordinary derivative of a function f (t) of one variable is denoted by df /dt or by f 0 (t). Partial derivatives of a function of several variables, such as u(x, y), are denoted by ?u/?x, for example, or in some contexts by a subscript, as in ux . Notation for gradient vectors and xvii
xviii
NOTATION
Jacobian and Hessian matrices will be introduced as needed. All logarithms are natural logarithms (base e ? 2.718) unless another base is explicitly indicated. The computational cost, or complexity, of numerical algorithms is usually measured by the number of arithmetic operations required. Traditionally, numerical analysts have counted only multiplications (and possibly divisions and square roots), because multiplications were usually significantly more expensive than additions or subtractions and because in most algorithms multiplications tend to be paired with a similar number of additions (for example, in computing the inner product of two vectors). More recently, the difference in cost between additions and multiplications has largely disappeared.1 Computer vendors and users like to advertise the highest possible performance, so it is increasingly common for every arithmetic operation to be counted. Because certain operation counts are so well known using the traditional practice, however, in this book only multiplications are usually counted. To clarify the meaning, the phrase “and a similar number of additions” will be added, or else it will be explicitly stated when both are being counted. In quantifying operation counts and the accuracy of approximations, we will often use “big-oh” notation to indicate the order of magnitude, or dominant term, of a function. For an operation count, we are interested in the behavior as the size of the problem, say n, becomes large. We say that f (n) = O(g(n)) (read “f is big-oh of g” or “f is of order g”) if there is a positive constant C such that |f (n)| ? C|g(n)| for n sufficiently large. For example, 2n3 + 3n2 + n = O(n3 ) because as n becomes large, the terms of order lower than n3 become relatively insignificant. For an accuracy estimate, we are interested in the behavior as some quantity h, such as a stepsize or mesh spacing, becomes small. We say that f (h) = O(g(h)) if there is a positive constant C such that |f (h)| ? C|g(h)| for h sufficiently small. For example, 1 = 1 + h + h2 + h3 + · · · = 1 + h + O(h2 ) 1?h because as h becomes small, the omitted terms beyond h2 become relatively insignificant. Note that the two definitions are equivalent if h = 1/n. 1 Many modern microprocessors can perform a coupled multiplication and addition with a single multiply-add instruction.
Chapter 1
Scientific Computing
1.1
Introduction
The subject of this book is traditionally called numerical analysis. Numerical analysis is concerned with the design and analysis of algorithms for solving mathematical problems that arise in computational science and engineering. For this reason, numerical analysis has more recently become known as scientific computing. Numerical analysis is distinguished from most other parts of computer science in that it deals with quantities that are continuous, as opposed to discrete. It is concerned with functions and equations whose underlying variables—time, distance, velocity, temperature, density, pressure, stress, and the like—are continuous in nature. Most of the problems of continuous mathematics (for example, almost any problem involving derivatives, integrals, or nonlinearities) cannot be solved, even in principle, in a finite number of steps and thus must be solved by a (theoretically infinite) iterative process that ultimately converges to a solution. In practice, of course, one does not iterate forever, but only until the answer is approximately correct, “close enough” to the desired result for practical purposes. Thus, one of the most important aspects of scientific computing is finding rapidly convergent iterative algorithms and assessing the accuracy of the resulting approximation. If convergence is sufficiently rapid, even some of the problems that can be solved by finite algorithms, such as systems of linear algebraic equations, may in some cases be better solved by iterative methods, as we will see. Consequently, a second factor that distinguishes numerical analysis is its concern with approximations and their effects. Many solution techniques involve a whole series of approximations of various types. Even the arithmetic used is only approximate, for digital computers cannot represent all real numbers exactly. In addition to having the usual properties of good algorithms, such as efficiency, numerical algorithms should also be as reliable and accurate as possible despite the various approximations made along the way. 1
2
1.1.1
CHAPTER 1. SCIENTIFIC COMPUTING
General Strategy
In seeking a solution to a given computational problem, a basic general strategy, which occurs throughout this book, is to replace a difficult problem with an easier one that has the same solution, or at least a closely related solution. Examples of this approach include • Replacing infinite processes with finite processes, such as replacing integrals or infinite series with finite sums, or derivatives with finite difference quotients • Replacing general matrices with matrices having a simpler form • Replacing complicated functions with simple functions, such as polynomials • Replacing nonlinear problems with linear problems • Replacing differential equations with algebraic equations • Replacing high-order systems with low-order systems • Replacing infinite-dimensional spaces with finite-dimensional spaces For example, to solve a system of nonlinear differential equations, we might first replace it with a system of nonlinear algebraic equations, then replace the nonlinear algebraic system with a linear algebraic system, then replace the matrix of the linear system with one of a special form for which the solution is easy to compute. At each step of this process, we would need to verify that the solution is unchanged, or is at least within some required tolerance of the true solution. To make this general strategy work for solving a given problem, we must have • An alternative problem, or class of problems, that is easier to solve • A transformation of the given problem into a problem of this alternative type that preserves the solution in some sense Thus, much of our effort will go into identifying suitable problem classes with simple solutions and solution-preserving transformations into those classes. Ideally, the solution to the transformed problem is identical to that of the original problem, but this is not always possible. In the latter case the solution may only approximate that of the original problem, but the accuracy can usually be made arbitrarily good at the expense of additional work and storage. Thus, primary concerns are estimating the accuracy of such an approximate solution and establishing convergence to the true solution in the limit.
1.2 1.2.1
Approximations in Scientific Computation Sources of Approximation
There are many sources of approximation or inexactness in computational science. Some of these occur even before computation begins: • Modeling: Some physical features of the problem or system under study may be simplified or omitted (e.g., friction, viscosity). • Empirical measurements: Laboratory instruments have finite precision. Their accuracy may be further limited by small sample size, or readings obtained may be subject to
1.2. APPROXIMATIONS IN SCIENTIFIC COMPUTATION
3
random noise or systematic bias. For example, even the most careful measurements of important physical constants, such as Newton’s gravitational constant or Planck’s constant, typically yield values with at most eight or nine significant decimal digits. • Previous computations: Input data may have been produced by a previous step whose results were only approximate. The approximations just listed are usually beyond our control, but they still play an important role in determining the accuracy that should be expected from a computation. We will focus most of our attention on approximations over which we do have some influence. These systematic approximations that occur during computation include • Truncation or discretization: Some features of a mathematical model may be omitted or simplified (e.g., replacing a derivative by a difference quotient or using only a finite number of terms in an infinite series). • Rounding The computer representation of real numbers and arithmetic operations upon them is generally inexact. The accuracy of the final results of a computation may reflect a combination of any or all of these approximations, and the resulting perturbations may be amplified or magnified by the nature of the problem being solved or the algorithm being used, or both. The study of the effects of such approximations on the accuracy and stability of numerical algorithms is traditionally called error analysis. Example 1.1 Approximations. The surface area of the Earth might be computed using the formula A = 4?r2 for the surface area of a sphere of radius r. The use of this formula for the computation involves a number of approximations: • The Earth is modeled as a sphere, which is an idealization of its true shape. • The value for the radius, r ? 6370 km, is based on a combination of empirical measurements and previous computations. • The value for ? is given by an infinite limiting process, which must be truncated at some point. • The numerical values for the input data, as well as the results of the arithmetic operations performed on them, are rounded in a computer. The accuracy of the computed result depends on all of these approximations.
1.2.2
Data Error and Computational Error
As we have just seen, some errors can be attributed to the input data, whereas others are due to subsequent computational processes. Although this distinction is not always clearcut (rounding, for example, may affect both the input data and subsequent computational
4
CHAPTER 1. SCIENTIFIC COMPUTING
results), it is nevertheless helpful in understanding the overall effects of approximations in numerical computations. A typical problem can be viewed as the computation of the value of a function, say f : R ? R (most realistic problems are multidimensional, but for now we consider only one dimension for illustration). Denote the true value of the input data by x, so that the desired true result is f (x). Suppose that we must work with inexact input, say x ?, and we ? can compute only an approximation to the function, say f . Then Total error = f?(? x) ? f (x) ? = (f (? x) ? f (? x)) + (f (? x) ? f (x)) = computational error + propagated data error. The first term in the sum is the difference between the exact and approximate functions for the same input and hence can be considered pure computational error . The second term is the difference between exact function values due to error in the input and thus can be viewed as pure propagated data error . Note that the choice of algorithm has no effect on the propagated data error.
1.2.3
Truncation Error and Rounding Error
Similarly, computational error (that is, error made during the computation) can be subdivided into truncation (or discretization) error and rounding error: • Truncation error is the difference between the true result (for the actual input) and the result that would be produced by a given algorithm using exact arithmetic. It is due to approximations such as truncating an infinite series, replacing a derivative by a finite difference quotient, replacing an arbitrary function by a polynomial, or terminating an iterative sequence before convergence. • Rounding error is the difference between the result produced by a given algorithm using exact arithmetic and the result produced by the same algorithm using finite-precision, rounded arithmetic. It is due to inexactness in the representation of real numbers and arithmetic operations upon them, which we will consider in detail in Section 1.3. By definition, then, computational error is simply the sum of truncation error and rounding error. Although truncation error and rounding error can both play an important role in a given computation, one or the other is usually the dominant factor in the overall computational error. Roughly speaking, rounding error tends to dominate in purely algebraic problems with finite solution algorithms, whereas truncation error tends to dominate in problems involving integrals, derivatives, or nonlinearities, which often require a theoretically infinite solution process. The distinctions we have made among the different types of errors are important for understanding the behavior of numerical algorithms and the factors affecting their accuracy, but it is usually not necessary, or even possible, to quantify precisely the individual types of errors. Indeed, as we will soon see, it is often advantageous to lump all of the errors together and attribute them to error in the input data.
1.2. APPROXIMATIONS IN SCIENTIFIC COMPUTATION
1.2.4
5
Absolute Error and Relative Error
The significance of an error is obviously related to the magnitude of the quantity being measured or computed. For example, an error of 1 is much less significant in counting the population of the Earth than in counting the occupants of a phone booth. This motivates the concepts of absolute error and relative error , which are defined as follows: Absolute error = approximate value ? true value, absolute error Relative error = . true value Some authors define absolute error to be the absolute value of the foregoing difference, but we will take the absolute value explicitly when only the magnitude of the error is needed. Relative error can also be expressed as a percentage, which is simply the relative error times 100. Thus, for example, an absolute error of 0.1 relative to a true value of 10 would be a relative error of 0.01, or 1 percent. A completely erroneous approximation would correspond to a relative error of at least 1, or at least 100 percent, meaning that the absolute error is as large as the true value. One interpretation of relative error is that if a quantity x ? has a relative error of about 10?t , the decimal representation of x ? has about t correct significant digits. Another useful way to express the relationship between absolute and relative error is the following: Approximate value = (true value) ? (1 + relative error). Of course, we do not usually know the true value; if we did, we would not need to bother with approximating it. Thus, we will usually merely estimate or bound the error rather than compute it exactly, because the true value is unknown. For this same reason, relative error is often taken to be relative to the approximate value rather than to the true value, as in the foregoing definition.
1.2.5
Sensitivity and Conditioning
Difficulties in solving a problem accurately are not always due to an ill-conceived formula or algorithm, but may be inherent in the problem being solved. Even with exact computation, the solution to the problem may be highly sensitive to perturbations in the input data. A problem is said to be insensitive, or well-conditioned , if a given relative change in the input data causes a reasonably commensurate relative change in the solution. A problem is said to be sensitive, or ill-conditioned , if the relative change in the solution can be much larger than that in the input data. More formally, we define the condition number of a problem f at x as Cond =
|relative change in solution| |(f (? x) ? f (x))/f (x)| = , |relative change in input data| |(? x ? x)/x|
where x ? is a point near x. A problem is sensitive, or ill-conditioned, if its condition number is much larger than 1. Anyone who has felt a shower go from freezing to scalding, or vice
6
CHAPTER 1. SCIENTIFIC COMPUTING
versa, at the slightest touch of the temperature control has had first-hand experience with a sensitive system. Example 1.2 Evaluating a Function. Consider the propagated data error when a function f is evaluated for an approximate input argument x ? = x + h instead of the “true” input value x. We know from calculus that Absolute error = f (x + h) ? f (x) ? hf 0 (x), so that Relative error = and hence
f (x + h) ? f (x) f 0 (x) ?h , f (x) f (x)
0
hf (x)/f (x) f 0 (x)
Cond ? = x f (x) . h/x
Thus, the relative error in the function value can be much larger or smaller than that in the input, depending on the properties of the function involved and the particular value of the input. For example, if f (x) = ex , then the absolute error ? hex , relative error ? h, and cond ? |x|.
Example 1.3 Sensitivity. Consider the problem of computing values of the cosine function for arguments near ?/2. Let x ? ?/2 and let h be a small perturbation to x. Then the error in computing cos(x + h) is given by Absolute error = cos(x + h) ? cos(x) ? ?h sin(x) ? ?h, and hence Relative error ? ?h tan(x) ? ?. Thus, small changes in x near ?/2 cause large relative changes in cos(x) regardless of the method for computing it. For example, cos(1.57079) = 0.63267949 ? 10?5 , whereas cos(1.57078) = 1.63267949 ? 10?5 , so that the relative change in the output, 1.58, is about a quarter of a million times larger than the relative change in the input, 6.37 ? 10?6 .
1.2.6
Backward Error Analysis
Analyzing the forward propagation of errors in a computation is often very difficult. Moreover, the worst-case assumptions made at each stage often lead to a very pessimistic bound on the overall error. An alternative approach is backward error analysis: Consider the approximate solution obtained to be the exact solution for a modified problem, then ask how
1.2. APPROXIMATIONS IN SCIENTIFIC COMPUTATION
7
large a modification to the original problem is required to give the result actually obtained. In other words, how much data error in the initial input would be required to explain all of the error in the final computed result? In terms of backward error analysis, an approximate solution to a given problem is good if it is the exact solution to a “nearby” problem. These relationships are illustrated schematically (and not to scale) in Fig. 1.1, where x and f denote the exact input and function, respectively, f? denotes the approximate function actually computed, and x ? denotes an input value for which the exact function would give this computed result. Note that the equality f (? x) = f?(x) is due to the choice of x ?; indeed, this requirement defines x ?. f x •......................................................................................................................................................................................................• f (x) ......... ......... ? ? ......... ......... | | ......... ......... | | ? ......... f ......... ......... ......... ......... backward error forward error ......... ......... ......... ......... | | ......... ......... | ......... ... | . . ? ? . . . . . f ................ x ? •......................................................................................................................................................................................• f (? x) = f?(x) Figure 1.1: Schematic diagram of backward error analysis.
Example 1.4 Backward Error Analysis. Suppose we want a simple function for approximating the exponential function f (x) = ex , and we want to examine its accuracy for the argument x = 1. We know that the exponential function is given by the infinite series f (x) = ex = 1 + x +
x2 x3 + + ···, 2! 3!
so we might consider truncating the series after, say, four terms to get the approximation x2 x3 f?(x) = 1 + x + + . 2 6 The forward error in this approximation is then given by f?(x) ? f (x). To determine the backward error, we must find the input value x ? for f that gives the output value we actually obtained for f?, that is, for which f (? x) = f?(x). For the exponential function, we know that this value is given by x ? = log(f?(x)). Thus, for the particular input value x = 1, we have, to seven decimal places, f (x) = 2.718282,
f?(x) = 2.666667,
x ? = log(2.666667) = 0.980829, Forward error = f?(x) ? f (x) = ?0.051615,
8
CHAPTER 1. SCIENTIFIC COMPUTING Backward error = x ? ? x = ?0.019171.
The point here is not to compare the numerical values of the forward and backward errors quantitatively, but merely to illustrate the concepts involved and to show that both are legitimate approaches to assessing accuracy. In this case, the forward error indicates that the accuracy is fairly good because the output is close to what we wanted to compute, whereas the backward error indicates that the accuracy is fairly good because the output we obtained is correct for an input that is only slightly perturbed.
1.2.7
Stability and Accuracy
The concept of stability of a computational algorithm is analogous to conditioning of a mathematical problem. Both concepts have to do with sensitivity to perturbations, but the term stability is usually used for algorithms and conditioning for problems (although stability is sometimes used for problems as well, especially in differential equations). An algorithm is stable if the result it produces is relatively insensitive to perturbations resulting from approximations made during the computation. From the viewpoint of backward error analysis, an algorithm is stable if the result it produces is the exact solution to a nearby problem. Accuracy, on the other hand, refers to the closeness of a computed solution to the true solution of the problem under consideration. Stability of an algorithm does not by itself guarantee that the computed solution is accurate: accuracy depends on the conditioning of the problem as well as the stability of the algorithm. Stability tells us that the solution obtained is exact for a nearby problem, but the solution to that nearby problem is not necessarily close to the solution to the original problem unless the problem is well-conditioned. Thus, inaccuracy can result from applying a stable algorithm to an ill-conditioned problem as well as from applying an unstable algorithm to a well-conditioned problem.
1.3
Computer Arithmetic
As noted earlier, one type of approximation inevitably made in scientific computing is in representing real numbers on a computer. In this section we will examine in some detail the finite-precision arithmetic systems that are used for most scientific computations on digital computers.
1.3.1
Floating-Point Numbers
In a digital computer, the real number system of mathematics is represented approximately by a floating-point number system. The basic idea resembles scientific notation, in which a number of very large or very small magnitude is expressed as a number of moderate size times an appropriate power of ten. For example, 2347 and 0.0007396 are written as 2.347?103 and 7.396?10?4 , respectively. In this format, the decimal point moves, or floats, as the power of 10 changes. Formally, a floating-point number system is characterized by four integers:
1.3. COMPUTER ARITHMETIC
9
? t [L, U ]
Base or radix Precision Exponent range
By definition, any number x in the floating-point system is represented as follows: x = ±(d0 +
d1 d2 dt?1 + 2 + · · · + t?1 )? e , ? ? ?
where 0 ? di ? ? ? 1,
i = 0, . . . , t ? 1,
and L ? e ? U. The part in parentheses, represented by the string of base-? digits d0 d1 · · · dt?1 , is called the mantissa or significand , and e is called the exponent or characteristic of the floating-point number x. The portion d1 d2 · · · dt?1 of the mantissa is called the fraction. In a computer, the sign, exponent, and mantissa are stored in separate fields of a given floating-point word, each of which has a fixed width. The number zero is represented uniquely by having both its mantissa and its exponent equal to zero. Most computers today use binary (? = 2) arithmetic, but other bases have also been used in the past, such as hexadecimal (? = 16) in IBM mainframes and ? = 3 in an ill-fated Russian computer. Octal (? = 8) and hexadecimal notations are also commonly used as a convenient shorthand for writing binary numbers in groups of three or four binary digits (bits), respectively. For obvious reasons, decimal (? = 10) arithmetic is popular in handheld calculators. To facilitate human interaction, a computer usually converts numerical values from decimal notation on input and to decimal notation for output, regardless of the base it uses internally. Parameters for some typical floating-point systems are given in Table 1.1, which illustrates the trade-off between precision and exponent range implied by their respective field widths. For example, working with the same 64-bit word length, the Cray system has a wider exponent range than does IEEE double precision, but at the expense of carrying less precision. Table 1.1: Parameters for some typical floating-point systems System ? t L U IEEE SP 2 24 ?126 127 IEEE DP 2 53 ?1, 022 1, 023 Cray 2 48 ?16, 383 16, 384 HP calculator 10 12 ?499 499 IBM mainframe 16 6 ?64 63 The IEEE standard single-precision (SP) and double-precision (DP) binary floatingpoint systems are by far the most important today. They have been almost universally adopted for personal computers and workstations, and also for many mainframes and supercomputers as well. The IEEE standard was carefully crafted to eliminate the many anomalies and ambiguities in earlier vendor-specific floating-point implementations and has
10
CHAPTER 1. SCIENTIFIC COMPUTING
greatly facilitated the development of portable and reliable numerical software. It also allows for sensible and consistent handling of exceptional situations, such as division by zero.
1.3.2
Normalization
A floating-point system is said to be normalized if the leading digit d0 is always nonzero unless the number represented is zero. Thus, in a normalized floating-point system, the mantissa m of a given nonzero floating-point number always satisfies 1 ? m < ?. (An alternative convention is that d0 is always zero, in which case a floating-point number is said to be normalized if d1 6= 0, and ? ?1 ? m < 1 instead.) Floating-point systems are usually normalized because • The representation of each number is then unique. • No digits are wasted on leading zeros, thereby maximizing precision. • In a binary (? = 2) system, the leading bit is always 1 and thus need not be stored, thereby gaining one extra bit of precision for a given field width.
1.3.3
Properties of Floating-Point Systems
A floating-point number system is finite and discrete. The number of normalized floatingpoint numbers is 2(? ? 1)? t?1 (U ? L + 1) + 1 because there are two choices of sign, ? ? 1 choices for the leading digit of the mantissa, ? choices for each of the remaining t ? 1 digits of the mantissa, and U ? L + 1 possible values for the exponent. The 1 is added because the number could be zero. There is a smallest positive normalized floating-point number, Underflow level = UFL = ? L , which has a 1 as the leading digit and 0 for the remaining digits of the mantissa, and the smallest possible value for the exponent. There is a largest floating-point number, Overflow level = OFL = ? U +1 (1 ? ? ?t ), which has ? ? 1 as the value for each digit of the mantissa and the largest possible value for the exponent. Any number larger than OFL cannot be represented in the given floatingpoint system, nor can any positive number smaller than UFL. Floating-point numbers are not uniformly distributed throughout their range, but are equally spaced only between successive powers of ?. Not all real numbers are exactly representable in a floating-point system. Real numbers that are exactly representable in a given floating-point system are sometimes called machine numbers. Example 1.5 Floating-Point System. An example floating-point system is illustrated
1.3. COMPUTER ARITHMETIC
11
in Fig. 1.2, where the tick marks indicate all of the 25 floating-point numbers in a system having ? = 2, t = 3, L = ?1, and U = 1. For this system, the largest number is OFL = (1.11)2 ? 21 = (3.5)10 , and the smallest positive normalized number is UFL = (1.00)2 ? 2?1 = (0.5)10 . This is a very tiny, toy system for illustrative purposes only, but it is in fact characteristic of floating-point systems in general: at a sufficiently high level of magnification, every normalized floating-point system looks essentially like this one—grainy and unequally spaced.
Figure 1.2: Example of a floating-point number system.
1.3.4
Rounding
If a given real number x is not exactly representable as a floating-point number, then it must be approximated by some “nearby” floating-point number. We denote the floatingpoint approximation of a given real number x by fl(x). The process of choosing a nearby floating-point number fl(x) to approximate a given real number x is called rounding, and the error introduced by such an approximation is called rounding error , or roundoff error. Two of the most commonly used rounding rules are • Chop: The base-? expansion of x is truncated after the (t ? 1)st digit. Since fl(x) is the next floating-point number towards zero from x, this rule is also sometimes called round toward zero. • Round to nearest: fl(x) is the nearest floating-point number to x; in case of a tie, we use the floating-point number whose last stored digit is even. Because of the latter property, this rule is also sometimes called round to even. Rounding to nearest is the most accurate, but it is somewhat more expensive to implement correctly. Some systems in the past have used rounding rules that are cheaper to implement, such as chopping, but rounding to nearest is the default rounding rule in IEEE standard systems. Example 1.6 Rounding Rules. Rounding the following decimal numbers to two digits using each of the rounding rules gives the following results Number 1.649 1.650 1.651 1.699
Chop 1.6 1.6 1.6 1.6
Round to nearest 1.6 1.6 1.7 1.7
Number 1.749 1.750 1.751 1.799
Chop 1.7 1.7 1.7 1.7
Round to nearest 1.7 1.8 1.8 1.8
12
CHAPTER 1. SCIENTIFIC COMPUTING
A potential source of additional error that is often overlooked is in the decimal-to-binary and binary-to-decimal conversions that usually take place upon input and output of floatingpoint numbers. Such conversions are not covered by the IEEE standard, which governs only internal arithmetic operations. Correctly rounded input and output can be obtained at reasonable cost, but not all computer systems do so. Efficient, portable routines for correctly rounded binary-to-decimal and decimal-to-binary conversions—dtoa and strtod, respectively—are available from netlib (see Section 1.4.1).
1.3.5
Machine Precision
The accuracy of a floating-point system can be characterized by a quantity variously known as the unit roundoff , machine precision, or machine epsilon. Its value, which we denote by mach , depends on the particular rounding rule used. With rounding by chopping, mach = ? 1?t , whereas with rounding to nearest, mach = 21 ? 1?t . The unit roundoff is important because it determines the maximum possible relative error in representing a nonzero real number x in a floating-point system:
fl(x) ? x
? mach .
x
An alternative characterization of the unit roundoff that you may sometimes see is that it is the smallest number such that fl(1 + ) > 1, but this is not quite equivalent to the previous definition if the round-to-even rule is used. Another definition sometimes used is that mach is the distance from 1 to the next larger floating-point number, but this may differ from either of the other definitions. Although they can differ in detail, all three definitions of mach have the same basic intent as measures of the granularity of a floating-point system. For the toy illustrative system in Example 1.5, mach = 0.25 with rounding by chopping, and mach = 0.125 with rounding to nearest. For IEEE binary floating-point systems, mach = 2?24 ? 10?7 in single precision and mach = 2?53 ? 10?16 in double precision. We thus say that the IEEE single- and double-precision floating-point systems have about 7 and 16 decimal digits of precision, respectively. Though both are “small,” the unit roundoff should not be confused with the underflow level. The unit roundoff mach is determined by the number of digits in the mantissa field of a floating-point system, whereas the underflow level UFL is determined by the number of digits in the exponent field. In all practical floating-point systems, 0 < UFL < mach < OFL.
1.3. COMPUTER ARITHMETIC
1.3.6
13
Subnormals and Gradual Underflow
In the toy floating-point system illustrated in Fig. 1.2, there is a noticeable gap around zero. This gap, which is present to some degree in any floating-point system, is due to normalization: the smallest possible mantissa is 1.00. . . , and the smallest possible exponent is L, so there are no floating-point numbers between zero and ? L . If we relax our insistence on normalization and allow leading digits to be zero (but only when the exponent is at its minimum value), then the gap around zero can be “filled in” by additional floating-point numbers. For our toy illustrative system, this relaxation gains six additional floating-point numbers, the smallest positive one of which is (0.01)2 ?2?1 = (0.125)10 , as shown in Fig. 1.3.
Figure 1.3: Example of a floating-point system with subnormals. The extra numbers added to the system in this way are referred to as subnormal or denormalized floating-point numbers. Although they usefully extend the range of magnitudes representable, subnormal numbers have inherently lower precision than normalized numbers because they have fewer significant digits in their fractional parts. In particular, extending the range in this manner does not make the unit roundoff mach any smaller. Such an augmented floating-point system is sometimes said to exhibit gradual underflow , since it extends the lower range of magnitudes representable rather than underflowing to zero as soon as the minimum exponent value would otherwise be exceeded. The IEEE standard provides for such subnormal numbers and gradual underflow. Gradual underflow is implemented through a special reserved value of the exponent field because the leading binary digit is not stored and hence cannot be used to indicate a denormalized number.
1.3.7
Exceptional Values
The IEEE floating-point standard provides two additional special values that indicate exceptional situations: • Inf, which stands for “infinity,” results from dividing a finite number by zero, such as 1/0. • NaN, which stands for “not a number,” results from undefined or indeterminate operations such as 0/0, 0 ? Inf, or Inf/Inf. Inf and NaN are implemented in IEEE arithmetic through special reserved values of the exponent field. Whether Inf and NaN are supported at the user level in a given computing environment depends on the language, compiler, and run-time system. If available, these quantities can be helpful in designing software that deals gracefully with exceptional situations rather than
14
CHAPTER 1. SCIENTIFIC COMPUTING
abruptly aborting the program. In MATLAB (see Section 1.4.2), for example, if Inf and NaN arise, they are propagated sensibly through a computation (e.g., 1 + Inf = Inf). It is still desirable, however, to avoid such exceptional situations entirely, if possible. In addition to alerting the user to arithmetic exceptions, these special values can also be useful as flags that cannot be confused with any legitimate numeric value. For example, NaN might be used to indicate a portion of an array that has not yet been defined.
1.3.8
Floating-Point Arithmetic
In adding or subtracting two floating-point numbers, their exponents must match before their mantissas can be added or subtracted. If they do not match initially, then the mantissa of one of the numbers must be shifted until the exponents do match. In performing such a shift, some of the trailing digits of the smaller (in magnitude) number will be shifted off the end of the mantissa field, and thus the correct result of the arithmetic operation cannot be represented exactly in the floating-point system. Indeed, if the difference in magnitude is too great, then the entire mantissa of the smaller number may be shifted completely beyond the field width so that the result is simply the larger of the operands. Another way of saying this is that if the true sum of two t-digit numbers contains more than t digits, then the excess digits will be lost when the result is rounded to t digits, and in the worst case the operand of smaller magnitude may be lost completely. Multiplication of two floating-point numbers does not require that their exponents match—the exponents are simply summed and the mantissas multiplied. However, the product of two t-digit mantissas will in general contain up to 2t digits, and thus once again the correct result cannot be represented exactly in the floating-point system and must be rounded. Example 1.7 Floating-Point Arithmetic. Consider a floating-point system with ? = 10 and t = 6. If x = 1.92403 ? 102 and y = 6.35782 ? 10?1 , then floating-point addition gives the result x + y = 1.93039 ? 102 , assuming rounding to nearest. Note that the last two digits of y have no effect on the result. With an even smaller exponent, y could have had no effect at all on the result. Similarly, floating-point multiplication gives the result x ? y = 1.22326 ? 102 , which discards half of the digits of the true product. Division of two floating-point numbers may also give a result that cannot be represented exactly. For example, 1 and 10 are both exactly representable as binary floating-point numbers, but their quotient, 1/10, has a nonterminating binary expansion and thus is not a binary floating-point number. In each of the cases just cited, the result of a floating-point arithmetic operation may differ from the result that would be given by the corresponding real arithmetic operation on the same operands because there is insufficient precision to represent the correct real result. The real result may also be unrepresentable because its exponent is beyond the range available in the floating-point system (overflow or underflow). Overflow is usually a more serious problem than underflow in the sense that there is no good approximation in a floating-point system to arbitrarily large numbers, whereas zero is often a reasonable approximation for arbitrarily small numbers. For this reason, on many computer systems
1.3. COMPUTER ARITHMETIC
15
the occurrence of an overflow aborts the program with a fatal error, but an underflow may be silently set to zero without disrupting execution. Example 1.8 Summing a Series. As an illustration of these issues, the infinite series ? X 1 n
n=1
has a finite sum in floating-point arithmetic even though the real series is divergent. At first blush, one might think that this result occurs because 1/n will eventually underflow, or the partial sum will eventually overflow, as indeed they must. But before either of these occurs, the partial sum ceasesP to change once 1/n becomes negligible relative to the partial sum, i.e., when 1/n < mach n?1 k=1 (1/k), and thus the sum is finite (see Computer Problem 1.8). As we have noted, a real arithmetic operation on two floating-point numbers does not necessarily result in another floating-point number. If a number that is not exactly representable as a floating-point number is entered into the computer or is produced by a subsequent arithmetic operation, then it must be rounded (using one of the rounding rules given earlier) to obtain a floating-point number. Because floating-point numbers are not equally spaced, the absolute error made in such an approximation is not uniform, but the relative error is bounded by the unit roundoff mach . Ideally, x flop y = fl(x op y) (i.e., floating-point arithmetic operations produce correctly rounded results); and many computers, such as those meeting the IEEE floating-point standard, achieve this ideal as long as x op y is within the range of the floating-point system. Nevertheless, some familiar laws of real arithmetic are not necessarily valid in a floatingpoint system. In particular, floating-point addition and multiplication are commutative but not associative. For example, if is a positive floating-point number slightly smaller than the unit roundoff mach , then (1 + ) + = 1, but 1 + ( + ) > 1. The failure of floating-point arithmetic to satisfy the normal laws of real arithmetic is one reason that forward error analysis can be difficult. One advantage of backward error analysis is that it permits the use of real arithmetic in the analysis.
1.3.9
Cancellation
Rounding is not the only necessary evil in finite-precision arithmetic. Subtraction between two t-digit numbers having the same sign and similar magnitudes yields a result with fewer than t significant digits, and hence it is always exactly representable (provided the two numbers involved do not differ in magnitude by more than a factor of two). The reason is that the leading digits of the two numbers cancel (i.e., their difference is zero). For example, again taking ? = 10 and t = 6, if x = 1.92403 ? 102 and z = 1.92275 ? 102 , then we obtain the result x ? z = 1.28000 ? 10?1 , which, with only three significant digits, is exactly representable. Despite the exactness of the result, however, such cancellation nevertheless often implies a serious loss of information. The problem is that the operands are often uncertain, owing to rounding or other previous errors, in which case the relative uncertainty in the difference
16
CHAPTER 1. SCIENTIFIC COMPUTING
may be large. In effect, if two nearly equal numbers are accurate only to within rounding error, then taking their difference leaves only rounding error as a result. As a simple example, if is a positive number slightly smaller than the unit roundoff mach , then (1 + ) ? (1 ? ) = 1 ? 1 = 0 in floating-point arithmetic, which is correct for the actual operands of the final subtraction, but the true result of the overall computation, 2, has been completely lost. The subtraction itself is not at fault: it merely signals the loss of information that had already occurred. Of course, the loss of information is not always complete, but the fact remains that the digits lost to cancellation are the most significant, leading digits, whereas the digits lost in rounding are the least significant, trailing digits. Because of this effect, computing a small quantity as a difference of large quantities is generally a bad idea, for rounding error is likely to dominate the result. For example, summing an alternating series, such as ex = 1 + x +
x2 x3 + + ··· 2! 3!
for x < 0, may give disastrous results because of catastrophic cancellation (see Computer Problem 1.9). Example 1.9 Cancellation. Cancellation is not an issue only in computer arithmetic; it may also affect any situation in which limited precision is attainable, such as empirical measurements or laboratory experiments. For example, determining the distance from Manhattan to Staten Island by using their respective distances from Los Angeles will produce a very poor result unless the latter distances are known with extraordinarily high accuracy. As another example, for many years physicists have been trying to compute the total energy of the helium atom from first principles using Monte Carlo techniques. The accuracy of these computations is determined largely by the number of random trials used. As faster computers become available and computational techniques are refined, the attainable accuracy improves. The total energy is the sum of the kinetic energy and the potential energy, which are computed separately and have opposite signs. Thus, the total energy is computed as a difference and suffers cancellation. Table 1.2 gives a sequence of values obtained over a number of years (these data were kindly provided by Dr. Robert Panoff). During this span the computed values for the kinetic and potential energies changed by only 6 percent or less, yet the resulting estimate for the total energy changed by 144 percent. The one or two significant digits in the earlier computations were completely lost in the subsequent subtraction.
Table 1.2: Computed Year 1971 1977 1980 1985 1988
values for the total energy of the helium atom
Kinetic 13.0 12.76 12.22 12.28 12.40
Potential ?14.0 ?14.02 ?14.35 ?14.65 ?14.84
Total ?1.0 ?1.26 ?2.13 ?2.37 ?2.44
1.3. COMPUTER ARITHMETIC
17
Example 1.10 Quadratic Formula. Cancellation and other numerical difficulties need not involve a long series of computations. For example, use of the standard formula for the roots of a quadratic equation is fraught with numerical pitfalls. As every schoolchild learns, the two solutions of the quadratic equation ax2 + bx + c = 0 are given by
?
b2 ? 4ac . 2a For some values of the coefficients, naive use of this formula in floating-point arithmetic can produce overflow, underflow, or catastrophic cancellation. For example, if the coefficients are very large or very small, then b2 or 4ac may overflow or underflow. The possibility of overflow can be avoided by rescaling the coefficients, such as dividing all three coefficients by the coefficient of largest magnitude. Such a rescaling does not change the roots of the quadratic equation, but now the largest coefficient is 1 and overflow cannot occur in computing b2 or 4ac. Such rescaling does not eliminate the possibility of underflow, but it does prevent needless underflow, which could otherwise occur when all three coefficients are very small. Cancellation between ?b and the square root can be avoided by computing one of the roots using the alternative formula x=
x=
?b ±
2c ? , ?b ? b2 ? 4ac
which has the opposite sign pattern from that of the standard formula. But cancellation inside the square root cannot be easily avoided without using higher precision (if the discriminant is small relative to the coefficients, then the two roots are close to each other, and the problem is inherently ill-conditioned). As an illustration, we use four-digit decimal arithmetic, with rounding to nearest, to compute the roots of the quadratic equation having coefficients a = 0.05010, b = ?98.78, and c = 5.015. For comparison, the correct roots, rounded to ten significant digits, are 1971.605916
and 0.05077069387.
Computing the discriminant in four-digit arithmetic produces b2 ? 4ac = 9757 ? 1.005 = 9756, so that p b2 ? 4ac = 98.77. The standard quadratic formula then gives the roots 98.78 ± 98.77 = 1972 0.1002
and 0.0998.
The first root is the correctly rounded four-digit result, but the other root is completely wrong, with an error of about 100 percent. The culprit is cancellation, not in the sense
18
CHAPTER 1. SCIENTIFIC COMPUTING
that the final subtraction is wrong (indeed it is exactly correct), but in the sense that cancellation of the leading digits has left nothing remaining but previous rounding errors. The alternative quadratic formula gives the roots 10.03 = 1003 98.78 ? 98.77
and 0.05077.
Once again we have obtained one fully accurate root and one completely erroneous root, but in each case it is the opposite root from the one obtained previously. Cancellation is again the explanation, but the different sign pattern causes the opposite root to be contaminated. In general, for computing each root we should choose whichever formula avoids this cancellation, depending on the sign of b.
Example 1.11 Finite Difference Approximation. Consider the finite difference approximation to the first derivative f 0 (x) ?
f (x + h) ? f (x) . h
We want h to be small so that the approximation will be accurate, but if h is too small, then fl(x + h) may not differ from fl(x). Even if fl(x + h) 6= fl(x), we might still have fl(f (x + h)) = fl(f (x)) if f is slowly varying. In any case, we can expect some cancellation in computing the difference f (x + h) ? f (x). Thus, there is a trade-off between truncation error and rounding error in choosing the size of h. If the relative error in the function values is bounded by , then the rounding error in the approximate derivative value is bounded by 2|f (x)|/h. The Taylor series expansion f (x + h) = f (x) + f 0 (x)h + f 00 (x)h2 /2 + · · · gives an estimate of M h/2 for the truncation error, where M is a bound for |f 00 (x)|. The total error is therefore bounded by 2|f (x)| M h + , h 2 which is minimized when p h = 2 |f (x)|/M . If we assume that the function values are accurate to machine precision and that f and f 00 have roughly the same magnitude, then we obtain the rule of thumb that it is usually best to perturb about half the digits of x by taking ? h ? mach · |x|. A typical example is shown in Fig. 1.4, where the error in the finite difference approximation for a particular function is plotted as a function of the stepsize h. This computation was done in IEEE single precision with x = 1, and the error indeed reaches a minimum ? at h ? mach . The error increases for smaller values of h because of rounding error, and increases for larger values of h because of truncation error.
1.3. COMPUTER ARITHMETIC
19
The rounding error can be reduced by working with higher-precision arithmetic. Truncation error can be reduced by using a more accurate formula, such as the centered difference approximation (see Section 8.7.1) f 0 (x) ?
10?5 10?710?610?510?410?310?210?1 100 stepsize h Figure 1.4: Error in finite difference approximation as a function of stepsize.
Example 1.12 Standard Deviation. The mean of a finite sequence of real values xi , i = 1, . . . , n, is defined by n 1X x ?= xi , n i=1
and the standard deviation is defined by "
n
1 X ?= (xi ? x ?)2 n?1
#1/2
.
i=1
Use of these formulas requires two passes through the data: one to compute the mean and another to compute the standard deviation. For better efficiency, it is tempting to use the mathematically equivalent formula "
1 ?= n?1
n X
x2i
2
? n? x
!#1/2
i=1
to compute the standard deviation, since both the sum and the sum of squares can be computed in a single pass through the data. Unfortunately, the single cancellation at the end of the one-pass formula is often much more damaging numerically than all of the cancellations in the two-pass formula combined. The problem is that the two quantities being subtracted in the one-pass formula are apt to
20
CHAPTER 1. SCIENTIFIC COMPUTING
be relatively large and nearly equal, and hence the relative error in the difference may be large (indeed, the result can even be negative, causing the square root to fail).
Example 1.13 Computing Residuals. Assessing the accuracy of a computation is often difficult if one uses only the same precision as that of the computation itself. Perhaps this observation should not be surprising: if we knew the actual error, we could have used it to obtain a more accurate result in the first place. As a simple example, suppose we are solving the scalar linear equation ax = b for the unknown x, and we have obtained an approximate solution x ?. As one measure of the quality of our answer, we wish to compute the residual r = b ? a? x. In floating-point arithmetic, a ?fl x ? = a? x(1 + ?1 ) for some ?1 ? mach . So b ?fl (a ?fl x ?) = [b ? a? x(1 + ?1 )](1 + ?2 ) = [r ? ?1 a? x](1 + ?2 ) = r + ?2 r ? ?1 a? x ? ?1 ?2 a? x ? r + ?2 r ? ?1 b. But ?1 b may be as large as mach b, which may be as large as r. Thus, higher precision may be required to enable a meaningful computation of the residual r.
1.4
Mathematical Software
This book covers a wide range of topics in numerical analysis and scientific computing. We will discuss the essential aspects of each topic but will not have the luxury of examining any topic in great detail. To be able to solve interesting computational problems, we will often rely on mathematical software written by professionals. Leaving the algorithmic details to such software will allow us to focus on proper problem formulation and interpretation of results. We will consider only the most fundamental algorithms for each type of problem, motivated primarily by the insight to be gained into choosing an appropriate method and using it wisely. Our primary goal is to become intelligent users, rather than creators, of mathematical software. Before citing some specific sources of good mathematical software, let us summarize the desirable characteristics that such software should possess, in no particular order of importance: • Reliability: always works correctly for easy problems • Robustness: usually works for hard problems, but fails gracefully and informatively when it does fail • Accuracy: produces results as accurate as warranted by the problem and input data, preferably with an estimate of the accuracy achieved
1.4. MATHEMATICAL SOFTWARE
21
• Efficiency: requires execution time and storage that are close to the minimum possible for the problem being solved • Maintainability: is easy to understand and modify • Portability: adapts with little or no change to new computing environments • Usability: has a convenient and well-documented user interface • Applicability: solves a broad range of problems Obviously, these properties often conflict, and it is rare software indeed that satisfies all of them. Nevertheless, this list gives mathematical software users some idea what qualities to look for and developers some worthy goals to strive for.
1.4.1
Mathematical Software Libraries
Several widely available sources of general-purpose mathematical software are listed here. The software listed is written in Fortran unless otherwise noted. At the end of each chapter of this book, specific routines are listed for given types of problems, both from these general libraries and from more specialized packages. For additional information about available mathematical software, see the URL http://gams.nist.gov on the Internet’s World-Wide Web. • FMM: A collection of software accompanying the book Computer Methods for Mathematical Computations, by Forsythe, Malcolm, and Moler [82]. Available from netlib (see below). • HSL (Harwell Subroutine Library): A collection of software developed at Harwell Laboratory in England. See URL http://www.cse.clrc.ac.uk/Activity/HSL. • IMSL (International Mathematical and Statistical Libraries): A commercial product of Visual Numerics Inc., Houston, Texas. A comprehensive library of mathematical software; the full library is available in Fortran, and a subset is available in C. See URL http://www.vni.com. • KMN: A collection of software accompanying the book Numerical Methods and Software, by Kahaner, Moler, and Nash [142]. • NAG (Numerical Algorithms Group): A commercial product of NAG Inc., Downers Grove, Illinois. A comprehensive library of mathematical software; the full library is available in Fortran, and a subset is available in C. See URL http://www.nag.com. • NAPACK: A collection of software designed to complement the book Applied Numerical Linear Algebra, by Hager [116]. In addition to linear algebra, also contains routines for nonlinear equations, unconstrained optimization, and fast Fourier transforms. Available from netlib. • netlib: A collection of free software from diverse sources available over the Internet. See URL http://www.netlib.org, or send email containing the request “send index” to netlib@ornl.gov, or ftp to one of several mirror sites, such as netlib.bell-labs.com or netlib2.cs.utk.edu. • NR (Numerical Recipes): A collection of software accompanying the book Numerical Recipes, by Press, Teukolsky, Vetterling, and Flannery [205]. Available in C and Fortran editions. • NUMAL: A collection of software developed at the Mathematisch Centrum, Amsterdam. Also available in Algol and Fortran, but most readily available in C from the book A
22
•
• •
•
CHAPTER 1. SCIENTIFIC COMPUTING Numerical Library in C for Scientists and Engineers, by Lau [162]. PORT: A collection of software developed at Bell Laboratories. Some portions are available from netlib, but other portions must be obtained commercially and licensed for use. See the port directory in netlib for further information. SLATEC: A collection of software compiled by a consortium of U.S. government laboratories. Available from netlib. SOL: A collection of software for optimization and related problems from the Systems Optimization Laboratory at Stanford University. For further information, see URL http://www.stanford.edu/~saunders/brochure/brochure.html. TOMS: A collection of software appearing in ACM Transactions on Mathematical Software (formerly Collected Algorithms of the ACM). Available from netlib. The algorithms are identified by number (in order of appearance) as well as by name.
1.4.2
Scientific Computing Environments
The software libraries just listed contain subroutines that are meant to be called by userwritten programs, usually in a conventional programming language such as Fortran or C. An increasingly popular alternative for scientific computing is interactive environments that provide powerful, conveniently accessible, built-in mathematical capabilities, often combined with sophisticated graphics and a very high-level programming language designed for rapid prototyping of new algorithms. One of the most widely used such computing environments is MATLAB, which is a proprietary commercial product of The MathWorks, Inc. (see URL http://www.mathworks.com). MATLAB, which stands for MATrix LABoratory, is an interactive system that integrates extensive mathematical capabilities, especially in linear algebra, with powerful scientific visualization, a high-level programming language, and a variety of optional “toolboxes” that provide specialized capabilities in particular applications, such as signal processing, image processing, control, system identification, optimization, and statistics. There is also a MATLAB interface for the NAG mathematical software library mentioned in Section 1.4.1. MATLAB is available for a wide variety of personal computers, workstations, and supercomputers, and comes in both professional and inexpensive student editions. If MATLAB is not available on your computer system, there are similar, though less powerful, packages that are freely available by ftp, including octave (http://www.che.wisc.edu/octave), RLaB (http://rlab.sourceforge.net), and Scilab (http://www-rocq.inria.fr/scilab). Other similar commercial products include GAUSS, HiQ, IDL, Mathcad, and PV-WAVE. Another family of interactive computing environments is based primarily on symbolic (rather than numeric) computation, often called computer algebra. These packages, which include Axiom, Derive, Macsyma, Maple, Mathematica, MuPAD, Reduce, and Scratchpad, provide many of the same mathematical and graphical capabilities, and in addition provide symbolic differentiation, integration, equation solving, polynomial manipulation, and the like, as well as arbitrary precision arithmetic. Because MATLAB is probably the most widely used of these environments for the types of problems discussed in this book, specific MATLAB functions, either from the basic environment or from the supplementary toolboxes, are mentioned in the summaries of available software for each problem category, along with software from the major conventional soft-