# Conte, de Boor - Elementary Numerical Analysis. An Algorithmic Approach

Посмотреть архив целикомПросмотреть файл в отдельном окне: d2b81f80c698ddc81a1506a1b4337e73.pdf

Home

Next

ELEMENTARY NUMERICAL ANALYSIS

An Algorithmic Approach

International Series in Pure and Applied Mathematics

G. Springer

Consulting Editor

Ahlfors: Complex Analysis

Bender and Orszag: Advanced Mathematical Methods for Scientists and Engineers

Buck: Advanced Calculus

Busacker and Saaty: Finite Graphs and Networks

Cheney: Introduction to Approximation Theory

Chester: Techniques in Partial Differential Equations

Coddington and Levinson: Theory of Ordinary Differential Equations

Conte and de Boor: Elementary Numerical Analysis: An Algorithmic Approach

Dennemeyer: Introduction to Partial Differential Equations and Boundary Value

Problems

Dettman: Mathematical Methods in Physics and Engineering

Hamming: Numerical Methods for Scientists and Engineers

Hildebrand: Introduction to Numerical Analysis

Householder: The Numerical Treatment of a Single Nonlinear Equation

Kalman, Falb, and Arbib: Topics in Mathematical Systems Theory

McCarty: Topology: An Introduction with Applications to Topological Groups

Moore: Elements of Linear Algebra and Matrix Theory

Moursund and Duris: Elementary Theory and Application of Numerical Analysis

Pipes and Harvill: Applied Mathematics for Engineers and Physicists

Ralston and Rabinowitz: A First Course in Numerical Analysis

Ritger and Rose: Differential Equations with Applications

Rudin: Principles of Mathematical Analysis

Shapiro: Introduction to Abstract Algebra

Simmons: Differential Equations with Applications and Historical Notes

Simmons: Introduction to Topology and Modern Analysis

Struble: Nonlinear Differential Equations

ELEMENTARY

NUMERICAL

ANALYSIS

An Algorithmic Approach

Third Edition

S. D. Conte

Purdue University

Carl de Boor

Universiry of Wisconsin—Madison

McGraw-Hill Book Company

New York St. Louis San Francisco Auckland Bogot? Hamburg

Johannesburg London Madrid Mexico Montreal New Delhi

Panama Paris S?o Paulo Singapore Sydney Tokyo Toronto

ELEMENTARY NUMERICAL ANALYSIS

An Algorithmic Approach

Copyright © 1980, 1972, 1965 by McGraw-Hill, inc. All rights reserved.

Printed in the United States of America. No part of this publication

may be reproduced, stored in a retrieval system, or transmitted, in any

form or by any means, electronic, mechanical, photocopying, recording or

otherwise, without the prior written permission of the publisher.

234567890 DODO 89876543210

This book was set in Times Roman by Science Typographers, Inc. The

editors were Carol Napier and James S. Amar; the production supervisor

was Phil Galea. The drawings were done by Fine Line Illustrations, Inc.

R. R. Donnelley & Sons Company was printer and binder.

Library of Congress Cataloging in Publication Data

Conte, Samuel Daniel, date

Elementary numerical analysis.

(International series in pure and applied

mathematics)

Includes index.

1. Numerical analysis-Data processing.

I . de Boor, Carl, joint author. II. Title.

1980

519.4

79-24641

QA297.C65

ISBN 0-07-012447-7

CONTENTS

Preface

Introduction

Chapter 1

1.1

1.2

1.3

1.4

1.5

1.6

1.7

Chapter 2

2.1

2.2

2.3

*2.4

2.5

2.6

*2.7

ix

xi

Number Systems and Errors

1

The Representation of Integers

The Representation of Fractions

Floating-Point Arithmetic

Loss of Significance and Error Propagation;

Condition and Instability

Computational Methods for Error Estimation

Some Comments on Convergence of Sequences

Some Mathematical Preliminaries

1

4

7

12

18

19

25

Interpolation by Polynomial

31

Polynomial Forms

Existence and Uniqueness of the Interpolating Polynomial

The Divided-Difference Table

Interpolation at an Increasing Number of

Interpolation Points

The Error of the Interpolating Polynomial

Interpolation in a Function Table Based on Equally

Spaced Points

The Divided Difference as a Function of Its Arguments

and Osculatory Interpolation

31

38

41

46

51

55

62

* Sections marked with an asterisk may be omitted without loss of continuity.

V

vi

CONTETS

Chapter 3

The Solution of Nonlinear Equations

72

A Survey of Iterative Methods

Fortran Programs for Some Iterative Methods

Fixed-Point Iteration

Convergence Acceleration for Fixed-Point Iteration

Convergence of the Newton and Secant Methods

Polynomial Equations: Real Roots

Complex Roots and M?ller’s Method

74

81

88

95

100

110

120

Chapter 4

Matrices and Systems of Linear Equations

4.1

4.2

4.3

4.4

4.5

4.6

*4.7

*4.8

Properties of Matrices

The Solution of Linear Systems by Elimination

The Pivoting Strategy

The Triangular Factorization

Error and Residual of an Approximate Solution; Norms

Backward-Error Analysis and Iterative Improvement

Determinants

The Eigenvalue Problem

128

128

147

157

160

169

177

185

189

Chapter *5 Systems of Equations and Unconstrained

Optimization

208

3.1

3.2

3.3

3.4

*3.5

3.6

*3.7

Optimization and Steepest Descent

Newton’s Method

Fixed-Point Iteration and Relaxation Methods

209

216

223

Approximation

235

Uniform Approximation by Polynomials

Data Fitting

Orthogonal Polynomials

Least-Squares Approximation by Polynomials

Approximation by Trigonometric Polynomials

Fast Fourier Transforms

Piecewise-Polynomial Approximation

235

245

251

259

268

277

284

Chapter 7

Differentiation and Integration

294

7.1

7.2

7.3

7.4

7.5

l 7.6

l 7.7

Numerical Differentiation

Numerical Integration: Some Basic Rules

Numerical Integration: Gaussian Rules

Numerical Integration: Composite Rules

Adaptive Quadrature

Extrapolation to the Limit

Romberg Integration

295

303

311

319

328

333

340

*5.1

*5.2

*5.3

Chapter 6

6.1

6.2

*6.3

*6.4

*6.5

*6.6

6.7

CONTENTS

Chapter 8

8.1

8.2

8.3

8.4

8.5

8.6

8.7

8.8

8.9

*8.10

*8.11

*8.12

*8.13

vii

The Solution of Differential Equations

346

Mathematical Preliminaries

Simple Difference Equations

Numerical Integration by Taylor Series

Error Estimates and Convergence of Euler’s Method

Runge-Kutta Methods

Step-Size Control with Runge-Kutta Methods

Multistep Formulas

Predictor-Corrector Methods

The Adams-Moulton Method

Stability of Numerical Methods

Round-off-Error Propagation and Control

Systems of Differential Equations

Stiff Differential Equations

346

349

354

359

362

366

373

379

382

389

395

398

401

Chapter 9 Boundary Value Problems

9.1 Finite Difference Methods

9.2 Shooting Methods

9.3 Collocation Methods

Appendix: Subroutine Libraries

References

Index

406

406

412

416

421

423

425

PREFACE

This is the third edition of a book on elementary numerical analysis which

is designed specifically for the needs of upper-division undergraduate

students in engineering, mathematics, and science including, in particular,

computer science. On the whole, the student who has had a solid college

calculus sequence should have no difficulty following the material.

Advanced mathematical concepts, such as norms and orthogonality, when

they are used, are introduced carefully at a level suitable for undergraduate

students and do not assume any previous knowledge. Some familiarity

with matrices is assumed for the chapter on systems of equations and with

differential equations for Chapters 8 and 9. This edition does contain some

sections which require slightly more mathematical maturity than the previous edition. However, all such sections are marked with asterisks and all

can be omitted by the instructor with no loss in continuity.

This new edition contains a great deal of new material and significant

changes to some of the older material. The chapters have been rearranged

in what we believe is a more natural order. Polynomial interpolation

(Chapter 2) now precedes even the chapter on the solution of nonlinear

systems (Chapter 3) and is used subsequently for some of the material in

all chapters. The treatment of Gauss elimination (Chapter 4) has been

simplified. In addition, Chapter 4 now makes extensive use of Wilkinson’s

backward error analysis, and contains a survey of many well-known

methods for the eigenvalue-eigenvector problem. Chapter 5 is a new

chapter on systems of equations and unconstrained optimization. It contains an introduction to steepest-descent methods, Newton’s method for

nonlinear systems of equations, and relaxation methods for solving large

linear systems by iteration. The chapter on approximation (Chapter 6) has

been enlarged. It now treats best approximation and good approximation

ix

x

PREFACE

by polynomials, also approximation by trigonometric functions, including

the Fast Fourier Transforms, as well as least-squares data fitting, orthogonal polynomials, and curve fitting by splines. Differentiation and integration are now treated in Chapter 7, which contains a new section on

adaptive quadrature. Chapter 8 on ordinary differential equations contains

considerable new material and some new sections. There is a new section

on step-size control in Runge-Kutta methods and a new section on stiff

differential equations as well as an extensively revised section on numerical

instability. Chapter 9 contains a brief introduction to collocation as a

method for solving boundary-value problems.

This edition, as did the previous one, assumes that students have

access to a computer and that they are familiar with programming in some

procedure-oriented language. A large number of algorithms are presented

in the text, and FORTRAN programs for many of these algorithms have

been provided. There are somewhat fewer complete programs in this

edition. All the programs have been rewritten in the FORTRAN 77

language which uses modern structured-programming concepts. All the

programs have been tested on one or more computers, and in most cases

machine results are presented. When numerical output is given, the text

will indicate which machine (IBM, CDC, UNIVAC) was used to obtain

the results.

The book contains more material than can usually be covered in a

typical one-semester undergraduate course for general science majors. This

gives the instructor considerable leeway in designing the course. For this, it

is important to point out that only the material on polynomial interpolation in Chapter 2, on linear systems in Chapter 4, and on differentiation

and integration in Chapter 7, is required in an essential way in subsequent

chapters. The material in the first seven chapters (exclusive of the starred

sections) would make a reasonable first course.

We take this opportunity to thank those who have communicated to us

misprints and errors in the second edition and have made suggestions for

improvement. We are especially grateful to R. E. Barnhill, D. Chambless,

A. E. Davidoff, P. G. Davis, A. G. Deacon, A. Feldstein, W. Ferguson,

A. O. Garder, J. Guest, T. R. Hopkins, D. Joyce, K. Kincaid, J. T. King,

N. Krikorian, and W. E. McBride.

S. D. Conte

Carl de Boor

INTRODUCTION

This book is concerned with the practical solution of problems on computers. In the process of problem solving, it is possible to distinguish

several more or less distinct phases. The first phase is formulation. In

formulating a mathematical model of a physical situation, scientists should

take into account beforehand the fact that they expect to solve a problem

on a computer. They will therefore provide for specific objectives, proper

input data, adequate checks, and for the type and amount of output.

Once a problem has been formulated, numerical methods, together

with a preliminary error analysis, must be devised for solving the problem.

A numerical method which can be used to solve a problem will be called

an algorithm. An algorithm is a complete and unambiguous set of procedures leading to the solution of a mathematical problem. The selection or

construction of appropriate algorithms properly falls within the scope of

numerical analysis. Having decided on a specific algorithm or set of

algorithms for solving the problem, numerical analysts should consider all

the sources of error that may affect the results. They must consider how

much accuracy is required, estimate the magnitude of the round-off and

discretization errors, determine an appropriate step size or the number of

iterations required, provide for adequate checks on the accuracy, and make

allowance for corrective action in cases of nonconvergence.

The third phase of problem solving is programming. The programmer

must transform the suggested algorithm into a set of unambiguous stepby-step instructions to the computer. The first step in this procedure is

called flow charting. A flow chart is simply a set of procedures, usually in

logical block form, which the computer will follow. It may be given in

graphical or procedural statement form. The complexity of the flow will

depend upon the complexity of the problem and the amount of detail

xi

xii INTRODUCTION

included. However, it should be possible for someone other than the

programmer to follow the flow of information from the chart. The flow

chart is an effective aid to the programmer, who must translate its major

functions into a program, and, at the same time, it is an effective means of

communication to others who wish to understand what the program does.

In this book we sometimes use flow charts in graphical form, but more

often in procedural statement form. When graphical flow charts are used,

standard conventions are followed, whereas all procedural statement charts

use a self-explanatory ALGOL-like statement language. Having produced

a flow chart, the programmer must transform the indicated procedures into

a set of machine instructions. This may be done directly in machine

language, in an assembly language, or in a procedure-oriented language. In

this book a dialect of FORTRAN called FORTRAN 77 is used exclusively. FORTRAN 77 is a new dialect of FORTRAN which incorporates

new control statements and which emphasizes modern structured-programming concepts. While FORTRAN IV compilers are available on almost all

computers, FORTRAN 77 may not be as readily available. However,

conversion from FORTRAN 77 to FORTRAN IV should be relatively

straightforward.

A procedure-oriented language such as FORTRAN or ALGOL is

sometimes called an algorithmic language. It allows us to express a

mathematical algorithm in a form more suitable for communication with

computers. A FORTRAN procedure that implements a mathematical

algorithm will, in general, be much more precise than the mathematical

algorithm. If, for example, the mathematical algorithm specifies an iterative procedure for finding the solution of an equation, the FORTRAN

program must specify (1) the accuracy that is required, (2) the number of

iterations to be performed, and (3) what to do in case of nonconvergence.

Most of the algorithms in this book are given in the normal mathematical

form and in the more precise form of a FORTRAN procedure.

In many installations, each of these phases of problem solving is

performed by a separate person. In others, a single person may be

responsible for all three functions. It is clear that there are many interactions among these three phases. As the program develops, more information becomes available, and this information may suggest changes in the

formulation, in the algorithms being used, and in the program itself.

ELEMENTARY NUMERICAL ANALYSIS

An Algorithmic Approach

Previous Home Next

CHAPTER

ONE

NUMBER SYSTEMS AND ERRORS

In this chapter we consider methods for representing numbers on computers and the errors introduced by these representations. In addition, we

examine the sources of various types of computational errors and their

subsequent propagation. We also discuss some mathematical preliminaries.

1.1 THE REPRESENTATION OF INTEGERS

In everyday life we use numbers based on the decimal system. Thus the

number 257, for example, is expressible as

257 = 2·100 + 5·10 + 7·1

= 2·102 + 5·101 + 7·1000

We call 10 the base of this system. Any integer is expressible as a

polynomial in the base 10 with integral coefficients between 0 and 9. We

use the notation

N = (a n a n - 1 ··· a 0 ) 1 0

= a n 10 n + a n-1 10 n-1 + ··· + a 0 10 0

(1.1)

to denote any positive integer in the base 10. There is no intrinsic reason to

use 10 as a base. Other civilizations have used other bases such as 12, 20,

or 60. Modern computers read pulses sent by electrical components. The

state of an electrical impulse is either on or off. It is therefore convenient to

represent numbers in computers in the binary system. Here the base is 2,

and the integer coefficients may take the values 0 or 1.

1

2

NUMBER SYSTEMS AND ERRORS

A nonnegative integer N will be represented in the binary system as

(1.2)

where the coefficients ak are either 0 or 1. Note that N is again represented

as a polynomial, but now in the base 2. Many computers used in scientific

work operate internally in the binary system. Users of computers, however,

prefer to work in the more familiar decimal system. It is therefore necessary to have some means of converting from decimal to binary when

information is submitted to the computer, and from binary to decimal for

output purposes.

Conversion of a binary number to decimal form may be accomplished

directly from the definition (1.2). As examples we have

The conversion of integers from a base

to the base 10 can also be

accomplished by the following algorithm, which is derived in Chap. 2.

Algorithm 1.1 Given the coefficients an, . . . , a0 of the polynomial

(1.3)

and a number

Compute recursively the numbers

Then

Since, by the definition (1.2), the binary integer

represents the value of the polynomial (1.3) at x = 2, we can use Algorithm 1.1, with

to find the decimal equivalents of binary integers.

Thus the decimal equivalent of (1101)2 computed using Algorithm 1.1

is

1.1

THE REPRESENTATION OF INTEGERS

3

and the decimal equivalent of (10000)2 is

Converting a decimal integer N into its binary equivalent can also be

accomplished by Algorithm 1.1 if one is willing to use binary arithmetic.

then by the definition (1.1), N = p(10). where

For if

p(x) is the polynomial (1.3). Hence we can calculate the binary representation for N by translating the coefficients

into binary integers

and then using Algorithm 1.1 to evaluate p(x) at x = 10 = (1010) 2 in

binary arithmetic. If, for example, N = 187, then

and using Algorithm 1.1 and binary arithmetic,

Therefore 187 = (10111011)2.

Binary numbers and binary arithmetic, though ideally suited for

today’s computers, are somewhat tiresome for people because of the

number of digits necessary to represent even moderately sized numbers.

Thus eight binary digits are necessary to represent the three-decimal-digit

number 187. The octal number system, using the base 8, presents a kind of

compromise between the computer-preferred binary and the people-preferred decimal system. It is easy to convert from octal to binary and back

since three binary digits make one octal digit. To convert from octal to

binary, one merely replaces all octal digits by their binary equivalent; thus

Conversely, to convert from binary to octal, one partitions the binary digits

in groups of three (starting from the right) and then replaces each threegroup by its octal digit; thus

If a decimal integer has to be converted to binary by hand, it is usually

fastest to convert it first to octal using Algorithm 1.1, and then from octal

to binary. To take an earlier example,

4

NUMBER SYSTEMS AND ERRORS

Hence, using Algorithm 1.1 [with 2 replaced by 10 = (12)8, and with octal

arithmetic],

Therefore, finally,

EXERCISES

1.1-l Convert the following binary numbers to decimal form:

1.1-2 Convert the following decimal numbers to binary form:

82, 109, 3433

1.1-3 Carry out the conversions in Exercises 1. l-l and 1.1-2 by converting first to octal form.

1.1-4 Write a FORTRAN subroutine which accepts a number to the base BETIN with the

NIN digits contained in the one-dimensional array NUMIN, and returns the NOUT digits of

the equivalent in base BETOUT in the one-dimensional array NUMOUT. For simplicity,

restrict both BETIN and BETOUT to 2, 4, 8, and 10.

1.2 THE REPRESENTATION OF FRACTIONS

If x is a positive real number, then its integral part xI is the largest integer

less than or equal to x, while

is its fractional part. The fractional part can always be written as a decimal

fraction:

(1.4)

where each b k is a nonnegative integer less than 10. If b k = 0 for all k

greater than a certain integer, then the fraction is said to terminate. Thus

is a terminating decimal fraction, while

is not.

If the integral part of x is given as a decimal integer by

1.2

THE REPRESENTATION OF FRACTIONS

5

while the fractional part is given by (1.4), it is customary to write the two

representations one after the other, separated by a point, the “decimal

point”:

Completely analogously, one can write the fractional part of x as a

binary fraction:

where each bk is a nonnegative integer less than 2, i.e., either zero or one. If

the integral part of x is given by the binary integer

then we write

using a “binary point.”

The binary fraction (.b1 b 2 b 3 · · · ) 2 for a given number xF between

zero and one can be calculated as follows: If

then

Hence b1 is the integral part of 2xF, while

Therefore, repeating this procedure, we find that b2 is the integral part of

2(2xF)F, b3 is the integral part of 2(2(2xF)F)F, etc.

If, for example, x = 0.625 = xF, then

and all further bk’s are zero. Hence

This example was rigged to give a terminating binary fraction. Unhappily, not every terminating decimal fraction gives rise to a terminating

binary fraction. This is due to the fact that the binary fraction for

6

NUMBER SYSTEMS AND ERRORS

is not terminating. We have

and now we are back to a fractional part of 0.2, so that the digits cycle. It

follows that

The procedure just outlined is formalized in the following algorithm.

Algorithm 1.2 Given x between 0 and 1 and an integer

1. Generate recursively b1, b2, b3, . . . by

greater than

Then

We have stated this algorithm for a general base

rather than for the

for two reasons. If this conversion to binary is

specific binary base

carried out with pencil and paper, it is usually faster to convert first to

and then to convert from octal to binary. Also, the

octal, i.e., use

algorithm can be used to convert a binary (or octal) fraction to decimal, by

and using binary (or octal) arithmetic.

choosing

To give an example, if x = (.lOl)2, then, with

and

binary arithmetic, we get from Algorithm 1.2

Hence subsequent bk’s are zero. This shows that

confirming our earlier calculation. Note that if xF is a terminating binary

1.3

FLOATING-POINT ARITHMETIC

7

fraction with n digits, then it is also a terminating decimal fraction with n

digits, since

EXERCISES

1.2-l Convert the following binary fractions to decimal fractions:

(.1100011)2

(. 1 1 1 1 1 1 1 1)2

1.2-2 Find the first 5 digits of .1 written as an octal fraction, then compute from it the first 15

digits of .1 as a binary fraction.

1.2-3 Convert the following octal fractions to decimal:

(.614)8

(.776)8

Compare with your answer in Exercise 1.2-1.

1.2-4 Find a binary number which approximates

to within 10-3.

1.2-5 If we want to convert a decimal integer N to binary using Algorithm 1.1, we have to use

binary arithmetic. Show how to carry out this conversion using Algorithm 1.2 and decimal

arithmetic. (Hint: Divide N by the appropriate power of 2, convert the result to binary, then

shift the “binary point” appropriately.)

1.2-6 If we want to convert a terminating binary fraction x to a decimal fraction using

Algorithm 1.2, we have to use binary arithmetic. Show how to carry out this conversion using

Algorithm 1.1 and decimal arithmetic.

1.3 FLOATING-POINT ARITHMETIC

Scientific calculations are usually carried out in floating-point arithmetic.

An n-digit floating-point number in base has the form

(1.5)

is a

called the mantissa, and e is an

where

integer called the exponent. Such a floating-point number is said to be

normalized in case

or else

For most computers,

although on some,

and in hand

calculations and on most desk and pocket calculators,

The precision or length n of floating-point numbers on any particular

computer is usually determined by the word length of the computer and

may therefore vary widely (see Fig. 1.1). Computing systems which accept

FORTRAN programs are expected to provide floating-point numbers of

two different lengths, one roughly double the other. The shorter one, called

single precision, is ordinarily used unless the other, called double precision,

is specifically asked for. Calculation in double precision usually doubles

the storage requirements and more than doubles running time as compared

with single precision.

8

NUMBER SYSTEMS AND ERRORS

Figure 1.1 Floating-point characteristics.

The exponent e is limited to a range

(1.6)

for certain integers m and M. Usually, m = - M, but the limits may vary

widely; see Fig. 1.1.

There are two commonly used ways of translating a given real number

x into an n

floating-point number fl(x), rounding and chopping. In

rounding, fl(x) is chosen as the normalized floating-point number nearest

x; some special rule, such as symmetric rounding (rounding to an even

digit), is used in case of a tie. In chopping, fl(x) is chosen as the nearest

normalized floating-point number between x and 0. If, for example, twodecimal-digit floating-point numbers are used, then

and

On some computers, this definition of fl(x) is modified in case

(underflow), where m and M are the

bounds on the exponents; either fl(x) is not defined in this case, causing a

stop, or else fl(x) is represented by a special number which is not subject to

the usual rules of arithmetic when combined with ordinary floating-point

numbers.

The difference between x and fl(x) is called the round-off error. The

round-off error depends on the size of x and is therefore best measured

relative to x. For if we write

(1.7)

is some number depending on x, then it is possible to

where

bound

independently of x, at least as long as x causes no overflow or

underflow. For such an x, it is not difficult to show that

in rounding

while

in chopping

(1.8)

(1.9)

1.3

FLOATING-POINT ARITHMETIC

9

See Exercise 1.3-3. The maximum possible value for

is often called the

unit roundoff and is denoted by u.

When an arithmetic operation is applied to two floating-point numbers, the result usually fails to be a floating-point number of the same

length. If, for example, we deal with two-decimal-digit numbers and

then

Hence, if denotes one of the arithmetic operations (addition, subtraction,

multiplication, or division) and

denotes the floating-point operation of

the same name provided by the computer, then, however the computer

may arrive at the result

for two given floating-point numbers x and

y, we can be sure that usually

Although the floating-point operation

some details from machine to machine,

corresponding to may vary in

is usually constructed so that

(1.10)

In words, the floating-point sum (difference, product, or quotient) of two

floating-point numbers usually equals the floating-point number which

represents the exact sum (difference, product, or quotient) of the two

numbers. Hence (unless overflow or underflow occurs) we have

(1.11 a)

where u is the unit roundoff. In certain situations, it is more convenient to

use the equivalent formula

(1.116)

Equation (1.11) expresses the basic idea of backward error analysis (see J.

H. Wilkinson [24]†). Explicitly, Eq. (1.11) allows one to interpret a floating-point result as the result of the corresponding ordinary arithmetic, but

performed on slightly perturbed data. In this way, the analysis of the effect

of floating-point arithmetic can be carried out in terms of ordinary

arithmetic.

For example, the value of the function

at a point x0 can be

calculated by n squarings, i.e., by carrying out the sequence of steps

In floating-point arithmetic, we compute instead, accordwith

ing to Eq. (1.1 la), the sequence of numbers

†Numbers in brackets refer to items in the references at the end of the book.

10

NUMBER SYSTEMS AND ERRORS

with

all i. The computed answer is, therefore,

To simplify this expression, we observe that, if

for some

for some

then

(see Exercise 1.3-6). Also then

Consequently,

for some

In words, the computed value

is the

exact value of f(x) at the perturbed argument

We can now gauge the effect which the use of floating-point arithmetic

has had on the accuracy of the computed value for f(x0) by studying how

the value of the (exactly computed) function f(x) changes when the

argument x is perturbed, as is done in the next section. Further, we note

that this error is, in our example, comparable to the error due to the fact

that we had to convert the initial datum x0 to a floating-point number to

begin with.

As a second example, of particular interest in Chap. 4, consider

calculation of the number s from the equation

(1.12)

by the formula

If we obtain s through the steps

then the corresponding numbers computed in floating-point arithmetic

satisfy

Here, we have used Eqs. (1.11a ) and (1.11b), and have not bothered to

1.3

distinguish the various

FLOATING-POINT ARITHMETIC

11

by subscripts. Consequently,

This shows that the computed value

for s satisfies the perturbed equation

(1.13)

Note that we can reduce all exponents by 1 in case ar+1 = 1, that is, in

case the last division need not be carried out.

EXERCISES

1.3-1 The following numbers are given in a decimal computer with a four-digit normalized

mantissa:

Perform the following operations, and indicate the error in the result, assuming symmetric

rounding:

1.3-2 Let

be given by chopping. Show that

(unless overflow or underflow occurs).

and that

13-3 Let

be given by chopping and let

be such that

(If

Show that then

is bounded as in (1.9).

1.3-4 Give examples to show that most of the laws of arithmetic fail to hold for floating-point

arithmetic. (Hint: Try laws involving three operands.)

1.3-5 Write a FORTRAN FUNCTION FL(X) which returns the value of the n-decimal-digit

floating-point number derived from X by rounding. Take n to be 4 and check your

calculations in Exercise 1.3-l. [Use ALOG10(ABS(X)) to determine e such that

1.3-6 Let

Show that for all

there exists

that

Show also that

some

provided

all have the same sign.

1.3-7 Carry out a backward error analysis for the calculation of the scalar product

Redo the analysis under the assumption that double-precision

cumulation is used. This means that the double-precision results of each multiplicatioin

retained and added to the sum in double precision, with the resulting sum rounded only at

end to single precision.

o

for

acare

the

12

NUMBER SYSTEMS AND ERRORS

1.4 LOSS OF SIGNIFICANCE AND ERROR PROPAGATION;

CONDITION AND INSTABILITY

If the number x* is an approximation to the exact answer x, then we call

the difference x - x* the error in x*; thus

Exact = approximation + error

(1.14)

The relative error in x*, as an approximation to x, is defined to be the

number (x - x*)/x. Note that this number is close to the number (x then (x x * ) / x * if it is at all small. [Precisely, if

x*)/x* =

Every floating-point operation in a computational process may give

rise to an error which, once generated, may then be amplified or reduced

in subsequent operations.

One of the most common (and often avoidable) ways of increasing the

importance of an error is commonly called loss of significant digits. If x* is

an approximation to x, then we say that x* approximates x to r significant

provided the absolute error |x - x*| is at most in the rt h

significant

of x. This can be expressed in a formula as

(1.15)

with s the largest integer such that

For instance, x* = 3 agrees

with

to one significant (decimal) digit, while

is correct to three significant digits (as an approximation to ). Suppose

now that we are to calculate the number

and that we have approximations x* and y* for x and y, respectively,

available, each of which is good to r digits. Then

is an approximation for z, which is also good to r digits unless x* and y*

agree to one or more digits. In this latter case, there will be cancellation of

digits during the subtraction, and consequently z* will be accurate to fewer

than r digits.

Consider, for example,

and assume each to be an approximation to x and y, respectively, correct

to seven significant digits. Then, in eight-digit floating-point arithmetic,

is the exact difference between x* and y*. But as an approximation to

z = x - y,z* is good only to three digits, since the fourth significant digit

of z* is derived from the eighth digits of x* and y*, both possibly in error.

1.4

LOSS OF SIGNIFICANCE, ERROR PROPAGATION; CONDITION, INSTABILITY

13

Hence, while the error in z* (as an approximation to z = x - y) is at most

the sum of the errors in x* and y*, the relative error in z* is possibly 10,000

times the relative error in x* or y*. Loss of significant digits is therefore

dangerous only if we wish to keep the relative error small.

Such loss can often be avoided by anticipating its occurrence. Consider, for example, the evaluation of the function

in six-decimal-digit arithmetic. Since

for x near zero, there will

be loss of significant digits for x near zero if we calculate f(x) by first

finding cos x and then subtracting the calculated value from 1. For we

cannot calculate cos x to more than six digits, so that the error in the

calculated value may be as large as 5 · 10-7, hence as large as, or larger

than, f(x) for x near zero. If one wishes to compute the value of f(x) near

zero to about six significant digits using six-digit arithmetic, one would

have to use an alternative formula for f(x), such as

which can be evaluated quite accurately for small x; else, one could make

use of the Taylor expansion (see Sec. 1.7) for f(x),

which shows, for example, that for

agrees with f(x) to at

least six significant digits.

Another example is provided by the problem of finding the roots of

the quadratic equation

(1.16)

We know from algebra that the roots are given by the quadratic formula

(1.17)

Let us assume that b2 - 4ac > 0, that b > 0, and that we wish to find the

root of smaller absolute value using (1.17); i.e.,

(1.18)

If 4ac is small compared with b 2 , then

will agree with b to

several places. Hence, given that

will be calculated correctly

only to as many places as are used in the calculations, it follows that the

numerator of (1.18), and therefore the calculated root, will be accurate to

fewer places than were used during the calculation. To be specific, take the

14

NUMBER SYSTEMS AND ERRORS

equation

(1.19)

Using (1.18) and five-decimal-digit floating-point chopped arithmetic, we

calculate

while in fact,

is the correct root to the number of digits shown. Here too, the loss of

significant digits can be avoided by using an alternative formula for the

calculation of the absolutely smaller root, viz.,

(1.20)

Using this formula, and five-decimal-digit arithmetic, we calculate

which is accurate to five digits.

Once an error is committed, it contaminates subsequent results. This

error propagation through subsequent calculations is conveniently studied

in terms of the two related concepts of condition and instability.

The word condition is used to describe the sensitivity of the function

value f(x) to changes in the argument x. The condition is usually measured

by the maximum relative change in the function value f(x) caused by a

unit relative change in the argument. In a somewhat informal formula,

condition off at x =

(1.21)

The larger the condition, the more ill-conditioned the function is said to

be. Here we have made use of the fact (see Sec. 1.7) that

i.e., the change in argument from x to x* changes the function value by

approximately

If, for example,

1.4

LOSS OF SIGNIFICANCE, ERROR PROPAGATION; CONDITION, INSTABILITY

then

15

hence the condition of f is, approximately,

This says that taking square roots is a well-conditioned process since it

actually reduces the relative error. By contrast, if

then

so that

and this number can be quite large for |x| near 1. Thus, for x near 1 or

- 1, this function is quite ill-conditioned. It very much magnifies relative

errors in the argument there.

The related notion of instability describes the sensitivity of a numerical

process for the calculation of f(x) from x to the inevitable rounding errors

committed during its execution in finite precision arithmetic. The precise

effect of these errors on the accuracy of the computed value for f(x) is

hard to determine except by actually carrying out the computations for

particular finite precision arithmetics and comparing the computed answer

with the exact answer. But it is possible to estimate these effects roughly by

considering the rounding errors one at a time. This means we look at the

individual computational steps which make up the process. Suppose there

are n such steps. Denote by xi the output from the ith such step, and take

x0 = x. Such an xi then serves as input to one or more of the later steps

and, in this way, influences the final answer xn = f(x). Denote by f i the

function which describes the dependence of the final answer on the

intermediate result xi . In particular, f0 is just f. Then the total process is

unstable to the extent that one or more of these functions fi is ill-conditioned. More precisely, the process is unstable to the extent that one or

more of the fi ’s has a much larger condition than f = f0 has. For it is the

condition of fi which gauges the relative effect of the inevitable rounding

error incurred at the ith step on the final answer.

To give a simple example, consider the function

for “large” x, say for

Its condition there is

which is quite good. But, if we calculate f(12345) in six-decimal arithmetic,

16

NUMBER SYSTEMS AND ERRORS

we find

while, actually,

So our calculated answer is in error by 10 percent. We analyze the

computational process. It consists of the following four computational

steps:

(1.22)

Now consider, for example, the function f 3 , i.e., the function which

describes how the final answer x4 depends on x3. We have

hence its condition is, approximately,

This number is usually near 1, i.e., f 3 is usually well-conditioned except

when t is near x2. In this latter case, f3 can be quite badly conditioned. For

example, in our particular case,

while

so the

condition is

or more than 40,000 times as big as the condition of

f itself.

We conclude that the process described in (1.22) is an unstable way to

evaluate f. Of course, if you have read the beginning of this section

carefully, then you already know a stable way to evaluate this function,

namely by the equivalent formula

1

In six-decimal arithmetic, this gives

1.4

LOSS OF SIGNIFICANCE, ERROR PROPAGATION; CONDITION, INSTABILITY

17

which is in error by only 0.0003 percent. The computational process is

(1.23)

Here, for example, f3(t) = 1/(x2 + t), and the condition of this function is,

approximately,

which is the case here. Thus, the condition of f3 is quite good; it

for

is as good as that of f itself.

We will meet other examples of large instability, particularly in the

discussion of the numerical solution of differential equations.

EXERCISES

1.4-l Find the root of smallest magnitude of the equation

using formulas (1.18) and (1.20). Work in floating-point arithmetic using a four- (decimal-)

place mantissa.

1.4-2 Estimate the error in evaluating

around x = 2 if the absolute

error in x is 10-6.

1.4-3 Find a way to calculate

correctly to the number of digits used when x is near zero for (a)-(c), very much larger than

for (d).

1.4-4 Assuming a computer with a four-decimal-place mantissa, add the following numbers

first in ascending order (from smallest to largest) and then in descending order. In doing so

round off the partial sums. Compare your results with the correct sum x = 0.107101023 · 105.

1.4-5 A dramatically unstable way to calculate f(x) = e x for negative x is provided by its

-12

by evaluating the Taylor series (1.36) at x = - 1 2 and

Taylor series (1.36). Calculate e

18

NUMBER SYSTEMS AND ERRORS

compare with the accurate value e-12 = 0.00000 61442 12354 · · · . [ Hint: By (1.36), the

difference between eX and the partial sum

is less than the next term

in absolute value, in case x is negative. So, it would be all right to sum the series until

1.4-6 Explain the result of Exercise 1.4-5 by comparing the condition of f(x) = e X near

x = - 12 with the condition of some of the functions fi involved in the computational

process. Then find a stable way to calculate e-12 from the Taylor series (1.36). (Hint:

e-x = 1/ex.)

1.5 COMPUTATIONAL METHODS FOR ERROR

ESTIMATION

This chapter is intended to make the student aware of the possible sources

of error and to point out some techniques which can be used to avoid these

errors. In appraising computer results, such errors must be taken into

account. Realistic estimates of the total error are difficult to make in a

practical problem. and an adequate mathematical theory is still lacking.

An appealing idea is to make use of the computer itself to provide us with

such estimates. Various methods of this type have been proposed. We shall

discuss briefly five of them. The simplest method makes use of double

precision. Here one simply solves the same problem twice—once in single

precision and once in double precision. From the difference in the results

an estimate of the total round-off error can then be obtained (assuming

that all other errors are less significant). It can then be assumed that the

same accumulation of roundoff will occur in other problems solved with

the same subroutine. This method is extremely costly in machine time

since double-precision arithmetic increases computer time by a factor of 8

on some machines, and in addition, it is not always possible to isolate

other errors.

A second method is interval arithmetic. Here each number is represented by two machine numbers, the maximum and the minimum values

that it might have. Whenever an operation is performed, one computes its

maximum and minimum values. Essentially, then, one will obtain two

solutions at every step, the true solution necessarily being contained within

the range determined by the maximum and minimum values. This method

requires more than twice the amount of computer time and about twice the

storage of a standard run. Moreover, the usual assumption that the true

solution lies about midway within the range is not, in general, valid. Thus

the range might be so large that any estimate of the round-off error based

upon this would be grossly exaggerated.

A third approach is significant-digit arithmetic. As pointed out earlier,

whenever two nearly equal machine numbers are subtracted, there is a

danger that some significant digits will be lost. In significant-digit

arithmetic an attempt is made to keep track of digits so lost. In one version

1.6

SOME COMMENTS ON CONVERGENCE OF SEQUENCES

19

only the significant digits in any number are retained, all others being

discarded. At the end of a computation we will thus be assured that all

digits retained are significant. The main objection to this method is that

some information is lost whenever digits are discarded, and that the results

obtained are likely to be much too conservative. Experimentation with this

technique is still going on, although the experience to date is not too

promising.

A fourth method which gives considerable promise of providing an

adequate mathematical theory of round-off-error propagation is based on

a statistical approach. It begins with the assumption that round-off errors

are independent. This assumption is, of course, not valid, because if the

same problem is run on the same machine several times, the answers will

always be the same. We can, however, adopt a stochastic model of the

propagation of round-off errors in which the local errors are treated as if

they were random variables. Thus we can assume that the local round-off

errors are either uniformly or normally distributed between their extreme

values. Using statistical methods, we can then obtain the standard deviation, the variance of distribution, and estimates of the accumulated roundoff error. The statistical approach is considered in some detail by Hamming [1] and Henrici [2]. The method does involve substantial analysis and

additional computer time, but in the experiments conducted to date it has

obtained error estimates which are in remarkable agreement with experimentally available evidence.

A fifth method is backward error analysis, as introduced in Sec. 1.3. As

we saw, it reduces the analysis of rounding error effects to a study of

perturbations in exact arithmetic and, ultimately, to a question of condition. We will make good use of this method in Chap. 4.

1.6 SOME COMMENTS ON CONVERGENCE OF

SEQUENCES

Calculus, and more generally analysis, is based on the notion of convergence. Basic concepts such as derivative, integral, and continuity are

defined in terms of convergent sequences, and elementary functions such

as ln x or sin x are defined by convergent series, At the same time,

numerical answers to engineering and scientific problems are never needed

exactly. Rather, an approximation to the answer is required which is

accurate “to a certain number of decimal places,” or accurate to within a

given tolerance

It is therefore not surprising that many numerical methods for finding

the answer

of a given problem merely produce (the first few terms of) a

sequence

which is shown to converge to the desired answer.

20

NUMBER SYSTEMS AND ERRORS

To recall the definition:

of (real or complex) numbers converges to a if and only if, for all

A sequence

there exists an integer

such that for all

Hence, if we have a numerical method which produces a sequence

converging to the desired answer

then we can calculate a to

any desired accuracy merely by calculating

for “large enough” n.

From a computational point of view, this definition is unsatisfactory

for the following reasons: (1) It is often not possible (without knowing the

answer

to know when n is “large enough.” In other words, it is difficult

to get hold of the function

mentioned in the definition of convergence. (2) Even when some knowledge about

is available, it may turn

out that the required n is too large to make the calculation of

feasible.

Example The number

is the value of the infinite series

Hence, with

the sequence

is monotone-decreasing to its limit

Moreover,

To calculate

correct to within 10-6 using this sequence, we would need 106 < 4 n +

3, or roughly, n = 250,000. On a computer using eight-decimal-digit floating-point

arithmetic, round-off in the calculation of

is probably much larger than 10-6.

Hence

could not be computed to within 10-6 using this sequence (except, perhaps,

by adding the terms from smallest to largest).

To deal with these problems, some notation is useful. Specifically, we

would like to measure how fast sequences converge. As with all measuring,

this is done by comparison, with certain standard sequences, such as

The comparison is made as follows: one says that

and writes

is of order

(1.24)

in case

(1.25)

1.6

SOME COMMENTS ON CONVERGENCE OF SEQUENCES 21

for some constant K and all sufficiently large n. Thus

Further, if it is possible to choose the constant K in (1.25) arbitrarily small

as soon as n is large enough; that is, should it happen that

then one says that

writes

is of higher order than

and

(1.26)

Thus

while sin

The order notation appears customarily only on the right-hand side of

an equation and serves the purpose of describing the essential feature of an

error term without bothering about multiplying constants or other detail.

For instance, we can state concisely the unsatisfactory state of affairs in

the earlier example by saying that

but also

i.e., the series converges to

as fast as 1/n (goes to zero) but no faster.

A convergence order or rate of l/n is much too slow to be useful in

calculations.

Example If

then, by definition,

is just a fancy way of saying that the sequence

Hence

converges to

Example If |r| < 1, then the geometric series

we have

Further, if

then

sums to 1/(1 - r). With

Thus

22

NUMBER SYSTEMS AND ERRORS

for some |r| < 1, we say that the convergence is (at

Hence, whenever a,,

least) geometric, for it is then (at least) of the same order as the convergence of the

geometric series.

than to know nothing,

Although it is better to know that

knowledge about the order of convergence becomes quite useful only when

we know more precisely that

This says that for “large enough”

To put it differently,

is a sequence converging to zero. Although we cannot

where

prove that a certain n is “large enough,” we can test the hypothesis that n is

“large enough” by comparing

with

If

for k near n, say for k = n - 2, n - 1, n, then we accept the hypothesis

that n is “large enough” for

to be true, and therefore accept

Example Let p > 1. Then the series

geometric series

To get a more precise statement, consider

Then

as a good estimate of the error

converges to its limit

like the

1.6 SOME COMMENTS ON CONVERGENCE OF SEQUENCES 23

For the ratios, we find

which is, e.g., within 1/10 of 1 for n = 3 and p = 2. Thus,

In fact,

good indication of the error in

in

is therefore 0.12005 · · · .

is then a

the error

This notation carries over to functions of a real variable. If

we say that the convergence is

provided

for some finite constant K and all small enough h. If this holds for all

K > 0, that is, if

then we call the convergence o(f(h)).

Example For h “near” zero, we have

Hence, for all

Example If the function f(x) has a zero of order

then

Rules for calculating with the order symbols are collected in the following

lemma.

and c is a constant,

Lemma 1.1 If

then

If also

then

(1.27)

If, further,

then also

24

NUMBER SYSTEMS AND ERRORS

while if

then

Finally, all statements remain true if

is replaced by o throughout.

The approximate calculation of a number via a sequence

converging to always involves an act of faith regardless of whether or not

the order of convergence is known. Given that the sequence is known to

converge to practicing numerical analysts ascertain that n is “large

enough” by making sure that, for small values of

differs “little

If they also know that the convergence is

enough” from

they check whether or not the sequence behaves accordingly near n. If they

also know that a satisfies certain equations or inequalities— might be the

sought-for solution of an equation—they check that satisfies these

equations or inequalities “well enough.” In short, practicing numerical

analysts make sure that n satisfies all conditions they can think of which

are necessary for n to be “large enough.” If all these conditions are

satisfied, then, lacking sufficient conditions for n to be “large enough,”

they accept

on faith as a good enough approximation to

In a way,

numerical analysts use all means at their disposal to distinguish a “good

enough” approximation from a bad one. They can do no more (and should

do no less).

It follows that numerical results arrived at in this way should not be

mistaken for final answers. Rather, they should be questioned freely if

subsequent investigations throw any doubt upon their correctness.

The student should appreciate this as another example of the basic

difference between numerical analysis and analysis. Analysis became a

precise discipline when it left the restrictions of practical calculations to

deal entirely with problems posed in terms of an abstract model of the

number system, called the real numbers. This abstract model is designed to

make a precise and useful definition of limit possible, which opens the way

to the abstract or symbolic solution of an impressive array of practical

problems, once these problems are translated into the terms of the model.

This still leaves the task of translating the abstract or symbolic solutions

back into practical solutions. Numerical analysis assumes this task, and

with it the limitations of practical calculations from which analysis

managed to escape so elegantly. Numerical answers are therefore usually

tentative and, at best, known to be accurate only to within certain bounds.

Numerical analysis is therefore not merely concerned with the construction of numerical methods. Rather, a large portion of numerical

analysis consists in the derivation of useful error bounds, or error estimates,

for the numerical answers produced by a numerical algorithm. Throughout

this book, the student will meet this preoccupation with error bounds so

typical of numerical analysis.

1.7 SOME MATHEMATICAL PRELIMINARIES 25

EXERCISES

1.6-1 The number ln 2 may be calculated from the series

It is known from analysis that this series converges and that the magnitude of the error in any

partial sum is less than the magnitude of the first neglected term. Estimate the number of

terms that would be required to calculate ln 2 to 10 decimal places.

1.6-2 For h near zero it is possible to write

and

Find the values of

and

for which these equalities hold.

1.6-3 Try to calculate, on a computer, the limit of the sequence

Theoretically, what is

and what is the order of convergence of the sequence?

1.7 SOME MATHEMATICAL PRELIMINARIES

It is assumed that the student is familiar with the topics normally covered

in the undergraduate analytic geometry and calculus sequence. These

include elementary notions of real and complex number systems; continuity; the concept of limits, sequences, and series; differentiation and integration. For Chap. 4, some knowledge of determinants is assumed. For

Chaps. 8 and 9, some familiarity with the solution of ordinary differential

equations is also assumed, although these chapters may be omitted.

In particular, we shall make frequent use of the following theorems.

Theorem 1.1: Intermediate-value theorem for continuous functions Let

f(x) be a continuous function on the interval

for some number a and some

then

This theorem is often used in the following form:

Theorem 1.2 Let f(x) be a continuous function on [a,b], let x1, . . . , xn

be points in [a,b], and let g1, . . . , gn, be real numbers all of one sign.

Then

26

NUMBER SYSTEMS AND ERRORS

TO indicate the proof, assume without loss of generality that gi > 0,

then

is a number between the two values

and

of the continuous function

and the conclusion follows

from Theorem 1.1.

One proves analogously the corresponding statement for infinite sums

or integrals:

Hence

Theorem 1.3: Mean-value theorem for integrals Let g(x) be a nonnegative or nonpositive integrable function on [a,b]. If f(x) is continuous

on [a,b], then

(1.28)

Warning The assumption that g(x) is of one sign is essential in Theorem

1.3, as the simple example

shows.

Theorem 1.4 Let f(x) be a continuous function on the closed and

bounded interval [a,b]. Then f(x) “assumes its maximum and minimum values on [a,b]”; i.e., there exist points

such

that

Theorem 1.5: Rolle’s theorem Let f(x) be continuous on the (closed

and finite) interval [a,b] and differentiable on (a,b). If f(a) = f(b) =

0, then

The proof makes essential use of Theorem 1.4. For by Theorem 1.4,

there are points

such that, for all

If now neither _ nor is in (a,b), then

and every

will do. Otherwise, either

or

is in (a,b), say,

But then

since

being the biggest value achieved by f(x) on [a,b].

An immediate consequence of Rolle’s theorem is the following theorem.

Theorem 1.6: Mean-value theorem for derivatives If f(x) is continuous

on the (closed and finite) interval [a,b] and differentiable on (a, b),

1.7 SOME MATHEMATICAL PRELIMINARIES 27

then

(1.29)

One gets Theorem 1.6 from Theorem 1.5 by considering in Theorem

1.5 the function

instead of f(x). Clearly, F(x) vanishes both at a and at b.

It follows directly from Theorem 1.6 that if f(x) is continuous on [a,b]

and differentiable on (a,b), and c is some point in [a,b], then for all

(1.30)

The fundamental theorem of calculus provides the more precise statement:

If f(x) is continuously differentiable, then for all

(1.31)

from which (1.30) follows by the mean-value theorem for integrals (1.28),

since f '(x) is continuous. More generally, one has the following theorem.

Theorem 1.7: Taylor’s formula with (integral) remainder If f(x) has

n + 1 continuous derivatives on [a,b] and c is some point in [a,b],

then for all

(1 . 32)

where

(1.33)

One gets (1.32) from (1.31) by considering the function

instead of f(x). For,

But since F(c) = f(c), this gives

hence by (1.31),

28

NUMBER SYSTEMS AND ERRORS

which is (1.32), after the substitution of x for c and of c for x.

Actually, f (n+1)(x) need not be continuous for (1.32) to hold. However,

if in (1.32), f(n+1)(x) is continuous, one gets, using Theorem 1.3, the more

familiar but less useful form for the remainder:

(1.34)

By setting h = x - c, (1.32) and (1.34) take the form

(1.35)

Example The function f(x) = eX has the Taylor expansion

for some between 0 and x

(1.36)

about c - 0. The expansion of f(x) = ln x = log, x about c = 1 is

where 0 < x < 2, and

is between 1 and x.

A similar formula holds for functions of several variables. One obtains

this formula from Theorem 1.7 with the aid of

Theorem 1.8: Chain rule If the function f(x,y, . . . , z) has continuous

first partial derivatives with respect to each of its variables, and

x = x(t), y = y(t), . . . , z = z(t) are continuously differentiable functions of t, then g(t) = f(x(t), y(t), . . . , z(t)) is also continuously differentiable, and

From this theorem, one obtains an expression for f(x, y, . . . , z) in

terms of the value and the partial derivatives at (a, b, . . . , c) by introducing the function

and then evaluating its Taylor series expansion around t = 0 at t = 1. For

example, this gives

1.7

SOME MATHEMATICAL PRELIMINARIES 29

Theorem 1.9 If f(x,y) has continuous first and second partial derivatives in a neighborhood D of the point (a,b) in the (x,y) plane, then

(1.37)

for all (x,y) in D, where

for some

depending on (x,y), and the subscripts on f

denote partial differentiation.

For example, the expansion of ex

sin y

about (a,b) = (0, 0) is

(1.38)

Finally, in the discussion of eigenvalues of matrices and elsewhere, we

need the following theorem.

Theorem 1.10: Fundamental theorem of algebra If p(x) is a polynomial

of degree n > 1, that is,

with a,, . . . , a,, real or complex numbers and

least one zero; i.e., there exists a complex number

then p(x) has at

such that

This rather deep theorem should not be confused with the straightforward statement, “A polynomial of degree n has at most n zeros,

counting multiplicity,” which we prove in Chap. 2 and use, for example, in

the discussion of polynomial interpolation.

EXERCISES

1.7-1 In the mean-value theorem for integrals, Theorem 1.3, let

[0,1]. Find the point

specified by the theorem and verify that this point lies in the interval

(0,1).

1.7-2 In the mean-value theorem for derivatives, Theorem 1.6, let

Find the point

specified by the theorem and verify that this point lies in the interval (a,b).

1.7-3 In the expansion (1.36) for eX, find n so that the resulting power sum will yield an

approximation correct to five significant digits for all x on [0,1].

30

NUMBER SYSTEMS AND ERRORS

1.7-4 Use Taylor’s formula (1.32) to find a power series expansion about

Find an expression for the remainder, and from this estimate the number of terms that would

be needed to guarantee six-significant-digit accuracy for

for all x on the interval

[-1,1].

1.7-5 Find the remainder R2(x,y) in the example (1.38) and determine its maximum value in

the region D defined by

1.7-6 Prove that the remainder term in (1.35) can also be written

1.7-7 Illustrate the statement in Exercise 1.7-6 by calculating, for

for various values of h, for example, for

with

and comparing R,(h)

1.7-8 Prove Theorem 1.9 from Theorems 1.7 and 1.8.

1.7-9 Prove Euler’s formula

n

by comparing the power series for e , evaluated at

the power series for

and i times the one for

with the sum of

Previous

CHAPTER

TWO

INTERPOLATION BY POLYNOMIALS

Polynomials are used as the basic means of approximation in nearly all

areas of numerical analysis. They are used in the solution of equations and

in the approximation of functions, of integrals and derivatives, of solutions

of integral and differential equations, etc. Polynomials owe this popularity

to their simple structure, which makes it easy to construct effective

approximations and then make use of them.

For this reason, the representation and evaluation of polynomials is a

basic topic in numerical analysis. We discuss this topic in the present

chapter in the context of polynomial interpolation, the simplest and

certainly the most widely used technique for obtaining polynomial approximations. More advanced methods for getting good approximations by

polynomials and other approximating functions are given in Chap. 6. But

it will be shown there that even best polynomial approximation does not

give appreciably better results than an appropriate scheme of polynomial

interpolation.

Divided differences serve as the basis of our treatment of the interpolating polynomial. This makes it possible to deal with osculatory (or

Hermite) interpolation as a special limiting case of polynomial interpolation at distinct points.

2.1 POLYNOMIAL FORMS

In this section, we point out that the customary way to describe a

polynomial may not always be the best way in calculations, and we

31

Home

Next

32

INTERPOLATION BY POLYNOMIALS

propose alternatives, in particular the Newton form. We also show how to

evaluate a polynomial given in Newton form. Finally, in preparation for

polynomial interpolation, we discuss how to count the zeros of a polynomial.

A polynomial p(x) of degree < n is, by definition, a function of the

form

(2.1)

with certain coefficients a0, a1, . . . , an. This polynomial has (exact) degree

n in case its leading coefficient a, is nonzero.

The power form (2.1) is the standard way to specify a polynomial in

mathematical discussions. It is a very convenient form for differentiating

or integrating a polynomial. But, in various specific contexts, other forms

are more convenient.

Example 2.1: The power form may lead to loss of significance If we construct the power

form of the straight line p(x) which takes on the values p(6000) = 1/3, p(6001) =

- 2/3, then, in five-decimal-digit floating-point arithmetic, we will obtain p(x) =

600.3 - x. Evaluating this straight line, in the same arithmetic, we find p(6000) = 0.3

and p(6001) = - 0.7, which recovers only the first digit of the given function values, a

loss of four decimal digits.

A remedy of sorts for such loss of significance is the use of the shifted

power form

(2.2)

If we choose the center c to be 6000, then, in the example, we would

get p(x) = 0.33333 - (x - 6000.0), and evaluation in five-decimal-digit

floating-point arithmetic now provides p(6000) = 0.33333, p(6001) =

- 0.66667; i.e., the values are as correct as five digits can make them.

It is good practice to employ the shifted power form with the center c

chosen somewhere in the interval [a,b] when interested in a polynomial on

that interval. A more sophisticated remedy against loss of significance (or

illconditioning) is offered by an expansion in Chebyshev polynomials or

other orthogonal polynomials; see Sec. 6.3.

The coefficients in the shifted power form (2.2) provide derivative

values, i.e.,

.

if p(x) is given by (2.2). In effect, the shifted power form provides the

Taylor expansion for p(x) around the center c.

A further generalization of the shifted power form is the Newton form

(2.3)

2.1

POLYNOMIAL. FORMS 33

This form plays a major role in the construction of an interpolating

polynomial. It reduces to the shifted power form if the centers c1, . . . , cn,

all equal c, and to the power form if the centers c1, . . . , cn, all equal zero.

The following discussion on the evaluation of the Newton form therefore

applies directly to these simpler forms as well.

It is inefficient to evaluate each of the n + 1 terms in (2.3) separately

and then sum. This would take n + n(n + 1)/2 additions and n(n + 1)/2

multiplications. Instead, one notices that the factor (x - c1 ) occurs in all

terms but the first; that is,

Again, each term between the braces but the first contains the factor

(x - c2); that is,

Continuing in this manner, we obtain p(x) in nested form:

whose evaluation for any particular value of x takes 2n additions and n

multiplications. If, for example, p(x) = 1 + 2(x - 1) + 3(x - 1)(x - 2)

+ 4(x - 1)(x - 2)(x - 3), and we wish to compute p(4), then we calculate

as follows:

This procedure is formalized in the following algorithm.

Algorithm 2.1: Nested multiplication for the Newton form Given the

n + 1 coefficients a0, . . . , an, for the Newton form (2.3) of the polynomial p(x), together with the centers c1 , . . . , cn . Given also the

number z.

Then,

Moreover, the auxilliary quantities

are of

34

INTERPOLATION BY POLYNOMIALS

independent interest. For, we have

(2.4)

i.e.,

are also coefficients in the Newton form for p(x), but

with centers z, c1, c2, . . . , cn-1.

We prove the assertion (2.4). From the algorithm,

Substituting these expressions into (2.3), we get

which proves (2.4).

Aside from producing the value of the polynomial (2.3) at any particular point z economically, the nested multiplication algorithm is useful in

changing from one Newton form to another. Suppose, for example, that we

wish to express the polynomial

in terms of powers of x, that is, in the Newton form with all centers equal

to zero. Then, applying Algorithm 2.1 with z = 0 (and n = 2), we get

Hence

2.1

POLYNOMIAL FORMS

35

Applying Algorithm 2.1 to this polynomial, again with z = 0, gives

Therefore

In this simple example, we can verify this result quickly by multiplying out

the terms in the original expression.

Repeated applications of the Nested Multiplication algorithm are

useful in the evaluation of derivatives of a polynomial given in Newton

form (see Exercises 2.1-2 through 2.1-5). The algorithm is also helpful in

establishing the following basic fact.

Lemma 2.1 If z1, . . . , zk are distinct zeros of the polynomial p(x),

then

for some polynomial r(x).

To prove this lemma, we write p(x) in power form (2.1), i.e., in Newton

form with all centers equal to zero, and then apply Algorithm 2.1 once, to

get

a polynomial of

[since

degree < n. In effect, we have divided p(x) by the linear polynomial

(x - z); q(x) is the quotient polynomial and the number p(z) is the

remainder. Now pick specifically z = z1. Then, by assumption, p(z1) = 0,

i.e.,

This finishes the proof in case k = 1. Further, for k > 1, it follows that

z2, . . . , zk are necessarily zeros of q(x), since p(x) vanishes at these points

while the linear polynomial x - z 1 does not, by assumption. Hence,

induction on the number k of zeros may now be used to complete the

proof.

36

INTERPOLATION BY POLYNOMIALS

Corollary If p(x) and q(x) are two polynomials of degree < k which

agree at the k + 1 distinct points z0, . . . , zk, then p(x) = q(x) identically.

Indeed, their difference d(x) = p(x) - q(x) is then a polynomial of

degree < k, and can, by Lemma 2.1, be written in the form

with r(x) some polynomial. Suppose that

Then

some coefficients c0, . . . , cm with

for

which is nonsense. Hence, r(x) = 0 identically, and so p(x) = q(x).

This corollary gives the answer, “At most one,” to the question “How

many polynomials of degree < k are there which take on specified values

at k + 1 specified points?”

These considerations concerning zeros of polynomials can be refined

through the notion of multiplicity of a zero. This will be of importance to

us later on, in the discussion of osculatory interpolation. We say that the

point z is a zero of (exact) multiplicity j, or of order j, of the function f(x)

provided

Example

For instance, the polynomial

has a zero of multiplicity j at z. It is reasonable to count such a zero j times since it can

be thought of as the limiting case of the polynomial

with j distinct, or simple, zeros as all these zeros come together, or coalesce, at z.

As another example, for

the function

has three (simple)

zeros in the interval

which converge to the number 0 as

Correspondingly, the (limiting) function sin x - x has a triple zero at 0.

With this notion of multiplicity of a zero, Lemma 2.1 can be

strengthened as follows.

Lemma 2.2 If z1, . . . zk is a sequence of zeros of the polynomial p(x)

counting multiplicity, then

for some polynomial r(x).

See Exercise 2.1-6 for a proof of this lemma. Note that the number z

could occur in the sequence z1, . . . , zk as many as j times in case z is a

zero of p(x) of order j.

2.1

POLYNOMIAL FORMS

37

From the lemma 2.2, we get by the earlier argument the

Corollary If p(x) and q(x) are two polynomials of degree < k which

agree at k + 1 points z0, . . . , z k in the sense that their difference

r(x) = p(x) - q(x) has the k + 1 zeros z0, . . . , zk (counting multiplicity), then p(x) = q(x) identically.

EXERCISES

2.1-1 Evaluate the cubic polynomial

Then use nested multiplication to obtain p(x) in power form, and evaluate that power form at

x - 314.15. Compare!

2.1-2 Let

be a polynomial in Newton

form. Prove: If c1 = c2 = · · · = cr+1, then p(j)(c1) = j!aj,j = 0, . . . ,r. [Hint: Under these

conditions, p(x) can be written

with q(x) some polynomial. Now differentiate.]

2.1-3 Find the first derivative of

at x = 2. [Hint: Apply Algorithm 2.1 twice to obtain the Newton form for p(x) with centers 2,

2, 1, - 1; then use Exercise 2.1-2.]

2.1-4 Find also the second derivative of the polynomial p(x) of Exercise 2.1-3 at x = 2.

2.1-5 Find the Taylor expansion around c = 3 for the polynomial of Exercise 2.1-3. [Hint:

The Taylor expansion for a polynomial around a point c is just the Newton form for this

polynomial with centers c, c, c, c, . . . .]

2.1-6 Prove Lemma 2.2. [Hint: By Algorithm 2.1, p(x) = (x - z1)q(x), Now, to finish the

proof by induction on the number k of zeros in the given sequence, prove that z2, . . . , zk is

necessarily a sequence of zeros (counting multiplicity) of q(x). For this, assume that the

number z occurs exactly j times in the sequence z2, . . . , zk and distinguish the cases z = z1

and

Also, use the fact that p (j)(x) = (x - z 1 )q (j)(x) + jq(j-1)(x). ]

2.1-7 Prove that, in the language of the corollary to Lemma 2.2, the Taylor polynomial

i! agrees with the function f(x) j-fold at the point x = a (i.e., a is a

j-fold zero of their difference).

2.1-8 Suppose someone gives you a FUNCTION F(X) which supposedly returns the value at

X of a specific polynomial of degree < r. Suppose further that, on inspection, you find that

the routine does indeed return the value of some polynomial of degree < r (e.g., you find only

additions/subtractions and multiplications involving X and numerical constants in that

subprogram, with X appearing as a factor less than r times). How many function values

would you have to check before you could be sure that the routine does indeed do what it is

supposed to do (assuming no rounding errors in the calculation)?

2.1-9 For each of the following power series, exploit the idea of nested multiplication to find

an efficient way for their evaluation. (You will have to assume, of course, that they are to be

summed only over n < N, for some a priori given N.)

.

38

INTERPOLATION BY POLYNOMIALS

2.2 EXISTENCE AND UNIQUENESS OF THE

INTERPOLATING POLYNOMIAL

Let x0, x1, . . . , xn be n + 1 distinct points on the real axis and let f(x) be a

real-valued function defined on some interval I = [a,b] containing these

points. We wish to construct a polynomial p(x) of degree < n which

interpolates f(x) at the points x0, . . . , xn, that is, satisfies

As we will see, there are many ways to write down such a polynomial.

It is therefore important to remind the reader at the outset that, by the

corollary to Lemma 2.1, there is at most one polynomial of degree < n which

interpolates f(x) at the n + 1 distinct points x0, . . . , xn.

Next we show that there is at least one polynomial of degree < n which

interpolates f(x) at the n + 1 distinct points x0, x1, . . . , xn. For this, we

employ yet another polynomial form, the Lagrange form

(2.5)

with

(2.6)

the Lagrange polynomials for the points x0, . . . , xn. The function lk(x) is

the product of n linear factors, hence a polynomial of exact degree n.

Therefore, (2.5) does indeed describe a polynomial of degree < n. Further,

lk(x) vanishes at xi for all

and takes the value 1 at xk, i.e.,

This shows that

i.e., the coefficients a0, . . . , an in the Lagrange form are simply the values

of the polynomial p(x) at the points x0 , . . . , xn . Consequently, for an

arbitrary function f(x),

(2.7)

is a polynomial of degree < n which interpolates f(x) at x0, . . . , xn. This

establishes the following theorem.

Theorem 2.1 Given a real-valued function f(x) and n + 1 distinct

points x0, . . . , xn, there exists exactly one polynomial of degree < n

which interpolates f(x) at x0, . . . , xn.

2.2

EXISTENCE AND UNIQUENESS OF THE INTERPOLATING POLYNOMIAL

39

Equation (2.7) is called the Lagrange formula for the interpolating

polynomial.

As a simple application, we consider the case n = 1; i.e., we are given

f(x) and two distinct points x0, x1. Then

and

This is the familiar case of linear interpolation written in some of its many

equivalent forms.

Example 2.2 An integral related to the complete elliptic integral is defined by

(2.8)

From a table of values of these integrals we find that, for various values of k measured

in degrees,

Find K(3.5), using a second-degree interpolating polynomial.

We have

Then

This approximation is in error in the last place.

The Lagrange form (2.7) for the interpolating polynomial makes it

easy to show the existence of an interpolating polynomial. But its evaluation at a point x takes at least 2(n + 1) multiplications/divisions and

(2n + 1) additions and subtractions after the denominators of the

Lagrange polynomials have been calculated once and for all and divided

into the corresponding function values. This is to be compared with n

multiplications and n additions necessary for the evaluation of a polynomial of degree n in power form by nested multiplication (see Algorithm

2.1).

40

INTERPOLATION BY POLYNOMIALS

A more serious objection to the Lagrange form arises as follows: In

practice, one is often uncertain as to how many interpolation points to use.

Hence, with p j (x) denoting the polynomial of degree < j which interpolates f(x) at x0, . . . , xj, one calculates p0(x), p1(x), p2(x), . . . , increasing the number of interpolation points, and hence the degree of the

interpolating polynomial until, so one hopes, a satisfactory approximation

pk(x) to f(x) has been found. In such a process, use of the Lagrange form

seems wasteful since, in calculating p k(x), no obvious advantage can be

taken of the fact that one already has p k-1(x) available. For this purpose

and others, the Newton form of the interpolating polynomial is much

better suited.

Indeed, write the interpolating polynomial p,(x) in its Newton form,

using the interpolation points x0, . . . , xn-1 as centers, i.e.,

(2.9)

For any integer k between 0 and n, let qk(x) be the sum of the first k + 1

terms in this form,

Then every one of the remaining terms in (2.9) has the factor (x - x0 )

· · · (x - xk), and we can write (2.9) in the form

for some polynomial r(x) of no further interest. The point is that this last

term (x - x0) · · · (x - xk)r(x) vanishes at the points x0, . . . , xk, hence

qk(x) itself must already interpolate f(x) at x0, . . . , xk [since pn(x) does].

Since q k(x) is also a polynomial of degree < k, it follows that q k(x) =

p k (x); i.e., q k (x) must be the unique polynomial of degree < k which

interpolates f(x) at x0, . . . , xk.

This shows that the Newton form (2.9) for the interpolating polynomial pn(x) can be built up step by step as one constructs the sequence

p 0 (x), p1 (x), p2 (x), . . . , with p k(x) obtained from p k-1(x) by addition of

the next term in the Newton form (2.9), i.e.,

It also shows that the coefficient A, in the Newton form (2.9) for the

interpolating polynomial is the leading coefficient, i.e., the coefficient of

x k , in the polynomial p k (x) of degree < k which agrees with f(x) at

x0 , . . . , xk. This coefficient depends only on the values of f(x) at the

points x0, . . . , xk; it is called the kth divided difference of f(x) at the points

x0, . . . , xk (for reasons given in the next section) and is denoted by

With this definition, we arrive at the Newton formula for the interpolating

2.3

THE DIVIDED-DIFFERENCE TABLE

41

polynomial

This can be written more compactly as

(2.10)

if we make use of the convention that

For n = 1, (2.10) reads

and comparison with the formula

obtained earlier therefore shows that

(2.11)

The first divided difference, at any rate, is a ratio of differences.

EXERCISES

2.2-1 Prove that

(x - xn). [Hint: Find the leading coefficient of the polynomial (2.7).]

given in Exercise 2.2-l as

22-2 Calculate the limit of the formula for

while all other points remain fixed.

2.2-3 Prove that the polynomial of degree < n which interpolates f(x) at n + 1 distinct

points is f(x) itself in case f(x) is a polynomial of degree < n.

2.2-4 Prove that the kth divided difference p[x0, . . . , xk] of a polynomial p(x) of degree < k

is independent of the interpolation points x0, xl, . . . , xk.

2.2-5 Prove that the kth divided difference of a polynomial of degree < k is 0.

2.3 THE DIVIDED-DIFFERENCE TABLE

Higher-order divided differences may be constructed by the formula

(2.12)

whose validity may be established as follows.

42

INTERPOLATION BY POLYNOMIALS

Let p,(x) be the polynomial of degree < i which agrees with f(x) at

x0, . . . , xi , as before, and let qk-1(x) be the polynomial of degree < k - 1

which agrees with f(x) at the points x1, . . . , xk. Then

(2.13)

is a polynomial of degree < k, and one checks easily that p(xi ) = f(xi ),

i = 0, . . . , k. Consequently, by the uniqueness of the interpolating polynomial, we must have p(x) = pk(x). Therefore

by definition

by (2.13)

by definition

which proves the important formula (2.12).

Example 2.3 Solve Example 2.2 using the Newton formula.

In this example, we have to determine the polynomial p2(x) of degree < 2 which

satisfies

By (2.11) we can calculate

Therefore, by (2.12)

and (2.10) now gives

Substituting into this the value x = 3.5, we obtain

which agrees with the result obtained in Example 2.2.

Equation (2.12) shows the kth divided difference to be a difference

quotient of (k - 1)st divided differences, justifying their name. Equation

(2.12) also allows us to generate all the divided differences needed for the

Newton formula (2.10) in a simple manner with the aid of a so-called

divided-difference table.

2.3

THE DIVIDED-DIFFERENCE TABLE

43

Such a table is depicted in Fig. 2.1, for n = 4.

The entries in the table are calculated, for example, column by

column, according to the following algorithm.

Algorithm 2.2: Divided-difference table Given the first two columns of

the table, containing x 0 , x 1 , . . . , x n and, correspondingly,

If this algorithm is carried out by hand, the following directions might

be helpful. Draw the two diagonals from the entry to be calculated through

its two neighboring entries to the left. If these lines terminate at f[xi] and

f[xj], respectively, divide the difference of the two neighboring entries by

the corresponding difference x j - x i to get the desired entry. This is

illustrated in Fig. 2.1 for the entry f[x1, . . . , x4].

When the divided-difference table is filled out, the coefficients

f[x0, . . . , xi ], i = 0, . . . , n, for the Newton formula (2.10) can be found

at the head of their respective columns.

For reasons of storage requirements, and because the DO variables in

many FORTRAN dialects can only increase, one would use a somewhat

modified version of Algorithm 2.2 in a FORTRAN program. First, for the

evaluation of the Newton form according to Algorithm 2.1, it is more

convenient to use the form

Figure 2.1 Divided-difference table.

44

INTERPOLATION BY POLYNOMIALS

i.e., to use the Newton formula with centers xn, xn-1, . . . , x1. For then the

value

can be calculated, according to Algorithm 2.1, by

Second, since we are then only interested in the numbers f[xi , . . . , xn],

i = 0, . . . , n, it is not necessary to store the entire divided-difference table

(requiring a two-dimensional array in which roughly half the entries would

not be used anyway, because of the triangular character of the divided-difference table). For if we use the abbreviation

then the calculations of Algorithm 2.2 read

In particular, the number d i,k-1 is not used any further once dik has been

calculated, so that we can safely store d ik over d i,k-1 .

Algorithm 2.3: Calculation of the coefficients for the Newton formula

Given the n + 1 distinct points x0, . . . , xn, and, correspondingly, the

numbers f(x0), . . . , f(xn), with f(xi ) stored in di , i = 0, . . . , n.

Then

Example 2.4

Let f(x) = (1 + x2)-1. For n = 2, 4, . . . , 16, calculate the polynomial

Pn(x) of degree < n which interpolates f(x) at the n + 1 equally spaced points

Then estimate the maximum interpolation error

on the interval [-5, 5] by computing

2.3

THE DIVIDED-DIFFERENCE TABLE

45

where

The FORTRAN program below uses Algorithms 2.1 and 2.3 to solve this problem.

FORTRAN PROGRAM FOR EXAMPLE 2.4

C PROGRAM FOR EXAMPLE 2.4

INTEGER I,J,K,N,NP1

REAL

D(17),ERRMAX,H,PNOFY,X(17),Y

C POLYNOMIAL INTERPOLATION AT EQUALLY SPACED POINTS TO THE FUNCTION

F(Y) = l./(l. + Y*Y)

C

PRINT 600

N',5X,'MAXIMUM ERROR')

600 FORMAT('1

DO 40 N=2,16,2

NP1 = N+1

H = 10./FLOAT(N)

DO 10 I=1,NP1

X(I) = FLOAT(I-1)*H - 5.

D(I) = Fix(I))

10

CONTINUE

C

CALCULATE DIVIDED DIFFERENCES BY ALGORITHM 2.3

DO 20 K=1,N

DO 20 I=1,NP1-R

D(I) = (D(I+1) - D(I))/(X(I+K) - X(I))

CONTINUE

20

ESTIMATE MAXIMUM INTERPOLATION ERROR ON (-5,5)

C

ERRMAX = 0.

DO 30 J=1,101

Y = FLOAT(J-1)/10. - 5.

C

CALCULATE PN(Y) BY ALGORITHM 2.1

PNOFY = D(1)

DO 29 K=2,NP1

PNOFY = D(K) + (Y - X(K))*PNOFY'

29

CONTINUE

ERRMAX = MAX(ABS(F(Y) - PNOFY) , ERRMAX)

CONTINUE

30

PRINT 630, N,ERRMAX

FORMAT(I5,El8.7)

630

40 CONTINUE .

STOP

END

COMPUTER OUTPUT FOR EXAMPLE 2.4

N

2

4

6

8

10

12

14

16

MAXIMUM ERROR

6.4615385E - 01

4.3813387E - 01

6.1666759E - 01

1.0451739E + 00

1.9156431E + 00

3.6052745E + 00

7.192008OE + 00

14051542E + 01

Note how the interpolation error soon increases with increasing degree even though we use

more and more information about the function f(x) in our interpolation process. This is

because we have used uniformly spaced interpolation points; see Exercise 6.1-12 and Eq.

(6.20).

46

INTERPOLATION BY POLYNOMIALS

EXERCISES

2.3-l From a table of logarithms we obtain the following values of log x at the indicated

tabular points.

x

log x

1.0

1.5

2.0

3.0

3.5

4.0

0.0

0.17609

0.30103

0.477 12

0.54407

0.60206

Form a divided-difference table based on these values.

2.3-2 Using the divided-difference table in Exercise 2.3-1, interpolate for the following

values: log 2.5, log 1.25, log 3.25. Use a third-degree interpolating polynomial in its Newton

form.

2.3-3 Estimate the error in the result obtained for log 2.5 in Exercise 2.3-2 by computing the

next term in the interpolating polynomial. Also estimate it by comparing the approximation

for log 2.5 with the sum of log 2 and the approximation for log 1.25.

2.3-4 Derive the formula

Then use it to interpret the Nested Multiplication Algorithm 2.1, applied to the polynomial

(2.10), as a way to calculate p[z, x0, . . . , xn-1], p[z, x0, . . . , xn-2], . . . , p[z, x0] and P[z], i.e.,

as a way to get another diagonal in the divided difference table for p(x).

2.3-5 By Exercise 2.2-3, the polynomial of degree < k which interpolates a function f(x) at

x0, . . . , xk is f(x) itself if f(x) is a polynomial of degree < k. This fact may be used to check

the accuracy of the computed interpolating polynomial. Adapt the FORTRAN program given

in Example 2.4 to carry out such a check as follows: For n = 4, 8, 12, . . . , 32, find the

polynomial pn(x) of degree < n which interpolates the function

at

0,1,2, . . . ,n. Then estimate

where

the yi's are a suitably large number of points in [0, n] .

2.3-6 Prove that the first derivative p'2(x) of the parabola interpolating f(x) at x0 < xl < x2 is

equal to the straight line which takes on the value f[xi-1, xi] at the point (xi-1 + xi) /2, for

i = 1, 2. Generalize this to describe p'n(x) as the interpolant to data

for

in case pn(x) interpolates f(x) at x0 < x1 < · · · < xn.

appropriate

*2.4 INTERPOLATION AT AN INCREASING NUMBER OF

INTERPOLATION POINTS

Consider now the problem of estimating f(x) at a point

using

polynomial interpolation at distinct points x0, x1, x2, . . . . With pk(x) the

polynomial of degree < k which interpolates f(x) at x0, . . . , xk, we calcuuntil, so we hope, the difference

late successively

between

and

is sufficiently small. The Newton form for the

*2.4

INTERPOLATION AT AN INCREASING NUMBER OF INTERPOLATION POINTS

47

interpolating polynomial

with

is expressly designed for such calculations. If we know

then we can calculate

and

Algorithm 2.4: Interpolation using an increasing number of interpolation

points Given distinct points x 0 , x 1 , x 2 , . . . and the value!

f(x0), f(x1), f(x2), . . . of a function f(x) at these points. Also, given a

point

For k = 0, 1, 2, . . . , until satisfied, do:

This algorithm generates the entries of the divided-difference table for

f(x) at x0 , x1 , x2 , . . . a diagonal at a time. During the calculation of

the upward diagonal emanating from f[xk+1] is calculated up to

and including the number f[x0, . . . , xk+1], using the number f[xk+1] =

f(xk+1) and the previously calculated entries f[xk], f[xk-1, xk],

. . . , f[x0, . . . , xk] in the preceding diagonal. Hence, even if only the

most recently calculated diagonal is saved (in a FORTRAN program, say),

the algorithm provides incidentally the requisite coefficients for the Newton form for pk+1(x) with centers xk+1, . . . , x1:

(2.14)

Example 2.5 We apply Algorithm 2.4 to the problem of Examples 2.2 and 2.3, using

x0 = 1, x1 = 4, x2 = 6, and in addition, x3 = 0. For this example,

We get

Next, with K[x1] = 1.5727, we get

0.0006, and with

we get

1.5724.

48

INTERPOLATION BY POLYNOMIALS

Adding the point x 2 = 6, we have K[x2 ] = 1.5751; hence K[x1 , x 2 ] = 0.0012,

K[x0, x1, x2] = 0.00012; therefore, as

the number calculated earlier in Example 2.3. To check the error for this approximation

to K(3.5), we add the point x 3 = 0. With K[x3 ] = 1.5708, we compute K[x2 , x 3 ] =

0.000717, K[x1, x2, x3] = 0.000121, K[x0, x1, x2, x3] = - 0.000001, and get, with

= (-2.5)(-1.25) = 3.125, that

indicating that 1.5722 or 1.5723 is probably the value of K(3.5) to within the accuracy of

the given values of K(x).

These calculations, if done by hand, are conveniently arranged in a table as shown

in Fig. 2.2, which also shows how Algorithm 2.4 gradually builds up the divided-difference table.

We have listed below a FORTRAN FUNCTION, called TABLE,

which uses Algorithm 2.4 to interpolate in a given table of abscissas and

ordinates X(I), F(I), I = 1, . . . , NTABLE, with F(I) = f(X(I)), and X(1)

< X(2) < · · · , in order to find a good approximation to f(x) at x =

XBAR. The program generates p0 (XBAR), p1 (XBAR), . . . , until

where TOL is a given e r r o r r e q u i r e m e n t , o r u n t i l k + 1 =

min(20, NTABLE), and then returns the number pk (XBAR). The sequence

x0, x1, x2, . . . of points of interpolation is chosen from the tabular points

X(1), X(2), . . . , X(NTABLE) as follows: If X(I) < XBAR < X(I + 1),

then x0 = X(I + 1), x1 = X(I), x2 = X(I + 2), x3 = X(I - 1), . . . , except

near the beginning or the end of the given table, where eventually only

points to the right or to the left of XBAR are used. To protect the program

(and the user!) against an unreasonable choice for TOL, the program

should be modified so as to terminate also if and when the successive

differences |p k+1 (XBAR) - p k(XBAR)| begin to increase as k increases.

(See also Exercise 2.4-1.)

Figure 2.2

*2.4

INTERPOLATION AT AN INCREASING NUMBER OF INTERPOLATION POINTS

49

FORTRAN SUBPROGRAM FOR INTERPOLATION IN A

FUNCTION TABLE

REAL FUNCTION TABLE (XBAR, X, F, NTABLE, TOL, I'FLAG )

C RETURNS AN INTERPOLATED VALUE TABLE AT XBAR FOR THE FUNCTION

C TABULATED AS (X(I),F(I)), I=l,...,NTABLE.

INTEGER IFLAG,NTABLE,

J,NEXT,NEXTL,NEXTR

REAL

F(NTABLE),TOL,X(NTABLE),XBAR,

A(20),ERROR,PSIK,XK(20)

C****** I N P U T ******

C XBAR POINT AT WHICH TO INTERPOLATE .

C X(I), F(I), I=1 ,...,NTABLE CONTAINS THE FUNCTION TABLE .

C

A S S U M P T I O N ... X IS ASSUMED TO BE INCREASING.)

C NTABLE NUMBER OF ENTRIES IN FUNCTION TABLE.

C TOL DESIRED ERROR BOUND .

C****** O U T P U T ******

C TABLE THE INTERPOLATED FUNCTION VALUE .

C IFLAG AN INTEGER,

C

=l , SUCCESSFUL EXECUTION ,

C

=2 , UNABLE TO ACHIEVE DESIRED ERROR IN 20 STEPS,

C

=3 , XBAR LIES OUTSIDE OF TABLE RANGE. CONSTANT EXTRAPOLATION IS

C

USED.

C****** M E T H O D ******

C A SEQUENCE OF POLYNOMIAL INTERPOLANTS OF INCREASING DEGREE IS FORMED

C USING TABLE ENTRIES ALWAYS AS CLOSE TO XBAR AS POSSIBLE. EACH INC TERPOLATED VALUE IS OBTAINED FROM THE PRECEDING ONE BY ADDITION OF A

C CORRECTION TERM (AS IN THE NEWTON FORMULA). THE PROCESS TERMINATES

C WHEN THIS CORRECTION IS LESS THAN TOL OR, ELSE, AFTER 20 STEPS.

C

C

LOCATE XBAR IN THE X-ARRAY.

IF (XBAR .GE. X(l) .AND. XBAR .LE. X(NTABLE)) THEN

DO 10 NEXT=2,NTABLE

IF (XBAR .LE. X(NEXT))

GO TO 12

CONTINUE

10

END IF

IF (XBAR .LT. X(1)) THEN

TABLE = F(1)

ELSE

TABLE = F(NTABLE)

END IF

PRINT 610,XBAR

610 FORMAT(E16.7,' NOT IN TABLE RANGE.')

IFLAG = 3

RETURN

12 XK(1) = X(NEXT)

NEXTL = NEXT-l

NEXTR = NEXT+1

A(1) = F(NEXT)

TABLE = A(1)

PSIK = 1.

USE ALGORITHM 2.4, WITH THE NEXT XK ALWAYS THE TABLE

C

C

ENTRY NEAREST

XBAR OF THOSE NOT YET USED.

KP1MAX = MIN(20,NTABLE)

DO 20 KP1=2,KP1MAX

IF (NEXTL .EQ. 0) THEN

NEXT = NEXTR

NEXTR = NEXTR+1

ELSE IF (NEXTR .GT. NTABLE) THEN

NEXT = NEXTL

NEXTL = NEXTL-1

ELSE IF (XBAR - X(NEXTL) .GT. X(NEXTR) - XBAR) THEN

NEXT = NEXTR

NEXTR = NEXTR+1

ELSE

NEXT = NEXTL

NEXTL = NEXTL-1

END IF

XK(KP1) = X(NEXT)

A(KP1) - F(NEXT)

DO 13 J=KP1-1,1,-l

A(J) = (A(J+l) - A(J))/(XK(KP1) - XK(J))

13

CONTINUE

50

INTERPOLATION BY POLYNOMIALS

FOR I=1 ,...,KP1, A(I) NOW CONTAINS THE DIV.DIFF. OF

F(X) OF ORDER K-I AT XK(I) ,...,XK(KP1).

PSIK = PSIK*(XBAR - XK(KP1-1))

ERROR = A(1)+PSIK

TEMPORARY PRINTOUT

C

PRINT

613,KP1,XK(KP1),TABLE,ERROR

FORMAT(110,3El7.7)

613

TABLE = TABLE + ERROR

IF (ABS(ERROR) .LE. TOL) THEN

IFLAG = 1

RETURN

END IF

20 CONTINUE

PRINT 620,KP1MAX

620 FORMAT(' NO CONVERGENCE IN ',I2,' STEPS.')

IFLAG = 2

RETURN

END

C

C

EXERCISES

2.4-1 The FORTRAN function TABLE given in the text terminates as soon as |pk+1 (XBAR)

- p k (XBAR)| < TOL. Show that this does not guarantee that the value pk+1 (XBAR)

returned by TABLE is within TOL of the desired number f(XBAR) by the following

exam les:

(a) f(x) = x2; for some I, X(I) = -10, X(I + 1) = 10, XBAR = 0, TOL = 10-5.

(b) f(x) = x 3 ; for some I, X(I) = -100, X(I + 1) = 0, X(I + 2) = 100, XBAR =

-50, TOL = 10-5.

2.4-2 Iterated linear interpolation is based on the following observation attributable to

Neville: Denote by p i,j (x) the polynomial of degree < j - i which interpolates f(x) at the

points xi, xi+1, . . . , xj, i < j. Then

Verify this identity. [Hint: We used such an identity in Sec. 2.3; see Eq. (2.13).]

2.4-3 Iterated linear interpolation (continued). The identity of Neville’s established in Exercise

2.4-2 allows one to generate the entries in the following triangular table

column by column, by repeatedly carrying out what looks like linear interpolation, to reach

eventually the desired number

the value at

of the interpolating polynomial which

agrees with f(x) at the n + 1 points x0, . . . , xn. This is Neville's Algorithm. Aitken’s Algorithm

is different in that one generates instead a triangular table whose jth column consists of the

2.5

THE ERROR OF THE INTERPOLATING POLYNOMIAL

51

numbers

With p0, 1, . . . , j, r(x) (for r > j) the polynomial of degree < j + 1 which agrees with f(x) at the

points x0, x1, . . . , xj, and xr.

Show by an operations count that Neville’s algorithm is more expensive than Algorithm

2.4. (Also, observe that Algorithm 2.4 provides, at no extra cost, a Newton form for the

interpolating polynomial for subsequent evaluation at other points, while the information

generated in Neville’s or Aitken’s algorithm is of no help for evaluation at other points.)

2.4-4 In inverse interpolation in a table, one is given a number and wishes to find the point

so that

where f(x) is the tabulated function. If f(x) is (continuous and) strictly

monotone-increasing or -decreasing, this problem can always be solved by considering the

given table xi, f(xi), i = 0, 1, 2, . . . to be a table yi, g(yi), i = 0, 1, 2, . . . for the inverse

function g(y) = f-1(y) = x by taking yi = f(xi), g(yi) = xi, i = 0, 1, 2, . . . , and to interpolate for the unknown value

in this table. Use the FORTRAN function TABLE to find

so that

2.5 THE ERROR OF THE INTERPOLATING POLYNOMIAL

Let f(x) be a real-valued function on the interval I = [a,b], and let

x0, . . . , xn be n + 1 distinct points in I. With pn(x) the polynomial of

degree < n which interpolates f(x) at x0, . . . , xn, the interpolation error

is given by

(2.15)

Let now

be any point different from x 0 , . . . , x n . If p n+1 (x) is the

polynomial of degree < n + 1 which interpolates f(x) at x0, . . . , xn and at

while by (2. 10),

It follows that

Therefore,

(2.16)

showing the error to be “like the next term” in the Newton form.

We cannot evaluate the right side of (2.16) without knowing the

number

But as we now prove, the number

is closely

related to the (n + 1)st derivative of f(x), and using this information, we

can at times estimate

52

INTERPOLATION BY POLYNOMIALS

Theorem 2.2 Let f(x) be a real-valued function, defined on [a,b] and

k times differentiable in (a, b). If x0, . . . , xk are k + 1 distinct points

in [a, b], then there exists

such that

(2.17)

For k = 1, this is just the mean-value theorem for derivatives (see Sec.

1.7). For the general case, observe that the error function ek(x) = f(x) pk(x) has (at least) the k + 1 distinct zeros x0, . . . , xk in I = [a, b]. Hence,

if f(x), and therefore e k (x), is k times differentiable on (a, b), then it

follows from Rolle’s theorem (see Sec. 1.7) that e’(x) has at least k zeros in

(a, b); hence e”(x) has at least k - 1 zeros in (a, b) and continuing in this

manner, we finally get that

has at least one zero in (a, b). Let be

one such zero. Then

On the other hand, we know that, for any x,

since, by definition, f[x0, . . . , xk] is the leading coefficient of p k(x), and

(2.17) now follows.

By taking a = min, xi , b = maxi xi , it follows that the unknown point

in (2.17) can be assumed to lie somewhere between the xi ’s.

If we apply Theorem 2.2 to (2.16), we get Theorem 2.3.

Theorem 2.3 Let f(x) be a real-valued function defined on [a, b] and

n + 1 times differentiable on (a, b). If p n (x) is the polynomial of

degree < n which interpolates f(x) at the n + 1 distinct points

there exists

x0, . . . , xn in [a, b], then for all

(a, b) such that

(2.18)

It is important to note that

depends on the point at which

the error estimate is required. This dependence need not even be continuous. As we have need in Chap. 7 to integrate and differentiate en(x) with

respect to x, we usually prefer for such purposes the formula (2.16). For, as

we show in Sec. 2.7, f[x0, . . . , xn, x] is a well-behaved function of x.

The error formula (2.18) is of only limited practical utility since, in

general, we will seldom know f(n+1)(x), and we will almost never know the

point

But when a bound on |f(n+1)(x)| is known over the entire interval

[a, b], then we can use (2.18) to obtain a (usually crude) bound on the

error of the interpolating polynomial in that interval.

2.5

THE ERROR OF THE INTERPOLATING POLYNOMIAL

53

Example 2.6 Find a bound for the error in linear interpolation.

The linear polynomial interpolating f(x) at x0 and x1 is

Equation (2.18) then yields the error formula

where depends on . If is a point between x0 and x1, then

Hence, if we know that |f”(x)] < M on [x0, x1], then

The maximum value of

hence is (x1 - x0)2/4. It follows that, for any

occurs at

Example 2.7 Determine the spacing h in a table of equally

between 1 and 2, so that interpolation with a

this table will yield a desired accuracy.

By assumption, the table will contain f(xi), with xi = 1

th en we approximate

the quadratic polynomial which interpolates f(x) at xi-1, xi,

then

for some

in (xi-1, xi+1). Since we do not know

One calculates

lies between x0 and x1.

spaced values of the function

second-degree polynomial in

+ ih, i = 0, . . . , N, where

where p 2 (x) is

xi+1. By (2.18), the error is

we can merely estimate

Further,

using the linear change of variables y = x - xi. Since the function

vanishes at y = - h and y = h, the maximum of

must occur at one of

the extrema of

These extrema are found by solving the equation

= 0, giving

Hence

We are now assured that, for any

if p2(x) is chosen as the quadratic polynomial which interpolates

at the three

tabular points nearest . If we wish to obtain seven-place accuracy this way, we would

54

INTERPOLATlON BY POLYNOMIALS

have to choose h so that

giving

The function

which appears in (2.18) depends,

of course, strongly on the placement of the interpolation points. It is

possible to choose these points for given n in the given interval a < x < b

in such a way that max

there is as small as possible. This choice of

points, the so-called Chebyshev points, is discussed in some detail in Sec.

6.1. For the common choice of equally spaced interpolation points, the

local maxima of

increase as one moves from the middle of the

interval toward its ends, and this increase becomes more pronounced with

increasing n (see Fig 2.3). In view of (2.18), it is therefore advisable (at

least when interpolating to uniformly spaced data) to make use of the

interpolating polynomial only near the middle data points. The interpolant

becomes less reliable as one approaches the leftmost or rightmost data

point. Of course, going beyond them is even worse. Such an undertaking is

called extrapolation and should only be used with great caution.

Figure 23 The function

equally spaced interpolation points

(solid); (b) Chebyshev points for the same interval (dotted).

EXERCISES

2.5-l A table of values of cos x is required so that linear interpolation will yield six-decimalplace accuracy for any value of x in

Assuming that the tabular values are to be equally

spaced, what is the minimum number of entries needed in the table?

2.5-2 The function defined by

2.6

INTERPOLATION AT EQUALLY SPACED POINTS

55

has been tabulated for equally spaced values of x with step h = 0.1. What is the maximum

error encountered if cubic interpolation is to be used to calculate

any point on the

interval

2.5-3 Prove: If the values f(x0), . . . , f(xn) are our only information about the function f(x),

then we can say nothing about the error

at a point

that

is, the error may be “very large” or may be “very small.” [Hint: Consider interpolation at

x 0 , x 1 , . . . , x n to the function f(x) = K(x - x 0 ) · · · (x - x n ), where K is an unknown

constant.] What does this imply about programs like the FUNCTION TABLE in Sec. 2.4 or

Algorithm 2.4?

2.5-4 Use (2.18) to give a lower bound on the interpolation error

when

2.6 INTERPOLATION IN A FUNCTION TABLE BASED ON

EQUALLY SPACED POINTS

Much of engineering and scientific calculation uses functions such as sin x,

ex, Jn (x), erf(x), etc., which are defined by an infinite series, or as the

solution of a certain differential equation, or by similar processes involving

limits, and can therefore, in general, not be evaluated in a finite number of

steps. Computer

Next

ELEMENTARY NUMERICAL ANALYSIS

An Algorithmic Approach

International Series in Pure and Applied Mathematics

G. Springer

Consulting Editor

Ahlfors: Complex Analysis

Bender and Orszag: Advanced Mathematical Methods for Scientists and Engineers

Buck: Advanced Calculus

Busacker and Saaty: Finite Graphs and Networks

Cheney: Introduction to Approximation Theory

Chester: Techniques in Partial Differential Equations

Coddington and Levinson: Theory of Ordinary Differential Equations

Conte and de Boor: Elementary Numerical Analysis: An Algorithmic Approach

Dennemeyer: Introduction to Partial Differential Equations and Boundary Value

Problems

Dettman: Mathematical Methods in Physics and Engineering

Hamming: Numerical Methods for Scientists and Engineers

Hildebrand: Introduction to Numerical Analysis

Householder: The Numerical Treatment of a Single Nonlinear Equation

Kalman, Falb, and Arbib: Topics in Mathematical Systems Theory

McCarty: Topology: An Introduction with Applications to Topological Groups

Moore: Elements of Linear Algebra and Matrix Theory

Moursund and Duris: Elementary Theory and Application of Numerical Analysis

Pipes and Harvill: Applied Mathematics for Engineers and Physicists

Ralston and Rabinowitz: A First Course in Numerical Analysis

Ritger and Rose: Differential Equations with Applications

Rudin: Principles of Mathematical Analysis

Shapiro: Introduction to Abstract Algebra

Simmons: Differential Equations with Applications and Historical Notes

Simmons: Introduction to Topology and Modern Analysis

Struble: Nonlinear Differential Equations

ELEMENTARY

NUMERICAL

ANALYSIS

An Algorithmic Approach

Third Edition

S. D. Conte

Purdue University

Carl de Boor

Universiry of Wisconsin—Madison

McGraw-Hill Book Company

New York St. Louis San Francisco Auckland Bogot? Hamburg

Johannesburg London Madrid Mexico Montreal New Delhi

Panama Paris S?o Paulo Singapore Sydney Tokyo Toronto

ELEMENTARY NUMERICAL ANALYSIS

An Algorithmic Approach

Copyright © 1980, 1972, 1965 by McGraw-Hill, inc. All rights reserved.

Printed in the United States of America. No part of this publication

may be reproduced, stored in a retrieval system, or transmitted, in any

form or by any means, electronic, mechanical, photocopying, recording or

otherwise, without the prior written permission of the publisher.

234567890 DODO 89876543210

This book was set in Times Roman by Science Typographers, Inc. The

editors were Carol Napier and James S. Amar; the production supervisor

was Phil Galea. The drawings were done by Fine Line Illustrations, Inc.

R. R. Donnelley & Sons Company was printer and binder.

Library of Congress Cataloging in Publication Data

Conte, Samuel Daniel, date

Elementary numerical analysis.

(International series in pure and applied

mathematics)

Includes index.

1. Numerical analysis-Data processing.

I . de Boor, Carl, joint author. II. Title.

1980

519.4

79-24641

QA297.C65

ISBN 0-07-012447-7

CONTENTS

Preface

Introduction

Chapter 1

1.1

1.2

1.3

1.4

1.5

1.6

1.7

Chapter 2

2.1

2.2

2.3

*2.4

2.5

2.6

*2.7

ix

xi

Number Systems and Errors

1

The Representation of Integers

The Representation of Fractions

Floating-Point Arithmetic

Loss of Significance and Error Propagation;

Condition and Instability

Computational Methods for Error Estimation

Some Comments on Convergence of Sequences

Some Mathematical Preliminaries

1

4

7

12

18

19

25

Interpolation by Polynomial

31

Polynomial Forms

Existence and Uniqueness of the Interpolating Polynomial

The Divided-Difference Table

Interpolation at an Increasing Number of

Interpolation Points

The Error of the Interpolating Polynomial

Interpolation in a Function Table Based on Equally

Spaced Points

The Divided Difference as a Function of Its Arguments

and Osculatory Interpolation

31

38

41

46

51

55

62

* Sections marked with an asterisk may be omitted without loss of continuity.

V

vi

CONTETS

Chapter 3

The Solution of Nonlinear Equations

72

A Survey of Iterative Methods

Fortran Programs for Some Iterative Methods

Fixed-Point Iteration

Convergence Acceleration for Fixed-Point Iteration

Convergence of the Newton and Secant Methods

Polynomial Equations: Real Roots

Complex Roots and M?ller’s Method

74

81

88

95

100

110

120

Chapter 4

Matrices and Systems of Linear Equations

4.1

4.2

4.3

4.4

4.5

4.6

*4.7

*4.8

Properties of Matrices

The Solution of Linear Systems by Elimination

The Pivoting Strategy

The Triangular Factorization

Error and Residual of an Approximate Solution; Norms

Backward-Error Analysis and Iterative Improvement

Determinants

The Eigenvalue Problem

128

128

147

157

160

169

177

185

189

Chapter *5 Systems of Equations and Unconstrained

Optimization

208

3.1

3.2

3.3

3.4

*3.5

3.6

*3.7

Optimization and Steepest Descent

Newton’s Method

Fixed-Point Iteration and Relaxation Methods

209

216

223

Approximation

235

Uniform Approximation by Polynomials

Data Fitting

Orthogonal Polynomials

Least-Squares Approximation by Polynomials

Approximation by Trigonometric Polynomials

Fast Fourier Transforms

Piecewise-Polynomial Approximation

235

245

251

259

268

277

284

Chapter 7

Differentiation and Integration

294

7.1

7.2

7.3

7.4

7.5

l 7.6

l 7.7

Numerical Differentiation

Numerical Integration: Some Basic Rules

Numerical Integration: Gaussian Rules

Numerical Integration: Composite Rules

Adaptive Quadrature

Extrapolation to the Limit

Romberg Integration

295

303

311

319

328

333

340

*5.1

*5.2

*5.3

Chapter 6

6.1

6.2

*6.3

*6.4

*6.5

*6.6

6.7

CONTENTS

Chapter 8

8.1

8.2

8.3

8.4

8.5

8.6

8.7

8.8

8.9

*8.10

*8.11

*8.12

*8.13

vii

The Solution of Differential Equations

346

Mathematical Preliminaries

Simple Difference Equations

Numerical Integration by Taylor Series

Error Estimates and Convergence of Euler’s Method

Runge-Kutta Methods

Step-Size Control with Runge-Kutta Methods

Multistep Formulas

Predictor-Corrector Methods

The Adams-Moulton Method

Stability of Numerical Methods

Round-off-Error Propagation and Control

Systems of Differential Equations

Stiff Differential Equations

346

349

354

359

362

366

373

379

382

389

395

398

401

Chapter 9 Boundary Value Problems

9.1 Finite Difference Methods

9.2 Shooting Methods

9.3 Collocation Methods

Appendix: Subroutine Libraries

References

Index

406

406

412

416

421

423

425

PREFACE

This is the third edition of a book on elementary numerical analysis which

is designed specifically for the needs of upper-division undergraduate

students in engineering, mathematics, and science including, in particular,

computer science. On the whole, the student who has had a solid college

calculus sequence should have no difficulty following the material.

Advanced mathematical concepts, such as norms and orthogonality, when

they are used, are introduced carefully at a level suitable for undergraduate

students and do not assume any previous knowledge. Some familiarity

with matrices is assumed for the chapter on systems of equations and with

differential equations for Chapters 8 and 9. This edition does contain some

sections which require slightly more mathematical maturity than the previous edition. However, all such sections are marked with asterisks and all

can be omitted by the instructor with no loss in continuity.

This new edition contains a great deal of new material and significant

changes to some of the older material. The chapters have been rearranged

in what we believe is a more natural order. Polynomial interpolation

(Chapter 2) now precedes even the chapter on the solution of nonlinear

systems (Chapter 3) and is used subsequently for some of the material in

all chapters. The treatment of Gauss elimination (Chapter 4) has been

simplified. In addition, Chapter 4 now makes extensive use of Wilkinson’s

backward error analysis, and contains a survey of many well-known

methods for the eigenvalue-eigenvector problem. Chapter 5 is a new

chapter on systems of equations and unconstrained optimization. It contains an introduction to steepest-descent methods, Newton’s method for

nonlinear systems of equations, and relaxation methods for solving large

linear systems by iteration. The chapter on approximation (Chapter 6) has

been enlarged. It now treats best approximation and good approximation

ix

x

PREFACE

by polynomials, also approximation by trigonometric functions, including

the Fast Fourier Transforms, as well as least-squares data fitting, orthogonal polynomials, and curve fitting by splines. Differentiation and integration are now treated in Chapter 7, which contains a new section on

adaptive quadrature. Chapter 8 on ordinary differential equations contains

considerable new material and some new sections. There is a new section

on step-size control in Runge-Kutta methods and a new section on stiff

differential equations as well as an extensively revised section on numerical

instability. Chapter 9 contains a brief introduction to collocation as a

method for solving boundary-value problems.

This edition, as did the previous one, assumes that students have

access to a computer and that they are familiar with programming in some

procedure-oriented language. A large number of algorithms are presented

in the text, and FORTRAN programs for many of these algorithms have

been provided. There are somewhat fewer complete programs in this

edition. All the programs have been rewritten in the FORTRAN 77

language which uses modern structured-programming concepts. All the

programs have been tested on one or more computers, and in most cases

machine results are presented. When numerical output is given, the text

will indicate which machine (IBM, CDC, UNIVAC) was used to obtain

the results.

The book contains more material than can usually be covered in a

typical one-semester undergraduate course for general science majors. This

gives the instructor considerable leeway in designing the course. For this, it

is important to point out that only the material on polynomial interpolation in Chapter 2, on linear systems in Chapter 4, and on differentiation

and integration in Chapter 7, is required in an essential way in subsequent

chapters. The material in the first seven chapters (exclusive of the starred

sections) would make a reasonable first course.

We take this opportunity to thank those who have communicated to us

misprints and errors in the second edition and have made suggestions for

improvement. We are especially grateful to R. E. Barnhill, D. Chambless,

A. E. Davidoff, P. G. Davis, A. G. Deacon, A. Feldstein, W. Ferguson,

A. O. Garder, J. Guest, T. R. Hopkins, D. Joyce, K. Kincaid, J. T. King,

N. Krikorian, and W. E. McBride.

S. D. Conte

Carl de Boor

INTRODUCTION

This book is concerned with the practical solution of problems on computers. In the process of problem solving, it is possible to distinguish

several more or less distinct phases. The first phase is formulation. In

formulating a mathematical model of a physical situation, scientists should

take into account beforehand the fact that they expect to solve a problem

on a computer. They will therefore provide for specific objectives, proper

input data, adequate checks, and for the type and amount of output.

Once a problem has been formulated, numerical methods, together

with a preliminary error analysis, must be devised for solving the problem.

A numerical method which can be used to solve a problem will be called

an algorithm. An algorithm is a complete and unambiguous set of procedures leading to the solution of a mathematical problem. The selection or

construction of appropriate algorithms properly falls within the scope of

numerical analysis. Having decided on a specific algorithm or set of

algorithms for solving the problem, numerical analysts should consider all

the sources of error that may affect the results. They must consider how

much accuracy is required, estimate the magnitude of the round-off and

discretization errors, determine an appropriate step size or the number of

iterations required, provide for adequate checks on the accuracy, and make

allowance for corrective action in cases of nonconvergence.

The third phase of problem solving is programming. The programmer

must transform the suggested algorithm into a set of unambiguous stepby-step instructions to the computer. The first step in this procedure is

called flow charting. A flow chart is simply a set of procedures, usually in

logical block form, which the computer will follow. It may be given in

graphical or procedural statement form. The complexity of the flow will

depend upon the complexity of the problem and the amount of detail

xi

xii INTRODUCTION

included. However, it should be possible for someone other than the

programmer to follow the flow of information from the chart. The flow

chart is an effective aid to the programmer, who must translate its major

functions into a program, and, at the same time, it is an effective means of

communication to others who wish to understand what the program does.

In this book we sometimes use flow charts in graphical form, but more

often in procedural statement form. When graphical flow charts are used,

standard conventions are followed, whereas all procedural statement charts

use a self-explanatory ALGOL-like statement language. Having produced

a flow chart, the programmer must transform the indicated procedures into

a set of machine instructions. This may be done directly in machine

language, in an assembly language, or in a procedure-oriented language. In

this book a dialect of FORTRAN called FORTRAN 77 is used exclusively. FORTRAN 77 is a new dialect of FORTRAN which incorporates

new control statements and which emphasizes modern structured-programming concepts. While FORTRAN IV compilers are available on almost all

computers, FORTRAN 77 may not be as readily available. However,

conversion from FORTRAN 77 to FORTRAN IV should be relatively

straightforward.

A procedure-oriented language such as FORTRAN or ALGOL is

sometimes called an algorithmic language. It allows us to express a

mathematical algorithm in a form more suitable for communication with

computers. A FORTRAN procedure that implements a mathematical

algorithm will, in general, be much more precise than the mathematical

algorithm. If, for example, the mathematical algorithm specifies an iterative procedure for finding the solution of an equation, the FORTRAN

program must specify (1) the accuracy that is required, (2) the number of

iterations to be performed, and (3) what to do in case of nonconvergence.

Most of the algorithms in this book are given in the normal mathematical

form and in the more precise form of a FORTRAN procedure.

In many installations, each of these phases of problem solving is

performed by a separate person. In others, a single person may be

responsible for all three functions. It is clear that there are many interactions among these three phases. As the program develops, more information becomes available, and this information may suggest changes in the

formulation, in the algorithms being used, and in the program itself.

ELEMENTARY NUMERICAL ANALYSIS

An Algorithmic Approach

Previous Home Next

CHAPTER

ONE

NUMBER SYSTEMS AND ERRORS

In this chapter we consider methods for representing numbers on computers and the errors introduced by these representations. In addition, we

examine the sources of various types of computational errors and their

subsequent propagation. We also discuss some mathematical preliminaries.

1.1 THE REPRESENTATION OF INTEGERS

In everyday life we use numbers based on the decimal system. Thus the

number 257, for example, is expressible as

257 = 2·100 + 5·10 + 7·1

= 2·102 + 5·101 + 7·1000

We call 10 the base of this system. Any integer is expressible as a

polynomial in the base 10 with integral coefficients between 0 and 9. We

use the notation

N = (a n a n - 1 ··· a 0 ) 1 0

= a n 10 n + a n-1 10 n-1 + ··· + a 0 10 0

(1.1)

to denote any positive integer in the base 10. There is no intrinsic reason to

use 10 as a base. Other civilizations have used other bases such as 12, 20,

or 60. Modern computers read pulses sent by electrical components. The

state of an electrical impulse is either on or off. It is therefore convenient to

represent numbers in computers in the binary system. Here the base is 2,

and the integer coefficients may take the values 0 or 1.

1

2

NUMBER SYSTEMS AND ERRORS

A nonnegative integer N will be represented in the binary system as

(1.2)

where the coefficients ak are either 0 or 1. Note that N is again represented

as a polynomial, but now in the base 2. Many computers used in scientific

work operate internally in the binary system. Users of computers, however,

prefer to work in the more familiar decimal system. It is therefore necessary to have some means of converting from decimal to binary when

information is submitted to the computer, and from binary to decimal for

output purposes.

Conversion of a binary number to decimal form may be accomplished

directly from the definition (1.2). As examples we have

The conversion of integers from a base

to the base 10 can also be

accomplished by the following algorithm, which is derived in Chap. 2.

Algorithm 1.1 Given the coefficients an, . . . , a0 of the polynomial

(1.3)

and a number

Compute recursively the numbers

Then

Since, by the definition (1.2), the binary integer

represents the value of the polynomial (1.3) at x = 2, we can use Algorithm 1.1, with

to find the decimal equivalents of binary integers.

Thus the decimal equivalent of (1101)2 computed using Algorithm 1.1

is

1.1

THE REPRESENTATION OF INTEGERS

3

and the decimal equivalent of (10000)2 is

Converting a decimal integer N into its binary equivalent can also be

accomplished by Algorithm 1.1 if one is willing to use binary arithmetic.

then by the definition (1.1), N = p(10). where

For if

p(x) is the polynomial (1.3). Hence we can calculate the binary representation for N by translating the coefficients

into binary integers

and then using Algorithm 1.1 to evaluate p(x) at x = 10 = (1010) 2 in

binary arithmetic. If, for example, N = 187, then

and using Algorithm 1.1 and binary arithmetic,

Therefore 187 = (10111011)2.

Binary numbers and binary arithmetic, though ideally suited for

today’s computers, are somewhat tiresome for people because of the

number of digits necessary to represent even moderately sized numbers.

Thus eight binary digits are necessary to represent the three-decimal-digit

number 187. The octal number system, using the base 8, presents a kind of

compromise between the computer-preferred binary and the people-preferred decimal system. It is easy to convert from octal to binary and back

since three binary digits make one octal digit. To convert from octal to

binary, one merely replaces all octal digits by their binary equivalent; thus

Conversely, to convert from binary to octal, one partitions the binary digits

in groups of three (starting from the right) and then replaces each threegroup by its octal digit; thus

If a decimal integer has to be converted to binary by hand, it is usually

fastest to convert it first to octal using Algorithm 1.1, and then from octal

to binary. To take an earlier example,

4

NUMBER SYSTEMS AND ERRORS

Hence, using Algorithm 1.1 [with 2 replaced by 10 = (12)8, and with octal

arithmetic],

Therefore, finally,

EXERCISES

1.1-l Convert the following binary numbers to decimal form:

1.1-2 Convert the following decimal numbers to binary form:

82, 109, 3433

1.1-3 Carry out the conversions in Exercises 1. l-l and 1.1-2 by converting first to octal form.

1.1-4 Write a FORTRAN subroutine which accepts a number to the base BETIN with the

NIN digits contained in the one-dimensional array NUMIN, and returns the NOUT digits of

the equivalent in base BETOUT in the one-dimensional array NUMOUT. For simplicity,

restrict both BETIN and BETOUT to 2, 4, 8, and 10.

1.2 THE REPRESENTATION OF FRACTIONS

If x is a positive real number, then its integral part xI is the largest integer

less than or equal to x, while

is its fractional part. The fractional part can always be written as a decimal

fraction:

(1.4)

where each b k is a nonnegative integer less than 10. If b k = 0 for all k

greater than a certain integer, then the fraction is said to terminate. Thus

is a terminating decimal fraction, while

is not.

If the integral part of x is given as a decimal integer by

1.2

THE REPRESENTATION OF FRACTIONS

5

while the fractional part is given by (1.4), it is customary to write the two

representations one after the other, separated by a point, the “decimal

point”:

Completely analogously, one can write the fractional part of x as a

binary fraction:

where each bk is a nonnegative integer less than 2, i.e., either zero or one. If

the integral part of x is given by the binary integer

then we write

using a “binary point.”

The binary fraction (.b1 b 2 b 3 · · · ) 2 for a given number xF between

zero and one can be calculated as follows: If

then

Hence b1 is the integral part of 2xF, while

Therefore, repeating this procedure, we find that b2 is the integral part of

2(2xF)F, b3 is the integral part of 2(2(2xF)F)F, etc.

If, for example, x = 0.625 = xF, then

and all further bk’s are zero. Hence

This example was rigged to give a terminating binary fraction. Unhappily, not every terminating decimal fraction gives rise to a terminating

binary fraction. This is due to the fact that the binary fraction for

6

NUMBER SYSTEMS AND ERRORS

is not terminating. We have

and now we are back to a fractional part of 0.2, so that the digits cycle. It

follows that

The procedure just outlined is formalized in the following algorithm.

Algorithm 1.2 Given x between 0 and 1 and an integer

1. Generate recursively b1, b2, b3, . . . by

greater than

Then

We have stated this algorithm for a general base

rather than for the

for two reasons. If this conversion to binary is

specific binary base

carried out with pencil and paper, it is usually faster to convert first to

and then to convert from octal to binary. Also, the

octal, i.e., use

algorithm can be used to convert a binary (or octal) fraction to decimal, by

and using binary (or octal) arithmetic.

choosing

To give an example, if x = (.lOl)2, then, with

and

binary arithmetic, we get from Algorithm 1.2

Hence subsequent bk’s are zero. This shows that

confirming our earlier calculation. Note that if xF is a terminating binary

1.3

FLOATING-POINT ARITHMETIC

7

fraction with n digits, then it is also a terminating decimal fraction with n

digits, since

EXERCISES

1.2-l Convert the following binary fractions to decimal fractions:

(.1100011)2

(. 1 1 1 1 1 1 1 1)2

1.2-2 Find the first 5 digits of .1 written as an octal fraction, then compute from it the first 15

digits of .1 as a binary fraction.

1.2-3 Convert the following octal fractions to decimal:

(.614)8

(.776)8

Compare with your answer in Exercise 1.2-1.

1.2-4 Find a binary number which approximates

to within 10-3.

1.2-5 If we want to convert a decimal integer N to binary using Algorithm 1.1, we have to use

binary arithmetic. Show how to carry out this conversion using Algorithm 1.2 and decimal

arithmetic. (Hint: Divide N by the appropriate power of 2, convert the result to binary, then

shift the “binary point” appropriately.)

1.2-6 If we want to convert a terminating binary fraction x to a decimal fraction using

Algorithm 1.2, we have to use binary arithmetic. Show how to carry out this conversion using

Algorithm 1.1 and decimal arithmetic.

1.3 FLOATING-POINT ARITHMETIC

Scientific calculations are usually carried out in floating-point arithmetic.

An n-digit floating-point number in base has the form

(1.5)

is a

called the mantissa, and e is an

where

integer called the exponent. Such a floating-point number is said to be

normalized in case

or else

For most computers,

although on some,

and in hand

calculations and on most desk and pocket calculators,

The precision or length n of floating-point numbers on any particular

computer is usually determined by the word length of the computer and

may therefore vary widely (see Fig. 1.1). Computing systems which accept

FORTRAN programs are expected to provide floating-point numbers of

two different lengths, one roughly double the other. The shorter one, called

single precision, is ordinarily used unless the other, called double precision,

is specifically asked for. Calculation in double precision usually doubles

the storage requirements and more than doubles running time as compared

with single precision.

8

NUMBER SYSTEMS AND ERRORS

Figure 1.1 Floating-point characteristics.

The exponent e is limited to a range

(1.6)

for certain integers m and M. Usually, m = - M, but the limits may vary

widely; see Fig. 1.1.

There are two commonly used ways of translating a given real number

x into an n

floating-point number fl(x), rounding and chopping. In

rounding, fl(x) is chosen as the normalized floating-point number nearest

x; some special rule, such as symmetric rounding (rounding to an even

digit), is used in case of a tie. In chopping, fl(x) is chosen as the nearest

normalized floating-point number between x and 0. If, for example, twodecimal-digit floating-point numbers are used, then

and

On some computers, this definition of fl(x) is modified in case

(underflow), where m and M are the

bounds on the exponents; either fl(x) is not defined in this case, causing a

stop, or else fl(x) is represented by a special number which is not subject to

the usual rules of arithmetic when combined with ordinary floating-point

numbers.

The difference between x and fl(x) is called the round-off error. The

round-off error depends on the size of x and is therefore best measured

relative to x. For if we write

(1.7)

is some number depending on x, then it is possible to

where

bound

independently of x, at least as long as x causes no overflow or

underflow. For such an x, it is not difficult to show that

in rounding

while

in chopping

(1.8)

(1.9)

1.3

FLOATING-POINT ARITHMETIC

9

See Exercise 1.3-3. The maximum possible value for

is often called the

unit roundoff and is denoted by u.

When an arithmetic operation is applied to two floating-point numbers, the result usually fails to be a floating-point number of the same

length. If, for example, we deal with two-decimal-digit numbers and

then

Hence, if denotes one of the arithmetic operations (addition, subtraction,

multiplication, or division) and

denotes the floating-point operation of

the same name provided by the computer, then, however the computer

may arrive at the result

for two given floating-point numbers x and

y, we can be sure that usually

Although the floating-point operation

some details from machine to machine,

corresponding to may vary in

is usually constructed so that

(1.10)

In words, the floating-point sum (difference, product, or quotient) of two

floating-point numbers usually equals the floating-point number which

represents the exact sum (difference, product, or quotient) of the two

numbers. Hence (unless overflow or underflow occurs) we have

(1.11 a)

where u is the unit roundoff. In certain situations, it is more convenient to

use the equivalent formula

(1.116)

Equation (1.11) expresses the basic idea of backward error analysis (see J.

H. Wilkinson [24]†). Explicitly, Eq. (1.11) allows one to interpret a floating-point result as the result of the corresponding ordinary arithmetic, but

performed on slightly perturbed data. In this way, the analysis of the effect

of floating-point arithmetic can be carried out in terms of ordinary

arithmetic.

For example, the value of the function

at a point x0 can be

calculated by n squarings, i.e., by carrying out the sequence of steps

In floating-point arithmetic, we compute instead, accordwith

ing to Eq. (1.1 la), the sequence of numbers

†Numbers in brackets refer to items in the references at the end of the book.

10

NUMBER SYSTEMS AND ERRORS

with

all i. The computed answer is, therefore,

To simplify this expression, we observe that, if

for some

for some

then

(see Exercise 1.3-6). Also then

Consequently,

for some

In words, the computed value

is the

exact value of f(x) at the perturbed argument

We can now gauge the effect which the use of floating-point arithmetic

has had on the accuracy of the computed value for f(x0) by studying how

the value of the (exactly computed) function f(x) changes when the

argument x is perturbed, as is done in the next section. Further, we note

that this error is, in our example, comparable to the error due to the fact

that we had to convert the initial datum x0 to a floating-point number to

begin with.

As a second example, of particular interest in Chap. 4, consider

calculation of the number s from the equation

(1.12)

by the formula

If we obtain s through the steps

then the corresponding numbers computed in floating-point arithmetic

satisfy

Here, we have used Eqs. (1.11a ) and (1.11b), and have not bothered to

1.3

distinguish the various

FLOATING-POINT ARITHMETIC

11

by subscripts. Consequently,

This shows that the computed value

for s satisfies the perturbed equation

(1.13)

Note that we can reduce all exponents by 1 in case ar+1 = 1, that is, in

case the last division need not be carried out.

EXERCISES

1.3-1 The following numbers are given in a decimal computer with a four-digit normalized

mantissa:

Perform the following operations, and indicate the error in the result, assuming symmetric

rounding:

1.3-2 Let

be given by chopping. Show that

(unless overflow or underflow occurs).

and that

13-3 Let

be given by chopping and let

be such that

(If

Show that then

is bounded as in (1.9).

1.3-4 Give examples to show that most of the laws of arithmetic fail to hold for floating-point

arithmetic. (Hint: Try laws involving three operands.)

1.3-5 Write a FORTRAN FUNCTION FL(X) which returns the value of the n-decimal-digit

floating-point number derived from X by rounding. Take n to be 4 and check your

calculations in Exercise 1.3-l. [Use ALOG10(ABS(X)) to determine e such that

1.3-6 Let

Show that for all

there exists

that

Show also that

some

provided

all have the same sign.

1.3-7 Carry out a backward error analysis for the calculation of the scalar product

Redo the analysis under the assumption that double-precision

cumulation is used. This means that the double-precision results of each multiplicatioin

retained and added to the sum in double precision, with the resulting sum rounded only at

end to single precision.

o

for

acare

the

12

NUMBER SYSTEMS AND ERRORS

1.4 LOSS OF SIGNIFICANCE AND ERROR PROPAGATION;

CONDITION AND INSTABILITY

If the number x* is an approximation to the exact answer x, then we call

the difference x - x* the error in x*; thus

Exact = approximation + error

(1.14)

The relative error in x*, as an approximation to x, is defined to be the

number (x - x*)/x. Note that this number is close to the number (x then (x x * ) / x * if it is at all small. [Precisely, if

x*)/x* =

Every floating-point operation in a computational process may give

rise to an error which, once generated, may then be amplified or reduced

in subsequent operations.

One of the most common (and often avoidable) ways of increasing the

importance of an error is commonly called loss of significant digits. If x* is

an approximation to x, then we say that x* approximates x to r significant

provided the absolute error |x - x*| is at most in the rt h

significant

of x. This can be expressed in a formula as

(1.15)

with s the largest integer such that

For instance, x* = 3 agrees

with

to one significant (decimal) digit, while

is correct to three significant digits (as an approximation to ). Suppose

now that we are to calculate the number

and that we have approximations x* and y* for x and y, respectively,

available, each of which is good to r digits. Then

is an approximation for z, which is also good to r digits unless x* and y*

agree to one or more digits. In this latter case, there will be cancellation of

digits during the subtraction, and consequently z* will be accurate to fewer

than r digits.

Consider, for example,

and assume each to be an approximation to x and y, respectively, correct

to seven significant digits. Then, in eight-digit floating-point arithmetic,

is the exact difference between x* and y*. But as an approximation to

z = x - y,z* is good only to three digits, since the fourth significant digit

of z* is derived from the eighth digits of x* and y*, both possibly in error.

1.4

LOSS OF SIGNIFICANCE, ERROR PROPAGATION; CONDITION, INSTABILITY

13

Hence, while the error in z* (as an approximation to z = x - y) is at most

the sum of the errors in x* and y*, the relative error in z* is possibly 10,000

times the relative error in x* or y*. Loss of significant digits is therefore

dangerous only if we wish to keep the relative error small.

Such loss can often be avoided by anticipating its occurrence. Consider, for example, the evaluation of the function

in six-decimal-digit arithmetic. Since

for x near zero, there will

be loss of significant digits for x near zero if we calculate f(x) by first

finding cos x and then subtracting the calculated value from 1. For we

cannot calculate cos x to more than six digits, so that the error in the

calculated value may be as large as 5 · 10-7, hence as large as, or larger

than, f(x) for x near zero. If one wishes to compute the value of f(x) near

zero to about six significant digits using six-digit arithmetic, one would

have to use an alternative formula for f(x), such as

which can be evaluated quite accurately for small x; else, one could make

use of the Taylor expansion (see Sec. 1.7) for f(x),

which shows, for example, that for

agrees with f(x) to at

least six significant digits.

Another example is provided by the problem of finding the roots of

the quadratic equation

(1.16)

We know from algebra that the roots are given by the quadratic formula

(1.17)

Let us assume that b2 - 4ac > 0, that b > 0, and that we wish to find the

root of smaller absolute value using (1.17); i.e.,

(1.18)

If 4ac is small compared with b 2 , then

will agree with b to

several places. Hence, given that

will be calculated correctly

only to as many places as are used in the calculations, it follows that the

numerator of (1.18), and therefore the calculated root, will be accurate to

fewer places than were used during the calculation. To be specific, take the

14

NUMBER SYSTEMS AND ERRORS

equation

(1.19)

Using (1.18) and five-decimal-digit floating-point chopped arithmetic, we

calculate

while in fact,

is the correct root to the number of digits shown. Here too, the loss of

significant digits can be avoided by using an alternative formula for the

calculation of the absolutely smaller root, viz.,

(1.20)

Using this formula, and five-decimal-digit arithmetic, we calculate

which is accurate to five digits.

Once an error is committed, it contaminates subsequent results. This

error propagation through subsequent calculations is conveniently studied

in terms of the two related concepts of condition and instability.

The word condition is used to describe the sensitivity of the function

value f(x) to changes in the argument x. The condition is usually measured

by the maximum relative change in the function value f(x) caused by a

unit relative change in the argument. In a somewhat informal formula,

condition off at x =

(1.21)

The larger the condition, the more ill-conditioned the function is said to

be. Here we have made use of the fact (see Sec. 1.7) that

i.e., the change in argument from x to x* changes the function value by

approximately

If, for example,

1.4

LOSS OF SIGNIFICANCE, ERROR PROPAGATION; CONDITION, INSTABILITY

then

15

hence the condition of f is, approximately,

This says that taking square roots is a well-conditioned process since it

actually reduces the relative error. By contrast, if

then

so that

and this number can be quite large for |x| near 1. Thus, for x near 1 or

- 1, this function is quite ill-conditioned. It very much magnifies relative

errors in the argument there.

The related notion of instability describes the sensitivity of a numerical

process for the calculation of f(x) from x to the inevitable rounding errors

committed during its execution in finite precision arithmetic. The precise

effect of these errors on the accuracy of the computed value for f(x) is

hard to determine except by actually carrying out the computations for

particular finite precision arithmetics and comparing the computed answer

with the exact answer. But it is possible to estimate these effects roughly by

considering the rounding errors one at a time. This means we look at the

individual computational steps which make up the process. Suppose there

are n such steps. Denote by xi the output from the ith such step, and take

x0 = x. Such an xi then serves as input to one or more of the later steps

and, in this way, influences the final answer xn = f(x). Denote by f i the

function which describes the dependence of the final answer on the

intermediate result xi . In particular, f0 is just f. Then the total process is

unstable to the extent that one or more of these functions fi is ill-conditioned. More precisely, the process is unstable to the extent that one or

more of the fi ’s has a much larger condition than f = f0 has. For it is the

condition of fi which gauges the relative effect of the inevitable rounding

error incurred at the ith step on the final answer.

To give a simple example, consider the function

for “large” x, say for

Its condition there is

which is quite good. But, if we calculate f(12345) in six-decimal arithmetic,

16

NUMBER SYSTEMS AND ERRORS

we find

while, actually,

So our calculated answer is in error by 10 percent. We analyze the

computational process. It consists of the following four computational

steps:

(1.22)

Now consider, for example, the function f 3 , i.e., the function which

describes how the final answer x4 depends on x3. We have

hence its condition is, approximately,

This number is usually near 1, i.e., f 3 is usually well-conditioned except

when t is near x2. In this latter case, f3 can be quite badly conditioned. For

example, in our particular case,

while

so the

condition is

or more than 40,000 times as big as the condition of

f itself.

We conclude that the process described in (1.22) is an unstable way to

evaluate f. Of course, if you have read the beginning of this section

carefully, then you already know a stable way to evaluate this function,

namely by the equivalent formula

1

In six-decimal arithmetic, this gives

1.4

LOSS OF SIGNIFICANCE, ERROR PROPAGATION; CONDITION, INSTABILITY

17

which is in error by only 0.0003 percent. The computational process is

(1.23)

Here, for example, f3(t) = 1/(x2 + t), and the condition of this function is,

approximately,

which is the case here. Thus, the condition of f3 is quite good; it

for

is as good as that of f itself.

We will meet other examples of large instability, particularly in the

discussion of the numerical solution of differential equations.

EXERCISES

1.4-l Find the root of smallest magnitude of the equation

using formulas (1.18) and (1.20). Work in floating-point arithmetic using a four- (decimal-)

place mantissa.

1.4-2 Estimate the error in evaluating

around x = 2 if the absolute

error in x is 10-6.

1.4-3 Find a way to calculate

correctly to the number of digits used when x is near zero for (a)-(c), very much larger than

for (d).

1.4-4 Assuming a computer with a four-decimal-place mantissa, add the following numbers

first in ascending order (from smallest to largest) and then in descending order. In doing so

round off the partial sums. Compare your results with the correct sum x = 0.107101023 · 105.

1.4-5 A dramatically unstable way to calculate f(x) = e x for negative x is provided by its

-12

by evaluating the Taylor series (1.36) at x = - 1 2 and

Taylor series (1.36). Calculate e

18

NUMBER SYSTEMS AND ERRORS

compare with the accurate value e-12 = 0.00000 61442 12354 · · · . [ Hint: By (1.36), the

difference between eX and the partial sum

is less than the next term

in absolute value, in case x is negative. So, it would be all right to sum the series until

1.4-6 Explain the result of Exercise 1.4-5 by comparing the condition of f(x) = e X near

x = - 12 with the condition of some of the functions fi involved in the computational

process. Then find a stable way to calculate e-12 from the Taylor series (1.36). (Hint:

e-x = 1/ex.)

1.5 COMPUTATIONAL METHODS FOR ERROR

ESTIMATION

This chapter is intended to make the student aware of the possible sources

of error and to point out some techniques which can be used to avoid these

errors. In appraising computer results, such errors must be taken into

account. Realistic estimates of the total error are difficult to make in a

practical problem. and an adequate mathematical theory is still lacking.

An appealing idea is to make use of the computer itself to provide us with

such estimates. Various methods of this type have been proposed. We shall

discuss briefly five of them. The simplest method makes use of double

precision. Here one simply solves the same problem twice—once in single

precision and once in double precision. From the difference in the results

an estimate of the total round-off error can then be obtained (assuming

that all other errors are less significant). It can then be assumed that the

same accumulation of roundoff will occur in other problems solved with

the same subroutine. This method is extremely costly in machine time

since double-precision arithmetic increases computer time by a factor of 8

on some machines, and in addition, it is not always possible to isolate

other errors.

A second method is interval arithmetic. Here each number is represented by two machine numbers, the maximum and the minimum values

that it might have. Whenever an operation is performed, one computes its

maximum and minimum values. Essentially, then, one will obtain two

solutions at every step, the true solution necessarily being contained within

the range determined by the maximum and minimum values. This method

requires more than twice the amount of computer time and about twice the

storage of a standard run. Moreover, the usual assumption that the true

solution lies about midway within the range is not, in general, valid. Thus

the range might be so large that any estimate of the round-off error based

upon this would be grossly exaggerated.

A third approach is significant-digit arithmetic. As pointed out earlier,

whenever two nearly equal machine numbers are subtracted, there is a

danger that some significant digits will be lost. In significant-digit

arithmetic an attempt is made to keep track of digits so lost. In one version

1.6

SOME COMMENTS ON CONVERGENCE OF SEQUENCES

19

only the significant digits in any number are retained, all others being

discarded. At the end of a computation we will thus be assured that all

digits retained are significant. The main objection to this method is that

some information is lost whenever digits are discarded, and that the results

obtained are likely to be much too conservative. Experimentation with this

technique is still going on, although the experience to date is not too

promising.

A fourth method which gives considerable promise of providing an

adequate mathematical theory of round-off-error propagation is based on

a statistical approach. It begins with the assumption that round-off errors

are independent. This assumption is, of course, not valid, because if the

same problem is run on the same machine several times, the answers will

always be the same. We can, however, adopt a stochastic model of the

propagation of round-off errors in which the local errors are treated as if

they were random variables. Thus we can assume that the local round-off

errors are either uniformly or normally distributed between their extreme

values. Using statistical methods, we can then obtain the standard deviation, the variance of distribution, and estimates of the accumulated roundoff error. The statistical approach is considered in some detail by Hamming [1] and Henrici [2]. The method does involve substantial analysis and

additional computer time, but in the experiments conducted to date it has

obtained error estimates which are in remarkable agreement with experimentally available evidence.

A fifth method is backward error analysis, as introduced in Sec. 1.3. As

we saw, it reduces the analysis of rounding error effects to a study of

perturbations in exact arithmetic and, ultimately, to a question of condition. We will make good use of this method in Chap. 4.

1.6 SOME COMMENTS ON CONVERGENCE OF

SEQUENCES

Calculus, and more generally analysis, is based on the notion of convergence. Basic concepts such as derivative, integral, and continuity are

defined in terms of convergent sequences, and elementary functions such

as ln x or sin x are defined by convergent series, At the same time,

numerical answers to engineering and scientific problems are never needed

exactly. Rather, an approximation to the answer is required which is

accurate “to a certain number of decimal places,” or accurate to within a

given tolerance

It is therefore not surprising that many numerical methods for finding

the answer

of a given problem merely produce (the first few terms of) a

sequence

which is shown to converge to the desired answer.

20

NUMBER SYSTEMS AND ERRORS

To recall the definition:

of (real or complex) numbers converges to a if and only if, for all

A sequence

there exists an integer

such that for all

Hence, if we have a numerical method which produces a sequence

converging to the desired answer

then we can calculate a to

any desired accuracy merely by calculating

for “large enough” n.

From a computational point of view, this definition is unsatisfactory

for the following reasons: (1) It is often not possible (without knowing the

answer

to know when n is “large enough.” In other words, it is difficult

to get hold of the function

mentioned in the definition of convergence. (2) Even when some knowledge about

is available, it may turn

out that the required n is too large to make the calculation of

feasible.

Example The number

is the value of the infinite series

Hence, with

the sequence

is monotone-decreasing to its limit

Moreover,

To calculate

correct to within 10-6 using this sequence, we would need 106 < 4 n +

3, or roughly, n = 250,000. On a computer using eight-decimal-digit floating-point

arithmetic, round-off in the calculation of

is probably much larger than 10-6.

Hence

could not be computed to within 10-6 using this sequence (except, perhaps,

by adding the terms from smallest to largest).

To deal with these problems, some notation is useful. Specifically, we

would like to measure how fast sequences converge. As with all measuring,

this is done by comparison, with certain standard sequences, such as

The comparison is made as follows: one says that

and writes

is of order

(1.24)

in case

(1.25)

1.6

SOME COMMENTS ON CONVERGENCE OF SEQUENCES 21

for some constant K and all sufficiently large n. Thus

Further, if it is possible to choose the constant K in (1.25) arbitrarily small

as soon as n is large enough; that is, should it happen that

then one says that

writes

is of higher order than

and

(1.26)

Thus

while sin

The order notation appears customarily only on the right-hand side of

an equation and serves the purpose of describing the essential feature of an

error term without bothering about multiplying constants or other detail.

For instance, we can state concisely the unsatisfactory state of affairs in

the earlier example by saying that

but also

i.e., the series converges to

as fast as 1/n (goes to zero) but no faster.

A convergence order or rate of l/n is much too slow to be useful in

calculations.

Example If

then, by definition,

is just a fancy way of saying that the sequence

Hence

converges to

Example If |r| < 1, then the geometric series

we have

Further, if

then

sums to 1/(1 - r). With

Thus

22

NUMBER SYSTEMS AND ERRORS

for some |r| < 1, we say that the convergence is (at

Hence, whenever a,,

least) geometric, for it is then (at least) of the same order as the convergence of the

geometric series.

than to know nothing,

Although it is better to know that

knowledge about the order of convergence becomes quite useful only when

we know more precisely that

This says that for “large enough”

To put it differently,

is a sequence converging to zero. Although we cannot

where

prove that a certain n is “large enough,” we can test the hypothesis that n is

“large enough” by comparing

with

If

for k near n, say for k = n - 2, n - 1, n, then we accept the hypothesis

that n is “large enough” for

to be true, and therefore accept

Example Let p > 1. Then the series

geometric series

To get a more precise statement, consider

Then

as a good estimate of the error

converges to its limit

like the

1.6 SOME COMMENTS ON CONVERGENCE OF SEQUENCES 23

For the ratios, we find

which is, e.g., within 1/10 of 1 for n = 3 and p = 2. Thus,

In fact,

good indication of the error in

in

is therefore 0.12005 · · · .

is then a

the error

This notation carries over to functions of a real variable. If

we say that the convergence is

provided

for some finite constant K and all small enough h. If this holds for all

K > 0, that is, if

then we call the convergence o(f(h)).

Example For h “near” zero, we have

Hence, for all

Example If the function f(x) has a zero of order

then

Rules for calculating with the order symbols are collected in the following

lemma.

and c is a constant,

Lemma 1.1 If

then

If also

then

(1.27)

If, further,

then also

24

NUMBER SYSTEMS AND ERRORS

while if

then

Finally, all statements remain true if

is replaced by o throughout.

The approximate calculation of a number via a sequence

converging to always involves an act of faith regardless of whether or not

the order of convergence is known. Given that the sequence is known to

converge to practicing numerical analysts ascertain that n is “large

enough” by making sure that, for small values of

differs “little

If they also know that the convergence is

enough” from

they check whether or not the sequence behaves accordingly near n. If they

also know that a satisfies certain equations or inequalities— might be the

sought-for solution of an equation—they check that satisfies these

equations or inequalities “well enough.” In short, practicing numerical

analysts make sure that n satisfies all conditions they can think of which

are necessary for n to be “large enough.” If all these conditions are

satisfied, then, lacking sufficient conditions for n to be “large enough,”

they accept

on faith as a good enough approximation to

In a way,

numerical analysts use all means at their disposal to distinguish a “good

enough” approximation from a bad one. They can do no more (and should

do no less).

It follows that numerical results arrived at in this way should not be

mistaken for final answers. Rather, they should be questioned freely if

subsequent investigations throw any doubt upon their correctness.

The student should appreciate this as another example of the basic

difference between numerical analysis and analysis. Analysis became a

precise discipline when it left the restrictions of practical calculations to

deal entirely with problems posed in terms of an abstract model of the

number system, called the real numbers. This abstract model is designed to

make a precise and useful definition of limit possible, which opens the way

to the abstract or symbolic solution of an impressive array of practical

problems, once these problems are translated into the terms of the model.

This still leaves the task of translating the abstract or symbolic solutions

back into practical solutions. Numerical analysis assumes this task, and

with it the limitations of practical calculations from which analysis

managed to escape so elegantly. Numerical answers are therefore usually

tentative and, at best, known to be accurate only to within certain bounds.

Numerical analysis is therefore not merely concerned with the construction of numerical methods. Rather, a large portion of numerical

analysis consists in the derivation of useful error bounds, or error estimates,

for the numerical answers produced by a numerical algorithm. Throughout

this book, the student will meet this preoccupation with error bounds so

typical of numerical analysis.

1.7 SOME MATHEMATICAL PRELIMINARIES 25

EXERCISES

1.6-1 The number ln 2 may be calculated from the series

It is known from analysis that this series converges and that the magnitude of the error in any

partial sum is less than the magnitude of the first neglected term. Estimate the number of

terms that would be required to calculate ln 2 to 10 decimal places.

1.6-2 For h near zero it is possible to write

and

Find the values of

and

for which these equalities hold.

1.6-3 Try to calculate, on a computer, the limit of the sequence

Theoretically, what is

and what is the order of convergence of the sequence?

1.7 SOME MATHEMATICAL PRELIMINARIES

It is assumed that the student is familiar with the topics normally covered

in the undergraduate analytic geometry and calculus sequence. These

include elementary notions of real and complex number systems; continuity; the concept of limits, sequences, and series; differentiation and integration. For Chap. 4, some knowledge of determinants is assumed. For

Chaps. 8 and 9, some familiarity with the solution of ordinary differential

equations is also assumed, although these chapters may be omitted.

In particular, we shall make frequent use of the following theorems.

Theorem 1.1: Intermediate-value theorem for continuous functions Let

f(x) be a continuous function on the interval

for some number a and some

then

This theorem is often used in the following form:

Theorem 1.2 Let f(x) be a continuous function on [a,b], let x1, . . . , xn

be points in [a,b], and let g1, . . . , gn, be real numbers all of one sign.

Then

26

NUMBER SYSTEMS AND ERRORS

TO indicate the proof, assume without loss of generality that gi > 0,

then

is a number between the two values

and

of the continuous function

and the conclusion follows

from Theorem 1.1.

One proves analogously the corresponding statement for infinite sums

or integrals:

Hence

Theorem 1.3: Mean-value theorem for integrals Let g(x) be a nonnegative or nonpositive integrable function on [a,b]. If f(x) is continuous

on [a,b], then

(1.28)

Warning The assumption that g(x) is of one sign is essential in Theorem

1.3, as the simple example

shows.

Theorem 1.4 Let f(x) be a continuous function on the closed and

bounded interval [a,b]. Then f(x) “assumes its maximum and minimum values on [a,b]”; i.e., there exist points

such

that

Theorem 1.5: Rolle’s theorem Let f(x) be continuous on the (closed

and finite) interval [a,b] and differentiable on (a,b). If f(a) = f(b) =

0, then

The proof makes essential use of Theorem 1.4. For by Theorem 1.4,

there are points

such that, for all

If now neither _ nor is in (a,b), then

and every

will do. Otherwise, either

or

is in (a,b), say,

But then

since

being the biggest value achieved by f(x) on [a,b].

An immediate consequence of Rolle’s theorem is the following theorem.

Theorem 1.6: Mean-value theorem for derivatives If f(x) is continuous

on the (closed and finite) interval [a,b] and differentiable on (a, b),

1.7 SOME MATHEMATICAL PRELIMINARIES 27

then

(1.29)

One gets Theorem 1.6 from Theorem 1.5 by considering in Theorem

1.5 the function

instead of f(x). Clearly, F(x) vanishes both at a and at b.

It follows directly from Theorem 1.6 that if f(x) is continuous on [a,b]

and differentiable on (a,b), and c is some point in [a,b], then for all

(1.30)

The fundamental theorem of calculus provides the more precise statement:

If f(x) is continuously differentiable, then for all

(1.31)

from which (1.30) follows by the mean-value theorem for integrals (1.28),

since f '(x) is continuous. More generally, one has the following theorem.

Theorem 1.7: Taylor’s formula with (integral) remainder If f(x) has

n + 1 continuous derivatives on [a,b] and c is some point in [a,b],

then for all

(1 . 32)

where

(1.33)

One gets (1.32) from (1.31) by considering the function

instead of f(x). For,

But since F(c) = f(c), this gives

hence by (1.31),

28

NUMBER SYSTEMS AND ERRORS

which is (1.32), after the substitution of x for c and of c for x.

Actually, f (n+1)(x) need not be continuous for (1.32) to hold. However,

if in (1.32), f(n+1)(x) is continuous, one gets, using Theorem 1.3, the more

familiar but less useful form for the remainder:

(1.34)

By setting h = x - c, (1.32) and (1.34) take the form

(1.35)

Example The function f(x) = eX has the Taylor expansion

for some between 0 and x

(1.36)

about c - 0. The expansion of f(x) = ln x = log, x about c = 1 is

where 0 < x < 2, and

is between 1 and x.

A similar formula holds for functions of several variables. One obtains

this formula from Theorem 1.7 with the aid of

Theorem 1.8: Chain rule If the function f(x,y, . . . , z) has continuous

first partial derivatives with respect to each of its variables, and

x = x(t), y = y(t), . . . , z = z(t) are continuously differentiable functions of t, then g(t) = f(x(t), y(t), . . . , z(t)) is also continuously differentiable, and

From this theorem, one obtains an expression for f(x, y, . . . , z) in

terms of the value and the partial derivatives at (a, b, . . . , c) by introducing the function

and then evaluating its Taylor series expansion around t = 0 at t = 1. For

example, this gives

1.7

SOME MATHEMATICAL PRELIMINARIES 29

Theorem 1.9 If f(x,y) has continuous first and second partial derivatives in a neighborhood D of the point (a,b) in the (x,y) plane, then

(1.37)

for all (x,y) in D, where

for some

depending on (x,y), and the subscripts on f

denote partial differentiation.

For example, the expansion of ex

sin y

about (a,b) = (0, 0) is

(1.38)

Finally, in the discussion of eigenvalues of matrices and elsewhere, we

need the following theorem.

Theorem 1.10: Fundamental theorem of algebra If p(x) is a polynomial

of degree n > 1, that is,

with a,, . . . , a,, real or complex numbers and

least one zero; i.e., there exists a complex number

then p(x) has at

such that

This rather deep theorem should not be confused with the straightforward statement, “A polynomial of degree n has at most n zeros,

counting multiplicity,” which we prove in Chap. 2 and use, for example, in

the discussion of polynomial interpolation.

EXERCISES

1.7-1 In the mean-value theorem for integrals, Theorem 1.3, let

[0,1]. Find the point

specified by the theorem and verify that this point lies in the interval

(0,1).

1.7-2 In the mean-value theorem for derivatives, Theorem 1.6, let

Find the point

specified by the theorem and verify that this point lies in the interval (a,b).

1.7-3 In the expansion (1.36) for eX, find n so that the resulting power sum will yield an

approximation correct to five significant digits for all x on [0,1].

30

NUMBER SYSTEMS AND ERRORS

1.7-4 Use Taylor’s formula (1.32) to find a power series expansion about

Find an expression for the remainder, and from this estimate the number of terms that would

be needed to guarantee six-significant-digit accuracy for

for all x on the interval

[-1,1].

1.7-5 Find the remainder R2(x,y) in the example (1.38) and determine its maximum value in

the region D defined by

1.7-6 Prove that the remainder term in (1.35) can also be written

1.7-7 Illustrate the statement in Exercise 1.7-6 by calculating, for

for various values of h, for example, for

with

and comparing R,(h)

1.7-8 Prove Theorem 1.9 from Theorems 1.7 and 1.8.

1.7-9 Prove Euler’s formula

n

by comparing the power series for e , evaluated at

the power series for

and i times the one for

with the sum of

Previous

CHAPTER

TWO

INTERPOLATION BY POLYNOMIALS

Polynomials are used as the basic means of approximation in nearly all

areas of numerical analysis. They are used in the solution of equations and

in the approximation of functions, of integrals and derivatives, of solutions

of integral and differential equations, etc. Polynomials owe this popularity

to their simple structure, which makes it easy to construct effective

approximations and then make use of them.

For this reason, the representation and evaluation of polynomials is a

basic topic in numerical analysis. We discuss this topic in the present

chapter in the context of polynomial interpolation, the simplest and

certainly the most widely used technique for obtaining polynomial approximations. More advanced methods for getting good approximations by

polynomials and other approximating functions are given in Chap. 6. But

it will be shown there that even best polynomial approximation does not

give appreciably better results than an appropriate scheme of polynomial

interpolation.

Divided differences serve as the basis of our treatment of the interpolating polynomial. This makes it possible to deal with osculatory (or

Hermite) interpolation as a special limiting case of polynomial interpolation at distinct points.

2.1 POLYNOMIAL FORMS

In this section, we point out that the customary way to describe a

polynomial may not always be the best way in calculations, and we

31

Home

Next

32

INTERPOLATION BY POLYNOMIALS

propose alternatives, in particular the Newton form. We also show how to

evaluate a polynomial given in Newton form. Finally, in preparation for

polynomial interpolation, we discuss how to count the zeros of a polynomial.

A polynomial p(x) of degree < n is, by definition, a function of the

form

(2.1)

with certain coefficients a0, a1, . . . , an. This polynomial has (exact) degree

n in case its leading coefficient a, is nonzero.

The power form (2.1) is the standard way to specify a polynomial in

mathematical discussions. It is a very convenient form for differentiating

or integrating a polynomial. But, in various specific contexts, other forms

are more convenient.

Example 2.1: The power form may lead to loss of significance If we construct the power

form of the straight line p(x) which takes on the values p(6000) = 1/3, p(6001) =

- 2/3, then, in five-decimal-digit floating-point arithmetic, we will obtain p(x) =

600.3 - x. Evaluating this straight line, in the same arithmetic, we find p(6000) = 0.3

and p(6001) = - 0.7, which recovers only the first digit of the given function values, a

loss of four decimal digits.

A remedy of sorts for such loss of significance is the use of the shifted

power form

(2.2)

If we choose the center c to be 6000, then, in the example, we would

get p(x) = 0.33333 - (x - 6000.0), and evaluation in five-decimal-digit

floating-point arithmetic now provides p(6000) = 0.33333, p(6001) =

- 0.66667; i.e., the values are as correct as five digits can make them.

It is good practice to employ the shifted power form with the center c

chosen somewhere in the interval [a,b] when interested in a polynomial on

that interval. A more sophisticated remedy against loss of significance (or

illconditioning) is offered by an expansion in Chebyshev polynomials or

other orthogonal polynomials; see Sec. 6.3.

The coefficients in the shifted power form (2.2) provide derivative

values, i.e.,

.

if p(x) is given by (2.2). In effect, the shifted power form provides the

Taylor expansion for p(x) around the center c.

A further generalization of the shifted power form is the Newton form

(2.3)

2.1

POLYNOMIAL. FORMS 33

This form plays a major role in the construction of an interpolating

polynomial. It reduces to the shifted power form if the centers c1, . . . , cn,

all equal c, and to the power form if the centers c1, . . . , cn, all equal zero.

The following discussion on the evaluation of the Newton form therefore

applies directly to these simpler forms as well.

It is inefficient to evaluate each of the n + 1 terms in (2.3) separately

and then sum. This would take n + n(n + 1)/2 additions and n(n + 1)/2

multiplications. Instead, one notices that the factor (x - c1 ) occurs in all

terms but the first; that is,

Again, each term between the braces but the first contains the factor

(x - c2); that is,

Continuing in this manner, we obtain p(x) in nested form:

whose evaluation for any particular value of x takes 2n additions and n

multiplications. If, for example, p(x) = 1 + 2(x - 1) + 3(x - 1)(x - 2)

+ 4(x - 1)(x - 2)(x - 3), and we wish to compute p(4), then we calculate

as follows:

This procedure is formalized in the following algorithm.

Algorithm 2.1: Nested multiplication for the Newton form Given the

n + 1 coefficients a0, . . . , an, for the Newton form (2.3) of the polynomial p(x), together with the centers c1 , . . . , cn . Given also the

number z.

Then,

Moreover, the auxilliary quantities

are of

34

INTERPOLATION BY POLYNOMIALS

independent interest. For, we have

(2.4)

i.e.,

are also coefficients in the Newton form for p(x), but

with centers z, c1, c2, . . . , cn-1.

We prove the assertion (2.4). From the algorithm,

Substituting these expressions into (2.3), we get

which proves (2.4).

Aside from producing the value of the polynomial (2.3) at any particular point z economically, the nested multiplication algorithm is useful in

changing from one Newton form to another. Suppose, for example, that we

wish to express the polynomial

in terms of powers of x, that is, in the Newton form with all centers equal

to zero. Then, applying Algorithm 2.1 with z = 0 (and n = 2), we get

Hence

2.1

POLYNOMIAL FORMS

35

Applying Algorithm 2.1 to this polynomial, again with z = 0, gives

Therefore

In this simple example, we can verify this result quickly by multiplying out

the terms in the original expression.

Repeated applications of the Nested Multiplication algorithm are

useful in the evaluation of derivatives of a polynomial given in Newton

form (see Exercises 2.1-2 through 2.1-5). The algorithm is also helpful in

establishing the following basic fact.

Lemma 2.1 If z1, . . . , zk are distinct zeros of the polynomial p(x),

then

for some polynomial r(x).

To prove this lemma, we write p(x) in power form (2.1), i.e., in Newton

form with all centers equal to zero, and then apply Algorithm 2.1 once, to

get

a polynomial of

[since

degree < n. In effect, we have divided p(x) by the linear polynomial

(x - z); q(x) is the quotient polynomial and the number p(z) is the

remainder. Now pick specifically z = z1. Then, by assumption, p(z1) = 0,

i.e.,

This finishes the proof in case k = 1. Further, for k > 1, it follows that

z2, . . . , zk are necessarily zeros of q(x), since p(x) vanishes at these points

while the linear polynomial x - z 1 does not, by assumption. Hence,

induction on the number k of zeros may now be used to complete the

proof.

36

INTERPOLATION BY POLYNOMIALS

Corollary If p(x) and q(x) are two polynomials of degree < k which

agree at the k + 1 distinct points z0, . . . , zk, then p(x) = q(x) identically.

Indeed, their difference d(x) = p(x) - q(x) is then a polynomial of

degree < k, and can, by Lemma 2.1, be written in the form

with r(x) some polynomial. Suppose that

Then

some coefficients c0, . . . , cm with

for

which is nonsense. Hence, r(x) = 0 identically, and so p(x) = q(x).

This corollary gives the answer, “At most one,” to the question “How

many polynomials of degree < k are there which take on specified values

at k + 1 specified points?”

These considerations concerning zeros of polynomials can be refined

through the notion of multiplicity of a zero. This will be of importance to

us later on, in the discussion of osculatory interpolation. We say that the

point z is a zero of (exact) multiplicity j, or of order j, of the function f(x)

provided

Example

For instance, the polynomial

has a zero of multiplicity j at z. It is reasonable to count such a zero j times since it can

be thought of as the limiting case of the polynomial

with j distinct, or simple, zeros as all these zeros come together, or coalesce, at z.

As another example, for

the function

has three (simple)

zeros in the interval

which converge to the number 0 as

Correspondingly, the (limiting) function sin x - x has a triple zero at 0.

With this notion of multiplicity of a zero, Lemma 2.1 can be

strengthened as follows.

Lemma 2.2 If z1, . . . zk is a sequence of zeros of the polynomial p(x)

counting multiplicity, then

for some polynomial r(x).

See Exercise 2.1-6 for a proof of this lemma. Note that the number z

could occur in the sequence z1, . . . , zk as many as j times in case z is a

zero of p(x) of order j.

2.1

POLYNOMIAL FORMS

37

From the lemma 2.2, we get by the earlier argument the

Corollary If p(x) and q(x) are two polynomials of degree < k which

agree at k + 1 points z0, . . . , z k in the sense that their difference

r(x) = p(x) - q(x) has the k + 1 zeros z0, . . . , zk (counting multiplicity), then p(x) = q(x) identically.

EXERCISES

2.1-1 Evaluate the cubic polynomial

Then use nested multiplication to obtain p(x) in power form, and evaluate that power form at

x - 314.15. Compare!

2.1-2 Let

be a polynomial in Newton

form. Prove: If c1 = c2 = · · · = cr+1, then p(j)(c1) = j!aj,j = 0, . . . ,r. [Hint: Under these

conditions, p(x) can be written

with q(x) some polynomial. Now differentiate.]

2.1-3 Find the first derivative of

at x = 2. [Hint: Apply Algorithm 2.1 twice to obtain the Newton form for p(x) with centers 2,

2, 1, - 1; then use Exercise 2.1-2.]

2.1-4 Find also the second derivative of the polynomial p(x) of Exercise 2.1-3 at x = 2.

2.1-5 Find the Taylor expansion around c = 3 for the polynomial of Exercise 2.1-3. [Hint:

The Taylor expansion for a polynomial around a point c is just the Newton form for this

polynomial with centers c, c, c, c, . . . .]

2.1-6 Prove Lemma 2.2. [Hint: By Algorithm 2.1, p(x) = (x - z1)q(x), Now, to finish the

proof by induction on the number k of zeros in the given sequence, prove that z2, . . . , zk is

necessarily a sequence of zeros (counting multiplicity) of q(x). For this, assume that the

number z occurs exactly j times in the sequence z2, . . . , zk and distinguish the cases z = z1

and

Also, use the fact that p (j)(x) = (x - z 1 )q (j)(x) + jq(j-1)(x). ]

2.1-7 Prove that, in the language of the corollary to Lemma 2.2, the Taylor polynomial

i! agrees with the function f(x) j-fold at the point x = a (i.e., a is a

j-fold zero of their difference).

2.1-8 Suppose someone gives you a FUNCTION F(X) which supposedly returns the value at

X of a specific polynomial of degree < r. Suppose further that, on inspection, you find that

the routine does indeed return the value of some polynomial of degree < r (e.g., you find only

additions/subtractions and multiplications involving X and numerical constants in that

subprogram, with X appearing as a factor less than r times). How many function values

would you have to check before you could be sure that the routine does indeed do what it is

supposed to do (assuming no rounding errors in the calculation)?

2.1-9 For each of the following power series, exploit the idea of nested multiplication to find

an efficient way for their evaluation. (You will have to assume, of course, that they are to be

summed only over n < N, for some a priori given N.)

.

38

INTERPOLATION BY POLYNOMIALS

2.2 EXISTENCE AND UNIQUENESS OF THE

INTERPOLATING POLYNOMIAL

Let x0, x1, . . . , xn be n + 1 distinct points on the real axis and let f(x) be a

real-valued function defined on some interval I = [a,b] containing these

points. We wish to construct a polynomial p(x) of degree < n which

interpolates f(x) at the points x0, . . . , xn, that is, satisfies

As we will see, there are many ways to write down such a polynomial.

It is therefore important to remind the reader at the outset that, by the

corollary to Lemma 2.1, there is at most one polynomial of degree < n which

interpolates f(x) at the n + 1 distinct points x0, . . . , xn.

Next we show that there is at least one polynomial of degree < n which

interpolates f(x) at the n + 1 distinct points x0, x1, . . . , xn. For this, we

employ yet another polynomial form, the Lagrange form

(2.5)

with

(2.6)

the Lagrange polynomials for the points x0, . . . , xn. The function lk(x) is

the product of n linear factors, hence a polynomial of exact degree n.

Therefore, (2.5) does indeed describe a polynomial of degree < n. Further,

lk(x) vanishes at xi for all

and takes the value 1 at xk, i.e.,

This shows that

i.e., the coefficients a0, . . . , an in the Lagrange form are simply the values

of the polynomial p(x) at the points x0 , . . . , xn . Consequently, for an

arbitrary function f(x),

(2.7)

is a polynomial of degree < n which interpolates f(x) at x0, . . . , xn. This

establishes the following theorem.

Theorem 2.1 Given a real-valued function f(x) and n + 1 distinct

points x0, . . . , xn, there exists exactly one polynomial of degree < n

which interpolates f(x) at x0, . . . , xn.

2.2

EXISTENCE AND UNIQUENESS OF THE INTERPOLATING POLYNOMIAL

39

Equation (2.7) is called the Lagrange formula for the interpolating

polynomial.

As a simple application, we consider the case n = 1; i.e., we are given

f(x) and two distinct points x0, x1. Then

and

This is the familiar case of linear interpolation written in some of its many

equivalent forms.

Example 2.2 An integral related to the complete elliptic integral is defined by

(2.8)

From a table of values of these integrals we find that, for various values of k measured

in degrees,

Find K(3.5), using a second-degree interpolating polynomial.

We have

Then

This approximation is in error in the last place.

The Lagrange form (2.7) for the interpolating polynomial makes it

easy to show the existence of an interpolating polynomial. But its evaluation at a point x takes at least 2(n + 1) multiplications/divisions and

(2n + 1) additions and subtractions after the denominators of the

Lagrange polynomials have been calculated once and for all and divided

into the corresponding function values. This is to be compared with n

multiplications and n additions necessary for the evaluation of a polynomial of degree n in power form by nested multiplication (see Algorithm

2.1).

40

INTERPOLATION BY POLYNOMIALS

A more serious objection to the Lagrange form arises as follows: In

practice, one is often uncertain as to how many interpolation points to use.

Hence, with p j (x) denoting the polynomial of degree < j which interpolates f(x) at x0, . . . , xj, one calculates p0(x), p1(x), p2(x), . . . , increasing the number of interpolation points, and hence the degree of the

interpolating polynomial until, so one hopes, a satisfactory approximation

pk(x) to f(x) has been found. In such a process, use of the Lagrange form

seems wasteful since, in calculating p k(x), no obvious advantage can be

taken of the fact that one already has p k-1(x) available. For this purpose

and others, the Newton form of the interpolating polynomial is much

better suited.

Indeed, write the interpolating polynomial p,(x) in its Newton form,

using the interpolation points x0, . . . , xn-1 as centers, i.e.,

(2.9)

For any integer k between 0 and n, let qk(x) be the sum of the first k + 1

terms in this form,

Then every one of the remaining terms in (2.9) has the factor (x - x0 )

· · · (x - xk), and we can write (2.9) in the form

for some polynomial r(x) of no further interest. The point is that this last

term (x - x0) · · · (x - xk)r(x) vanishes at the points x0, . . . , xk, hence

qk(x) itself must already interpolate f(x) at x0, . . . , xk [since pn(x) does].

Since q k(x) is also a polynomial of degree < k, it follows that q k(x) =

p k (x); i.e., q k (x) must be the unique polynomial of degree < k which

interpolates f(x) at x0, . . . , xk.

This shows that the Newton form (2.9) for the interpolating polynomial pn(x) can be built up step by step as one constructs the sequence

p 0 (x), p1 (x), p2 (x), . . . , with p k(x) obtained from p k-1(x) by addition of

the next term in the Newton form (2.9), i.e.,

It also shows that the coefficient A, in the Newton form (2.9) for the

interpolating polynomial is the leading coefficient, i.e., the coefficient of

x k , in the polynomial p k (x) of degree < k which agrees with f(x) at

x0 , . . . , xk. This coefficient depends only on the values of f(x) at the

points x0, . . . , xk; it is called the kth divided difference of f(x) at the points

x0, . . . , xk (for reasons given in the next section) and is denoted by

With this definition, we arrive at the Newton formula for the interpolating

2.3

THE DIVIDED-DIFFERENCE TABLE

41

polynomial

This can be written more compactly as

(2.10)

if we make use of the convention that

For n = 1, (2.10) reads

and comparison with the formula

obtained earlier therefore shows that

(2.11)

The first divided difference, at any rate, is a ratio of differences.

EXERCISES

2.2-1 Prove that

(x - xn). [Hint: Find the leading coefficient of the polynomial (2.7).]

given in Exercise 2.2-l as

22-2 Calculate the limit of the formula for

while all other points remain fixed.

2.2-3 Prove that the polynomial of degree < n which interpolates f(x) at n + 1 distinct

points is f(x) itself in case f(x) is a polynomial of degree < n.

2.2-4 Prove that the kth divided difference p[x0, . . . , xk] of a polynomial p(x) of degree < k

is independent of the interpolation points x0, xl, . . . , xk.

2.2-5 Prove that the kth divided difference of a polynomial of degree < k is 0.

2.3 THE DIVIDED-DIFFERENCE TABLE

Higher-order divided differences may be constructed by the formula

(2.12)

whose validity may be established as follows.

42

INTERPOLATION BY POLYNOMIALS

Let p,(x) be the polynomial of degree < i which agrees with f(x) at

x0, . . . , xi , as before, and let qk-1(x) be the polynomial of degree < k - 1

which agrees with f(x) at the points x1, . . . , xk. Then

(2.13)

is a polynomial of degree < k, and one checks easily that p(xi ) = f(xi ),

i = 0, . . . , k. Consequently, by the uniqueness of the interpolating polynomial, we must have p(x) = pk(x). Therefore

by definition

by (2.13)

by definition

which proves the important formula (2.12).

Example 2.3 Solve Example 2.2 using the Newton formula.

In this example, we have to determine the polynomial p2(x) of degree < 2 which

satisfies

By (2.11) we can calculate

Therefore, by (2.12)

and (2.10) now gives

Substituting into this the value x = 3.5, we obtain

which agrees with the result obtained in Example 2.2.

Equation (2.12) shows the kth divided difference to be a difference

quotient of (k - 1)st divided differences, justifying their name. Equation

(2.12) also allows us to generate all the divided differences needed for the

Newton formula (2.10) in a simple manner with the aid of a so-called

divided-difference table.

2.3

THE DIVIDED-DIFFERENCE TABLE

43

Such a table is depicted in Fig. 2.1, for n = 4.

The entries in the table are calculated, for example, column by

column, according to the following algorithm.

Algorithm 2.2: Divided-difference table Given the first two columns of

the table, containing x 0 , x 1 , . . . , x n and, correspondingly,

If this algorithm is carried out by hand, the following directions might

be helpful. Draw the two diagonals from the entry to be calculated through

its two neighboring entries to the left. If these lines terminate at f[xi] and

f[xj], respectively, divide the difference of the two neighboring entries by

the corresponding difference x j - x i to get the desired entry. This is

illustrated in Fig. 2.1 for the entry f[x1, . . . , x4].

When the divided-difference table is filled out, the coefficients

f[x0, . . . , xi ], i = 0, . . . , n, for the Newton formula (2.10) can be found

at the head of their respective columns.

For reasons of storage requirements, and because the DO variables in

many FORTRAN dialects can only increase, one would use a somewhat

modified version of Algorithm 2.2 in a FORTRAN program. First, for the

evaluation of the Newton form according to Algorithm 2.1, it is more

convenient to use the form

Figure 2.1 Divided-difference table.

44

INTERPOLATION BY POLYNOMIALS

i.e., to use the Newton formula with centers xn, xn-1, . . . , x1. For then the

value

can be calculated, according to Algorithm 2.1, by

Second, since we are then only interested in the numbers f[xi , . . . , xn],

i = 0, . . . , n, it is not necessary to store the entire divided-difference table

(requiring a two-dimensional array in which roughly half the entries would

not be used anyway, because of the triangular character of the divided-difference table). For if we use the abbreviation

then the calculations of Algorithm 2.2 read

In particular, the number d i,k-1 is not used any further once dik has been

calculated, so that we can safely store d ik over d i,k-1 .

Algorithm 2.3: Calculation of the coefficients for the Newton formula

Given the n + 1 distinct points x0, . . . , xn, and, correspondingly, the

numbers f(x0), . . . , f(xn), with f(xi ) stored in di , i = 0, . . . , n.

Then

Example 2.4

Let f(x) = (1 + x2)-1. For n = 2, 4, . . . , 16, calculate the polynomial

Pn(x) of degree < n which interpolates f(x) at the n + 1 equally spaced points

Then estimate the maximum interpolation error

on the interval [-5, 5] by computing

2.3

THE DIVIDED-DIFFERENCE TABLE

45

where

The FORTRAN program below uses Algorithms 2.1 and 2.3 to solve this problem.

FORTRAN PROGRAM FOR EXAMPLE 2.4

C PROGRAM FOR EXAMPLE 2.4

INTEGER I,J,K,N,NP1

REAL

D(17),ERRMAX,H,PNOFY,X(17),Y

C POLYNOMIAL INTERPOLATION AT EQUALLY SPACED POINTS TO THE FUNCTION

F(Y) = l./(l. + Y*Y)

C

PRINT 600

N',5X,'MAXIMUM ERROR')

600 FORMAT('1

DO 40 N=2,16,2

NP1 = N+1

H = 10./FLOAT(N)

DO 10 I=1,NP1

X(I) = FLOAT(I-1)*H - 5.

D(I) = Fix(I))

10

CONTINUE

C

CALCULATE DIVIDED DIFFERENCES BY ALGORITHM 2.3

DO 20 K=1,N

DO 20 I=1,NP1-R

D(I) = (D(I+1) - D(I))/(X(I+K) - X(I))

CONTINUE

20

ESTIMATE MAXIMUM INTERPOLATION ERROR ON (-5,5)

C

ERRMAX = 0.

DO 30 J=1,101

Y = FLOAT(J-1)/10. - 5.

C

CALCULATE PN(Y) BY ALGORITHM 2.1

PNOFY = D(1)

DO 29 K=2,NP1

PNOFY = D(K) + (Y - X(K))*PNOFY'

29

CONTINUE

ERRMAX = MAX(ABS(F(Y) - PNOFY) , ERRMAX)

CONTINUE

30

PRINT 630, N,ERRMAX

FORMAT(I5,El8.7)

630

40 CONTINUE .

STOP

END

COMPUTER OUTPUT FOR EXAMPLE 2.4

N

2

4

6

8

10

12

14

16

MAXIMUM ERROR

6.4615385E - 01

4.3813387E - 01

6.1666759E - 01

1.0451739E + 00

1.9156431E + 00

3.6052745E + 00

7.192008OE + 00

14051542E + 01

Note how the interpolation error soon increases with increasing degree even though we use

more and more information about the function f(x) in our interpolation process. This is

because we have used uniformly spaced interpolation points; see Exercise 6.1-12 and Eq.

(6.20).

46

INTERPOLATION BY POLYNOMIALS

EXERCISES

2.3-l From a table of logarithms we obtain the following values of log x at the indicated

tabular points.

x

log x

1.0

1.5

2.0

3.0

3.5

4.0

0.0

0.17609

0.30103

0.477 12

0.54407

0.60206

Form a divided-difference table based on these values.

2.3-2 Using the divided-difference table in Exercise 2.3-1, interpolate for the following

values: log 2.5, log 1.25, log 3.25. Use a third-degree interpolating polynomial in its Newton

form.

2.3-3 Estimate the error in the result obtained for log 2.5 in Exercise 2.3-2 by computing the

next term in the interpolating polynomial. Also estimate it by comparing the approximation

for log 2.5 with the sum of log 2 and the approximation for log 1.25.

2.3-4 Derive the formula

Then use it to interpret the Nested Multiplication Algorithm 2.1, applied to the polynomial

(2.10), as a way to calculate p[z, x0, . . . , xn-1], p[z, x0, . . . , xn-2], . . . , p[z, x0] and P[z], i.e.,

as a way to get another diagonal in the divided difference table for p(x).

2.3-5 By Exercise 2.2-3, the polynomial of degree < k which interpolates a function f(x) at

x0, . . . , xk is f(x) itself if f(x) is a polynomial of degree < k. This fact may be used to check

the accuracy of the computed interpolating polynomial. Adapt the FORTRAN program given

in Example 2.4 to carry out such a check as follows: For n = 4, 8, 12, . . . , 32, find the

polynomial pn(x) of degree < n which interpolates the function

at

0,1,2, . . . ,n. Then estimate

where

the yi's are a suitably large number of points in [0, n] .

2.3-6 Prove that the first derivative p'2(x) of the parabola interpolating f(x) at x0 < xl < x2 is

equal to the straight line which takes on the value f[xi-1, xi] at the point (xi-1 + xi) /2, for

i = 1, 2. Generalize this to describe p'n(x) as the interpolant to data

for

in case pn(x) interpolates f(x) at x0 < x1 < · · · < xn.

appropriate

*2.4 INTERPOLATION AT AN INCREASING NUMBER OF

INTERPOLATION POINTS

Consider now the problem of estimating f(x) at a point

using

polynomial interpolation at distinct points x0, x1, x2, . . . . With pk(x) the

polynomial of degree < k which interpolates f(x) at x0, . . . , xk, we calcuuntil, so we hope, the difference

late successively

between

and

is sufficiently small. The Newton form for the

*2.4

INTERPOLATION AT AN INCREASING NUMBER OF INTERPOLATION POINTS

47

interpolating polynomial

with

is expressly designed for such calculations. If we know

then we can calculate

and

Algorithm 2.4: Interpolation using an increasing number of interpolation

points Given distinct points x 0 , x 1 , x 2 , . . . and the value!

f(x0), f(x1), f(x2), . . . of a function f(x) at these points. Also, given a

point

For k = 0, 1, 2, . . . , until satisfied, do:

This algorithm generates the entries of the divided-difference table for

f(x) at x0 , x1 , x2 , . . . a diagonal at a time. During the calculation of

the upward diagonal emanating from f[xk+1] is calculated up to

and including the number f[x0, . . . , xk+1], using the number f[xk+1] =

f(xk+1) and the previously calculated entries f[xk], f[xk-1, xk],

. . . , f[x0, . . . , xk] in the preceding diagonal. Hence, even if only the

most recently calculated diagonal is saved (in a FORTRAN program, say),

the algorithm provides incidentally the requisite coefficients for the Newton form for pk+1(x) with centers xk+1, . . . , x1:

(2.14)

Example 2.5 We apply Algorithm 2.4 to the problem of Examples 2.2 and 2.3, using

x0 = 1, x1 = 4, x2 = 6, and in addition, x3 = 0. For this example,

We get

Next, with K[x1] = 1.5727, we get

0.0006, and with

we get

1.5724.

48

INTERPOLATION BY POLYNOMIALS

Adding the point x 2 = 6, we have K[x2 ] = 1.5751; hence K[x1 , x 2 ] = 0.0012,

K[x0, x1, x2] = 0.00012; therefore, as

the number calculated earlier in Example 2.3. To check the error for this approximation

to K(3.5), we add the point x 3 = 0. With K[x3 ] = 1.5708, we compute K[x2 , x 3 ] =

0.000717, K[x1, x2, x3] = 0.000121, K[x0, x1, x2, x3] = - 0.000001, and get, with

= (-2.5)(-1.25) = 3.125, that

indicating that 1.5722 or 1.5723 is probably the value of K(3.5) to within the accuracy of

the given values of K(x).

These calculations, if done by hand, are conveniently arranged in a table as shown

in Fig. 2.2, which also shows how Algorithm 2.4 gradually builds up the divided-difference table.

We have listed below a FORTRAN FUNCTION, called TABLE,

which uses Algorithm 2.4 to interpolate in a given table of abscissas and

ordinates X(I), F(I), I = 1, . . . , NTABLE, with F(I) = f(X(I)), and X(1)

< X(2) < · · · , in order to find a good approximation to f(x) at x =

XBAR. The program generates p0 (XBAR), p1 (XBAR), . . . , until

where TOL is a given e r r o r r e q u i r e m e n t , o r u n t i l k + 1 =

min(20, NTABLE), and then returns the number pk (XBAR). The sequence

x0, x1, x2, . . . of points of interpolation is chosen from the tabular points

X(1), X(2), . . . , X(NTABLE) as follows: If X(I) < XBAR < X(I + 1),

then x0 = X(I + 1), x1 = X(I), x2 = X(I + 2), x3 = X(I - 1), . . . , except

near the beginning or the end of the given table, where eventually only

points to the right or to the left of XBAR are used. To protect the program

(and the user!) against an unreasonable choice for TOL, the program

should be modified so as to terminate also if and when the successive

differences |p k+1 (XBAR) - p k(XBAR)| begin to increase as k increases.

(See also Exercise 2.4-1.)

Figure 2.2

*2.4

INTERPOLATION AT AN INCREASING NUMBER OF INTERPOLATION POINTS

49

FORTRAN SUBPROGRAM FOR INTERPOLATION IN A

FUNCTION TABLE

REAL FUNCTION TABLE (XBAR, X, F, NTABLE, TOL, I'FLAG )

C RETURNS AN INTERPOLATED VALUE TABLE AT XBAR FOR THE FUNCTION

C TABULATED AS (X(I),F(I)), I=l,...,NTABLE.

INTEGER IFLAG,NTABLE,

J,NEXT,NEXTL,NEXTR

REAL

F(NTABLE),TOL,X(NTABLE),XBAR,

A(20),ERROR,PSIK,XK(20)

C****** I N P U T ******

C XBAR POINT AT WHICH TO INTERPOLATE .

C X(I), F(I), I=1 ,...,NTABLE CONTAINS THE FUNCTION TABLE .

C

A S S U M P T I O N ... X IS ASSUMED TO BE INCREASING.)

C NTABLE NUMBER OF ENTRIES IN FUNCTION TABLE.

C TOL DESIRED ERROR BOUND .

C****** O U T P U T ******

C TABLE THE INTERPOLATED FUNCTION VALUE .

C IFLAG AN INTEGER,

C

=l , SUCCESSFUL EXECUTION ,

C

=2 , UNABLE TO ACHIEVE DESIRED ERROR IN 20 STEPS,

C

=3 , XBAR LIES OUTSIDE OF TABLE RANGE. CONSTANT EXTRAPOLATION IS

C

USED.

C****** M E T H O D ******

C A SEQUENCE OF POLYNOMIAL INTERPOLANTS OF INCREASING DEGREE IS FORMED

C USING TABLE ENTRIES ALWAYS AS CLOSE TO XBAR AS POSSIBLE. EACH INC TERPOLATED VALUE IS OBTAINED FROM THE PRECEDING ONE BY ADDITION OF A

C CORRECTION TERM (AS IN THE NEWTON FORMULA). THE PROCESS TERMINATES

C WHEN THIS CORRECTION IS LESS THAN TOL OR, ELSE, AFTER 20 STEPS.

C

C

LOCATE XBAR IN THE X-ARRAY.

IF (XBAR .GE. X(l) .AND. XBAR .LE. X(NTABLE)) THEN

DO 10 NEXT=2,NTABLE

IF (XBAR .LE. X(NEXT))

GO TO 12

CONTINUE

10

END IF

IF (XBAR .LT. X(1)) THEN

TABLE = F(1)

ELSE

TABLE = F(NTABLE)

END IF

PRINT 610,XBAR

610 FORMAT(E16.7,' NOT IN TABLE RANGE.')

IFLAG = 3

RETURN

12 XK(1) = X(NEXT)

NEXTL = NEXT-l

NEXTR = NEXT+1

A(1) = F(NEXT)

TABLE = A(1)

PSIK = 1.

USE ALGORITHM 2.4, WITH THE NEXT XK ALWAYS THE TABLE

C

C

ENTRY NEAREST

XBAR OF THOSE NOT YET USED.

KP1MAX = MIN(20,NTABLE)

DO 20 KP1=2,KP1MAX

IF (NEXTL .EQ. 0) THEN

NEXT = NEXTR

NEXTR = NEXTR+1

ELSE IF (NEXTR .GT. NTABLE) THEN

NEXT = NEXTL

NEXTL = NEXTL-1

ELSE IF (XBAR - X(NEXTL) .GT. X(NEXTR) - XBAR) THEN

NEXT = NEXTR

NEXTR = NEXTR+1

ELSE

NEXT = NEXTL

NEXTL = NEXTL-1

END IF

XK(KP1) = X(NEXT)

A(KP1) - F(NEXT)

DO 13 J=KP1-1,1,-l

A(J) = (A(J+l) - A(J))/(XK(KP1) - XK(J))

13

CONTINUE

50

INTERPOLATION BY POLYNOMIALS

FOR I=1 ,...,KP1, A(I) NOW CONTAINS THE DIV.DIFF. OF

F(X) OF ORDER K-I AT XK(I) ,...,XK(KP1).

PSIK = PSIK*(XBAR - XK(KP1-1))

ERROR = A(1)+PSIK

TEMPORARY PRINTOUT

C

613,KP1,XK(KP1),TABLE,ERROR

FORMAT(110,3El7.7)

613

TABLE = TABLE + ERROR

IF (ABS(ERROR) .LE. TOL) THEN

IFLAG = 1

RETURN

END IF

20 CONTINUE

PRINT 620,KP1MAX

620 FORMAT(' NO CONVERGENCE IN ',I2,' STEPS.')

IFLAG = 2

RETURN

END

C

C

EXERCISES

2.4-1 The FORTRAN function TABLE given in the text terminates as soon as |pk+1 (XBAR)

- p k (XBAR)| < TOL. Show that this does not guarantee that the value pk+1 (XBAR)

returned by TABLE is within TOL of the desired number f(XBAR) by the following

exam les:

(a) f(x) = x2; for some I, X(I) = -10, X(I + 1) = 10, XBAR = 0, TOL = 10-5.

(b) f(x) = x 3 ; for some I, X(I) = -100, X(I + 1) = 0, X(I + 2) = 100, XBAR =

-50, TOL = 10-5.

2.4-2 Iterated linear interpolation is based on the following observation attributable to

Neville: Denote by p i,j (x) the polynomial of degree < j - i which interpolates f(x) at the

points xi, xi+1, . . . , xj, i < j. Then

Verify this identity. [Hint: We used such an identity in Sec. 2.3; see Eq. (2.13).]

2.4-3 Iterated linear interpolation (continued). The identity of Neville’s established in Exercise

2.4-2 allows one to generate the entries in the following triangular table

column by column, by repeatedly carrying out what looks like linear interpolation, to reach

eventually the desired number

the value at

of the interpolating polynomial which

agrees with f(x) at the n + 1 points x0, . . . , xn. This is Neville's Algorithm. Aitken’s Algorithm

is different in that one generates instead a triangular table whose jth column consists of the

2.5

THE ERROR OF THE INTERPOLATING POLYNOMIAL

51

numbers

With p0, 1, . . . , j, r(x) (for r > j) the polynomial of degree < j + 1 which agrees with f(x) at the

points x0, x1, . . . , xj, and xr.

Show by an operations count that Neville’s algorithm is more expensive than Algorithm

2.4. (Also, observe that Algorithm 2.4 provides, at no extra cost, a Newton form for the

interpolating polynomial for subsequent evaluation at other points, while the information

generated in Neville’s or Aitken’s algorithm is of no help for evaluation at other points.)

2.4-4 In inverse interpolation in a table, one is given a number and wishes to find the point

so that

where f(x) is the tabulated function. If f(x) is (continuous and) strictly

monotone-increasing or -decreasing, this problem can always be solved by considering the

given table xi, f(xi), i = 0, 1, 2, . . . to be a table yi, g(yi), i = 0, 1, 2, . . . for the inverse

function g(y) = f-1(y) = x by taking yi = f(xi), g(yi) = xi, i = 0, 1, 2, . . . , and to interpolate for the unknown value

in this table. Use the FORTRAN function TABLE to find

so that

2.5 THE ERROR OF THE INTERPOLATING POLYNOMIAL

Let f(x) be a real-valued function on the interval I = [a,b], and let

x0, . . . , xn be n + 1 distinct points in I. With pn(x) the polynomial of

degree < n which interpolates f(x) at x0, . . . , xn, the interpolation error

is given by

(2.15)

Let now

be any point different from x 0 , . . . , x n . If p n+1 (x) is the

polynomial of degree < n + 1 which interpolates f(x) at x0, . . . , xn and at

while by (2. 10),

It follows that

Therefore,

(2.16)

showing the error to be “like the next term” in the Newton form.

We cannot evaluate the right side of (2.16) without knowing the

number

But as we now prove, the number

is closely

related to the (n + 1)st derivative of f(x), and using this information, we

can at times estimate

52

INTERPOLATION BY POLYNOMIALS

Theorem 2.2 Let f(x) be a real-valued function, defined on [a,b] and

k times differentiable in (a, b). If x0, . . . , xk are k + 1 distinct points

in [a, b], then there exists

such that

(2.17)

For k = 1, this is just the mean-value theorem for derivatives (see Sec.

1.7). For the general case, observe that the error function ek(x) = f(x) pk(x) has (at least) the k + 1 distinct zeros x0, . . . , xk in I = [a, b]. Hence,

if f(x), and therefore e k (x), is k times differentiable on (a, b), then it

follows from Rolle’s theorem (see Sec. 1.7) that e’(x) has at least k zeros in

(a, b); hence e”(x) has at least k - 1 zeros in (a, b) and continuing in this

manner, we finally get that

has at least one zero in (a, b). Let be

one such zero. Then

On the other hand, we know that, for any x,

since, by definition, f[x0, . . . , xk] is the leading coefficient of p k(x), and

(2.17) now follows.

By taking a = min, xi , b = maxi xi , it follows that the unknown point

in (2.17) can be assumed to lie somewhere between the xi ’s.

If we apply Theorem 2.2 to (2.16), we get Theorem 2.3.

Theorem 2.3 Let f(x) be a real-valued function defined on [a, b] and

n + 1 times differentiable on (a, b). If p n (x) is the polynomial of

degree < n which interpolates f(x) at the n + 1 distinct points

there exists

x0, . . . , xn in [a, b], then for all

(a, b) such that

(2.18)

It is important to note that

depends on the point at which

the error estimate is required. This dependence need not even be continuous. As we have need in Chap. 7 to integrate and differentiate en(x) with

respect to x, we usually prefer for such purposes the formula (2.16). For, as

we show in Sec. 2.7, f[x0, . . . , xn, x] is a well-behaved function of x.

The error formula (2.18) is of only limited practical utility since, in

general, we will seldom know f(n+1)(x), and we will almost never know the

point

But when a bound on |f(n+1)(x)| is known over the entire interval

[a, b], then we can use (2.18) to obtain a (usually crude) bound on the

error of the interpolating polynomial in that interval.

2.5

THE ERROR OF THE INTERPOLATING POLYNOMIAL

53

Example 2.6 Find a bound for the error in linear interpolation.

The linear polynomial interpolating f(x) at x0 and x1 is

Equation (2.18) then yields the error formula

where depends on . If is a point between x0 and x1, then

Hence, if we know that |f”(x)] < M on [x0, x1], then

The maximum value of

hence is (x1 - x0)2/4. It follows that, for any

occurs at

Example 2.7 Determine the spacing h in a table of equally

between 1 and 2, so that interpolation with a

this table will yield a desired accuracy.

By assumption, the table will contain f(xi), with xi = 1

th en we approximate

the quadratic polynomial which interpolates f(x) at xi-1, xi,

then

for some

in (xi-1, xi+1). Since we do not know

One calculates

lies between x0 and x1.

spaced values of the function

second-degree polynomial in

+ ih, i = 0, . . . , N, where

where p 2 (x) is

xi+1. By (2.18), the error is

we can merely estimate

Further,

using the linear change of variables y = x - xi. Since the function

vanishes at y = - h and y = h, the maximum of

must occur at one of

the extrema of

These extrema are found by solving the equation

= 0, giving

Hence

We are now assured that, for any

if p2(x) is chosen as the quadratic polynomial which interpolates

at the three

tabular points nearest . If we wish to obtain seven-place accuracy this way, we would

54

INTERPOLATlON BY POLYNOMIALS

have to choose h so that

giving

The function

which appears in (2.18) depends,

of course, strongly on the placement of the interpolation points. It is

possible to choose these points for given n in the given interval a < x < b

in such a way that max

there is as small as possible. This choice of

points, the so-called Chebyshev points, is discussed in some detail in Sec.

6.1. For the common choice of equally spaced interpolation points, the

local maxima of

increase as one moves from the middle of the

interval toward its ends, and this increase becomes more pronounced with

increasing n (see Fig 2.3). In view of (2.18), it is therefore advisable (at

least when interpolating to uniformly spaced data) to make use of the

interpolating polynomial only near the middle data points. The interpolant

becomes less reliable as one approaches the leftmost or rightmost data

point. Of course, going beyond them is even worse. Such an undertaking is

called extrapolation and should only be used with great caution.

Figure 23 The function

equally spaced interpolation points

(solid); (b) Chebyshev points for the same interval (dotted).

EXERCISES

2.5-l A table of values of cos x is required so that linear interpolation will yield six-decimalplace accuracy for any value of x in

Assuming that the tabular values are to be equally

spaced, what is the minimum number of entries needed in the table?

2.5-2 The function defined by

2.6

INTERPOLATION AT EQUALLY SPACED POINTS

55

has been tabulated for equally spaced values of x with step h = 0.1. What is the maximum

error encountered if cubic interpolation is to be used to calculate

any point on the

interval

2.5-3 Prove: If the values f(x0), . . . , f(xn) are our only information about the function f(x),

then we can say nothing about the error

at a point

that

is, the error may be “very large” or may be “very small.” [Hint: Consider interpolation at

x 0 , x 1 , . . . , x n to the function f(x) = K(x - x 0 ) · · · (x - x n ), where K is an unknown

constant.] What does this imply about programs like the FUNCTION TABLE in Sec. 2.4 or

Algorithm 2.4?

2.5-4 Use (2.18) to give a lower bound on the interpolation error

when

2.6 INTERPOLATION IN A FUNCTION TABLE BASED ON

EQUALLY SPACED POINTS

Much of engineering and scientific calculation uses functions such as sin x,

ex, Jn (x), erf(x), etc., which are defined by an infinite series, or as the

solution of a certain differential equation, or by similar processes involving

limits, and can therefore, in general, not be evaluated in a finite number of

steps. Computer

### Случайные PDF

Файл |

Элементы Приборных Устройств (PDF).pdf |

2 семинар.pdf |

o67.pdf |

12.pdf |

14.10.pdf |