# Morgan - Numerical Methods

Посмотреть архив целикомПросмотреть файл в отдельном окне: 3a9e317ef2267cbfaba3a5cc4c1f9766.pdf

Home Next

Numerical Methods

Real-Time and Embedded Systems Programming

.................................

Featuring in-depth

coverage of:

l

Fixed and

floating point

mathematical

techniques

without a

coprocessor

l

Numerical I/O

for embedded

systems

l

Data conversion

methods

Don Morgan

Numerical Methods

Real-Time and Embedded Systems Programming

Numerical Methods

Real-Time and Embedded Systems Programming

Featuring in-depth

coverage of:

l

Fixed and

floating point

mathematical

techniques

without a

coprocessor

l

Numerical I/O

for embedded

systems

l

Data conversion

methods

Don Morgan

M&T Books

A Division of M&T Publishing, Inc.

411 BOREL AVE.

SAN MATEO, CA 94402

© 1992 by M&T Publishing, Inc.

Printed in the United States of America

All rights reserved. No part of this book or disk may be reproduced or transmitted in any form or by any

means, electronic or mechanical, including photocopying, recording, or by any information storage and

retrieval system, without prior written permission from the Publisher. Contact the Publisher for

information on foreign rights.

Limits of Liability and Disclaimer of Warranty

The Author and Publisher of this book have used their best efforts in preparing the book and the

programs contained in it and on the diskette. These efforts include the development, research, and

testing of the theories and programs to determine their effectiveness.

The Author and Publisher make no warranty of any kind, expressed or implied, with regard to these

programs or the documentation contained in this book. The Author and Publisher shall not be liable in

any event for incidental or consequential damages in connection with, or arising out of, the furnishing,

performance, or use of these programs.

Library of Congress Cataloging-in-Publication Data

Morgan, Don 1948Numerical Methods/Real-Time and Embedded Systems Programming

by Don Morgan

p. cm.

Includes Index

ISBN l-5585l-232-2 Book and Disk set

2. Real-time data processing.

1. Electronic digital computers—Programming.

I. Title

3. Embedded computer systems—Programming.

QA76.6.M669 1992

513.2 ' 0285—dc20

91-47911

CIP

Project Editor: Sherri Morningstar

95 94 93 92

Cover Design: Lauren Smith Design

4 3 2 1

Trademarks: The 80386, 80486 are registered trademarks and the 8051, 8048, 8086, 8088,

8OC196 and 80286 are products of Intel Corporation, Santa Clara, CA. The Z80 is a registered

trademark of Zilog Inc., Campbell, CA. The TMS34010 is a product of Texas Instruments, Dallas,

TX. Microsoft C is a product of Microsoft Corp. Redmond, WA.

Acknowledgment

Thank you Anita, Donald and Rachel

for your love and forbearance.

Contents

WHY THIS BOOK IS FOR YOU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

CHAPTER 1: NUMBERS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

Systems of Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

Bases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

The Radix Point, Fixed and Floating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2

Types of Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 5

Fixed Point . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 5

Floating Point . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 7

Positive and Negative Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 8

Fundamental Arithmetic Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1

Microprocessors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1

Buswidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

22

Data type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 4

Flags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 4

Rounding and the Sticky Bit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 5

Branching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

NUMERICAL METHODS

Instructions ............................................................................................. 2 6

Addition ............................................................................................ 26

Subtraction ........................................................................................

27

27

Multiplication ..................................................................................

Division ............................................................................................ 28

Negation and Signs ............................................................................. 28

Shifts, Rotates and Normalization ....................................................... 29

Decimal and ASCII Instructions ......................................................... 30

CHAPTER 2: INTEGERS ..........................................................

Addition and Subtraction .............................................................................

Unsigned Addition and Subtraction ....................................................

Multiprecision Arithmetic ..................................................................

add64: Algorithm ...............................................................................

add64: Listing .....................................................................................

sub64: Algorithm ................................................................................

sub64: Listing .....................................................................................

Signed Addition and Subtraction ........................................................

Decimal Addition and Subtraction ......................................................

Multiplication and Division .........................................................................

Signed vs. Unsigned .......................................................................

signed-operation: Algorithm ...............................................................

signed-operation: Listing ...................................................................

Binary Multiplication ..................................................................................

cmul: Algorithm ................................................................................

cmul: Listing ....................................................................................

33

33

33

35

36

36

37

37

38

40

42

43

44

45

46

49

49

CONTENTS

A Faster Shift and Add ................................................................................. 50

51

52

53

55

55

57

cmul2: Algorithm ..............................................................................

cmul2: Listing ....................................................................................

Skipping Ones and Zeros .............................................................................

booth: Algorithm ................................................................................

booth: Listing .....................................................................................

bit-pair: Algorithm .............................................................................

bit-pair: Listing .................................................................................

Hardware Multiplication: Single and Multiprecision .....................................

mu132: Algorithm ...............................................................................

mu132: Listing ....................................................................................

Binary Division ...........................................................................................

Error Checking ..................................................................................

Software Division ........................................................................................

cdiv: Algorithm ..................................................................................

cdiv: Listing. .....................................................................................

Hardware Division ......................................................................................

div32: Algorithm ................................................................................

div32: Listing ....................................................................................

div64: Algorithm ...............................................................................

div64: Listing .....................................................................................

58

61

62

63

64

64

65

67

68

69

74

75

79

80

CHAPTER 3; REAL NUMBERS ..................................................

Fixed Point .................................................................................................

Significant Bits ............................................................................................

The Radix Point .........................................................................................

Rounding ....................................................................................................

Basic Fixed-Point Operations ......................................................................

85

86

87

89

89

92

NUMERICAL METHODS

A Routine for Drawing Circles...................................................................

circle: Algorithm ..........................................................................

circle: Listing ................................................................................

Bresenham’s Line-Drawing Algorithm ............................................

line: Algorithm .............................................................................

line: Listing .......................................................................................

Division by Inversion .............................................................................

divnewt: Algorithm..............................................................................

divnewt: Listing................................................................................

Division by Multiplication..............................................................................

divmul: Algorithm..............................................................................

divmul: Listing...................................................................................

95

98

98

100

101

102

105

108

109

114

116

117

CHAPTER 4: FLOATING-POINT ARITHMETIC ........................ 123

What To Expect ....................................................................................... 124

A Small Floating-Point Package................................................................ 127

The Elements of a Floating-Point Number.............................................. 128

Extended Precision ............................................................................ 131

The External Routines ................................................................................. 132

fp_add: Algorithm ............................................................................. 132

fp_add: Listing................................................................................. 133

The Core Routines........................................................................................ 134

Fitting These Routines to an Application...................................................... 136

Addition and Subtraction: FLADD.............................................................. 136

FLADD: The Prologue. Algorithm...................................................... 138

FLADD: The Prologue. Listing.......................................................... 138

The FLADD Routine Which Operand is Largest? Algorithm.............. 140

The FLADD Routine: Which Operand is Largest? Listing.................... 141

CONTENTS

The FLADD Routine: Aligning the Radix Points. Algorithm................. 142

The FLADD Routine: Aligning the Radix Point. Listing .................... 143

FLADD: The Epilogue. Algorithm..................................................... 144

FLADD: The Epilogue. Listing ........................................................... 145

Multiplication and Division: FLMUL..........................................................

flmul: Algorithm .................................................................................

flmul: Listing ....................................................................................

mu164a: Algorithm ...........................................................................

mu164a: Listing .................................................................................

FLDIV...............................................................................................

fldiv: Algorithm..................................................................................

fldiv: Listing ......................................................................................

Rounding .....................................................................................................

Round: Algorithm...............................................................................

Round: Listing ...................................................................................

147

147

148

151

152

154

154

155

159

159

160

CHAPTER 5: INPUT, OUTPUT, AND CONVERSION....... ........163

Decimal Arithmetic ...................................................................................... 164

Radix Conversions ...................................................................................... 165

Integer Conversion by Division................................................................... 165

bn_dnt: Algorithm .............................................................................. 166

bn_dnt: Listing ................................................................................... 167

Integer Conversion by Multiplication........................................................... 169

dnt_bn: Algorithm.............................................................................. 170

dnt_bn: Listing ................................................................................... 170

Fraction Conversion b y Multiplication .......................................................... 172

bfc_dc: Algorithm............................................................................... 173

bfc_dc: Listing .................................................................................... 173

NUMERICAL METHODS

Fraction Conversion by Division .................................................................. 175

Dfc_bn: Algorithm ............................................................................. 176

Dfc_bn: Listing .................................................................................. 177

Table-Driven Conversions............................................................................

Hex to ASCII ..............................................................................................

hexasc: Algorithm.............................................................................

hexasc: Listing ..................................................................................

Decimal to Binary ......................................................................................

tb_dcbn: Algorithm ............................................................................

tb_dcbn: Listing ..................................................................................

Binary to Decimal.......................................................................................

tb_bndc: Algorithm............................................................................

tb_bndc: Listing ..................................................................................

Floating-Point Conversions..........................................................................

ASCII to Single-Precision Float...................................................................

atf: Algorithm ....................................................................................

atf: Listing .........................................................................................

Single-Precision Float to ASCII...................................................................

fta: Algorithm ....................................................................................

Fta: Listing .........................................................................................

Fixed Point to Single-Precision Floating Point .............................................

ftf: Algorithm .....................................................................................

ftf: Listing ........................................................................................

Single-Precision Floating Point to Fixed Point .............................................

ftfx Algorithm ..................................................................................

ftfx: Listing .......................................................................................

179

179

180

180

182

182

184

187

188

189

192

192

193

195

200

200

202

206

207

208

211

212

212

CONTENTS

CHAPTER 6: THE ELEMENTARY FUNCTIONS ....................

Fixed Point Algorithms ..............................................................................

Lookup Tables and Linear Interpolation......................................................

lg 10: Algorithm ...............................................................................

lg 10: Listing ....................................................................................

Dcsin: Algorithm ............................................................................

Dcsin: Listing ..................................................................................

Computing With Tables .............................................................................

Pwrb: Algorithm ..............................................................................

Pwrb: Listing ...................................................................................

CORDIC Algorithms.................................................................................

Circular: Algorithm ............................................................................

Circular: Listing ................................................................................

Polynomial Evaluations ...........................................................................

taylorsin: Algorithm .......................................................................

taylorsin: Listing .............................................................................

Polyeval: Algorithm...........................................................................

Polyeval: Listing ...............................................................................

Calculating Fixed-Point Square Roots........................................................

fx_sqr: Algorithm .............................................................................

fx_sqr: Listing ..................................................................................

school_sqr: Algorithm ......................................................................

school_sqr: Listing ..............................................................................

Floating-Point Approximations ...................................................................

Floating-Point Utilities ...............................................................................

frxp: Algorithm ..................................................................................

frxp: Listing ......................................................................................

ldxp: Algorithm.................................................................................

ldxp: Listing......................................................................................

217

217

217

219

220

224

227

233

234

235

237

242

242

247

249

250

251

251

253

254

254

256

257

259

259

259

260

261

261

NUMERICAL METHODS

263

flr: Algorithm ...........................................................................................

263

flr: Listing .................................................................................................

265

flceil: Algorithm ......................................................................................

266

flceil: Listing ............................................................................................

268

intmd: Algorithm....................................................................................

268

intmd: Listing ...........................................................................................

269

Square Roots ......................................................................................................

270

Flsqr: Algorithm.......................................................................................

271

Flsqr: Listing ............................................................................................

273

Sines and Cosines...............................................................................................

274

flsin: Algorithm........................................................................................

275

Flsin: Listing............................................................................................

APPENDIXES:

A: A PSEUDO-RANDOM NUMBER GENERATOR .................... 2 8 1

B: TABLES AND EQUATES ............................................

295

C: FXMATH.ASM ............................................................. 297

D: FPMATH.ASM .......................................................... 337

E: IO.ASM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

373

CONTENTS

F: TRANS.ASM AND TABLE.ASM

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407

G: MATH.C............................................................................475

GLOSSARY

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485

INDEX ................................................................................ 493

Additional Disk

Just in case you need an additional disk, simply call the toll-free number listed

below. The disk contains all the routines in the book along with a simple C shell that

can be used to exercise them. This allows you to walk through the routines to see how

they work and test any changes you might make to them. Once you understand how

the routine works, you can port it to another processor. Only $10.00 postage-paid.

To order with your credit card, call Toll-Free l-800-533-4372 (in CA 1-800356-2002). Mention code 7137. Or mail your payment to M&T Books, 411 Bore1

Ave., Suite 100, San Mateo, CA 94402-3522. California residents please add

applicable sales tax.

Why This Book Is For You

The ability to write efficient, high-speed arithmetic routines ultimately depends

upon your knowledge of the elements of arithmetic as they exist on a computer. That

conclusion and this book are the result of a long and frustrating search for

information on writing arithmetic routines for real-time embedded systems.

With instruction cycle times coming down and clock rates going up, it would

seem that speed is not a problem in writing fast routines. In addition, math

coprocessors are becoming more popular and less expensive than ever before and are

readily available. These factors make arithmetic easier and faster to use and

implement. However, for many of you the systems that you are working on do not

include the latest chips or the faster processors. Some of the most widely used

microcontrollers used today are not Digital Signal Processors (DSP), but simple

eight-bit controllers such as the Intel 8051 or 8048 microprocessors.

Whether you are using one on the newer, faster machines or using a simple

eight-bit one, your familiarity with its foundation will influence the architecture of

the application and every program you write. Fast, efficient code requires an

understanding of the underlying nature of the machine you are writing for. Your

knowledge and understanding will help you in areas other than simply implementing

the operations of arithmetic and mathematics. For example, you may want the

ability to use decimal arithmetic directly to control peripherals such as displays and

thumbwheel switches. You may want to use fractional binary arithmetic for more

efficient handling of D/A converters or you may wish to create buffers and arrays that

wrap by themselves because they use the word size of your machine as a modulus.

The intention in writing this book is to present a broad approach to microprocessor arithmetic ranging from data on the positional number system to algorithms for

1

NUMERICAL METHODS

developing many elementary functions with examples in 8086 assembler and

pseudocode. The chapters cover positional number theory, the basic arithmetic

operations to numerical I/O, and advanced topics are examined in fixed and floating

point arithmetic. In each subject area, you will find many approaches to the same

problem; some are more appropriate for nonarithmetic, general purpose machines

such as the 8051 and 8048, and others for the more powerful processors like the

Tandy TMS34010 and the Intel 80386. Along the way, a package of fixed-point and

floating-point routines are developed and explained. Besides these basic numerical

algorithms, there are routines for converting into and out of any of the formats used,

as well as base conversions and table driven translations. By the end of the book,

readers will have code they can control and modify for their applications.

This book concentrates on the methods involved in the computational process,

not necessarily optimization or even speed, these come through an understanding of

numerical methods and the target processor and application. The goal is to move the

reader closer to an understanding of the microcomputer by presenting enough

explanation, pseudocode, and examples to make the concepts understandable. It is

an aid that will allow engineers, with their familiarity and understanding of the target,

to write the fastest, most efficient code they can for the application.

2

Introduction

If you work with microprocessors or microcontrollers, you work with numbers.

Whether it is a simple embedded machine-tool controller that does little more than

drive displays, or interpret thumbwheel settings, or is a DSP functioning in a realtime system, you must deal with some form of numerics. Even an application that

lacks special requirements for code size or speed might need to perform an

occasional fractional multiply or divide for a D/A converter or another peripheral

accepting binary parameters. And though the real bit twiddling may hide under the

hood of a higher-level language, the individual responsible for that code must know

how that operation differs from other forms of arithmetic to perform it correctly.

Embedded systems work involves all kinds of microprocessors and

microcontrollers, and much of the programming is done in assembler because of the

speed benefits or the resulting smaller code size. Unfortunately, few references

are written to specifically address assembly language programming. One of the

major reasons for this might be that assembly-language routines are not easily

ported from one processor to another. As a result, most of the material devoted

to assembler programming is written by the companies that make the processors. The code and algorithms in these cases are then tailored to the particular

advantages (or to overcoming the particular disadvantages) of the product. The

documentation that does exist contains very little about writing floating-point

routines or elementary functions.

This book has two purposes. The first and primary aim is to present a spectrum

of topics involving numerics and provide the information necessary to understand

the fundamentals as well as write the routines themselves. Along with this information are examples of their implementation in 8086 assembler and pseudocode

that show each algorithm in component steps, so you can port the operation to

another target. A secondary, but by no means minor, goal is to introduce you

3

NUMERICAL METHODS

to the benefits of binary arithmetic on a binary machine. The decimal numbering

system is so pervasive that it is often difficult to think of numbers in any other format,

but doing arithmetic in decimal on a binary machine can mean an enormous number

of wasted machine cycles, undue complexity, and bloated programs. As you proceed

through this book, you should become less dependent on blind libraries and more

able to write fast, efficient routines in the native base of your machine.

Each chapter of this book provides the foundation for the next chapter. At the

code level, each new routine builds on the preceeding algorithms and routines.

Algorithms are presented with an accompanying example showing one way to

implement them. There are, quite often, many ways that you could solve the

algorithm. Feel free to experiment and modify to fit your environment.

Chapter 1 covers positional number theory, bases, and signed arithmetic. The

information here provides the necessary foundation to understand both decimal and

binary arithmetic. That understanding can often mean faster more compact routines

using the elements of binary arithmetic- in other words, shifts, additions, and

subtractions rather than complex scaling and extensive routines.

Chapter 2 focuses on integer arithmetic, presenting algorithms for performing

addition, subtraction, multiplication, and division. These algorithms apply to machines that have hardware instructions and those capable of only shifts, additions,

and subtractions.

Real numbers (those with fractional extension) are often expressed in floating

point, but fixed point can also be used. Chapter 3 explores some of the qualities of

real numbers and explains how the radix point affects the four basic arithmetic

functions. Because the subject of fractions is covered, several rounding techniques

are also examined. Some interesting techniques for performing division, one using

multiplication and the other inversion, are also presented. These routines are

interesting because they involve division with very long operands as well as from a

purely conceptual viewpoint. At the end of the chapter, there is an example of an

algorithm that will draw a circle in a two dimensional space, such as a graphics

monitor, using only shifts, additions and subtractions.

Chapter 4 covers the basics of floating-point arithmetic and shows how scaling

is done. The four basic arithmetic functions are developed into floating-point

4

INTRODUCTION

routines using the fixed point methods given in earlier chapters.

Chapter 5 discusses input and output routines for numerics. These routines deal

with radix conversion, such as decimal to binary, and format conversions, such as

ASCII to floating point. The conversion methods presented use both computational

and table-driven techniques.

Finally, the elementary functions are discussed in Chapter 6. These include

table-driven techniques for fast lookup and routines that rely on the fundamental

binary nature of the machine to compute fast logarithms and powers. The CORDIC

functions which deliver very high-quality transcendentals with only a few shifts and

additions, are covered, as are the Taylor expansions and Horner’s Rule. The

chapter ends with an implementation of a floating-point sine/cosine algorithm

based upon a minimax approximation and a floating-point square root.

Following the chapters, the appendices comprise additional information and

reference materials. Appendix A presents and explains the pseudo-random number

generator developed to test many of the routines in the book and includes

SPECTRAL.C, a C program useful in testing the functions described in this book.

This program was originally created for the pseudo-random number generator and

incorporates a visual check and Chi-square statistical test on the function. Appendix

B offers a small set of constants commonly used.

The source code for all the arithmetic functions, along with many ancillary

routines and examples, is in appendices C through F.

Integer and fixed-point routines are in Appendix C. Here are the classical

routines for multiplication and division, handling signs, along with some of the more

complex fixed-point operations, such as the Newton Raphson iteration and linear

interpolation for division.

Appendix D consists of the basic floating-point routines for addition,

subtraction, multiplication, and division, Floor, ceiling, and absolute value

functions are included here, as well as many other functions important to the

more advanced math in Chapter 6.

The conversion routines are in Appendix E. These cover the format and

numerical conversions in Chapter 5

In Appendix F, there are two source files. TRANS.ASM contains the elementary

5

NUMERICAL METHODS

functions described in Chapter 6, and TABLE.ASM that holds the tables, equates

and constants used in TRANS.ASM and many of the other modules.

MATH.C in Appendix F is a C program useful in testing the functions described

in this book. It is a simple shell with the defines and prototypes necessary to perform

tests on the routines in the various modules.

Because processors and microcontrollers differ in architecture and instruction

set, algorithmic solutions to numeric problems are provided throughout the book for

machines with no hardware primitives for multiplication and division as well as for

those that have such primitives.

Assembly language by nature isn’t very portable, but the ideas involved in

numeric processing are. For that reason, each algorithm includes an explanation that

enables you to understand the ideas independently of the code. This explanation is

complemented by step-by-step pseudocode and at least one example in 8086

assembler. All the routines in this book are also available on a disk along with a

simple C shell that can be used to exercise them. This allows you to walk through the

routines to see how they work and test any changes you might make to them. Once

you understand how the routine works, you can port it to another processor. The

routines as presented in the book are formatted differently from the same routines on

the disk. This is done to accommodate the page size. Any last minute changes to the

source code are documented in the Readme file on the disk.

There is no single solution for all applications; there may not even be a single

solution for a particular application. The final decision is always left to the individual

programmer, whose skills and knowledge of the application are what make the

software work. I hope this book is of some help.

6

Previous Home

CHAPTER 1

Numbers

Numbers are pervasive; we use them in almost everything we do, from counting

the feet in a line of poetry to determining the component frequencies in the periods

of earthquakes. Religions and philosophies even use them to predict the future. The

wonderful abstraction of numbers makes them useful in any situation. Actually, what

we find so useful aren’t the numbers themselves (numbers being merely a representation), but the concept of numeration: counting, ordering, and grouping.

Our numbering system has humble beginnings, arising from the need to quantify

objects-five horses, ten people, two goats, and so on-the sort of calculations that

can be done with strokes of a sharp stone or root on another stone. These are natural

numbers-positive, whole numbers, each defined as having one and only one

immediate predecessor. These numbers make up the number ray, which stretches

from zero to infinity (see Figure 1- 1).

Figure 1-1. The number line.

7

Next

NUMERICAL METHODS

The calculations performed with natural numbers consist primarily of addition

and subtraction, though natural numbers can also be used for multiplication (iterative

addition) and, to some degree, for division. Natural numbers don’t always suffice,

however; how can you divide three by two and get a natural number as the result?

What happens when you subtract 5 from 3? Without decimal fractions, the results of

many divisions have to remain symbolic. The expression "5 from 3" meant nothing

until the Hindus created a symbol to show that money was owed. The words positive

and negative are derived from the Hindu words for credit and debit’.

The number ray-all natural numbers- b e c a m e part of a much greater schema

known as the number line, which comprises all numbers (positive, negative, and

fractional) and stretches from a negative infinity through zero to a positive infinity

with infinite resolution*. Numbers on this line can be positive or negative so that 35 can exist as a representable value, and the line can be divided into smaller and

smaller parts, no part so small that it cannot be subdivided. This number line extends

the idea of numbers considerably, creating a continuous weave of ever-smaller

pieces (you would need something like this to describe a universe) that finally give

meaning to calculations such as 3/2 in the form of real numbers (those with decimal

fractional extensions).

This is undeniably a valuable and useful concept, but it doesn’t translate so

cleanly into the mechanics of a machine made of finite pieces.

Systems of Representation

The Romans used an additional system of representation, in which the symbols

are added or subtracted from one another based on their position. Nine becomes IX

in Roman numerals (a single count is subtracted from the group of 10, equaling nine;

if the stroke were on the other side of the symbol for 10, the number would be 11).

This meant that when the representation reached a new power of 10 or just became

too large, larger numbers could be created by concatenating symbols. The problem

here is that each time the numbers got larger, new symbols had to be invented.

Another form, known as positional representation, dates back to the Babylonians,

who used a sort of floating point with a base of 60.3 With this system, each

successively larger member of a group has a different symbol. These symbols are

8

NUMBERS

then arranged serially to grow more significant as they progress to the left. The

position of the symbol within this representation determines its value. This makes for

a very compact system that can be used to approximate any value without the need

to invent new symbols. Positional numbering systems also allow another freedom:

Numbers can be regrouped into coefficients and powers, as with polynomials, for

some alternate approaches to multiplication and division, as you will see in the

following chapters.

If b is our base and a an integer within that base, any positive integer may be

represented as:

or as:

ai * bi + ai-1 * bi-1 + ... + a0 * b0

As you can see, the value of each position is an integer multiplied by the base

taken to the power of that integer relative to the origin or zero. In base 10, that

polynomial looks like this:

ai * 10i + ai-1 * 10i-1 + ... + a0 * 100

and the value 329 takes the form:

3 * 10 + 2 * 10 + * 10

Of course, since the number line goes negative, so must our polynomial:

ai * bi + ai-1 * bi-1 + ... + a0 * b0 + a-1 * b-1 + a-2 * b-2 + ... + a-i *

b-i

Bases

Children, and often adults, count by simply making a mark on a piece of paper

for each item in the set they’re quantifying. There are obvious limits to the numbers

9

NUMERICAL METHODS

that can be conveniently represented this way, but the solution is simple: When the

numbers get too big to store easily as strokes, place them in groups of equal size and

count only the groups and those that are left over. This makes counting easier because

we are no longer concerned with individual strokes but with groups of strokes and

then groups of groups of strokes. Clearly, we must make the size of each group

greater than one or we are still counting strokes. This is the concept of base. (See

Figure l-2.) If we choose to group in l0s, we are adopting 10 as our base. In base 10,

they are gathered in groups of 10; each position can have between zero and nine

things in it. In base 2, each position can have either a one or a zero. Base 8 is zero

through seven. Base 16 uses zero through nine and a through f. Throughout this

book, unless the base is stated in the text, a B appended to the number indicates base

2, an O indicates base 8, a D indicates base 10, and an H indicates base 16.

Regardless of the base in which you are working, each successive position to the

left is a positive increase in the power of the position.

In base 2, 999 looks like:

lllll00lllB

If we add a subscript to note the position, it becomes:

19 18 17 16 15 04 03 12 11 10

This has the value:

1*29 +1*28 +1*27 +1*26 +1*25 +1*24 +1*23 +1*22 +1*21 +1*20

which is the same as:

1*512 + 1*256 + 1*128 + 1*64 + 1*32 + 0*16 + 0*8 + 1*4 + 1*2 + 1*1

Multiplying it out, we get:

512 + 256 + 128 + 64 + 32 + 0 + 0 + 4 + 2 + 1 = 999

10

NUMBERS

Figure 1-2. The base of a number system defines the number of unique digits

available in each position.

Octal, as the name implies, is based on the count of eight. The number 999 is 1747

in octal representation, which is the same as writing:

1*83 + 7*82 + 4*81 + 7*80

or

1*512 + 7*64 + 4*8 + 7*1

When we work with bases larger than 10, the convention is to use the letters of

the alphabet to represent values equal to or greater than 10. In base 16 (hexadecimal),

therefore, the set of numbers is 0 1 2 3 4 5 6 7 8 9 a b c d e f, where a = 10 and

f = 15. If you wanted to represent the decimal number 999 in hexadecimal, it would

be 3e7H, which in decimal becomes:

3*162 + 14*161 + 7*160

Multiplying it out gives us:

3*256 + 14*16 + 7*1

11

NUMERICAL METHODS

Obviously, a larger base requires fewer digits to represent the same value.

Any number greater than one can be used as a base. It could be base 2, base 10,

or the number of bits in the data type you are working with. Base 60, which is used

for timekeeping and trigonometry, is attractive because numbers such as l/3 can be

expressed exactly. Bases 16, 8, and 2 are used everywhere in computing machines,

along with base 10. And one contingent believes that base 12 best meets our

mathematical needs.

The Radix Point, Fixed and Floating

Since the physical world cannot be described in simple whole numbers, we need

a way to express fractions. If all we wish to do is represent the truth, a symbol will

do. A number such as 2/3 in all its simplicity is a symbol-a perfect symbol, because

it can represent something unrepresentable in decimal notation. That number

translated to decimal fractional representation is irrational; that is, it becomes an

endless series of digits that can only approximate the original. The only way to

express an irrational number in finite terms is to truncate it, with a corresponding loss

of accuracy and precision from the actual value.

Given enough storage any number, no matter how large, can be expressed as

ones and zeros. The bigger the number, the more bits we need. Fractions present a

similar but not identical barrier. When we’re building an integer we start with unity,

the smallest possible building block we have, and add progressively greater powers

(and multiples thereof) of whatever base we’re in until that number is represented.

We represent it to the least significant bit (LSB), its smallest part.

The same isn’t true of fractions. Here, we’re starting at the other end of the

spectrum; we must express a value by adding successively smaller parts. The trouble

is, we don’t always have access to the smallest part. Depending on the amount of

storage available, we may be nowhere near the smallest part and have, instead of a

complete representation of a number, only an approximation. Many common values

can never be represented exactly in binary arithmetic. The decimal 0.1 or one 10th,

for example, becomes an infinite series of ones and zeros in binary

(1100110011001100 ... B). The difficulties in expressing fractional parts completely

can lead to unacceptable errors in the result if you’re not careful.

12

NUMBERS

The radix point (the point of origin for the base, like the decimal point) exists on

the number line at zero and separates whole numbers from fractional numbers. As

we move through the positions to the left of the radix point, according to the rules of

positional notation, we pass through successively greater positive powers of that

base; as we move to the right, we pass through successively greater negative powers

of the base.

In the decimal system, the number 999.999 in positional notation is

929190.9-19-29-3

And we know that base 10

102 = 100

101 = 10

100 = 1

It is also true that

10-1 = .1

10-2 = .01

10-3 = .001

We can rewrite the number as a polynomial

9*102 + 9*101 + 9*100 + 9*10-1 + 9*10-2 + 9*10-3

Multiplying it out, we get

900 +90 + 9 + .9 + .09 + .009

which equals exactly 999.999.

Suppose we wish to express the same value in base 2. According to the previous

example, 999 is represented in binary as 1111100111B. To represent 999.999, we

need to know the negative powers of two as well. The first few are as follows:

13

NUMERICAL METHODS

2-1 = .5D

2 = .25D

2-3 = .125D

2-4 = .0625D

-5

2 = .03125D

-2

2-6 = .015625D

2-7 = .0078125D

2-8 = .00390625D

2-9 = .001953125D

2-10 = .0009765625D

-11

2

= .00048828125D

2-12 = .000244140625D

Twelve binary digits are more than enough to approximate the decimal fraction

.999. Ten digits produce

1111100111.1111111111 =

999.9990234375

which is accurate to three decimal places.

Representing 999.999 in other bases results in similar problems. In base 5, the

decimal number 999.999 is noted

12444.4444141414 =

1*54 + 2*53 + 4*52 + 4*51 + 4*50. + 4*5-1 + 4*5-2 + 4*5-3 + 4*5-4 + 1*5-5 +

4*5-6 + 1*5-7 + 4*5-8 + 1*5-9 + 4*5-10 =

1*625 + 2*125 + 4*25 + 4*5 + 4+ 4*.2 + 4*.04 + 4*.008 + 4*.0016

+ 1*.00032 + 4*.000065 + 1*.0000125 + 4*.00000256

+ 1*.000000512 + 4*.0000001024

or

625+ +250 + 100 + 20 + 4 + .8 + .16 + .032 + .0064 + .00032 + .000256 +

.0000128 + .00001024 + .000000512 + .00004096 =

999.9990045696

14

NUMBERS

But in base 20, which is a multiple of 10 and two, the expression is rational. (Note

that digits in bases that exceed 10 are usually denoted by alphabetical characters; for

example, the digits of base 20 would be 0 l 2 3 4 5 6 7 8 9 A B C D E F G H I J .)

29J.JJC

2x202 + 9x201 + 19x200. + 19x20-1 + 19x20-2 + 12x20-3 =

2x400 + 9x20 + 19x1. + 19x.05 + 19x.0025 + 12x.000125

or

800 + 180 + 19. + .95 + .0475 + .0015 =

999.999

As you can see, it isn’t always easy to approximate a fraction. Fractions are a sum

of the value of each position in the data type. A rational fraction is one whose sum

precisely matches the value you are trying to approximate. Unfortunately, the exact

combination of parts necessary to represent a fraction exactly may not be available

within the data type you choose. In cases such as these, you must settle for the

accuracy obtainable within the precision of the data type you are using.

Types of Arithmetic

This book covers three basic types of arithmetic: fixed point (including integeronly arithmetic and modular) and floating point.

Fixed Point

Fixed-point implies that the radix point is in a fixed place within the representation. When we’re working exclusively with integers, the radix point is always to

the right of the rightmost digit or bit. When the radix point is to the left of the leftmost

digit, we’re dealing with fractional arithmetic. The radix point can rest anywhere

within the number without changing the mechanics of the operation. In fact, using

fixed-point arithmetic in place of floating point, where possible, can speed up any

arithmetic operation. Everything we have covered thus far applies to fixed-point

arithmetic and its representation.

15

NUMERICAL METHODS

Though fixed-point arithmetic can result in the shortest, fastest programs, it

shouldn’t be used in all cases. The larger or smaller a number gets, the more storage

is required to represent it. There are alternatives; modular arithmetic, for example,

can, with an increase in complexity, preserve much of an operation’s speed.

Modular arithmetic is what people use every day to tell time or to determine the

day of the week at some future point. Time is calculated either modulo 12 or 24—

that is, if it is 9:00 and six hours pass on a 12-hour clock, it is now 3:00, not 15:00:

9 + 6 = 3

This is true if all multiples of 12 are removed. In proper modular notation, this

would be written:

9 + 6

3, mod 12.

In this equation, the sign

means congruence. In this way, we can make large

numbers congruent to smaller numbers by removing multiples of another number (in

the case of time, 12 or 24). These multiples are often removed by subtraction or

division, with the smaller number actually being the remainder.

If all operands in an arithmetic operation are divided by the same value, the result

of the operation is unaffected. This means that, with some care, arithmetic operations

performed on the remainders can have the same result as those performed on the

whole number. Sines and cosines are calculated mod 360 degrees (or mod 2

radians). Actually, the input argument is usually taken mod /2 or 90 degrees,

depending on whether you are using degrees or radians. Along with some method for

determining which quadrant the angle is in, the result is computed from the

congruence (see Chapter 6).

Random number generators based on the Linear Congruential Method use

modular arithmetic to develop the output number as one of the final steps.4

Assembly-language programmers can facilitate their work by choosing a modulus

that’s as large as the word size of the machine they are working on. It is then a simple

matter to calculate the congruence, keeping those lower bits that will fit within the

16

NUMBERS

word size of the computer. For example, assume we have a hexadecimal doubleword:

and the word size of our machine is 16 bits

For more information on random number generators, see Appendix A.

One final and valuable use for modular arithmetic is in the construction of selfmaintaining buffers and arrays. If a buffer containing 256 bytes is page aligned-the

last eight bits of the starting address are zero-and an 8-bit variable is declared to

count the number of entries, a pointer can be incremented through the buffer simply

by adding one to the counting variable, then adding that to the address of the base of

the buffer. When the pointer reaches 255, it will indicate the last byte in the buffer;

when it is incremented one more time, it will wrap to zero and point once again at

the initial byte in the buffer.

Floating Point

Floating point is a way of coding fixed-point numbers in which the number of

significant digits is constant per type but whose range is enormously increased

because an exponent and sign are embedded in the number. Floating-point arithmetic

is certainly no more accurate than fixed point-and it has a number of problems,

including those present in fixed point as well as some of its own-but it is convenient

and, used judiciously, will produce valid results.

The floating-point representations used most commonly today conform, to some

degree, to the IEEE 754 and 854 specifications. The two main forms, the long real

and the short real, differ in the range and amount of storage they require. Under the

IEEE specifications, a long real is an 8-byte entity consisting of a sign bit, an 11-bit

exponent, and a 53-bit significand, which mean the significant bits of the floatingpoint number, including the fraction to the right of the radix point and the leading one

17

NUMERICAL METHODS

to the left. A short real is a 4-byte entity consisting of a sign bit, an 8-bit exponent,

and a 24-bit significand.

To form a binary floating-point number, shift the value to the left (multiply by

two) or to the right (divide by two) until the result is between 1.0 and 2.0. Concatenate

the sign, the number of shifts (exponent), and the mantissa to form the float.

Doing calculations in floating point is very convenient. A short real can express

a value in the range 1038 to 10-38 in a doubleword, while a long real can handle values

ranging from 10308 to 10-308 in a quadword. And most of the work of maintaining the

numbers is done by your floating-point package or library.

As noted earlier, some problems in the system of precision and exponentiation

result in a representation that is not truly "real"—namely, gaps in the number line and

loss of significance. Another problem is that each developer of numerical software

adheres to the standards in his or her own fashion, which means that an equation that

produced one result on one machine may not produce the same result on another

machine or the same machine running a different software package. This compatibility problem has been partially alleviated by the widespread use of coprocessors.

Positive and Negative Numbers

The most common methods of representing positive and negative numbers in a

positional number system are sign magnitude, diminished-radix complement, and

radix complement (see Table 1- 1).

With the sign-magnitude method, the most significant bit (MSB) is used to

indicate the sign of the number: zero for plus and one for minus. The number itself

is represented as usual—that is, the only difference between a positive and a negative

representation is the sign bit. For example, the positive value 4 might be expressed

as 0l00B in a 4-bit binary format using sign magnitude, while -4 would be

represented as 1100B.

This form of notation has two possible drawbacks. The first is something it has

in common with the diminished-radix complement method: It yields two forms of

zero, 0000B and 1000B (assuming three bits for the number and one for the sign).

Second, adding sign-magnitude values with opposite signs requires that the magni-

18

NUMBERS

tudes of the numbers be consulted to determine the sign of the result. An example of

sign magnitude can be found in the IEEE 754 specification for floating-point

representation.

The diminished-radix complement is also known as the one’s complement in

binary notation. The MSB contains the sign bit, as with sign magnitude, while the rest

of the number is either the absolute value of the number or its bit-by-bit complement.

The decimal number 4 would appear as 0100 and -4 as 1011. As in the foregoing

method, two forms of zero would result: 0000 and 1111.

The radix complement, or two’s complement, is the most widely used notation

in microprocessor arithmetic. It involves using the MSB to denote the sign, as in the

other two methods, with zero indicating a positive value and one meaning negative.

You derive it simply by adding one to the one’s-complement representation of the

same negative value. Using this method, 4 is still 0100, but -4 becomes 1100. Recall

that one’s complement is a bit-by-bit complement, so that all ones become zeros and

all zeros become ones. The two’s complement is obtained by adding a one to the

one’s complement.

This method eliminates the dual representation of zero-zero is only 0000

(represented as a three-bit signed binary number)-but one quirk is that the range of

values that can be represented is slightly more negative than positive (see the chart

below). That is not the case with the other two methods described. For example, the

largest positive value that can be represented as a signed 4-bit number is 0111B, or

7D, while the largest negative number is 1000B, or -8D.

19

NUMERICAL METHODS

One's complement

Two's

complement

Sign complement

0111

7

7

7

0110

6

6

6

0101

5

5

5

0100

4

4

4

0011

3

3

3

0010

2

2

2

0001

1

1

1

0000

0

0

0

1111

-0

-1

-7

1110

-1

-2

-6

1101

-2

-3

-5

1100

-3

-4

-4

1011

-4

-5

-3

1010

-5

-6

-2

1001

-6

-7

-1

1000

-7

-8

-0

Table 1-1. Signed Numbers.

Decimal integers require more storage and are far more complicated to work

with than binary; however, numeric I/O commonly occurs in decimal, a more

familiar notation than binary. For the three forms of signed representation already

discussed, positive values are represented much the same as in binary (the leftmost

20

NUMBERS

bit being zero). In sign-magnitude representation, however, the sign digit is nine

followed by the absolute value of the number. For nine’s complement, the sign digit

is nine and the value of the number is in nine’s complement. As you might expect,

10’s complement is the same as nine’s complement except that a one is added to the

low-order (rightmost) digit.

Fundamental Arithmetic Principles

So far we’ve covered the basics of positional notation and bases. While this book

is not about mathematics but about the implementation of basic arithmetic operations

on a computer, we should take a brief look at those operations.

1. Addition is defined as a + b = c and obeys the commutative rules described

below.

2. Subtraction is the inverse of addition and is defined as b = c - a.

3. Multiplication is defined as ab = c and conforms to the commutative,

associative, and distributive rules described below.

4. Division is the inverse of multiplication and is shown by b = c/a.

5. A power is ba=c.

6. A root is b =

7 . A logarithm is a = logbc.

Addition and subtraction must also satisfy the following rules:5

8. Commutative:

a+b=b+a

ab = ba

9 . Associative:

a = (b + c) = (a + b) + c

a(bc) = (ab)c

10. Distributive:

a(b + c) = ab + ac

From these rules, we can derive the following relations:6

11. (ab)c = acbc

21

NUMERICAL METHODS

12. abac = ac(b+c)

13.

14.

15.

16.

17.

(ab)c = a(bc)

a+0=a

a x 1= a

a1 = a

a/0 is undefined

These agreements form the basis for the arithmetic we will be using in upcoming

chapters.

Microprocessors

The key to an application’s success is the person who writes it. This statement

is no less true for arithmetic. But it’s also true that the functionality and power of the

underlying hardware can greatly affect the software development process.

Table l-2 is a short list of processors and microcontrollers currently in use, along

with some issues relevant to writing arithmetic code for them (such as the instruction

set, and bus width). Although any one of these devices, with some ingenuity and

effort, can be pushed through most common math functions, some are more capable

than others. These processors are only a sample of what is available. In the rest of this

text, we’ll be dealing primarily with 8086 code because of its broad familiarity.

Examples from other processors on the list will be included where appropriate.

Before we discuss the devices themselves, perhaps an explanation of the

categories would be helpful.

Buswidth

The wider bus generally results in a processor with a wider bandwidth because it can

access more data and instruction elements. Many popular microprocessors have a

wider internal bus than external, which puts a burden on the cache (storage internal

to the microprocessor where data and code are kept before execution) to keep up with

the processing. The 8088 is an example of this in operation, but improvements in the

80x86 family (including larger cache sizes and pipelining to allow some parallel

processing) have helped alleviate the problem.

22

NUMBERS

Table 1-2. Instructions and flags.

23

NUMERICAL METHODS

Data type

The larger the word size of your machine, the larger the numbers you can process

with single instructions. Adding two doubleword operands on an 8051 is a

multiprecision operation requiring several steps. It can be done with a single ADD

on a TMS34010 or 80386. In division, the word size often dictates the maximum size

of the quotient. A larger word size allows for larger quotients and dividends.

Flags

The effects of a processor’s operation on the flags can sometimes be subtle. The

following comments are generally true, but it is always wise to study the data sheets

closely for specific cases.

Zero. This flag is set to indicate that an operation has resulted in zero. This can

occur when two operands compare the same or when two equal values are

subtracted from one another. Simple move instructions generally do not affect

the state of the flag.

Carry. Whether this flag is set or reset after a certain operation varies from

processor to processor. On the 8086, the carry will be set if an addition overflows

or a subtraction underflows. On the 80C196, the carry will be set if that addition

overflows but cleared if the subtraction underflows. Be careful with this one.

Logical instructions will usually reset the flag and arithmetic instructions as well

as those that use arithmetic elements (such as compare) will set it or reset it based

on the results.

Sign. Sometimes known as the negative flag, it is set if the MSB of the data type

is set following an operation.

Overflow. If the result of an arithmetic operation exceeds the data type meant

to contain it, an overflow has occurred. This flag usually only works predictably

with addition and subtraction. The overflow flag is used to indicate that the result

of a signed arithmetic operation is too large for the destination operand. It will

be set if, after two numbers of like sign are added or subtracted, the sign of the

result changes or the carry into the MSB of an operand and the carry out don’t

match.

24

NUMBERS

Overflow Trap. If an overflow occurred at any time during an arithmetic

operation, the overflow trap will be set if not already set. This flag bit must be

cleared explicitly. It allows you to check the validity of a series of operations.

Auxiliary Carry. The decimal-adjust instructions use this flag to correct the

accumulator after a decimal addition or subtraction. This flag allows the

processor to perform a limited amount of decimal arithmetic.

Parity. The parity flag is set or reset according to the number of bits in the lower

byte of the destination register after an operation. It is set if even and reset if odd.

Sticky Bit. This useful flag can obviate the need for guard digits on certain

arithmetic operations. Among the processors in Table l-2, it is found only on

the 80C196. It is set if, during a multiple right shift, more than one bit was shifted

into the carry with a one in the carry at the end of the shift.

Rounding and the Sticky Bit

A multiple shift to the right is a divide by some power of two. If the carry is set,

the result is equal to the integer result plus l/2, but should we round up or down? This

problem is encountered frequently in integer arithmetic and floating point. Most

floating-point routines have some form of extended precision so that rounding can

be performed. This requires storage, which usually defaults to some minimal data

type (the routines in Chapter 4 use a word). The sticky bit reduces the need for such

extended precision. It indicates that during a right shift, a one was shifted into the

carry flag and then shifted out.

Along with the carry flag, the sticky bit can be used for rounding. For example,

suppose we wish to divide the hex value 99H by 16D. We can do this easily with a

four-bit right shift. Before the shift, we have:

Operand

10011001

Carry flag

0

Sticky bit

0

We shift the operand right four times with the following instruction:

shr

ax, #4

25

NUMERICAL METHODS

During the shift, the Least Significant Bit (LSB) of the operand (a one) is shifted

into the carry and then out again, setting the sticky bit followed by two zeros and a

final one. The operand now has the following form:

Operand

00001001

Sticky bit

Carry flag

1 (from the last shift) 1 (because of the first one

shifted in and out of the carry)

To round the result, check the carry flag. If it’s clear, the bits shifted out were less

than half of the LSB, and rounding can be done by truncation. If the carry is set, the

bits shifted out were at least half of the LSB. Now, with the sticky bit, we can see if

any other bits shifted out during the divide were ones; if so, the sticky bit is set and

we can round up.

Rounding doesn’t have to be done as described here, but however you do it the

sticky bit can make your work easier. Too bad it’s not available on more machines.

Branching

Your ability to do combined jumps depends on the flags. All the processors listed

in the table have the ability to branch, but some implement that ability on more

sophisticated relationships. Instead of a simple “jump on carry,” you might find

“jump if greater, ” “jump if less than or equal,” and signed and unsigned operations.

These extra instructions can cut the size and complexity of your programs.

Of the devices listed, the TMS34010, 80x86 and 80C196 have the richest set of

branching instructions. These include branches on signed and unsigned comparisons

as well as branches on the state of the flags alone.

Instructions

Addition

Add. Of course, to perform any useful arithmetic, the processor must be capable

of some form of addition. This instruction adds two operands, signaling any

overflow from the result by setting the carry.

26

NUMBERS

Add-with-Carry. The ability to add with a carry bit allows streamlined,

multiprecision additions. In multibyte or multiword additions, the add instruction is usually used first; the add-with-carry instruction is used in each succeeding addition. In this way, overflows from each one addition can ripple through

to the next.

Subtraction

Subtract. All the devices in Table l-2 can subtract except the 8048 and 8051.

The 8051 uses the subtract-with-carry instruction to fill this need. On the 8048,

to subtract one quantity (the subtrahend) from another (the minuend), you must

complement the subtrahend and increment it, then add it to the minuend-in

other words, add the two’s complement to the minuend.

Subtract-with-Carry. Again, the 8048 does not support this instruction, while all

the others do. Since the 8051 has only the subtract-with-carry instruction, it is

important to see that the carry is clear before a subtraction is performed unless

it is a multiprecision operation. The subtract-with-carry is used in multiprecision

subtraction in the same manner as the add-with-carry is used in addition.

Compare. This instruction is useful for boundary, magnitude and equality

checks. Most implementations perform a comparison by subtracting one value

from another. This process affects neither operand, but sets the appropriate flags.

Many microprocessors allow either signed or unsigned comparisons.

Multiplication

Multiply. This instruction performs a standard unsigned multiply based on the

word size of the particular microprocessor or microcontroller. Hardware can

make life easier. On the 8088 and 8086, this instruction was embarrassingly slow

and not that much of a challenge to shift and add routines. On later members of

the 80x86 family, it takes a fraction of the number of cycles to perform, making

it very useful for multiprecision and single-precision work.

Signed Multiply. The signed multiply, like the signed divide (which we’ll

27

NUMERICAL METHODS

discuss in a moment), has limited use. It produces a signed product from two

signed operands on all data types up to and including the word size of the

machine. This is fine for tame applications, but makes the instruction unsuitable

for multiprecision work. Except for special jobs, it might be wise to employ a

generic routine for handling signed arithmetic. One is described in the next

chapter.

Division

Divide. A hardware divide simplifies much of the programmer’s work unless it

is very, very slow (as it is on the 8088 and 8086). A multiply canextend the useful

range of the divide considerably. The following chapter gives examples of how

to do this.

Signed Divide. Except in specialized and controlled circumstances, the signed

divide may not be of much benefit. It is often easier and more expeditious to

handle signed arithmetic yourself, as will be shown in Chapter 2.

Modulus. This handy instruction returns the remainder from the division of two

operands in the destination register. As the name implies, this instruction is very

useful when doing modular arithmetic. This and signed modulus are available on

the TMS34010.

Signed Modulus. This is the signed version of the earlier modulus instruction,

here the remainder bears the sign of the dividend.

Negation and Signs

One’s Complement. The one’s complement is useful for logical operations and

diminished radix arithmetic (see Positive and Negative Numbers, earlier in this

chapter). This instruction performs a bit-by-bit complement of the input argument; that is, it makes each one a zero and each zero a one.

Two’s Complement. The argument is one’s complemented, then incremented by

28

NUMBERS

one. This is how negative numbers are usually handled on microcomputers.

l

Sign Extension. This instruction repeats the value of the MSB of the byte or word

through the next byte, word, or doubleword so the proper results can be obtained

from an arithmetic operation. This is useful for converting a small data type to

a larger data type of the same sign for such operations as multiplication and

division.

Shifts, Rotates and Normalization

Rotate. This simple operation can occur to the right or the left. In the case of a

ROTATE to the right, each bit of the data type is shifted to the right; the LSB is

deposited in the carry, while a zero is shifted in on the left. If the rotate is to the

left, each bit is moved to occupy the position of the next higher bit in the data type

until the last bit is shifted out into the carry flag (see figure l-3). On the Z80,

some shifts put the same bit into the carry and the LSB of the byte you are

shifting. Rotation is useful for multiplies and divides as well as normalization.

Rotate-through-Carry. This operation is similar to the ROTATE above, except

that the carry is shifted into the LSB (in the case of a left shift), or the MSB (in

the case of a right shift). Like the ROTATE, this instruction is useful for

multiplies and divides as well as normalization.

Arithmetic Shift. This shift is similar to a right shift. As each bit is shifted, the

value of the MSB remains the same, maintaining the value of the sign.

Normalization. This can be either a single instruction, as is the case on the

80C196, or a set of instructions, as on the TMS34010. NORML will cause the

80C196 to shift the contents of a doubleword register to the left until the MSB

is a one, “normalizing” the value and leaving the number of shifts required in a

register. On the TMS34010, LMO leaves the number of bits required to shift a

doubleword so that its MSB is one. A multibit shift can then be used to normalize

it. This mechanism is often used in floating point and as a setup for binary table

routines.

29

NUMERICAL METHODS

Figure 1-3. Shifts and rotates.

Decimal and ASCII Instructions

Decimal Adjust on Add. This instruction adjusts the results of the addition of two

decimal values to decimal. Decimal numbers cannot be added on a binary

computer with guaranteed results without taking care of any intrabyte carries

that occur when a digit in a position exceeds nine. On the 8086, this instruction

affects only the AL register. This and the next instruction can be very useful in

an embedded system that receives decimal data and must perform some simple

processing before displaying or returning it.

Decimal Adjust on Subtract. This instruction is similar to the preceeding one

except that it applies to subtraction.

ASCII Adjust. These instructions prepare either binary data for conversion to

ASCII or ASCII data for conversion to binary. Though Motorola processors also

implement these instructions, they are found only in the 80x86 series in our list.

Used correctly, they can also do a certain amount of arithmetic.

30

NUMBERS

Most of the earlier microprocessors-such as the 8080, 8085, Z80, and 8086—

as well as microcontrollers like the 8051 were designed with general applications in

mind. While the 8051 is billed as a Boolean processor, it’s general set of instructions

makes many functions possible and keeps it very popular today.

All these machines can do arithmetic at one level or another. The 8080, 8085, and

Z80 are bit-oriented and don’t have hardware multiplies and divides, making them

somewhat slower and more difficult to use than those that do. The 8086 and 8051

have hardware multiplies and divides but are terribly slow about it. (The timings for

the 8086 instructions were cleaned up considerably in subsequent generations of the

286, 386, and 486.) They added some speed to the floating-point routines and

decreased code size.

Until a few years ago, the kind of progress usually seen in these machines was

an increase in the size of the data types available and the addition of hardware

arithmetic. The 386 and 486 can do some 64-bit arithmetic and have nice shift

instructions, SHLD and SHRD, that will happily shift the bits of the second operand

into the first and put the number of bits shifted in a third operand. This is done in a

single stroke, with the bits of one operand shifted directly into the other, easing

normalization of long integers and making for fast binary multiplies and divides. In

recent years we’ve seen the introduction of microprocessors and microcontrollers

that are specially designed to handle floating-point as well as fixed-point arithmetic.

These processors have significantly enhanced real-time control applications and

digital signal processing in general. One such microprocessor is the TMS34010; a

microcontroller with a similar aptitude is the 80C196.

31

NUMERICAL METHODS

1

Kline, Morris. Mathematics for the Nonmathematician. New York, NY: Dover

Publications, Inc., 1967, Page 72.

2

Gellert, W., S. Gottwald, M. Helwich, H. Kastner, and H. K?stner (eds.). The

VNR Concise Encyclopedia of Mathematics. New York, NY: Van Nostrand

Reinhold, 1989, Page 20.

3

Knuth, D. E. Seminumerical Algorithms. Reading, MA: Addison-Wesley Publishing Co., 1980, Page 180.

4

Knuth, D. E. Seminumerical Algorithms. Reading, MA: Addison-Wesley Publishing Co., 1981, Pages 1-127.

5

Cavanagh, Joseph J. F. Digital Computer Arithmetic. New York, NY: McGrawHill Book Co., 1984, Page 2.

6

Pearson, Carl E. (ed.) Handbook of Applied Mathematics. New York, NY: Van

Nostrand Reinhold, 1983, Page 1.

32

Previous Home Next

CHAPTER 2

Integers

Reducing a problem to the integer level wherever possible is certainly one of the

fastest and safest ways to solve it. But integer arithmetic is only a subset of fixedpoint arithmetic. Fixed-point arithmetic means that the radix point remains in the

same place during all calculations. Integer arithmetic is fixed point arithmetic with

the radix point consistently to the right of the LSB. Conversely, fractional arithmetic

is simply fixed point with the radix point to the left of the MSB. There are no specific

requirements regarding placement of the radix point; it depends entirely on the needs

of the operation. Sines and cosines may require no integer at all, while a power-series

calculation may require an integer portion. You may wish to use two guard digits

during multiplication and division for rounding purposes-it depends on you and the

application.

To present algorithms for the four basic operations of mathematics-addition,

subtraction, multiplication, and division-this chapter will concentrate on integeronly arithmetic. The operations for fixed-point and integer-only arithmetic are

essentially the same; the former simply involves some attention to the placement of

the radix point.

This chapter begins with the basic operations and the classic algorithms for them,

followed by some more advanced algorithms. The classic algorithms aren’t necessarily the fastest, but they are so elegant and reflect the character of the binary

numbering system and the nature of arithmetic itself so well that they are worth

knowing. Besides, on any binary machine, they’ll always work.

Addition and Subtraction

Unsigned Addition and Subtraction

Simply put, addition is the joining of two sets of numbers or quantities, into one

set. We could also say that when we add we’re really incrementing one value, the

33

NUMERICAL METHODS

augend, by another value, the addend. Subtraction is the inverse of addition, with one

number being reduced, or decremented, by another.

For example, the addition operation

+2

9

or

0111

0010

1001

might be accomplished on the 8086 with this instruction sequence:

mov

add

al,7

al,2

In positional arithmetic, each position is evaluated

x

Numerical Methods

Real-Time and Embedded Systems Programming

.................................

Featuring in-depth

coverage of:

l

Fixed and

floating point

mathematical

techniques

without a

coprocessor

l

Numerical I/O

for embedded

systems

l

Data conversion

methods

Don Morgan

Numerical Methods

Real-Time and Embedded Systems Programming

Numerical Methods

Real-Time and Embedded Systems Programming

Featuring in-depth

coverage of:

l

Fixed and

floating point

mathematical

techniques

without a

coprocessor

l

Numerical I/O

for embedded

systems

l

Data conversion

methods

Don Morgan

M&T Books

A Division of M&T Publishing, Inc.

411 BOREL AVE.

SAN MATEO, CA 94402

© 1992 by M&T Publishing, Inc.

Printed in the United States of America

All rights reserved. No part of this book or disk may be reproduced or transmitted in any form or by any

means, electronic or mechanical, including photocopying, recording, or by any information storage and

retrieval system, without prior written permission from the Publisher. Contact the Publisher for

information on foreign rights.

Limits of Liability and Disclaimer of Warranty

The Author and Publisher of this book have used their best efforts in preparing the book and the

programs contained in it and on the diskette. These efforts include the development, research, and

testing of the theories and programs to determine their effectiveness.

The Author and Publisher make no warranty of any kind, expressed or implied, with regard to these

programs or the documentation contained in this book. The Author and Publisher shall not be liable in

any event for incidental or consequential damages in connection with, or arising out of, the furnishing,

performance, or use of these programs.

Library of Congress Cataloging-in-Publication Data

Morgan, Don 1948Numerical Methods/Real-Time and Embedded Systems Programming

by Don Morgan

p. cm.

Includes Index

ISBN l-5585l-232-2 Book and Disk set

2. Real-time data processing.

1. Electronic digital computers—Programming.

I. Title

3. Embedded computer systems—Programming.

QA76.6.M669 1992

513.2 ' 0285—dc20

91-47911

CIP

Project Editor: Sherri Morningstar

95 94 93 92

Cover Design: Lauren Smith Design

4 3 2 1

Trademarks: The 80386, 80486 are registered trademarks and the 8051, 8048, 8086, 8088,

8OC196 and 80286 are products of Intel Corporation, Santa Clara, CA. The Z80 is a registered

trademark of Zilog Inc., Campbell, CA. The TMS34010 is a product of Texas Instruments, Dallas,

TX. Microsoft C is a product of Microsoft Corp. Redmond, WA.

Acknowledgment

Thank you Anita, Donald and Rachel

for your love and forbearance.

Contents

WHY THIS BOOK IS FOR YOU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

CHAPTER 1: NUMBERS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

Systems of Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

Bases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

The Radix Point, Fixed and Floating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2

Types of Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 5

Fixed Point . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 5

Floating Point . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 7

Positive and Negative Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 8

Fundamental Arithmetic Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1

Microprocessors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1

Buswidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

22

Data type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 4

Flags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 4

Rounding and the Sticky Bit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 5

Branching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

NUMERICAL METHODS

Instructions ............................................................................................. 2 6

Addition ............................................................................................ 26

Subtraction ........................................................................................

27

27

Multiplication ..................................................................................

Division ............................................................................................ 28

Negation and Signs ............................................................................. 28

Shifts, Rotates and Normalization ....................................................... 29

Decimal and ASCII Instructions ......................................................... 30

CHAPTER 2: INTEGERS ..........................................................

Addition and Subtraction .............................................................................

Unsigned Addition and Subtraction ....................................................

Multiprecision Arithmetic ..................................................................

add64: Algorithm ...............................................................................

add64: Listing .....................................................................................

sub64: Algorithm ................................................................................

sub64: Listing .....................................................................................

Signed Addition and Subtraction ........................................................

Decimal Addition and Subtraction ......................................................

Multiplication and Division .........................................................................

Signed vs. Unsigned .......................................................................

signed-operation: Algorithm ...............................................................

signed-operation: Listing ...................................................................

Binary Multiplication ..................................................................................

cmul: Algorithm ................................................................................

cmul: Listing ....................................................................................

33

33

33

35

36

36

37

37

38

40

42

43

44

45

46

49

49

CONTENTS

A Faster Shift and Add ................................................................................. 50

51

52

53

55

55

57

cmul2: Algorithm ..............................................................................

cmul2: Listing ....................................................................................

Skipping Ones and Zeros .............................................................................

booth: Algorithm ................................................................................

booth: Listing .....................................................................................

bit-pair: Algorithm .............................................................................

bit-pair: Listing .................................................................................

Hardware Multiplication: Single and Multiprecision .....................................

mu132: Algorithm ...............................................................................

mu132: Listing ....................................................................................

Binary Division ...........................................................................................

Error Checking ..................................................................................

Software Division ........................................................................................

cdiv: Algorithm ..................................................................................

cdiv: Listing. .....................................................................................

Hardware Division ......................................................................................

div32: Algorithm ................................................................................

div32: Listing ....................................................................................

div64: Algorithm ...............................................................................

div64: Listing .....................................................................................

58

61

62

63

64

64

65

67

68

69

74

75

79

80

CHAPTER 3; REAL NUMBERS ..................................................

Fixed Point .................................................................................................

Significant Bits ............................................................................................

The Radix Point .........................................................................................

Rounding ....................................................................................................

Basic Fixed-Point Operations ......................................................................

85

86

87

89

89

92

NUMERICAL METHODS

A Routine for Drawing Circles...................................................................

circle: Algorithm ..........................................................................

circle: Listing ................................................................................

Bresenham’s Line-Drawing Algorithm ............................................

line: Algorithm .............................................................................

line: Listing .......................................................................................

Division by Inversion .............................................................................

divnewt: Algorithm..............................................................................

divnewt: Listing................................................................................

Division by Multiplication..............................................................................

divmul: Algorithm..............................................................................

divmul: Listing...................................................................................

95

98

98

100

101

102

105

108

109

114

116

117

CHAPTER 4: FLOATING-POINT ARITHMETIC ........................ 123

What To Expect ....................................................................................... 124

A Small Floating-Point Package................................................................ 127

The Elements of a Floating-Point Number.............................................. 128

Extended Precision ............................................................................ 131

The External Routines ................................................................................. 132

fp_add: Algorithm ............................................................................. 132

fp_add: Listing................................................................................. 133

The Core Routines........................................................................................ 134

Fitting These Routines to an Application...................................................... 136

Addition and Subtraction: FLADD.............................................................. 136

FLADD: The Prologue. Algorithm...................................................... 138

FLADD: The Prologue. Listing.......................................................... 138

The FLADD Routine Which Operand is Largest? Algorithm.............. 140

The FLADD Routine: Which Operand is Largest? Listing.................... 141

CONTENTS

The FLADD Routine: Aligning the Radix Points. Algorithm................. 142

The FLADD Routine: Aligning the Radix Point. Listing .................... 143

FLADD: The Epilogue. Algorithm..................................................... 144

FLADD: The Epilogue. Listing ........................................................... 145

Multiplication and Division: FLMUL..........................................................

flmul: Algorithm .................................................................................

flmul: Listing ....................................................................................

mu164a: Algorithm ...........................................................................

mu164a: Listing .................................................................................

FLDIV...............................................................................................

fldiv: Algorithm..................................................................................

fldiv: Listing ......................................................................................

Rounding .....................................................................................................

Round: Algorithm...............................................................................

Round: Listing ...................................................................................

147

147

148

151

152

154

154

155

159

159

160

CHAPTER 5: INPUT, OUTPUT, AND CONVERSION....... ........163

Decimal Arithmetic ...................................................................................... 164

Radix Conversions ...................................................................................... 165

Integer Conversion by Division................................................................... 165

bn_dnt: Algorithm .............................................................................. 166

bn_dnt: Listing ................................................................................... 167

Integer Conversion by Multiplication........................................................... 169

dnt_bn: Algorithm.............................................................................. 170

dnt_bn: Listing ................................................................................... 170

Fraction Conversion b y Multiplication .......................................................... 172

bfc_dc: Algorithm............................................................................... 173

bfc_dc: Listing .................................................................................... 173

NUMERICAL METHODS

Fraction Conversion by Division .................................................................. 175

Dfc_bn: Algorithm ............................................................................. 176

Dfc_bn: Listing .................................................................................. 177

Table-Driven Conversions............................................................................

Hex to ASCII ..............................................................................................

hexasc: Algorithm.............................................................................

hexasc: Listing ..................................................................................

Decimal to Binary ......................................................................................

tb_dcbn: Algorithm ............................................................................

tb_dcbn: Listing ..................................................................................

Binary to Decimal.......................................................................................

tb_bndc: Algorithm............................................................................

tb_bndc: Listing ..................................................................................

Floating-Point Conversions..........................................................................

ASCII to Single-Precision Float...................................................................

atf: Algorithm ....................................................................................

atf: Listing .........................................................................................

Single-Precision Float to ASCII...................................................................

fta: Algorithm ....................................................................................

Fta: Listing .........................................................................................

Fixed Point to Single-Precision Floating Point .............................................

ftf: Algorithm .....................................................................................

ftf: Listing ........................................................................................

Single-Precision Floating Point to Fixed Point .............................................

ftfx Algorithm ..................................................................................

ftfx: Listing .......................................................................................

179

179

180

180

182

182

184

187

188

189

192

192

193

195

200

200

202

206

207

208

211

212

212

CONTENTS

CHAPTER 6: THE ELEMENTARY FUNCTIONS ....................

Fixed Point Algorithms ..............................................................................

Lookup Tables and Linear Interpolation......................................................

lg 10: Algorithm ...............................................................................

lg 10: Listing ....................................................................................

Dcsin: Algorithm ............................................................................

Dcsin: Listing ..................................................................................

Computing With Tables .............................................................................

Pwrb: Algorithm ..............................................................................

Pwrb: Listing ...................................................................................

CORDIC Algorithms.................................................................................

Circular: Algorithm ............................................................................

Circular: Listing ................................................................................

Polynomial Evaluations ...........................................................................

taylorsin: Algorithm .......................................................................

taylorsin: Listing .............................................................................

Polyeval: Algorithm...........................................................................

Polyeval: Listing ...............................................................................

Calculating Fixed-Point Square Roots........................................................

fx_sqr: Algorithm .............................................................................

fx_sqr: Listing ..................................................................................

school_sqr: Algorithm ......................................................................

school_sqr: Listing ..............................................................................

Floating-Point Approximations ...................................................................

Floating-Point Utilities ...............................................................................

frxp: Algorithm ..................................................................................

frxp: Listing ......................................................................................

ldxp: Algorithm.................................................................................

ldxp: Listing......................................................................................

217

217

217

219

220

224

227

233

234

235

237

242

242

247

249

250

251

251

253

254

254

256

257

259

259

259

260

261

261

NUMERICAL METHODS

263

flr: Algorithm ...........................................................................................

263

flr: Listing .................................................................................................

265

flceil: Algorithm ......................................................................................

266

flceil: Listing ............................................................................................

268

intmd: Algorithm....................................................................................

268

intmd: Listing ...........................................................................................

269

Square Roots ......................................................................................................

270

Flsqr: Algorithm.......................................................................................

271

Flsqr: Listing ............................................................................................

273

Sines and Cosines...............................................................................................

274

flsin: Algorithm........................................................................................

275

Flsin: Listing............................................................................................

APPENDIXES:

A: A PSEUDO-RANDOM NUMBER GENERATOR .................... 2 8 1

B: TABLES AND EQUATES ............................................

295

C: FXMATH.ASM ............................................................. 297

D: FPMATH.ASM .......................................................... 337

E: IO.ASM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

373

CONTENTS

F: TRANS.ASM AND TABLE.ASM

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407

G: MATH.C............................................................................475

GLOSSARY

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485

INDEX ................................................................................ 493

Additional Disk

Just in case you need an additional disk, simply call the toll-free number listed

below. The disk contains all the routines in the book along with a simple C shell that

can be used to exercise them. This allows you to walk through the routines to see how

they work and test any changes you might make to them. Once you understand how

the routine works, you can port it to another processor. Only $10.00 postage-paid.

To order with your credit card, call Toll-Free l-800-533-4372 (in CA 1-800356-2002). Mention code 7137. Or mail your payment to M&T Books, 411 Bore1

Ave., Suite 100, San Mateo, CA 94402-3522. California residents please add

applicable sales tax.

Why This Book Is For You

The ability to write efficient, high-speed arithmetic routines ultimately depends

upon your knowledge of the elements of arithmetic as they exist on a computer. That

conclusion and this book are the result of a long and frustrating search for

information on writing arithmetic routines for real-time embedded systems.

With instruction cycle times coming down and clock rates going up, it would

seem that speed is not a problem in writing fast routines. In addition, math

coprocessors are becoming more popular and less expensive than ever before and are

readily available. These factors make arithmetic easier and faster to use and

implement. However, for many of you the systems that you are working on do not

include the latest chips or the faster processors. Some of the most widely used

microcontrollers used today are not Digital Signal Processors (DSP), but simple

eight-bit controllers such as the Intel 8051 or 8048 microprocessors.

Whether you are using one on the newer, faster machines or using a simple

eight-bit one, your familiarity with its foundation will influence the architecture of

the application and every program you write. Fast, efficient code requires an

understanding of the underlying nature of the machine you are writing for. Your

knowledge and understanding will help you in areas other than simply implementing

the operations of arithmetic and mathematics. For example, you may want the

ability to use decimal arithmetic directly to control peripherals such as displays and

thumbwheel switches. You may want to use fractional binary arithmetic for more

efficient handling of D/A converters or you may wish to create buffers and arrays that

wrap by themselves because they use the word size of your machine as a modulus.

The intention in writing this book is to present a broad approach to microprocessor arithmetic ranging from data on the positional number system to algorithms for

1

NUMERICAL METHODS

developing many elementary functions with examples in 8086 assembler and

pseudocode. The chapters cover positional number theory, the basic arithmetic

operations to numerical I/O, and advanced topics are examined in fixed and floating

point arithmetic. In each subject area, you will find many approaches to the same

problem; some are more appropriate for nonarithmetic, general purpose machines

such as the 8051 and 8048, and others for the more powerful processors like the

Tandy TMS34010 and the Intel 80386. Along the way, a package of fixed-point and

floating-point routines are developed and explained. Besides these basic numerical

algorithms, there are routines for converting into and out of any of the formats used,

as well as base conversions and table driven translations. By the end of the book,

readers will have code they can control and modify for their applications.

This book concentrates on the methods involved in the computational process,

not necessarily optimization or even speed, these come through an understanding of

numerical methods and the target processor and application. The goal is to move the

reader closer to an understanding of the microcomputer by presenting enough

explanation, pseudocode, and examples to make the concepts understandable. It is

an aid that will allow engineers, with their familiarity and understanding of the target,

to write the fastest, most efficient code they can for the application.

2

Introduction

If you work with microprocessors or microcontrollers, you work with numbers.

Whether it is a simple embedded machine-tool controller that does little more than

drive displays, or interpret thumbwheel settings, or is a DSP functioning in a realtime system, you must deal with some form of numerics. Even an application that

lacks special requirements for code size or speed might need to perform an

occasional fractional multiply or divide for a D/A converter or another peripheral

accepting binary parameters. And though the real bit twiddling may hide under the

hood of a higher-level language, the individual responsible for that code must know

how that operation differs from other forms of arithmetic to perform it correctly.

Embedded systems work involves all kinds of microprocessors and

microcontrollers, and much of the programming is done in assembler because of the

speed benefits or the resulting smaller code size. Unfortunately, few references

are written to specifically address assembly language programming. One of the

major reasons for this might be that assembly-language routines are not easily

ported from one processor to another. As a result, most of the material devoted

to assembler programming is written by the companies that make the processors. The code and algorithms in these cases are then tailored to the particular

advantages (or to overcoming the particular disadvantages) of the product. The

documentation that does exist contains very little about writing floating-point

routines or elementary functions.

This book has two purposes. The first and primary aim is to present a spectrum

of topics involving numerics and provide the information necessary to understand

the fundamentals as well as write the routines themselves. Along with this information are examples of their implementation in 8086 assembler and pseudocode

that show each algorithm in component steps, so you can port the operation to

another target. A secondary, but by no means minor, goal is to introduce you

3

NUMERICAL METHODS

to the benefits of binary arithmetic on a binary machine. The decimal numbering

system is so pervasive that it is often difficult to think of numbers in any other format,

but doing arithmetic in decimal on a binary machine can mean an enormous number

of wasted machine cycles, undue complexity, and bloated programs. As you proceed

through this book, you should become less dependent on blind libraries and more

able to write fast, efficient routines in the native base of your machine.

Each chapter of this book provides the foundation for the next chapter. At the

code level, each new routine builds on the preceeding algorithms and routines.

Algorithms are presented with an accompanying example showing one way to

implement them. There are, quite often, many ways that you could solve the

algorithm. Feel free to experiment and modify to fit your environment.

Chapter 1 covers positional number theory, bases, and signed arithmetic. The

information here provides the necessary foundation to understand both decimal and

binary arithmetic. That understanding can often mean faster more compact routines

using the elements of binary arithmetic- in other words, shifts, additions, and

subtractions rather than complex scaling and extensive routines.

Chapter 2 focuses on integer arithmetic, presenting algorithms for performing

addition, subtraction, multiplication, and division. These algorithms apply to machines that have hardware instructions and those capable of only shifts, additions,

and subtractions.

Real numbers (those with fractional extension) are often expressed in floating

point, but fixed point can also be used. Chapter 3 explores some of the qualities of

real numbers and explains how the radix point affects the four basic arithmetic

functions. Because the subject of fractions is covered, several rounding techniques

are also examined. Some interesting techniques for performing division, one using

multiplication and the other inversion, are also presented. These routines are

interesting because they involve division with very long operands as well as from a

purely conceptual viewpoint. At the end of the chapter, there is an example of an

algorithm that will draw a circle in a two dimensional space, such as a graphics

monitor, using only shifts, additions and subtractions.

Chapter 4 covers the basics of floating-point arithmetic and shows how scaling

is done. The four basic arithmetic functions are developed into floating-point

4

INTRODUCTION

routines using the fixed point methods given in earlier chapters.

Chapter 5 discusses input and output routines for numerics. These routines deal

with radix conversion, such as decimal to binary, and format conversions, such as

ASCII to floating point. The conversion methods presented use both computational

and table-driven techniques.

Finally, the elementary functions are discussed in Chapter 6. These include

table-driven techniques for fast lookup and routines that rely on the fundamental

binary nature of the machine to compute fast logarithms and powers. The CORDIC

functions which deliver very high-quality transcendentals with only a few shifts and

additions, are covered, as are the Taylor expansions and Horner’s Rule. The

chapter ends with an implementation of a floating-point sine/cosine algorithm

based upon a minimax approximation and a floating-point square root.

Following the chapters, the appendices comprise additional information and

reference materials. Appendix A presents and explains the pseudo-random number

generator developed to test many of the routines in the book and includes

SPECTRAL.C, a C program useful in testing the functions described in this book.

This program was originally created for the pseudo-random number generator and

incorporates a visual check and Chi-square statistical test on the function. Appendix

B offers a small set of constants commonly used.

The source code for all the arithmetic functions, along with many ancillary

routines and examples, is in appendices C through F.

Integer and fixed-point routines are in Appendix C. Here are the classical

routines for multiplication and division, handling signs, along with some of the more

complex fixed-point operations, such as the Newton Raphson iteration and linear

interpolation for division.

Appendix D consists of the basic floating-point routines for addition,

subtraction, multiplication, and division, Floor, ceiling, and absolute value

functions are included here, as well as many other functions important to the

more advanced math in Chapter 6.

The conversion routines are in Appendix E. These cover the format and

numerical conversions in Chapter 5

In Appendix F, there are two source files. TRANS.ASM contains the elementary

5

NUMERICAL METHODS

functions described in Chapter 6, and TABLE.ASM that holds the tables, equates

and constants used in TRANS.ASM and many of the other modules.

MATH.C in Appendix F is a C program useful in testing the functions described

in this book. It is a simple shell with the defines and prototypes necessary to perform

tests on the routines in the various modules.

Because processors and microcontrollers differ in architecture and instruction

set, algorithmic solutions to numeric problems are provided throughout the book for

machines with no hardware primitives for multiplication and division as well as for

those that have such primitives.

Assembly language by nature isn’t very portable, but the ideas involved in

numeric processing are. For that reason, each algorithm includes an explanation that

enables you to understand the ideas independently of the code. This explanation is

complemented by step-by-step pseudocode and at least one example in 8086

assembler. All the routines in this book are also available on a disk along with a

simple C shell that can be used to exercise them. This allows you to walk through the

routines to see how they work and test any changes you might make to them. Once

you understand how the routine works, you can port it to another processor. The

routines as presented in the book are formatted differently from the same routines on

the disk. This is done to accommodate the page size. Any last minute changes to the

source code are documented in the Readme file on the disk.

There is no single solution for all applications; there may not even be a single

solution for a particular application. The final decision is always left to the individual

programmer, whose skills and knowledge of the application are what make the

software work. I hope this book is of some help.

6

Previous Home

CHAPTER 1

Numbers

Numbers are pervasive; we use them in almost everything we do, from counting

the feet in a line of poetry to determining the component frequencies in the periods

of earthquakes. Religions and philosophies even use them to predict the future. The

wonderful abstraction of numbers makes them useful in any situation. Actually, what

we find so useful aren’t the numbers themselves (numbers being merely a representation), but the concept of numeration: counting, ordering, and grouping.

Our numbering system has humble beginnings, arising from the need to quantify

objects-five horses, ten people, two goats, and so on-the sort of calculations that

can be done with strokes of a sharp stone or root on another stone. These are natural

numbers-positive, whole numbers, each defined as having one and only one

immediate predecessor. These numbers make up the number ray, which stretches

from zero to infinity (see Figure 1- 1).

Figure 1-1. The number line.

7

Next

NUMERICAL METHODS

The calculations performed with natural numbers consist primarily of addition

and subtraction, though natural numbers can also be used for multiplication (iterative

addition) and, to some degree, for division. Natural numbers don’t always suffice,

however; how can you divide three by two and get a natural number as the result?

What happens when you subtract 5 from 3? Without decimal fractions, the results of

many divisions have to remain symbolic. The expression "5 from 3" meant nothing

until the Hindus created a symbol to show that money was owed. The words positive

and negative are derived from the Hindu words for credit and debit’.

The number ray-all natural numbers- b e c a m e part of a much greater schema

known as the number line, which comprises all numbers (positive, negative, and

fractional) and stretches from a negative infinity through zero to a positive infinity

with infinite resolution*. Numbers on this line can be positive or negative so that 35 can exist as a representable value, and the line can be divided into smaller and

smaller parts, no part so small that it cannot be subdivided. This number line extends

the idea of numbers considerably, creating a continuous weave of ever-smaller

pieces (you would need something like this to describe a universe) that finally give

meaning to calculations such as 3/2 in the form of real numbers (those with decimal

fractional extensions).

This is undeniably a valuable and useful concept, but it doesn’t translate so

cleanly into the mechanics of a machine made of finite pieces.

Systems of Representation

The Romans used an additional system of representation, in which the symbols

are added or subtracted from one another based on their position. Nine becomes IX

in Roman numerals (a single count is subtracted from the group of 10, equaling nine;

if the stroke were on the other side of the symbol for 10, the number would be 11).

This meant that when the representation reached a new power of 10 or just became

too large, larger numbers could be created by concatenating symbols. The problem

here is that each time the numbers got larger, new symbols had to be invented.

Another form, known as positional representation, dates back to the Babylonians,

who used a sort of floating point with a base of 60.3 With this system, each

successively larger member of a group has a different symbol. These symbols are

8

NUMBERS

then arranged serially to grow more significant as they progress to the left. The

position of the symbol within this representation determines its value. This makes for

a very compact system that can be used to approximate any value without the need

to invent new symbols. Positional numbering systems also allow another freedom:

Numbers can be regrouped into coefficients and powers, as with polynomials, for

some alternate approaches to multiplication and division, as you will see in the

following chapters.

If b is our base and a an integer within that base, any positive integer may be

represented as:

or as:

ai * bi + ai-1 * bi-1 + ... + a0 * b0

As you can see, the value of each position is an integer multiplied by the base

taken to the power of that integer relative to the origin or zero. In base 10, that

polynomial looks like this:

ai * 10i + ai-1 * 10i-1 + ... + a0 * 100

and the value 329 takes the form:

3 * 10 + 2 * 10 + * 10

Of course, since the number line goes negative, so must our polynomial:

ai * bi + ai-1 * bi-1 + ... + a0 * b0 + a-1 * b-1 + a-2 * b-2 + ... + a-i *

b-i

Bases

Children, and often adults, count by simply making a mark on a piece of paper

for each item in the set they’re quantifying. There are obvious limits to the numbers

9

NUMERICAL METHODS

that can be conveniently represented this way, but the solution is simple: When the

numbers get too big to store easily as strokes, place them in groups of equal size and

count only the groups and those that are left over. This makes counting easier because

we are no longer concerned with individual strokes but with groups of strokes and

then groups of groups of strokes. Clearly, we must make the size of each group

greater than one or we are still counting strokes. This is the concept of base. (See

Figure l-2.) If we choose to group in l0s, we are adopting 10 as our base. In base 10,

they are gathered in groups of 10; each position can have between zero and nine

things in it. In base 2, each position can have either a one or a zero. Base 8 is zero

through seven. Base 16 uses zero through nine and a through f. Throughout this

book, unless the base is stated in the text, a B appended to the number indicates base

2, an O indicates base 8, a D indicates base 10, and an H indicates base 16.

Regardless of the base in which you are working, each successive position to the

left is a positive increase in the power of the position.

In base 2, 999 looks like:

lllll00lllB

If we add a subscript to note the position, it becomes:

19 18 17 16 15 04 03 12 11 10

This has the value:

1*29 +1*28 +1*27 +1*26 +1*25 +1*24 +1*23 +1*22 +1*21 +1*20

which is the same as:

1*512 + 1*256 + 1*128 + 1*64 + 1*32 + 0*16 + 0*8 + 1*4 + 1*2 + 1*1

Multiplying it out, we get:

512 + 256 + 128 + 64 + 32 + 0 + 0 + 4 + 2 + 1 = 999

10

NUMBERS

Figure 1-2. The base of a number system defines the number of unique digits

available in each position.

Octal, as the name implies, is based on the count of eight. The number 999 is 1747

in octal representation, which is the same as writing:

1*83 + 7*82 + 4*81 + 7*80

or

1*512 + 7*64 + 4*8 + 7*1

When we work with bases larger than 10, the convention is to use the letters of

the alphabet to represent values equal to or greater than 10. In base 16 (hexadecimal),

therefore, the set of numbers is 0 1 2 3 4 5 6 7 8 9 a b c d e f, where a = 10 and

f = 15. If you wanted to represent the decimal number 999 in hexadecimal, it would

be 3e7H, which in decimal becomes:

3*162 + 14*161 + 7*160

Multiplying it out gives us:

3*256 + 14*16 + 7*1

11

NUMERICAL METHODS

Obviously, a larger base requires fewer digits to represent the same value.

Any number greater than one can be used as a base. It could be base 2, base 10,

or the number of bits in the data type you are working with. Base 60, which is used

for timekeeping and trigonometry, is attractive because numbers such as l/3 can be

expressed exactly. Bases 16, 8, and 2 are used everywhere in computing machines,

along with base 10. And one contingent believes that base 12 best meets our

mathematical needs.

The Radix Point, Fixed and Floating

Since the physical world cannot be described in simple whole numbers, we need

a way to express fractions. If all we wish to do is represent the truth, a symbol will

do. A number such as 2/3 in all its simplicity is a symbol-a perfect symbol, because

it can represent something unrepresentable in decimal notation. That number

translated to decimal fractional representation is irrational; that is, it becomes an

endless series of digits that can only approximate the original. The only way to

express an irrational number in finite terms is to truncate it, with a corresponding loss

of accuracy and precision from the actual value.

Given enough storage any number, no matter how large, can be expressed as

ones and zeros. The bigger the number, the more bits we need. Fractions present a

similar but not identical barrier. When we’re building an integer we start with unity,

the smallest possible building block we have, and add progressively greater powers

(and multiples thereof) of whatever base we’re in until that number is represented.

We represent it to the least significant bit (LSB), its smallest part.

The same isn’t true of fractions. Here, we’re starting at the other end of the

spectrum; we must express a value by adding successively smaller parts. The trouble

is, we don’t always have access to the smallest part. Depending on the amount of

storage available, we may be nowhere near the smallest part and have, instead of a

complete representation of a number, only an approximation. Many common values

can never be represented exactly in binary arithmetic. The decimal 0.1 or one 10th,

for example, becomes an infinite series of ones and zeros in binary

(1100110011001100 ... B). The difficulties in expressing fractional parts completely

can lead to unacceptable errors in the result if you’re not careful.

12

NUMBERS

The radix point (the point of origin for the base, like the decimal point) exists on

the number line at zero and separates whole numbers from fractional numbers. As

we move through the positions to the left of the radix point, according to the rules of

positional notation, we pass through successively greater positive powers of that

base; as we move to the right, we pass through successively greater negative powers

of the base.

In the decimal system, the number 999.999 in positional notation is

929190.9-19-29-3

And we know that base 10

102 = 100

101 = 10

100 = 1

It is also true that

10-1 = .1

10-2 = .01

10-3 = .001

We can rewrite the number as a polynomial

9*102 + 9*101 + 9*100 + 9*10-1 + 9*10-2 + 9*10-3

Multiplying it out, we get

900 +90 + 9 + .9 + .09 + .009

which equals exactly 999.999.

Suppose we wish to express the same value in base 2. According to the previous

example, 999 is represented in binary as 1111100111B. To represent 999.999, we

need to know the negative powers of two as well. The first few are as follows:

13

NUMERICAL METHODS

2-1 = .5D

2 = .25D

2-3 = .125D

2-4 = .0625D

-5

2 = .03125D

-2

2-6 = .015625D

2-7 = .0078125D

2-8 = .00390625D

2-9 = .001953125D

2-10 = .0009765625D

-11

2

= .00048828125D

2-12 = .000244140625D

Twelve binary digits are more than enough to approximate the decimal fraction

.999. Ten digits produce

1111100111.1111111111 =

999.9990234375

which is accurate to three decimal places.

Representing 999.999 in other bases results in similar problems. In base 5, the

decimal number 999.999 is noted

12444.4444141414 =

1*54 + 2*53 + 4*52 + 4*51 + 4*50. + 4*5-1 + 4*5-2 + 4*5-3 + 4*5-4 + 1*5-5 +

4*5-6 + 1*5-7 + 4*5-8 + 1*5-9 + 4*5-10 =

1*625 + 2*125 + 4*25 + 4*5 + 4+ 4*.2 + 4*.04 + 4*.008 + 4*.0016

+ 1*.00032 + 4*.000065 + 1*.0000125 + 4*.00000256

+ 1*.000000512 + 4*.0000001024

or

625+ +250 + 100 + 20 + 4 + .8 + .16 + .032 + .0064 + .00032 + .000256 +

.0000128 + .00001024 + .000000512 + .00004096 =

999.9990045696

14

NUMBERS

But in base 20, which is a multiple of 10 and two, the expression is rational. (Note

that digits in bases that exceed 10 are usually denoted by alphabetical characters; for

example, the digits of base 20 would be 0 l 2 3 4 5 6 7 8 9 A B C D E F G H I J .)

29J.JJC

2x202 + 9x201 + 19x200. + 19x20-1 + 19x20-2 + 12x20-3 =

2x400 + 9x20 + 19x1. + 19x.05 + 19x.0025 + 12x.000125

or

800 + 180 + 19. + .95 + .0475 + .0015 =

999.999

As you can see, it isn’t always easy to approximate a fraction. Fractions are a sum

of the value of each position in the data type. A rational fraction is one whose sum

precisely matches the value you are trying to approximate. Unfortunately, the exact

combination of parts necessary to represent a fraction exactly may not be available

within the data type you choose. In cases such as these, you must settle for the

accuracy obtainable within the precision of the data type you are using.

Types of Arithmetic

This book covers three basic types of arithmetic: fixed point (including integeronly arithmetic and modular) and floating point.

Fixed Point

Fixed-point implies that the radix point is in a fixed place within the representation. When we’re working exclusively with integers, the radix point is always to

the right of the rightmost digit or bit. When the radix point is to the left of the leftmost

digit, we’re dealing with fractional arithmetic. The radix point can rest anywhere

within the number without changing the mechanics of the operation. In fact, using

fixed-point arithmetic in place of floating point, where possible, can speed up any

arithmetic operation. Everything we have covered thus far applies to fixed-point

arithmetic and its representation.

15

NUMERICAL METHODS

Though fixed-point arithmetic can result in the shortest, fastest programs, it

shouldn’t be used in all cases. The larger or smaller a number gets, the more storage

is required to represent it. There are alternatives; modular arithmetic, for example,

can, with an increase in complexity, preserve much of an operation’s speed.

Modular arithmetic is what people use every day to tell time or to determine the

day of the week at some future point. Time is calculated either modulo 12 or 24—

that is, if it is 9:00 and six hours pass on a 12-hour clock, it is now 3:00, not 15:00:

9 + 6 = 3

This is true if all multiples of 12 are removed. In proper modular notation, this

would be written:

9 + 6

3, mod 12.

In this equation, the sign

means congruence. In this way, we can make large

numbers congruent to smaller numbers by removing multiples of another number (in

the case of time, 12 or 24). These multiples are often removed by subtraction or

division, with the smaller number actually being the remainder.

If all operands in an arithmetic operation are divided by the same value, the result

of the operation is unaffected. This means that, with some care, arithmetic operations

performed on the remainders can have the same result as those performed on the

whole number. Sines and cosines are calculated mod 360 degrees (or mod 2

radians). Actually, the input argument is usually taken mod /2 or 90 degrees,

depending on whether you are using degrees or radians. Along with some method for

determining which quadrant the angle is in, the result is computed from the

congruence (see Chapter 6).

Random number generators based on the Linear Congruential Method use

modular arithmetic to develop the output number as one of the final steps.4

Assembly-language programmers can facilitate their work by choosing a modulus

that’s as large as the word size of the machine they are working on. It is then a simple

matter to calculate the congruence, keeping those lower bits that will fit within the

16

NUMBERS

word size of the computer. For example, assume we have a hexadecimal doubleword:

and the word size of our machine is 16 bits

For more information on random number generators, see Appendix A.

One final and valuable use for modular arithmetic is in the construction of selfmaintaining buffers and arrays. If a buffer containing 256 bytes is page aligned-the

last eight bits of the starting address are zero-and an 8-bit variable is declared to

count the number of entries, a pointer can be incremented through the buffer simply

by adding one to the counting variable, then adding that to the address of the base of

the buffer. When the pointer reaches 255, it will indicate the last byte in the buffer;

when it is incremented one more time, it will wrap to zero and point once again at

the initial byte in the buffer.

Floating Point

Floating point is a way of coding fixed-point numbers in which the number of

significant digits is constant per type but whose range is enormously increased

because an exponent and sign are embedded in the number. Floating-point arithmetic

is certainly no more accurate than fixed point-and it has a number of problems,

including those present in fixed point as well as some of its own-but it is convenient

and, used judiciously, will produce valid results.

The floating-point representations used most commonly today conform, to some

degree, to the IEEE 754 and 854 specifications. The two main forms, the long real

and the short real, differ in the range and amount of storage they require. Under the

IEEE specifications, a long real is an 8-byte entity consisting of a sign bit, an 11-bit

exponent, and a 53-bit significand, which mean the significant bits of the floatingpoint number, including the fraction to the right of the radix point and the leading one

17

NUMERICAL METHODS

to the left. A short real is a 4-byte entity consisting of a sign bit, an 8-bit exponent,

and a 24-bit significand.

To form a binary floating-point number, shift the value to the left (multiply by

two) or to the right (divide by two) until the result is between 1.0 and 2.0. Concatenate

the sign, the number of shifts (exponent), and the mantissa to form the float.

Doing calculations in floating point is very convenient. A short real can express

a value in the range 1038 to 10-38 in a doubleword, while a long real can handle values

ranging from 10308 to 10-308 in a quadword. And most of the work of maintaining the

numbers is done by your floating-point package or library.

As noted earlier, some problems in the system of precision and exponentiation

result in a representation that is not truly "real"—namely, gaps in the number line and

loss of significance. Another problem is that each developer of numerical software

adheres to the standards in his or her own fashion, which means that an equation that

produced one result on one machine may not produce the same result on another

machine or the same machine running a different software package. This compatibility problem has been partially alleviated by the widespread use of coprocessors.

Positive and Negative Numbers

The most common methods of representing positive and negative numbers in a

positional number system are sign magnitude, diminished-radix complement, and

radix complement (see Table 1- 1).

With the sign-magnitude method, the most significant bit (MSB) is used to

indicate the sign of the number: zero for plus and one for minus. The number itself

is represented as usual—that is, the only difference between a positive and a negative

representation is the sign bit. For example, the positive value 4 might be expressed

as 0l00B in a 4-bit binary format using sign magnitude, while -4 would be

represented as 1100B.

This form of notation has two possible drawbacks. The first is something it has

in common with the diminished-radix complement method: It yields two forms of

zero, 0000B and 1000B (assuming three bits for the number and one for the sign).

Second, adding sign-magnitude values with opposite signs requires that the magni-

18

NUMBERS

tudes of the numbers be consulted to determine the sign of the result. An example of

sign magnitude can be found in the IEEE 754 specification for floating-point

representation.

The diminished-radix complement is also known as the one’s complement in

binary notation. The MSB contains the sign bit, as with sign magnitude, while the rest

of the number is either the absolute value of the number or its bit-by-bit complement.

The decimal number 4 would appear as 0100 and -4 as 1011. As in the foregoing

method, two forms of zero would result: 0000 and 1111.

The radix complement, or two’s complement, is the most widely used notation

in microprocessor arithmetic. It involves using the MSB to denote the sign, as in the

other two methods, with zero indicating a positive value and one meaning negative.

You derive it simply by adding one to the one’s-complement representation of the

same negative value. Using this method, 4 is still 0100, but -4 becomes 1100. Recall

that one’s complement is a bit-by-bit complement, so that all ones become zeros and

all zeros become ones. The two’s complement is obtained by adding a one to the

one’s complement.

This method eliminates the dual representation of zero-zero is only 0000

(represented as a three-bit signed binary number)-but one quirk is that the range of

values that can be represented is slightly more negative than positive (see the chart

below). That is not the case with the other two methods described. For example, the

largest positive value that can be represented as a signed 4-bit number is 0111B, or

7D, while the largest negative number is 1000B, or -8D.

19

NUMERICAL METHODS

One's complement

Two's

complement

Sign complement

0111

7

7

7

0110

6

6

6

0101

5

5

5

0100

4

4

4

0011

3

3

3

0010

2

2

2

0001

1

1

1

0000

0

0

0

1111

-0

-1

-7

1110

-1

-2

-6

1101

-2

-3

-5

1100

-3

-4

-4

1011

-4

-5

-3

1010

-5

-6

-2

1001

-6

-7

-1

1000

-7

-8

-0

Table 1-1. Signed Numbers.

Decimal integers require more storage and are far more complicated to work

with than binary; however, numeric I/O commonly occurs in decimal, a more

familiar notation than binary. For the three forms of signed representation already

discussed, positive values are represented much the same as in binary (the leftmost

20

NUMBERS

bit being zero). In sign-magnitude representation, however, the sign digit is nine

followed by the absolute value of the number. For nine’s complement, the sign digit

is nine and the value of the number is in nine’s complement. As you might expect,

10’s complement is the same as nine’s complement except that a one is added to the

low-order (rightmost) digit.

Fundamental Arithmetic Principles

So far we’ve covered the basics of positional notation and bases. While this book

is not about mathematics but about the implementation of basic arithmetic operations

on a computer, we should take a brief look at those operations.

1. Addition is defined as a + b = c and obeys the commutative rules described

below.

2. Subtraction is the inverse of addition and is defined as b = c - a.

3. Multiplication is defined as ab = c and conforms to the commutative,

associative, and distributive rules described below.

4. Division is the inverse of multiplication and is shown by b = c/a.

5. A power is ba=c.

6. A root is b =

7 . A logarithm is a = logbc.

Addition and subtraction must also satisfy the following rules:5

8. Commutative:

a+b=b+a

ab = ba

9 . Associative:

a = (b + c) = (a + b) + c

a(bc) = (ab)c

10. Distributive:

a(b + c) = ab + ac

From these rules, we can derive the following relations:6

11. (ab)c = acbc

21

NUMERICAL METHODS

12. abac = ac(b+c)

13.

14.

15.

16.

17.

(ab)c = a(bc)

a+0=a

a x 1= a

a1 = a

a/0 is undefined

These agreements form the basis for the arithmetic we will be using in upcoming

chapters.

Microprocessors

The key to an application’s success is the person who writes it. This statement

is no less true for arithmetic. But it’s also true that the functionality and power of the

underlying hardware can greatly affect the software development process.

Table l-2 is a short list of processors and microcontrollers currently in use, along

with some issues relevant to writing arithmetic code for them (such as the instruction

set, and bus width). Although any one of these devices, with some ingenuity and

effort, can be pushed through most common math functions, some are more capable

than others. These processors are only a sample of what is available. In the rest of this

text, we’ll be dealing primarily with 8086 code because of its broad familiarity.

Examples from other processors on the list will be included where appropriate.

Before we discuss the devices themselves, perhaps an explanation of the

categories would be helpful.

Buswidth

The wider bus generally results in a processor with a wider bandwidth because it can

access more data and instruction elements. Many popular microprocessors have a

wider internal bus than external, which puts a burden on the cache (storage internal

to the microprocessor where data and code are kept before execution) to keep up with

the processing. The 8088 is an example of this in operation, but improvements in the

80x86 family (including larger cache sizes and pipelining to allow some parallel

processing) have helped alleviate the problem.

22

NUMBERS

Table 1-2. Instructions and flags.

23

NUMERICAL METHODS

Data type

The larger the word size of your machine, the larger the numbers you can process

with single instructions. Adding two doubleword operands on an 8051 is a

multiprecision operation requiring several steps. It can be done with a single ADD

on a TMS34010 or 80386. In division, the word size often dictates the maximum size

of the quotient. A larger word size allows for larger quotients and dividends.

Flags

The effects of a processor’s operation on the flags can sometimes be subtle. The

following comments are generally true, but it is always wise to study the data sheets

closely for specific cases.

Zero. This flag is set to indicate that an operation has resulted in zero. This can

occur when two operands compare the same or when two equal values are

subtracted from one another. Simple move instructions generally do not affect

the state of the flag.

Carry. Whether this flag is set or reset after a certain operation varies from

processor to processor. On the 8086, the carry will be set if an addition overflows

or a subtraction underflows. On the 80C196, the carry will be set if that addition

overflows but cleared if the subtraction underflows. Be careful with this one.

Logical instructions will usually reset the flag and arithmetic instructions as well

as those that use arithmetic elements (such as compare) will set it or reset it based

on the results.

Sign. Sometimes known as the negative flag, it is set if the MSB of the data type

is set following an operation.

Overflow. If the result of an arithmetic operation exceeds the data type meant

to contain it, an overflow has occurred. This flag usually only works predictably

with addition and subtraction. The overflow flag is used to indicate that the result

of a signed arithmetic operation is too large for the destination operand. It will

be set if, after two numbers of like sign are added or subtracted, the sign of the

result changes or the carry into the MSB of an operand and the carry out don’t

match.

24

NUMBERS

Overflow Trap. If an overflow occurred at any time during an arithmetic

operation, the overflow trap will be set if not already set. This flag bit must be

cleared explicitly. It allows you to check the validity of a series of operations.

Auxiliary Carry. The decimal-adjust instructions use this flag to correct the

accumulator after a decimal addition or subtraction. This flag allows the

processor to perform a limited amount of decimal arithmetic.

Parity. The parity flag is set or reset according to the number of bits in the lower

byte of the destination register after an operation. It is set if even and reset if odd.

Sticky Bit. This useful flag can obviate the need for guard digits on certain

arithmetic operations. Among the processors in Table l-2, it is found only on

the 80C196. It is set if, during a multiple right shift, more than one bit was shifted

into the carry with a one in the carry at the end of the shift.

Rounding and the Sticky Bit

A multiple shift to the right is a divide by some power of two. If the carry is set,

the result is equal to the integer result plus l/2, but should we round up or down? This

problem is encountered frequently in integer arithmetic and floating point. Most

floating-point routines have some form of extended precision so that rounding can

be performed. This requires storage, which usually defaults to some minimal data

type (the routines in Chapter 4 use a word). The sticky bit reduces the need for such

extended precision. It indicates that during a right shift, a one was shifted into the

carry flag and then shifted out.

Along with the carry flag, the sticky bit can be used for rounding. For example,

suppose we wish to divide the hex value 99H by 16D. We can do this easily with a

four-bit right shift. Before the shift, we have:

Operand

10011001

Carry flag

0

Sticky bit

0

We shift the operand right four times with the following instruction:

shr

ax, #4

25

NUMERICAL METHODS

During the shift, the Least Significant Bit (LSB) of the operand (a one) is shifted

into the carry and then out again, setting the sticky bit followed by two zeros and a

final one. The operand now has the following form:

Operand

00001001

Sticky bit

Carry flag

1 (from the last shift) 1 (because of the first one

shifted in and out of the carry)

To round the result, check the carry flag. If it’s clear, the bits shifted out were less

than half of the LSB, and rounding can be done by truncation. If the carry is set, the

bits shifted out were at least half of the LSB. Now, with the sticky bit, we can see if

any other bits shifted out during the divide were ones; if so, the sticky bit is set and

we can round up.

Rounding doesn’t have to be done as described here, but however you do it the

sticky bit can make your work easier. Too bad it’s not available on more machines.

Branching

Your ability to do combined jumps depends on the flags. All the processors listed

in the table have the ability to branch, but some implement that ability on more

sophisticated relationships. Instead of a simple “jump on carry,” you might find

“jump if greater, ” “jump if less than or equal,” and signed and unsigned operations.

These extra instructions can cut the size and complexity of your programs.

Of the devices listed, the TMS34010, 80x86 and 80C196 have the richest set of

branching instructions. These include branches on signed and unsigned comparisons

as well as branches on the state of the flags alone.

Instructions

Addition

Add. Of course, to perform any useful arithmetic, the processor must be capable

of some form of addition. This instruction adds two operands, signaling any

overflow from the result by setting the carry.

26

NUMBERS

Add-with-Carry. The ability to add with a carry bit allows streamlined,

multiprecision additions. In multibyte or multiword additions, the add instruction is usually used first; the add-with-carry instruction is used in each succeeding addition. In this way, overflows from each one addition can ripple through

to the next.

Subtraction

Subtract. All the devices in Table l-2 can subtract except the 8048 and 8051.

The 8051 uses the subtract-with-carry instruction to fill this need. On the 8048,

to subtract one quantity (the subtrahend) from another (the minuend), you must

complement the subtrahend and increment it, then add it to the minuend-in

other words, add the two’s complement to the minuend.

Subtract-with-Carry. Again, the 8048 does not support this instruction, while all

the others do. Since the 8051 has only the subtract-with-carry instruction, it is

important to see that the carry is clear before a subtraction is performed unless

it is a multiprecision operation. The subtract-with-carry is used in multiprecision

subtraction in the same manner as the add-with-carry is used in addition.

Compare. This instruction is useful for boundary, magnitude and equality

checks. Most implementations perform a comparison by subtracting one value

from another. This process affects neither operand, but sets the appropriate flags.

Many microprocessors allow either signed or unsigned comparisons.

Multiplication

Multiply. This instruction performs a standard unsigned multiply based on the

word size of the particular microprocessor or microcontroller. Hardware can

make life easier. On the 8088 and 8086, this instruction was embarrassingly slow

and not that much of a challenge to shift and add routines. On later members of

the 80x86 family, it takes a fraction of the number of cycles to perform, making

it very useful for multiprecision and single-precision work.

Signed Multiply. The signed multiply, like the signed divide (which we’ll

27

NUMERICAL METHODS

discuss in a moment), has limited use. It produces a signed product from two

signed operands on all data types up to and including the word size of the

machine. This is fine for tame applications, but makes the instruction unsuitable

for multiprecision work. Except for special jobs, it might be wise to employ a

generic routine for handling signed arithmetic. One is described in the next

chapter.

Division

Divide. A hardware divide simplifies much of the programmer’s work unless it

is very, very slow (as it is on the 8088 and 8086). A multiply canextend the useful

range of the divide considerably. The following chapter gives examples of how

to do this.

Signed Divide. Except in specialized and controlled circumstances, the signed

divide may not be of much benefit. It is often easier and more expeditious to

handle signed arithmetic yourself, as will be shown in Chapter 2.

Modulus. This handy instruction returns the remainder from the division of two

operands in the destination register. As the name implies, this instruction is very

useful when doing modular arithmetic. This and signed modulus are available on

the TMS34010.

Signed Modulus. This is the signed version of the earlier modulus instruction,

here the remainder bears the sign of the dividend.

Negation and Signs

One’s Complement. The one’s complement is useful for logical operations and

diminished radix arithmetic (see Positive and Negative Numbers, earlier in this

chapter). This instruction performs a bit-by-bit complement of the input argument; that is, it makes each one a zero and each zero a one.

Two’s Complement. The argument is one’s complemented, then incremented by

28

NUMBERS

one. This is how negative numbers are usually handled on microcomputers.

l

Sign Extension. This instruction repeats the value of the MSB of the byte or word

through the next byte, word, or doubleword so the proper results can be obtained

from an arithmetic operation. This is useful for converting a small data type to

a larger data type of the same sign for such operations as multiplication and

division.

Shifts, Rotates and Normalization

Rotate. This simple operation can occur to the right or the left. In the case of a

ROTATE to the right, each bit of the data type is shifted to the right; the LSB is

deposited in the carry, while a zero is shifted in on the left. If the rotate is to the

left, each bit is moved to occupy the position of the next higher bit in the data type

until the last bit is shifted out into the carry flag (see figure l-3). On the Z80,

some shifts put the same bit into the carry and the LSB of the byte you are

shifting. Rotation is useful for multiplies and divides as well as normalization.

Rotate-through-Carry. This operation is similar to the ROTATE above, except

that the carry is shifted into the LSB (in the case of a left shift), or the MSB (in

the case of a right shift). Like the ROTATE, this instruction is useful for

multiplies and divides as well as normalization.

Arithmetic Shift. This shift is similar to a right shift. As each bit is shifted, the

value of the MSB remains the same, maintaining the value of the sign.

Normalization. This can be either a single instruction, as is the case on the

80C196, or a set of instructions, as on the TMS34010. NORML will cause the

80C196 to shift the contents of a doubleword register to the left until the MSB

is a one, “normalizing” the value and leaving the number of shifts required in a

register. On the TMS34010, LMO leaves the number of bits required to shift a

doubleword so that its MSB is one. A multibit shift can then be used to normalize

it. This mechanism is often used in floating point and as a setup for binary table

routines.

29

NUMERICAL METHODS

Figure 1-3. Shifts and rotates.

Decimal and ASCII Instructions

Decimal Adjust on Add. This instruction adjusts the results of the addition of two

decimal values to decimal. Decimal numbers cannot be added on a binary

computer with guaranteed results without taking care of any intrabyte carries

that occur when a digit in a position exceeds nine. On the 8086, this instruction

affects only the AL register. This and the next instruction can be very useful in

an embedded system that receives decimal data and must perform some simple

processing before displaying or returning it.

Decimal Adjust on Subtract. This instruction is similar to the preceeding one

except that it applies to subtraction.

ASCII Adjust. These instructions prepare either binary data for conversion to

ASCII or ASCII data for conversion to binary. Though Motorola processors also

implement these instructions, they are found only in the 80x86 series in our list.

Used correctly, they can also do a certain amount of arithmetic.

30

NUMBERS

Most of the earlier microprocessors-such as the 8080, 8085, Z80, and 8086—

as well as microcontrollers like the 8051 were designed with general applications in

mind. While the 8051 is billed as a Boolean processor, it’s general set of instructions

makes many functions possible and keeps it very popular today.

All these machines can do arithmetic at one level or another. The 8080, 8085, and

Z80 are bit-oriented and don’t have hardware multiplies and divides, making them

somewhat slower and more difficult to use than those that do. The 8086 and 8051

have hardware multiplies and divides but are terribly slow about it. (The timings for

the 8086 instructions were cleaned up considerably in subsequent generations of the

286, 386, and 486.) They added some speed to the floating-point routines and

decreased code size.

Until a few years ago, the kind of progress usually seen in these machines was

an increase in the size of the data types available and the addition of hardware

arithmetic. The 386 and 486 can do some 64-bit arithmetic and have nice shift

instructions, SHLD and SHRD, that will happily shift the bits of the second operand

into the first and put the number of bits shifted in a third operand. This is done in a

single stroke, with the bits of one operand shifted directly into the other, easing

normalization of long integers and making for fast binary multiplies and divides. In

recent years we’ve seen the introduction of microprocessors and microcontrollers

that are specially designed to handle floating-point as well as fixed-point arithmetic.

These processors have significantly enhanced real-time control applications and

digital signal processing in general. One such microprocessor is the TMS34010; a

microcontroller with a similar aptitude is the 80C196.

31

NUMERICAL METHODS

1

Kline, Morris. Mathematics for the Nonmathematician. New York, NY: Dover

Publications, Inc., 1967, Page 72.

2

Gellert, W., S. Gottwald, M. Helwich, H. Kastner, and H. K?stner (eds.). The

VNR Concise Encyclopedia of Mathematics. New York, NY: Van Nostrand

Reinhold, 1989, Page 20.

3

Knuth, D. E. Seminumerical Algorithms. Reading, MA: Addison-Wesley Publishing Co., 1980, Page 180.

4

Knuth, D. E. Seminumerical Algorithms. Reading, MA: Addison-Wesley Publishing Co., 1981, Pages 1-127.

5

Cavanagh, Joseph J. F. Digital Computer Arithmetic. New York, NY: McGrawHill Book Co., 1984, Page 2.

6

Pearson, Carl E. (ed.) Handbook of Applied Mathematics. New York, NY: Van

Nostrand Reinhold, 1983, Page 1.

32

Previous Home Next

CHAPTER 2

Integers

Reducing a problem to the integer level wherever possible is certainly one of the

fastest and safest ways to solve it. But integer arithmetic is only a subset of fixedpoint arithmetic. Fixed-point arithmetic means that the radix point remains in the

same place during all calculations. Integer arithmetic is fixed point arithmetic with

the radix point consistently to the right of the LSB. Conversely, fractional arithmetic

is simply fixed point with the radix point to the left of the MSB. There are no specific

requirements regarding placement of the radix point; it depends entirely on the needs

of the operation. Sines and cosines may require no integer at all, while a power-series

calculation may require an integer portion. You may wish to use two guard digits

during multiplication and division for rounding purposes-it depends on you and the

application.

To present algorithms for the four basic operations of mathematics-addition,

subtraction, multiplication, and division-this chapter will concentrate on integeronly arithmetic. The operations for fixed-point and integer-only arithmetic are

essentially the same; the former simply involves some attention to the placement of

the radix point.

This chapter begins with the basic operations and the classic algorithms for them,

followed by some more advanced algorithms. The classic algorithms aren’t necessarily the fastest, but they are so elegant and reflect the character of the binary

numbering system and the nature of arithmetic itself so well that they are worth

knowing. Besides, on any binary machine, they’ll always work.

Addition and Subtraction

Unsigned Addition and Subtraction

Simply put, addition is the joining of two sets of numbers or quantities, into one

set. We could also say that when we add we’re really incrementing one value, the

33

NUMERICAL METHODS

augend, by another value, the addend. Subtraction is the inverse of addition, with one

number being reduced, or decremented, by another.

For example, the addition operation

+2

9

or

0111

0010

1001

might be accomplished on the 8086 with this instruction sequence:

mov

add

al,7

al,2

In positional arithmetic, each position is evaluated

x