Floating point arithmetic and unit roundoff error

Fundamental rule: a floating-point operation must approximate the corresponding real number arithmetic operation by rounding any result that is not a floating-point number to the nearest floating-point number.

In short: a fl(op) b = fl(a op b), where op = +,*,-,/.

A simple formula can be used to estimate the error resulting from floating point arithmetic.

Therefore: a fl(op) b = a op b + $ϵ$ (a op b), where $∣ ϵ ∣ \leq u$ , and:

u = \frac{1}{2} \times (distance between 1 and the next largest floating point)

$u$ is called the unit roundoff.
$u \approx 1 0^{- 7.2}$ in single precision, and $u \approx 1 0^{- 15.9}$ in double precision.

Floating point numbers, Floating point arithmetic is different from regular arithmetic

📓 CME 302

Explorer

Floating point arithmetic and unit roundoff error

Graph View

Backlinks