Tag: floating point

Sources of Numerical Errors (Round-Off, Truncation, Stability)
Numerical computations inherently involve errors due to the limitations of representing and manipulating numbers in a finite-precision system. Understanding the different types of numerical errors is crucial for developing stable and accurate numerical algorithms. In this post, I will discuss three primary sources of numerical errors: round-off errors, truncation errors, and numerical stability.

Round-Off Errors

Round-off errors arise from the finite precision used to store numbers in a computer. Since most real numbers cannot be represented exactly in floating-point format, they must be approximated, leading to small discrepancies.

Causes of Round-Off Errors
- Finite Precision Representation: Floating-point numbers are stored using a limited number of bits, which results in small approximations for many numbers. For example, the decimal 0.1 cannot be exactly represented in binary, leading to a small error.
- Arithmetic Operations: When performing arithmetic on floating-point numbers, small errors can accumulate.
- Conversion Between Number Bases: Converting between decimal and binary introduces small inaccuracies due to repeating fractions in binary representation.
Example of Round-Off Error

Consider summing 0.1 ten times in floating-point arithmetic:
```
import numpy as np
print(sum([0.1] * 10) == 1.0)  # Expected True, but may return False due to precision error
print(sum([0.1] * 10))  # Output: 0.9999999999999999
```
This error occurs because 0.1 is stored as an approximation, and the accumulation of small errors results in a slight deviation from 1.0.

Catastrophic Cancellation

Catastrophic cancellation is a specific type of round-off error that occurs when subtracting two nearly equal numbers. Since the leading significant digits cancel out, the result has significantly reduced precision, often leading to large relative errors.

For example, consider:
```
import numpy as np
x = np.float32(1.0000001)
y = np.float32(1.0000000)
print(x - y)  # Loss of significant digits due to cancellation
```
If the subtraction results in a number much smaller than the original values, relative errors become large, reducing numerical accuracy. This type of error is especially problematic in iterative methods and when computing small differences in large values.

Truncation Errors

Truncation errors occur when an infinite or continuous mathematical process is approximated by a finite or discrete numerical method. These errors arise from simplifying mathematical expressions or using approximate numerical techniques.

Causes of Truncation Errors
- Numerical Differentiation and Integration: Approximating derivatives using finite differences or integrals using numerical quadrature leads to truncation errors.
- Series Expansions: Many functions are approximated using truncated Taylor series expansions, leading to errors that depend on the number of terms used.
- Discretization of Continuous Problems: Solving differential equations numerically involves discretizing time or space, introducing truncation errors.
Example of Truncation Error

Approximating the derivative of \(f(x) = \sin(x)\) using the finite difference formula: \(f'(x) \approx \frac{f(x+h) – f(x)}{h}\)

For small h, this formula provides an approximation, but it is not exact because it ignores higher-order terms in the Taylor series expansion.

Numerical Stability

Numerical stability pertains to an algorithm’s sensitivity to small perturbations, such as round-off or truncation errors, during its execution. A numerically stable algorithm ensures that such small errors do not grow uncontrollably, thereby maintaining the accuracy of the computed results.

Causes of Numerical Instability
- Ill-Conditioned Problems: Some problems are highly sensitive to small changes in input. For example, solving a nearly singular system of linear equations can magnify errors, making the solution highly inaccurate.
- Unstable Algorithms: Some numerical methods inherently amplify errors during execution. For instance, certain iterative solvers may diverge rather than converge if they are not properly designed.
- Accumulation of Round-Off Errors in Repeated Computations: When an algorithm involves repeated arithmetic operations, small round-off errors may compound over time, leading to significant deviations from the true result.
Example of Numerical Instability: Iterative Methods

Consider an iterative method designed to approximate the square root of a number. If the method is not properly formulated, small errors in each iteration can accumulate, leading to divergence from the true value rather than convergence. This behavior exemplifies numerical instability, as the algorithm fails to control the propagation of errors through successive iterations.

Conclusion

Understanding numerical errors is crucial for designing reliable computational methods. While round-off errors stem from finite precision representation, truncation errors arise from approximations in numerical methods. Stability plays a key role in determining whether small errors will remain controlled or grow exponentially. In the next post, I will discuss techniques to mitigate these errors and improve numerical accuracy.
20 March 2025
Representation of Numbers in Computers
Computers handle numbers differently from how we do in mathematics. While we are accustomed to exact numerical values, computers must represent numbers using a finite amount of memory. This limitation leads to approximations, which can introduce errors in numerical computations. In this post, I will explain how numbers are stored in computers, focusing on integer and floating-point representations.

Integer Representation

Integers are stored exactly in computers using binary representation. Each integer is stored in a fixed number of bits, commonly 8, 16, 32, or 64 bits. The two primary representations of integers are:

The Binary System

Computers operate using binary (base-2) numbers, meaning they represent all values using only two digits: 0 and 1. Each digit in a binary number is called a bit. The value of a binary number is computed similarly to decimal (base-10) numbers but using powers of 2 instead of powers of 10.

For example, the binary number 1101 represents: \[(1 \times 2^3)+(1 \times 2^2)+(0 \times 2^1)+(1 \times 2^0)=8+4+0+1=13\]

Similarly, the decimal number 9 is represented in binary as 1001.

Unsigned Integers

Unsigned integers can only represent non-negative values. A n-bit unsigned integer can store values from 0 to 2^n - 1. For example, an 8-bit unsigned integer can represent values from 0 to 255 (2^8 - 1).

Signed Integers and Two’s Complement

Signed integers can represent both positive and negative numbers. The most common way to store signed integers is two’s complement, which simplifies arithmetic operations and ensures unique representations for zero.

In two’s complement representation:
- The most significant bit (MSB) acts as the sign bit (0 for positive, 1 for negative).
- Negative numbers are stored by taking the binary representation of their absolute value, inverting the bits, and adding 1.
For example, in an 8-bit system:
- +5 is represented as 00000101
- -5 is obtained by:
  1. Writing 5 in binary: 00000101
  2. Inverting the bits: 11111010
  3. Adding 1: 11111011
Thus, -5 is stored as 11111011.

One of the key advantages of two’s complement is that subtraction can be performed as addition. For instance, computing 5 - 5 is the same as 5 + (-5), leading to automatic cancellation without requiring separate subtraction logic in hardware.

The range of a n-bit signed integer is from -2^(n-1) to 2^(n-1) - 1. For example, an 8-bit signed integer ranges from -128 to 127.

Floating-Point Representation

Most real numbers cannot be represented exactly in a computer due to limited memory. Instead, they are stored using the IEEE 754 floating-point standard, which represents numbers in the form: \[x = (-1)^s \times M \times 2^E\]

where:
- s is the sign bit (0 for positive, 1 for negative).
- M (the mantissa) stores the significant digits.
- E (the exponent) determines the scale of the number.
How the Mantissa and Exponent Are Stored and Interpreted

The mantissa (also called the significand) and exponent are stored in a structured manner to ensure a balance between precision and range.
- Mantissa (Significand): The mantissa represents the significant digits of the number. In IEEE 754, the mantissa is stored in normalized form, meaning that the leading bit is always assumed to be 1 (implicit bit) and does not need to be stored explicitly. This effectively provides an extra bit of precision.
- Exponent: The exponent determines the scaling factor for the mantissa. It is stored using a bias system to accommodate both positive and negative exponents.
  - In single precision (32-bit): The exponent uses 8 bits with a bias of 127. This means the stored exponent value is E + 127.
  - In double precision (64-bit): The exponent uses 11 bits with a bias of 1023. The stored exponent value is E + 1023.
For example, the decimal number 5.75 is stored in IEEE 754 single precision as:
1. Convert to binary: 5.75 = 101.11_2
2. Normalize to scientific notation: 1.0111 × 2^2
3. Encode:
  - Sign bit: 0 (positive)
  - Exponent: 2 + 127 = 129 (binary: 10000001)
  - Mantissa: 01110000000000000000000 (without the leading 1)
Final representation in binary: 0 10000001 01110000000000000000000

Special Floating-Point Values: Inf and NaN

IEEE 754 also defines special representations for infinite values and undefined results:
- Infinity (Inf): This occurs when a number exceeds the largest representable value. It is represented by setting the exponent to all 1s and the mantissa to all 0s:
  - Positive infinity: 0 11111111 00000000000000000000000
  - Negative infinity: 1 11111111 00000000000000000000000
- Not-a-Number (NaN): This is used to represent undefined results such as 0/0 or sqrt(-1). It is identified by an exponent of all 1s and a nonzero mantissa:
  - NaN: x 11111111 ddddddddddddddddddddddd (where x is the sign bit and d is any nonzero value in the mantissa)
Subnormal Numbers

Subnormal numbers (also called denormalized numbers) are a special category of floating-point numbers used to represent values that are too small to be stored in the normal format. They help address the issue of underflow, where very small numbers would otherwise be rounded to zero.

Why Are Subnormal Numbers Needed?

In standard IEEE 754 floating-point representation, the smallest normal number occurs when the exponent is at its minimum allowed value. However, values smaller than this minimum would normally be rounded to zero, causing a loss of precision in numerical computations. To mitigate this, IEEE 754 defines subnormal numbers, which allow for a gradual reduction in precision rather than an abrupt transition to zero.

How Are Subnormal Numbers Represented?

A normal floating-point number follows the form: \[x = (-1)^s \times (1 + M) \times 2^E\]

where 1 + M represents the implicit leading bit (always 1 for normal numbers), and E is the exponent.

For subnormal numbers, the exponent is set to the smallest possible value (E = 1 - bias), and the leading 1 in the mantissa is no longer assumed. Instead, the number is stored as: \[x = (-1)^s \times M \times 2^{1 – \text{bias}}\]

This means subnormal numbers provide a smooth transition from the smallest normal number to zero, reducing sudden underflow errors.

Example of a Subnormal Number

In IEEE 754 single-precision (32-bit) format:
- The smallest normal number occurs when E = 1 (after subtracting bias: E - 127 = -126).
- The next smaller numbers are subnormal, where E = 0, and the mantissa gradually reduces towards zero.
For example, a subnormal number with a small mantissa could look like:
```
0 00000000 00000000000000000000001
```
This represents a very small positive number, much closer to zero than any normal number.

Limitations of Subnormal Numbers
- They have reduced precision, as the leading 1 bit is missing.
- Operations involving subnormal numbers are often slower on some hardware due to special handling.
- In extreme cases, they may still lead to precision loss in calculations.
Precision and Limitations

Floating-point representation allows for a vast range of values, but it comes with limitations:
- Finite Precision: Only a finite number of real numbers can be represented.
- Rounding Errors: Some numbers (e.g., 0.1 in binary) cannot be stored exactly, leading to small inaccuracies.
- Underflow and Overflow: Extremely small numbers may be rounded to zero (underflow), while extremely large numbers may exceed the maximum representable value (overflow).
Example: Floating-Point Approximation

Consider storing 0.1 in a 32-bit floating-point system. Its binary representation is repeating, meaning it must be truncated, leading to a slight approximation error. This small error can propagate in calculations, affecting numerical results.

Conclusion

Understanding how numbers are represented in computers is crucial in computational physics and numerical methods. In the next post, I will explore sources of numerical errors, including truncation and round-off errors, and how they impact computations.
13 March 2025

Tag: floating point

Sources of Numerical Errors (Round-Off, Truncation, Stability)

Round-Off Errors

Causes of Round-Off Errors

Example of Round-Off Error

Catastrophic Cancellation

Truncation Errors

Causes of Truncation Errors

Example of Truncation Error

Numerical Stability

Causes of Numerical Instability

Example of Numerical Instability: Iterative Methods

Conclusion

Representation of Numbers in Computers

Integer Representation

The Binary System

Unsigned Integers

Signed Integers and Two’s Complement

Floating-Point Representation

How the Mantissa and Exponent Are Stored and Interpreted

Special Floating-Point Values: Inf and NaN

Subnormal Numbers

Why Are Subnormal Numbers Needed?

How Are Subnormal Numbers Represented?

Example of a Subnormal Number

Limitations of Subnormal Numbers

Precision and Limitations

Example: Floating-Point Approximation

Conclusion

Special Floating-Point Values: `Inf` and `NaN`