Representation of Numbers in Computers

Computers handle numbers differently from how we do in mathematics. While we are accustomed to exact numerical values, computers must represent numbers using a finite amount of memory. This limitation leads to approximations, which can introduce errors in numerical computations. In this post, I will explain how numbers are stored in computers, focusing on integer and floating-point representations.

Integer Representation

Integers are stored exactly in computers using binary representation. Each integer is stored in a fixed number of bits, commonly 8, 16, 32, or 64 bits. The two primary representations of integers are:

The Binary System

Computers operate using binary (base-2) numbers, meaning they represent all values using only two digits: 0 and 1. Each digit in a binary number is called a bit. The value of a binary number is computed similarly to decimal (base-10) numbers but using powers of 2 instead of powers of 10.

For example, the binary number 1101 represents: \[(1 \times 2^3)+(1 \times 2^2)+(0 \times 2^1)+(1 \times 2^0)=8+4+0+1=13\]

Similarly, the decimal number 9 is represented in binary as 1001.

Unsigned Integers

Unsigned integers can only represent non-negative values. A n-bit unsigned integer can store values from 0 to 2^n - 1. For example, an 8-bit unsigned integer can represent values from 0 to 255 (2^8 - 1).

Signed Integers and Two’s Complement

Signed integers can represent both positive and negative numbers. The most common way to store signed integers is two’s complement, which simplifies arithmetic operations and ensures unique representations for zero.

In two’s complement representation:

  • The most significant bit (MSB) acts as the sign bit (0 for positive, 1 for negative).
  • Negative numbers are stored by taking the binary representation of their absolute value, inverting the bits, and adding 1.

For example, in an 8-bit system:

  • +5 is represented as 00000101
  • -5 is obtained by:
    1. Writing 5 in binary: 00000101
    2. Inverting the bits: 11111010
    3. Adding 1: 11111011

Thus, -5 is stored as 11111011.

One of the key advantages of two’s complement is that subtraction can be performed as addition. For instance, computing 5 - 5 is the same as 5 + (-5), leading to automatic cancellation without requiring separate subtraction logic in hardware.

The range of a n-bit signed integer is from -2^(n-1) to 2^(n-1) - 1. For example, an 8-bit signed integer ranges from -128 to 127.

Floating-Point Representation

Most real numbers cannot be represented exactly in a computer due to limited memory. Instead, they are stored using the IEEE 754 floating-point standard, which represents numbers in the form: \[x = (-1)^s \times M \times 2^E\]

where:

  • s is the sign bit (0 for positive, 1 for negative).
  • M (the mantissa) stores the significant digits.
  • E (the exponent) determines the scale of the number.

How the Mantissa and Exponent Are Stored and Interpreted

The mantissa (also called the significand) and exponent are stored in a structured manner to ensure a balance between precision and range.

  • Mantissa (Significand): The mantissa represents the significant digits of the number. In IEEE 754, the mantissa is stored in normalized form, meaning that the leading bit is always assumed to be 1 (implicit bit) and does not need to be stored explicitly. This effectively provides an extra bit of precision.
  • Exponent: The exponent determines the scaling factor for the mantissa. It is stored using a bias system to accommodate both positive and negative exponents.
    • In single precision (32-bit): The exponent uses 8 bits with a bias of 127. This means the stored exponent value is E + 127.
    • In double precision (64-bit): The exponent uses 11 bits with a bias of 1023. The stored exponent value is E + 1023.

For example, the decimal number 5.75 is stored in IEEE 754 single precision as:

  1. Convert to binary: 5.75 = 101.11_2
  2. Normalize to scientific notation: 1.0111 × 2^2
  3. Encode:
    • Sign bit: 0 (positive)
    • Exponent: 2 + 127 = 129 (binary: 10000001)
    • Mantissa: 01110000000000000000000 (without the leading 1)

Final representation in binary: 0 10000001 01110000000000000000000

Special Floating-Point Values: Inf and NaN

IEEE 754 also defines special representations for infinite values and undefined results:

  • Infinity (Inf): This occurs when a number exceeds the largest representable value. It is represented by setting the exponent to all 1s and the mantissa to all 0s:
    • Positive infinity: 0 11111111 00000000000000000000000
    • Negative infinity: 1 11111111 00000000000000000000000
  • Not-a-Number (NaN): This is used to represent undefined results such as 0/0 or sqrt(-1). It is identified by an exponent of all 1s and a nonzero mantissa:
    • NaN: x 11111111 ddddddddddddddddddddddd (where x is the sign bit and d is any nonzero value in the mantissa)

Subnormal Numbers

Subnormal numbers (also called denormalized numbers) are a special category of floating-point numbers used to represent values that are too small to be stored in the normal format. They help address the issue of underflow, where very small numbers would otherwise be rounded to zero.

Why Are Subnormal Numbers Needed?

In standard IEEE 754 floating-point representation, the smallest normal number occurs when the exponent is at its minimum allowed value. However, values smaller than this minimum would normally be rounded to zero, causing a loss of precision in numerical computations. To mitigate this, IEEE 754 defines subnormal numbers, which allow for a gradual reduction in precision rather than an abrupt transition to zero.

How Are Subnormal Numbers Represented?

A normal floating-point number follows the form: \[x = (-1)^s \times (1 + M) \times 2^E\]

where 1 + M represents the implicit leading bit (always 1 for normal numbers), and E is the exponent.

For subnormal numbers, the exponent is set to the smallest possible value (E = 1 - bias), and the leading 1 in the mantissa is no longer assumed. Instead, the number is stored as: \[x = (-1)^s \times M \times 2^{1 – \text{bias}}\]

This means subnormal numbers provide a smooth transition from the smallest normal number to zero, reducing sudden underflow errors.

Example of a Subnormal Number

In IEEE 754 single-precision (32-bit) format:

  • The smallest normal number occurs when E = 1 (after subtracting bias: E - 127 = -126).
  • The next smaller numbers are subnormal, where E = 0, and the mantissa gradually reduces towards zero.

For example, a subnormal number with a small mantissa could look like:

0 00000000 00000000000000000000001

This represents a very small positive number, much closer to zero than any normal number.

Limitations of Subnormal Numbers

  • They have reduced precision, as the leading 1 bit is missing.
  • Operations involving subnormal numbers are often slower on some hardware due to special handling.
  • In extreme cases, they may still lead to precision loss in calculations.

Precision and Limitations

Floating-point representation allows for a vast range of values, but it comes with limitations:

  • Finite Precision: Only a finite number of real numbers can be represented.
  • Rounding Errors: Some numbers (e.g., 0.1 in binary) cannot be stored exactly, leading to small inaccuracies.
  • Underflow and Overflow: Extremely small numbers may be rounded to zero (underflow), while extremely large numbers may exceed the maximum representable value (overflow).

Example: Floating-Point Approximation

Consider storing 0.1 in a 32-bit floating-point system. Its binary representation is repeating, meaning it must be truncated, leading to a slight approximation error. This small error can propagate in calculations, affecting numerical results.

Conclusion

Understanding how numbers are represented in computers is crucial in computational physics and numerical methods. In the next post, I will explore sources of numerical errors, including truncation and round-off errors, and how they impact computations.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *