Categories
DSP

Decimal number to Fixed-point & Floating-point Single/Double binary precision

213.3

213 -> ‘11010101’

0.3 -> ?

0.3 -> ‘01 0011 0011 0011 ….

Decimal number to fixed-point binary precision

From the above information 213.3 can be represented as follows :

In Fixed point the numbers are represented with a fixed number of digits after and sometimes before the decimal point eg fixed<11,3> denotes an 11-bit fixed point number of which 3 right most bits are fractional. eg real bit pattern number 11010101.010 from the example above.

How to represent this decimal number in IEEE 754 32-bit (single) floating-point notation?

In floating-point single (32-bits) or double (64-bits) precision, the number is represented with a mantissa and an exponent. The placement of the decimal point can float relative to the significant digits of the number.

Decimal number to floating-point single (32-bits) binary precision

For a 32-bit floating-point notation, need to express it in the form :

  • 1 sign bit, 8 exponent bits, 23 fraction bits

Shift the fixed-point binary representation 7 times to the left to represent the number as a scientific notation using mantissa (affects accuracy) & exponent (affects range) :

In 8-bit exponent the largest integer we can store is 2^8-1 = 255

Exponents we want also negative to represent very small numbers. Instead of using 2’s complement IEEE decided to bias the exponent.

Exponent bias (Expbias) = 2^(K-1) – 1

Since 8 exponent bits, K=8 . ExpBias = 2^7 – 1 = 127

The number here 213.3 is positive we add to the bias a value of 7 (E’=ExpBias+E=127 + 7=134dec) [134dec == ‘10000110’ binary]

Note: If sign bit = 0 (positive number) E’=E+ExpBias else E’=E-ExpBias

The number in IEEE754 32-bit floating point notation becomes :

How to represent this decimal number in IEEE 754 64-bit (double) floating-point notation?

Decimal number to floating-point single (64-bits) binary precision

For a 64-bit floating-point notation, need to express it in the form :

  • 1 sign bit, 11 exponent bits, 52 fraction bits

Since 11 exponent bits, K=11 . Exponent bias = 2^10 – 1 = 1023

Bias for double-precision format is 1023

The number 213.3 is positive we add to the bias a value of 7 (E’=ExpBias+E=1023+7=1030dec) [1030 dec == ‘10000000110’ binary]

The number in IEEE754 64-bit floating point notation becomes :

1153.125

1153 -> ‘10010000001’

.125 -> 0.001 (0*0.5 + 0*0.25 + 1*0.125)

Decimal number to fixed-point binary precision

From the above information 1153.125 can be represented as follows :

The above fixed point binary precision denotes a 14-bit fixed point number of which 3 right most bits are fractional

How to represent this decimal number in IEEE 754 32-bit (single) floating-point notation?

Decimal number to floating-point single (32-bits) binary precision

For a 32-bit floating-point notation, need to express it in the form :

  • 1 sign bit, 8 exponent bits, 23 fraction bits

Shift the fixed-point binary representation 10 times to the left to represent the number as a scientific notation using mantissa (affects accuracy) & exponent (affects range) :

Since 8 exponent bits, K=8 . Exponent bias = 2^7 – 1 = 127

The number here 1153.125 is positive we add to the bias a value of 10 (E’=ExpBias+E=127+10=137dec) [137dec == 10001001 binary]

The number in IEEE754 32-bit floating point notation becomes :

How to represent this decimal number in IEEE 754 64-bit (double) floating-point notation?

Decimal number to floating-point single (64-bits) binary precision

For a 64-bit floating-point notation, need to express it in the form :

  • 1 sign bit, 11 exponent bits, 52 fraction bits

Since 11 exponent bits, K=11 . Exponent bias = 2^10 – 1 = 1023

Bias for double-precision format is 1023

The number 1153.125 is positive we add to the bias a value of 10 (E’=ExpBias+E=1023+10=1033dec) [1033 dec == 10000001001 binary]

The number in IEEE754 64-bit floating point notation becomes :