Floating-point data is used to represent values whose magnitudes are too great or too small to be represented by fixed-point data. It may also be used to minimize precision loss in calculations where terms differ considerably in magnitude.
Unexpected results may appear when using floating-point values. The limited precision of floating-point variables and the fact that they cannot precisely represent most fractional values combine to produce behavior in a program that may seem strange. Furthermore, the floating-point implementation on many machines carries more precision in intermediate results than is strictly required by the rules of PL/I. Also, floating-point variables and constants are usually limited to only one or two actual internal precision values. For example, most implementations of the IEEE floating-point standard allow only "short" (23-bit) and "long" (52-bit) floating-point values. Finally, precise results may vary depending on the level of program optimization.
For example, consider the following program:
SAMPLE: PROCEDURE OPTIONS (MAIN); DECLARE (F,G) FLOAT BINARY (23); F = 3111; G = .14; IF F + G ^= 3111.14 THEN PUT LIST ('UNEQUAL RESULT'); END SAMPLE;
This program will actually print the UNEQUAL RESULT message when run on many machines, because the constant .14 and 3111.14 are not precisely representable as floating-point binary numbers. The assignment of .14 to G puts about 23 bits of precision of the constant into the variable. On most implementations, the sum is computed with an actual precision of 52 or more bits, preserving the 23 bits of precision that apply to the fractional part. However, that sum is being compared to the constant 3111.14. The constant will have 23 or 52 bits of precision. (In fact, a strict implementation of the rules of the language requires only that it have 20 bits of precision; for more information, see the chapter Data Type Conversions.) If 3111.14 is represented as a short floating-point number, 14 of the 23 bits of precision are taken up to represent 3111 and hence only 9 bits of the fraction appear in the constant. This will not result in an equal comparison to the constant 3111.14. If the constant is a long floating-point number, 38 of the 25 precision bits will be fractional and the comparison will still fail.
One attempt to fix this might be to change the program as follows:
SAMPLE: PROCEDURE OPTIONS (MAIN); DECLARE (F, G, H) FLOAT BINARY (23); RESULT FLOAT BINARY (23) STATIC INITIAL (3111.14) F = 3111; G = .14; H = F + G IF H ^= 3111.14 THEN PUT LIST ('UNEQUAL RESULT') END SAMPLE;
This appears to truncate the sum to a "real" 23 bits of precision and to compare it to a constant of known precision. This will normally work. However, in the presence of optimization, the value of F + G may be used without reducing its precision through storing it in H and then reloading it.
Thus, as a general rule, comparing floating-point numbers for equality and inequality does not work reliably. Instead, check for equality using the following method:
ABS(FIRST-SECOND) < EPSILON
where the constant EPSILON must be chosen considering the goal of the application.
Floating-point numbers consist of a mantissa m, a base b, and an exponent e. A floating-point number is represented in the following format:
m*b**e
The mantissa m is a fraction containing at least p digits. The value of b and the possible range of e are defined by each implementation (for information on this implementation, refer to your Open PL/I User's Guide); however, if the base is binary, m contains the equivalent of at least p binary digits, and if the base is decimal, m contains the equivalent of at least p decimal digits. For example:
DECLARE X FLOAT BINARY(23); DECLARE Y FLOAT DECIMAL(7);
In this example, the values of X are floating-point numbers whose mantissa contains the equivalent of at least 23 binary digits. The values of Y are floating-point numbers whose mantissa contains the equivalent of at least 7 decimal digits.
The representation of floating-point values in storage also depends on the implementation. An implementation may represent the mantissa in any base that it chooses, provided that it contains an equivalent of at least p binary or p decimal digits. On some computers, decimal floating-point values are represented by using a decimal mantissa, while other implementations use a binary or hexadecimal mantissa for all floating-point numbers.
Open PL/I represents floating-point decimal using a decimal mantissa and decimal exponent, and represents floating-point binary using a binary mantissa and binary exponent. For more information, refer to your Open PL/I User's Guide.
Because floating-point values may be represented in either base, and because excess digits may be lost during calculations, floating-point values may be approximate. However, any integer value that is converted to floating-point and converted back to integer retains its original value. Likewise, the floating-point calculations of addition, subtraction, and multiplication, when performed on integer values, produce integer values.
Floating-point constants are written as fixed-point constants followed by an exponent, as shown in the following examples:
5E+02 4.5E1 100E-04 .001E-04 0E0
Floating-point constants have a decimal precision of p, where p is the number of digits in the fixed-point constant. For example, 4.5E1 has a precision of 2.
When fixed-point constants such as 1.5 are used in operations with floating-point values, they should be written with an exponent to avoid run-time conversion of fixed-point decimal values to floating-point.
If floating-point values are converted to character strings or bit strings, the length of the resulting string is determined by p, not by the actual value. For a discussion of conversion rules, see the chapter Data Type Conversions.
Caution is necessary when using floating-point constants in arithmetic expressions. Floating-point constants are considered to be of type Float Decimal, and the rules for the precision of arithmetic results specify that the precision of the result of a floating-point operation is the maximum of the precisions of the operands. Consider the following example:
SAMPLE2: PROCEDURE OPTIONS(MAIN); DECLARE X FLOAT BIN(52); X = 1E0 + .4999E0; PUT SKIP LIST(X); END SAMPLE2;
The answer printed by this program is, surprisingly, .500000000000000E+000. This is because the two constants are Float Decimal with precisions of 1 and 4, respectively. The addition is performed and the result is converted to Float Decimal(4); the intermediate result of 1.4999 is rounded to 4 decimal digits, and the result is converted to Float Binary and stored in X. An attempt to correct this might be to change the expression to read
X = BINARY(1E0) + BINARY(.4999E0);
This does the computation in Float Binary. However, this converts the first constant to Float Bin(4) and the second to Float Bin(14), and does the computation in short floating-point on most machines. The short floating-point result is lengthened to Float Bin(52) and is then stored in X, giving an answer of 1.499900013208389E+000. The correct fix is to use the two-argument form of the BINARY built-in function:
X = BINARY(1E0,52) + BINARY(.4999E0,52);
This produces long floating-point constants and the computation is done with maximal precision, printing 1.499900000000000E+000. Another solution would be to rewrite the constants using trailing zeros to indicate a more precise constant:
X = 1.0000000000000000E0 + .4999000000000000E0;
This also gives 1.499900000000000E+000, since the computation is done in large enough precision and the result is converted directly to a high-precision floating-point value.
In general, when using floating-point constants in arithmetic expressions with floating-point variables, the precision of the result can be controlled by using the FLOAT and BINARY built-in functions on the operands. The results of operations on two floating-point constants may not be what is expected.
Exponentiation, MOD, and all transcendental functions (including SORT) operations on floating-point decimal data are actually performed using the floating-point binary operations. Thus, the range of the operands and the results of these operations are limited to the ranges for floating-point binary numbers.