The UTF-81 encoding scheme is a variable-width Unicode encoding. Each valid code point is encoded using one to four 8-bit bytes. UTF-8 is a popular encoding scheme, as it is backward-compatible (with ASCII); it is endianness independent; and it is often provides a more compact representation of Unicode than UTF-16.
Enterprise Developer provides native COBOL support for defining, comparing, and moving UTF-8 data.
UTF-8 data items can either be of fixed character length, or fixed byte length:
01 U1 PIC U(4). 01 U2 PIC UUUU.
The number of storage bytes required for each character is 4 bytes; therefore, the data item examples shown above required 16 bytes of storage each. Because UTF-8 is a variable-width encoding, not all characters will require all 4 bytes, and so during move operations, not all reserved bytes are used; where this is the case, the unused bytes are padded with the UTF-8 blank space encoding of x'20'. If truncation is required during a move operation, truncation occurs on a character boundary.
01 U3 PIC U BYTE-LENGTH 24
Again, due to the varying nature (in length) of UTF-8 items, the number of characters in a data item is variable, depending on the size of each character; however, it will always be in the range of [ceil(n/4),n] where n is the specified byte length.
There are two types of UTF-8 literal: basic and hexadecimal.
01 U4 PIC U VALUE u'lit-string'
where lit-string is the literal value. If you specify any double-byte characters, these must be delimited with the shift-out and shift-in characters. Due to the variable-width nature of Unicode, the maximum number of characters possible within lit-string varies.
To include \uhhhh or \U00hhhhhh as a string in a UTF-8 literal, the escape character (\) itself can be escaped (using \) to interpret the string literally; for example \\u00FF is not processed as a Unicode escape sequence.
01 U5 PIC U VALUE ux'hex-string'
where hex-string can be a minimum of 2 hexadecimal digits, which can be in the range 0-9, a-f, and A-F inclusive. Each group of two digits represents a single encoding of a UTF-8 character.
The sequence of bytes represented by hex-string is validated to ensure that it contains a valid sequence of UTF-8 bytes. If it does, this hexadecimal notation is stored as UTF-8 characters, and results in the content having the same meaning as a basic UTF-8 literal specifying the same characters.
The following COBOL statements support the use of UTF-8 data items:
A UTF-8 comparison is a comparison between two operands of class UTF-8. When either of those operands is not of class UTF-8 (of which, only class alphabetic, alphanumeric, or national are permitted), that operand is converted to an item of class UTF-8 before the comparison.
During MERGE or SORT operations, comparisons are performed using a binary, byte-by-byte comparison, which produces the same order as a corresponding set of national strings representing the same Unicode code points (assuming all code points are taken from the Basic Multilingual Plane).
If the operands are of unequal length, the comparison is performed as if the shorter operand were padded (with trailing UTF-8 space characters) to the length of the other operand.
If the operands are of equal length (or assumed to be, due to the additional padding), the comparison compares each corresponding character position, starting at the left-most position, until either unequal UTF-8 characters are encountered or the right-most character position is reached, whichever comes first. The operands are considered equal if all corresponding UTF-8 characters are equal.
When the first unequal character is encountered, it is compared to determine the relationship of the operands. The operand that contains the UTF-8 character with the higher collating sequence value is the greater operand.
There are a number of intrinsic functions that support the processing of UTF-8 data in native COBOL: