UTF-8 Literals

A UTF-8 literal is a string of characters of class UTF-8. You can specify such literals using basic notation or hexadecimal notation.

General Formats for Format 1

Format 1 is considered basic notation.
If string contains any DBCS characters, they must be delimited by shift-out and shift-in control characters.
Due to the variable-width nature of Unicode, the maximum number of characters possible within string varies.
The following Unicode escape sequences are permitted in string:

\uhhhh

where each h represents a hexadecimal digit in the range 0-9, a-f, and A-F inclusive. This escape sequence corresponds to a Unicode code point from the Basic Multilingual Plane (BMP), within the range U+0000 to U+FFFF.

\U00hhhhhh

where each h represents a hexadecimal digit in the range 0-9, a-f, and A-F inclusive. This escape sequence can corresponds to a Unicode code point from the Basic Multilingual Plane, or any Supplementary Planes. This means that as well as the range specified above, it also includes U+10000 to U+10FFFF.

Note: Code points U+D800 to U+DFFF are reserved for the high and low halves of surrogate pairs used by UTF-16; therefore, do not specify \uD800 through \uDFFF and \U0000D800 through \U0000DFFF as Unicode escape sequences in UTF-8 literals.
To include \uhhhh or \U00hhhhhh as a string in a UTF-8 literal, the escape character (\) itself can be escaped (using \) to interpret the string literally; for example \\u00FF is not processed as a Unicode escape sequence.

Format 2 is the hexadecimal notation.
hex-string consists of hexadecimal digits in the range 0-9, a-f, and A-F inclusive. Each group of two digits represents a single encoding of a UTF-8 character.
The sequence of bytes represented by hex-string is validated to ensure that it contains a valid sequence of UTF-8 bytes. If it does, this hexadecimal notation is stored as UTF-8 characters, and results in the content having the same meaning as a basic UTF-8 literal specifying the same characters.
A UTF-8 literal in hexadecimal notation has a data class and category of UTF-8, and can be used interchangeably with a basic UTF-8 literal.