ECMAScript source text is represented as a sequence of characters in the Unicode character encoding, version 2.1 or later, using the UTF-16 transformation format. The text is expected to have been normalised to Unicode Normalised Form C (canonical composition), as described in Unicode Technical Report #15. Conforming ECMAScript implementations are not required to perform any normalisation of text, or behave as though they were performing normalisation of text, themselves.
ECMAScript source text can contain any of the Unicode characters. All Unicode white space characters are treated as white space, and all Unicode line/paragraph separators are treated as line separators. Non-Latin Unicode characters are allowed in identifiers, string literals, regular expression literals and comments.
Throughout the rest of this document, the phrase "code point" and the word "character" will be used to refer to a 16-bit unsigned value used to represent a single 16-bit unit of UTF-16 text. The phrase "Unicode character" will be used to refer to the abstract linguistic or typographical unit represented by a single Unicode scalar value (which may be longer than 16 bits and thus may be represented by more than one code point). This only refers to entities represented by single Unicode scalar values: the components of a combining character sequence are still individual "Unicode characters," even though a user might think of the whole sequence as a single character.
In string literals, regular expression literals and identifiers, any character (code point) may also be expressed as a Unicode escape sequence consisting of six characters, namely \u plus four hexadecimal digits. Within a comment, such an escape sequence is effectively ignored as part of the comment. Within a string literal or regular expression literal, the Unicode escape sequence contributes one character to the value of the literal. Within an identifier, the escape sequence contributes one character to the identifier.
NOTE 1
Although this document sometimes refers to a "transformation" between a
"character" within a "string" and the 16-bit unsigned integer that is
the UTF-16 encoding of that character, there is actually no
transformation because a "character" within a "string" is actually
represented using that 16-bit unsigned value.
NOTE 2
ECMAScript differs from the Java programming language in the behaviour
of Unicode escape sequences. In a Java program, if the Unicode escape
sequence \u000A, for example, occurs
within a single-line comment, it is interpreted as a line terminator
(Unicode character 000A is line feed) and
therefore the next character is not part of the comment. Similarly, if
the Unicode escape sequence \u000A occurs
within a string literal in a Java program, it is likewise interpreted
as a line terminator, which is not allowed within a string literal ---
one must write \n instead of
\u000A to cause a line feed to be part of the
string value of a string literal. In an ECMAScript program, a Unicode
escape sequence occurring within a comment is never interpreted and
therefore cannot contribute to termination of the comment. Similarly, a
Unicode escape sequence occurring within a string literal in an
ECMAScript program always contributes a character to the string value
of the literal and is never interpreted as a line terminator or as a
quote mark that might terminate the string literal.