Text Blocks (Second Preview)

Copyright © 2020 Oracle and/or its affiliates · All Rights Reserved · License

This document describes changes to The Java® Language Specification, Java SE 14 Edition to support text blocks, a preview feature of Java SE 14.

See JEP 368 for an overview of the feature.

Changes are described with respect to existing sections of the JLS. New text is indicated like this and deleted text is indicated like this. Explanation and discussion, as needed, is set aside in grey boxes.

Chapter 1: Introduction

1.5 Preview Features

The following are essential API elements associated with Text Blocks:

Chapter 3: Lexical Structure

3.10 Literals

A literal is the source code representation of a value of a primitive type (4.2), the String type (4.3.3), or the null type (4.1).

Literal:
IntegerLiteral
FloatingPointLiteral
BooleanLiteral
CharacterLiteral
StringLiteral
TextBlock
NullLiteral

3.10.4 Character Literals

A character literal is expressed as a character or an escape sequence (3.10.6), enclosed in ASCII single quotes. (The single-quote, or apostrophe, character is \u0027.)

CharacterLiteral:
' SingleCharacter '
' EscapeSequence '
SingleCharacter:
InputCharacter but not ' or \

See 3.10.6 for the definition of EscapeSequence.

A character literal is always of type char (4.2.1).

Character literals can only represent UTF-16 code units (3.1), i.e., they are limited to values from \u0000 to \uffff. Supplementary characters must be represented either as a surrogate pair within a char sequence, or as an integer, depending on the API they are used with.

The content of a character literal is the SingleCharacter or the EscapeSequence which follows the opening '.

It is a compile-time error for the character following the SingleCharacter or EscapeSequence content to be other than a '.

It is a compile-time error for a line terminator (3.4) to appear after the opening ' and before the closing '.

As specified in 3.4, the characters CR and LF are never an InputCharacter; each is recognized as constituting a LineTerminator, so it may not appear in a string literal, even in the escape sequence \ LineTerminator.

The character represented a character literal is the content of the character literal with any escape sequence interpreted, as if by execution of String::translateEscapes on the content.

The following are examples of char literals:

Because Unicode escapes are processed very early, it is not correct to write '\u000a' for a character literal whose value is linefeed (LF); the Unicode escape \u000a is transformed into an actual linefeed in translation step 1 (3.3) and the linefeed becomes a LineTerminator in step 2 (3.4), and so the character literal is not valid in step 3. Instead, one should use the escape sequence '\n' (3.10.6). Similarly, it is not correct to write '\u000d' for a character literal whose value is carriage return (CR). Instead, use '\r'. Finally, it is not possible to write '\u0027' for a character literal containing an apostrophe (').

In C and C++, a character literal may contain representations of more than one character, but the value of such a character literal is implementation-defined. In the Java programming language, a character literal always represents exactly one character.

3.10.5 String Literals

A string literal consists of zero or more characters enclosed in double quotes. Characters such as newlines may be represented by escape sequences (3.10.7) - one escape sequence for characters in the range U+0000 to U+FFFF, two escape sequences for the UTF-16 surrogate code units of characters in the range U+010000 to U+10FFFF.

StringLiteral:
" {StringCharacter} "
StringCharacter:
InputCharacter but not " or \
EscapeSequence

A string literal is always of type String (4.3.3).

The content of a string literal is the sequence of characters that begins immediately after the opening " and ends immediately before the closing matching ".

It is a compile-time error for a line terminator to appear in the content of a string literal after the opening " and before the closing matching ".

As specified in 3.4, the characters CR and LF are never an InputCharacter; each is recognized as constituting a LineTerminator, so it may not appear in a string literal, even in the escape sequence \ LineTerminator.

The string represented by a string literal is the content of the string literal with every escape sequence interpreted, as if by execution of String::translateEscapes on the content.

The following are examples of string literals:

""                    // the empty string
"\""                  // a string containing " alone
"This is a string"    // a string containing 16 characters
"This is a " +        // actually a string-valued constant expression,
    "two-line string"    // formed from two string literals

Because Unicode escapes are processed very early, it is not correct to write "\u000a" for a string literal containing a single linefeed (LF); the Unicode escape \u000a is transformed into an actual linefeed in translation step 1 (3.3) and the linefeed becomes a LineTerminator in step 2 (3.4), and so the string literal is not valid in step 3. Instead, one should write "\n" (3.10.6). Similarly, it is not correct to write "\u000d" for a string literal containing a single carriage return (CR). Instead, use "\r". Finally, it is not possible to write "\u0022" for a string literal containing a double quotation mark (").

A long string literal can always be broken up into shorter pieces and written as a (possibly parenthesized) expression using the string concatenation operator + (15.18.1).

At run time, a string literal is a reference to an instance of class String (4.3.1, 4.3.3) that denotes the string represented by the string literal.

Moreover, a string literal always refers to the same instance of class String. This is because string literals - or, more generally, strings that are the values of constant expressions (15.28) - are "interned" so as to share unique instances, using the method String.intern (12.5).

3.10.6 Text Blocks

A text block consists of zero or more characters enclosed by opening and closing delimiters. Characters may be represented by escape sequences (3.10.7), but the newline and double quote characters that must be represented with escape sequences in a string literal may be represented directly in a text block.

TextBlock:
" " " { TextBlockWhiteSpace } LineTerminator { TextBlockCharacter } " " "
TextBlockWhiteSpace:
WhiteSpace but not LineTerminator
TextBlockCharacter:
InputCharacter but not \
EscapeSequence
LineTerminator

The following productions from 3.3, 3.4, and 3.6 are shown here for convenience:

WhiteSpace:
the ASCII SP character, also known as "space"
the ASCII HT character, also known as "horizontal tab"
the ASCII FF character, also known as "form feed"
LineTerminator
LineTerminator:
the ASCII LF character, also known as "newline"
the ASCII CR character, also known as "return"
the ASCII CR character followed by the ASCII LF character
InputCharacter:
UnicodeInputCharacter but not CR or LF
UnicodeInputCharacter:
UnicodeEscape
RawInputCharacter
UnicodeEscape:
\ UnicodeMarker HexDigit HexDigit HexDigit HexDigit
RawInputCharacter:
any Unicode character

A text block is always of type String (4.3.3).

The opening delimiter is a sequence that starts with three double quote characters ("""), continues with zero or more space, tab, and form feed characters, and concludes with a line terminator.

The closing delimiter is a sequence of three double quote characters.

The content of a text block is the sequence of characters that begins immediately after the line terminator of the opening delimiter, and ends immediately before the first double quote of the closing delimiter.

Unlike in a string literal (3.10.5), it is not a compile-time error for a line terminator to appear in the content of a text block.

The use of the escape sequences \" and \n is permitted in a text block, but not necessary or recommended. However, representing the sequence """ in a text block requires the escaping of at least one " character, to avoid mimicking the closing delimiter.

Example 3.10.6-2. Escape sequences in text blocks

The following snippet of text would be less readable if the " characters were escaped:

If a text block is to denote another text block, then it is recommended to escape the first " of the embedded opening and closing delimiters:

The string represented by a text block is not the literal sequence of characters in the content. Instead, the string represented by a text block is the result of applying the following transformations to the content, in order:

  1. Line terminators are normalized to the ASCII LF character, as follows:

    1. An ASCII CR character followed by an ASCII LF character is translated to an ASCII LF character.

    2. An ASCII CR character is translated to an ASCII LF character.

  2. Incidental white space is removed, as if by execution of String::stripIndent on the characters resulting from step 1.

  3. Escape sequences are interpreted, as if by execution of String::translateEscapes on the characters resulting from step 2.

Example 3.10.6-3. Order of transformations on text block content

Interpreting escape sequences last allows developers to use \n, \f, and \r for vertical formatting of a string without affecting the normalization of line terminators, and to use \b and \t for horizontal formatting of a string without affecting the removal of incidental white space. For example, consider this text block that mentions the escape sequence \r (CR):

The \r escapes are not interpreted until after the line terminators have been normalized to LF. Using Unicode escapes to visualize LF (\u000A) and CR (\u000D), and using | to visualize the left margin, the final result is:

When this specification says that a text block contains a particular character or sequence of characters, or that a particular character or sequence of characters is in a text block, it means that the string represented by the text block (as opposed to the content of the text block) contains the character or sequence of characters.

At run time, a text block is a reference to an instance of class String that denotes the string represented by the text block.

A text block always refers to the same instance of class String. This is because the strings represented by text blocks - or, more generally, strings that are the values of constant expressions (15.28) - are "interned" so as to share unique instances (12.5).

Example 3.10.6-4. Text blocks evaluate to strings

Text blocks can be used wherever an expression of type String is allowed, such as in string concatenation (15.18.1), in method invocation on class String, and in annotations with String elements:

3.10.6 3.10.7 Escape Sequences for Character and String Literals

In character literals (3.10.4), string literals (3.10.5), and text blocks (3.10.6), the character and string escape sequences allow for the representation of some nongraphic characters without using Unicode escapes (3.3), as well as the single quote, double quote, and backslash characters.

EscapeSequence:
\ b (backspace BS, Unicode \u0008)
\ s (space SP, Unicode \u0020)
\ t (horizontal tab HT, Unicode \u0009)
\ f (form feed FF, Unicode \u000c)
\ n (linefeed LF, Unicode \u000a)
\ r (carriage return CR, Unicode \u000d)
\ LineTerminator (line continuation, no Unicode representation)
\ " (double quote ", Unicode \u0022)
\ ' (single quote ', Unicode \u0027)
\ \ (backslash \, Unicode \u005c)
OctalEscape (octal value, Unicode \u0000 to \u00ff)

...

Octal escapes are provided for compatibility with C, but can express only Unicode values \u0000 through \u00FF, so Unicode escapes are usually preferred.

It is a compile-time error if the character following a backslash in an escape sequence is not a LineTerminator or an ASCII b, s, t, f, n, r, ", ', \, 0, 1, 2, 3, 4, 5, 6, or 7.

An escape sequence in the content of a character literal, string literal, or text block is interpreted by replacing its \ and trailing characters with the single character denoted by the Unicode escape in the EscapeSequence grammar. The line continuation escape sequence has no corresponding Unicode escape, so is interpreted by replacing it with nothing.

The line continuation escape sequence may appear in a text block, but cannot appear in a character literal (3.10.4) or a string literal (3.10.5) because each disallows a LineTerminator.

3.10.7 3.10.8 The Null Literal

The null type has one value, the null reference, represented by the null literal null, which is formed from ASCII characters.

NullLiteral:
null

A null literal is always of the null type (4.1).

Other Changes

The following changes should also be made:

Some clarification of terminology around "escapes" is desirable: