Text Blocks (Second Preview)

Chapter 1: Introduction

1.5 Preview Features

The following are essential API elements associated with Text Blocks:

The method stripIndent in String
The method translateEscapes in String

Chapter 3: Lexical Structure

3.10 Literals

A literal is the source code representation of a value of a primitive type (4.2), the String type (4.3.3), or the null type (4.1).

Literal:: IntegerLiteral; FloatingPointLiteral; BooleanLiteral; CharacterLiteral; StringLiteral; TextBlock; NullLiteral

3.10.4 Character Literals

A character literal is expressed as a character or an escape sequence (3.10.6), enclosed in ASCII single quotes. (The single-quote, or apostrophe, character is \u0027.)

CharacterLiteral:: ' SingleCharacter '; ' EscapeSequence '
SingleCharacter:: InputCharacter but not ' or \

See 3.10.6 for the definition of EscapeSequence.

A character literal is always of type char (4.2.1).

Character literals can only represent UTF-16 code units (3.1), i.e., they are limited to values from \u0000 to \uffff. Supplementary characters must be represented either as a surrogate pair within a char sequence, or as an integer, depending on the API they are used with.

The content of a character literal is the SingleCharacter or the EscapeSequence which follows the opening '.

It is a compile-time error for the character following the ~~SingleCharacter or EscapeSequence~~ content to be other than a '.

It is a compile-time error for a line terminator (3.4) to appear after the opening ' and before the closing '.

As specified in 3.4, the characters CR and LF are never an InputCharacter; each is recognized as constituting a LineTerminator, so it may not appear in a string literal, even in the escape sequence \ LineTerminator.

The character represented a character literal is the content of the character literal with any escape sequence interpreted, as if by execution of String::translateEscapes on the content.

The following are examples of char literals:

'a'

'%'

'\t'

'\\'

'\''

'\u03a9'

'\uFFFF'

'\177'

'™'

Because Unicode escapes are processed very early, it is not correct to write '\u000a' for a character literal whose value is linefeed (LF); the Unicode escape \u000a is transformed into an actual linefeed in translation step 1 (3.3) and the linefeed becomes a LineTerminator in step 2 (3.4), and so the character literal is not valid in step 3. Instead, one should use the escape sequence '\n' (3.10.6). Similarly, it is not correct to write '\u000d' for a character literal whose value is carriage return (CR). Instead, use '\r'. Finally, it is not possible to write '\u0027' for a character literal containing an apostrophe (').

In C and C++, a character literal may contain representations of more than one character, but the value of such a character literal is implementation-defined. In the Java programming language, a character literal always represents exactly one character.

3.10.5 String Literals

A string literal consists of zero or more characters enclosed in double quotes. Characters such as newlines may be represented by escape sequences (3.10.7) ~~- one escape sequence for characters in the range U+0000 to U+FFFF, two escape sequences for the UTF-16 surrogate code units of characters in the range U+010000 to U+10FFFF~~.

StringLiteral:: " {StringCharacter} "
StringCharacter:: InputCharacter but not " or \; EscapeSequence

A string literal is always of type String (4.3.3).

The content of a string literal is the sequence of characters that begins immediately after the opening " and ends immediately before the closing matching ".

It is a compile-time error for a line terminator to appear in the content of a string literal ~~after the opening " and before the closing matching "~~.

As specified in 3.4, the characters CR and LF are never an InputCharacter; each is recognized as constituting a LineTerminator, so it may not appear in a string literal, even in the escape sequence \ LineTerminator.

The string represented by a string literal is the content of the string literal with every escape sequence interpreted, as if by execution of String::translateEscapes on the content.

The following are examples of string literals:
""                    // the empty string
"\""                  // a string containing " alone
"This is a string"    // a string containing 16 characters
"This is a " +        // actually a string-valued constant expression,
    "two-line string"    // formed from two string literals
Because Unicode escapes are processed very early, it is not correct to write "\u000a" for a string literal containing a single linefeed (LF); the Unicode escape \u000a is transformed into an actual linefeed in translation step 1 (3.3) and the linefeed becomes a LineTerminator in step 2 (3.4), and so the string literal is not valid in step 3. Instead, one should write "\n" (3.10.6). Similarly, it is not correct to write "\u000d" for a string literal containing a single carriage return (CR). Instead, use "\r". Finally, it is not possible to write "\u0022" for a string literal containing a double quotation mark (").

A long string literal can always be broken up into shorter pieces and written as a (possibly parenthesized) expression using the string concatenation operator + (15.18.1).

At run time, a string literal is a reference to an instance of class String (4.3.1, 4.3.3) that denotes the string represented by the string literal.

Moreover, a string literal always refers to the same instance of class String. This is because string literals - or, more generally, strings that are the values of constant expressions (15.28) - are "interned" so as to share unique instances, using the method String.intern (12.5).

3.10.6 Text Blocks

A text block consists of zero or more characters enclosed by opening and closing delimiters. Characters may be represented by escape sequences (3.10.7), but the newline and double quote characters that must be represented with escape sequences in a string literal may be represented directly in a text block.

TextBlock:: " " " { TextBlockWhiteSpace } LineTerminator { TextBlockCharacter } " " "
TextBlockWhiteSpace:: WhiteSpace but not LineTerminator
TextBlockCharacter:: InputCharacter but not \; EscapeSequence; LineTerminator

The following productions from 3.3, 3.4, and 3.6 are shown here for convenience:

WhiteSpace:

the ASCII SP character, also known as "space"

the ASCII HT character, also known as "horizontal tab"

the ASCII FF character, also known as "form feed"

LineTerminator

LineTerminator:

the ASCII LF character, also known as "newline"

the ASCII CR character, also known as "return"

the ASCII CR character followed by the ASCII LF character

InputCharacter:

UnicodeInputCharacter but not CR or LF

UnicodeInputCharacter:

UnicodeEscape

RawInputCharacter

UnicodeEscape:

\ UnicodeMarker HexDigit HexDigit HexDigit HexDigit

RawInputCharacter:

any Unicode character

A text block is always of type String (4.3.3).

The opening delimiter is a sequence that starts with three double quote characters ("""), continues with zero or more space, tab, and form feed characters, and concludes with a line terminator.

The closing delimiter is a sequence of three double quote characters.

The content of a text block is the sequence of characters that begins immediately after the line terminator of the opening delimiter, and ends immediately before the first double quote of the closing delimiter.

Unlike in a string literal (3.10.5), it is not a compile-time error for a line terminator to appear in the content of a text block.

Example 3.10.6-1. Text Blocks

When multi-line strings are desired, a text block is usually more readable than a concatenation of string literals. For example, compare these alternative representations of a snippet of HTML:

String html = "<html>\n" +
              "    <body>\n" +
              "        <p>Hello, world</p>\n" +
              "    </body>\n" +
              "</html>\n";

String html = """
              <html>
                  <body>
                      <p>Hello, world</p>
                  </body>
              </html>
              """;

Here are some examples of text blocks:

String season = """
                winter""";    // the six characters w i n t e r

String period = """
                winter
                """;          // the seven characters w i n t e r LF

String greeting =
    """
    Hi, "Bob"
    """;        // the ten characters H i , SP " B o b " LF

String salutation =
    """
    Hi,
     "Bob"
    """;        // the eleven characters H i , LF SP " B o b " LF

String empty = """
               """;      // the empty string (zero length)

String quote = """
               "
               """;      // the two characters " LF

String backslash = """
                   \\
                   """;  // the two characters \ LF

The use of the escape sequences \" and \n is permitted in a text block, but not necessary or recommended. However, representing the sequence """ in a text block requires the escaping of at least one " character, to avoid mimicking the closing delimiter.

Example 3.10.6-2. Escape sequences in text blocks

The following snippet of text would be less readable if the " characters were escaped:

String story = """
    "When I use a word," Humpty Dumpty said,
    in rather a scornful tone, "it means just what I
    choose it to mean - neither more nor less."
    "The question is," said Alice, "whether you
    can make words mean so many different things."
    "The question is," said Humpty Dumpty,
    "which is to be master - that's all."""";

If a text block is to denote another text block, then it is recommended to escape the first " of the embedded opening and closing delimiters:

String code =
    """
    String text = \"""
        A text block inside a text block
    \""";
    """;

The string represented by a text block is not the literal sequence of characters in the content. Instead, the string represented by a text block is the result of applying the following transformations to the content, in order:

Line terminators are normalized to the ASCII LF character, as follows:
1. An ASCII CR character followed by an ASCII LF character is translated to an ASCII LF character.
2. An ASCII CR character is translated to an ASCII LF character.
Incidental white space is removed, as if by execution of String::stripIndent on the characters resulting from step 1.
Escape sequences are interpreted, as if by execution of String::translateEscapes on the characters resulting from step 2.

Example 3.10.6-3. Order of transformations on text block content

Interpreting escape sequences last allows developers to use \n, \f, and \r for vertical formatting of a string without affecting the normalization of line terminators, and to use \b and \t for horizontal formatting of a string without affecting the removal of incidental white space. For example, consider this text block that mentions the escape sequence \r (CR):

String html = """
              <html>\r
                  <body>\r
                      <p>Hello, world</p>\r
                  </body>\r
              </html>\r
              """;

The \r escapes are not interpreted until after the line terminators have been normalized to LF. Using Unicode escapes to visualize LF (\u000A) and CR (\u000D), and using | to visualize the left margin, the final result is:

|<html>\u000D\u000A
|    <body>\u000D\u000A
|        <p>Hello, world</p>\u000D\u000A
|    </body>\u000D\u000A
|</html>\u000D\u000A

When this specification says that a text block contains a particular character or sequence of characters, or that a particular character or sequence of characters is in a text block, it means that the string represented by the text block (as opposed to the content of the text block) contains the character or sequence of characters.

At run time, a text block is a reference to an instance of class String that denotes the string represented by the text block.

A text block always refers to the same instance of class String. This is because the strings represented by text blocks - or, more generally, strings that are the values of constant expressions (15.28) - are "interned" so as to share unique instances (12.5).

Example 3.10.6-4. Text blocks evaluate to strings

Text blocks can be used wherever an expression of type String is allowed, such as in string concatenation (15.18.1), in method invocation on class String, and in annotations with String elements:

System.out.println("abc" + """
                           cde
                           """);

String math = """
              1+1 equals
              """ + " " + String.valueOf(2);

String cde = """
             abcde""".substring(2);

@Precondition("""
    rate > 0 &&
    rate <= MAX_REFRESH_RATE
""")
public void setRefreshRate(int rate) { ... }

3.10.6 3.10.7 Escape Sequences for Character and String Literals

In character literals (3.10.4), string literals (3.10.5), and text blocks (3.10.6), the ~~character and string~~ escape sequences allow for the representation of some nongraphic characters without using Unicode escapes (3.3), as well as the single quote, double quote, and backslash characters.

EscapeSequence:: \ b (backspace BS, Unicode \u0008); \ s (space SP, Unicode \u0020); \ t (horizontal tab HT, Unicode \u0009); \ f (form feed FF, Unicode \u000c); \ n (linefeed LF, Unicode \u000a); \ r (carriage return CR, Unicode \u000d); \ LineTerminator (line continuation, no Unicode representation); \ " (double quote ", Unicode \u0022); \ ' (single quote ', Unicode \u0027); \ \ (backslash \, Unicode \u005c); OctalEscape (octal value, Unicode \u0000 to \u00ff)

...

Octal escapes are provided for compatibility with C, but can express only Unicode values \u0000 through \u00FF, so Unicode escapes are usually preferred.

It is a compile-time error if the character following a backslash in an escape sequence is not a LineTerminator or an ASCII b, s, t, f, n, r, ", ', \, 0, 1, 2, 3, 4, 5, 6, or 7.

An escape sequence in the content of a character literal, string literal, or text block is interpreted by replacing its \ and trailing characters with the single character denoted by the Unicode escape in the EscapeSequence grammar. The line continuation escape sequence has no corresponding Unicode escape, so is interpreted by replacing it with nothing.

The line continuation escape sequence may appear in a text block, but cannot appear in a character literal (3.10.4) or a string literal (3.10.5) because each disallows a LineTerminator.

3.10.7 3.10.8 The Null Literal

The null type has one value, the null reference, represented by the null literal null, which is formed from ASCII characters.

NullLiteral:: null

A null literal is always of the null type (4.1).

Other Changes

The following changes should also be made:

3.1, final paragraph: add mention of text blocks.
3.7, final paragraph: add mention of text blocks.
4.3.3, third paragraph: add mention of text blocks.
12.5, second paragraph, list should start as follows:

"Loading of a class or interface that contains a string literal (§3.10.5) or a text block (§3.10.6) may create a new String object to represent the string literal or text block. (This will not occur if ~~a string~~ an instance of String denoting the same sequence of Unicode code points as the string literal or text block has previously been interned.)"
15.8.1, fifth bullet: add mention of text blocks.
15.28, first bullet: add mention of text blocks:

"Literals of primitive type (§3.10.1, §3.10.2, §3.10.3, §3.10.4~~, §3.10.5~~) , and string literals (§3.10.5), and text blocks (§3.10.6)."
JVMS 4.7.16.1, const_value_index: rephrase from "denotes either a primitive constant value or a String literal as the value of ..." to "denotes a constant of either a primitive type or the type String as the value of ...".

Some clarification of terminology around "escapes" is desirable:

3.3: A compiler for the Java programming language ("Java compiler") first recognizes Unicode escapes in its input, ... and passing all other characters unchanged. ~~Representing supplementary characters requires two consecutive Unicode escapes.~~ This translation step results in a sequence of Unicode input characters. ... One Unicode escape can represent characters in the range U+0000 to U+FFFF. Representing supplementary characters in the range U+010000 to U+10FFFF requires two consecutive Unicode escapes.
3.5: The input characters and line terminators that result from Unicode escape processing ...
Update cross-references to 3.10.8: The Null Literal from 3.8, 3.9, 4.1, 15.8.1, and 15.12.3.