Truffle Strings Guide

Truffle Strings is Truffle’s primitive String type, which can be shared between languages. Language implementers are encouraged to use Truffle Strings as their language’s string type for easier interoperability and better performance.

TruffleString supports a plethora of string encodings, but is especially optimized for the most commonly used:

TruffleString API

All operations exposed by TruffleString are provided as an inner Node, and as static or instance methods. Users should use the provided nodes where possible, as the static/instance methods are just shorthands for executing their respective node’s uncached version. All nodes are named {NameOfOperation}Node, and all convenience methods are named {nameOfOperation}Uncached.

Some operations support lazy evaluation, such as lazy concatenation or lazy evaluation of certain string properties. Most of these operations provide a parameter boolean lazy, which allows the user to enable or disable lazy evaluation on a per-callsite basis.

Operations dealing with index values, such as CodePointAtIndex, are available in two variants: codepoint-based indexing and byte-based indexing. Byte-based indexing is indicated by the ByteIndex-suffix or prefix in an operation’s name, otherwise indices are based on codepoints. For example, the index parameter ofCodePointAtIndex is codepoint-based, whereas CodePointAtByteIndex uses a byte-based index.

The list of currently available operations is listed below and grouped by category.

Creating a new TruffleString:

Query string properties:

Comparison:

Conversion:

Accessing codepoints and bytes:

Search:

Combining:

Instantiation

A TruffleString can be created from a codepoint, a number, a primitive array or a java.lang.String.

Strings of any encoding can be created with TruffleString.FromByteArrayNode, which expects a byte array containing the already encoded string. This operation can be non-copying, by setting the copy parameter to false.

Important: TruffleStrings will assume the array content to be immutable, do not modify the array after passing it to the non-copying variant of this operation.

import com.oracle.truffle.api.dsl.Cached;
import com.oracle.truffle.api.strings.TruffleString;

abstract static class SomeNode extends Node {

    @Specialization
    static TruffleString someSpecialization(
            @Cached TruffleString.FromByteArrayNode fromByteArrayNode) {
        byte[] array = {'a', 'b', 'c'};
        return fromByteArrayNode.execute(array, 0, array.length, TruffleString.Encoding.UTF_8, false);
    }
}

For easier creation of UTF-16 and UTF-32 strings independent of the system’s endianness, TruffleString provides TruffleString.FromCharArrayUTF16Node and TruffleString.FromIntArrayUTF32Node.

TruffleString may also be created via TruffleStringBuilder, which is TruffleString’s equivalent to java.lang.StringBuilder.

TruffleStringBuilder provides the following operations:

See the below example:

import com.oracle.truffle.api.dsl.Cached;
import com.oracle.truffle.api.strings.TruffleString;
import com.oracle.truffle.api.strings.TruffleStringBuilder;

abstract static class SomeNode extends Node {

    @Specialization
    static TruffleString someSpecialization(
            @Cached TruffleStringBuilder.AppendCharUTF16Node appendCharNode,
            @Cached TruffleStringBuilder.AppendJavaStringUTF16Node appendJavaStringNode,
            @Cached TruffleStringBuilder.AppendIntNumberNode appendIntNumberNode,
            @Cached TruffleStringBuilder.AppendStringNode appendStringNode,
            @Cached TruffleString.FromCharArrayUTF16Node fromCharArrayUTF16Node,
            @Cached TruffleStringBuilder.AppendCodePointNode appendCodePointNode,
            @Cached TruffleStringBuilder.ToStringNode toStringNode) {
        TruffleStringBuilder sb = TruffleStringBuilder.create(TruffleString.Encoding.UTF_16);
        sb = appendCharNode.execute(sb, 'a');
        sb = appendJavaStringNode.execute(sb, "abc", /* fromIndex: */ 1, /* length: */ 2);
        sb = appendIntNumberNode.execute(sb, 123);
        TruffleString string = fromCharArrayUTF16Node.execute(new char[]{'x', 'y'}, /* fromIndex: */ 0, /* length: */ 2);
        sb = appendStringNode.execute(sb, string);
        sb = appendCodePointNode.execute(sb, 'z');
        return toStringNode.execute(sb); // string content: "abc123xyz"
    }
}

Encodings

Every TruffleString is encoded in a specific internal encoding, which is set during instantiation.

TruffleString is fully optimized for the following encodings:

Many other encodings are supported, but not fully optimized. To use them, they must be enabled by setting needsAllEncodings = true in the Truffle language registration.

A TruffleString’s internal encoding is not exposed. Instead of querying a string’s encoding, languages should pass an expectedEncoding parameter to all methods where the string’s encoding matters (which is almost all operations). This allows re-using string objects when converting between encodings, if a string is byte-equivalent in both encodings. A string can be converted to a different encoding using SwitchEncodingNode, as shown in the following example:

import com.oracle.truffle.api.dsl.Cached;
import com.oracle.truffle.api.strings.TruffleString;
import com.oracle.truffle.api.strings.TruffleStringBuilder;

abstract static class SomeNode extends Node {

    @Specialization
    static void someSpecialization(
            @Cached TruffleString.FromJavaStringNode fromJavaStringNode,
            @Cached TruffleString.ReadByteNode readByteNode,
            @Cached TruffleString.SwitchEncodingNode switchEncodingNode,
            @Cached TruffleString.ReadByteNode utf8ReadByteNode) {

        // instantiate a new UTF-16 string
        TruffleString utf16String = fromJavaStringNode.execute("foo", TruffleString.Encoding.UTF_16);

        // read a byte with expectedEncoding = UTF-16.
        // if the string is not byte-compatible with UTF-16, this method will throw an IllegalArgumentException
        System.out.printf("%x%n", readByteNode.execute(utf16String, /* byteIndex */ 0, TruffleString.Encoding.UTF_16));

        // convert to UTF-8.
        // note that utf8String may be reference-equal to utf16String!
        TruffleString utf8String = switchEncodingNode.execute(utf16String, TruffleString.Encoding.UTF_8);

        // read a byte with expectedEncoding = UTF-8
        // if the string is not byte-compatible with UTF-8, this method will throw an IllegalArgumentException
        System.out.printf("%x%n", utf8ReadByteNode.execute(utf8String, /* byteIndex */ 0, TruffleString.Encoding.UTF_8));
    }
}

Byte-equivalency between encodings is determined with string compaction on UTF-16 and UTF-32, so e.g. a compacted UTF-16 String is byte-equivalent to ISO-8859-1, and if all of its characters are in the ASCII range (see CodeRange), it is also byte-equivalent to UTF-8.

To check if your code is switching encodings properly, run your unit tests with the system property truffle.strings.debug-strict-encoding-checks=true. This disables re-using string objects when switching encodings, and makes encoding checks more strict: all operations working on a single string will enforce an exact match, whereas operations working on two strings will still allow byte-equivalent re-interpretations.

All TruffleString operations with more than one string parameter require the strings to be in an encoding compatible with the result encoding. So either the strings need to be in the same encoding, or the caller must ensure that both Strings are compatible with the resulting encoding. This enable callers which already know the SwitchEncodingNodes would be noops to just skip them for footprint reasons.

import com.oracle.truffle.api.dsl.Cached;
import com.oracle.truffle.api.strings.TruffleString;
import com.oracle.truffle.api.strings.TruffleStringBuilder;

abstract static class SomeNode extends Node {

    @Specialization
    static boolean someSpecialization(
            TruffleString a,
            TruffleString b,
            @Cached TruffleString.SwitchEncodingNode switchEncodingNodeA,
            @Cached TruffleString.SwitchEncodingNode switchEncodingNodeB,
            @Cached TruffleString.EqualNode equalNode) {
        TruffleString utf8A = switchEncodingNodeA.execute(a, TruffleString.Encoding.UTF_8);
        TruffleString utf8B = switchEncodingNodeB.execute(b, TruffleString.Encoding.UTF_8);
        return equalNode.execute(utf8A, utf8B, TruffleString.Encoding.UTF_8);
    }
}

String Properties

TruffleString exposes the following properties:

See the below example how to query all properties exposed by TruffleString:

import com.oracle.truffle.api.dsl.Cached;
import com.oracle.truffle.api.strings.TruffleString;

abstract static class SomeNode extends Node {

    @Specialization
    static TruffleString someSpecialization(
            TruffleString string,
            @Cached TruffleString.CodePointLengthNode codePointLengthNode,
            @Cached TruffleString.IsValidNode isValidNode,
            @Cached TruffleString.GetCodeRangeNode getCodeRangeNode,
            @Cached TruffleString.HashCodeNode hashCodeNode) {
        System.out.println("byte length: " + string.byteLength(TruffleString.Encoding.UTF_8));
        System.out.println("codepoint length: " + codePointLengthNode.execute(string, TruffleString.Encoding.UTF_8));
        System.out.println("is valid: " + isValidNode.execute(string));
        System.out.println("code range: " + getCodeRangeNode.execute(string));
        System.out.println("hash code: " + hashCodeNode.execute(string, TruffleString.Encoding.UTF_8));
    }
}

String Equality and Comparison

TruffleString objects should be checked for equality using EqualNode. Just like HashCodeNode, the equality comparison is sensitive to the string’s encoding, so before any comparison, strings should always be converted to a common encoding. Object#equals(Object) behaves analogous to EqualNode, but since this method does not have an expectedEncoding parameter, it will determine the string’s common encoding automatically. If the string’s encodings are not equal, TruffleString will check whether one string is binary-compatible to the other string’s encoding, and if so, match their content. Otherwise, the strings are deemed not equal, no automatic conversion is applied.

Note that since TruffleString’s hashCode and equals methods are sensitive to string encoding, TruffleString objects must always be converted to a common encoding before, e.g., using them as keys in a HashMap.

TruffleString also provides three comparison nodes CompareBytesNode, CompareCharsUTF16Node, and CompareIntsUTF32Node, to compare strings respectively byte-by-byte, char-by-char, and int-by-int.

Concatenation

Concatenation is done via ConcatNode. This operation requires both strings to be in expectedEncoding, which is also the encoding of the resulting string. Lazy concatenation is supported via the lazy parameter. When two strings are concatenated lazily, the allocation and initialization of the new string’s internal array is delayed until another operation requires direct access to that array. Materialization of such “lazy concatenation strings” can be triggered explicitly with a MaterializeNode. This is useful to do before accessing a string in a loop, such as in the following example:

import com.oracle.truffle.api.dsl.Cached;
import com.oracle.truffle.api.strings.TruffleString;

abstract static class SomeNode extends Node {

    @Specialization
    static TruffleString someSpecialization(
            TruffleString utf8StringA,
            TruffleString utf8StringB,
            @Cached TruffleString.ConcatNode concatNode,
            @Cached TruffleString.MaterializeNode materializeNode,
            @Cached TruffleString.ReadByteNode readByteNode) {
        // lazy concatenation
        TruffleString lazyConcatenated = concatNode.execute(utf8StringA, utf8StringB, TruffleString.Encoding.UTF_8, /* lazy */ true);

        // explicit materialization
        TruffleString materialized = materializeNode.execute(lazyConcatenated, TruffleString.Encoding.UTF_8);

        int byteLength = materialized.byteLength(TruffleString.Encoding.UTF_8);
        for (int i = 0; i < byteLength; i++) {
            // string is guaranteed to be materialized here, so no slow materialization code can end up in this loop
            System.out.printf("%x%n", readByteNode.execute(materialized, i, TruffleString.Encoding.UTF_8));
        }
    }
}

Substrings

Substrings can be created via SubstringNode and SubstringByteIndexNode, which use codepoint-based and byte-based indices, respectively. Substrings can also be lazy, meaning that no new array is created for the resulting string, but instead the parent string’s array is re-used and just accessed with the offset and length passed to the substring node. Currently, a lazy substring’s internal array is never trimmed (i.e. replaced by a new array of the string’s exact length). Note that this behavior effectively creates a memory leak whenever a lazy substring is created. An extreme example where this could be problematic: given a string that is 100 megabyte in size, any lazy substring created from this string will keep the 100 megabyte array alive, even when the original string is freed by the garbage collector. Use lazy substrings with caution.

Interoperability with java.lang.String

TruffleString provides FromJavaStringNode for converting a java.lang.String to TruffleString. To convert from TruffleString to java.lang.String, use a ToJavaStringNode. This node will internally convert the string to UTF-16, if necessary, and create a java.lang.String from that representation.

Object#toString() is implemented using the uncached version of ToJavaStringNode and should be avoided on fast paths.

import com.oracle.truffle.api.dsl.Cached;
import com.oracle.truffle.api.strings.TruffleString;

abstract static class SomeNode extends Node {

    @Specialization
    static void someSpecialization(
            @Cached TruffleString.FromJavaStringNode fromJavaStringNode,
            @Cached TruffleString.SwitchEncodingNode switchEncodingNode,
            @Cached TruffleString.ToJavaStringNode toJavaStringNode,
            @Cached TruffleString.ReadByteNode readByteNode) {
        TruffleString utf16String = fromJavaStringNode.execute("foo", TruffleString.Encoding.UTF_16);
        TruffleString utf8String = switchEncodingNode.execute(utf16String, TruffleString.Encoding.UTF_8);
        System.out.println(toJavaStringNode.execute(utf8String));
    }
}

TruffleString also exposes #toStringDebug() for debugging purposes. Do not use this method for anything other than debugging, as its return value is unspecified and may change at any time.

Differences to java.lang.String

The following items should be considered when switching from java.lang.String to TruffleString:

Codepoint Iterators

TruffleString provides TruffleStringIterator as a means of iterating over a string’s codepoints. This method should be preferred over using CodePointAtIndexNode in a loop, especially on variable-width encodings such as UTF-8, since CodePointAtIndexNode may have to re-calculate the byte index equivalent of the given codepoint index on every call.

See the example:

import com.oracle.truffle.api.dsl.Cached;
import com.oracle.truffle.api.strings.TruffleString;
import com.oracle.truffle.api.strings.TruffleStringIterator;

abstract static class SomeNode extends Node {

    @Specialization
    static void someSpecialization(
            TruffleString string,
            @Cached TruffleString.CreateCodePointIteratorNode createCodePointIteratorNode,
            @Cached TruffleStringIterator.NextNode nextNode,
            @Cached TruffleString.CodePointLengthNode codePointLengthNode,
            @Cached TruffleString.CodePointAtIndexNode codePointAtIndexNode) {

        // iterating over a string's codepoints using TruffleStringIterator
        TruffleStringIterator iterator = createCodePointIteratorNode.execute(string, TruffleString.Encoding.UTF_8);
        while (iterator.hasNext()) {
            System.out.printf("%x%n", nextNode.execute(iterator));
        }

        // suboptimal variant: using CodePointAtIndexNode in a loop
        int codePointLength = codePointLengthNode.execute(string, TruffleString.Encoding.UTF_8);
        for (int i = 0; i < codePointLength; i++) {
            // performance problem: codePointAtIndexNode may have to calculate the byte index corresponding
            // to codepoint index i for every loop iteration
            System.out.printf("%x%n", codePointAtIndexNode.execute(string, i, TruffleString.Encoding.UTF_8));
        }
    }
}

Mutable Strings

TruffleString also provides a mutable string variant called MutableTruffleString, which is also accepted in all nodes of TruffleString. MutableTruffleString is not thread-safe and allows overwriting bytes in its internal byte array or native pointer via WriteByteNode. The internal array or native pointer’s content may also be modified externally, but the corresponding MutableTruffleString must be notified of this via notifyExternalMutation(). MutableTruffleString is not a Truffle interop type, and must be converted to an immutable TruffleString via TruffleString.AsTruffleString before passing a language boundary.