Raw string literals -- where we are, how we got here
brian.goetz at oracle.com
Tue Mar 27 19:15:24 UTC 2018
Now that things have largely stabilized with raw string literals, let me
summarize where we are, and how we got here.
## The proposal
Where we are now is that a raw string literal consists of an opening
delimiter which is a sequence of N consecutive backticks, for some N >
0, a body which may contain any characters (including newlines) except
for a sequence of N consecutive backticks, and a closing delimiter of N
consecutive backticks. Any line-end sequences (CR, LF, CRLF) are
normalized to a single newline (LF), and the remainder of the body is
treated without any further transformation (including without unicode
escape processing), and placed in a String. No other processing is done
on the contents.
A raw string literal has type String, just like a traditional string
literal, and can be used anywhere an expression of type String can be
used (assignment, concatenation, etc.)
String s = `Doesn't have a \n newline character in it`;
String ss = `a multi-
String sss = ``a string with a single tick (`) character in it``;
String ssss = `a string with two ticks (``) in it`;
String sssss = `````a string literal with gratuitously many ticks
in its delimiter`````;
Note that the delimiter need not be _more_ ticks than the longest tick
sequence in the body; if the body contains sequences of two ticks and
three ticks, it can be delimited by one tick, four ticks, five ticks,
etc. This makes it possible to choose a minimal delimiter that doesn't
interfere with the body.
## Design Center
The design center for this feature is _raw string literals_. Not
multi-line strings (though this is well handled), not interpolated
strings (though this can be considered in the future.) It turns off all
inline escaping, even unicode escaping (which is usually handled by the
lexer before the production even sees the characters.) We stay as true
as we can to this principle: raw means raw, not 99% raw with a little
bit of escaping. (The single exception is normalizing of carriage
control, the absence of which would just be too surprising.)
The primary use case addressed by raw string literals are snippets of
code from other languages embedded in Java source files. Here we
interpret "languages" broadly; they could be traditional programming
languages, specialized languages like regular expressions or SQL, or
human languages. We want that the Java lexing not interfere at all;
given a suitable O(1) incantation (picking a non-conflicting delimiter),
you can freely cut and paste the foreign string to and from Java. Being
able to do this is not only convenient, but it reduces errors due to
hand-mangling the string, and enhances readability because the embedded
snippet is free of interference from Java.
Choosing raw-ness as a design center leads to a simpler design, which is
good, but it also is _more stable_, because it leads us away from the
temptation to tweak the rules here and there in ways that might be
subjectively attractive, but that further increase the complexity of the
feature. This design choice belies a priority choice: the high-order
bit is _no embedding anomalies_. Users don't have to reason about
whether they need to hand-mangle a snippet to avoid it being mangled by
the compiler or runtime; given a suitable choice of delimiter, there's
nothing else to think about. (IDEs can help with the "writing code"
part of this.)
The various additional features we might be tempted to put in (special
processing for leading or trailing blank lines, leading white space,
trimming to markers, etc) can instead be handled via library
functionality. Since raw string literals are Strings, we can further
process them with library code -- both JDK code and user code (though
methods on String have the advantage that they can be chained, rather
than wrapped, which most users will prefer). Adding new string
manipulation features via libraries rather than through the language is
easier, can be done by users, and is not constrained by the demands of
consistency (you can have seven different trimming methods, each with
their own definition of whitespace, if you like), whereas a language
feature has to be one-size-fits-all. Moving this complexity to the
library where possible leads to a simpler feature and more choices for
#### A road not taken
We choose to divide the world of string literals first into raw and
non-raw literals; from this, multi-line strings falls out for free as we
can treat line breaks in the source file as just more raw characters.
We could have chosen, instead, to first divide the world into single and
multi-line strings, and then into raw and non-raw; this would have left
us with four choices (raw single line, raw multi-line, cooked
single-line, cooked multi-line.) This also would have been a defensible
position, but seemed to add lexical complexity for little gain.
#### The exception that proves the rule
The one exception to raw-ness is that we normalize the line terminators
to the most common (*nix) choice of a single newline, rather than using
the platform-specific line terminator on the system that happens to have
compiled the classfile. The alternative would have just been too
Given that this feature has such a high syntax-to-substance ratio, we
should expect more than the usual number of syntax opinions. Let's start
with some consequences of our chosen design center.
#### No fixed delimiter
From the design choice above, it is a forced move to accept variable
delimiters. Otherwise, one cannot represent a string with the delimiter
in a raw string, without inventing an escaping mechanism, and subverting
our "raw means raw" goal.
The "self-embedding test" is not a mere theoretical goal. Since the
snippets we expect to paste into Java source are not randomly chosen
strings of characters, but meaningful snippets of some language, the
likelihood of wanting to represent a string that contains the chosen
delimiter goes up. Even if you are willing to dismiss "embed Java in
Java" as a serious use case (we're not), people also want a familiar
delimiter, which means something that looks like the delimiter in other
languages, further increasing the chance of collision. (For example, if
we'd picked a fixed triple quote delimiter, then you couldn't embed
Groovy or Python code, among others -- surely a real use case). Fixed
delimiters (of any length) and "raw means raw" are not compatible goals,
and we choose "raw means raw".
The credible options for variable delimiters are using a repeating
delimiter sequence (say, any number of ticks), or some sort of
user-provided nonce ("here" docs), or both. Nonces impose a higher
congnitive load on readers, and their benefit accrues mostly to corner
cases, so the more constrained option of repeating delimiters seems
#### Why not 'just' use triple quotes
People's syntax preferences are guided by familiarity, so we should
expect suggestions to be biased towards what "similar" languages already
do. So the suggestion of using """triple quotes""" should be expected.
We've already discussed how a fixed delimiter is not acceptable. So at a
minimum, this would have to be adjusted to "three or more." While some
people find triple quotes natural (or at least familiar), others find it
offensively heavyweight. Neither crowd is going to convince the other.
#### But ticks are too light
The opposite of the "triple quotes are too heavy" argument is "ticks are
too light"; that a single tick is a lightweight character, and could go
unnoticed, especially if your monitor hasn't been cleaned for a while.
Unfortunately the quote-like delimiters in the middle of the weight
range are taken by other activities. Again, we can't satisfy the "too
light" and "too heavy" crowd at the same time; whichever we do will make
some people unhappy.
#### Why do you have to always do something new?
The quoting scheme chosen -- any number of ticks -- is actually taken
from something we all use: Markdown
(https://daringfireball.net/projects/markdown/syntax), which permits any
number of ticks to be used for infix sequences, and any different number
of ticks to be embedded. (Where we depart from Markdown is that
Markdown strips any leading and trailing newlines from multi-line tick
blocks, an appropriate trick for a page presentation language, but not
consistent with the design goal of "raw".)
#### But I want indentation stripping
When embedding a snippet of one language in another, both of which
support indentation, we are left with two choices: indent the enclosed
block exactly, which has the effect of the code "jutting out to the
left", or indent the enclosed block relative to the enclosing block,
which has the effect of having more indentation than you might want for
the enclosed block. Sometimes this doesn't matter, but sometimes it
does. Whatever we do, one of these crowds will be unhappy. When in
doubt, we stick to the principle of "raw means raw", and provide
indentation stripping via new instance methods on `String` to allow a
range of trimming options, such as `trimIndent()`.
#### But I want leading / trailing empty lines
Some people would like for the language to strip off leading and
trailing blank lines. Like indentation stripping, this is going to be
what people want sometimes, and sometimes not. And given that again, we
can't do both, we again, are guided by "raw means raw", and provide
library means to strip the extraneous newlines.
#### But I want a marker character to make it obvious
Some people would like a margin marker character, so they can manage
margins like this:
foo(`This is a long string
>the characters up to, and
>including, the bracket are stripped
>by the compiler
> and this line is indented`)
(Others would argue the marker character should be "|".) Again, we
believe these sorts of transforms are the purview of libraries, not
language, and will be provided.
#### But people will make ASCII art
`Yes, they might.`
#### But I want to use unicode escaping
There will be library support for explicitly processing Unicode escape
sequences, or backslash escape sequences, or both.
#### But calling library methods like `longString`.trim() is ugly
You say ugly; I say simple and transparent.
#### But doing these things in libraries has to be slower and yield more
No, it doesn't.
## Anomalies and puzzlers
While the proposed scheme is lexically very simple, it does have some at
least one surprising consequence, as well as at least one restriction:
- The empty string cannot be represented by a raw string literal
(because two consecutive ticks will be interpreted as a double-tick
delimiter, not a starting and ending delimiter);
- String containing line delimiters other than \n cannot be
represented directly by a raw string literal.
The latter anomaly is true for any scheme that is free of embedding
anomalies (escaping) and that normalizes newlines. If we chose to not
normalize newlines, we'd arguably have a worse anomaly, which is that
the carriage control of a raw string depends on the platform you
compiled it on.
The empty-string anomaly is scary at first, but, in my opinion, is much
less of a concern than the initial surprise makes it appear. Once you
learn it, you won't forget it -- and IDEs and compilers will provide
feedback that help you learn it. It is also easily avoided: use
traditional string literals unless you have a specific need for
raw-ness. There already is a perfectly valid way to denote the empty
#### Can't these be fixed?
These anomalies can be moved around by tweaking the rules, but the
result is going to be more complicated rules and the same number (or
more) of anomalies, just in different places -- and sometimes in worse
places. While there is room to subjectively differ on which anomalies
are worse than others, we believe that the simplicity of this scheme,
and its freedom from embedding anomalies, makes it the winner.
Because we start with such a simple rule (any number of consecutive
ticks), pretty much any tweak is going to be complexity-increasing. It
seems a poor tradeoff to make the feature more complex and less
convenient for everyone, just to cater to empty strings.
More information about the amber-spec-observers