Raw string literals -- restarting the discussion
brian.goetz at oracle.com
Wed Jan 2 18:21:39 UTC 2019
As many of you saw, we pulled back the Raw String Literals feature from JDK 12. The public statement is here:
So, let's restart the design discussion. First, I want to enumerate some of the process errors I think we made.
- We never really explored the full design space. The initial proposal had a reasonable syntactic strawman, and rather than explore the entire space, we mostly followed the path of refining the initial strawman, and stopped there.
- We got caught in the "linear thinking" trap with respect to the design center. We started off thinking of this feature as "raw strings", of which multi-line strings are an important sub-case, but in reality most of the user pain is over dealing with multi-line snippets of HTML, JSON, XML, or SQL, and raw-ness is secondary. We never really made this turn.
- We were too focused on getting the last 2% rather than the first 98%. (Note that for many, perhaps most language features, the last 2% is critical; for this one, which is entirely about syntactic convenience, it is not.)
Specifically, by focusing on self-embedding as a test of fitness rather than more typical use cases, we ended up in a place that was both more complex than necessary, and at the same time, still had prominent anomalies. (Anomalies are unavoidable if we are unwilling to take on a super-ugly syntax, but we do have some control over how obvious and prominent they are.)
From my "language steward" perspective, my main problem is that the two forms of string literals in the current proposal are gratuitously unrelated. They are syntactically unrelated (different delimiters and delimiter arity rules), and semantically unrelated (one must be raw and permits multiple lines; the other cannot be raw and cannot be multiple line.) I would prefer to have a single string literal feature, with some sub-options for controlling raw-ness and/or line spanning -- with bonus points if these are orthogonal aspects. (As a sub-concern, I would strongly prefer we not burn the backtick character as a delimiter; it should be entirely possible to avoid this by building on the existing string literal mechanism.)
So, how should we evaluate success here? This feature doesn't improve the expressiveness or abstractive ability of the language at all -- it's purely about syntactic convenience. And, given that we've limped along for 20+ years without it, it's lack can't be all _that_ problematic. So let's identify the use cases we care about most, and evaluate the feature through the lens of how it helps those use cases. In my opinion, these are:
- Multi-line snippets of JSON, HTML, XML, and SQL embedded in Java code as string literals. (Other languages are used too, but these constitute the majority.) These currently require escaping for quotes and for newlines, which means every such snippet requires substantial surgery. This is painful for code writers (though IDEs can do most of the lifting here), but more importantly, is harder to read, and it is really easy to leave out a `\n` and get the wrong result, and not have it be immediately noticeable. We would like for most such snippets to be simply pastable without modification.
- Regular expressions and Windows paths routinely require escaping, which again is easy to get wrong and hard to read. (Regular expressions are hard enough to read, we don't need to make it harder.) These are typically a single line.
Given that this feature is pure convenience, we'd also like to avoid excessive spending of our complexity budgets -- either language complexity or teachability. Grabbing for that last 2% at the expense of either of these is not a good trade.
Note too that there is no ideal answer here; we can see this quite clearly by looking at the variety of choices other languages have made, and each still has anomalies (e.g., python raw strings can't end with a backslash) or forces ugly complexity on the reader (e.g., user-selected nonces in C++ raw strings, or Rust's `#` characters). This is truly a "pick your poison" game.
Let's remind ourselves of what other languages do in this area. In all these languages, raw strings can contain newlines; some have separate features for multi-line escaped strings and multi-line raw strings.
- C simulates multi-line strings by having a continuation character (backslash) in the last column, or by implicitly concatenating adjacent string literals (`"raw" "string"`). It does not support raw strings, though there is a gcc extension that emulates C++ raw strings.
- C++ supports multi-line strings through raw strings. It denotes raw strings with an `R` prefix before the quotes, and a user-selected nonce and parentheses inside the quotes: `R"NONCE(raw string)NONCE"`. The nonce may be empty, but the parens are required.
- Rust supports multi-line strings by simply allowing newline characters in an ordinary string literal. It separately supports raw string literals with an `r` prefix, followed by a variable (can be zero) number of `#` characters, a double quote, the raw string, a double quote, and the same number of `#` characters: `r##"raw string"##`.
- Python allows string literals to span multiple lines by using a three-quote (`"""`) delimiter. It allows raw string literals by prefixing the string literal with `r`. Its escaping rules for quotes in raw strings are unusual; a backslash preceded by a quote escapes the quote, but leaves the backspace in the string. (Accordingly, a raw string cannot end with a backslash.)
- Ruby supports multi-line strings with here-docs, and raw strings using the `%q()` construct: `q(raw string)`.
- C#, like C++, support multi-line strings through raw strings. A raw string precedes the string literal with an `@` character: `@"raw string"`.
- Scala and Kotlin, like C++ and C#, support multi-line strings through raw strings. A raw string is delimited with triple quotes: `"""raw string"""`.
Note too that there is also room for interpretation on the meaning of "raw"; Python permits some escaping in raw strings, and Kotlin permit interpolation in raw strings.
We can divide the approaches roughly into three categories:
- Those that use user-supplied nonces (C++, here-docs). These can render 100% of embedded strings, with the costs that come with nonces: annoying to write, and imposing cognitive load to read (as nearly any sequence can be a nonce.)
- Those that use variable-sized delimiters (Rust, and our previous proposal). These are simpler, but will invariably have some anomalies.
- Those that use fixed delimiters (C#, Scala). These are simpler still, and will have more anomalies.
So, recapping our starting point and guidance:
- The primarily use case is multi-line snippets of JSON, HTML, XML, and SQL. It is rare that these require true-raw-ness, but they all commonly have embedded quote characters.
- The secondary use case is truly raw strings, of which the most common offenders are small-ish -- regular expressions and windows paths.
- We should start by trying to extend existing string literals to support raw and/or multi-line strings.
Some questions we need to answer:
- What are reasonable delimiter choices for raw and/or multi-line strings?
- Should the default treatment of multi-line strings be raw or escaped (alternately, is this one feature or two)?
- Is raw-ness a property of a string literal, or a state that can change within the literal (i.e., with embedded start-raw/end-raw escape sequences)?
- How do we embed delimiters in raw strings (escaping, doubling up, concatenation)?
- How far do we want to go to support embedding of delimiters?
Let's start by asking how we might extend the current string literal feature to support multi-line strings. Currently, a string literal starts with a double-quote, can span only a single line of source, and ends at the first unescaped double quote. How could we extend this to a multi-line string literal? Some possibilities include:
- Simply remove the constraint of "can only span a single line"; no other change to delimiters is required (the Rust approach.)
- Choose a different fixed delimiter, such as tripled quotes ("""), doubled single-quotes (''...''), or a multi-character quote token (`/"..."/`).
- Use a modifier on the opening quote, such as `R"..."` or `@"..."`
- Use an embedded escape sequence, such as `"\M..."`, to opt into multi-line treatment
- Use here-docs, with a fixed or user-providable nonce
I think its reasonable to eliminate here-docs from consideration as these are more typically associated with scripting languages.
At first blush, the simplicity of the Rust approach is attractive; just let strings span multiple lines, with no new syntax. The obvious counter-arguments are pretty weak in the current age; if you code in IDE, as most developers do, it is not easy to accidentally leave off a closing quote, and the syntax highlighting will make this obvious in the event we do so anyway. But, if we look through the lens of our use cases -- such as JSON snippets -- we see that this approach fails almost completely, because you _still_ have to escape the quotes, and almost all multi-line snippets will have quotes. So, let's cross this off too. The same applies to using a letter prefix for multi-line strings; it doesn't address the primary use case.
Note too that our primary use case admits a middle-ground option: multi-line strings are not raw, but quotes need not be escaped. This is a possibility if the delimiter is anything other than a single double-quote (`"`).
So, some reasonable starting points on this front include:
- Just follow C#/Scala/Kotlin, where there's a single mechanism for both raw and multi-line, delimited by triple-quotes. Here, a single (or double) embedded quote does not necessarily need to be escaped.
- Use triple-quotes for non-raw multi-line string literals, and some sort of additional way to select raw-ness for either single- or triple-quoted string literals. (Same comment about embedded quotes.)
- Same, but use doubled or tripled single-quotes.
Within the "multiple quote" options, we can separately choose between a fixed number of quotes (e.g., 3) or a variable number (e.g., 3 or more, odd only, etc.) The trade-off here is about where the anomalies go; with the variable-number approaches, it gets harder to start or end with the delimiter character (while this is not necessarily a serious anomaly, but it is a prominent one), and with the fixed approach, there is more need to do something (escaping, concatenating, etc) the delimiter character (though embedding triple-quotes is not all that common in our primary use cases). Also, our IDE friends have pointed out that even numbers of quotes put the IDE in a quandary as to whether the user has just typed the opening delimiter, or both the opening and closing delimiters.
One option is to just say that multi-line strings are also raw. We have evidence that this is not totally unworkable, as several languages have gone this way, but it does mean that for the use cases where the user wants multi-line but not raw, they must resort either to concatenation, or explicit escape processing (e.g., `"""foo""".escape()`)
Another is to allow a prefix character to indicate raw-ness; `R"foo"` or `R"""foo"""`. The prefix character approach is more extensible to other kinds of modes to string processing.
Another option is to use a different delimiter, as the current proposal does. If we were to go this way, I'd suggest we consider double or triple single-quote (which are currently illegal), rather than continuing with backtick.
A fourth option, one that has not yet been considered, is to say that raw-ness is a _state_ of processing a string literal; string literals start out escaped, but can drop into (and out of) raw-ness as they like:
String s = "This part is escaped\n, but this part\- is raw, and this part\+ is escaped again."
String path = "\-C:\bin\putty";
This gets us where multi-line-ness and raw-ness are orthogonal properties of string literals -- without requiring any new delimiters.
So, how to proceed? First, let's try to avoid focusing on our own personal preferences, or be distracted by unfamiliarity, and remember that our job here is to get to a design that's best for _tomorrow's_ Java developers and source base. (That means that, for example, we can't allow ourselves to be distracted by the fact that, say, embedded "\-" or `R"..."` is unfamiliar today. It will be familiar tomorrow, if we decide that's what would be best.)
Here's what would be super-useful:
- Data that supports or refutes the claim that our primary use cases are embedded JSON, HTML, XML, and SQL.
- Use cases we've left out, for which we can discuss whether we want to incorporate them into our goals.
- Data (either Java or non-Java) on the use of various flavors of strings (raw, multi-line, etc) in real codebases, which might be useful to help determine, for example, whether raw and multi-line should be lumped into the same bucket or not.
The bike shed is open (but please show up with structural members, not just paint.)
More information about the amber-spec-observers