# Enhancing Java String Literals Round 2

Remi Forax forax at univ-mlv.fr
Sun Jan 6 19:39:21 UTC 2019

----- Mail original -----
> De: "Brian Goetz" <brian.goetz at oracle.com>
> Cc: "amber-spec-experts" <amber-spec-experts at openjdk.java.net>
> Envoyé: Dimanche 6 Janvier 2019 18:43:19
> Objet: Re: Enhancing Java String Literals Round 2

> As Reinier pointed out on amber-dev, regex strings may routinely contain escaped
> meta-characters — +, *, brackets, etc.   So the embedded \- and \+ story has an
> obvious conflict.  While these are not the only possible characters for such
> “shift” operators, his point that this might be overkill is a good one.  So
> let’s look at options for denoting raw-ness.
>
> - Just make triple-quote strings always raw as well as multi-line-capable;
> regexes and friends would use TQ strings even though they are single line
> (Scala, Kotlin)
> - Letter prefix, such as R”…” (C++, Rust, Ruby)
> - Symbol prefix, such as @“…” (C#), or \”…” (suggestive of “distributing” the
> escaping across the string.)
> - Embedded escape sequence that switches to raw mode, but can’t be switched
> back: “\+raw string”, “\{raw}raw string”.
>
> Data from Google suggests that, in their code base, on the order of 5% of
> candidates for multi-line strings use some escape sequences (Kevin/Liam, can
> you verify?)  This suggests to me that the “just use TQ” approach is vaguely
> workable, but likely to be error-prone (5% is infrequently enough that people
> will say \t when they mean tab and discover this at runtime, and then have to
> go back and add a .escape() call.)
>
> (Of these, my current favorite is using the backslash: “cooked”, “””cooked and
> ML-capable”, \”raw”, \”””raw and ML capable”.  The use of \ suggests “the
> backslashes have been pre-added for you”, building on existing associations
> with backslash.)
>
> Are there other credible candidates that I’ve missed?

the triple single quote like in Ruby, '''...'''

the fake method call like in Lisp or Perl, quote(...), q(...) or raw(...)

Rémi

>
>
>
>> On Jan 2, 2019, at 2:00 PM, Jim Laskey <james.laskey at oracle.com> wrote:
>>
>>
>>>
>>> First of all, I would like to apologize for leading us down the garden path re
>>> Java Raw String Literals. I jumped into this feature fully enamoured with the
>>> JavaScript equivalent and, "why can't we have this in Java?"  As the proposal
>>> evolved, it became clear that what we came up with was not a good Java
>>> solution. I underestimated the concern that the original proposal was too left
>>> field and did not fit into Java very well. It's somewhat ironic that the
>>> backtick looks like a thorn.
>>>
>>> So, let's start the new year with a structured approach to the enhance string
>>> literal design. Brian gave a summary of why the old design fails. Starting with
>>> this summary, Brian and I talked out a series of critical decision points that
>>> should be given thought, if not answers, before we propose a new design. As an
>>> exercise, I supplemented these points and created a series of small decision
>>> trees (a full on decision tree would be complex and not very helpful.) I found
>>> these trees good intuition pumps for getting the design at least 80% there.
>>>
>>>
>>>
>>>
>>> Even the label Raw String Literal put the emphasis on the wrong part of the
>>> feature. What developers really want is multi-line strings. They want to be
>>> able to paste alien source into their Java programs with as little fuss as
>>> possible.
>>>
>>> String raw-ness (not translating escapes) is a tangential aspect, that may or
>>> may not be needed to implement multi-line strings. Yes, the regex and Window's
>>> file path arguments in JEP 326 are still valid, but this aspect needs to be
>>> separated from the main part of the design. Further in the discussion, we'll
>>> see that raw-ness is really a many-headed hydra, best slain one head at a time.
>>>
>>>
>>>
>>>
>>> We have to be honest. We know Java's primary market. Sure we want to embed Java
>>> in Java for writing tests. Sure there is JavaScript and CSS in web pages.
>>> Nevertheless, most uses of multi-line will be for non-complex grammars.
>>> Specifically, grammars that don't require special handling of multi-character
>>> delimiter sequences. If you can accept this, then the solution set is much
>>> smaller.
>>>
>>>
>>>
>>>
>>> This is an easy one. Familiarity is key to feature education. Radical wandering
>>> off with new syntax is not helpful to anyone but bloggers and authors.
>>>
>>>
>>>
>>>
>>> If you buy into the familiarity argument, then double quote is really only
>>> choice for a delimiter. Double quote already indicates a string literal. Single
>>> quote indicates a character. We don’t want to gratuitously burn unused symbols
>>> like backtick. Backslash works for regex but maybe not for others. Combinations
>>> and nonces just introduce new noise when our original goal was to reduce noise
>>> and complexity.
>>>
>>>
>>>
>>>
>>> Other languages avoid delimiter escape sequences by doubling up. Example,
>>> "abc""def" -> abc"def. This concept is unfamiliar to Java developers, why
>>> change now. Escape sequences are what we know.
>>>
>>>
>>>
>>>
>>> Language designers got very nervous when I suggested infinite delimiter
>>> sequences in the original proposal; lexically sacrilegious. I felt strongly
>>> that it was easy to explain and only 1 in 1M developers would ever use more
>>> than 4-5 character delimiter sequences. In round two, I have come to agree.
>>> This was taking on more complexity than is really warranted, for a use case
>>> that doesn’t come along very often. I suggest we only need single and triple
>>> double quotes. A single double quote works today, so no argument there. Double
>>> double quotes means empty string, no problem. Triple double quotes are only
>>> necessary to avoid having to escape quotes in alien source.
>>>
>>> String json = """
>>>                {
>>>                  "name": "Jean Smith",
>>>                  "age": 32,
>>>                  "location": "San Jose"
>>>                }
>>>              """;
>>>
>>> versus
>>>
>>> String json = "
>>>                {
>>>                  \"name\": \"Jean Smith\",
>>>                  \"age\": 32,
>>>                  \"location\": \"San Jose\"
>>>                }
>>>              ";
>>>
>>> This second case is where we wandered off the tracks with raw-ness. We assumed
>>> raw-ness is necessary to avoid all the backslashes. Most cases can be handled
>>> with triple double quotes.
>>>
>>> Okay, so why not more combinations? Simply because, most of the time they are
>>> not needed. On the rare occasion we do have nested triple double quotes, we can
>>> then use escape sequences.
>>>
>>> String nestedJSON = """
>>>                        \"\"\"
>>>                        {
>>>                          "name": "Jean Smith",
>>>                          "age": 32,
>>>                          "location": "San Jose"
>>>                        }
>>>                        \"\"\";
>>>                    """;
>>>
>>> or better yet, you only have to escape every third double quote
>>>
>>> String nestedJSON = """
>>>                        \"""
>>>                        {
>>>                          "name": "Jean Smith",
>>>                          "age": 32,
>>>                          "location": "San Jose"
>>>                        }
>>>                        \""";
>>>                    """;
>>>
>>> Not so evil and it's familiar.
>>>
>>>
>>>
>>>
>>> Meaning, you can only use single quotes for simple strings and triple quotes for
>>> multi-line strings. I don't have a strong opinion other than it seems like an
>>> unneeded restriction. The only argument I've heard has been for better error
>>> recovery when missing a close delimiter during parsing. My counter for that
>>> argument is that if you are processing multi-line strings then you can easily
>>> track the first newline after the opening delimiter and recover from there. I
>>> implemented that recovery in javac and worked out well.
>>>
>>>
>>>
>>>
>>>
>>> Cooked (translated escape sequences) should be the default. Why should a
>>> multi-line string be different than a simple string?  We have a solution for
>>> embedding double quote. Single quotes don't require escaping. Tabs and newlines
>>> can exist as is. Unicode characters can be either an escape sequence or the
>>> unicode character. So the only problem case is backslash. I would argue that
>>> the rare backslash can be escaped. If not, then the developer can use the
>>> raw-ness solution.
>>>
>>>
>>>
>>>
>>> If we don't translate newlines, then source is not transferable across
>>> platforms. That is, a source from one platform may not execute the same way on
>>> another platform. Translating consistently guarantees execution consistency. As
>>> a note, programming languages that didn't translate newlines in multi-line
>>> string literals typically regretted it later (Python.)
>>>
>>>
>>>
>>>
>>> With the original Raw String Literal proposal, there was concern about leading
>>> and trailing nested delimiters. If we default to cooked strings, then we use
>>> can use \".
>>>
>>>
>>>
>>>
>>> These questions have been answered numerous times and fall into the realm of
>>> library support. Same arguments as before, same outcome.
>>>
>>>
>>> To summarize the bold paths at this point;
>>> 	- multi-line strings are an extension of traditional simple strings
>>> 	- newlines in a string are no longer an error and the string can extend across
>>> 	several lines
>>> 	- error recovery can pick up at the first newline after the opening delimiter
>>> 	- multi-line strings process escape sequences (including unicode) in the same
>>> 	way as simple strings
>>> 	- multiple double quotes are handled with escape sequences
>>> 	- triple double quote delimiter is introduced to avoid escaping simple double
>>> 	quote sequences
>>>
>>> Generally, I think this is very much in the traditional Java spirit.
>>>
>>>
>>> Now, let's move on to the lesser but more interesting issue. As I stated above,
>>> raw-ness is a multi-headed beast. Raw-ness involves the turning off the
>>> translation of
>>> 	- escape sequences
>>> 	- unicode escapes
>>> 	- delimiter sequences
>>> 	- escape sequence prefix (backslash)
>>> 	- tabs and newlines (control characters in general)
>>>
>>> Sometimes we need all of the translations, sometimes few and sometimes none. In
>>> the multi-line discussion above, we see we don't need raw as much as we might
>>> have expected. Maybe for occasional backslashes, as in regex and Windows paths
>>> strings.
>>>
>>>
>>>
>>>
>>>
>>> The original Raw String Literal proposal suggested that raw-ness was a property
>>> of the whole string literal and thus we proposed an alternate delimiter syntax
>>> just to emphasize that fact. If we accept the bold path of multi-line
>>> discussion above, then alternate delimiter is out. This leaves prefixing as the
>>> best option to bless a string literal with raw-ness.
>>>
>>> At this point, I would like to suggest an alternate, maybe progressive way to
>>> think of raw-ness. Since the original proposal, I have been thinking of
>>> raw-ness as a state of processing the literal. State is certainly obvious in
>>> the scanner implementation, why not raise that to the language level? If it is
>>> a state then we should be able to enter and leave that state in some way.
>>> Escape sequences are an obvious way of transitioning translation in the string.
>>> \- and \+ are available and not currently recognized as valid escape sequences,
>>> why not \- and \+ to toggle escape processing?
>>>
>>>    String a = "cooked \-raw\+ cooked";  // cooked raw cooked - a little odd but not
>>>    so much so
>>>    String b = "abc\-\\\\\+def";         // abc\\\\def - struggling
>>>    String c = "\-abc\\\\def";           // abc\\\\def - more readable as an inner
>>>    prefix
>>>    String d = "abc\-\-def\+\+ghi";      // abc\-def\+ghi - raw on "\-" is "\" and
>>>    "-", raw off "\+" is "\" and "+"
>>>    String e = """\-"abc"\+""";          // "abc" - \- and \+ act a no-ops of sorts
>>>
>>> Comparing property vs state:
>>>
>>>    Runtime.getRuntime().exec(R""" "C:\Program Files\foo" bar""".strip());
>>>    Runtime.getRuntime().exec("""\-"C:\Program Files\foo" bar""");
>>>
>>>    System.out.println("this".matches(R"\w\w\w\w"));
>>>    System.out.println("this".matches("\-\w\w\w\w"));
>>>
>>>    String html = R"""
>>>                       <html>
>>>                           <body>
>>>                               <p>Hello World.</p>
>>>                           </body>
>>>                       </html>
>>>                  """.align();
>>>    String html = """\-
>>>                       <html>
>>>                           <body>
>>>                               <p>Hello World.</p>
>>>                           </body>
>>>                       </html>
>>>                  """.align();
>>>
>>>
>>>    String nested = """
>>>                        String EXAMPLE_TEST = "This is my small example "
>>>                                + "string which I'm going to "
>>>                                + "use for pattern matching.";
>>>                    """ +
>>>                    R"""
>>>                        System.out.println(EXAMPLE_TEST.replaceAll("\\s+", "\t"));
>>>                    """;
>>>    String nested = """
>>>                        String EXAMPLE_TEST = "This is my small example "
>>>                                + "string which I'm going to "
>>>                                + "use for pattern matching.";
>>>                        \-
>>>                        System.out.println(EXAMPLE_TEST.replaceAll("\\s+", "\t"));
>>>                        \+
>>>                    """;
>>>
>>> Hopefully, this is a good starting point for discussion. As before, I'm
>>> pragmatic about which direction we go, so feel free to comment.
>>>
>>> Cheers,
>>>
>>> -- Jim
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>