Raw string literals and Unicode escapes

Maurizio Cimadamore maurizio.cimadamore at oracle.com
Tue Feb 27 10:55:53 UTC 2018

On 27/02/18 08:16, forax at univ-mlv.fr wrote:
> Hi John,
> see below.
> ----- Mail original -----
>> De: "John Rose" <john.r.rose at oracle.com>
>> À: "Remi Forax" <forax at univ-mlv.fr>
>> Cc: "amber-spec-experts" <amber-spec-experts at openjdk.java.net>
>> Envoyé: Lundi 26 Février 2018 21:17:13
>> Objet: Re: Raw string literals and Unicode escapes
>> On Feb 26, 2018, at 10:43 AM, Alex Buckley <alex.buckley at oracle.com> wrote:
>>> On 2/25/2018 4:19 AM, Remi Forax wrote:
>>>> I'm late in the game but why not using the same system as Perl, PHP,
>>>> Ruby to solve the Lts [1], i.e
>>>> you have a sequence that says this is the starts of a raw string (%Q,
>>>> qq, m) then a character (in a predefined list), the raw string and at
>>>> the end of the raw string the same character as at the beginning (or its
>>>> mirror).
>>>> By example, this 'raw' as prefix for a raw string
>>>> raw`this is a raw string`
>>>> raw'this is another raw string'
>>>> raw[yet another raw string]
>>> See "Choice of Delimiters" in the "Alternatives" section of the JEP.
>> The JEP doesn't clearly call out the goal of *no* escapes in the bulk
>> of the raw string, but that requirement (which we have adopted)
>> affects the choice of quotes in a decisive manner.  Let me try to
>> lay out the "string physics" that underly this decision.
>> *Any* single-character end-quote will have a significant probability
>> of showing up inside the bulk of a (randomly selected) raw string.
>> How significant?  Well, let's say conservatively that raw strings
>> can have all possible characters, but the end-quote sequence
>> only shows up one out of a hundred times, per character position,
>> in raw strings.  If you are using a series of ten-character raw
>> strings (to say nothing of bigger ones), you have about a 10%
>> chance for any given raw string to contain an inconvenient
>> end-quote.
>> That percentage is significant, especially given that in some
>> cases strings will be longer and quote characters will be more
>> common, both factors increasing the failure rate beyond 10%.
>> But even a 0.1% failure rate is noticeable to users, making a
>> feature feel unreliable.
>> This generalizes to any fixed multi-character end-quote, with a
>> reduction of probability exponential in the length of the end-quote,
>> but still with a non-zero probability, of occurring in the bulk of
>> a randomly selected string.  A two-character end-quote might
>> have a probability of 10^-4, and that means you have a more
>> modest but still significant chance of failure of 10% across a
>> suite of 100 random 10-character strings, or for one random
>> 1000-character string.
>> Any *finite choice* of end-quotes has the same problem, with
>> a non-zero probability that decreases (but does not vanish)
>> with the number of available end-quotes.  The only way to
>> break out of the box is to allow the user an unlimited range
>> of successively "stronger" end-quotes (i.e., less likely ones).
>> (Randomly selected raw strings are easy to model, although
>> the numbers used above are an approximation to a binomial
>> distribution.  In fact, though, strings which show up non-randomly
>> in real code are *more* likely to mention end-quotes, since their
>> contents are somehow correlated to the enclosing language.)
>> You can easily demonstrate this issue by nesting Java code
>> which uses raw quotes inside of a containing raw quote.  An
>> easy first test of a proposed quoting mechanism is, "will it
>> nest?"  If not, then the quoting mechanism does not meet
>> a key requirement for raw quotes.
>> This key requirement is unconstrained pasting *without* fixups
>> (escape sequences embedded in the bulk of the quote).
>> Anything else, with some epsilon probability of requiring escapes,
>> is not truly raw, just "mostly raw".
>> In the case you propose, Remi, the probability of having an
>> un-quotable bulk string is quite high, since all of the end-quotes
>> are single characters.
>> Only a convention with an end-quote of arbitrary length is strong
>> enough to "fence in" arbitrary raw strings.  The simplest possible
>> such convention is to allow replication of a single character to
>> serve as the end-quote.  This decision toward simplicity
>> influences other features in Java raw strings, including the
>> decision to use a new character and to disallow certain
>> edge cases, notably null strings.
>> — John
> I understand your point but i disagree with your analysis.
> My own experience is that raw strings follow what i call the 'embedded languages' hypothesis,
> i.e. for any application, there is a length such all raw strings with a length greater than this length contain only embedded programming languages.
> So after this length instead of having the probability to see a character to be virtually 1, you have the opposite effect, because programming languages (a human construct) are very regular in the set of chars they use. So you do not need to a repetition of a character to avoid a statistical effect that does not occur. Being able to choose the escape character, is enough.
W/o diving too much on the repeated vs. 'single but customizable' 
choice, I'm also a bit suspicious of the fact that John's analysis 
conservatively assumes that a snippet of text embedded in a raw string 
is a random sequence of character, in the true sense. This, to me, just 
seems the wrong assumption - by definition something truly random has 
high entropy and something with high entropy is usually associated with 
low information content - which is just not compatible with the use case 
of 'pasting in a code snippet' (example: it's highly likely that the 
prefix 'cla' will be followed by 'ss' in a Java-like snippet). I would 
expect entropy of the embedded snippet to be quite low compared to the 
assumption made here, which greatly affects the probability 
calculations. For the analysis to be correct, it should take into 
account the _frequency_ by which a given delimiter can appear in the 
various kinds of snippets that could be pasted in (and there's one such 
frequency for each snippet kind) - or we're at risk of overestimating 
(if we pick a delimiter symbol whose frequency is, in reality, really 
low), or underestimating (if we pick a symbol that, conversely,  happens 
very frequently).

>> P.S. I expect IDE vendors will quickly supply useful "stretchy quotes"
>> which will resize themselves to contain whatever users throw into
>> the raw string body.  At that point backticks will feel like magic tokens
>> that never accidentally match raw string bodies.
> regards,
> Rémi

More information about the amber-spec-observers mailing list