Raw string literals and Unicode escapes
alex.buckley at oracle.com
Wed Feb 14 19:46:23 UTC 2018
On 2/13/2018 2:11 PM, John Rose wrote:
> On Feb 13, 2018, at 9:58 AM, Alex Buckley <alex.buckley at oracle.com
> <mailto:alex.buckley at oracle.com>> wrote:
>> I suspect the trickiest part of specifying raw string literals will be
>> the lexer's modal behavior for Unicode escapes. As such, I am going to
>> put the behavior under the microscope.
> For an approach to this see:
> In short: We define a so-called "preimage" for each token,
> which is the unambiguously defined sequence of UTF-16
> code points that translate to that token via \u substitution
> and line terminator normalization.
> For raw strings (only) the preimage of a token is significant.
> The backticks of a raw string (both opening and closing)
> are required to be their own preimage (no \u0060 allowed).
> And the raw string body contents are the preimage of the
> string token, not the normal token image.
> I think preimage is the trick we need here, and it settles
> a number of questions, such as those you raised.
> All of the tricky examples you raised are uniformly illegal,
> under the preimage rule for raw-string quotes.
I agree that holding on to the preimage of each InputElement (JLS 3.5)
is necessary because ` can legitimately appear in some kinds of
InputElement as an ordinary InputCharacter (derived from either the
RawInputCharacter ` or the UnicodeEscape \u0060):
// This Markdown processor treats ` specially.
/* This Markdown processor treats \u0060 specially. */
2. Token (and more specifically, StringLiteral)
Only if the InputElement is a Token, and more specifically a
RawStringLiteral, do we need to take the sequence of InputCharacters and
LineTerminators that constitute its RawStringBody and replace that
sequence with its preimage.
I want to say something about the delimiters of the raw string literal
now, but I'll do that in response to Jim's mail.
More information about the amber-spec-observers