Raw string literals and Unicode escapes

John Rose john.r.rose at oracle.com
Tue Feb 13 22:11:06 UTC 2018

On Feb 13, 2018, at 9:58 AM, Alex Buckley <alex.buckley at oracle.com> wrote:
> I suspect the trickiest part of specifying raw string literals will be the lexer's modal behavior for Unicode escapes. As such, I am going to put the behavior under the microscope.

For an approach to this see:
  http://cr.openjdk.java.net/~jrose/jls/raw-string-pages-v4.pdf <http://cr.openjdk.java.net/~jrose/jls/raw-string-pages-v4.pdf>

In short:  We define a so-called "preimage" for each token,
which is the unambiguously defined sequence of UTF-16
code points that translate to that token via \u substitution
and line terminator normalization.

For raw strings (only) the preimage of a token is significant.
The backticks of a raw string (both opening and closing)
are required to be their own preimage (no \u0060 allowed).
And the raw string body contents are the preimage of the
string token, not the normal token image.

I think preimage is the trick we need here, and it settles
a number of questions, such as those you raised.
All of the tricky examples you raised are uniformly illegal,
under the preimage rule for raw-string quotes.

— John

More information about the amber-spec-observers mailing list