RFR: 8224225: Tokenizer improvements [v2]
mcimadamore at openjdk.java.net
Thu Oct 1 13:41:15 UTC 2020
On Thu, 1 Oct 2020 13:32:42 GMT, Jim Laskey <jlaskey at openjdk.org> wrote:
>> Please review these changes to the javac scanner.
>> I recommend looking at the "new" versions of 1. UnicodeReader, then 2. JavaTokenizer and then 3. JavadocTokenizer
>> before venturing into the diffs.
>> Rationale, under the heading of technical debt and separation of concerns: There is a lot "going on" in the
>> JavaTokenizer/JavadocTokenizer that needed to be cleaned up.
>> - The UnicodeReader shouldn't really be accumulating characters for literals.
>> - A tokenizer shouldn't need to be aware of the unicode translations.
>> - There is no need for peek ahead.
>> - There were a lot of repetitive tasks that should be done in methods instead of complex expressions.
>> - Names of existing methods were often confusing.
>> To avoid disruption, I avoided changing logical, except in the UnicodeReader. There are some relics in the
>> JavaTokenizer/JavadocTokenizer that could be cleaned up but require deeper analysis.
>> Some details;
>> - UnicodeReader was reworked to provide tokenizers a running stream of unicode characters/codepoints. Steps:
>> - characters are read from the buffer.
>> - if the character is a '' then check to see if the character is the beginning of an unicode escape sequence, If so,
>> then translate. - if the character is a high surrogate then check to see if next character is the low surrogate. If so
>> then combine. - A tokenizer can test a codepoint with the isSurrogate predicate (when/if needed.)
>> The result of putting this logic on UnicodeReader's shoulders means that a tokenizer does not need have any unicode
>> - The old UnicodeReader modified the source buffer to insert an EOI character at the end to mark the last character.
>> - This meant the buffer had to be large enough (grown) to accommodate.
>> - There really was no need since we can simply return an EOI when trying to read past the end of buffer.
>> - The only buffer mutability left behind is when reading digits.
>> - Unicode digits are still replaced with ASCII digits.
>> - This seems unnecessary, but I didn't want to risk messing around with the existing logic. Maybe someone can
>> enlighten me.
>> - The sequence '\' is special cased in the UnicodeReader so that the sequence "\\uXXXX" is handled properly.
>> - Thus, tokenizers don't have to special case '\' (happened frequently in the JavadocTokenizer.)
>> - JavaTokenizer was modified to accumulate scanned literals in a StringBuilder.
>> - This simplified/clarified the code significantly.
>> - Since a lot of the functionality needed by the JavaTokenizer comes directly from a UnicodeReader, I made JavaTokenizer
>> a subclass of UnicodeReader.
>> - Otherwise, I would have had to reference "reader." everywhere or would have to create JavaTokenizer methods to
>> repeat the same logic. This was simpler and cleaner.
>> - Since the pattern "if (ch == 'X') bpos++" occurred a lot, I switched to using "if (accept('X')) " patterns.
>> - Actually, I tightened up a lot of these patterns, as you will see in the code.
>> - There are a lot of great mysteries in JavadocTokenizer, but I think I cracked most of them. The code is simpler and
>> more modular.
>> - The new scanner is slower to warm up due to new layers of method calls (ex. HelloWorld is 5% slower). However, once
>> warmed up, this new scanner is faster than the existing code. The JDK java code compiles 5-10% faster.
>> Previous review: https://mail.openjdk.java.net/pipermail/compiler-dev/2020-August/014806.html
> Jim Laskey has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev
> excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since
> the last revision:
> - Merge branch 'master' into 8224225
> - 8224225: Tokenizer improvements
Approved as per:
Marked as reviewed by mcimadamore (Reviewer).
More information about the compiler-dev