RFR 8237599 : Greedy matching against supplementary chars fails to respect the region
Roger.Riggs at oracle.com
Wed Mar 25 13:56:59 UTC 2020
Interesting edge case, would never be seen with 8 bit charsets.
On 3/21/20 3:15 AM, Ivan Gerasimov wrote:
> Gentle ping.
> The webrev was rebased to accommodate recent changes in RegExTest.java.
> The fix is to handle an edge case situation, which is supposedly not
> too common.
> Nevertheless, I think, it is important to handle it correctly.
> Thanks in advance!
> On 1/22/20 8:23 PM, Ivan Gerasimov wrote:
>> Hello everyone!
>> When the input of a j.u.regex.Matcher is restricted with .region()
>> method, it can possibly cut off a half of a surrogate pair.
>> It turns out that greedy matching implemented in the
>> Pattern.CharPropertyGreedy class fails to recognize this edge case in
>> two scenarios:
>> 1) When it greedily consumes the input and meets a higher half of a
>> surrogate pair that was cut off at the end of input, and
>> 2) When it backs off and meets a lower half of a surrogate pair at
>> the very beginning of input.
>> In both cases, the engine reads the entire codepoint, crossing the
>> boundaries of the set region.
>> Instead, it should only read the half of the surrogate pair that lies
>> inside the region and ignore the other half.
>> Would you please help review the fix?
>> BUGURL: https://bugs.openjdk.java.net/browse/JDK-8237599
>> WEBREV: http://cr.openjdk.java.net/~igerasim/8237599/00/webrev/
>> Thanks in advance!
More information about the core-libs-dev