6990617: Regular expression doesn't match if unicode character next to a digit.
stephen.flores at oracle.com
Tue Dec 13 04:16:04 UTC 2011
I have added the regression test for the case below and added a
"continue" statement after line 1622 to get the case to pass.
I have updated the webrev.
On 12/12/2011 02:22 PM, Xueming Shen wrote:
> Hi Steve,
> The \x3[0-9] approach is interesting, it appears to solve the problem
> paying a higher cost I originally thought (looking back, for example).
> The logic of initializing/setting/unsetting of "beginQuote" to
> true/false appears to
> be incorrect when there are multiple \Qn...\E in one pattern. Ln#1622
> setting will
> always be followed by Ln#1630, if I read the code correctly.
> For example
> Pattern pattern =
> Matcher matcher = pattern.matcher("\t1sometext\t2sometext");
> System.out.printf("find=%b%n", matcher.find());
> will still return false?
> On 12/09/2011 10:05 PM, Stephen Flores wrote:
>> Please review the following webrev (includes new test to demonstrate
>> the bug):
>> for bug: 6990617 Regular expression doesn't match if unicode character
>> next to a digit.
>> A DESCRIPTION OF THE PROBLEM :
>> Unicode characters are represented as \\+number.
>> For instance, one could write:
>> Pattern p = Pattern.compile("\\011some text\\012");
>> Matcher m = p.matcher("\tsome text\n");
>> System.out.println(m.find()); // yields "true"
>> However, if we want to match a string with a digit next to
>> the unicode character, it doesn't match (whether we "quote"
>> the regular expression or not). Note the "1" next to the tab
>> character (unicode 011).
>> Pattern p = Pattern.compile("\\011\\Q1some text\\E\\012");
>> Matcher m = p.matcher("\t1some text\n");
>> System.out.println(m.find()); // yields "false"
>> This happens because Pattern accepts either \\0011 or \\011 for
>> the same character. From the javadoc:
>> \0nn The character with octal value 0nn (0 <= n <= 7)
>> \0mnn The character with octal value 0mnn (0 <= m <= 3, 0 <= n <= 7)
>> Pattern.RemoveQEQuoting() does NOT handle this situation correctly.
>> The existing implementation now simply copies all ASCII.isAlnum()
>> characters when handing a quote.
>> Description of fix:
>> In the method Pattern.RemoveQEQuoting any ASCII digit at the
>> beginning of a quote will now be prefixed by "\x3" (not just \
>> because this would be a backref). 0x30 is the ASCII code for '0'.
More information about the core-libs-dev