Empty regexp replaceall and surrogate pairs results in corrupted utf16.

Ulf Zibis Ulf.Zibis at gmx.de
Fri Jun 8 12:24:51 UTC 2012

Oops, correction:
StringBuilder sb = new StringBuilder(s1.length * 2 + 1);
for (char c : s1.getChars())
String s2 = sb.append('X').toString();

Am 08.06.2012 14:16, schrieb Ulf Zibis:
> I tend to agree Dawid.
> Especially the comparison with Python behaviour is demonstrative.
> Is there any spec weather the Java Regex API has a general contract with 16-bit chars or Unicode 
> codepoints?
> Thinking about the search pattern e.g. "[AB\uD840\uDC00C]"; what does it actually search for, the 
> isolated occurence of each char, or the occurence of the codepoint "\uD840\uDC00" ?
> Last, but not least, we should think about, which would be the common use case, an which would be 
> more easy to work around.
> (I think, the current view on isolated chars is more easy to work around:
> StringBuilder sb = new StringBuilder(s1.length + 1).append('X');
> for (char c : s1.getChars())
>     sb.append(c).append('X');
> String s2 = sb.toString();
> )
> Additionally I like to discuss: "any possible zero-width position of the target String"
> If String length is l, maybe it's arguable, that position l is no valid position in the String.
> From the use case point of view, I think "P e t e r" as result of "Peter".replaceAll("", " ") is 
> the most useful.
> -Ulf
> Am 08.06.2012 13:14, schrieb Dawid Weiss:
>> I guess a lot depends on the point of view. From historical point of
>> view (where a char[] and a String are basically unsigned values) that
>> pattern should simply process every value (index) and work like you
>> say. But from a practical point of view I think it is a bug -- it
>> corrupts the string, transforming legal unicode into invalid values.
>> I checked with Python (3) and the behavior there is the expected one
>> (it work at the unicode codepoint level rather than surrogate level).
>> Where is the behavior of "" that you mention defined? I admit I
>> couldn't find any reference to this in the documentation:
>>> Using an empty String "" as a regex for the replaceAll() takes the
>>> advantage of the special meaning of "", in which it is interpreted as
>>> it can match any possible zero-width position of the target String
>> I'm not saying you're wrong (and that pattern is definitely not common
>> so it's probably academic discussion) but I'd like some concrete
>> reference as to how an empty pattern should behave. To me consistency
>> with the rest of the Pattern specification would be that it operates
>> at "zero width position between unicode characters" not between any
>> char[] value, even an incorrect one or in the middle of a surrogate.
>> Dawid
>> On Fri, Jun 8, 2012 at 12:46 AM, Xueming Shen<xueming.shen at oracle.com>  wrote:
>>> Personally I don't think it is a bug. A j.l.String represents a sequence of
>>> UTF-16 chars. While
>>> a pair of surrogates represents a supplementary character, a single
>>> surrogate itself is still
>>> a "legal" independent entity inside a String object and length of a String
>>> is still defined as
>>> the total number of char unit and an index value between a high surrogate
>>> and a low
>>> surrogate is still a legal index value that can be used to access the char
>>> at that particular
>>> position. Using an empty String "" as a regex for the replaceAll() takes the
>>> advantage of the
>>> special meaning of "", in which it is interpreted as it can match any
>>> possible zero-width
>>> position of the target String, it does  not imply anything regarding
>>> "character"  or
>>> "characters" around it, so I would not interpret it as a zero-with character
>>> boundary,
>>> therefor a "position" in between a pair surrogates is still a good "found"
>>> for replacing.
>>> -Sherman
>>> On 6/7/2012 1:07 PM, Dawid Weiss wrote:
>>>> Hi, I'm a committer to the Apache Lucene project. We have randomized
>>>> tests and one seed hit the following (simplified) scenario:
>>>>     String s1 = "AB\uD840\uDC00C";
>>>>     String s2 = s1.replaceAll("", "X");
>>>> the input contains an extended unicode character (any surrogate pair
>>>> will do). The pattern is an empty string (in fact, it was randomized
>>>> as "]|" but it's the same problem so I omit the details). The problem
>>>> is that after applying this pattern, replaceAll inserts X in between
>>>> the surrogate pair characters and this results in invalid UTF-16:
>>>> AB��C
>>>> I believe this is a bug in the regexp implementation (sorry, don't
>>>> have a patch for it) but I'd like to confirm it's not something known.
>>>> Pointers appreciated.
>>>> Dawid

More information about the core-libs-dev mailing list