Empty regexp replaceall and surrogate pairs results in corrupted utf16.
xueming.shen at oracle.com
Fri Jun 8 18:36:57 UTC 2012
On 06/08/2012 05:16 AM, Ulf Zibis wrote:
> Is there any spec weather the Java Regex API has a general contract
> with 16-bit chars or Unicode codepoints?
The regex spec says Pattern and Matcher work ON character sequence with
the reference to
CharSequence interface, but the pattern itself does support Unicode
character via various
regex constructors and flags. An empty String pattern is really a corner
case here, it does
not say anything about "character", the current implementation
interprets it as each, every
stop when you iterate through the target CharSequence. It might not be
desirable for some
use scenario, but not not-reasonable.
> Additionally I like to discuss: "any possible zero-width position of
> the target String"
> If String length is l, maybe it's arguable, that position l is no
> valid position in the String.
If you considering those "boundary matcher" regex constructs, it might
to consider this "invalid position" as a valid when using regex. I think
must of other
regex engines do the same thing, for example, the perl.
$mystring =~ s// /g;
printf "[%s]\n", $mystring;
[ P e t e r ]
But I have to say you might have a point here:-)
> From the use case point of view, I think "P e t e r" as result of
> "Peter".replaceAll("", " ") is the most useful.
More information about the core-libs-dev