<i18n dev> Fwd: Some differences on Window UDC area

Charles Lee littlee at linux.vnet.ibm.com
Wed May 23 01:03:21 PDT 2012

Hi guys,

We have a simple test case:

for (String cname : new String[] { "GBK", "MS936", "GB18030" }) {
         Charset charset = Charset.forName(cname);
         System.out.println("charset: " + charset.name());
         CharsetEncoder ce = charset.newEncoder();
         char[] chars = new char[] { 0xE585, 0xE586, 0xE592 };
         CharBuffer cb = CharBuffer.wrap(chars);
         ByteBuffer bb = ce.encode(cb);

         for (char c : chars) {
         System.out.printf("\\u%04x", (int) c);
         System.out.print(" -> ");

         for (byte b : bb.array())
         if (b != 0x0) {
             System.out.printf("\\x%02x", (int) b & 0xFF);

The output is
charset: GBK
\ue585\ue586\ue592 -> \xa2\xa0\xa2\xab\xa3\x40
charset: x-mswin-936
\ue585\ue586\ue592 -> \xa2\xa0\xa2\xab\xa3\x40
charset: GB18030
\ue585\ue586\ue592 -> \xa2\xa0\xa3\x40\xa3\x4c

 From the msdn[1], U+E000 -> U+F8FF is in the EUDC scope. So U+E586 is 
in the EUDC scope. But the mapped code in MS936/GBK is 0xA2AB, it is not 
in the EUDC scope.
With another simple test case, you can find there are more codes that is 
not mapped right:

for (int i = 0xE000; i < 0xE000 + 1894; i++) {
         String s = new String(new char[] { (char) i });
         byte[] bs = s.getBytes("MS936");
         int b0 = (int) bs[0] & 0xFF;
         int b1 = (int) bs[1] & 0xFF;
         if ((b0 >= 0xAA && b0 <= 0xAF) && (b1 >= 0xA1 && b1 <= 0xFE))
         if ((b0 >= 0xF8 && b0 <= 0xFE) && (b1 >= 0xA1 && b1 <= 0xFE))
         if ((b0 >= 0xA1 && b0 <= 0xA7) && (b1 >= 0x40 && b1 <= 0xA0))
         System.out.printf("\\u%04X -> \\x%02X\\x%02X%n", i, b0, b1);

I have written a generator in C#[2] which outputs the mapping code in 
GB2312[3] and GB18030[4] in scope U+E000 and U+F8FF to find that most of 
code are the same. Hereby I suggest we may follow the code from GB2312 
and the changed map file in openjdk can be found [5][6].

Would anyone help to take a look on this issue?

[2] http://cr.openjdk.java.net/~littlee/OJDK-63/webrev.00/Program.cs
[3] http://cr.openjdk.java.net/~littlee/OJDK-63/webrev.00/gb2312Map.txt
[4] http://cr.openjdk.java.net/~littlee/OJDK-63/webrev.00/gb18030Map.txt
[5] http://cr.openjdk.java.net/~littlee/OJDK-63/webrev.00/GBK.map.new
[6] http://cr.openjdk.java.net/~littlee/OJDK-63/webrev.00/MS936.map.new

P.S: Sorry for the late notice.

On 03/29/2011 03:00 PM, Charles Lee wrote:
> On 03/28/2011 11:06 PM, Alan Bateman wrote:
>> Charles Lee wrote:
>>> :
>>> It looks similar. How can I find the patch quickly? I notice it says 
>>> "the list is attached to this CR". Is it CR-6183404? Since cr has 
>>> the pattern cr.openjdk.java.net/~username/id, how can I know who is 
>>> the committer to this CR?
>> cr.openjdk.java.net is the place where we push webrevs when a patch 
>> is out for review. I don't think this one is one anyone's list for 
>> jdk7 and the list attached to the bug is likely the list of incorrect 
>> mappings. If this is fixed then I assume the fix will update the 
>> mappings in jdk/make/tools/CharsetMapping/MS936.map.
>> -Alan
> I have output more bytes[1] to see whether other bytes are encoded 
> correctly. But unfortunately it is not. It is kind of like, on 
> windows, using ms936, PUA of ms936 use the PUA of gb18030. In 
> wikipedia, it says gb18030 is compatible with gbk which ms936 
> implemented. Can we conclude that ms936 should follow the gb18030's 
> behavior?
> [1] 0xE585, 0xE586, 0xE587, 0xE588, 0xE589, 0xE58a, 0xE58b, 0xE58c, 
> 0xE58d, 0xE58e, 0xE58f, 0xE590, 0xE591, 0xE592,  0xE593, 0xE594, 
> 0xE595, 0xE596, 0xE597,  0xE598, 0xE599,  0xE59a, 0xE59b, 0xE59c, 
> 0xE59d, 0xE59e, 0xe79f.
> Using MS936 charset, we expect:
> \xa2\xa0\xa3\x40\xa3\x41\xa3\x42\xa3\x43\xa3\x44\xa3\x45\xa3\x46\xa3\x47\xa3\x48\xa3\x49\xa3\x4a\xa3\x4b\xa3\x4c\xa3\x4d\xa3\x4e\xa3\x4f\xa3\x50\xa3\x51\xa3\x52\xa3\x53\xa3\x54\xa3\x55\xa3\x56\xa3\x57\xa3\x58\xa6\xfe
> but we got:
> \xa2\xa0\xa2\xab\xa2\xac\xa2\xad\xa2\xae\xa2\xaf\xa2\xb0\xa2\xe3\xa2\xe4\xa2\xef\xa2\xf0\xa2\xfd\xa2\xfe\xa3\x40\xa3\x41\xa3\x42\xa3\x43\xa3\x44\xa3\x45\xa3\x46\xa3\x47\xa3\x48\xa3\x49\xa3\x4a\xa3\x4b\xa3\x4c\xa7\xa0

Yours Charles

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/i18n-dev/attachments/20120523/7f45c034/attachment.html 

More information about the i18n-dev mailing list