Codereview request for 6653797: Reimplement JDK charset repository charsets.jar

Xueming Shen xueming.shen at
Mon Jul 16 17:13:22 UTC 2012

On 7/16/2012 9:57 AM, Ulf Zibis wrote:
> Hi Sherman,
> as I just said for 7183053, I can't look in the details at the moment, 
> as I do not have suitable environment installed at the moment.
> All I can see, looks reasonable.
> Regarding part 4 of bug 6653797, there is still existing adaptor from 
> my side, if desired.

The has been removed. That will be an alternative if we hear any 


> Just one comment: I think it should be possible to share the mapping 
> data partly across charsets, so the charsets.jar would be decreased 
> again more?
> -Ulf
> Am 16.07.2012 00:12, schrieb Xueming Shen:
>> Hi
>> This changeset includes the migration of our JIS0201/0208/0212 based 
>> single/
>> double-byte charsets to the new mapping based implementation. This is 
>> the
>> left-over of the effort we put in JDK7 to re-implement most of our 
>> charsets in
>> charsets.jar to (1)have better performance (2) small storage foot 
>> print and (3)
>> ease the maintenance burden.
>> Notes of the implementation:
>> (1) jis0201/0208/0212 and their variants are now generated from the 
>> mapping table
>> during the build time. (See those new .map *.nr and *.c2b tables)
>> (2) EUC_JP/LINUX_OPEN, SJIS, PCK, ISO2022_JP and its variants are now 
>> using these
>> new jis0201/02080212 charsets.
>> (3) Those in red (in webrev) are the old implementation, since no 
>> charset uses them
>> anymore, I removed them from the repository)
>> (4) There are two approaches for PCK and SJIS. PCK.java_v1 and 
>> SJIS.java_v1 are the
>> one that follows the old implementation, which decode/encodes base on 
>> the
>> jis0201/0208 (and the variants) mapping via Ken's algorithm. This is 
>> known to be
>> slow and buggy (the algothrim does not take care of illegal sjis cp, 
>> see #6653797
>> and
>> So in this charset, I generated the direct mapping tables for sjis 
>> and pck and use
>> the "general" DoubleByte base class for these two charsets. This 
>> results in much
>> faster decoding/encoding and correct mapping for all code points. The 
>> downside
>> of this approach is that it adds about 50k uncompressed side to the 
>> charsets.jar.
>> But given this change actually reduces about 300K from the rt.jar, we 
>> still get
>> a net 250K, so I decided to go with this approach for better 
>> performance.
>> It appears to be lots of files (80+) in the webrev, but that number 
>> includes the
>> removed old implementation and the tests I put in to guarantee the 
>> identical
>> de/encoding result from the old and new implementations (those OLD... 
>> test
>> cases), the change is actually not that big:-) So please help review. 
>> I can then
>> put this multi-year efforts into rest.
>> -Sherman

More information about the core-libs-dev mailing list