Rewrite of IBM doublebyte charsets

Ulf Zibis Ulf.Zibis at
Sun May 17 18:19:28 UTC 2009

Am 14.05.2009 22:55, Xueming Shen schrieb:
> Thanks again for taking time on this. Here is the IBM db charsets webrev
> This is a bigger fish than the EUC_TW:-)

*** Decoder-Suggestions:

(1) Unused imports in
    import java.util.Arrays;
    import sun.nio.cs.StandardCharsets;
    import static sun.nio.cs.CharsetMapping.*;
    import sun.nio.cs.ext.DoubleByte;  // or instead: static 

(2) Please extract de/encoder classes to separate java file:
    In tabbed editor it's much more comfortable to select a tab, than 
scrolling 760 lines up and down.

(3) Modify dimension of b2c:
      char[][] b2c = new char[0x100][segSize];
    so decode :
      public char decodeDouble(int b1, int b2) {
          if ((b2-=b2Min) < 0 || b2 >= segSize)
              return UNMAPPABLE_DECODING;
          return b2c[b1][b2];
   Benefit[1]: increase performance of decoder
   Benefit[2]: reduce memory of B2C_UNMAPPABLE from 8192 to 512 bytes
   Benefit[3]: some of b2c pages could be saved (if only containing \uFFFD)

(4) Don't care about b2Max (it's always not far from 0xff):
   Benefit[4]: another performance increase of decoder (only check: 
(b2-=b2Min) < 0)

(5) Truncate String segments (there are 65 % "\uFFFD" in IBM933):
    (fill b2c segments first with "\uFFFD", then initialize)
   Benefit[5]: save up to 180 % superfluous memory and disk-footprint

(6) Unload b2cStr from memory after startup:
    - outsource b2cStr to additional class file like EUC_TW approach
    - set b2cStr = null after startup (remove final modifier)
   Benefit[6]: avoid 100 % superfluous memory-footprint

(7) Avoid copying b2cStr to b2c:
    (String#charAt() is fast as char[] access)
   Benefit[7]: increase startup performance for decoder

(8) Truncate b2c segments (catch unmappable indexes by RuntimeException):
   Benefit[8]: save up to 180 % superfluous memory-footprint

(9) Share mappings (IBM930 and IBM939 are 99 % identical):
   Benefit[9]: save up to 99 % superfluous disk-footprint
   Benefit[10]: save up to 99 % superfluous memory-footprint (if both 
charsets are loaded)

(10) Provide 4-way fork from de/encodeLoop():
   Benefit[11]: increase performance, if there is only 1 direct buffer

(11) Quit coders xBufferLoop by exception on xflow:
   Benefit[12]: increase performance

(12) Get rid of package dependency:
   Benefit[13]: avoid superfluous disk-footprint
   Benefit[14]: save maintenance of converters
   Disadvantage[1]: published under JRL (waiting for launch of OpenJDK-7 
project "charset-enhancement") ;-)

(13) Take data files in account _once more_:
   Following upper suggestions, data files should be much more smaller 
than for EUC_TW,
   so loading time from jar by getResourceAsStream() could be 
acceptable. If some day
   Bug ID 6818736, 6818736 were solved, we could profit once more, 
without doing much.
   Benefit[15]: avoid 50 % superfluous disk-footprint
   Benefit[16]: sharing of map data for different charsets becomes more 

(14) Split map data files into chunks and load lazy.
   TW native speakers must be consulted, to define reasonable chunks!
   Benefit[17]: save startup time
   Benefit[18]: save memory
   Benefit[19]: sharing of map data becomes much more simple

(15) Diff also against and see similarity

(16) decodeArrayLoop: shortcut calculation of limits:
      int sl = sp + src.remaining();
      int dl = dp + dst.remaining();

(17) Decoder#decodeArrayLoop: shortcut for single byte only:
      int sr = src.remaining();
      int sl = sp + sr;
      int dr = dst.remaining();
      int dl = dp + dr;
      // single byte only loop
      int slSB = sp + sr < dr ? sr : dr;
      while (sp < slSB) {
          char c = b2cSB[sa[sp] && 0xff];
          if (c == UNMAPPABLE_DECODING)
          da[dp++] = c;
     Same for Encoder#encodeArrayLoop

(18) Decoder_EBCDIC: boolean singlebyteState:
      if (singlebyteState)

(19) Decoder_EBCDIC: decode single byte first:
      if (singlebyteState)
          c = b2cSB[b1];
      if (c == UNMAPPABLE_DECODING) {
   Benefit[20]: should be faster

*** Encoder-Suggestions:

(21) join *.nr to *.c2b files (25->000a becomes 000a->fffd):
   Benefit[21]: reduce no. of files
   Benefit[22]: simplifies initC2B() (avoids 2 loops)

(22) Save c2b in 2-dimensional array:
     char[][] c2b = new char[0x100][]
     set unused segments to 256-size UNMAPPABLE_ENCODING[]
   Benefit[23]: save calculation of index in encodeChar() --> little faster
   Benefit[24]: initC2B() becomes faster
   - huge c2b[] is initialized twice, 1st with 0 (according JLS) + 2nd 
   - only fill 256 bytes with UNMAPPABLE_ENCODING, and get copies by 
   Benefit[25]: save c2bIndex

(23) Truncate c2b segments:
     c2b[x] = new char[usedLength]
     (usedLength values could be generated and saved in DoubleByte-X or 
data file)
   Benefit[26]: avoid superfluous memory and disk-footprint (I guess ~30 %)
   Benefit[27]: don't range-check in-segment index, catch unmappable 
index by IndexOutOfBoundsException

(24) Additionally truncate leading unmappables in c2b segments, and host 
   Benefit[28]: avoid another superfluous memory and disk-footprint (I 
guess ~10 %)
   Disadvantage[21]: needs hosting of offsets: 256 bytes

(25) Concerning (23),(24): Check out best segment size (maybe 256 is not 
   Benefit[29]: avoid another superfluous memory and disk-footprint (I 
guess 10-20 %)


More information about the core-libs-dev mailing list