<i18n dev> [15] RFR: 8244459: Optimize the hash map size in LocaleProviderAdapters
Mark Davis ☕️
mark at macchiato.com
Wed May 6 04:02:41 UTC 2020
I wouldn't worry too much about over 2^29 either.
However, the number of possible valid language tags is much larger than
people think, since any combination of multiple variants can occur). So
even not counting the locale extensions, it is over 2^129 (my rough
calculation).
Mark
On Tue, May 5, 2020 at 5:09 PM <naoto.sato at oracle.com> wrote:
> Hi Stuart,
>
> On 5/5/20 4:29 PM, Stuart Marks wrote:
> > As for Naoto's changeset, I don't think it should be held up while we
> > bikeshed this. :-) I'm ok with whatever he decides.
>
> I don't think the number of language tags exceed 2^29, unless languages
> in the whole universe are counted :-) So I would go with the current
> version.
>
> Naoto
>
> >
> > s'marks
> >
> >
> >
> >
> > On 5/5/20 1:26 PM, Peter Levart wrote:
> >> There's more...
> >>
> >>
> >> Guava (and JDK in copy constructor) formula:
> >>
> >> (int) ((float) expectedSize / 0.75f + 1.0f)
> >>
> >>
> >> is not the same as Naoto's formula:
> >>
> >> (int) (expectedSize / 0.75f) + 1
> >>
> >>
> >> Notice that Naoto does addition of 1 in integer arithmetic after
> >> conversion to int, while Guava/JDK does in floating point before
> >> conversion to int. If I put Naoto's formula into my test program, no
> >> undercalculations are detected.
> >>
> >>
> >> So while Naoto's formula is sub-optimal for expectedSizes that are
> >> multiples of 3, the Guava/JDK also has a bug.
> >>
> >>
> >> Regards, Peter
> >>
> >>
> >> On 5/5/20 10:01 PM, Peter Levart wrote:
> >>>
> >>>
> >>> On 5/5/20 9:41 PM, Martin Buchholz wrote:
> >>>> Hi Peter,
> >>>>
> >>>> Are you saying guava has a tiny bug?
> >>>
> >>>
> >>> If it was just 1 too much when expected size is a multiple of 3 then
> >>> that would not be a bug, just sub-optimal calculation. And the same
> >>> calculation is performed also in JDK when a copy constructor is
> >>> called for example.
> >>>
> >>>
> >>> But I investigated further and what I found could be considered a
> >>> bug. Sometimes, the following expression:
> >>>
> >>>
> >>> (int) ((float) expectedSize / 0.75f + 1.0f)
> >>>
> >>>
> >>> ...calculates a value that is not enough (due to floating point
> >>> arithmetic and conversion to int) to store the expectedSize elements
> >>> into the HashMap without re-hashing.
> >>>
> >>>
> >>> What HashMap does with initialCapacity parameter is to round it up to
> >>> nearest power of 2:
> >>>
> >>> static int tableSizeFor(int cap) {
> >>> int n = -1 >>> Integer.numberOfLeadingZeros(cap - 1);
> >>> return (n < 0) ? 1 : (n >= MAXIMUM_CAPACITY) ?
> >>> MAXIMUM_CAPACITY : n + 1;
> >>> }
> >>>
> >>> then it uses this as the initial backing table size. From that table
> >>> size it calculates the threshold value:
> >>>
> >>> static int threshold(int cap) {
> >>> float ft = (float) cap * 0.75f;
> >>> return (cap < MAXIMUM_CAPACITY && ft < (float)
> >>> MAXIMUM_CAPACITY ?
> >>> (int) ft : Integer.MAX_VALUE);
> >>> }
> >>>
> >>> ... and uses it as the max. number of elements that a HashMap can
> >>> hold before it is re-hashed. So I did the following test (comparing
> >>> the effectiveness of above formula with alternative
> >>> (expectedSize*4+2)/3 formula):
> >>>
> >>>
> >>> public class HMTest {
> >>> static final int MAXIMUM_CAPACITY = 1 << 30;
> >>>
> >>> static int tableSizeFor(int cap) {
> >>> int n = -1 >>> Integer.numberOfLeadingZeros(cap - 1);
> >>> return (n < 0) ? 1 : (n >= MAXIMUM_CAPACITY) ?
> >>> MAXIMUM_CAPACITY : n + 1;
> >>> }
> >>>
> >>> static int threshold(int cap) {
> >>> float ft = (float) cap * 0.75f;
> >>> return (cap < MAXIMUM_CAPACITY && ft < (float)
> >>> MAXIMUM_CAPACITY ?
> >>> (int) ft : Integer.MAX_VALUE);
> >>> }
> >>>
> >>> public static void main(String[] args) {
> >>> for (int expectedSize = 0; expectedSize < (Integer.MAX_VALUE
> >>> - 2) / 4; expectedSize++) {
> >>> int cap1 = (int) ((float) expectedSize / 0.75f + 1.0f);
> >>> int cap2 = (expectedSize * 4 + 2) / 3;
> >>> int ts1 = tableSizeFor(cap1);
> >>> int ts2 = tableSizeFor(cap2);
> >>> int th1 = threshold(ts1);
> >>> int th2 = threshold(ts2);
> >>>
> >>> if (th1 < expectedSize || th2 < expectedSize) {
> >>> System.out.printf("%d: (%d, %d, %d)%s (%d, %d,
> %d)%s\n",
> >>> expectedSize,
> >>> cap1, ts1, th1, (th1 < expectedSize) ? "!" :
> >>> " ",
> >>> cap2, ts2, th2, (th2 < expectedSize) ? "!" : "
> "
> >>> );
> >>> }
> >>> }
> >>> }
> >>> }
> >>>
> >>>
> >>> And what this prints is the following:
> >>>
> >>>
> >>> 25165825: (33554432, 33554432, 25165824)! (33554434, 67108864,
> 50331648)
> >>> 50331649: (67108864, 67108864, 50331648)! (67108866, 134217728,
> >>> 100663296)
> >>> 50331650: (67108864, 67108864, 50331648)! (67108867, 134217728,
> >>> 100663296)
> >>> 100663297: (134217728, 134217728, 100663296)! (134217730, 268435456,
> >>> 201326592)
> >>> 100663298: (134217728, 134217728, 100663296)! (134217731, 268435456,
> >>> 201326592)
> >>> 100663299: (134217728, 134217728, 100663296)! (134217732, 268435456,
> >>> 201326592)
> >>> 100663300: (134217728, 134217728, 100663296)! (134217734, 268435456,
> >>> 201326592)
> >>> 201326593: (268435456, 268435456, 201326592)! (268435458, 536870912,
> >>> 402653184)
> >>> 201326594: (268435456, 268435456, 201326592)! (268435459, 536870912,
> >>> 402653184)
> >>> 201326595: (268435456, 268435456, 201326592)! (268435460, 536870912,
> >>> 402653184)
> >>> 201326596: (268435456, 268435456, 201326592)! (268435462, 536870912,
> >>> 402653184)
> >>> 201326597: (268435456, 268435456, 201326592)! (268435463, 536870912,
> >>> 402653184)
> >>> 201326598: (268435456, 268435456, 201326592)! (268435464, 536870912,
> >>> 402653184)
> >>> 201326599: (268435456, 268435456, 201326592)! (268435466, 536870912,
> >>> 402653184)
> >>> 201326600: (268435456, 268435456, 201326592)! (268435467, 536870912,
> >>> 402653184)
> >>> 402653185: (536870912, 536870912, 402653184)! (536870914, 1073741824,
> >>> 2147483647)
> >>> 402653186: (536870912, 536870912, 402653184)! (536870915, 1073741824,
> >>> 2147483647)
> >>> 402653187: (536870912, 536870912, 402653184)! (536870916, 1073741824,
> >>> 2147483647)
> >>> 402653188: (536870912, 536870912, 402653184)! (536870918, 1073741824,
> >>> 2147483647)
> >>> 402653189: (536870912, 536870912, 402653184)! (536870919, 1073741824,
> >>> 2147483647)
> >>> 402653190: (536870912, 536870912, 402653184)! (536870920, 1073741824,
> >>> 2147483647)
> >>> 402653191: (536870912, 536870912, 402653184)! (536870922, 1073741824,
> >>> 2147483647)
> >>> 402653192: (536870912, 536870912, 402653184)! (536870923, 1073741824,
> >>> 2147483647)
> >>> 402653193: (536870912, 536870912, 402653184)! (536870924, 1073741824,
> >>> 2147483647)
> >>> 402653194: (536870912, 536870912, 402653184)! (536870926, 1073741824,
> >>> 2147483647)
> >>> 402653195: (536870912, 536870912, 402653184)! (536870927, 1073741824,
> >>> 2147483647)
> >>> 402653196: (536870912, 536870912, 402653184)! (536870928, 1073741824,
> >>> 2147483647)
> >>> 402653197: (536870912, 536870912, 402653184)! (536870930, 1073741824,
> >>> 2147483647)
> >>> 402653198: (536870912, 536870912, 402653184)! (536870931, 1073741824,
> >>> 2147483647)
> >>> 402653199: (536870912, 536870912, 402653184)! (536870932, 1073741824,
> >>> 2147483647)
> >>> 402653200: (536870912, 536870912, 402653184)! (536870934, 1073741824,
> >>> 2147483647)
> >>>
> >>>
> >>> So as you see, for expectedSize < (Integer.MAX_VALUE - 2) / 4 (where
> >>> the alternative formula does not experience overflow and is enough
> >>> for Naoto's case) all miscalculations are due to the JDK/Guava
> >>> formula which in those cases calculates a value that is less than
> >>> alternative formula's value and too small to adequately pre-size the
> >>> HashMap table.
> >>>
> >>>
> >>> Voila, we have some bugs to fix or I am doing something wrong here.
> >>>
> >>>
> >>> Regards, Peter
> >>>
> >>>
> >>>>
> >>>> On Tue, May 5, 2020 at 12:12 PM Peter Levart <peter.levart at gmail.com
> >>>> <mailto:peter.levart at gmail.com>> wrote:
> >>>>
> >>>> Hi Martin,
> >>>>
> >>>> On 5/5/20 8:26 PM, Martin Buchholz wrote:
> >>>>> See related:
> >>>>>
> https://guava.dev/releases/23.0/api/docs/com/google/common/collect/Maps.html#newHashMapWithExpectedSize-int-
> >>>>>
> >>>>
> >>>>
> >>>> This is basically the same calculation (or at least gives same
> >>>> result) as Naoto did (without the max part):
> >>>>
> >>>> Naoto: (int)(expectedSize / 0.75f) + 1
> >>>>
> >>>> Guava: (int) ((float) expectedSize / 0.75F + 1.0F)
> >>>>
> >>>> but in case expectedSize is a multiple of 3, it gives the result
> >>>> which is 1 more than needed. If what is needed is also a power of
> >>>> 2, then twice the needed space is allocated in the HashMap
> >>>> backing table.
> >>>>
> >>>>
> >>>> Regards, Peter
> >>>>
> >>>>
> >>>>>
> >>>>> On Tue, May 5, 2020 at 11:03 AM <naoto.sato at oracle.com
> >>>>> <mailto:naoto.sato at oracle.com>> wrote:
> >>>>>
> >>>>> And here is the fix. Please review.
> >>>>>
> >>>>> http://cr.openjdk.java.net/~naoto/8244459/webrev.00/
> >>>>>
> >>>>> Naoto
> >>>>>
> >>>>> On 5/5/20 10:25 AM, naoto.sato at oracle.com
> >>>>> <mailto:naoto.sato at oracle.com> wrote:
> >>>>> > Hi Peter,
> >>>>> >
> >>>>> > You are correct. Thanks. I'll remove that initial value
> >>>>> of 16.
> >>>>> >
> >>>>> > Naoto
> >>>>> >
> >>>>> > On 5/5/20 9:37 AM, Peter Levart wrote:
> >>>>> >> Hi Naoto,
> >>>>> >>
> >>>>> >> On 4/30/20 12:18 AM, naoto.sato at oracle.com
> >>>>> <mailto:naoto.sato at oracle.com> wrote:
> >>>>> >>> Hello,
> >>>>> >>>
> >>>>> >>> Please review this small fix to the following issue:
> >>>>> >>>
> >>>>> >>> https://bugs.openjdk.java.net/browse/JDK-8244152
> >>>>> >>>
> >>>>> >>> The proposed changeset is located at:
> >>>>> >>>
> >>>>> >>> https://cr.openjdk.java.net/~naoto/8244152/webrev.00/
> >>>>> >>>
> >>>>> >>> The hash map used there didn't have initial capacity,
> >>>>> even though the
> >>>>> >>> exact numbers are known.
> >>>>> >>
> >>>>> >>
> >>>>> >> Well, it has to be calculated 1st (countTokens), but I
> >>>>> guess this pays
> >>>>> >> off when HashSet (the backing HashMap) does not have to
> >>>>> be rehashed then.
> >>>>> >>
> >>>>> >> The expression you use:
> >>>>> >>
> >>>>> >> Math.max((int)(tokens.countTokens() / 0.75f) + 1, 16)
> >>>>> >>
> >>>>> >> ...has a minimum value of 16. Why is that? 16 is just
> >>>>> HashMap's
> >>>>> >> default initialCapacity if not specified explicitly. But
> >>>>> if you only
> >>>>> >> want to store say 1 entry in the map, you can specify 2 as
> >>>>> >> initialCapacity and HashMap will happily work for such
> >>>>> case without
> >>>>> >> resizing.
> >>>>> >>
> >>>>> >>
> >>>>> >> So you could just use:
> >>>>> >>
> >>>>> >> (int)(tokens.countTokens() / 0.75f) + 1
> >>>>> >>
> >>>>> >> And even this expression is sometimes overshooting the
> >>>>> minimal
> >>>>> >> required value by 1 (when # of tokens is "exact" multiple
> >>>>> of 0.75f,
> >>>>> >> say 6). I think the following could be used to optimally
> >>>>> pre-size the
> >>>>> >> HashMap with default load factor 0.75:
> >>>>> >>
> >>>>> >> (tokens.countTokens() * 4 + 2) / 3
> >>>>> >>
> >>>>> >>
> >>>>> >> Regards, Peter
> >>>>> >>
> >>>>> >>>
> >>>>> >>> Naoto
> >>>>>
>
More information about the i18n-dev
mailing list