lopx at gazeta.pl
Wed Apr 30 16:32:13 PDT 2008
> After a few hundred bytecodes, the inlining heuristic starts to get
> scared. Hand inlining is like that: The JVM prefers to do this for you.
> You can play with -XX:FreqInlineSize=N where N is larger than your
> switch method. Also maybe bump -XX:MaxInlineSize (default 35).
> But it is better to use small methods and let the JVM pick inlinings.
> You may have found a weakness in the profiling that causes the inliner
> to fall down.
Recent #jruby conversation revealed another factor that might influence
my benchmarks. I've tried MaxInlineSize (100, 500, 1000) some time ago
and it indeed give a boost like 40% on my machine for some questionable
The problem is that architecture can have huge impact on the results
(I'm on Pentium4 Northwood 32bit). Exactly same settings on Core Duo
actually worsened performance! The theory is that older Pentium CPUs
have weaker jump prediction accuracy so heavy inlining reduces this
Though, all this doesn't affect the artificial benchmark I posted where
profiling is to blame I'd guess. I'ts reproduceable across different CPUs.
I've also tried FreqInlineSize, MaxInlineLevel and few others available
in fastdebug OpenJDK builds, none of them had such big impact as
MaxInlineSize had (On my architecture only it seems).
Other Interesting thing is that turning on UseParallelGC on my single
CPU system can give a noticeable boost (in code that doesn't use
explicit GC at all). Any Theories ?
>> Then I realized that there's something wrong with inlining. After
>> that tried different approaches like splitting the switch into ten
>> switches and even a tree like one (which turned out to be a complete
>> nonsense of course). As split switches worked better for cases I
>> tested, the whole thing became a lot slower on average.
> Another thing that's better since Feb. is we now have a disassembly
> story that we can share with the public. I put in a plugin interface
> to the Gnu binutils library; it's integrated here, in version 13 of
> the JVM:
> There are sources and (imperfect) build instructions for making the
> The point of the -XX:+PrintAssembly option is to give you a very
> concrete view (often too concrete) as to what decisions the JIT is
> making. In your experiments you would have seen a range of code
> shapes, from decision trees to jump tables.
I'll definitely try that, so far, I've been only looking at
PrintOptoAssembly (not so in depth unfortunately)
>> Then I decided to prepare a reduced testcase (attached as
>> Aswitch2.java file) for the problem and saw rather surprising
>> results. With client compiler benchmarks didn't depend on the number
>> of cases but for hotspot server compiler it turned out to be the
>> opposite (but it was not linear).
>> As you've mentioned before, with small number of cases (and/or low
>> complexity of the whole method) Hotspot is able to inline the method
>> and the switch into it's caller and then prove that the switching
>> value is constant. After increasing number of cases (like 5..10) I
>> began to see:
>> <inline_fail reason='already compiled into a big method'/>
> That's a weak part of our heuristic. Hmm. It's especially bad when
> there is a great variation in size (switch on constant vs. switch on
> non-constant). We know we need heuristics that can estimate the
> effects of constant folding. Branches (switches) based on parameter
> values are a key case.
But it's quite artificial example. I believe new SSA produced after
inlining will always do the right job ?
>> appeared in hotspot.log and the timings went down to 650ms
>> after adding another 30 or so cases the timings got even worse (700ms
>> and more after being warmed up).
>> So I began to draw conclusions that counting case hits (found it in
>> MultiBranchData class and it's consumers) makes Hotspot try to inline
>> the most frequent case into the caller or something. I can imagine
>> this for two way branch with a stable flow path, but when switch case
>> distribution begins to be somewhat 'random/unknown' (in this
>> microbenchmark it is not the case though, as only one case is being
>> hit) for Hotspot profiler and the switch is big enough, the hole
>> thing looses the point, yet switches shall be O(1) with dense values.
> In how many places do you expect the switch code to be inlined? Once
> for the general case and many places for constant switch selectors?
I'd like to have it inlined for all used cases, potentially all of
them..., though I know it's not possible due to many thresholds.
>> Even more surprisingly, the benchmark gets about 10% boost with jump
>> tables turned off
> It must be getting smart or lucky with decision trees. If a small
> minority of cases determines the performance, and those get compiled
> into the heart of the decision tree (and hence the loop), then the
> other cases don't really matter.
>> whereas -XX:MaxInlineSize=500 or so doesn't affect the benchmark
>> (though it gave a boost for some other joni regexp benchmarks)
> FreqInlineSize takes precedence over MaxInlineSize for hot methods,
> IIRC. If so, it's a bug that MaxInlineSize > FreqInlineSize silently
> has no effect.
In majority of cases it indeed affects the timings, though as I
mentioned before I'm becoming to be worried that CPU & architecture is
thwarting my experiment numbers.
> Do you expect to get a different inlining of the Big Switch for each
> Encoding? That will be tricky but may be desirable. You almost want
> to use something like anonymous classes as a templating mechanism to
> forcibly copy the Big Switch into distinct type contexts.
Very few bytecodes are encoding dependent, most of the job is done by
the compiler that emits very specialized instructions (like EXACTMB2N2,
that expects to match two twobyte characters). All others that have been
used in Oniguruma are specialized for singlebyte (so encoding is
eventually not being used during matches in the benchmarks they have
been run with).
Yay!, I've also considered anonymous classes by having megamorphic
callsites. The thing that scared me was too much state in the matcher
and possible creation of lots non static class instances every match. On
the other hand, those classes might just implement bool perform(Matcher
m) or so and do their job (though it would introduce much more
indirection to access matcher fields).
> Encodings as virtuals: For megamorphic calls, vtables are a little
> cheaper than itables. But you win bigger when you factor things so
> that interface or virtual calls are monomorphic or at most bimorphic.
Encodings are mostly used during parsing. This actual problem we will
face implementing Ruby 1.9 Strings. Unfortunately Encoding.length(byte
b) is polymorphic (4-6 versions) and it's going to be used quite heavily
there. Might be even possible to reduce it to two versions (return fixed
length and return length table lookup).
>> Case fold expansion optimization actually made the opposite effect
>> here as we seem to have problem with switch method performance itself
>> (all opcodes are very cheap here). We might get rid of this
>> optimization as [exactn-ic-sb:n:...] is very cheap for some single
>> byte encodings (direct table lookup), but a multibyte one
>> ([exactn-ic:n:...]) has very high cost as it uses two buffers to
>> unfold the code points and make case insensitive comparison. Also, it
>> would be an ugly workaround.
> It sounds like you have a specific need for customizing your Big
> Switch for case-folding searches. This takes me one step closer to
> the templating precipice. One way to encourage templating-like
> behavior from the JVM is to put your Big Switch into an abstract base
> class and then specialize it in a range of subclasses. In this case,
> perhaps the range of subclasses is dynamically created. (Depends on
> the number of degrees of freedom across which you want to customize.
> Can you statically enumerate in a bunch of hand-named subclasses, or
Oh, this one I'll need to think through. The bad news is that lots of
patterns are compiled into main switch heavy code (like instant pushing,
>> I admit there still might be bottlenecks in java code I'm not aware
>> of. On the other hand, work on jvm bytecode compiled version has been
>> started. It will require completely new code for stack manipulation,
>> fail fast logic and anchors (Matcher.java). Though, it will bring new
>> exciting optimization opportunities like taking advantage of java
>> stack and building fail fast code from templates (not to mention even
>> more specialized opcode equivalent implementations that will not use
> I don't have much hope that the java stack will give you much leverage
> on JVM performance, unless you use it in ways that are very similar to
> javac output. The java stack is for naming temporary values
> compactly, as opposed to a plain register-based architecture.
What about other characteristics ? better to use than ThreadLocal
Weakrefed growing array as artificial stack ? pros and cons of those two
approaches is that we might run out of java stack or have instant bounds
checks with artificial one.
> On the other hand, compiling to bytecodes is a great way to bypass all
> sort of Big Switch issues. Use anonymous classes for that, if the JVM
> supports them. (No plans to standardize these yet on other JVMs. JSR
> 292 has its hands full with invokedynamic.)
> The PyPy experience shows that partial evaluation of your Big Switch
> can take you a long way. I'd like to make our switch optimizations
> work better, even if you end up going all the way to bytecodes.
> BTW, going to bytecodes means you start dealing with the cost of
> registering, compiling, and GC-ing the results of class loading.
> Anonymous classes help address this, but it's expensive; we've seen
> scaling problems with such things, e.g., databases that compile every
> SQL query to a tiny class. Perhaps the best thing to do would be
> (like JRuby) to go to bytecodes only after the same RE had been used N
> times or more (N = 40, maybe). Meanwhile, the Big Switch gives good
> performance for low-use REs. Best of all would be a way to fold the
> Big Switch with a constant code vector and tell the JVM "partial
> evaluate this". With the right library factoring, we can get close to
> that ideal, but bytecode generation is probably part of the mix for
> the foreseeable future.
Is the "constant code vector" meant to be a code that is used to
assemble final ouput by just generating calls mainly? It so, yeah, that
would minimize amount of code that's loaded, verified and eventually
profiled later when run.
The other problem is that the AST is immediately thrown away after
compilation, we can keep it but it consumes memory (reparsing the regexp
for JIT would solve this problem). The good news is that the we can
apply different approaches for different kinds of regexps (like lazy
AOTing for immutable literal regexps that are used the most). The other
good news is that JRuby already applies much more aggressive (but
Weakrefed) regexp caching than CRuby and it really pays back. The
regexps that are created via Strings would go separate path you're
>> I've also done some review of Hotspot sources (I'm overwhelmed by it)
>> mainly the opto bytecode parser and profiler structures, a bit of
>> call site profiling and caching logic. Also did some review of
>> optimization phases. But it's still too early for me to draw any
>> conclusions, so my question is where to find additional information
>> if it exists and what best approach to choose when analyzing the
>> sources ?
> We've started a wiki for this purpose; see above. It would be great
> if you (or anyone else on the hotspot learning curve) would contribute
> to it as you discover important facts. I've added stuff, but since
> I've been working on this for 10 years, it's hard to have perspective
> on what newcomers need to know. And, this is the best year by far for
> being a newcomer!
This is great news, I'm reading this stuff as it appears online.
Big thanks for your information.
More information about the hotspot-compiler-dev