vcall wish for hotspot

Andy Nuss andrew_nuss at
Sat May 18 21:37:42 PDT 2013

Well, I guess I don't know what I'm doing after all.

I ported my benchmark to C++, and found that C++ overhead for the virtual call was 6 nanos, way slower than  the 0.5 nanos I was getting for hotspot.

Then I guessed, maybe Clemens is right, hotspot is seeing that Link1 and Link2 are very similar in structure, and even though the iteration thru the list of base types would seem to force a virtual call, hotspot is just turning it into an if-else, and inlining everything else, and the extra 0.5 nanos is simply due to branch mispredict.

So I create 10 variants of my Link class, Link1 thru Link10, and hotspot was about the same as C++, 6 nanos.

But that could mean that hotspot was now using a vtable, or it could mean hotspot was paying for 10 branch mispredictions.

So I tried 20 variants of my Link class, Link1 thru Link20, and hotspot was now about 11 nanos.  So it would seem that hotspot, for simple variants of a base class doesn't use a vtable at all, but just creates one big if-else inline.  Whereas C++, even with just 2 variants of the class, always uses a vtable.

If this is true, that itself is very interesting.  Any ideas how to write a test that really forces hotspot to use the vtable?

 From: Andy Nuss <andrew_nuss at>
To: hotspot <hotspot-compiler-dev at> 
Sent: Saturday, May 18, 2013 5:18 PM
Subject: Re: vcall wish for hotspot

Ok guys.  I'll try to get setup with github, and provide both cases for you with latest C++ and latest jdk7, or redact my claim.

By the way, there are no branches, as you will see from my benchmark.

 From: Clemens Eisserer <linuxhippy at>
To: Andy Nuss <andrew_nuss at>; hotspot-dev at 
Sent: Saturday, May 18, 2013 4:22 PM
Subject: Re: vcall wish for hotspot

Hi Andy,

>> I do know that C++ compilers can emit much much faster vcalls,
>> especially when the class is not involving multiple inheritance.
> As to how much faster c++ is, I haven't coded in C++ at all in 5 years, and
> then it was on a much slower laptop, so I'm not set up to benchmark it.

You have not benchmarked it, yet claim to "know" C++ compilers emot
"much much" faster code for this case?

> Then in two identical warmed up loop functions of 10 billion iterations,
> one the iterates and returns the sum of Link get() and one that returns the sum of Link3 get(),
> compare the time to execute both.
> The difference is about .5 nanos per iteration on my machine.

The difference you see with your benchmark is not only the pure
overhead of bimorphic call logic, but far more important, branch
misprediction caused by the alternating branch targets.

-------------- next part --------------
An HTML attachment was scrubbed...

More information about the hotspot-compiler-dev mailing list