Valhalla, startup, performance of interpreter, and vwithfield
ioi.lam at oracle.com
Thu May 16 15:11:11 UTC 2019
In C1, it should be pretty easy to eliminate the intermediate object for
We can probably do the same thing in the interpreter (by rewriting to a
new fast_aaload_getfield bytecode, etc).
On 5/15/19 2:25 PM, Karen Kinnear wrote:
> Thank you Sergey - sounds like we are in agreement that we could apply your existing measurements
> to predict the cost of creating a new inline class with multiple fields in the interpreter (and maybe C1).
> So we can go back to the other conversation thread which addresses what model we want for creation
> of inline classes, where we can discuss benefits other than performance.
> That is Brian’s “replace withfield?” email thread.
>> On May 15, 2019, at 3:59 PM, Sergey Kuksenko <sergey.kuksenko at oracle.com> wrote:
>> On 5/15/19 6:49 AM, Karen Kinnear wrote:
>>> I discussed this with Frederic, and between MVT and LW1 he had improved the interpreter overhead of the withfield bytecode.
>>> He pointed out that the measurement we are looking for is slightly different - at least than my understanding of what
>>> you measured.
>>> The question is about the cost of inline class creation vs. identity class creation, not a single operation default or withfield.
>>> So - could you take take a small inline class - say one with 4 fields each containing an int and compare
>>> the cost of a method that constructs the inline class using 4 withfields, vs. the cost of a constructor
>>> for a comparable identity class?
>> Probably I was unclear. In the interpreter cost of the single vdefault or vwithfield approximately equals to the cost of the whole creation of similar identity class (new + constructor). Thus if we have class with N fields -> the full cost of creation of inline class ~(N+1) higher than creation of equivalent identity class (N vwithfields + 1 vdefault).
>> Will it have high impact to the interpreter speed - Yes.
>> Will it have high impact to the startup time - No. (Here I can't be 100% sure, only 80% and it's required good enough compilation by C1 that is important for our tiered compilation policy).
>> If talk about the interpreter performance as requirement - there are a lot of patterns which should be considered. For example compare two patterns:
>> 1) V[i].x + V[i].y
>> 2) V v = V[i]; v.x + v.y
>> The first one is two times slower in the interpreter. (two loads from inlined/flattened array -> two allocations).
>>> The theory is that for the inline class, there would be 4 withfields, each with an allocation step
>>> (for the interpreter, and possibly for C1).
>> I can't have any proofs at the moment, but I consider value types allocations in C1 as potential danger for startup time.
>>> So the cost of construction would be much higher than the
>>> equivalent identity class constructor. For those not in the nest, there would the need to call the method
>>> that creates the inline class; whereas the identity class could be created by anyone - so my mental model
>>> is that both examples would have a call overhead in them.
>> Identity class could be created by anyone, but identity class has to have invocation (constructor) after new. Both identity and inline classes have mandatory invocation in that case. As for pure invocation overhead, the cost of invocation (just invocation) is ~3 times lower than the cost of allocation. (in the interpreter).
>>> Does that make sense to you?
>>> Would that be something you could measure?
>>> I think we have alternative approaches which would not require each field setting to perform an allocation step.
>> That will definitely improve the interpreter speed. The question is - are there any other benefits besides the interpreter speed?
>>>> On May 13, 2019, at 5:26 PM, Sergey Kuksenko <sergey.kuksenko at oracle.com> wrote:
>>>> On 5/13/19 7:00 AM, Brian Goetz wrote:
>>>>> This is good news. I want to ask further about the numbers you cite here. You compare value creation to classic object creation, but obviously we want value creation to be faster.
>>>> In the interpreter? I am afraid that value creation cost in the interpreter can't be faster than classic object creation. We still have interpretation cost of value types slower than interpretation of equivalent classic objects. But the difference was reduced drastically. Also I didn't find any scenario where the interpreter performance has significant impact to startup time. The first execution which implies class loading, verification, etc is 500x times slower than subsequent execution in the interpreter. (classic objects and value types)
>>>>> When you say it is comparable to classic object creation costs, I assume that you are not including the allocation cost, and comparing only the field write costs?
>>>> No. It includes allocation cost. Don't forget - I am talking about the interpreter performance. Here is some decomposition.
>>>> 1. Classic object creation: ~230ns (500 cycles) for the whole object creation. It could be split to ~200ns (440 cycles) for object allocation and ~30 ns (60 cycles) for fields initialization.
>>>> 2. Value type creation. Any single operation vdefault or vwithfield has ~200ns (440 cycles) cost. It's on par (even slightly better) than full object creation. And it looks normal, because of the single vdefault or vwithfield operation - "creates" object (or similar to it). Of course, than more fields we have than higher it is in the interpreter to gather the full objects.
>>>> As for compiled code - after C2 we have the following numbers:
>>>> e.g. (two-fields classe)
>>>> 1. Classic object creation: 14.9ns (total cost) (G1GC)
>>>> 1.1 Classic object creation - only fields write cost: 0.99ns
>>>> 2. Value type (full creation): 0.97ns (slightly better than just fields write cost in case of classic object).
>>>> Note: all examples here was measured when all data are perfectly fit into CPU caches, even for classic objects. All value type benefits due to better cache locality were intentionally excluded.
>>>>>> I did quick evaluation of startup and interpreter performance cost. I have to take back my words that "vwithfield is major contributor to the interpreter speed and merged(or fused) vwithfield could improve interpreter performance". It was quite long time age when I was looking into interpreter's performance last time. I have to say that a huge work was done for interpreter since that time and now I don't consider interpreter's performance as an issue. As for vwithfield, now cost of the single vwithfield (in the interpreter) is approximately 200ns (on 2.2GHz freq). It is not a big nor a small value. If compare cost of value creation vs cost similar classic java object creation (simple writes) then single vwithfield costs ~7%-10% from the whole object creation. So I am guessing that if you have a value with 10 fields (and 10 vwithfield operations) - you may double value creation cost, but it will have minor impact for the whole execution.
>>>>>> Also I have to say that if look into startup for the first execution of code - interpreter takes less than 1%. All others actions (classloading, verification, etc..) take much more time. As for "time to performance" - I didn't evaluate it yet. Interpreter's impact could be higher in that case. At the same moment - working TieredCompilation will improve "time to performance" much more than any interpreter tuning.
More information about the valhalla-dev