Compiling large switch statements
john.r.rose at oracle.com
Wed Sep 10 22:11:06 UTC 2014
On Sep 10, 2014, at 2:10 PM, Paul Govereau <paul.govereau at oracle.com> wrote:
> I have been looking at the javac code generator, and I think we may be
> doing a poor job of compiling large switch statements because of an
> overly conservative estimate of the cost of the lookupswitch
> instruction. Because we over estimate the cost of a lookupswitch, we
> can generate very large tableswitch instructions: twice the size or
> larger than necessary.
> Briefly, javac estimates the total cost of a switch instruction to be:
> space + 3 * time,
> and generates the lowest cost instruction. However, javac assumes the
> time cost of a lookupswitch is linear, so we greatly favor the
> tableswitch. (For reference, the relevant code is in the visitSwitch
> method in jvm.Gen).
> As an example, consider a large switch containing 50 cases. For the
> tableswitch, the important factor is the range of the matched values,
> because this determines the size of the array needed for the table of
> addresses. Let's call this size S, then the cost of a tableswitch is
> estimated as:
> size (words): 4 + S + 1
> computation: 3 steps (2 comparisons, 1 load)
> For the same code the cost for a lookupswitch is estimated as:
> size(words): 3 + 2 * 50
> computation: 50 steps (presumably 50 comparisons?)
> Note that the lookupswitch match-offset pairs must be sorted. So, it
> stands to reason that to implementation is using (at least) a binary
> search on the sorted list of keys. Therefore, the computation cost
> must be closer to 2*log(50), no?
Yes. Log(#cases) is a far better estimate.
> With the current metrics, we allow a tableswitch instruction with 50
> items to be as large as 976 bytes before we change to the much smaller
> 412 byte lookupswitch. It seems to me the log-estimate is more
> accurate and we should change to the lookupswitch when the tableswitch
> reaches 468 bytes. However, I am only guessing as to how these
> bytecodes are implemented. Does anyone on the list know the details?
I knew a few, now I know more... funny how curiosity works.
The C2 JIT reorganizes lookupswitch and tableswitch instructions from scratch, using its own notions of what is efficient. You end up with a decision tree and/or some PC jump blocks, but you can end up with a mix of both from either instruction.
The C1 JIT reorganizes the instructions also, detecting key ranges (runs of common branch targets) and handling them with 1-2 comparisons per range. Oddly, it does not bother to put a decision tree on top of this, nor does it attempt jump tables.
The assembly interpreter uses a binary search on lookupswitch. The C++ template interpreter, which is rarely used, does a simple linear search over a lookupswitch, so that is the only component that is as bad as you feared.
I suggest using brute classfile size as your metric for choosing your switch. It will never be far off from the model you propose, and the differences won't matter once the JIT gets hold of the code. As a bonus, you'll slightly decrease the size of jars, which surely counts for something (you don't even have to run the code to collect your wins).
More information about the compiler-dev