Value array element size represented as log2
John Rose
john.r.rose at oracle.com
Tue Oct 2 09:24:31 UTC 2018
On Oct 2, 2018, at 12:55 AM, David Simms <david.simms at oracle.com> wrote:
>
> What is it you are asking here ? That you want to 20 bytes rather than pow 2 aligned ?
20 bytes would be represented as 5*4, where the 5 comes from the 1/3/5/7 selector,
and 4 is the power of two.
All odd sizes up to 7, all even sizes up to 16, all multiples of four up to 32 bytes,
and all multiples of eight up to 64 bytes are representable in this scheme.
The first several byte sizes not exactly representable are: 9(10), 11(12), 13(14), 15(16),
17,18,19(20), 21,22,23(24), 25,26,27(28), 29,30,31,(32). The parenthesized numbers
are the next higher representable sizes. As you can see the error is kept small.
The layout helper would have to be tweaked to allow the 1/3/5/7 selector, and users
of the layout helper which process value arrays would have to take it into account.
A good encoding for the selector would be:
enum sel1357 { sel1=0, sel3=1, sel5=2, sel7=3 };
sel1357 encode_sel(int sel) {
sel1357 sel_code = (sel1357)((sel - 1) >> 1);
assert(sel == 1 + (sel_code << 1));
return sel_code;
}
int decode_sel(sel1357 sel_code) {
int sel = 1 + ((int)sel_code << 1);
}
The logic for choosing the best selector and shift could be:
int decode_shift_and_sel(int size, sel1357& return_sel) {
int best_err = size, best_sel = 0, best_shift = 0;
for (int sel = 1; sel <= 7; sel += 2) {
int shift = upper_log2((size + sel - 1) / sel);
int err = (1 << shift) * sel;
if (err < best_err) { best_err = err; best_sel = sel; best_shift = shift; }
}
return_sel = encode_sel(best_sel);
return best_shift;
}
(There are surely bugs in this, but you get the idea. A dead-reckoning
algorithm could do the job faster, without the loop, but it would be harder
to read.)
>
> A long time ago we discussed that mul was too expensive compared to shift, so that encoding is intentional (addressing uses shift)
No mul is needed, ever. In most cases the JIT can see the array type and
emit suitable shift/add sequences to produce 3 (1+2), 5 (1+4), or 7 (8-1).
Even the interpreter, or polymorphic code, can produce suitable ALU
operations that are competitive with mul, if only those four multipliers
are possible.
int mul_by_sel(intptr_t x, sel1357 sel) {
switch (sel) {
case sel1: return x;
case sel3: return x + (x << 1);
case sel5: return x + (x << 2);
case sel7: return (x << 3) - x;
}
}
Similar points can be made for a one-bit 1/3 selector, with more footprint
and somewhat simpler decoding.
int mul_by_sel(intptr_t x, sel13 sel) {
return x + (-(int)sel & (x << 1));
}
In many cases, the value types will have true power of two size, so that
the selector will be "sel1" (unit multiplier). If that's common enough, then
a speculation might be in order:
int mul_by_sel(intptr_t x, sel1357 sel) {
if (sel == sel1) return x;
int result = (x << (int)sel); // needs fixup for sel7
if (sel == sel7) x = -x; // fixup
result += x;
return result;
}
int mul_by_sel(intptr_t x, sel13 sel) {
return sel == sel1 ? x : x + (x << 1);
}
More information about the valhalla-dev
mailing list