RFR :7088419 : (L) Use x86 Hardware CRC32 Instruction with java.util.zip.CRC32 and java.util.zip.Adler32
david.r.chase at oracle.com
Thu May 16 11:12:11 PDT 2013
Previously sent to core-libs-dev, apologies for double-coverage.
Comments received are appended at the bottom.
problem: Some applications spend a surprising amount of time computing CRC32s
(Not sure I'm supposed to be precise on an open mailing list). Recent Intel
architectures provide instructions that might be useful in addressing this.
I turned this into a general attack on performance in Adler32 and CRC32, partly because the bug report was not clear on the size of the problematic inputs. The general approach turned out to be useful, because it's tricky to get the small-case overheads down for the accelerated-instruction version of the code.
1) For CRC32 and Adler32, break out the "small" case (single bytes, and up to around 60-80 bytes)
to be computed on the Java side, avoiding JNI overheads.
2) For CRC32 and Adler32, figure out the "combine" operations for the checksum of a concatenated pair of byte sequences, and add fork-join parallelism for large-enough inputs. Tuning this is hard, so large "small" buffer sizes were chosen (< 512K for unaccelerated CRC32, 1MB for Adler32 and accelerated CRC32) to make this a safe optimization. This can be disabled by setting the (not-yet-documented, perhaps wrongly named) property "sun.zip.serialOnly=true". I just now noticed that this is not the case for Adler32; assuming we agree on the existence of this flag and its name, this needs to be added there, too.
3) For Adler32, defer calculation of the actual "Adler" checksum until it is requested. This is an optimization for byte-at-a-time use.
4) For CRC32, use the new-ish PCLMULQDQ instruction (64-bit by 64-bit carryless multiply, yielding a 128-bit result) in the style described in Intel's white paper on using this instruction to compute CRCs. All the constants are different because the CRC32 is bit-reversed from the checksums computed in Intel's paper, but the unrolling is the same, and the fill and drain code is also similar. This is by default enabled whenever CLMUL and AVX features are both present, and can be disabled at the command line with -XX:-UseCLMUL (or -XX:-UseAVX).
There is a companion webrev that puts information about the availability of the PCLMULQDQ in 3-operand form into a hidden property:
The CRC benchmark test was adjusted to also warm-up the small-stuff case.
A new test was written that was designed to exercise all the corner cases of the new Adler32 and CRC32 implementations; byte-at-a-time is compared to byte-at-a-time-not-looking is compared to arrays of bytes is compared to DirectBuffers, at a variety of sizes and alignments designed to exercise combinations of non-empty fill and drain for the Intel-accelerated CRC32, and corner cases for the fork-join code.
The new test takes just under 2 minutes on a T1; that seems like it is cutting it close, so I should either trim the test or give it a little more time.
Separate C unit tests (not checked in, not sure where to put them) were written to exhaustively test the accelerated code and research its portability to various compilers.
problems and justifications:
Using our currently specified compilers for building the JDK, the C access to the new Intel intrinsics is not available. This is unfortunately also true of the assemblers paired with our currently specified compilers.
( What IS the specified compiler for building on Linux? Gcc 4.2 on the Mac, if asked for a 16-byte alignment, helpfully tells me that is too big, and it will use 15 instead. This might be an issue on Linux, if 4.2 is the build compiler version there. ) Therefore, the native code for crc32 includes:
1) the C code with intrinsics for the accelerated CRC32 kernel (compilable with gcc 4.6 or clang, and almost compilable with Sun Studio 12.3; it parses and optimizes, but triggers a crash in the back end, which needs to be reported).
2) the 64-bit version of the assembly language, including commented byte sequences for the instructions that the build-specified assemblers do not recognize.
3) the 32-bit version of the assembly language, including commented byte sequences for the instructions that the build-specified assemblers do not recognize.
My hope is that the assembly language is temporary, though it is only a modest step higher in abstraction from the C code, which is almost entirely intrinsic calls. If we have to wait for the C compilers, perhaps the assemblers will all become modern enough to handle the instructions by their proper names. I have a small tool that I wrote to automatically generate the asm statement from the output lldb x/i. (Where does such a weird little tool belong?)
I tried to use the instruction just as a stand-alone intrinsic (that might be wrapped in unsafe code and called from Java) but that did not yield a performance gain; it only seems to work well if embedded in an unrolled loop with several independent pipelines of computation running, all feeding 128-bit accumulator registers. That is to say, Intel wrote that white paper for a reason, which was to help guide people like me towards happy results. To my knowledge, we don't support compilation to 128-bit wide xmm registers in hotspot, so this approach was unlikely to work.
Another possibility is to wrap the entire kernel up as a single 80-instruction intrinsic. This didn't seem like a win to me, and it would require the use of Unsafe code to interface to it. Even then it's not clear it's possible, because the kernel requires that its input be 16-byte aligned. I don't know if the alignment is guaranteed by the garbage collector, and I could not find a way in Unsafe to even get the (temporarily valid) address of a GC-allocated object. It would be helpful to know the address even if it is only probably true, because for fork-join it is helpful to arrange splits so that they land on a 16-byte boundary so that the eventual serial case will (probably) go as fast as possible with empty fill/drain steps.
Yet another possibility is to move all of "fastcrc32" (the outer fill and drain code written in C) into a single intrinsic, but that's 208 instructions total (I just looked).
Oh yeah, performance improvements. On an Ivy Bridge laptop (2 cores, 2 threads/core), the accelerated CRC without workstealing is 2.5x faster on large inputs, 2x faster at about 512 bytes. Modest amounts of fork-join parallelism (only fork for work 1M or larger if CLMUL present) provide about 1.5x on top of that, for a 3.75x speedup.
Fork-join benchmarks pretty well on the Sun T1, T2, T4 series of processors down to somewhat smaller block sizes, but for now I am conservative an use a serial-if-smaller-than setting of 512K. I've observed speedups as high as 13x on a T1 (1(32), dr-evil), 24x on a T2+ ( 4(256), mrspock), and 8x on a T4 (8(64), sc11d1901).
Parallel performance is a little harder to reason about on big x86 boxes (both Intel and AMD), so I am leaving the threshold high. Dave Dice thought this might be an artifact of cores being put into a power-saving mode and being slow to wake (the particular benchmark I wrote would have been pessimal for this, since it alternated between serial and parallel phases). The eventual speedups were often impressive (6x-12x) but it was unclear how many hardware threads (out of the 32-64 available) I was using to obtain this. Yes, I need to plug this into JMH for fine-tuning. I'm using the system fork-join pool because that initially seemed like the good-citizen thing to do (balance CRC/Adler needs against those of anyone else who might be doing work) but I am starting to wonder if it would make more sense to establish a small private pool with a bounded number of threads, so that I don't need to worry about being a good citizen so much. It occurs to me, late in the game, that using big-ish units of work is another, different way to be a bad citizen. (I would prefer to get this checked in if it represents a net improvement, and then work on the tuning afterwards.)
Thanks for your consideration, I know that this is large and somewhat late. It took a while to get the "last" bug out.
I'm sure Doug or Brian or David Holmes will have opinions on this point but I would think using the common pool is right. If parallel sort, CRC32 and other specific usages each created their own thread pool then I could imagine a lot of thread pools hanging around and competing. Plus there are cases like EE where no-parallelism might be the right answer and one wouldn't want to have to configure each usage.
In any case, this looks really good work. One thing that might be worth checking is startup/warm-up. I have a vague memory of this being a concern in the past with Adler32, Sherman might remember the details.
I haven't looked at the details for the PCLMULQDQ instruction but a caryless multiply could be of use to some of the crypto primitives as well (GHASH, GMAC and probably others). Perhaps the property could be "sun.hotspot.x64.clmulSupported" or something less specific to the usage.
What's our actual experience with needing switches like -XX:-UseCLMUL or-XX:-UseAVX for other features? Faulty implementations? Feature misreporting? Performance regressions? Virtualization interactions?
my reply to Mike:
Note that the instruction also works on 32-bit, and once the builds all use sufficiently modern compilers, the same source code works for both.
I haven't seen any virtualization problems -- I've also been testing on VMs, Ubuntu and Solaris on VMWare.
Should I also give it a go in VirtualBox? What should I be looking for?
More information about the hotspot-compiler-dev