RFR :7088419 : (L) Use x86 Hardware CRC32 Instruction with java.util.zip.CRC32 and java.util.zip.Adler32
Alan.Bateman at oracle.com
Mon May 20 00:33:10 PDT 2013
On 20/05/2013 02:24, David Chase wrote:
> I don't like this approach for several reasons.
> First, we're not done finding places that fork-join parallelism can make things go faster. If, each time we find a new one, we must split it into the parallel and serial versions, we're going to get a tiresome proliferation of interfaces. We'll need to document, test, etc, and programmers will need to spend time choosing "the right one" for each application. This will be another opportunity to split application source bases into "old" and "new" -- these chances are always there, but why add more?
> Second, this doesn't actually address the bug. This was done for a bug, they want CRC32 to go faster in various places, some of them open source. They were not looking for a different faster CRC, they were looking for the same CRC to go faster. They don't want to edit their source code or split their source base, and as we all know, Java doesn't have #ifdef.
> Third, I've done a fair amount of benchmarking, one with "unlimited" fork join running down to relatively small task sizes, the other with fork-join capped at 4 threads (or in one case, 2 threads) of parallelism. Across a range of inputs and block sizes I checked the efficiency, meaning the departure from ideal speedup (2x or 4x). For 4M or larger inputs, across a range of machines, with parallelism capped at 2 (laptop, and single-split fork-joins) or 4, efficiency never dropped below 75%. The machines ranged from a core-i5 laptop, to an old T1000, to various Intel boxes, to a good-sized T4.
> Out of 216 runs (9 machines, inputs 16M/8M/4M, task sizes 32K to 8M),
> 10 runs had efficiency 75%<= eff< 80%
> 52 runs, 80%<= eff< 90%
> 139 runs, 90%<= eff< 110%
> 15 runs had superlinear speedup of 110% or better "efficiency" (I checked for noisy data, it was not noisy).
> We can pick a minimum-parallel size that will pretty much assure no inefficient surprises (I think it is 4 megabytes, but once committed to FJ, it looks like a good minimum task size is 128k), and there's a knob for controlling fork-join parallelism if people are in an environment where they noticed these momentary surprises and care (a T-1000/Niagara does about 9 serial 16M CRC32s per second, so it's not a long-lived blip). If necessary/tasteful, we can add a knob for people who want more parallelism than that.
> If it's appropriate to put the benchmarks (PDF) in a public place, I can do that.
> Fourth, I think there's actually a bit of needing to lead by example. If we treat fork/join parallelism as something that is so risky and potentially destabilizing that parallelized algorithms deserve their own interface, then what will other people think? I've got plenty of concerns about efficient use of processors, but I also checked what happens if the forkjoin pool is throttled, and it works pretty well.
I think we need to get more experience with parallel operations before
considering changing the default behavior of long standing methods. This
it why I am suggesting this should be opt-in, meaning you run with
something like -Djdk.enableParallelCRC32Update=true to have the existing
methods use FJ. Having it opt-in rather than opt-out would also reduce
concerns if this is proposed to be back-ported to jdk7u. I don't have an
opinion as to whether other tuning knobs are required.
At this point, we have Arrays.parallelSort and the Streams API defines
the parallel() method to get a stream that is parallel. Having the word
"parallel" in the code means it is clear and obvious when reading the
code (no surprises). Maybe going forward that this will be unnecessary,
meaning it will be transparent. For now though, I think we should at
least consider adding parallelUpdate methods.
More information about the hotspot-compiler-dev