Speed optimization of Spliterators.spliteratorUnknownSize for parallel case
Tagir F. Valeev
amaembo at gmail.com
Tue Jul 14 02:43:17 UTC 2015
Thank you for the detailed answer.
PS> Thanks for looking at this. Judging by the results i am guessing
PS> your measurements were performed on a 4-core system.
Yes, quad-core, I mentioned it before.
PS> My initial inclination for these scenarios is it is likely better
PS> for the developer to split the execution in two parts. The first
PS> part, low N and low Q, is to collect sequentially into an array or
PS> list. The second part (total size is known), low N and high Q,
PS> would then use that array/list as the stream source and operate in parallel.
Probably it would be fine to add such recommendations to the
corresponding methods javaDoc (for example, BufferedReader.lines).
PS> That, in a sense, is almost what your spliterator does for n <
PS> initial batch size -:) but because the target threshold size is
PS> derived from the unknown root size (Long.MAX_VALUE) the
PS> sub-spliterators are not themselves further split, so the
PS> parallelism is better but still limited. It's tricky to adjust the
PS> target size dynamically, we could do it for the first split (e.g.
PS> unknown size -> two known size), but tracking further splits is
PS> harder and may not reliably perform (due to concurrent
PS> sub-splitting). I generally don't want to complicate the
PS> computation mechanism for poor splitting use-cases.
I see. That actually was also the thing which I wanted to suggest.
Still even improving the first split would be great, I'd vote for it.
This way AbstractSpliterator user implementations may also benefit.
My optimization cannot be applied to AbstractSpliterator as we cannot
control the tryAdvance method.
PS> Your approach also degenerates when N is just over the current batch size.
Yes, it becomes bad, but my measurement show that it's still no worse
than current implementation:
Benchmark (parallel) (size) Mode Cnt Score Error Units
IteratorTest.base false 1028 avgt 30 659.308 ± 12.618 us/op
IteratorTest.base false 1100 avgt 30 746.619 ± 21.260 us/op
IteratorTest.base true 1028 avgt 30 194.800 ± 1.988 us/op
IteratorTest.base true 1100 avgt 30 228.926 ± 2.923 us/op
IteratorTest.jdkSpliterator false 1028 avgt 30 648.622 ± 3.857 us/op
IteratorTest.jdkSpliterator false 1100 avgt 30 750.741 ± 10.279 us/op
IteratorTest.jdkSpliterator true 1028 avgt 30 673.574 ± 6.469 us/op
IteratorTest.jdkSpliterator true 1100 avgt 30 679.499 ± 4.310 us/op
IteratorTest.optimized false 1028 avgt 30 643.209 ± 1.686 us/op
IteratorTest.optimized false 1100 avgt 30 718.077 ± 8.123 us/op
IteratorTest.optimized true 1028 avgt 30 674.447 ± 5.153 us/op
IteratorTest.optimized true 1100 avgt 30 674.622 ± 4.252 us/op
PS> It does not make as much sense to include such an estimate
PS> argument for Files.lines for two reasons:
PS> 1) I recently improved Files.lines for common character sets
PS> where it is efficient to identify encoded line feeds (such as UTF-8); and
PS> 2) for other character sets we could derive an estimate from the
PS> file size (it could actually be the file size, which is the case for 1).
I see. FileChannelLinesSpliterator is a nice improvement. Have you
tested it on device files, named pipes or cases when the file size
is changed in-between by some other process? Also probably it should
be mentioned in BufferedReader.lines documentation, that using
Files.lines for file source is preferred.
PS> So i am inclined to do the following:
PS> 1) consider adding a size estimate argument for all stream
PS> factories that are of unknown size and where no reasonable estimate can be derived.
PS> 2) modify the spliterator factories in Spliterators to accept an estimate e.g.:
PS> public static <T> Spliterator<T>
PS> spliteratorUnknownSize(Iterator<? extends T> iterator,
PS> long sizeEstimate,
PS> int characteristics)
By the way currently it's possible to create an IteratorSpliterator
with estimated size:
Spliterators.spliterator(iterator, estimatedSize, Spliterator.CONCURRENT);
Of course it's a misuse of CONCURRENT flag, but it's not used
otherwisely by stream API, thus currently it has no unwanted
With best regards,
PS> 3) Possibly improve the final split logic of implementations in
PS> Spliterators so that it benefits known size, estimated size [*], and maybe unknown size.
PS> [*] if the known or estimated size remaining is less than the
PS> next batch size, then split in half as the final splitting action.
PS> In this case there is no need for an array field since the iterator can be reused.
PS> On Jul 10, 2015, at 8:01 PM, Tagir F. Valeev <amaembo at gmail.com> wrote:
>> The problem of ineffective parallelization of spliteratorUnknownSize
>> for low-N/high-Q tasks was already discussed in the thread "Exploiting
>> concurrency with IteratorSpliterator of unknown size"
>> [ http://mail.openjdk.java.net/pipermail/lambda-dev/2014-March/011968.html ]
>> It was proposed to modify some heuristics and/or add some parameters
>> to control them.
>> I have a simpler idea which allows to run in parallel high-Q tasks for
>> N <~ 3000 about twice faster without any API changes while having no
>> visible impact for high-N or sequential cases. Here's the gist with
>> proof-of-concept implementation (UnknownSizeSpliterator.java), JMH
>> benchmark (IteratorTest.java) and benchmarking results (performed
>> on Quad-Core i5-3340, Win7 64bit):
>> The problem which I noticed is the following. When trySplit is
>> requested and JDK IteratorSpliterator fills the buffer, it may
>> suddenly discover that the source iterator has no more elements. At
>> this point the IteratorSpliterator splits in very uneven manner.
>> Namely it dumps everything into the ArraySpliterator leaving no
>> elements for itself. As it did not report the size previously, the
>> pipeline engine assumes that both parts are roughly equal which
>> result in very poor performance.
>> I merged both IteratorSpliterator and ArraySpliterator into single
>> class UnknownSizeSpliterator. When it sees that input spliterator has
>> no more elements, it converts itself into array-mode and splits right
>> here to two equal parts, thus the first call of trySplit for N < 1024
>> performs actual split.
>> Note that I did not change the heuristics at all, thus for big-N
>> tasks the results should be the same. My implementation is a little
>> bit simplified (no additional characteristics, no precondition
>> checks, no primitive implementations), but I guess current version is
>> ok for preliminary testing to decide whether it's good or bad.
>> This implementation might improve the performance of existing methods
>> like BufferedReader.lines, Files.lines, Pattern.splitAsStream when
>> using the parallel stream as well as user-defined iterator-based
>> spliterators. Also it maybe useful for JDK-8072727 feature
>> [ https://bugs.openjdk.java.net/browse/JDK-8072727 ].
>> What do you think?
>> With best regards,
>> Tagir Valeev.
More information about the core-libs-dev