[OpenJDK 2D-Dev] Pisces AA renderer performance improvement.
james.graham at oracle.com
Sat Aug 6 00:30:37 UTC 2011
I've finally started wrapping my head around this. It's a nice new take
on reorganizing the work and I'm intrigued by the fact that you said
that this is getting close to beating Ductus.
- Side note - I could repurpose the previous incarnation of the file for
FX rendering with just a few minor modifications - it looks like those
days are over now, though, and I'll have to massage it quite a bit more
to share the code... (Frowny face for me)
- I'd like to see the Renderer cache implemented in the RenderingEngine
code since it is outside the scope of the work that Renderer does.
Also, your table could fill up with partially used renderers if there
are ever any bugs that cause it to throw an exception before it calls
giveBack(). I'd like to see some code that tries to make sure they are
always given back even if errors happen (or, at least, they are removed
from the data structures if we feel that they aren't appropriate to
reuse after the error).
- Renderer, line 158 - "while (--toadd >= 0)" allows the compiler to use
the condition codes from the decrement to satisfy the test.
- Why not just have 1 tracker that manages TILE_SIZE<<SUB_POS values
instead of TILE_SIZE trackers each of which manages SUB_POS values?
Unfortunately that means that EvenOdd would have to keep an array
(because 32*8 is more bits than any single primitive type we have), but
since we already need to have an array of the EvenOdd trackers, we
aren't any worse off and in fact we are ahead on storage because
currently you have 32*32 bits per set of trackers (plus overhead for 32
Some suggestions for future work:
(Don't need to deal with these now, but I'll throw the ideas on the
table so they don't get lost)
- or and line don't really need a full 32-bits. They could share
storage, or they could be stored in byte arrays, or they could even
share a byte array.
- if you store last_subpixy instead of numys then you wouldn't have to
keep storing the new number back in the arrays - it would become a
- I think it would be worth it to amortize the removal of dead pointers.
All it would take would be to store a NULL in the spot and set a
boolean, then they could all be collapsed out in one pass at the end.
To get even more creative, if you sort them the other way around and
then scan from count=>0 then you could just copy the value "ptrs[i] =
ptrs[--count]" and let the insertion sort at the start of the next pixel
row deal with it.
- Another way to manage the merge sort would be to have a temp array of
SUB_PIX_Y values, quickly store all of the crossings into it either
forwards or backwards in a pair of tight loops (one for slope >0 and one
for slope <0) and then do a regular merge sort in a second tight loop.
If you use the values at [ci,ci+toadd) as the temp array then one test
could skip the merge sort step entirely because the values would already
be sorted (and this is likely to happen). My thoughts were that 2
tighter loops might be faster than one more complicated loop.
I'm going to send this now before I get too bogged down in figuring out
the new design of the TileGenerator...
On 7/19/2011 2:01 PM, Denis Lila wrote:
> Hi Jim.
> We spoke about this a while ago and I started working
> on it. This is what I have so far:
> My main intention was to remove some stages
> from the renderer (like you suggested) because what
> we were doing was:
> 1. Transform curves to lines and put the lines in a buffer.
> 2. Iterate through scanlines, get crossings with the lines,
> compute alphas, and store them into an array.
> 3. Compress the alpha array using RLE.
> 4. When asked, use the RLE encoding to generate alpha tiles.
> and I was hoping to be able to go directly from curves to
> crossings, and then generate the tiles from the crossing
> data. This would reduce the intermediate storage and would
> move around much less data.
> At first, I implemented this with an int where each inner
> array represented the crossings in one pixel row, and the
> crossings themselves packed 3 chunks of data into one int:
> the x coordinate of the crossing, the orientation, the scanline
> within the pixel row where the crossing was located (this was
> a number in [0:1<<SUBPIXEL_LG_POSITIONS_Y) ).
> The crossings needed to be sorted, and I did this with Arrays.sort.
> This was fast in small tests, but for any shape with more than 4 edges
> per pixel row it started having huge scalability problems
> (for larger shapes we were spending 90+% of the time in
> The version in the webrev is the fastest yet, and it uses
> a class similar to the ScanlineIterator for the sorting
> (but now it iterates through pixel rows, not scanlines).
> Unfortunately this means that there still are two levels
> of intermediate storage (edges and crossings, analogous
> to edges and RLE elements in the old version), but the
> only way I can think of how to get rid of one of them
> and still take use an average linear time sort is if we
> start outputting alphas pixelrow by pixelrow, rather than
> in 32x32 tiles.
> Anyway, it's much faster than it used to be, so we should
> probably push it (after I implement a cache eviction
> Any suggestions/criticisms are always welcome.
> Thank you,
More information about the 2d-dev