Decreased latency performance with Stack Walker API compared to sun.misc.JavaLangAccess

mandy chung mandy.chung at
Fri Oct 20 20:18:34 UTC 2017

I created to track this.


On 10/20/17 10:35 AM, Rafael Winterhalter wrote:
> Hei Mandy,
> thank you for your quick reply.
> The benchmark is already only capturing the stack frame object. I ran 
> the benchmark with the transformation as a comparison and it is indeed 
> slower due to extra work. The JIT code suggests the native method plus 
> monitor to be the major cost here but it is of course necessary to 
> have it there to fetch the stack information.
> Without this transformation in the user  thread, capturing stack frame 
> objects of 100 frames is however still about 2.5 times as expensive as 
> constructing a throwable as a snapshot.
> Just to not be misunderstood: the total cost is reduced by using the 
> stack walker, even compared to using java lang secrets under Java 8 
> but the inability to process the stack outside of a worker thread 
> makes it inapplicable for my purposes due to latency requirements.
> Thank you for considering this use case and all your great work!
> Cheers, Rafael
> Am 20.10.2017 18:11 schrieb "mandy chung" <mandy.chung at 
> <mailto:mandy.chung at>>:
>     Hi Rafael,
>     Thanks for the feedback.  We did some investigation in
>     understanding the overhead of Throwable if it used StackWalker API
>     [1].  It did come to mind whether the StackWalker API should
>     provide a way to walk the backtrace which we should do the
>     investigation with JDK-8141239.
>     The benchmark compares capturing 35 StackTraceElements for
>     apple-to-apple comparison which is fair.  I am curious on the perf
>     difference when you capture only StackFrame objects?  This would
>     save the overhead to construct StackTraceElement objects (and its
>     associated string objects).
>     Mandy
>     [1]
>     <>
>     On 10/20/17 6:32 AM, Rafael Winterhalter wrote:
>         Hello,
>         a typical patern when reading the stack of the current thread
>         in tooling
>         like performance monitoring used to imply the creation of an
>         instance of
>         Throwable and to process this instance's attached stack in
>         another thread.
>         The performance cost is shared about 10/90 for creating a new
>         throwable
>         compared to reading its frames, so this is really a worthy
>         optimization.
>         It is also common to use the JavaLangAccess API which offers
>         selective
>         access of single frames.
>         This API does no longer exist as it was superseeded by the
>         Stack Walker API
>         which is of course much safer and even a more performant
>         alternative when
>         looking at the total performance. However, using a stack
>         walker, it is no
>         longer possible to move the stack processing out of the user
>         thread but it
>         must be done at the moment the snapshot of the stack is taken.
>         It turns out
>         that this increases latency dramatically when processing
>         stacks compared to
>         the asyncronous alternative.
>         In a quick benchmark, it seems like walking 35 frames of a 100
>         frames stack
>         allows me 70k operations per second whereas creating a new
>         throwable yields
>         about 200k operations per second. Also, within a less isolated
>         test, I can
>         infer this additional overhead from the actual latency numbers
>         of a web
>         service when using the stack walker API to extract the top 35
>         frames
>         compared to the "old" solution using JavaLangAccess.
>         For this reason, it seems to be the best solution to avoid the
>         stack walker
>         when aiming for latency at the moment if the stack is not required
>         immediately and if access resources are available in other
>         threads.
>         I would therefore like to propose to extend the stack walker
>         API to allow
>         walking the stack of an existing throwable to allow for
>         similar performance
>         as with JavaLangAccess. I understand that the VM must do more work
>         altogether. When receving the full stack from a throwable,
>         this takes about
>         three times as long. In practice, for a product I am involved
>         in, this
>         casues a noticable overhead when running a Java 9 VM compared
>         to Java 8.
>         Alternatively, it would of course even be better if one could
>         take a
>         snapshot of only the top x frames to walk on this object
>         rather then a
>         throwable.
>         I have added my benchmarks (snapshot for the current user
>         thread operation,
>         complete for the entire processing) into this Gist:
>         <>
>         Thank you for any feedback on my finding!
>         Best regards, Rafael

More information about the core-libs-dev mailing list