JDK-8229517: Buffered Logging?

Ramki Ramakrishna rramakrishna at twitter.com
Thu Jan 30 21:54:47 UTC 2020

Hi Xin, thank you so much for driving this forward!! (Sorry, I have been
disconnected with this for a while; my apologies.)

How long has this implementation been in use with your services, and
perhaps you might be able to use that experience and any data from that
experience to inform the efficacy of this in addressing the root problem?

As I indicated in separate feedback to you very early on when we were
talking about this, we have a slightly more complex implementation here at
Twitter that has, we believe, served us well through JDK 8, and it's been
forward-ported into UL for JDK 11 (and can be prepared for tip). Let me try
and spend some time this week, extracting a clean delta/patch of those
changes to share with you, both for GC logging in JDK 8 and for UL in JDK
11, and perhaps we can measure performance and complexity & maintainability
of the two implementations to decide on a path forward.

Sorry again for my tardiness in following up on this as I had promised in
the JBS ticket that we filed.
-- ramki

On Thu, Jan 30, 2020 at 1:26 PM Liu, Xin <xxinliu at amazon.com> wrote:

> Hi, Thomas,
> I will work on the tip of jdk. Certainly, It will support unified log(UL).
> In our current webrev,  we moved all file IOs out of GC safepoints. It
> helps for p99/p100 GC pause time. We started with optimizing GC pause time
> in long tail, that's why we hacked GCLog first.
> Ideally, we should add async API to UL. Users of UL who are very sensitive
> to delay can choose asynchronous API.
> About parameters, I have the same concerns. To be honest, I don't have
> answer to make everybody happy. What I did is guarantee that no gclog will
> miss when logs are very verbose and bogus delay is 5m for every File write.
> I don't think it's good idea to use dynamic allocation for  log buffers.
> If kernel had very long pause, JVM would run out of memory.  I don't like
> to introduce a new problem when I try to solve one.
> Here is  the script I used to collect data.
> https://github.com/navyxliu/JavaGCworkload/blob/master/t.sh#L7
> Those parameters are tuning.  We can define a bottom line.
> https://cr.openjdk.java.net/~xliu/8229517/webrev03/webrev/src/share/vm/runtime/globals.hpp.udiff.html
> I used a mutex without safepoint check.  It should guarantee that the
> synchronization between read/writer won't wait for safepoints.  I am quite
> new for hotspot synchronizations. Could you help me review this part?
> https://cr.openjdk.java.net/~xliu/8229517/webrev03/webrev/src/share/vm/utilities/ostream.cpp.udiff.html
> Two buffers switch their roles in flush_log().  We also explored lock-free
> queue using  boost. It works but it's non-trivial for me to implement one
> in hotspot.  I would like to hold until we identify async buffer is
> bottleneck.
> thanks,
> --lx
> From: Thomas Stüfe <thomas.stuefe at gmail.com>
> Date: Wednesday, January 29, 2020 at 3:13 AM
> To: "Liu, Xin" <xxinliu at amazon.com>
> Cc: hotspot-runtime-dev <hotspot-runtime-dev at openjdk.java.net>, John
> Coomes <jcoomes at twitter.com>, "Mathiske, Bernd" <mathiske at amazon.com>,
> Ramki Ramakrishna <rramakrishna at twitter.com>
> Subject: Re: JDK-8229517: Buffered Logging?
> Hi Xin,
> I was not sure whether to respond in JBS or here, so I do it here.
> Like others I think that if we really do this it would make more sense to
> add this as a general feature of UL, not of the GC logging in particular.
> We have the problem (logging may block or take time) in other areas too. I
> have thought about a similar solution in the past (offloading UL writes to
> a dedicated thread) but usually when I thought about the details I got
> unsure. But maybe I do not see the whole picture, and you can convince me :)
> Some questions (assuming we go the more general direction of buffering UL
> output):
> - How do you determine which buffer size is large enough? Lets say UL gets
> really chatty, how do you deal with buffer overflow? Do you stop the
> writers until the buffer is drained or do you discard log output? Both
> options seems similar bad.
> - We need synchronization now between reader and writers. Slow IO can
> stall the reader, which may block the writers (if you stop writers on a
> full buffer). I am not sure, but this may be worse than before, where we
> had the writer threads effectively write in parallel to the same fd.
> Thanks & Cheers, Thomas
> On Sun, Jan 26, 2020 at 12:39 AM Liu, Xin <mailto:xxinliu at amazon.com>
> wrote:
> Dear hotspot developers,
>     I think runtime might be a better forum to continue this discussion.
>     The JBS: https://bugs.openjdk.java.net/browse/JDK-8229517
>     Now our latest patch is here:
> https://cr.openjdk.java.net/~xliu/8229517/webrev02/webrev/
>     I know it’s based on jdk8u. I really appreciate if someone reviews it
> and gives me advice.  If the community is still interested, we can keep
> developing using better approach or on the tip of jdk.
>     I’d like to present our experiments here. I try to post data on the
> JDK-8229517, but It doesn't look good in comment area.
>     Experiments
>     We can’t reproduce the problem in [1] because harddisks are not
> accessible for us. To simulate IO blocking in JVM, we introduce a bogus
> delay. It spins DelayInGCLogging milliseconds for every fwrite.
>       notproduct(intx, DelayInGCLogging, 0,
>       "Delay milliseconds to simulate IO blocking for GC log testing")
>     It spins in gcLogFileStream::write_blocking.
> https://cr.openjdk.java.net/~xliu/8229517/webrev02/webrev/src/share/vm/utilities/ostream.cpp.cdiff.html
>     We observe GC pause time using -XX:+PrintGCApplicationStoppedTime for
> JavaWorkload with 2g heap. The unit is second.  The program is on github.
> https://github.com/zhenyun/JavaGCworkload.
>     $java -Xmx2g -Xms2g -Xloggc:gc.log -XX:+UseG1GC
> -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCDetails
> -XX:+UnlockExperimentalVMOptions Test
>     sync means jdk8u-dev. async means jdk8u-dev with nonblocking gclog
> change. delayXXX denotes an artifactual delay given by
> -XX:DelayInGCLogging=XXX.
>                                  median     p95         p99
>  p100        losses
>     -------------------------------------------------------------------
>     async                   0.01357   0.36187  0.48691   0.61792
>     sync                     0.01339   0.3618    0.48646   0.63539
>     -------------------------------------------------------------------
>     delay5ms-async 0.01516  0.40564  0.486        0.61053  148156
>     delay5ms-sync   0.20444  0.4428    0.56733    0.69522
>     -------------------------------------------------------------------
>     delay10ms-async0.0123   0.47451  0.48623    0.61083  195460
>     delay10ms-sync  0.38732 0.52424  0.64783    0.78087
>     -------------------------------------------------------------------
>     async and sync perform equally in happy case. JavaWorkload didn’t lose
> any GC log even though the GC activities are intensive and the outputs are
> verbose. If we introduce 5 ms delay for each file-writing, delay5ms-sync
> starts to show longer stop time for p99 and p100. eg. p99 increases from
> 0.486s to 0.567s, which is 16,7% longer. delay5ms-async remains the same
> p99 and p100 at expense of losing 148k logging events. Because the outputs
> GC logs are contiguous, they are still readable and helpful for problem
> analysis. After we increase delay to 10ms, delay10ms-async still have the
> same p99 and p100 stop time while delay10ms-sync has 22.9% longer stop
> time.
>     [1]Eliminating Large JVM GC Pauses Caused by Background IO Traffic
> https://engineering.linkedin.com/blog/2016/02/eliminating-large-jvm-gc-pauses-caused-by-background-io-traffic
>     Thanks,
>     --lx
>     From: Ramki Ramakrishna <mailto:rramakrishna at twitter.com>
>     Date: Tuesday, August 13, 2019 at 4:16 PM
>     To: "mailto:hotspot-dev at openjdk.java.net" <mailto:
> hotspot-dev at openjdk.java.net>
>     Cc: "Liu, Xin" <mailto:xxinliu at amazon.com>, "Mathiske, Bernd" <mailto:
> mathiske at amazon.com>, "Hohensee, Paul" <mailto:hohensee at amazon.com>, John
> Coomes <mailto:jcoomes at twitter.com>, Tony Printezis <mailto:
> tprintezis at twitter.com>
>     Subject: Re: Buffered Logging?
>     I filed https://bugs.openjdk.java.net/browse/JDK-8229517.  Please
> correct/complete if I mangled or neglected to fill any specific required
> fields.
>     Patches forthcoming in the next few days.
>     -- ramki
>     On Tue, Aug 13, 2019 at 2:53 PM Ramki Ramakrishna <mailto:mailto:
> rramakrishna at twitter.com> wrote:
>     I'll go ahead and open a JBS ticket to keep the ball rolling on this
> one, and so we have a place to hang some patches and discussions off of.
> Scream if I shouldn't.
>     -- ramki
>     On Mon, Aug 12, 2019 at 6:19 PM Ramki Ramakrishna <mailto:mailto:
> rramakrishna at twitter.com> wrote:
>     (Resent... Apologies to anyone who might see duplicates. There was an
> issue with my membership in the list which, I am guessing, has now been
> resolved.)
>     ---------- Forwarded message ----------
>     From: Ramki Ramakrishna <mailto:mailto:rramakrishna at twitter.com>
>     To: mailto:mailto:hotspot-dev at openjdk.java.net
>     Cc: "Liu, Xin" <mailto:mailto:xxinliu at amazon.com>, "Mathiske, Bernd"
> <mailto:mailto:mathiske at amazon.com>, "Hohensee, Paul" <mailto:mailto:
> hohensee at amazon.com>, John Coomes <mailto:mailto:jcoomes at twitter.com>,
> Tony Printezis <mailto:mailto:tprintezis at twitter.com>
>     Bcc:
>     Date: Mon, 12 Aug 2019 15:48:17 -0700
>     Subject: Buffered Logging?
>     In the JDK 7 time-frame, some folks noticed much longer than expected
> pauses that seemed to coincide with GC logging in the midst of a GC
> safepoint. In that setup, the GC logs were going to a disk file (these were
> often useful for post-mortem analyses) rather than to a RAM-based tmpfs
> which had been the original design center assumption. The vicissitudes of
> the dirty page flushing policy in Linux when
>     IO load on the machine (not necessarily the JVM process doing the
> logging)
>     could affect the length and duration of these inline logging stalls.
>     A buffered logging scheme was then implemented by us (and
> independently by
>     others) which we have used successfully to date to avoid these pauses
> in high i/o
>     multi-tenant environments.
>     We have recently ported the OpenJDK 8u based scheme past the Unified
> Logging related changes and have been using it on OpenJDK 11u.
>     In various fora, some of us have heard people express an interest in
> such an implementation.
>     Would there be appetite for this in openjdk (there are some
> differences between our scheme in OpenJDK 8u and their subsequent
> post-Unified-Logging port to OpenJDK 11)?
>     If so, we'd like to open a JBS ticket, attach some patches rebased to
> current OpenJDK tip, and start preparing a webrev for discussion with a
> view to possible upstreaming if it converges past review/discussion/debate.
>     PS: In recent discussion at the JVMLS, mention was made of something
> that might be similar in support of JFR streaming, so our rebase to tip and
> familiarization with any JFR-associated buffered logging might well make
> this easier or, possibly, moot.
>     Thoughts, questions, comments, feedback?
>     -- ramki
>     --
>     JVM Team, Platform Engineering, Twitter (San Francisco)
>     --
>     JVM Team, Infrastructure Engineering, Twitter (San Francisco)
>     --
>     JVM Team, Infrastructure Engineering, Twitter (San Francisco)

JVM Team, Infrastructure Engineering, Twitter (San Francisco)

More information about the hotspot-runtime-dev mailing list