[foreign-memaccess+abi] RFR: Add a benchmark for strlen using Foreign Linker API [v2]
mcimadamore at openjdk.java.net
Wed Feb 17 17:17:18 UTC 2021
> I've been spending some time looking into this issue:
> And, to understand better the problem, I put together an hopefully comprehensive benchmark of the strlen function; it turns out that the strlen call itself is fast, and it's the conversion from Java to native string where the benchmark spends most of its time.
> While playing with the benchmark, I came up with alternative ways to do this conversion which greatly speed up the benchmark results, even surpassing (at least on my machine) what's possible with JNI. Jorn and I think that, for future references, it would be a good idea to include this benchmark in our suite.
> For the curious, there are many factors which make the default `CLinker::toCString` go slower than expected:
> * allocating a fresh segment on each iteration is expensive, because it takes two native call (malloc, memset), plus a bunch of CAS to reserve memory in the Java runtime
> * freeing the segment on each iteration is equally expensive - one native call (free), plus again, some CAS to unreserve memory
> * bulk copy is fast, but again requires a native call
> * all in all, we need 4 native calls per iteration (malloc, memset, copy, free) each adding cost when it comes to state transitions
> In other words, the advantage of JNI here is that (i) the level of safety provided by JNI is lower (e.g. the runtime doesn't need to track e.g. allocated memory, which segments do); also (ii) when we call the JNI-ified strlen function, the malloc, free, copy happen when we're in native code already - which means less state transitions are required.
> Note that we can completely eliminate (i) basically by creating restricted segments using CLinker::allocateMemoryRestricted (which does a plain malloc). We can also eliminate (ii) by creating *trivial* function descriptors for the calls to malloc/free/strlen, thereby removing cost associated with state transitions there. Both routes are tested in the benchmark (note that they both requires some willingness to embrace restricted methods). I have put together a variant which shows how NativeScope can be used to speed allocation up (which works really well for small strings, and is _not_ restricted).
> What are the lessons learned for plain `CLinker::toCString` ?
> * While the logic is generally fast, all state transitions and unsafe calls are killing performance in such a tight scenario; perhaps worth considering intrinsifying Unsafe::allocateMemory/copyMemory/setMemory/freeMemory.
> * The way to go, performance-wise is not to rely on the default malloc-based allocation. This is where Panama has a big edge over JNI, whose allocation logic is _fixed_. Proposals such as the one described in  will make passing custom allocators to `CLinker::toCString` easier, so that clients can decide which allocation strategy best fits their use case.
>  - https://inside.java/2021/01/25/memory-access-pulling-all-the-threads/
Maurizio Cimadamore has updated the pull request incrementally with one additional commit since the last revision:
Fix missing cast from size_t to int
- all: https://git.openjdk.java.net/panama-foreign/pull/454/files
- new: https://git.openjdk.java.net/panama-foreign/pull/454/files/6c4a4981..f110731a
- full: https://webrevs.openjdk.java.net/?repo=panama-foreign&pr=454&range=01
- incr: https://webrevs.openjdk.java.net/?repo=panama-foreign&pr=454&range=00-01
Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod
Fetch: git fetch https://git.openjdk.java.net/panama-foreign pull/454/head:pull/454
More information about the panama-dev