jextract on large Windows header files

Duncan Gittins duncan.gittins at
Mon Feb 1 15:48:22 UTC 2021

My comments below too

On 01/02/2021 11:27, Maurizio Cimadamore wrote:
> Hi Duncan,
> thanks for experimenting further - some comments inline
> On Sun, 2021-01-31 at 15:54 +0000, Duncan Gittins wrote:
>> I've downloaded jdk/panama-foreign and experimented with some changes
>> to
>> jextract in order to help with the footprint of jars generated from
>> the
>> large Windows header files that I mentioned in the thread: Feedback
>> /
>> query on jextract for Windows 10.
>> 1) Added command line args: --symbol {symbol-to-include} and --
>> symbols
>> {file-of-symbols-to-include} which eliminates all top level symbols
>> not
>> in the parameter list or symbol file(s). This avoids the need to
>> generate huge amounts of unnecessary methods.
>> For example in my application, "jextract shlobj_core.h" gives ~ 11MB
>> jar
>> (~ 49k top level symbols), adding -filter for required small group
>> of
>> header files cuts this to 880k (5k symbols), and with the --symbols
>> filter that is down to a 10K jar (9 symbols) which is much more
>> managable size to use. Also the lack of lazy load in --source output
>> is
>> also no longer a problem because all static field method handles are
>> generally used.
> We used to have something similar in the previous version of jextract.
> We're a bit on the fence as to whether to re-add it. The rationale
> being that any filtering mechanism is not perfect, and might actually
> introduce extraction bugs - think of a function A which depends on a
> struct S - if you extract A then you have to extract S (and everything
> S depends on). Figuring out these dependencies is not trivial - in fact
> it's not even possible in the general case; for instance in a library
> like OpenGL you might correctly import a set of functions you want to
> use - but what about the macro constants? OpenGL is quite unusable w/o
> its set of constants - which one should be imported? The header file
> gives us no clue as to which function depends on which constant.
> In addition, what we have noticed is that, with filtering, is very easy
> to start "simple" and to end up with a crawling web of slightly inter-
> related options (include lists, exclude lists, transitive dependencies,
> etc.) which generate a lot of complexity while not being, per se,
> tremendously general.
> For these reasons, we have opted to go for an API-oriented approach,
> where, if needed, a client could interact with a jextract API, and add
> the required filtering. If you look at the `JextractTool` class, there
> are a bunch of static methods which can be used for parsing the C code,
> and for generating source/class files from the AST that has been
> produced by the parser.
> It is possible for a client to insert an extra step in the middle
> between parsing and generation, so as to allow for custom filtering
> strategies.

I would do this if jextract only had --filter by path when it left 
incubator stage, though I'm wary of going this route for now as I'd end 
up replicating nearly all of the run() method including the standard 
command line parsing.

> That said, the problem of filtering is a common one, and I suspect many
> will stumble with similar questions - so perhaps is worth shaking the
> pandora's box once more to see if we can agree on a solution that is
> general enough, w/o being overly complex.
> One thing I'm open to is to add a separate configuration (text) file,
> with some random syntax, which allowed clients to specify a list of
> symbols to import (structs, constants, functions, typedef, ...). There
> is no magic added by jextract: jextract would simply consult the lists
> during extraction and then decide as to whether include a given symbol
> in the generated AST or not. Jextract might issue warnings in cases it
> detects that certain dependencies are missing (but as stated above,
> this analysis could not be deemed "complete" in any sort of way).
> (Another twist on that idea we've considered would be to use C itself
> for defining the symbols to import, through a custom header file which
> includes all the function symbols to be imported - but while that works
> well for functions, I don't think it scales to other things like
> structs and constants).

I think a solution is needed for symbol filtering in cases like the 
Windows headers, but it does not need to be complex / with dependency 
management - just name matching. Most likely development will start off 
with the huge unfiltered.jar and switch to filtered.jar (and back when 
needed) in each cycle. At each stage, won't any symbol omissions in the 
filtered.jar be easily identified by any good IDE or via a clean run of 
javac either on the developers own code or that code generated by 
--source ? Jextract could also be made to emit an example of the filter 
file showing the initial view of included symbols.

>> 2) Changed "jextract --source" so that identical FunctionDescriptor
>> declarations are re-used. This makes the compiled source jar a
>> little
>> bit smaller than the generated class jar. For example many of the
>> Windows functions such as SetUserObjectInformationA /
>> SetUserObjectInformationW have the same descriptor, so jextract can
>> emit
>> just one FUNC_ declaration and make MH_  / $FUNC() of the second
>> refer
>> to the earlier declaration like this:
>>       /**  DUPLICATE => SetUserObjectInformationA$FUNC_
>>       static final FunctionDescriptor SetUserObjectInformationW$FUNC_
>> =
>> FunctionDescriptor.of(C_INT,C_POINTER,C_INT,C_POINTER,C_LONG);
>>       */
>>       static final jdk.incubator.foreign.FunctionDescriptor
>> SetUserObjectInformationW$FUNC() { return
>> SetUserObjectInformationA$FUNC_; }
> There's a lot of duplication in the source-generation - it's not just
> function descriptors - but also struct layouts, etc. In principle we
> should de-dup all these, but then another subtle problem arises - which
> is you can't forward reference a static constant - so, completely de-
> duping the graph of constants requires building a tree of dependencies
> between the various constants, do a topological sort, and start adding
> constants from the leaves. All this magic happens "for free" when using
> classfile generation (and would also happen for free if the language
> had some concept of lazy statics).
> Before tackling this complexity, I'd like to have a better sense of how
> far we might be from getting some support for lazy statics - as the
> kind of complexity described above to get better source generation
> feels like a bit of a dead-end.
> CheersMaurizio
>> If you think either of these will help, I can share the small set of
>> changes I made.
>> Duncan

More information about the panama-dev mailing list