What's in a CONSTANT_Class?

forax at univ-mlv.fr forax at univ-mlv.fr
Thu Jun 15 22:09:00 UTC 2017

> De: "John Rose" <john.r.rose at oracle.com>
> À: "Rémi Forax" <forax at univ-mlv.fr>
> Cc: "Karen Kinnear" <karen.kinnear at oracle.com>,
> valhalla-spec-experts at openjdk.java.net
> Envoyé: Mercredi 14 Juin 2017 23:55:23
> Objet: Re: What's in a CONSTANT_Class?

> On Jun 14, 2017, at 9:22 AM, Remi Forax < forax at univ-mlv.fr > wrote:

>> With my ASM Hat,
>> both CONSTANT_Class_info “;Q<name>” and CONSTANT_ValueType_info that references
>> an UTF8 are Ok for me.

> Between those two I prefer the first since it doesn't require a new CP tag.

>> Weirdly, having a CONSTANT_Value_info that reference a CONSTANT_Class_info is
>> little harder to implement because the implementation of ASM is sensitive to
>> the number of levels of indirection (it's hardcoded to be 4, a constant method
>> handle has 4 levels).

> Interesting fact. Won't that have to change with condy?
> That allows bootstrap specifications to be recursive.

I have not implemented condy yet but i believe it's not an issue if the way to lookup a value is lazy, 
the user code will be recursive and not the ASM internals. 

>> On the longer term, I think that the spec of CONSTANT_Class should changed to
>> accept a class descriptor and not a class name (which is not BTW because array
>> are accepted in order to encode a method call to an array clone()).
>> It will allow more sharing and unlike a class name, a class descriptor is an
>> extensible format.

> [Flat strings won't take us there]

> Remi, flat strings don't go far enough. They are moderately
> extensible, and certainly accommodate new ground types like
> QFoo; and UFoo;, but there are two big problems. First, they
> suffer from combinatorial explosion (*less* sharingin flat strings)
> and second they incompletely support expression-holes which
> are required when we get to generics.

(BTW, neither you nor Karen did answer to my mail asking why we need UFoo; ) 

I agree that we may need a tree of constants but only if the interpreter need that in order to interpret the code. 
In my opinion, we should only use a tree of constants if it makes sense for the interpreter, otherwise, the constant should be flattened as a String. 
By example, an interpreter of Java 9 does not need to extract the types from a method descriptor in order to run (the verifier does but not the interpreter), 
so for me a method descriptor does not need to be a tree of constants at least for Java 9. 

> We live with the combinatorial problems of method type descriptors,
> but I think that's a place we want to retreat from. (Look at the encoding
> of (Object,Object,Object)Object: The flatness requires repetition of
> the whole qualified name four times, just in this one descriptor.)

> When we go to parameterized types, ground types will have multiple
> levels of nesting, which turns the problem from quadratic to
> exponential. That that point it's more than today's irritant.

> You can patch this with repeat operators, but the natural format
> is a tree, which represents all subparts uniformly, rather than some
> as a defining use, and others as repeated uses.

I fully agree. Specializing the code should not require to patch constants of the constant pool. 
The patchable content should be represented by an index inside a tree and the interpreter should maintain an array (in fact two arrays because you have method parameter types and class parameter types) of the corresponding type arguments. 

> [String-tagged shallow trees]

> For non-ground generic types, a type string could to be something
> like a format string. (The format "hello, %s" has a string-typed hole.)
> In that case, the string doesn't give you everything you need;
> it must be joined by a vector of operands. At that point you've
> invented trees, and then the real question is whether tree nodes
> should be tagged by format strings (an infinite number of them)
> or by a handful of simple CP-style tags.

> I handled both these issues in Pack200 by with the CONSTANT_Signature
> CP type (present only in Pack200 archives), whose content is a format
> string (with N>=0 holes) plus an implicitly counted vector of (N) CP refs
> of type CONSTANT_Class. (Primitives are inlined.) For technical
> reasons the hole syntax, if any, must be different from either string
> format notations and Pack200 with future JVMs; I think it should be
> a simple period '.'. (For discussion signature meta-characters see
> my "Symbolic Freedom" manifesto ca. 2008.)

> For values+generics we'll probably want to look at an experimental design
> like this that uses string-tagged tree nodes. They are very compact (hence
> their use in Pack200).

The problem of String tagging is that you often need to allocate a new string when you do the replacement or you have to be clever (and play with stack allocated iterators), but having a real tree may be worst if the interpreter data structure and the tree of constant in the constant pool do not match. 

About the compactness, being easier to interpret is in my opinion more important than be more compact. 

> [Byte-tagged deep trees]

> But I think for ease of tooling we will end up with the other option,
> which is *more* tree nodes tagged by a very small finite set of
> CP-style tags. This is why I support designs like the ones
> Dan has been sketching.

> In that style of tree, a format string like "hello, %s" breaks down into
> nested AST (Append[Literal["hello, "],Param[]]). Instead of parsing
> the string to find holes, the holes are directly represented, along
> with every other part, in a strongly-typed AST tree.

> An advantage of Dan-style trees is they are more strongly normalizing.
> With the format-based trees you always have small types sliding inline
> into the format strings, or out as explicit nodes (for uses like ldc).
> The programmer's educated instincts prefer one way to say one
> thing, rather than many ways to say the same thing. Stronger
> normalization leads to better compactness and fewer bugs.


> [Constant inlining?]

> Dan-style trees *could* be made much more compact, comparable
> to format strings, by extending the CP to support inlining of constant
> expressions into other expressions. This weakens the strong normalization
> of constants, but at a lower level where it can be hidden; constants
> presented via tools like ASM can be normalized easily, with a single
> clever rule ("unwind the inlining by making temporary CP nodes").
> ASM does stuff like this in reverse already, by interning ("normalizing")
> constants.

> We probably need something like this anyway, for the future
> CONSTANT_Group syntax, which doesn't pay for itself if it has to
> burn its way through the limited (u2) index space of the CP; so it
> needs some form of inlining, for constants that occur only inside
> the group and don't need global sharing.

>> From the VM point of view, it's easy to know if a CONSTANT_Class is a descriptor
>> or not, if it's a descriptor, the last character is a ';'.

> Actually, for the proposed extension, you look at the *first* character to see
> if it is a ';'. It's a different place (already existing) in the system where
> you
> check to see whether the name is of the form Foo or "LFoo;", and strip
> the decorations in the latter case. You *could* get away with Class["QFoo;"]
> but I don't recommend it, because it's a little harder to decode for both
> human readers and parsers.

i do not understand why ? 

>> I also think that the bytecode version corresponding to 10 should requires that
>> all CONSTANT_Class are encoded as class descriptor.

> If I understand what you are saying, that's not MVT at all, since it
> would force a revolution in tools. So we won't do that. It's overwhelmingly
> likely that legacy uses of CONSTANT_Class will coexist with new
> CP forms for multiple releases, even if this gives up the advantages
> of normal forms.

yes,, it's post MVT and given there will be other changes in the constant pool, tools will need to be updated so we can also mandate CONSTANT_Class to use only the descriptor format at at time. 

> [In the crystal ball]

> Beyond MVT, the CONSTANT_Class[";QFoo;"] wants to become either
> a Pack200 style thing like this:

> CONSTANT_Type[format="Q.;", args={ClassFile["Foo"]}]

> or (preferentially) a Dan-tree-like thing like this:

> CONSTANT_ClassType[mode=Q, class=ClassFile["Foo"]]

i disagree, i think we should limit ourselves to use tree of constants only when we need type substitution, 
Constant_Class("QFoo;") get you enough information, and if the format is normalized to be always descriptor, you can easily disambiguate using the first character. 

> — John


> P.S. [Side notes on CP-ology]

> When I write CP AST I like to omit "CONSTANT_" from
> nested nodes, and elide Utf8 nodes completely around strings.
> Makes it almost like a real notation.

> Format-strings for CP nodes would use a previously-unusable
> character for a hole; the period '.' fits nicely and looks like a hole.
> The arity of the node is determined Pack200-style by counting
> the holes, or else with an explicit u2 field. Pack200 simply counts
> the 'L' characters, but this breaks down with 'Q' and 'U'.

> Note that both format-strings and proper trees supply a solution
> for the quadratic length of method type descriptors. Pack200
> uses CONSTANT_Signature for both method and field types.

> CONSTANT_Signature["(L;L;L;)L;", Class("Object"), Class("Object"),
> Class("Object"), Class("Object")] # one child node mentioned 4x

More information about the valhalla-spec-observers mailing list