From John.Rose at Sun.COM Sat Feb 2 05:11:49 2008 From: John.Rose at Sun.COM (John Rose) Date: Sat, 02 Feb 2008 05:11:49 -0800 Subject: FYI: blog on Da Vinci Machine vs. Microsoft DLR/CLR Message-ID: <50D57E18-A7A0-43DF-A3CE-D69E5A228322@Sun.COM> I spent the early week learning about the competition. I hope you enjoy my notes: http://blogs.sun.com/jrose/entry/bravo_for_the_dynamic_runtime Best wishes, -- John From charles.nutter at sun.com Sat Feb 2 07:53:55 2008 From: charles.nutter at sun.com (Charles Oliver Nutter) Date: Sat, 02 Feb 2008 09:53:55 -0600 Subject: FYI: blog on Da Vinci Machine vs. Microsoft DLR/CLR In-Reply-To: <50D57E18-A7A0-43DF-A3CE-D69E5A228322@Sun.COM> References: <50D57E18-A7A0-43DF-A3CE-D69E5A228322@Sun.COM> Message-ID: <47A49213.8080003@sun.com> John Rose wrote: > I spent the early week learning about the competition. I hope you > enjoy my notes: > > http://blogs.sun.com/jrose/entry/bravo_for_the_dynamic_runtime Great post...I'm starting to get excited for the next year of OpenJDK. - Charlie From p.thamarai at gmail.com Sun Feb 3 19:14:30 2008 From: p.thamarai at gmail.com (Thamaraiselvan Poomalai) Date: Mon, 4 Feb 2008 08:44:30 +0530 Subject: FYI: Instructions for building multi-language-vm aka The Da Vinci VM Message-ID: <5def42a80802031914k75743e1dp9cbce1d5ad785330@mail.gmail.com> Hello, If you want to build and run mlvm, the below blog may assist you, http://softonaut.blogspot.com/2008/02/building-openjdk-multi-language-vm-aka.html Best wishes, selvan -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.openjdk.java.net/pipermail/mlvm-dev/attachments/20080204/50fcfabe/attachment.html From John.Rose at Sun.COM Mon Feb 4 10:42:49 2008 From: John.Rose at Sun.COM (John Rose) Date: Mon, 04 Feb 2008 10:42:49 -0800 Subject: FYI: Instructions for building multi-language-vm aka The Da Vinci VM In-Reply-To: <5def42a80802031914k75743e1dp9cbce1d5ad785330@mail.gmail.com> References: <5def42a80802031914k75743e1dp9cbce1d5ad785330@mail.gmail.com> Message-ID: <1376BD3F-C580-402B-99E7-260AE2083841@sun.com> Thanks, Selvan! -- John On Feb 3, 2008, at 7:14 PM, Thamaraiselvan Poomalai wrote: > If you want to build and run mlvm, the below blog may assist you, > > http://softonaut.blogspot.com/2008/02/building-openjdk-multi- > language-vm-aka.html -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.openjdk.java.net/pipermail/mlvm-dev/attachments/20080204/58956fc2/attachment.html From John.Rose at Sun.COM Tue Feb 12 01:44:09 2008 From: John.Rose at Sun.COM (John Rose) Date: Tue, 12 Feb 2008 01:44:09 -0800 Subject: Tailcalls in the jvm In-Reply-To: <6E427B23-B974-4805-82E1-887A52CA1AA4@gmail.com> References: <191AC8AA-6C88-49D0-B828-B1A87A655BBA@Sun.COM> <978F703A-22FA-4A51-A854-79D571159247@sun.com> <6E427B23-B974-4805-82E1-887A52CA1AA4@gmail.com> Message-ID: On Feb 10, 2008, at 5:17 AM, Arnold Schwaighofer wrote: > Yes i am still working on tail calls. Sadly i have nothing to show > so far, as i have been put off my schedule during winter holidays > by former work obligations. These things happen. I think we'll make the Da Vinci Machine happen also! > While looking at how to handle the security (stack inspection) > problem that tail calls impose indeed a question arose. > > For my first implementation i plan to issue a check when performing > the tail call whether the 'tailcalling' caller has the permission > java.security.allpermisson to ensure that the missing frame > (protection domain) of the caller does not change security > behavior. If the caller has any other permission no tail call > optimization will be performed. > > But inspired by the idea of storing the security information in the > continuation [1], I have come up with the following idea to > preserve the existing security semantics and always perform tail > call optimization (except for cases where it's obviously not > possible - exceptions, monitors etc). The idea is based on the > assumption that the protection domain of a class is a property that > does not change (often) e.g only when the class is loaded. This is > my question: is this assumption correct? Yes. It is a static property of the class. The only thing dynamic about it is that a virtual or interface call can reach a computed and variable target method, and hence a computed and variable PD (which is derived from the target method's class). > I have found the native method setProtectionDomain0() on > java.lang.class that is documented to be called by > ClassLoader.defineClass. My current understanding is that the only > way to change the protection domain is via reloading the class? > > << ==== only read the following if you have time ... (long discussion of stack frame formats read but deleted) > [1]A Tail-Recursive Machine with Stack Inspection, Clements, Felleisen I have a few points of response: 0. I find that paper is obscure, since I don't often move in circles where the term "contractum" pops up in normal conversation. But there are circles within circles, I see, since Clements & Felleisen fall back on named lambdas instead of the mysterium tremendum of the Y combinator favored by Fournet & Gordon. :-) More comments on their paper below. (I almost said, "inter alia.") 1. If protection domains are few, then we could, as you suggest, put a single-word bitmap in every stack frame. The idea to deoptimize and reformat on the 33rd, 65th, etc. protection domains is clever. I suppose if you were to reserve a word in every stack frame, you could also make it an indirection to an expandable array. But I'd rather not go down this road. In HotSpot we try to keep optimized stack frames as "native" as possible. If there is a need, say, to track GC roots, we try not to compromise code quality or stack frame layout but instead add metadata on the side. This is a powerful technique. It doesn't work for PD accumulation, since that is a (slightly) dynamic property of stack frames. I'd hate to invent stack frame customization machinery (complex, and with unknown overheads), just for this one use case. 2. I agree protection domains are usually very few. I haven't seen use of many. (Maybe heavy users of signed JARs.) In practice, they are proxies for ClassLoader identity, and you can usually interchange ClassLoader with PD (or with the ordered pair) when reasoning about privileges on stack frames. I think the rules for PD-violating tail-calls can be simple and conservatively approximated by similar rules about calls between methods in different class loaders. (Privileges are subtractive as stack frame accumulate, of course, except that they are reset by doPrivileged. So to be as permissive as possible, you need a per-thread set of PDs, or you need to represent them via active stack frames, and avoid deleting frames which would change the subtractive answer. Deleting a subtraction causes an addition, which is the hazard here, since the addition is of a set of privileges which would normally be denied to the PD. I find this confusing, BTW.) 3. Since we're defining our own VM here, let's see what we can do by refusing to handle the hard cases. One too-radical idea: Throw a linkage error if less-privileged code tail-calls more privileged code. This probably won't work well in the presence of method handles (or even virtual/interface calls) since the byte compiler cannot statically determine the identity of a callee and refrain from requesting a bad tail call. A scheme example would be: (define (revlist listfn a b) (listfn b a)) Suppose revlist is in an applet, and listfn is bound to the actual a system routine 'list'. Then the draconian rule will cause the code to fail. This rule is well-motivated, since the listfn might also be bound to for-each, which could cause a closure b to be invoked in an privilege context elevated beyond revlist's capabilities. 4. OK, that was a bad start. Let's try again. Make a less draconian rule, one that allows the call but keeps more of the stack trace around. When revlist tail-calls list, the system can keep the stack frame around (or keep an adapter frame) to represent the PD of revlist. I call this a "defective tail call", because the compiler requested a tail call, but the runtime pushed a stack anyway. The application cannot detect this, unless it performs an unbounded recursion using only tail calls some of which are defective; eventually it will get a stack overflow from the frames pushed by the defective calls. Apparently Fournet and Gordon (whom I have not read) allow their engine to defeat tailcalls, because C&F say they make the frame take-down operation explicit. Basically, I propose the JVM should defeat a tail call when it detects a cross-PD call. It should refrain from taking down the frame, or (better) replace the frame with a holder for the PD. I think this is a workable rule, though I haven't thought hard about what bad consequences it might bring. The bad cases I can think of seem implausible. If you allow cross-PD tail-calls to be defective, you burden only code patterns which perform looping or "threaded interpreter" (non-pushing) state transitions between mutually untrusting modules. This seems unlikely; in Scheme at least nearly all tail calls are either within the same compilation unit, or else are "accidental" to a call to a routine which either returns immediately (e.g., list or car) or else performs something which inherently needs a stack frame (e.g., for-each, dynamic-wind). C&F claim that these infinitely looping tail calls pingponging between system domains would be commonly written, if only Java programmers could be assured that the stack would not blow up. Maybe there is a PyPy JIT example, or some sort of message passing state machine, for which the stack frame buildup would be unbounded. But I don't see one yet, and as an engineer I don't want to work on problems that nobody seems to care about. Most control-passing patterns involve at least a little branchiness, where a caller tries first one thing and then another; that caller needs to push a stack frame. I guess supporting pure CPS (i.e., heap-resident user-built frames) might run into problems, but (again) when I try to come up with Scheme examples I find system calls which are either leaves (car) or inherently pushy (for-each). Many Scheme implementations allow cross-module tail calls to be defeated. They have to apologize for this, but the defect does not seem to cause problems for programmers. So, in the absence of more evidence, I think Schinz and Odersky (referenced in C&F) strike the right compromise. 5. Or we could use little adapter frames to represent PD information. This is the engineer's equivalent to C&F's continuation markers, which summarize annotations derived from backtraces. Have a cross-PD tail call push an adapter (not the original frame) which the JVM creates specially for the purpose of temporarily defeating (potentially) cross-PD tail calls. But, keep a counter in them, and when too many pile up, crush them down into a single adapter, with a composite PD marker. The "crush" operation can compress an unlimited number of frames, since the PD set the JVM needs can be unordered, and PDs are few. (Oddly enough, I wouldn't need to produce an asymptotic space proof over a multi-kinded combinatorial algebra in order to code this up. It's a funny world.) Shall we do the simple "defeatist" thing first, and reserve the adapter frames for a second effort? Best wishes, -- John From arnold.schwaighofer at gmail.com Wed Feb 13 14:29:12 2008 From: arnold.schwaighofer at gmail.com (Arnold Schwaighofer) Date: Wed, 13 Feb 2008 23:29:12 +0100 Subject: Tailcalls in the jvm In-Reply-To: <3E7AD86D-C826-4A34-AB08-5D3223C6066C@Sun.COM> References: <191AC8AA-6C88-49D0-B828-B1A87A655BBA@Sun.COM> <978F703A-22FA-4A51-A854-79D571159247@sun.com> <6E427B23-B974-4805-82E1-887A52CA1AA4@gmail.com> <3E7AD86D-C826-4A34-AB08-5D3223C6066C@Sun.COM> Message-ID: On Feb 12, 2008 10:44 AM, John Rose wrote: > These things happen. I think we'll make the Da Vinci Machine happen > also! > 0. I find that paper is obscure, since I don't often move in circles > where the term "contractum" pops up in normal conversation. > But there are circles within circles, I see, since Clements & Felleisen > fall back on named lambdas instead of the mysterium tremendum > of the Y combinator favored by Fournet & Gordon. :-) :) > 1. If protection domains are few, then we could, as you suggest, > put a single-word bitmap in every stack frame. > > The idea to deoptimize and reformat on the 33rd, 65th, etc. > protection domains is clever. I suppose if you were to > reserve a word in every stack frame, you could also > make it an indirection to an expandable array. > > But I'd rather not go down this road. Okay. I think Christian already told me that stack frame modifications would be frowned upon. :) > In HotSpot we try to keep optimized stack frames as "native" as > possible. > If there is a need, say, to track GC roots, we try not to compromise > code quality > or stack frame layout but instead add metadata on the side. > This is a powerful technique. It doesn't work for PD accumulation, > since that is a (slightly) dynamic property of stack frames. > I'd hate to invent stack frame customization machinery (complex, > and with unknown overheads), just for this one use case. okay. > 3. Since we're defining our own VM here, let's see what we can do > by refusing to handle the hard cases. One too-radical idea: Throw > a linkage error if less-privileged code tail-calls more privileged code. > > This probably won't work well in the presence of method handles > (or even virtual/interface calls) since the byte compiler cannot > statically determine the identity of a callee and refrain from > requesting a bad tail call. A scheme example would be: > (define (revlist listfn a b) (listfn b a)) > > Suppose revlist is in an applet, and listfn is bound to > the actual a system routine 'list'. > > Then the draconian rule will cause the code to fail. > This rule is well-motivated, since the listfn might also > be bound to for-each, which could cause a closure > b to be invoked in an privilege context elevated > beyond revlist's capabilities. yeah, i also wanted to do something along those lines until i realized that the callee might be unknown (interface/virtual call) to the caller. > 4. OK, that was a bad start. Let's try again. > > Make a less draconian rule, one that allows > the call but keeps more of the stack trace around. > When revlist tail-calls list, the system can keep > the stack frame around (or keep an adapter frame) > to represent the PD of revlist. > > I call this a "defective tail call", because the compiler > requested a tail call, but the runtime pushed a stack > anyway. The application cannot detect this, unless > it performs an unbounded recursion using only tail calls > some of which are defective; eventually it will get > a stack overflow from the frames pushed by the > defective calls. > > Apparently Fournet and Gordon (whom I have not > read) allow their engine to defeat tailcalls, because > C&F say they make the frame take-down operation > explicit. > > Basically, I propose the JVM should defeat a tail > call when it detects a cross-PD call. It should > refrain from taking down the frame, or (better) > replace the frame with a holder for the PD. That is probably the path i'll take. But there is the problem of recognizing cross-pd calls for dynamic call targets (virt./interface/method handle calls) as you suggest above. A solution could be to add another function entry point like the one for monomorphic calls. Let's call it unverified tail call entry point. It checks that the current function's PD and the caller's function PD are the same and jumps to the normal entry point if they match (fast path). If they are not the appropriate action is taken dealing with the situation (proceed as non-tail call, setup adapter frame,throw exception). Callee targets which are statically known to be in the same PD (a potential optimization could be a PD analysis akin to the class hierarchy analysis) would use the normal entry point (verified or unverified entry point) while the others would go to the unverified tail call entry point. Pro: fast also for the dynamic callee case(if there is a efficient way to compare the two PDs - maybe caller passes its PD in a register) in the matching PD case Contra: might be more complex then the solution below Another solution would be to call always call a jvm runtime function (more overhead, always a slow path) to perform the tail call in cases where the call target is dynamic. I think this is what the .net runtime does/used to do? (JIT_TailCall, http://connect.microsoft.com/VisualStudio/feedback/ViewFeedback.aspx?FeedbackID=98236) Pro: less complex Contra: slower for dynamic call target case What do you think? > > 5. Or we could use little adapter frames to represent > PD information. > > This is the engineer's equivalent to C&F's continuation > markers, which summarize annotations derived from > backtraces. > > Have a cross-PD tail call push an adapter (not the original > frame) which the JVM creates specially for the purpose > of temporarily defeating (potentially) cross-PD tail calls. > > But, keep a counter in them, and when too many pile up, > crush them down into a single adapter, with a composite > PD marker. The "crush" operation can compress an > unlimited number of frames, since the PD set the JVM > needs can be unordered, and PDs are few. Yes that is a cool idea. Note to myself: Always think of adapter frames. they seem to be the solution to many problems. > (Oddly enough, I wouldn't need to produce an asymptotic space > proof over a multi-kinded combinatorial algebra in order to > code this up. It's a funny world.) :) > Shall we do the simple "defeatist" thing first, and reserve > the adapter frames for a second effort? Yes i think that is the way to go. A least my lazy left big toe tells me so :). regards arnold From John.Rose at Sun.COM Wed Feb 13 17:01:10 2008 From: John.Rose at Sun.COM (John Rose) Date: Wed, 13 Feb 2008 17:01:10 -0800 Subject: Tailcalls in the jvm In-Reply-To: References: <191AC8AA-6C88-49D0-B828-B1A87A655BBA@Sun.COM> <978F703A-22FA-4A51-A854-79D571159247@sun.com> <6E427B23-B974-4805-82E1-887A52CA1AA4@gmail.com> <3E7AD86D-C826-4A34-AB08-5D3223C6066C@Sun.COM> Message-ID: On Feb 13, 2008, at 2:29 PM, Arnold Schwaighofer wrote: > On Feb 12, 2008 10:44 AM, John Rose wrote: >> Basically, I propose the JVM should defeat a tail >> call when it detects a cross-PD call. > > That is probably the path i'll take. But there is the problem of > recognizing cross-pd calls for dynamic call targets > (virt./interface/method handle calls) as you suggest above. > A solution could be to add another function entry point like the one > for monomorphic calls. Let's call it unverified tail call entry point. > It checks that the current function's PD and the caller's function PD > are the same and jumps to the normal entry point if they match (fast > path). If they are not the appropriate action is taken dealing with > the situation (proceed as non-tail call, setup adapter frame,throw > exception). --- One key question is who does the frame-popping. One way is for the tail-caller to eagerly pop his own frame, but pass a PD token either to the callee's special entry point, or to an intermediate "transition stub". Call this eager popping. It is the standard technique. The other way is for the tail-caller to try to keep his frame, but pass a PD token plus a take-down cookie ("TDC", e.g., FP or frame size) to the immediate callee. (Which is either a special method entry point or again a transition stub.) Call this lazy popping. The advantage of lazy popping is that you don't need to create frames when a TC must be defeated. All the frames stay "standard", and this interacts most simply with deoptimization, debugging, security checks, and anything else that walks the stack. One disadvantage of lazy popping is the extra TDC argument and (perhaps) overheads interpreting it. With eager popping, each caller takes down his own frame, by compiling customized code. Eager popping seems to be the more aggressive technique. One thing to worry about is whether there is enough stack space in the tail-caller's caller to hold the outgoing arguments to the tail-callee. In general there isn't. That consideration alone might lead us to favor lazy popping. You might think that lazy popping would make it impossible to collapse ("crush") stack frames into adapters which summarize PD contexts. That's not true, since HotSpot has adequate stack walking facilities. When you decide stack space is a getting tight after too many consecutive defeated tail calls, you trap into the VM code, look at the stack, and reformat it, like deoptimization does today. I'm not sure which one to try first, but I lean towards lazy popping, since it makes the stack look more "normal" to the reflective parts of the VM runtime. --- Another key question is what is the shape of a call site. HotSpot call sites are simple and general "inline caches": set #patchable_token, %inline_cache_reg call patchable_immediate_callee Two machine words are patchable here. The initial state is that the immediate callee is a linkage routine. The usual state is called "monomorphic", where the immediate callee is the "verified entry point" of the actual method, and the token is the expected class, which every VEP checks. VEPs are the same couple of instructions everywhere; they act like inlined transition stubs. This usual case includes most nominally virtual call sites. The fallback state is called "megamorphic", where the immediate callee is a transition stub that performs dynamic dispatch. A typical transition stub loads a the receiver's class, loads a pointer from the vtable, and jumps through the pointer. The "patchable token" is garbage except for monomorphic calls. This makes most state transitions atomic by virtue of an atomic patch to the immediate callee. (It's not the whole story; see the sources for more messy details.) The key routines are CompiledIC::set_to_monomorphic, CompiledIC::set_to_megamorphic and their callees. It seems to me the problem of cross-PD detection breaks into the same pieces: - unlinked calls, where the VM linker routine gets to edit the site as before - monomorphic calls, where the decision whether to pop is made at link time - megamorphic calls, where the decision can be folded into the vtable/itable transition stubs In the case of a monomorphic site, the VM linker routine can decide at link time whether to defeat the tail call or not, and patch the call site accordingly. In the less-common case of a megamorphic call site, the transition stub can be responsible to compare protection domains. This comparison can be moderately expensive (I think) because it will be comparatively infrequent. This affects the choice of PD token. I recommend that the PD token be the caller's class (klassOop). Comparing two PD tokens requires comparing their protection domain objects. The comparison takes an extra indirection, but it probably makes the code for the call site simpler, since klassOops are easy to introduce into code. > Callee targets which are statically known to be in the same PD (a > potential optimization could be a PD analysis akin to the class > hierarchy analysis) would use the normal entry point (verified or > unverified entry point) while the others would go to the unverified > tail call entry point. Yes, the VM call site linker could use CHA to pick a more direct and efficient direct callee, the monomorphic method's VEP. --- Another key question is who is the immediate callee. There are at least two cases: A tail call which still needs to do a PD check, and an unchecked tail call. Should there be two more entry points? Either or both entry points could instead be transition stubs, like vtable or itable dispatch stubs. There will be more such stubs, too, as a result of the invokedynamic work. > Pro: fast also for the dynamic callee case(if there is a efficient way > to compare the two PDs - maybe caller passes its PD in a register) in > the matching PD case > Contra: might be more complex then the solution below Maybe. > Another solution would be to call always call a jvm runtime function > (more overhead, always a slow path) to perform the tail call in cases > where the call target is dynamic. I think this is what the .net > runtime does/used to do? (JIT_TailCall, > http://connect.microsoft.com/VisualStudio/feedback/ > ViewFeedback.aspx?FeedbackID=98236) > > Pro: less complex > Contra: slower for dynamic call target case I think it's simple enough to pass the PD (and TDC if necessary) as extra arguments. I do think we should build a faster tailcall than what .NET has. Having some tailcalls always trap to the runtime is a very uneven performance profile! -- John From arnold.schwaighofer at gmail.com Thu Feb 14 14:09:23 2008 From: arnold.schwaighofer at gmail.com (Arnold Schwaighofer) Date: Thu, 14 Feb 2008 23:09:23 +0100 Subject: Tailcalls in the jvm In-Reply-To: References: <191AC8AA-6C88-49D0-B828-B1A87A655BBA@Sun.COM> <978F703A-22FA-4A51-A854-79D571159247@sun.com> <6E427B23-B974-4805-82E1-887A52CA1AA4@gmail.com> <3E7AD86D-C826-4A34-AB08-5D3223C6066C@Sun.COM> Message-ID: On Thu, Feb 14, 2008 at 2:01 AM, John Rose wrote: > On Feb 13, 2008, at 2:29 PM, Arnold Schwaighofer wrote: > > > On Feb 12, 2008 10:44 AM, John Rose wrote: > One key question is who does the frame-popping. yes > One way is for the tail-caller to eagerly pop his own frame, but pass > a PD token either to the callee's special entry point, or to an > intermediate > "transition stub". Call this eager popping. It is the standard > technique. > > The other way is for the tail-caller to try to keep his frame, but > pass a PD token plus a take-down cookie ("TDC", e.g., FP or frame size) > to the immediate callee. (Which is either a special method entry > point or again a transition stub.) Call this lazy popping. > > The advantage of lazy popping is that you don't need to create > frames when a TC must be defeated. All the frames stay "standard", > and this interacts most simply with deoptimization, debugging, > security checks, and anything else that walks the stack. > > One disadvantage of lazy popping is the extra TDC argument > and (perhaps) overheads interpreting it. > > With eager popping, each caller takes down his own frame, > by compiling customized code. I think optimally it would be a mix of both (for maximum efficiency). > Eager popping seems to be the more aggressive technique. > One thing to worry about is whether there is enough stack > space in the tail-caller's caller to hold the outgoing arguments > to the tail-callee. In general there isn't. That consideration > alone might lead us to favor lazy popping. Yes you are right. but it does not fully solve the problem. we can not just resize (enlargen) the stack at the callee site. The caller of the 'tail calling' caller does not know the size by which we are adjusting. Changing the size of the outgoing area without the function knowing it will lead to erroneous behavior :) my idea (well actually Christian suggested using adapter frames - my idea was way more complicated - resizing out going areas during runtime, remember my comment about adapter frames in the last email - they are a solution to many problems :) to deal with more arguments is inserting a dummy frame that contains a big area for outgoing arguments in cases where the callee has more arguments. This implies a check (does the dummy frame exist and is it big enough) at the callee site (another entry point combination or stub) for the general case (more arguments). e.g if we have a fun(arg1, arg2) tail calling a fun2(arg1, arg2, arg3, arg) than before control is handed over to fun2 a dummy/adapter frame is created where the arguments can go. that is if there isn't already a dummy frame (because maybe fun was also tail called). So the original idea was (not taking the PD problem into account) to have two code paths: - fast case: callee has less or the same number of arguments, eagerly pop the stack and jump to callee (in case of virt/interface calls jump to transition stub) - slower case: callee has more arguments, jump to a point in callee (special entry point/stub) that checks whether the dummy frame already exists and has an appropiate size(if not it puts it there), here we would use the lazy popping variant and create the adapter frame if needed. > You might think that lazy popping would make it impossible to > collapse ("crush") stack frames into adapters which summarize > PD contexts. That's not true, since HotSpot has adequate stack > walking facilities. When you decide stack space is a getting > tight after too many consecutive defeated tail calls, you trap > into the VM code, look at the stack, and reformat it, like > deoptimization does today. > > I'm not sure which one to try first, but I lean towards > lazy popping, since it makes the stack look more "normal" > to the reflective parts of the VM runtime. > > --- > Another key question is what is the shape of a call site. > > HotSpot call sites are simple and general "inline caches": > set #patchable_token, %inline_cache_reg > call patchable_immediate_callee > > Two machine words are patchable here. The initial > state is that the immediate callee is a linkage routine. > > The usual state is called "monomorphic", where the > immediate callee is the "verified entry point" of the > actual method, and the token is the expected class, > which every VEP checks. VEPs are the same > couple of instructions everywhere; they act like > inlined transition stubs. > > This usual case includes most nominally virtual call sites. > > The fallback state is called "megamorphic", where > the immediate callee is a transition stub that performs > dynamic dispatch. > > A typical transition stub loads a the receiver's class, > loads a pointer from the vtable, and jumps through > the pointer. > > The "patchable token" is garbage except for monomorphic > calls. This makes most state transitions atomic by virtue > of an atomic patch to the immediate callee. (It's not the > whole story; see the sources for more messy details.) > > The key routines are CompiledIC::set_to_monomorphic, > CompiledIC::set_to_megamorphic and their callees. > > It seems to me the problem of cross-PD detection breaks > into the same pieces: > - unlinked calls, where the VM linker routine gets to edit the site > as before > - monomorphic calls, where the decision whether to pop is made at > link time > - megamorphic calls, where the decision can be folded into the > vtable/itable transition stubs > > In the case of a monomorphic site, the VM linker routine can decide > at link time whether to defeat the tail call or not, and patch the > call site accordingly. > agree > In the less-common case of a megamorphic call site, the transition > stub can be responsible to compare protection domains. This comparison > can be moderately expensive (I think) because it will be comparatively > infrequent. This affects the choice of PD token. I recommend that the > PD token be the caller's class (klassOop). Comparing two PD tokens > requires comparing their protection domain objects. The comparison > takes an extra indirection, but it probably makes the code for the call > site simpler, since klassOops are easy to introduce into code. okay > > > Callee targets which are statically known to be in the same PD (a > > potential optimization could be a PD analysis akin to the class > > hierarchy analysis) would use the normal entry point (verified or > > unverified entry point) while the others would go to the unverified > > tail call entry point. > > Yes, the VM call site linker could use CHA to pick a more > direct and efficient direct callee, the monomorphic method's > VEP. I think you miss understood me there. I was referring to different tail call entry points the linker can choose from as you suggest below. (ignoring the outgoing argument problem for a moment) as you say we would have the following cases: - PD statically known (maybe improved through a PD analysis - e.g all currently loaded method's with signature x are in the same PD) to be the same 1. statically known callee: jump to unverified entry point 2. monomorphic callee: jump to verified entry point 3. megamorphic callee: jump to transition stub - PD might be different 4. megamorphic calee: jump to vtable/itable stub which checks PD > --- > Another key question is who is the immediate callee. > > There are at least two cases: A tail call which still needs to do a PD > check, and an unchecked tail call. Should there be two more entry > points? yes. probably more. > > Either or both entry points could instead be transition stubs, > like vtable or itable dispatch stubs. There will be more such > stubs, too, as a result of the invokedynamic work. okay > I do think we should build a faster tailcall than what .NET has. > Having some tailcalls always trap to the runtime is a very uneven > performance profile! Agreed :) I was not suggesting to do what .net does (might do) merely comparing to it. > -- John > So to summarize we have three (!) dimensions. arguments (same-less/more), callee target (static, dynamic), PD known to be equal or not same/or less | callee target // more | arguments | static | monomorphic | megamorphic __________________|_______________________________________________ | 1a | 2a | 3a | | | PD equal | // | // | // | 1b | 2b | 3b __________________|_______________________________________________ | 4a | 5a | 6a | | | // PD (might) differ | | | 6b (ascii -art needs fixed sized font) 1a.) eagerly pop the stack jump to callee move arguments to appropiate place pop stack jump verified-entry-point 1b.) jump to special entry point that pops the stack (lazy) checks for a dummy (adapter frame), shuffels arguments set tail-call-token-or-fp %tail-call-token-reg jump tail-call-check-for-dummy-frame-verified-entry-point 2.a) eagerly pop the stack jump to callee (unverified entry point) pop stack set #patchable token %inline_cache_reg jump unverified_entry_point unverified_entry_point is the monomorphic unverified entry point 2.b) lazy pop the stack jump to special entry point/stub that takes down stack set tail-call-token-or-fp %tail-call-token-reg jump tail-call-check-for-dummy-frame-unverfied-entry-point 3ab.) exists if we do a pd-analysis(do all currently loaded classe with that signature have the same pd) like 2a/b but jump to megamorphic stub which in case of 3b checks for dummy (adapter) frame otherwise like 6a 4a.) defeat - since we statically know they differ 5a.) defeat - ditto 6a) always lazily pop the stack since pd checks can cause defeat set pd_klass_loader %pd_reg set tail-call-token-or-fp %tail-call-token-reg jump tail-call-check-pd-vtable/itable-stub 6b) set pd_klass_loader %pd_reg set tail-call-token-or-fp %tail-call-token-reg jump tail-call-check-pd-and-dummy-frame-vtable/itable-stub _ [on x86 we are running out of registers so probably use stack slots] some of these cases might be collapsed. as some are more general and would work for other cases (while being less efficient of course). The reason why i think it makes sense to differentiate between the less/equal argument case and the callee has more argument case is that i believe that a language front-end might know in advance what the maximum number of arguments will be (for most cases). and so could fill every function with dummy arguments. resulting in more efficient tail calls. int add2(int a, int b, int dummy, int dummy2, int dummy3, ...) return a+b; the generated frames than always would have the correct frame size. (no dummy frames needed) moving of the dummy arguments might be optimized 'away' by the compiler which could probably recognize that the arguments (dummy, dummy2) has not changed during the function. gee - and we haven't talk about what to do in case of interpreted to compiled and vice versa :) better me stops (big mouth) talking and starts coding. regards arnold From John.Rose at Sun.COM Fri Feb 15 01:51:10 2008 From: John.Rose at Sun.COM (John Rose) Date: Fri, 15 Feb 2008 01:51:10 -0800 Subject: Tailcalls in the jvm In-Reply-To: References: <191AC8AA-6C88-49D0-B828-B1A87A655BBA@Sun.COM> <978F703A-22FA-4A51-A854-79D571159247@sun.com> <6E427B23-B974-4805-82E1-887A52CA1AA4@gmail.com> <3E7AD86D-C826-4A34-AB08-5D3223C6066C@Sun.COM> Message-ID: <2352B194-F729-4ECB-BB06-A35FDFA85EC5@Sun.COM> On Feb 14, 2008, at 2:09 PM, Arnold Schwaighofer wrote: > On Thu, Feb 14, 2008 at 2:01 AM, John Rose wrote: >> One disadvantage of lazy popping is the extra TDC argument >> and (perhaps) overheads interpreting it. >> >> With eager popping, each caller takes down his own frame, >> by compiling customized code. > > I think optimally it would be a mix of both (for maximum efficiency). Interesting; probably true. There's still the question of which to attack first. >> One thing to worry about is whether there is enough stack >> space in the tail-caller's caller to hold the outgoing arguments >> to the tail-callee. > > Yes you are right. but it does not fully solve the problem. we can not > just resize (enlargen) the stack at the callee site. Yes; my thought was that lazy popping might make this easier. As you say, the tail-caller has full information about incoming and outgoing argument lists, so maybe he is the right guy to be responsible for pushing any adapters. In my experience with Scheme, a very common case is that the tail-caller's outgoing arguments are the same size as the incoming. (E.g., let-loop.) Another very common case is that the tail-caller's outgoing arguments are less than the number of fixed argument registers (e.g., 6 on SPARC). > my idea (well actually Christian suggested using adapter frames - my > idea was way more complicated - resizing out going areas during > runtime, remember my comment about adapter frames in the last email - > they are a solution to many problems :) to deal with more arguments is > inserting a dummy frame that contains a big area for outgoing > arguments in cases where the callee has more arguments. This implies a > check (does the dummy frame exist and is it big enough) at the callee > site (another entry point combination or stub) for the general case > (more arguments). Yes. If there's only one dummy frame, you can check the return-pc. It doesn't need to be very large at first, but it does need to be arbitrarily expandable. (Or, the dummy frame could be some sort of specialized interpreter frame. Those are resizable.) > e.g if we have a fun(arg1, arg2) tail calling a fun2(arg1, arg2, arg3, > arg) than before control is handed over to fun2 a dummy/adapter frame > is created where the arguments can go. that is if there isn't already > a dummy frame (because maybe fun was also tail called). Agreed. > So the original idea was (not taking the PD problem into account) to > have two code paths: > > - fast case: callee has less or the same number of arguments, eagerly > pop the stack and jump to callee (in case of virt/interface calls > jump to transition stub) Or callee has N or less arguments (N = minimum args in calling seq). > - slower case: callee has more arguments, jump to a point in callee > (special entry point/stub) that checks whether the dummy frame > already exists and has an appropiate size(if not it puts it there), > here we would use the lazy popping variant and create the adapter > frame if needed. I like it! > as you say we would have the following cases: > > - PD statically known (maybe improved through a PD analysis - e.g all > currently loaded method's with signature x are in the same PD) to > be the same > > 1. statically known callee: jump to unverified entry point > > 2. monomorphic callee: jump to verified entry point > > 3. megamorphic callee: jump to transition stub This is where CHA could be helpful. Consider the no_finalizable_subclasses assertion in dependencies.hpp. A similar no_cross_domain_overrides assert could reduce a run-time check on the actual method being called to a compile-time check on the superclass (e.g., abstract) method being called. Your idea about a global invariant on signatures is nice too, but it doesn't have infrastructure support. The dependencies stuff is flexible and powerful. It is limited to statements quantified over supertypes (mainly superclasses). > - PD might be different > 4. megamorphic calee: jump to vtable/itable stub which checks PD Good. > So to summarize we have three (!) dimensions. > > arguments (same-less/more), callee target (static, dynamic), PD known > to be equal or not Impressive (and a little scary). These two are both easiest to do first and will go a long way: > 1a.) eagerly pop the stack jump to callee > 2.a) eagerly pop the stack jump to callee (unverified entry point) > 1b.) jump to special entry point that pops the stack (lazy) checks for > a dummy (adapter frame), shuffels arguments This one is less important that one would think, if N above is reasonable. The hard case is x86_32. More below, where you suggest an approach. > 6a) always lazily pop the stack since pd checks can cause defeat > 6b) These two can be merged for a prototype. I was going to say that "all real programs" will have megamorphic calls that fit case 3ab because of CHA or your signature rule. But a script fragment running in a sandbox might tail-call back and forth between script code (compiled in a safe PD) and trusted system-level adapters (in the dynamic language's runtime). In Scheme, APPLY is a system-level "adapter" which a script could call from an untrusted function, passing it another untrusted function. Eventually it would be good to look at the performance of this pattern; maybe APPLY (and similar pure combinators) be marked as trusted, but effective call sites to something else (the first argument, in APPLY's case). Then the cross-PD check would skip over the combinator (APPLY) and look directly at the ultimate callee (first argument). I don't see a compelling play to make here immediately, though. > [on x86 we are running out of registers so probably use stack slots] Yes. (See above for an approach to make those stack slots available as part of everybody's calling sequence.) > some of these cases might be collapsed. as some are more general > and would work for other cases (while being less efficient of > course). Yes. > int add2(int a, int b, int dummy, int dummy2, int dummy3, ...) > return a+b; This is in fact the varargs save area that most riscs have. The place where the server compiler declares those is varargs_C_out_slots_killed in the *.ad file. It makes every caller stay out of the way of the last N argument slots, whether it uses them or not. On SPARC and x86_64, that AD file parameter is non-zero, and matched to other values in the system, notably the interp. I recommend developing on x86_64 or SPARC first, since it will let you dodge cases 123b longer. I think you'll want to put a non-zero value in varargs_C_out_slots_killed on x86_32. One tricky part is stretching outgoing non-compiled argument lists to the minimum length, since the x86_32 overloads ESP as both the C and Java stack pointers. But we have framework for this sort of thing already. See gen_i2c_adapter in sharedRuntime_x86_32.cpp. You'll probably be spending lots of time with code like that. > moving of the dummy arguments might be optimized 'away' by the > compiler which could probably recognize that the arguments (dummy, > dummy2) has not changed during the function. Telling the register allocator that they are "killed slots" has the desired effect. Every caller just allocates and avoids them. The interpreter (and maybe random other callers) needs to take care to leave those bubbles in the stack when calling compiled code. > gee - and we haven't talk about what to do in case of interpreted to > compiled and vice versa :) The fourth dimension. :-( You'll note that every method has a compiled and an interpreted entry point. The upside is that the interpreter stack frames are self-identifying and flexible. Send along the diffs whenever you've got something you want to show! BTW, we're working out the process in real-time: http://openjdk.java.net/guide/ Best, -- John From Jason.Fordham at Sun.COM Fri Feb 29 16:53:02 2008 From: Jason.Fordham at Sun.COM (Jason Fordham) Date: Fri, 29 Feb 2008 16:53:02 -0800 Subject: Hello, and other things Message-ID: <47C8A8EE.8000809@sun.com> Hi all, I was drawn into this list by the news today that Jonathan Schwartz was interested in making the JVM into Just Another VM. Coincidentally (and this is the spur to my enthusiasm), I started thinking about targeting GCC for the JVM last week. It quickly became clear that the JVM instruction set is designed to make the C programming model difficult: the separation of bytecodes, stacks, frames, and object space, and the generally unconvertible addressType quickly led me to a model where the JVM stacks are ignored except for primitive operations, while memory - for data, bss and heap - is modeled in a large array. In order to model C's function calls by pointer, I figured a handle pair, class and method, hashing the strings, with a linking stage after compilation to perform fixup - much as I imagine slide 17 in the LangNet presentation implies. The key obstacles I see are that the instruction set makes implementing a C-like stack expensive: there are no neat push and pop operations for this memory model, it feels like microcoding. Though I understand the motivation, which is to protect the bytecodes from malicious or lazy use of buffer overflows, and other mechanisms for executing data. I like the method handle mechanism, for a variety of reasons, and I would like to see some easing up on where the a stack is located so that operations which index into the stack are more flexible, and fast. Is this possible? Jason