RFR(L): 8029075 - String deduplication in G1

Bengt Rutisson bengt.rutisson at oracle.com
Mon Mar 17 09:51:32 UTC 2014

Hi Per,

On 2014-03-14 20:07, Per Liden wrote:
> Hi,
> Here's an updated webrev, based on feedback from primarily Thomas and 
> Coleen (Thanks!).
> http://cr.openjdk.java.net/~pliden/8029075/webrev.2/

Looks good to me.


> I hope to have addressed all issues that were raised. The main theme 
> of course is the addition of documentation/comments and the splitting 
> of some classes into separate files. I think I've addressed most, if 
> not all, of the other minor issues Thomas mentioned. The only thing I 
> can think of that wasn't changed was the categorization of the VM 
> flags, which I motivated in a different mail.
> I didn't think it made sense to upload a diff against the previous 
> webrev as that diff turned out to be a lot bigger than real webrev 
> itself. Sorry if this makes reviewing harder.
> cheers,
> /Per
> On 2014-03-14 14:19, Per Liden wrote:
>> Hi Coleen,
>> Inline below...
>> On 2014-03-14 13:19, Coleen Phillimore wrote:
>>> Per,  3 quick comments below.  Thank you for the reorganization and 
>>> additional comments.
>>> On 3/13/14 11:02 AM, Per Liden wrote:
>>>> Hi Coleen,
>>>> Thanks for looking at the patch! Comment below.
>>>> On 2014-03-12 23:10, Coleen Phillimore wrote:
>>>>> Hi Per,
>>>>> Why not add the call for StringDedup::unlink() inside the function 
>>>>> unlink_string_and_symbol_table rather than just above in two 
>>>>> places, since they go together.
>>>>> +  // Delete dead entries in string deduplication queue/table
>>>>> +  StringDedup::unlink(&GenMarkSweep::is_alive);
>>>>> +
>>>>>    // Delete entries for dead interned string and clean up 
>>>>> unreferenced symbols in symbol table.
>>>>> G1CollectedHeap::heap()->unlink_string_and_symbol_table(&GenMarkSweep::is_alive); 
>>>> Yep, I'll move that call into unlink_string_and_symbol_table(). The 
>>>> unlink_string_and_symbol_table() is a fairly new addition, there 
>>>> used to be StringTable/SymbolTable calls there when I added the 
>>>> StrDedup call, that's why it looks like that.
>>>>> I didn't see anything else specific to change.  Some general 
>>>>> comments which are similar to Thomas's.  There seems to be a lack 
>>>>> of comments in the deduplication file.   I hate huge block 
>>>>> comments too but some comments about what it does in general would 
>>>>> be nice. Like 3 sentences per class.  The file is overwhelming 
>>>>> (ie, I completely lose interest)  with the statistics printing at 
>>>>> the front.   Maybe you could put the most interesting classes at 
>>>>> the front of the file?   Maybe statistics printing could be a 
>>>>> separate file.
>>>>> One of the conventions that we have is that accessors for classes 
>>>>> are on one line, so you could make the file smaller by doing that 
>>>>> reformatting.
>>>>> It is also a general convention to have comments above the 
>>>>> declaration of a class to say generally what it's for, when it's 
>>>>> used, etc.  It's not easy to tell by the name or we could guess 
>>>>> wrong.   What is StringDedupQueue for, eg? That would make 
>>>>> reviewing the code easier.
>>>> Agree, I'll be adding comments to each class and a general 
>>>> description of what dedup does.
>>>>> There seems to be a lot of duplication of code with the hashtable 
>>>>> in hashtable.hpp, only different.  We have lots of hashtables in 
>>>>> hotspot.  Did you try to reinstantiate this hashtable and why 
>>>>> didn't that not work?  There are interesting memory allocation 
>>>>> functions for the entries in this normal hashtable that might help 
>>>>> with fragmentation, for instance.
>>>> Very early on I actually did use an instance of Hashtable and even 
>>>> played with the idea of having a combined StringTable/DedupTable. I 
>>>> decided not to use that though, mostly because I felt I would need 
>>>> to make too many modifications to non-dedup related code and 
>>>> because Hashtable<> carries concepts, like shared entries, which 
>>>> didn't quite fit with dedup. Using the existing StringTable would 
>>>> mean it would have to have logic to manage entries which point to 
>>>> either Strings or char[], which again seemed like an too intrusive 
>>>> change. The Hashtable allocation strategy (with allocating blocks 
>>>> of entries) seems nice from a performance perspective, but freeing 
>>>> those becomes a bit of a challenge, which I guess is why it never 
>>>> frees any entries. Since the dedup table is there to in the end 
>>>> save memory, I felt I needed better control over the table's memory 
>>>> usage. At one point I actually started writing a 
>>>> WeakOopsHashTable<>, which I hoped could serve as the base for both 
>>>> StringTable and DedupTable, but again I felt I was losing focus a 
>>>> bit as it would mean modifying large parts of non-GC and non-dedeup 
>>>> related stuff. The potential connection to CDS here also made me a 
>>>> bit scared of going in that direction, as I don't have a complete 
>>>> understanding of the implications of that.
>>> Ok, I'd figured they didn't map that well.   There's a lot of 
>>> duplication already in symbol and string tables also.
>>>>> Lastly, why is the hash code in the string written to and used? 
>>>>> The StringTable uses the same hash algorithm until it has to 
>>>>> change the hash code, but the new hash code isn't stored in the 
>>>>> string because it would be incompatible.
>>>> I think StringTable and the dedup table is doing the same thing 
>>>> here. The hash code in the String object is only used/updated in 
>>>> the case we're using the standard/compatible hash function. As soon 
>>>> as the table is rehashed and switches to another hash function it 
>>>> no longer updates the hash code in the String. use_java_hash() 
>>>> tells use if we're using the standard/compatible hash and it's used 
>>>> here:
>>>>   if (use_java_hash()) {
>>>>     // Get hash code from cache
>>>>     hash = java_lang_String::hash(java_string);
>>>>   }
>>>> and later here:
>>>>   if (use_java_hash() && hash != 0) {
>>>>     // Store hash code in cache
>>>>     java_lang_String::set_hash(java_string, hash);
>>>>   }
>>>> Concerns?
>>> I have to look again but wanted to answer quickly (forgot to last 
>>> night).  This is good, but why do we have to update the hash code in 
>>> the java/lang/String instance at all?
>> We don't strictly have to, but it mimics the behavior of 
>> String.hashCode() and it is seems like the natural thing to do to 
>> given that the hash field in String is a cache for this purpose so 
>> that future users of String.hashCode() don't have to do the same 
>> calculation. Do you have a reason in mind why we wouldn't want to 
>> write it back?
>> cheers,
>> /Per
>>>>> I would like to see this be a few different files.   It seems like 
>>>>> this file does three things conceptually, interacts with G1 
>>>>> somehow to determine which strings are candidates for 
>>>>> deduplication, stores these things in a rehashable, resizable hash 
>>>>> table and gathers statistics.
>>>>> I'm sorry to have so many comments so late and I haven't really 
>>>>> tried to read the code for correctness.   I think some 
>>>>> reformatting and reorganization would make the reviewers job 
>>>>> easier.  It's a big change and worthwhile.
>>>> Great, I've no problems splitting this up. I'm more of a "one class 
>>>> per file" person myself, but I had the (apparently incorrect) 
>>>> impression that wasn't the hotspot way. I'm thinking I'll split 
>>>> this up into the following files (while I'm at it I'll also add the 
>>>> G1 prefix as Thomas suggested):
>>>> g1StringDedup - Main interface for the GC, would contain entry 
>>>> point and things like the task and closure classes
>>>> g1StringDedupTable - The table, cache and entry classes
>>>> g1StringDedupQueue - The queue
>>>> g1StringDedupStat - The statistics sutff
>>>> g1StringDedupThread - The dedup thread
>>> Yes, this would be nice.   symbolTable.cpp should be split up too if 
>>> that is what gave you the impression.   I will file an RFE to do 
>>> that, it's been at the back of my mind to do so.
>>> Thanks,
>>> Coleen
>>>> Sounds good?
>>>> Will send out and updated webrev.
>>>> Thanks,
>>>> /Per
>>>>> Thanks,
>>>>> Coleen
>>>>> On 3/12/14 9:34 AM, Per Liden wrote:
>>>>>> Hi Thomas,
>>>>>> Thanks for reviewing. Comments inline.
>>>>>> On 03/11/2014 01:39 PM, Thomas Schatzl wrote:
>>>>>>> Hi Per,
>>>>>>> On Fri, 2014-03-07 at 15:13 +0100, Per Liden wrote:
>>>>>>>> Hi,
>>>>>>>> Here's an updated webrev based on the feedback from Bengt and 
>>>>>>>> Thomas.
>>>>>>>> http://cr.openjdk.java.net/~pliden/8029075/webrev.1/
>>>>>>> See further below.
>>>>>>>> And as suggested, two changes in the initial webrev were broken 
>>>>>>>> out into
>>>>>>>> separate CRs, available here:
>>>>>>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8036672
>>>>>>>> Webrev: http://cr.openjdk.java.net/~pliden/8036672/webrev.0/
>>>>>>> Looks good.
>>>>>>> Did you check the impact on something like specjbb2005 which does
>>>>>>> nothing but object copying? That code is somewhat performance 
>>>>>>> sensitive.
>>>>>> Yes, I did Aurora runs with jbb2005 and jbb2013, didn't see any 
>>>>>> difference.
>>>>>>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8036673
>>>>>>>> Webrev: http://cr.openjdk.java.net/~pliden/8036673/webrev.0/
>>>>>>> Looks good.
>>>>>>> Here are my comments: there are a few minor issues, and a single 
>>>>>>> major:
>>>>>>> basically the documentation for this feature is lacking imo. I 
>>>>>>> listed a
>>>>>>> few particular issues of mine with it.
>>>>>>> First the minor issues:
>>>>>>> - could you check on copyright dates on the modified files? They 
>>>>>>> have
>>>>>>> not been updated anywhere afaics.
>>>>>>> - not sure if good, but maybe make 
>>>>>>> PrintStringDeduplicationStatistics
>>>>>>> diagnostic? It seems to really be some diagnosis functions.
>>>>>>> - maybe make the new *ALot flags notproduct. These seem to be 
>>>>>>> debugging
>>>>>>> helpers similar to other *ALot flags.
>>>>>> I had a conversion with Bengt about this some time ago and I 
>>>>>> think the options are categorized correctly. The ALot flags needs 
>>>>>> be available in product builds because they are used by tests. 
>>>>>> Most Print*Statistics are product therefore it makes sense to 
>>>>>> follow that pattern for PrintStringDeduplicationStatistics.
>>>>>>> - StringDeduplicationAgeThreshold: maybe also needs to mention 
>>>>>>> that if
>>>>>>> the object is promoted, that it is always considered as candidate.
>>>>>> Check.
>>>>>>> marksweep.inline.hpp
>>>>>>> - MarkSweep::mark_object(): did you measure impact on other 
>>>>>>> collectors
>>>>>>> using this code? It is enabled unconditionally. Also I would prefer
>>>>>>> that !UseG1GC automatically disables UseStringDeduplication (or 
>>>>>>> use a
>>>>>>> global variable for that) to avoid the two checks. (The latter 
>>>>>>> option is
>>>>>>> probably better).
>>>>>> Yes, ran jbb2005/2013.
>>>>>>> symbolTable.cpp:
>>>>>>> - again both UseG1GC and UseStringDedupliation are paired. Could 
>>>>>>> this be
>>>>>>> put in a method/global variable somewhere that is set up 
>>>>>>> correctly at
>>>>>>> startup?
>>>>>> I think using the options is much more readable. Adding another 
>>>>>> state which tells us if G1 & Dedup is enabled means we need a 
>>>>>> pre-init stage to set up that state, which I don't think is very 
>>>>>> nice.
>>>>>>> g1CollectedHeap.cpp:
>>>>>>> - StringDedup::initialize() is run unconditionally (the check for
>>>>>>> UseStringDeduplication is inside the method) while it's outside for
>>>>>>> everything else. Just a note, keep it if wanted.
>>>>>> They are hidden inside dedup everywhere, except where the 
>>>>>> performance overhead of a call might be an issue (like mark and 
>>>>>> copy).
>>>>>>> TestStringDeduplicationTools.java:
>>>>>>> - maybe put retrieval of the unsafe object into the test library.
>>>>>>> Browsing through the code shows quite a lot copy+paste here 
>>>>>>> already.
>>>>>>> - "qeueue" -> "queue"
>>>>>> Check.
>>>>>>> all tests:
>>>>>>> - missing bug ids
>>>>>> Check.
>>>>>>>  From here it is mostly about comments/documentation:
>>>>>> I agree that adding a feature overview and some additional 
>>>>>> comments is probably a good idea. However, I also know that Bengt 
>>>>>> and StefanK aren't too crazy about large comment blobs so I tried 
>>>>>> to kept the noise down. I also disagree with some of the other 
>>>>>> comments below which tend to be about personal taste rather than 
>>>>>> code quality or breaking some coding convention we have, e.g. the 
>>>>>> static helpers, readability, naming, etc.
>>>>>> I'll add some comments and make some other changes and send out 
>>>>>> an updated webrev shortly.
>>>>>> cheers,
>>>>>> /Per
>>>>>>> stringdedup.?pp:
>>>>>>> - is it possible to put an overview about how the string 
>>>>>>> deduplication
>>>>>>> works here? There is absolutely no place in the code where e.g. the
>>>>>>> string deduplication cycle is explained. There is no comprehensive
>>>>>>> information about this feature anywhere, just bits about corner 
>>>>>>> cases or
>>>>>>> particular here and there, but nothing that can be used as entry 
>>>>>>> point
>>>>>>> to understand what this is actually doing.
>>>>>>> To some extent, the comments seem to be repetitions of what the 
>>>>>>> code
>>>>>>> already shows. E.g. "some_constant = (1 << 10); // 1024 (must be 
>>>>>>> power
>>>>>>> of 2)" or "// Store hash code in cache" followed by
>>>>>>> "java_lang_String::set_hash(java_string, hash);".
>>>>>>> There is also no reference to the JEP (I am not sure if it is 
>>>>>>> customary
>>>>>>> to put references to outside documents - I would prefer some); 
>>>>>>> however
>>>>>>> the JEP does not replace documentation in the source as it is not
>>>>>>> detailed enough, and the JEP cannot be changed any more. The JEP is
>>>>>>> interesting for the alternatives considered though.
>>>>>>> Ideally, to reasonably understand the feature I would like to 
>>>>>>> not need
>>>>>>> to spend hours looking through the implementation.
>>>>>>> - how many threads does this feature add, and are supported by 
>>>>>>> this code
>>>>>>> (for which parts)? Synchronization between them?
>>>>>>> - the comments "For XY" in the StringDedup header file do not 
>>>>>>> give any
>>>>>>> hint about what they are doing (not sure if it is interesting as 
>>>>>>> it is
>>>>>>> easy to find out the only callers). It seems that this 
>>>>>>> information has
>>>>>>> been put at the place where these methods are invoked (or just 
>>>>>>> somewhere
>>>>>>> in the cpp file). Imo the methods and members (except trivial) 
>>>>>>> should be
>>>>>>> documented in the hpp file in the class body, i.e. in the 
>>>>>>> interface.
>>>>>>> - there are a few methods (also in the StringDedupTable class) in
>>>>>>> StringDedup where the results of the methods do not seem obvious 
>>>>>>> and
>>>>>>> should be commented.
>>>>>>> - I really dislike the location of the "Candidate selection" 
>>>>>>> explanation
>>>>>>> within the cpp file. I would not have found it if I were not 
>>>>>>> looking for
>>>>>>> some documentation ;) It would probably fit very well in the above
>>>>>>> mentioned overview paragraphs.
>>>>>>> - please add the static helper functions that implement the actual
>>>>>>> policies (is_candidate_from_mark, is_candidate_from_evacuation) 
>>>>>>> to the
>>>>>>> StringDedup class (as private static methods) so that they can 
>>>>>>> be found
>>>>>>> easily. Static non-class helper functions in cpp files should 
>>>>>>> only be
>>>>>>> used for stuff that is really unimportant and not relevant to 
>>>>>>> the code
>>>>>>> imo. However these seem to be _the_ central policy methods for 
>>>>>>> string
>>>>>>> dedup.
>>>>>>> So it would be nice to have them mentioned in the class/hpp file 
>>>>>>> and
>>>>>>> documented there.
>>>>>>> - facts about the code are repeated, e.g. that the table size 
>>>>>>> must be a
>>>>>>> power of two. This is nice, but has a few drawbacks: the 
>>>>>>> compiler does
>>>>>>> not care about comments at all, so it might be useful to put 
>>>>>>> asserts in
>>>>>>> these places instead, and second, it's just duplication of text 
>>>>>>> with the
>>>>>>> usual problem of getting out of date.
>>>>>>> - I am not completely sure why the class has not been called
>>>>>>> G1StringDedup. It won't work without G1 anyway and calls G1 
>>>>>>> internals
>>>>>>> all the time. There does not seem to be a way to make it more 
>>>>>>> generic
>>>>>>> without considerable refactoring. It is fine to keep the name 
>>>>>>> though,
>>>>>>> just asking. I also know that this feature does not want to be a 
>>>>>>> generic
>>>>>>> solution.
>>>>>>> - is it possible to separate out the statistics variables of
>>>>>>> StringDedupTable into either a separate class instead of using 
>>>>>>> globals
>>>>>>> or separate them a little from the other unrelated members? (And
>>>>>>> describe them) This would improve readability.
>>>>>>> I *think* this is everything from 
>>>>>>> StringDedupTable::_entries_added to
>>>>>>> _rehash_count? I have no idea without searching for every use and
>>>>>>> deducing from that.
>>>>>>> - when implementing the hash table for the StringDedupTable, did 
>>>>>>> you
>>>>>>> consider trying to reuse one of the existing hashmap 
>>>>>>> implementations?
>>>>>>> And potentially extend them.
>>>>>>> Imo we already have way too many ad-hoc implementations of often 
>>>>>>> used
>>>>>>> data structures (particularly) in G1 and GC code.
>>>>>>> - please document the properties of the StringDedupTable (type, 
>>>>>>> hashing,
>>>>>>> automatic resizing behavior, thread safety, ...).
>>>>>>> - please explain what StringDedupTable/Cache are and what they 
>>>>>>> are used
>>>>>>> for in context of the feature.
>>>>>>> - stringdedup.cpp, line 1309: candicate -> candidate
>>>>>>> Thanks,
>>>>>>> Thomas

More information about the hotspot-dev mailing list