RFR(L): 8029075 - String deduplication in G1

Per Liden per.liden at oracle.com
Fri Mar 14 13:19:09 UTC 2014

Hi Coleen,

Inline below...

On 2014-03-14 13:19, Coleen Phillimore wrote:
> Per,  3 quick comments below.  Thank you for the reorganization and 
> additional comments.
> On 3/13/14 11:02 AM, Per Liden wrote:
>> Hi Coleen,
>> Thanks for looking at the patch! Comment below.
>> On 2014-03-12 23:10, Coleen Phillimore wrote:
>>> Hi Per,
>>> Why not add the call for StringDedup::unlink() inside the function 
>>> unlink_string_and_symbol_table rather than just above in two places, 
>>> since they go together.
>>> +  // Delete dead entries in string deduplication queue/table
>>> +  StringDedup::unlink(&GenMarkSweep::is_alive);
>>> +
>>>    // Delete entries for dead interned string and clean up 
>>> unreferenced symbols in symbol table.
>>> G1CollectedHeap::heap()->unlink_string_and_symbol_table(&GenMarkSweep::is_alive); 
>> Yep, I'll move that call into unlink_string_and_symbol_table(). The 
>> unlink_string_and_symbol_table() is a fairly new addition, there used 
>> to be StringTable/SymbolTable calls there when I added the StrDedup 
>> call, that's why it looks like that.
>>> I didn't see anything else specific to change.  Some general 
>>> comments which are similar to Thomas's.  There seems to be a lack of 
>>> comments in the deduplication file.   I hate huge block comments too 
>>> but some comments about what it does in general would be nice. Like 
>>> 3 sentences per class.  The file is overwhelming (ie, I completely 
>>> lose interest)  with the statistics printing at the front.   Maybe 
>>> you could put the most interesting classes at the front of the 
>>> file?   Maybe statistics printing could be a separate file.
>>> One of the conventions that we have is that accessors for classes 
>>> are on one line, so you could make the file smaller by doing that 
>>> reformatting.
>>> It is also a general convention to have comments above the 
>>> declaration of a class to say generally what it's for, when it's 
>>> used, etc.  It's not easy to tell by the name or we could guess 
>>> wrong.   What is StringDedupQueue for, eg?  That would make 
>>> reviewing the code easier.
>> Agree, I'll be adding comments to each class and a general 
>> description of what dedup does.
>>> There seems to be a lot of duplication of code with the hashtable in 
>>> hashtable.hpp, only different.  We have lots of hashtables in 
>>> hotspot.  Did you try to reinstantiate this hashtable and why didn't 
>>> that not work?  There are interesting memory allocation functions 
>>> for the entries in this normal hashtable that might help with 
>>> fragmentation, for instance.
>> Very early on I actually did use an instance of Hashtable and even 
>> played with the idea of having a combined StringTable/DedupTable. I 
>> decided not to use that though, mostly because I felt I would need to 
>> make too many modifications to non-dedup related code and because 
>> Hashtable<> carries concepts, like shared entries, which didn't quite 
>> fit with dedup. Using the existing StringTable would mean it would 
>> have to have logic to manage entries which point to either Strings or 
>> char[], which again seemed like an too intrusive change. The 
>> Hashtable allocation strategy (with allocating blocks of entries) 
>> seems nice from a performance perspective, but freeing those becomes 
>> a bit of a challenge, which I guess is why it never frees any 
>> entries. Since the dedup table is there to in the end save memory, I 
>> felt I needed better control over the table's memory usage. At one 
>> point I actually started writing a WeakOopsHashTable<>, which I hoped 
>> could serve as the base for both StringTable and DedupTable, but 
>> again I felt I was losing focus a bit as it would mean modifying 
>> large parts of non-GC and non-dedeup related stuff. The potential 
>> connection to CDS here also made me a bit scared of going in that 
>> direction, as I don't have a complete understanding of the 
>> implications of that.
> Ok, I'd figured they didn't map that well.   There's a lot of 
> duplication already in symbol and string tables also.
>>> Lastly, why is the hash code in the string written to and used? The 
>>> StringTable uses the same hash algorithm until it has to change the 
>>> hash code, but the new hash code isn't stored in the string because 
>>> it would be incompatible.
>> I think StringTable and the dedup table is doing the same thing here. 
>> The hash code in the String object is only used/updated in the case 
>> we're using the standard/compatible hash function. As soon as the 
>> table is rehashed and switches to another hash function it no longer 
>> updates the hash code in the String. use_java_hash() tells use if 
>> we're using the standard/compatible hash and it's used here:
>>   if (use_java_hash()) {
>>     // Get hash code from cache
>>     hash = java_lang_String::hash(java_string);
>>   }
>> and later here:
>>   if (use_java_hash() && hash != 0) {
>>     // Store hash code in cache
>>     java_lang_String::set_hash(java_string, hash);
>>   }
>> Concerns?
> I have to look again but wanted to answer quickly (forgot to last 
> night).  This is good, but why do we have to update the hash code in 
> the java/lang/String instance at all?

We don't strictly have to, but it mimics the behavior of 
String.hashCode() and it is seems like the natural thing to do to given 
that the hash field in String is a cache for this purpose so that future 
users of String.hashCode() don't have to do the same calculation. Do you 
have a reason in mind why we wouldn't want to write it back?


>>> I would like to see this be a few different files.   It seems like 
>>> this file does three things conceptually, interacts with G1 somehow 
>>> to determine which strings are candidates for deduplication, stores 
>>> these things in a rehashable, resizable hash table and gathers 
>>> statistics.
>>> I'm sorry to have so many comments so late and I haven't really 
>>> tried to read the code for correctness.   I think some reformatting 
>>> and reorganization would make the reviewers job easier.  It's a big 
>>> change and worthwhile.
>> Great, I've no problems splitting this up. I'm more of a "one class 
>> per file" person myself, but I had the (apparently incorrect) 
>> impression that wasn't the hotspot way. I'm thinking I'll split this 
>> up into the following files (while I'm at it I'll also add the G1 
>> prefix as Thomas suggested):
>> g1StringDedup - Main interface for the GC, would contain entry point 
>> and things like the task and closure classes
>> g1StringDedupTable - The table, cache and entry classes
>> g1StringDedupQueue - The queue
>> g1StringDedupStat - The statistics sutff
>> g1StringDedupThread - The dedup thread
> Yes, this would be nice.   symbolTable.cpp should be split up too if 
> that is what gave you the impression.   I will file an RFE to do that, 
> it's been at the back of my mind to do so.
> Thanks,
> Coleen
>> Sounds good?
>> Will send out and updated webrev.
>> Thanks,
>> /Per
>>> Thanks,
>>> Coleen
>>> On 3/12/14 9:34 AM, Per Liden wrote:
>>>> Hi Thomas,
>>>> Thanks for reviewing. Comments inline.
>>>> On 03/11/2014 01:39 PM, Thomas Schatzl wrote:
>>>>> Hi Per,
>>>>> On Fri, 2014-03-07 at 15:13 +0100, Per Liden wrote:
>>>>>> Hi,
>>>>>> Here's an updated webrev based on the feedback from Bengt and 
>>>>>> Thomas.
>>>>>> http://cr.openjdk.java.net/~pliden/8029075/webrev.1/
>>>>> See further below.
>>>>>> And as suggested, two changes in the initial webrev were broken 
>>>>>> out into
>>>>>> separate CRs, available here:
>>>>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8036672
>>>>>> Webrev: http://cr.openjdk.java.net/~pliden/8036672/webrev.0/
>>>>> Looks good.
>>>>> Did you check the impact on something like specjbb2005 which does
>>>>> nothing but object copying? That code is somewhat performance 
>>>>> sensitive.
>>>> Yes, I did Aurora runs with jbb2005 and jbb2013, didn't see any 
>>>> difference.
>>>>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8036673
>>>>>> Webrev: http://cr.openjdk.java.net/~pliden/8036673/webrev.0/
>>>>> Looks good.
>>>>> Here are my comments: there are a few minor issues, and a single 
>>>>> major:
>>>>> basically the documentation for this feature is lacking imo. I 
>>>>> listed a
>>>>> few particular issues of mine with it.
>>>>> First the minor issues:
>>>>> - could you check on copyright dates on the modified files? They have
>>>>> not been updated anywhere afaics.
>>>>> - not sure if good, but maybe make PrintStringDeduplicationStatistics
>>>>> diagnostic? It seems to really be some diagnosis functions.
>>>>> - maybe make the new *ALot flags notproduct. These seem to be 
>>>>> debugging
>>>>> helpers similar to other *ALot flags.
>>>> I had a conversion with Bengt about this some time ago and I think 
>>>> the options are categorized correctly. The ALot flags needs be 
>>>> available in product builds because they are used by tests. Most 
>>>> Print*Statistics are product therefore it makes sense to follow 
>>>> that pattern for PrintStringDeduplicationStatistics.
>>>>> - StringDeduplicationAgeThreshold: maybe also needs to mention 
>>>>> that if
>>>>> the object is promoted, that it is always considered as candidate.
>>>> Check.
>>>>> marksweep.inline.hpp
>>>>> - MarkSweep::mark_object(): did you measure impact on other 
>>>>> collectors
>>>>> using this code? It is enabled unconditionally. Also I would prefer
>>>>> that !UseG1GC automatically disables UseStringDeduplication (or use a
>>>>> global variable for that) to avoid the two checks. (The latter 
>>>>> option is
>>>>> probably better).
>>>> Yes, ran jbb2005/2013.
>>>>> symbolTable.cpp:
>>>>> - again both UseG1GC and UseStringDedupliation are paired. Could 
>>>>> this be
>>>>> put in a method/global variable somewhere that is set up correctly at
>>>>> startup?
>>>> I think using the options is much more readable. Adding another 
>>>> state which tells us if G1 & Dedup is enabled means we need a 
>>>> pre-init stage to set up that state, which I don't think is very nice.
>>>>> g1CollectedHeap.cpp:
>>>>> - StringDedup::initialize() is run unconditionally (the check for
>>>>> UseStringDeduplication is inside the method) while it's outside for
>>>>> everything else. Just a note, keep it if wanted.
>>>> They are hidden inside dedup everywhere, except where the 
>>>> performance overhead of a call might be an issue (like mark and copy).
>>>>> TestStringDeduplicationTools.java:
>>>>> - maybe put retrieval of the unsafe object into the test library.
>>>>> Browsing through the code shows quite a lot copy+paste here already.
>>>>> - "qeueue" -> "queue"
>>>> Check.
>>>>> all tests:
>>>>> - missing bug ids
>>>> Check.
>>>>>  From here it is mostly about comments/documentation:
>>>> I agree that adding a feature overview and some additional comments 
>>>> is probably a good idea. However, I also know that Bengt and 
>>>> StefanK aren't too crazy about large comment blobs so I tried to 
>>>> kept the noise down. I also disagree with some of the other 
>>>> comments below which tend to be about personal taste rather than 
>>>> code quality or breaking some coding convention we have, e.g. the 
>>>> static helpers, readability, naming, etc.
>>>> I'll add some comments and make some other changes and send out an 
>>>> updated webrev shortly.
>>>> cheers,
>>>> /Per
>>>>> stringdedup.?pp:
>>>>> - is it possible to put an overview about how the string 
>>>>> deduplication
>>>>> works here? There is absolutely no place in the code where e.g. the
>>>>> string deduplication cycle is explained. There is no comprehensive
>>>>> information about this feature anywhere, just bits about corner 
>>>>> cases or
>>>>> particular here and there, but nothing that can be used as entry 
>>>>> point
>>>>> to understand what this is actually doing.
>>>>> To some extent, the comments seem to be repetitions of what the code
>>>>> already shows. E.g. "some_constant = (1 << 10); // 1024 (must be 
>>>>> power
>>>>> of 2)" or "// Store hash code in cache" followed by
>>>>> "java_lang_String::set_hash(java_string, hash);".
>>>>> There is also no reference to the JEP (I am not sure if it is 
>>>>> customary
>>>>> to put references to outside documents - I would prefer some); 
>>>>> however
>>>>> the JEP does not replace documentation in the source as it is not
>>>>> detailed enough, and the JEP cannot be changed any more. The JEP is
>>>>> interesting for the alternatives considered though.
>>>>> Ideally, to reasonably understand the feature I would like to not 
>>>>> need
>>>>> to spend hours looking through the implementation.
>>>>> - how many threads does this feature add, and are supported by 
>>>>> this code
>>>>> (for which parts)? Synchronization between them?
>>>>> - the comments "For XY" in the StringDedup header file do not give 
>>>>> any
>>>>> hint about what they are doing (not sure if it is interesting as 
>>>>> it is
>>>>> easy to find out the only callers). It seems that this information 
>>>>> has
>>>>> been put at the place where these methods are invoked (or just 
>>>>> somewhere
>>>>> in the cpp file). Imo the methods and members (except trivial) 
>>>>> should be
>>>>> documented in the hpp file in the class body, i.e. in the interface.
>>>>> - there are a few methods (also in the StringDedupTable class) in
>>>>> StringDedup where the results of the methods do not seem obvious and
>>>>> should be commented.
>>>>> - I really dislike the location of the "Candidate selection" 
>>>>> explanation
>>>>> within the cpp file. I would not have found it if I were not 
>>>>> looking for
>>>>> some documentation ;) It would probably fit very well in the above
>>>>> mentioned overview paragraphs.
>>>>> - please add the static helper functions that implement the actual
>>>>> policies (is_candidate_from_mark, is_candidate_from_evacuation) to 
>>>>> the
>>>>> StringDedup class (as private static methods) so that they can be 
>>>>> found
>>>>> easily. Static non-class helper functions in cpp files should only be
>>>>> used for stuff that is really unimportant and not relevant to the 
>>>>> code
>>>>> imo. However these seem to be _the_ central policy methods for string
>>>>> dedup.
>>>>> So it would be nice to have them mentioned in the class/hpp file and
>>>>> documented there.
>>>>> - facts about the code are repeated, e.g. that the table size must 
>>>>> be a
>>>>> power of two. This is nice, but has a few drawbacks: the compiler 
>>>>> does
>>>>> not care about comments at all, so it might be useful to put 
>>>>> asserts in
>>>>> these places instead, and second, it's just duplication of text 
>>>>> with the
>>>>> usual problem of getting out of date.
>>>>> - I am not completely sure why the class has not been called
>>>>> G1StringDedup. It won't work without G1 anyway and calls G1 internals
>>>>> all the time. There does not seem to be a way to make it more generic
>>>>> without considerable refactoring. It is fine to keep the name though,
>>>>> just asking. I also know that this feature does not want to be a 
>>>>> generic
>>>>> solution.
>>>>> - is it possible to separate out the statistics variables of
>>>>> StringDedupTable into either a separate class instead of using 
>>>>> globals
>>>>> or separate them a little from the other unrelated members? (And
>>>>> describe them) This would improve readability.
>>>>> I *think* this is everything from StringDedupTable::_entries_added to
>>>>> _rehash_count? I have no idea without searching for every use and
>>>>> deducing from that.
>>>>> - when implementing the hash table for the StringDedupTable, did you
>>>>> consider trying to reuse one of the existing hashmap implementations?
>>>>> And potentially extend them.
>>>>> Imo we already have way too many ad-hoc implementations of often used
>>>>> data structures (particularly) in G1 and GC code.
>>>>> - please document the properties of the StringDedupTable (type, 
>>>>> hashing,
>>>>> automatic resizing behavior, thread safety, ...).
>>>>> - please explain what StringDedupTable/Cache are and what they are 
>>>>> used
>>>>> for in context of the feature.
>>>>> - stringdedup.cpp, line 1309: candicate -> candidate
>>>>> Thanks,
>>>>> Thomas

More information about the hotspot-dev mailing list