String.indexOf optimization

Zoltan Sziladi kissziszi at
Mon Apr 7 07:56:17 UTC 2014

Thanks everyone for the input. Even though many people who are way
smarter than me already tried to beat the naive implementation, I'll
try too, just for the fun of it. I'll post my updates here if I find
something worthy of mentioning.


On Fri, Apr 4, 2014 at 7:13 PM, Martin Buchholz <martinrb at> wrote:
> Summary:
> Many people (myself included) have looked at this problem.  It's unlikely
> that String.indexOf will change.  It's hard to beat the naive implementation
> in the typical case.  The 64k size of the character set will make
> Boyer-Moore harder to implement efficiently.
> On Fri, Apr 4, 2014 at 7:49 AM, Zoltan Sziladi <kissziszi at> wrote:
>> Hi,
>> I am new to this mailing list so please forgive me if this has been
>> discussed before.
>> I was looking at the implementation of String.indexOf and I see that
>> it uses the O(n^2) naive implementation. I have been trying to find
>> out why it does not use some kind of a modern, sophisticated O(n)
>> algorithm but I have no clear answer as of now.
>> My guess is that the average case should be quite good for this
>> algorithm because in practice the partial matches are actually quite
>> rare, so it should work well... usually. Also, I saw that this code
>> was last touched about 6 years ago, so maybe it was just left like
>> this?
>> My concern is actually the worst case scenario. If we compared two
>> long strings with lots of partial matches, then this would perform
>> quite poorly. Wouldn't it be worth having an O(n) implementation here
>> then? Modern O(n) pattern matching algorithms don't use much extra
>> space either.
>> The Collections.sort method also uses an algorithm that prepares for
>> worst case. Maybe a highly optimized quicksort could outperform the
>> current mergesort implementation but I suppose one of the design
>> principles behind that was also to prepare for the worst case. (Even
>> though a nicely implemented quicksort has an expected average case
>> runtime of O(nlogn) regardless of the input).
>> If anyone knows why it is implemented this way or if it were possible
>> to change the implementation, I'd be happy to hear your opinion.
>> Thanks!
>> Regards,
>> Zoltan

More information about the core-libs-dev mailing list