RFR(JDK11/NIO) 8202285: (fs) Add a method to Files for comparing file contents

Remi Forax forax at univ-mlv.fr
Mon Apr 30 22:23:17 UTC 2018

----- Mail original -----
> De: "Paul Sandoz" <paul.sandoz at oracle.com>
> À: "Alan Bateman" <Alan.Bateman at oracle.com>
> Cc: "nio-dev" <nio-dev at openjdk.java.net>, "core-libs-dev" <core-libs-dev at openjdk.java.net>
> Envoyé: Lundi 30 Avril 2018 20:47:06
> Objet: Re: RFR(JDK11/NIO) 8202285: (fs) Add a method to Files for comparing file contents

>> On Apr 27, 2018, at 4:30 AM, Alan Bateman <Alan.Bateman at oracle.com> wrote:
>> On 27/04/2018 05:51, Joe Wang wrote:
>>> Hi,
>>> Considering extending isSameFile to add isSameContent to Files. Please review.
>>> JBS: https://bugs.openjdk.java.net/browse/JDK-8202285
>>> webrev: http://cr.openjdk.java.net/~joehw/jdk11/8202285/webrev/
>>> specdiff:
>>> http://cr.openjdk.java.net/~joehw/jdk11/8202285/specdiff/java/nio/file/Files.html
>> I assume we should ignore the implementation for now as the eventual
>> implementation won't use readAllBytes (at least not for for large files).
> Yes, as long as we don’t forget to follow up on a replacement (using memory
> mapped files say).
>> The existing isSameFile is specified as "Tests if two paths locate the same
>> file" and it would be good if the new method could be somewhat consistent with
>> that, e.g. "Tests if the content of two files is identical".
>> Specifying that two path that locate the same file always returns true is
>> reasonable. This could be make clearer by say that the returning always returns
>> true when path and path2 are equals, if event if the file does not exist.
>> The @return should say that it returns true if path and path2 locate the same
>> file or the content of both files is identical.
>> The javadoc for SecurityException has "to the file", I assume this should be "to
>> both files”.
> We might also want to say the contents of the two files are assumed to be held
> constant during the operation.
> It’s tempting (well to me at least) to generalize to a mismatch method (like for
> arrays) returning the mismatching location in bytes, then you can determine if
> one file is a prefix of another given the files sizes. Bound accepting methods
> would also be useful to mismatch on partial content (including within the same
> file). If you use memory mapped files we can use direct byte buffers to
> efficiently perform the mismatch.

I'm not sure memory mapping is a good idea, Windows is notoriously bad at memory mapping small files and if the files are big, see you own comment below.
But an implementation that reads byte buffers and compare them will be more efficient.

> To Remi’s point this might dissuade/guide developers from using this method when
> there are other more efficient techniques available when operating at larger
> scales. However, it is unfortunately harder that it should be in Java to hash
> the contents of a file, a byte[] or ByteBuffer, according to some chosen
> algorithm (or a good default).

it's 6 lines of code

  var digest = MessageDigest.getInstance("SHA1");
  try(var input = Files.newInputStream(Path.of("myfile.txt"));
      var output = new DigestOutputStream(OutputStream.nullOutputStream(), digest)) {
  var hash = digest.digest();

or 3 lines if you don't mind to load the whole file in memory

  var digest = MessageDigest.getInstance("SHA1");
  var hash = digest.digest();

> Paul.


More information about the core-libs-dev mailing list