6202130: Need to handle UTF-8 values and break up lines longer than 72 bytes

naoto.sato at oracle.com naoto.sato at oracle.com
Tue Feb 11 22:20:15 UTC 2020


Can the code in Manifest.java be simplified using regex? E.g.,

var gc = Pattern.compile("\\X");
	// print till 72 bytes in UTF-8

Just a thought. If BreakIterator is preferred, it should take Locale.US 
as an argument to the factory method, so that it would produce the same 
result no matter what the default locale is.


On 2/10/20 1:22 PM, Lance Andersen wrote:
> Hi all,
> Here is a webrev for the patch that Philipp is proposing which will make it easier to review:  http://cr.openjdk.java.net/~lancea/6202130/webrev.00 <http://cr.openjdk.java.net/~lancea/6202130/webrev.00>
>> On Dec 26, 2019, at 11:50 AM, Philipp Kunz <philipp.kunz at paratix.ch> wrote:
>> Hi,
>> The specification says, a line break in a manifest can occur beforeor
>> after a Unicode character encoded in UTF-8.
>>>    ...>      value:         SPACE *otherchar newline
>> *continuation>      continuation:  SPACE *otherchar
>> newline>    ...>      otherchar:     any UTF-8 character except NUL, CR
>> and LF
>> The current implementation breaks manifest lines at 72 bytes regardless
>> ofhow the bytes around the break are part of a sequence of bytes
>> encoding acharacter. Code points may use up to four bytes when encoded
>> in UTF-8.Manifests with line breaks inside of sequences of bytes
>> encoding Unicodecharacters in UTF-8 with more than one bytes not only
>> are invalid UTF-8but also look ugly in text editors.For example, a
>> manifest could look like this:
>> import java.util.jar.Manifest;import java.util.jar.Attributes;import
>> static java.util.jar.Attributes.Name;
>> public class CharacterBrokenDemo1 {    public static void main(String[]
>> args) throws Exception{        Manifest mf = new
>> Manifest();        Attributes attrs =
>> mf.getMainAttributes();        attrs.put(Name.MANIFEST_VERSION,
>> "1.0");        attrs.put(new Name("Some-Key"),                  "Some
>> languages have decorated characters, " +                   "for
>> example: español"); // or
>> "espa\u00D1ol"        mf.write(System.out);    }}
>> Above code produces a result as follows with some unexpected question
>> markswhere the encoding is invalid:
>>>    Manifest-Version: 1.0>    Some-Key: Some languages have decorated
>> characters, for example: espa?>     ?ol
>> This is of course an example written with actual question marks to get
>> a validtext for this message. The trick here is that "Some-Key" to
>> "example :espa"amounts to exactly one byte less encoded in UTF-8 than
>> would fit on one linewith the 72 byte limit so that the subsequent
>> character encoded with two bytesgets broken inside of the sequence of
>> two bytes for this character across acontinuation line break.
>> When decoding the resulting bytes from UTF-8 as one whole string, the
>> twoquestion marks will not fit together again even if the line break
>> with thecontinuation space is removed. However, Manifest::read removes
>> the continuationline breaks ("\r\n ") before decoding the manifest
>> header value from UTF-8 andhence can reproduce the original value.
>> Characters encoded in UTF-8 can not only span up to four bytes for one
>> codepoint each, there are also combining characters or classes thereof
>> or combiningdiacritical marks or whatever the appropriate term could
>> be, that combine morethan one code point into what is usually
>> experienced and referred to as acharacter.
>> The term character really gets ambiguous at this point. I wouldn't know
>> whatthe specification actually refers to with that term "character". So
>> rather thandiving in too much specification or any sorts of theory,
>> let's look at anotherexample:
>> import java.util.jar.Manifest;import java.util.jar.Attributes;import
>> static java.util.jar.Attributes.Name;
>> public class DemoCharacterBroken2 {    public static void main(String[]
>> args) throws Exception{        Manifest mf = new
>> Manifest();        Attributes attrs =
>> mf.getMainAttributes();        attrs.put(Name.MANIFEST_VERSION,
>> "1.0");        attrs.put(new Name("Some-Key"), " ".repeat(53) +
>> "Angstro\u0308m");        mf.write(System.out);    }}
>> which produces console output as follows:
>>>    Manifest-Version: 1.0>    Some-
>> Key:                                                      Angstro>
>> ̈m
>> (In case this does not display well, the diaeresis is on the m on the
>> last line)
>> When the whole Manifest is decoded from UTF-8 as one big single string
>> andcontinuation line breaks are not removed until after UTF-8 decoding
>> the wholemanifest, the diaeresis (umlaut, two points above, u0308)
>> apparently kind ofjumps onto the following letter m because somehow it
>> cannot be combined withthe preceding space. The UTF-8 decoder (all of
>> my editors I tried, not onlyEclipse and its console view, also less,
>> gedit, cat and terminal) somehowtries to fix that but the diaeresis may
>> not necessarily jump back on the "o"where it originally belonged by
>> removing the continuation line break ("\r\n ")after UTF-8 decoding has
>> taken place, at least that did not work for me.
>> Hence, ideally combining diacritical marks should better not be
>> separated fromwhatever they combine with when breaking manifest lines
>> onto a continuationline. Such combinations, however, seem to be
>> unlimited in terms of number ofcode points combining into the same
>> "experienced" character. I was able tofind combinations that not only
>> exceed the limit of 72 bytes per line but alsoexceed the line buffer
>> size of 512 bytes in Manifest::read. These may be ratheruncommon but
>> still possible to my own surprise.
>> Next consideration would then be to remove that limit of 512 bytes per
>> manifestline but exceeding it would make such manifests incompatible
>> with previousManifest::read implementations and is not really an
>> immediately availableoption at the moment.
>> As a compromise, those characters including combining diacritical marks
>> whichcombine only so many code points as that their binarily encoded
>> form in UTF-8remains within a limit of 71 bytes can be written without
>> an interruptingcontinuation line break, which applies to most cases,
>> but not all. I guess thisshould suit practically and realistically to
>> be expected values well.
>> Another possibility would be to allow for characters that are
>> combinations ofmultiple Unicode code points to be kept together in
>> their encoded form in UTF-8up to 512 bytes line length limit when
>> reading minus a space and a line breakamounting to 509 bytes, but that
>> would still not make manifests be representedas valid Unicode in all
>> corner cases and I guess would not probably make a realimprovement in
>> practice over a limit of 71 bytes.
>> Attached is a patch that tries to implement what was described above
>> using aBreakIterator. While it works from a functional point of view,
>> this might beless desirable performance-wise. Alternatively could be
>> considered to do withoutthe BreakIterator and only keep Unicode code
>> points together by not placingline breaks before a continuation byte,
>> which however would not addresscombining diacritical marks as in the
>> second example above.
>> The jar file specification does not explicitly state that manifest
>> should bevalid UTF-8, and they were not always, but it also does not
>> state otherwise,leaving an impression that manifests could be
>> (mis)taken for UTF-8 encodedstrings, which they also are in many or
>> most cases and which has been confusedmany times. At the moment, the
>> only case where a valid manifest is not also avalid UTF-8 encoded
>> string is when a sequence of bytes encoding the samecharacter happens
>> to be interrupted with a continuation line break. To the bestof my
>> knowledge, all other valid manifests are also valid UTF-8 encoded
>> strings.
>> It would be nice, if manifests could be viewed and manipulated with all
>> Unicodecapable editors and not only be parsed correctly with
>> Manifest::read.
>> Any opinions? Would someone sponsor this patch?
>> Regards,Philipp
>> https://docs.oracle.com/en/java/javase/13/docs/specs/jar/jar.html#specificationhttps://bugs.openjdk.java.net/browse/JDK-6202130https://bugs.openjdk.java.net/browse/JDK-6443578https://github.com/gradle/gradle/issues/5225https://bugs.openjdk.java.net/browse/JDK-8202525https://en.wikipedia.org/wiki/Combining_character
>> <6202130-manifestutf8linebreak.patch>
>   <http://oracle.com/us/design/oracle-email-sig-198324.gif>
>   <http://oracle.com/us/design/oracle-email-sig-198324.gif> <http://oracle.com/us/design/oracle-email-sig-198324.gif>
>   <http://oracle.com/us/design/oracle-email-sig-198324.gif>Lance Andersen| Principal Member of Technical Staff | +1.781.442.2037
> Oracle Java Engineering
> 1 Network Drive
> Burlington, MA 01803
> Lance.Andersen at oracle.com <mailto:Lance.Andersen at oracle.com>

More information about the core-libs-dev mailing list