Programming

The string type is broken

My previous article, “We don’t need a string type“, caused a bit of stir. Though the feedback is mixed, there is a common theme of a string being a useful feature. After doing a bit more research I can determine only one thing: most current string types are broken!

Many of us believe our strings are capable of more than what they actually do. We rely on their functionality without actually checking that its valid. This can easily lead to programs which do not work correctly, particularly with respect to internationalization. In most cases it seems we would be better off without a string type.

Evaluate

I looked at how strings behave in a few basic situations. I’ll go over each situation, giving the expected result and some of the actual results. I considered showing a matrix with the results, but since all the tested languages behave so poorly it didn’t seem useful.

noël

Using the text “noël” with a decomposed Unicode string “noe\u0308l”, I checked the following:

1. Does it print correctly? Yes, most languages are capable of doing this. Though the ideone.com interface seems to break the output (so be careful with testing).

2. What is the reverse? “lëon”, correct? Mostly this fails. The most common result is “l̈eon” (the dieresis is on the ‘l’ instead of the ‘e’). This is what happens without a string class, by just reversing an array of code points.

3. What are the first three characters? Mostly the answer here is “noe”, as opposed to the desired “noë”. This could easily lead into a big discussion about what a character is, but I assume most people would not be happy with the current result. This is again indicative of a string type which merely treats the data as an array of code points.

4. What is the length? The common answer is 5. And yet again, this indicates our string types are merely arrays of characters and not truly handling the text.

For all of these questions, try to consider what should happen if you were editing this text in your favourite word processor or text editor. I generally expect that the ‘ë’ character is handled as a single entity. I don’t expect backspace/delete to just remove part of the letter. I expect copying the first three letters to include the accent.

😸😾

It was a bit weird to find out that Unicode has cats in it (I hope you have a font which shows them — if not, the title of this section is a happy cat and a sad cat, part of the Unicode emoticon set). These characters were chosen since they are outside of the BMP (basic multilingual plane). This spells trouble for languages using UTF-16 encodings (Java, C#, JavaScript).

1. Length? Python unicode correctly reports 2. Those UTF-16 languages tend to report 4: the characters require surrogate pairs.

2. Substring after the first character? Python unicode correctly reports the sad cat “😾”. The UTF-16 languages produce invalid strings with a half-surrogate followed by the sad cat.

3. Reverse? Python unicode gets the correct reverse of “😾😸”. The UTF-16 languages produce invalid strings. With C# I think I uncovered a defect in ideone. It doesn’t even show the invalid string and instead shows no output at all for the entire program! [ideone defect]

Languages using an encoding agnostic library, like C++, Perl, and normal Python 2 strings, fail here as well. They ignore any encoding and assume the string is an array of 1-byte code points. Python 3 adopted unicode as the default string type, thus fixing some problems. It appears that Perl also has a ‘utf8’ mode which fixes problems for these cats, but not for the “noël” string.

baffle

This string contains a ligature character, the “ffl” part is a single unicode code point. They exist mainly for compatibility, but they are a good test for case conversion.

1. What is the uppercase? I did not find any language which doesn’t print “BAfflE”. Notice the ligature remains lowercase. The expected answer is of course “BAFFLE”.

Unicode has a special class of case conversion: this single ligature code point is actually converted to three code points. By not following these additional rules, a language uppercase function produces an interesting result: a string converted to uppercase still has lowercase characters in it.

noël again

A final check I did was to compare two logically equivalent strings with different composition forms. Here “noël” is using the precomposed “ë” character.

1. Is precomposed == decomposed? The answer is no in all tests. However, several languages do offer Unicode normalization libraries. In those languages the normal form of the strings does compare equal. JavaScript does not have such a library, which is really tragic because it’s primarily a UI language, exactly where’d you want proper unicode functionality.

It’s tempting to argue that normalization and lexical analysis is not part of the basic string type. But these seem like fundamental operations one would want to do with text. If they aren’t included, what exactly is the purpose of the string type?

It’s broken

I encourage you to run such tests in your favourite language. If you are doing work with international text it is vital that you understand what your ‘string’ type is actually doing. Once you’ve run this you should reconsider what your “string” type is actually doing for you. In my opinion they’re all broken.

I admit the correct answer is not always clear. Text processing is a difficult topic, and at the very minimum we’d have to cover grapheme clusters (some string classes expose functionality relating to this, and Perl even has a GCString class). This is beyond the scope of this article, but very relevant for a good string type.

The point I made in my previous article becomes more poignant. I’d rather have an array of characters than a broken string class. I don’t put any false expectations on an array of characters: the results it produces for the above tests are very logical. Indeed, an array of unicode characters performs better on these tests than many of the specialized string classes.

Categories: Programming

Tagged as: , , , ,

127 replies »

  1. Excellent post. I tried to make this point in my comments on your prior post. I really think that strings should be anything but a dumb array of values. Again, strings (in my perfect little world) should be UTF-8 and immutable with logical operations available to it (such as properly reversing the glyph series rather than stupidly reversing the byte sequence). The ‘char’ type should be a 32-bit value representing an exact code point.

    Yes, I was indeed shocked when I originally learned that the C# string had no safety against surrogate pairs. Like wow… This is partially why I hate the mass adoption of UTF-16 (Java, C#, Qt, etc.). It’s an excuse to revert to lazy array tactics while supporting a decent set of unicode characters.

    Hopefully, I get a working prototype of my dream language out before 2047.

    • I hope this article clarifies my last post a bit as well. I see the value the in having a proper string type, but I have yet to see a proper string type. If I had to choose between any current string type and an array of characters, I’d just choose the array.

    • How could a 32bit char represent composed unicode chars ? My feeling is that this isn’t good enough yet.
      It looks like we need a good string c library that all languages could reuse.
      The question would than be, is this possible wth unicode and the composed characters ?
      I disagree on the concluion that we don’t need any. To me this article demonstrates that we need one and there is some road to go until we have it right.

    • We don’t need a string type as they are implemented now. We do need a string type which properly handles Unicode. Since nothing comes close enough at the moment I am uncertain what the interface to a string would actually look like. Some people need code points, some need grapheme clusters, others actually need encoded bytes.

    • Go (golang.org) has byte arrays, code point arrays (UTF-32) and strings of UTF-8. They all work.
      Go is finishing a normalization library now:
      http://blog.golang.org/normalization
      A recent article on that blog discusses Go strings in general, answering the questions explicitly about what the string class does and does not do.

      No need to wait for 2047.

    • > Some people need code points, some need grapheme clusters, others actually need encoded bytes.

      Most people actually need both (but not all of them already know that *g*)!
      The simplest approach would be to wrap a code point string with a grapheme cluster iterator and then build upon that to provide more complex functionality like regular expression matching and other iterators.
      The iterator approach is a natural choice for splitting grapheme clusters when using the Unicode default algorithms for boundary detection because that algorithm works on a two code points wide sliding window moving from start to end.

    • Look into Plan 9’s UTF8 support. “Strings” are actually linked lists of Runes (a glyph). It’s a pretty neat approach.

  2. Objective-C’s NSString type does correctly upper-case baffle into BAFFLE.

    NSString* str = @”baffle”;
    NSString* upr = [str uppercaseString];

    • I don’t suppose lowercaseString turns BAFFLE back into baffle so that lower(upper(baffle)) == baffle. It’s a problem of natural languages: they are irregular.

    • NSString passes all the tests providing the string returned from precomposedStringWithCanonicalMapping is used.

      NSString *subject = [@”noe\u0308l” precomposedStringWithCanonicalMapping];
      NSLog(@”%@”, subject); //noël
      //NSLog(@”%@”, [subject reveresedString]); //No reverse string method
      NSLog(@”%@”, [subject substringWithRange:NSMakeRange(0,3)]); //noë
      NSLog(@”%@”, @([subject length])); //4

    • So does Python 3 (at least 3.4 beta):

      $ python3
      Python 3.3.0+ (3.3:ad0af795c345, Jan 19 2013, 02:39:16)
      [GCC 4.6.3 20120306 (Red Hat 4.6.3-2)] on linux
      Type “help”, “copyright”, “credits” or “license” for more information.
      >>> “baffle”.upper()
      ‘BAFFLE’
      >>>

    • Same with Perl:

      perl -Mutf8 -e ‘print uc(“baffle”)’
      BAFFLE

      using the utf8 module allows it to understand that its source is in UTF8, and is something that should be turned on in most cases I think. It doesn’t help with some of the other cases, e.g. reversing noël still fails quite badly.

  3. This actually shows even more issues. Even though it’s true that normalization can help get rid of many of them, there are still some important ones that can’t be dealt in that way at all.

    1) Some characters can combine with others and can’t be canonicalized into a single one. In some cases you can even literally stockpile multiple characters into one, and a program is expected to be able to handle that despite being an arbitrary amount of codepoints. This effectively makes even UTF-32 a variable-length format.

    2) Characters alone aren’t enough, stuff like the locale also affects operations. The most well-known example is probably the Turkish I issue, where uppercase and lowercase have to change how they work because the characters are different than they would be usually. The worst part? In a real application, you may need the locale to be applied not at a string level, but at a *character* level (somebody may decide to write a portion of the text in a different language, for example).

    3) Related to the above, but characters may also be affected by the surrounding ones in the same way. Not sure to what extent though, maybe it’s just for rendering.

    4) What do you do if the string isn’t normalized? One could attempt to enforce normalization, but then that could be considered a loss of data, right? We don’t want data to get modified unless we’re actively changing it.

    I’m not even sure there’s anything programming languages can do about this, not without making it overly complicated. Sounds like in the end it may be easier for programs to just handle raw bytes on their own, because they’ll still need to be aware of many implementation details anyway.

    I imagine #1 could be solved by having a character to be a collection of codepoints and then a string to be a collection of characters (and this may warrant a dedicated string type as the naive way to do it would be a disaster handling system resources), but #2 requires more data than that, and the rest are heavily tied to the program logic.

    • For point 1, Unicode has this arbitrary limitation that no more than 32 code points may comprise a single running combination sequence. With that in mind all we need is a UTF-1024 encoding! ;)

      All your points are valid. I think this points towards not having a string type in a language. I like the idea of just having an array of code points. It makes it clearer what to expect and allows a library to decide what exactly a string is. This is espeically true since, as you say, much of the string operations are locale dependent — and not the locale of the computer, but of the user (very important distinction for web applications).

    • Where is this arbitrary limitation? As far as I recall, there is no such limitation in general. Are you perhaps talking about the Stream-Safe Text Format (http://unicode.org/reports/tr15/#Stream_Safe_Text_Format)? That’s only one optional specification an implementation can choose to adhere to. It even says “Such extremely long sequences of combining marks are not illegal, even though for all practical purposes they are not meaningful. ” after mentioning an example with 10000 umlauts.

      Still regarding normalisation… Unfortunately, the standard normalisation forms are buggy, and under the current stability policy, cannot be fixed. One example of this that I know is U+387 GREEK ANO TELEIA, which wrongly decomposes canonically (!) into U+00B7 MIDDLE DOT (the Greek name even means literally “upper dot”). This means that some processes may choose to avoid normalisation, because, even the canonical forms risk losing important information.

      All that said, nice post. I was thinking of writing something in the same vein some time ago, but now I don’t need to :P

    • I can no longer find the 32-code point limit, though I’m positive I saw it (I’ve read virtually all the unicode docs recently). It even mentions that 32 is enough for any possible linguistic situation.

    • It’s right there in the Stream-Safe Text Format: “The value of 30 is chosen to be significantly beyond what is required for any linguistic or technical usage. While it would have been feasible to chose a smaller number, this value provides a very wide margin, yet is well within the buffer size limits of practical implementations.”

    • Also note that the SSTF doesn’t really limit a grapheme cluster to 32 code points. It only limits a running sequence of non-starters to 30, but one can produce sequences of arbitrary length in SSTF by introducing CGJs.

    • One thing I was thikning on that would work with a dedicated string type is storing each codepoint as three bytes. The maximum codepoint allowed is U+10FFFD (thank surrogates for this), so only 21 bits are needed. Since three bytes have 24 bits, that means 3 bits to spare for other stuff.

      I imagine then that multiple codepoints combining into a single character could be solved by using one of the bits (e.g. to signal it combines with the previous character), while multiple characters combining into a single grapheme could be solved by using another bit in the same way (and there’s still a bit more to spare). Granted, the compiler or the type would have to handle this under the hood, but it’s doable, and somebody programming in the language shouldn’t care. This still won’t solve locale issues (that probably jumps into library territory), but it will solve the issue about what makes a character, etc.

      I guess there may be performance issues as well, but honestly, if you’re in a performance critical situation you’ll probably want to avoid touching strings whenever you can in the first place.

    • Great comment. Handling lowercase and uppercase transformations seems almost impossible to do correctly. Even with locale aware strings/substrings, it would need to be based on a dictionary in my native tongue (German, de_DE). Example: uppercase(“das maß”) = “DAS MASS”. lowercase(“DIE MASSE”) = “die masse”. I can’t think of an example, but I’m pretty sure there are words that are the same in uppercase but transform differently into lowercase based on the context.

    • The Unicode standard defines, that “ß” does not change when mapping to upper, title, or lower case. Therefore semantic meaning does not change when case mangling “Maße” or “Masse”. A good solution to that problem, i think.

    • Thanks Allan, I didn’t know that. Although I don’t think it’s a good solution, it certainly solves the problem. Also, good choice of problematic words – to think I was so close! :-)

    • What would be the alternative? Lowercase “ß” could map to an uppercase “ss” ligature (“Ss” ligature for titlecase). But that would either introduce a ligature into NFKC or degrade an NFKC string to NFC form! The (Unicode) universe would shatter if things like that would be allowed to happen!
      The right solution would be to define the “ß” as what it really is: A ligature. It would be mapped to “s” + “z” in NFKC.
      That would be the right thing to do in the analogue world too. But there may be other opinions…

  4. It’s not even matter of “international” text. Even English-centric website may need to deal with ligatures, emoji and names and addresses with non-ASCII characters.

  5. Are you sure that the “ffl ligature” thing is a bug? I don’t know what the Unicode standard says, but my first thought was, “The ligature is just for historical compatibility purposes. You should have normalised the string before you manipulated it.”

  6. You need to brush up your perl utf8 skills a little,
    3) also works correctly with perl:

    perl -e ‘use utf8; binmode(STDOUT, “:utf8”); print scalar reverse “😸😾”;’

    Output:
    😾😸

    I think you’ve proven the point that a language should have proper UTF8 support for strings (something that perl excels at).

    • I did say that in utf8 mode perl works for this scenario (my test case works). You have to be sure to enable the utf8 mode, it is not the default. It still fails with the noël scenarios though.

    • I think this might be platform dependent, 16bit or 32bit. It would perform a bit better since it can encode the cats without problem.

    • As mortoray said, it’s platform dependent. On Windows a wchar is 16 bit, on anything else it’s typically 32bit. Besides, wstring doesn’t solve the problem as it’s just a storage container, it knows nothing about the contents of the string, so doesn’t help you with unicode characters. If you’re working with UTF-32 you’re golden, for anything else it doesn’t help.

  7. “I’d rather have an array of characters than a broken string class.”

    False dichotomy. I’d rather have a working string class than either…

  8. Haskell seems to perform well in the ffl-test:
    > Data.Text.toUpper “baffle”
    “BAFFLE”

    Unfortunately not so for the “noël” test.

    • Just to be sure, is the Text string “noël” created using the “noël” form with the composed e or the “noe\u0308l” form with the two separate code points. The developers of Text have done an amazing job making it as complete as possible, but I wouldn’t blame them if things like this slipped through.

    • Is the “noël” using the precomposed or decomposed “noe\u0308l” form? It looks like the former, which is pretty easy to get right. Can you try it again with the dieresis as a separate code point?

    • Nice! How do you get ghci to show Unicode characters and not just printing them on the format “\205”? I have googled and didn’t manage to solve it. (I have the character set encoding set to UTF-8 in Terminal.app.)

  9. What about getting the first three characters of “baffle”? Is “baf” the correct answer?

    • That’s a good question. I suspect “baf” is the correct answer, and I wonder if there is any library that does it. I suspect if you normalize it first (since the ffl would disappear I think).

    • > I suspect if you normalize it first (since the ffl would disappear I think).

      The ligarture disappears in NFK[CD] but not in NF[CD]. Whether normalization to NFK[CD] is a good idea depends (as always) on the situation.
      For visual grapheme cluster counting, one would convert the entire text to NFKC.
      For getting teaser text from an article i would not a normalization step and let a ligature count as just one grapheme cluster even if it may resemble three of them logically. I assume, that articles are stored in NFC (the nondestructive normalization form with smallest memory footprint).

      The Unicode standard does not treat ligatures as containing more than one grapheme cluster for that normalization forms that permits them. So “efflab” is the correct result of reversing “baffle” and “baffle”[2] has to return “ffl” even when working on the grapheme cluster level!

      There may or may not be a need for another grapheme cluster definition that permits splitting of ligatures in NF[CD]. A straight forward way to implement a reverse function adhering to that special definition would NFKC each Unicode grapheme cluster on the fly. When that results in multiple Unicode grapheme clusters, that are used – else the original is preserved (so that “ℕ” does not become “N”).
      The real problem is to find a good name for that special interpretation of a grapheme cluster…

    • It bothers me that even if one implements proper unicode grapheme cluster rules the ligature would not reverse as intended.

      I don’t actually care about reversal, but I take it as an indicator that a reverse iterator is possible. If I were a user and had an “ffl” in my text I’d assume I could backspace individual characters. Though honestly, if I’m editting the document I’d suspect the ligatures to all be removed. They are no longer required for proper rendering.

    • Yes, backspacing of single characters of a ligature (NFKCing the ligature on the fly) would be the right way to do it.

      Ligatures surely are legitimate typographic features. But i think about their inclusion into the standard as an errer regardless. Technically, a ligature is a single grapheme cluster and therefore the standard is formally right in treating it as such. But for text processing almost always treating ligatures as multiple units is what is needed.
      Doing it “right” withoput dropping ligatures altogether (not possible due to stability rules anyway) would mean to add a set of combining versions of the latin letters and: Either extend the NF[CD]-mapping of the legacy ligatures to convert them to the new combining code points (likely not possible due to stability rules too). Or add another normal form that extends NF[CD] to add that mapping.

      But the ligature problem is also one, that even most people using the affected scripts on a daily basis will never see in the wild. One mostly stumbles over it in Unicode test suites (*eg*) and old literature where ligatures most often get converted to single letters by the OCR process.

      Finally, ignoring the split-in-ligature problem could be the best “solution”…

    • Perhaps it’s years of working in a print shop (specifically with typesetting), but I’m perfectly comfortable treating ligatures (such as ‘ffl’) as distinct, single characters. In fact, I would be displeased at the presumption of software that treated that single character as three (‘f’, ‘f’ & ‘l’). I would not expect ligatures to be case-changed unless there were very clear rules about their case.

      In the case of “Baffle”, I would expect a character count of four, I would expect to cursor and backspace the ‘ffl’ as a single unit. I would expect the first three characters to be “Baffl”, the last two to be “ffle”, and I would expect the reverse to be “lfflaB”.

  10. Baffle seems to work in java:

    user=> (count “baffle”)
    4
    user=> (.toUpperCase “baffle”)
    “BAFFLE”

  11. While JavaScript (in browsers) has no way to normalize precomposed/decomposed strings, it has standard methods to correctly compare them:
    https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/localeCompare
    https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Collator

    E.g.:

    var decomp=”noël”;
    var precomp=”noël”;
    console.log(decomp.split(“”));
    console.log(precomp.split(“”));
    console.log(decomp.localeCompare(precomp));

    Prints:
    [“n”, “o”, “e”, “̈”, “l”]
    [“n”, “o”, “ë”, “l”]
    0

    Browser support for this varies. The Intl.Collator interface is currently only supported Chrome (maybe also in Opera? Idk).

    Note: In Chrome when comparing (e.g. sorting) a lot of strings String.prototype.localeCompare is much slower than using a pre-allocated Intl.Collator instance (because internally localeCompare creates a new collator for each call). Using Intl.Collator reduced startup time of my http://greattuneplayer.jit.su/ immensely. Also node.js currently has no support for Intl.*. It probably will be a compile time option for 0.12.

    • For those browsers that do not support the Intl object yet, I’ve implemented the unicode normalization algorithm in pure javascript in a project called iLib, along with a rudimentary collator class. This is available for free on source forge under the Apache2 license. Look for the i18nlib project at http://sourceforge.net/projects/i18nlib/

      It also has iterators for characters, code points, and glyphs (grapheme clusters) which deal with UTF-16 properly.

  12. All this works in Perl. Yes, you do have to use the following pragmas

    use utf8;
    use open qw( :encoding(UTF-8) :std );

    But if you are working in UTF all the time, there are plenty of ways to the effects take place automatically and not have to bother.

    Just for kicks:

    use strict;
    use utf8;
    use open qw( :encoding(UTF-8) :std );

    use charnames qw( :full :short );

    my $word = “no\N{LATIN SMALL LETTER E WITH DIAERESIS}l”;
    my $combine = “noe\N{COMBINING DIAERESIS}l”;

    print uc $word;
    print uc $combine;
    print $word eq $combine ? ‘same’ : ‘different’;

    use Unicode::Normalize;
    print NFD($word) eq NFD($combine) ? ‘same’ : ‘different’;

    # prints:
    NOËL
    NOËL
    different
    same

  13. Perl can solve the noël scenarios flawlessly if you first normalize the string:

    use 5.010; #enables ‘say’ among other features
    use utf8; #enables utf8 in identifiers and literals
    use Unicode::Normalize ‘normalize’;
    binmode STDOUT, ‘:utf8’; # will convert unicode chars into utf8 bytes when writing
    my $str = “noe\x{308}l”; # not the same literal as “noël”, which has 4 unicode code points

    $str = normalize(‘FCC’, $str); # after this, all noël scenarios work as expected

    say length($str); #4
    say scalar reverse $str; # lëon
    say substr($str, 0, 3); #noë

  14. Perl strings are stored as code points which is why the noël problem exists however there is a Unicode::GCString module which creates Unicode::GCString objects that do the right thing and even report the correct length. If you can’t get at that module this code does the correct reverse.

    $reversed = join(“”, reverse $str =~ /\X/g);

    The underlying problem hear is at what level are you working. Perl let’s you work at the code point level. The noël problem is at the next level above, the Grapheme Cluster level. This is the level that the Unicode::GCString level works at.

  15. Go (golang.org) has byte arrays, code point arrays (UTF-32) and strings of UTF-8. They all work.
    Go is finishing a normalization library now:
    http://blog.golang.org/normalization
    A recent article on that blog discusses Go strings in general, answering the questions explicitly about what the string class does and does not do.

    No need to wait for 2047.

    • I experimented with Golang and used the libraries provided for utf8 and normalizing. If you normalize the string explicitly then reversing noël or café works correctly. Otherwise not. In utf8 I can also enter these words already normalized, and of course that works even without normalizing.

      Golang cannot, so far as I can tell, change the case of the ligature in baffle, and the reverse comes out as looking like efflab. If you know what I mean.

      Golang is very nice to the cats and does the right things.

      Only programmers reverse strings. And then only in interviews. The more important tests are case-changing, splitting, joining, searching, regular expression matching, and serializing-deserializing.

      It may be necessary to reverse ‘Characters’ in Unicode processing. When writing out Right-to-Left text, such as Hebrew and Arabic, some character pairs sometimes need to be interchanged: Parenthesis (), and {}, , []. The opening brace is the }, you see.

      I personally think that ligatures are among the pathological cases in Unicode, and it may not be practical to handle them correctly by default.

      On the other hand, the pathological cases may be the members of the Unicode consortium. These maniacs have *animals* in the so-called alphabet. Code point 1F47D is the face of an unhappy space alien.

      Again, I point out there’s a nice rant on Unicode and Perl at: http://stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-utf-8-by-default/6163129#6163129. It’s harder than it looks, it says.

      Any modern language needs Unicode string types, or classes, or libraries, or some combination thereof, so it can get most or all of the hard things right. Golang has byte arrays, immutable strings which are always encoded in utf8 by law, and arrays of utf32 integers, called runes. Conversion between them is trivial. And libraries take care of all the rest except ligatures. This is technically a bug. There’s also a library to convert to and from utf16. And the library to normalize.

      Golang is very popular in China, so I conclude that all the source code is in utf8 must be good for the whole planet. That is, all the worlds software accepts utf8.

      It can be hard to tell the difference between bugs in displaying text and in processing text. If I carefully insert Hebrew into my text editor, and verify in code the sequence of the characters by running the program, the editor nevertheless shows them in the wrong order. Most will.

    • I agree that reverse is not a usual thing to do with text. However, it is an indication of how text is being handled, and how things like substring and truncation will be handled.

    • Thank you for the nice tests of the JVM languages. I’m surprised the results are so different for each JVM language.

    • I was tempted to use this issue. I certainly do in other discussions I have. I then favoured text which might appeal more to an English audience. I think programmers using non-Latin derived languages already understand the problems.

    • renatoathaydes, it looks like some of your tests aren’t using the version of “baffle” where the ffl is a single ligature. Using the Unicode escaped version, “ba\ufb04e”, I get the following from Java 7u45:

      *** baffle
      length 4
      reverse efflab
      tail affle
      3 chars baffl
      upper BAFFLE

  16. I’ve just tested ‘baffle’ with Python 3.3.3 in a Japanese locale, and it works fine for me.

    >>> ‘ba\ufb04e’
    ‘baffle’
    >>> ‘ba\ufb04e’.upper()
    ‘BAFFLE’

    Maybe you need to upgrade to the latest version of Python, or check your locale settings?

  17. did you look at Ruby 1.9’s implementation of different encodings? it’s not perfect but it does treat a string as a sequence of characters and not a sequence of bytes.

    it also supports many encodings(129+) in addition to a “byte encoding” that is the dumb string everyone might be familiar with. i’m sure there may be cases surrounding capitalization of unicode characters not being applied properly but overall you can be safe in the assumption methods who operate on a string are sequential characters and not bytes.

  18. Don’t see any major problem with Perl: remember that \X represents a grapheme cluster:

    $ echo noël | perl -CS -nle ‘print reverse /\X/g’
    lëon

    Which works no matter the normalization:

    $ echo noël | nfd | perl -CS -nle ‘print reverse /\X/g’
    lëon

    $ echo noël | nfc | perl -CS -nle ‘print reverse /\X/g’
    lëon

    There does exist a Uniocde::GCString module that allows you to do substr, index, length, etc operations on grapheme clusters instead of on individual code points. Basically, it turns a string into a sequence of grapheme clusters instead of a sequence of code points. But for simple enough stuff, you can just use \X.

    $ echo ñôël | nfc | perl -CS -nle ‘print scalar( () = /\X/g)’
    4
    $ echo ñôël | nfd | perl -CS -nle ‘print scalar( () = /\X/g)’
    4
    $ echo ñôël | nfc | perl -CS -nle ‘print length’
    4
    $ echo ñôël | nfd | perl -CS -nle ‘print length’
    7

    So don’t do this:

    $ echo ñôël | nfd | perl -CS -nle ‘print scalar reverse’
    l̈êõn

    Do this:

    $ echo ñôël | nfd | perl -CS -nle ‘print reverse /\X/g’
    lëôñ

    It’s up to you whether you want to deal with code points or which grapheme clusters. Once you learn the difference, it’s pretty easy.

  19. The underlying problem is, that most string classes just deal with code points instead of grapheme clusters. At the code point level there is no way to get reversion or slicing right. Respecting grapheme cluster boundaries is key here.
    In case of ligature code points it gets worse: They would have tobe treated decomposed for most operations but returned as given when returning a sequence fully containing all decomposed code points in original order. In most use cases one would just NFKC them away of course.

    But in a lot of cases, grapheme clusters are just not enough. The common teaser text truncation problem needs to deal with words (falling back to grapheme clusters) at least and should prefer to truncate on a sentence boundary instead. What i really need is a multilevel string class.
    One of the functionalities i would really like to see implemented in such a class: Truncating a text to a maximum length of grapheme clusters, code points or even bytes (it would need to know the target encoding for that one) preferring to break on a sentence boundary falling back to word and grapheme clusters.
    Of course i implemented that function already like most of us did. But manipulating text is just so essential an ability, that almost every language should start to provide it properly in its standard library today. That stuff is easy to do wrong and i failed a lot at it. I wished for not having to reinvent that wheel again and again as i changed languages in the past.
    And think aubout the children (new coders) too!..

    • “But in a lot of cases, grapheme clusters are just not enough. The common teaser text truncation problem needs to deal with words (falling back to grapheme clusters) at least and should prefer to truncate on a sentence boundary instead.”

      The problem with that is that word boundaries and sentence boundaries are dependent on locale and most locale code thinks that the locale of the machine not the user matters. So without knowing the locale of the user you can get into big problems segmenting words, lines, and sentences.

    • In this case its the language, the article is written in and not the locale of the writer’s, visitor’s or local machine.

      Of course the principle is the same as when formatting timestamps: It depends on i18n-specific parameters. One could feed the needed parameters to the function dealing with splitting text. Someone writes that article and of course, that one knows wich language the article is written in. The code could ask for that meta information on creation of an article.

      But even if parameters are unknown – lets just fallback to the default behaviour defined in the Unicode standard (yes, it contains boundary detection for grapheme clusters, words and sentences). Would still be better to split after abbrevations sometimes instead of splitting inside words almost always (in case of senetence granularity).

      The need for meta data is not a show stopper. We most often have that meta data. Either all texts are known to be in the same known language. Or code point counting would reveal the used script and (for a lot of texts) even the used language quite easily.

  20. I’d have hoped Common Lisp would fare well here, but SBCL (1.1.11 on 64-bit Linux Mint 15) is pretty broken. My results:

    string: noël
    reversed: l̈eon
    first 3 chars: noe
    length: 5

    string: 😸😾
    reversed: 😾😸
    first 1 char: 😸
    length: 2

    string: baffle
    upcase: BAfflE

    string: noël
    equals precomposed: NIL

  21. I am baffled such misunderstanding still exists in 2013. I would like to quote Joel Spolsky from 2003 – “So I have an announcement to make: if you are a programmer working in 2003 and you don’t know the basics of characters, character sets, encodings, and Unicode, and I catch you, I’m going to punish you by making you peel onions for 6 months in a submarine. I swear I will.” http://www.joelonsoftware.com/articles/Unicode.html
    It is not that hard. Learn Unicode and use ICU and all will be good.

    • My desire is that a language, by default, should be correctly doing the basic string operations. Even I generally assume my string classes are working correctly and other than on a UI/template system, wouldn’t likely go out of my way to add unicode support. No, if proper text handling is not the default then we’ll forever have programs which don’t do it correctly.

  22. C++ works, but via ICU or some other similar utility library.

    I tested Qt (which itself uses ICU) and although the raw QString itself had middling success, that’s because it uses code points, not graphemes, and UTF-16 (w/ surrogates when necessary to exceed the BMP). So initially the substring and reversing examples didn’t work for either noël or the cats, but the source strings all displayed and baffle properly decomposed when uppercased (into BAFFLE).

    Of course QString doesn’t have a .toReversed() function so whichever way you go you have to code your own (I used std::reverse).

    Qt does provide a class, QTextBoundaryFinder, to split a QString into boundaries (word, line, but also grapheme clusters). By using this class it’s possible to correctly handle all reversing, substring, and length calculation requirements (including the non-BMP cats).

    The one thing I was disappointed with was the locale comparison. I already knew Qt supported normalizing QString to the defined Unicode normal forms, but the Qt I tested (4.8.6) doesn’t account for normalization differences in its QString::localeAwareCompare. You have to manually normalize both strings to the same normal form for that to work.

    But with that in mind it was all relatively straightforward, but you do have to understand that Unicode has combining characters and plan accordingly. ;)

  23. You have to be a standards lawyer to worry about latin ligature cases like ffl. At the same time there are widely used scripts (e.g. Arabic and practically all the Devanagari-based scripts of India) where ligatures are a core issue, any people here with experience building websites, word processors, and in general general things that require significant text interaction in such languages?

    • Ligatures are completely legal typographic element. Yes, proper support for asian scripts is more important of course.

      But failing at case-mangling for ligatures is still a bug that should be fixed. It will not affect comparison because one would use NFKC for that. Casemangling failures look ugly and undermine user’s confidence in the application, site, or even the organisation providing them. For the user it just feels buggy – even if functionality is not affected.

  24. I’d recommend you add the example from the lower row here to your post:

    Reason being that noël can be normalized, while dot/s/dot cannot be normalized in a way that retains the actual content, yet still results in a string of length one; in any language; due to the way Unicode itself works.

  25. I think a root cause of a lot of this is simple taxonomy. When someone says “I want a string class,” they are bringing a lot of mental baggage about what a “string class” should look like along with them… most obviously the string-of-bytes idea.

    If they called it UnsafeText, those who are writing these classes would have a clearer understanding of what qualifies as “done”.

    • Excellent. I’m glad to see a positive effect from this. It also gives me another reference for the Leaf language.

  26. I think that people may be conflating two different modes of Unicode operation. One mode is ‘use utf8’ and that tells Perl that the program itself is to be interpreted as being encode in UTF-8. The other is ‘use feature “unicode_strings”, which is off by default for backwards compatibility. However, it is enabled if one does any of a number of things, including using a -E on the command line instead of a -e.

    But my main point is that the hardest part to get right is not something you tested. It is case insensitive matching in regular expressions. Around 10% of Unicode folding definitions are to multiple characters. And not just from ligatures. The most common is LATIN SMALL LETTER SHARP S which is supposed to match the sequence ‘ss’ case-insensitively. Perl now works for these:
    perl -E ‘say “\N{LATIN SMALL LETTER SHARP S}s” =~ /^sss$/i;’
    perl -E ‘say “s\N{LATIN SMALL LETTER SHARP S}” =~ /^sss$/i;’
    perl -E ‘say “sss” =~ /^\N{LATIN SMALL LETTER SHARP S}s$/i;’
    perl -E ‘say “sss” =~ /^s\N{LATIN SMALL LETTER SHARP S}$/i;’

    But not for things like this:
    perl -E ‘say “\N{LATIN SMALL LETTER SHARP S}s” =~ /^[st][st][st]$/i;’

    The reason is that the bracketed character classes form “silos” whose boundaries can’t be matched across.

    • I think I’m actually going to make a more complete test suite. I see all sorts of places where a language might have troubles dealing with unicode. Regexes is of course one of those big areas.

    • Whether “ß” should additionally match “s”, “ss”, both or none depends on the situation. There is just not one size that fits all when it comes to matching.
      Each behaviour is wrong in one situation or the other. Same applies to case folding. There are two modes of case folding defined in the Unicode standard. The simple one, used by most implementations, does not alter string lengths when folding and does not change “ß”.
      Not trying to be too smart by default is a sane approach. There are situations were “Maße” and “Masse” should not both casefold to “MASSE”. The path chosen by Perl – to support a specific use case _partially_in_a_likely_surprising_way_ is obviously plain wrong and the result of too much complexity added to a once harmless library over time.

      For regular expression matching, currently common options just don’t do it. Users need to be able to choose the normal form (none, NFC, NFD, NFKC, NFKD) and case mapping (none, simple or extended and lower or upper) mode to apply on the string to match against before doing the matching. And in a lot of cases, normalization and case folding are not enough. Then additional problem-specific mangling is needed. Sequence like toNFKD + removeCodePointsContainedInSomeGeneralCategories + simpleFoldToLower + NFC + replaceSomeGraphemeSequences are part of my life since i changed from ISO-8859-1 to Unicode.

      The case insensitivity modifier in regular expression matching implementations has become misleading at best through creation of the Unicode standard (it was perfectly nonambiguous before)!
      It is a legacy option to be used when matching artificial languages only.
      Regular expression libraries still work. But text that needs to be mangled for matching, needs to be mangled before calling the regexp library.

      Ambiguousity has crept into other base libraries too. Case mangling functions in a lot of languages are not documented enough to indicate the algorithm used. Most often it seems as if someone took code from the pre-Unicode era and just extended it to support the algorithm from Unicode, that looked most sane to use for the implementing developer’s current needs.
      As there are two case mangling standards in Unicode (one reversible and non-length-changing and another one), any sane implementation would let the user choose between them. And as the Unicode standard documents keep on saying over and over again, that one may want to implement additional or even more complex mappings for handling some language better, extensibility would be key in proper libraries. Base libraries should, for example as parameter to a case fold function, allow the developer to provide an additional mapping table to respect in favor of the built in ones. I also never saw an implemented option to select the Unicode standard’s version to adhere to…

  27. In response to the statement that “Base libraries should, for example as parameter to a case fold function, allow the developer to provide an additional mapping table to respect in favor of the built in ones,” I can provide two examples.

    The first is in the Go language’s strings package, along with string-mapping functions like ToUpper, ToLower, and ToTitle, they also provide versions of those functions whose names end in “Special” (such as ToUpperSpecial) and which take an extra argument specifying the special casing rules to be given precedence. Originally this was to handle the “Turkish I” problem, and I do not know if any other rulesets have appeared.

    http://golang.org/pkg/strings/

    The do provide an EqualFold() function, but that is specifically defined to work only on the SimpleFold() (that is, 1:1 rune-to-rune) versions, not the possibly multi-codepoint mappings of the fancier case-mapping rules. You’ll find no “full” case-mapping functions in their unicode package:

    http://golang.org/pkg/unicode/

    The other example is Perl’s Unicode::Casing module, which allows you to override the case-mapping functions and their string-interpolation escapes within a lexical scope. For example:

    use Unicode::Casing (
    uc => \&my_uc,
    lc => \&my_lc,
    ucfirst => \&my_ucfirst,
    lcfirst => \&my_lcfirst,
    fc => \&my_fc,
    );

    You can read about it more here:

    http://search.cpan.org/~khw/Unicode-Casing-0.12/lib/Unicode/Casing.pm

    To no great surprise, the provided example is for a turkish_lc() function.

    All that aside, I agree that trying to get all the odd border cases right on full case-insensitive matching is a vexing problem with no perfect solution. It is a great deal easier if you first do the folding yourself on the string and then supplying case-folded forms to the matcher than it is to try to make the matching Engine figure everything out. However, this has its own pitfalls, not to mention space requirements. And you cannot turn it on or off in one part of the match compared with another part. This is a real problem when trying to assemble larger match strings out of smaller ones, where different rules apply.

    For text searching, this probably also involves reducing to NFKC before folding, or doing a Unicode-Collation-Algorithm match, which is quite expensive.

    And even so, you won’t get everything you need. If you look very closely at Adobe Acrobat’s search function, it is indeed doing all that and more, allowing U+0027 APOSTROPHE to match U+2019 RIGHT SINGLE QUOTATION MARK, or even breaking up lexical ligatures. This clearly involves more logic than mere normalization and folding, but if you are searching real text, you always have to do more than just apply the simple words.

    • Runic script does not even has lower and upper case. :P
      Nice to read, that there is progress in “fresh” languages. Looks like Golang does some things better.

      I also stumbled over that strange (maybe turkish) “İ” and “ı”. For matching i almost always want to mangle that to regular “I” or “i”. I also would want a lot more mappings for other languages even to the point where “Cheng” would match “程”. But a turkish searcher would possibly not want to get results with “i” when searching for “ı” (though i do not know for sure).

      Breaking up ligatures is what NFKC is really good at (it also converts mirrored latin and other “decorated” stuff to its plain version). Punctuation i mostly just drop before matching. I know, that it may matter in a lot of cases and therefore understand why Adobe chose to mangle it less destructively.

  28. Funnily enough, my screenreader (macintalk) will not read out paragraph that contain non-bmp characters, not just skip them like Cyrillic or such. There is so much broken there.

    • It should ponounce “😸” like “emoticon of a smiling cat face” translated into the user’s language. At least is should fall back to the name given by the Unicode standard like in “Symbol: GRINNING CAT FACE WITH SMILING EYES”.

      The worst case for a screenreader is to not pronouncing it at all. Even “unknown symbol” would be better than nothing.

  29. One minor nitpick: noël with precomposed ë is
    four characters, four columns, four glyphs, five UTF-8 octets;
    noël with decomposed ë is
    five characters, four columns, four glyphs, six UTF-8 octets.

    I assume you mean “gimme the number of glyphs”; a character
    is a UCS-4 codepoint (important for languages or environments
    whose wide character type is 16 bits), a glyph can have more
    of those (e.g. combining or annotations).

  30. The Rust string type is pretty bad, actually. It passes the uppercase test, but it fails the slicing test, the length test, and it doesn’t have a reverse function to test reversing with.

    fn main() {
    
    	// 1. noël
    	let s = "noël";
    
    	// 1.1. Does it print correctly?
    	// A: It does.
    	println!("Precomposed: {}", s);
    
    	// 1.2. Does it reverse correctly?
    	// A: Rust does not provide a "reverse()" function for strings.
    	// I could convert it into a list of codepoints,
    	// which would result in a string with a reversed list of codepoints.
    
    	// 1.3. First three characters.
    	// A: I could convert it into a list of codepoints here, too,
    	// but there's a more intuitive way to do it that's even more wrong.
    	println!("First three characters: {}", &s[..3]);
    
    	// 1.4. Length
    	// A: It's bytes.
    	println!("Length: {}", s.len());
    
    	// 2. Cats
    	let s = "😸😾";
    
    	// 2.1. Length
    	// A: Bytes again.
    	println!("Length: {}", s.len());
    
    	// 2.2. Substring
            // A: It crashes, because I told it to slice on a non-character boundary.
    	/*println!("Substring: {}", &s[..1]);*/
    
    	// 2.3. Reverse
    	// A: Again, there is no built-in reverse function.
    	// I could convert it into a list of codepoints,
    	// which would work correctly in this case,
    	// but that would be a double-standard.
    
    	// 3. baffle
    	let s = "baffle";
    
    	// 3.1. What is the uppercase?
    	// A: BAFFLE
    	println!("Uppercase: {}", s.to_uppercase());
    
    	// 4. noël again
    
    	// 4.1. Are they equal?
    	// A: false
    	let s = "noël";
    	let t = "noël";
    	println!("Equal: {}", s == t);
    
    }
  31. The Rust string type is pretty bad, actually. It passes the uppercase test, but it fails the slicing test, the length test, and it doesn’t have a reverse function to test reversing with.

    fn main() {
    
    	// 1. noël
    	let s = "noël";
    
    	// 1.1. Does it print correctly?
    	// A: It does.
    	println!("Precomposed: {}", s);
    
    	// 1.2. Does it reverse correctly?
    	// A: Rust does not provide a "reverse()" function for strings.
    	// I could convert it into a list of codepoints,
    	// which would result in a string with a reversed list of codepoints.
    
    	// 1.3. First three characters.
    	// A: I could convert it into a list of codepoints here, too,
    	// but there's a more intuitive way to do it that's even more wrong.
    	println!("First three characters: {}", &s[..3]);
    
    	// 1.4. Length
    	// A: It's bytes.
    	println!("Length: {}", s.len());
    
    	// 2. Cats
    	let s = "😸😾";
    
    	// 2.1. Length
    	// A: Bytes again.
    	println!("Length: {}", s.len());
    
    	// 2.2. Substring
            // A: It crashes, because I told it to slice on a non-character boundary.
    	/*println!("Substring: {}", &s[..1]);*/
    
    	// 2.3. Reverse
    	// A: Again, there is no built-in reverse function.
    	// I could convert it into a list of codepoints,
    	// which would work correctly in this case,
    	// but that would be a double-standard.
    
    	// 3. baffle
    	let s = "baffle";
    
    	// 3.1. What is the uppercase?
    	// A: BAFFLE
    	println!("Uppercase: {}", s.to_uppercase());
    
    	// 4. noël again
    
    	// 4.1. Are they equal?
    	// A: false
    	let s = "noël";
    	let t = "noël";
    	println!("Equal: {}", s == t);
    
    }
    • Rust appears to have an unstable “graphemes” type that can be used to iterate over graphemes. I’m not sure how close this is to stabilization. I think exposing a byte interface and a grapheme interface separately is not unreasonable for a low-level language designed to interface with C.

  32. I did find a language that passes all the tests out of the box not special modules.

    Perl 6 handles everything including correctly counting, reversing and splitting noël as well as knowing noël == noël.

    my $test = “noe\x[0308]l”;

    say $test;
    say $test.flip;
    say $test.substr(0,3);
    say $test.codes;

    $test = “😸😾”;

    say $test;
    say $test.codes;
    say $test.substr(0,1);
    say $test.flip;

    $test = “baffle”;

    say $test;
    say $test.uc;

    my $noel1 = “noël”;
    my $noel2 = “noe\x[0308]l”;

    say $noel1 eq $noel2;

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s