Defective Language

What is the length of a string? A tricky question

Every day millions of programmers require the length of a string. Despite this there is no universal definition of what string.length actually represents. It changes between languages, and quite often doesn’t return the length we actually want. From limited use encoding and memory, to highly useful, but unavailable, grapheme clusters, and also visual extents of rendering, there are significantly different ways to define “length”.

Encoded length: utf-8, ascii

In most languages the string.length function returns an encoded length of the string. This typically ends up being how much memory the string is occupying in whatever the system default encoding is. This is a rather unfortunate default since it is the most useless definition of string length.

Each language and platforms differs, making consistent algorithm implementations difficult. In C++ on Linux you’ll get a measure of utf-8 bytes, or perhaps UCS-32 if you’re using wstring. In Java and C# you’ll get the UTF-16 count. The docs for several languages, like Python, don’t appear to mention what their length function is measuring.

The encoded length of a string is generally only useful when you are serializing data. It’s also only useful for a particular encoding: for example, we want to specifically serialize as UTF-8, not just whatever C# is storing in memory.

Uses of an any encoded length for something other than serialization is rather suspicious. It’s extremely common though since it’s what string.length tends to do in most languages and is thus the most convenient function to call.

Character code count

One step more abstract than encoded length is the character code length. This is a count of the real “characters” in the string: each “character” is just a number defined in some character table, such as Unicode.

Historically, with code page support, the character code count and the string.length functions tended to return the same value. This explains why the length function works that way. With the advent of Unicode, and multiple encodings, it’s rarely the case now; the string.length of a string is almost never the same as the count of the unicode code points it contains.

For many applications this length also just tended to worked, since prior to non-BMP codes in Unicode a UTF-16 (C#/Java) encoded string actually had the same string.length as code point length. But then along come emoji characters 😈, ensuring that all apps now need to deal with non-BMP text.

Character code count is a consistent way to define the length of a string. The number of unicode code points is the same regardless of language and encoding. It’s also a sane way to work with the underlying string characters. These are the values that define the properties of a character, whether it s a letter, digit, upper-case, or combining character.

This is the simplest useful definition of a “character”, though it’s usefulness is still quite domain specific. Languages, such as XML, tend to be defined at this level, and thus parsers tend to work with code points.

Grapheme cluster count

What if you’re trying to limit the number of “characters” a user can type into a field on the display. This is where the definition of length becomes really tricky. Clearly the encoded length is wrong, since the user cannot “see” this length and will appear to have random length restrictions.

Character code count will also appear wrong. Combining characters in unicode don’t present as distinct visual characters. A typical user will be confused as to why an accented character counts more towards their maximum than an unaccented one. Even more, in some languages combinations of non-combining characters can be presented as a single visual “glyph”.

This is where a grapheme cluster becomes useful. It’s an attempt to group several code points along their visual grouping. The “grapheme cluster” is what a user would logically count as a single “character”. Movement through a document, highlighting, deletion and insertion, are all expected to be aligned to grapheme clusters. Any user-perceivable text manipulation must be done at this level or it will be perceived as wrong.

The problem with grapheme clusters is that libraries to work with them are not readily available in all languages. It isn’t built in to the native string functionality. The system isn’t perfect either as lots of ambiguities exist. However, it’s still the best baseline approach to text manipulation available.

Visual extent

An often overlooked, but highly useful definition of length, is the visual extent. This is a measure of how much space a string occupies when rendered with a given font and size. Of all the methods to calculate size it is certainly the most costly, but perhaps the most realistic.

Often length limitations are imposed on a string that should appear somewhere in a form or on the UI of a screen. It is important the string fits within those bounds as to not produce a broken visual. This type of limit often appears in real-world forms: in my passport application, and some banking forms, I’ve often had a box for my signature with the only requirement that it fits within the box.

It’s very unfortunate that text measuring and rendering APIs are not readily available, and the ones that are available are not easy to use. It’s likely the reason why one rarely considers visual extent in any non-graphical program.

Nothing’s perfect

All of these approaches have their uses and their limitations. As a general rule grapheme clusters should be the default view of strings as it matches user perception. Encoded length is the next choice for applications that involve serialization, and it’s really only useful for the serialization layer. If you really just need some kind of limit to prevent overflow abuse then code point count is fine.

Regardless of the choice a determined user can usually find a way to mess with any “length” restriction. If you count grapheme clusters a user can still put thousands of combining characters and produce very tall text decorations. This can be defeated by measuring the visual extent, but another user might user thousands of zero-width joiners, or other zero-size markings, as well.

Just always remember that string.length is not a consistent definition of length, and usually not a very useful one.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s