Tags

, , ,

Should string be a distinct type or merely an alias for an array of characters? I’m considering the options for Leaf and can’t convince myself that a specialized type is needed. Looking at C++, a string and a ‘vector’ are almost the same. Specialized operations, like case conversion, appear bound to the underlying ‘char’ type, not the collection.

What is a string

A string is nothing more than a sequence of characters. To be more correct, a programming character is a code point for a particular character set. A single code point doesn’t always encode a logical glyph, it may be a combining character, joiner, or other such control signal. Does this distinction play a significant role?

Consider the ‘length’ function of a string. Is it intended to return the number of glyphs, combined characters, or count of the underlying code points? Should two distinct, but canonically equivalent strings return the same length? Trying to accommodate the complexity of normalization and equivalence rules in a generic ‘length’ function seems ludicrous. Domains where the “true” glyph length is required are uncommon, and would probably require a rendering library. The only sane option for ‘length’ is to return the number of code points — which is indistinguishable from the ‘length’ of a character array.

We can support that decision by considering the behaviour of a subscript, or indexing operation. Should we index the logical character, or a specific code point? Again, looking at unicode combining characters, there is no actual representation, or even definition, of what a “logical” character is. Sequences of combining marks are unbound, thus no fixed size type could even represent a “logical” character. It would seem that operations on the string need to be done at the code point level — exactly the behaviour for the same operation on an array of characters.

The C++ difference

A ‘string’ and ‘vector’ in C++ have only one significant difference that I can see: the string is null-terminated. This one small difference allows the ‘c_str’ function to return a pointer directly to the internal memory of the string. (C++11 clarified the string to ensure this is the only valid way to store strings.)

For C++ this is an essential feature: the class would be nearly useless without the ‘c_str’ function. It isn’t however a desired feature. We’re forced to convert strings into their C equivalent for calling API functions. Quite disturbingly this includes functions in the standard C++ library itself. For example, the constructor of ‘ofstream’ used not to accept a ‘string’, only a ‘const char*’ (C++11 fixed this particular case).

Null-termination is also a sordid tale all on its own. The whole C library of string functions, like strcat, strcpy, are terribly unsafe. Working with C-strings is a perilous and error prone affair. Very few modern APIs even rely on null termination, requiring an explicit length argument instead.

This primary difference between a C++ ‘string’ and ‘vector’ is really just a historical oddity that many programs don’t even need anymore.

Surrogates and variable-length encodings

The preceding considers characters at the code point level: a single value is the direct encoding of a character. Storing strings this way is often inefficient. More compact variable-length encodings are frequently used: differing numbers of bytes are used to represent a single code point. For example, in UTF-16 a code point may comprise a 2-byte or 4-byte sequence. Values that don’t encode a full character, they are part of a sequence, are called surrogates.

The surrogates are not the same as combining characters. A surrogate has no meaning in the character set: it is merely an encoding primitive. Working with the string at the character level requires actual code points. Trying to work with an encoded string as a sequence of characters is very troublesome and leads to ambiguities. What should the ‘length’ function return, the number of encoding values, or the number of code points?

It is tempting to use encoding a justification for a special string type. With a proper abstraction you could store strings in a variety of formats then expose them as a sequence of characters. This would, for example, allow using UTF-8 strings directly in string operations. The ‘length’ function returns the number of characters. The underlying encoding is irrelevant (perhaps exposed via special functions).

A big challenge in such a class is efficiency. A basic operation like accessing a character by index is now a linear operation. It requires decoding the string, scanning from the beginning, assembling surrogates, and counting the resulting characters. Even simple forward scanning requires a potential loop and stateful extraction of the next character. This overhead will multiply through all basic routines, such as splitting, translation and regex matching.

I don’t know of a language that handles strings in this fashion. The efficiency problems leave it a very unattractive option. It’s just so much simpler to decode while loading and work in the character domain. It requires more memory, but I can’t think of any domain where this would actually be a significant issue — even massive strings are small compared to other assets.

Library support

Strings have a lot of specialized operations in comparison to generic arrays: normalization, character translation, regular expression matching, parsing, formatting, stripping, encoding, et cetera. To be pedantic though, a vector of any type will always have specialized operations: numbers have summations, averages, medians; vertices form meshes which can be transformed, simplified, rasterized.

The discussion may instead be one of free functions versus member functions. Should all these special operations be member functions, thus requiring a distinct ‘string’ class? Or should they all be free functions that would work just fine with the generic ‘array’ class? Certainly all these functions could be written as free functions. None of them require special access, the public interface of an array is enough.

Perhaps the issue is just one of syntax. If I could write `str.toUpperCase()` with free functions, the question is moot. D does exactly this with uniform function call syntax. I think we should also look at the trend reversal in C++11, many operations are now available as free functions rather than member functions. Popular opinion seems to be leaning in the direction of free functions.

If free functions are the preferred approach, then there is no need for a dedicated string class. String operations can be written as free functions on an array of characters.

A string of what

What if we need strings of something other than characters? By default Leaf will probably be the Unicode code set, but perhaps you wish to just use ascii, or latin-1 in your code. Most strings I’ve seen tend to ignore this issue. Some languages, like PHP, let you specify at a global level. A type labelled simply as ‘string’ doesn’t allow much variation.

So let’s assume characters can be marked as ascii. To have a string of these characters makes string a kind of template class ‘string’. It is still necessary to refer to individual characters, so a type ‘char ascii’ is required as well. Once I start adding these modifiers I don’t like the feel of ‘string’ anymore, it seems like it says something other than ‘array’.

Go back to our discussion of variable length encoding: what if you really wanted to have UTF-16 strings, and live with the surrogates as-is? Saying ‘string’ is now really ambiguous, is it a string of unicode characters, or actual utf16 code values. It would be clearer to create a type alias and use that. In Leaf we could do ‘type utf16 : binary 16bit’ and then have a ‘array’. There is no confusion here about what this is: an array of integral code values, not characters.

Just a typedef

I’m now rather convinced there is no need for a special string class. I will nonetheless have a ‘string’ in Leaf, but it will merely be a type alias for an ‘array’. Basic strings are used often enough that this shortcut is appreciated. For cases where encoding is vitally important, using explicit array types is likely the safer option. Clearly a rich string library is important, but there is no need for a string class to provide it.

Can you think of any other situation where a dedicated string class is needed, or at least very helpful?

Read the followup article “The string type is broken“.

About these ads