Strings and Text are not the same

Thinking of text and strings as the same type is wrong. It leads to all kinds of errors and results in confusing or incomplete APIs. I wrote before that the string type is broken, but that’s only true if the string type is supposed to handle human readable text. The more I think about that, the more I realize it’s wrong. It is often quite convenient, but perhaps that makes it even worse. I’m leaning more again towards not needing a string type at all.

What is a string? What is text?

The issue stems from a confusion of the terms “string” and “text”. Text is something that makes sense to a person. It’s an abstract concept including words, numbers, punctuation and other human language traits. A string is merely an array of codes. Now certainly a string can be used to encode text, but that doesn’t make it text itself.

All the terms in this area are muddled due to a strong history of liberal use. We have database “text” fields that really just store strings. The term “character” has a multitude of definitions, from the high-level human perceivable glyph, down to a terminal control code like escape. The term “character” should probably just be avoided entirely. There are string processing libraries that are really more focused on text. Source code seems to be both text and a string. I still think the terms “text” and “string” are okay, but we must be careful when using them.

HTML is not text

HTML is a prime example showing a distinction. It has the term “text” buried in its acronym, but an HTML document is really a string. The distinction can be seen with the operations that are performed on the data.

Take a simple word, like “hello”, as an example. A writer may wish to bold this word. They highlight the word, press bold, and the word appears bold. To the computer the operation is wrapping the string to produce “<b>hello</b>”. For whatever reason the program could instead create the string “<b>he</b><b>llo</b>”. This is the exact same text, a bold hello, yet the string is quite different.

Now the user presses the “Capitalize” function in their editor. They expect to see “Hello” now. The underlying document string should now be “<b>Hello</b>”. A function that blindly works on strings as though they were text might result in “<B>hello</b>”. A lot of text processing functions, like case manipulation, are not applicable to strings in general.

Problems with text handling are so common that I couldn’t even write this article without hitting some. Those “<” symbols above were not being properly escaped. I had to modify the options to my markdown processor to get it working as I expect. I also had a hard time in my editor working with the combining characters used in the next section.

Accented delimiters <̧

Document processing, like HTML, JSON or YAML, is done at the code string level. I’m using the word “code” here where other sources may choose to use the word “character”. I’m avoiding that term since it’s meaning is ambiguous. At the level of parsing we are clearly thinking only of the codes.

In HTML, code #60 is used to open an element. You may also recognize this as the ASCII, and Unicode, code for <. The HTML parser however doesn’t care what #60 represents to you, only that it delineates the start of an element.

Combining characters can create an accented version of that symbol, . In text this is clearly a different symbol: it’s a distinct grapheme cluster. The HTML parser doesn’t care about that. It sees code #60 followed by #807 (combining cedilla). It thus sees the opening of an element. However, since it isn’t followed by a valid naming character most parsers just ignore this element (I’m not positive that is correct to do). This is not the case with an accented quote, like . Here the parsers (at least the browsers I tested), let the quote end an attribute and then have a garbage character lying around.

These aren’t binary formats, lest on be thinking that. A large range Unicode is available for names in HTML, and the source can be in any of the Unicode encodings: UTF-8, UTF-16, UTF-32. The parser must understand the decoded form, thus it truly is working at the code level, not in binary.

Mixing operations is bad

We need to work with strings at the code level. SQL, HTML, JSON, source code, HTTP, are all defined in terms of a string of codes. Manipulating and parsing documents tends to be at the code level (when not binary). Operations at this level tend to be very well defined. Operations at the text level are very ambiguous, as they are subjective and locale dependent.

Yet our strings are filled with functions that work somewhat at the text level. Perhaps the most dangerous of which is string formatting. In C the printf family, in C++ the text streams, in Python the format function. From experience we know that you can’t use the generic formatting functions to produce SQL statements, otherwise you’ll be open to SQL injection attacks. Similarly we can’t create HTML using simple string formatting or concatenation, since that would lead to corrupt documents. So why does a string even have a format function if it isn’t actually applicable to strings?

format is also not very good at producing text. It doesn’t know about document formats so can’t properly produce an HTML fragment. It can’t handle basic localization issues like pluralization. It doesn’t understand character widths, joiners, and non-characters, so can’t even do basic alignment correctly. It’s also not extensible thus never seems to handle the myriad of formatting issues real text tends to have. It’s far better to use a template language that is aware of these issues and specially tailored to text handling.

In my previous article I tested a variety of string operations. A lot of the unusual results come from mixing of string and text handling. If we can’t decide exactly what is a string, and what is text, some of those operations are not possible to unambiguously define.

An anachronism

We need to stop mixing our string and text operations. A whole range of security issues, like SQL injection and cross-site scripting attacks, happen because of this mixture. Our text handling is also bad because of it. We don’t sort things correctly. We don’t format things correctly. We do truncation wrong. We handle combining characters poorly.

Even if we know about the problem it is hard to avoid. Our current languages make it very easy to mix text and strings, and that leads to the bad behaviour. It tends to work a lot of times, which just makes it easy to forget the issues. I don’t know of any language that offers proper text handling as part of its standard library. At this point it’s hard to even say what that library needs to offer. That doesn’t make it any less essential.

Certainly though something needs to change. Hopefully I can address these problems in Leaf, but I’m not sure I fully understand what needs to be done yet. What’s important is that we all stop looking at strings as text handling types, and start looking for proper alternatives.

6 replies »

  1. Interesting analysis. I agree with your attempt at differentiating the two, but I don’t think there is a cut-and-dry solution for it. Would you make a Text type that bans non-alphanumeric characters? As you pointed out, HTML is structured differently from JSON or whatever else. You have to sanitize a string before it can be treated as “text”.

    • I don’t know what the solution is. I don’t know if a full text handling type is required, or merely adjustments to the way strings are handled. Something like sub-typing might work. Here you mark a string as HTML, JSON, or otherwise, preventing accidental incorrect formatting.

      I wish there were a simple answer. I need it for Leaf. :)

    • Sounds to me more like a wrapper type of some kind: basically a string + meta data of some sort. I think it would be dangerous inside the core string type itself.

  2. I was with you up to the point you said HTML was not text. :D

    Rather than risk hijacking the comment section with a very long post (’cause I think there’s a lot to be said on the topic), I’m going to pop over to my programming blog and whip up a response. Stand by for ping back…

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s