Thinking of text and strings as the same type is wrong. It leads to all kinds of errors and results in confusing or incomplete APIs. I wrote before that the string type is broken, but that’s only true if the string type is supposed to handle human readable text. The more I think about that, the more I realize it’s wrong. It is often quite convenient, but perhaps that makes it even worse. I’m leaning more again towards not needing a string type at all.
What is a string? What is text?
The issue stems from a confusion of the terms “string” and “text”. Text is something that makes sense to a person. It’s an abstract concept including words, numbers, punctuation and other human language traits. A string is merely an array of codes. Now certainly a string can be used to encode text, but that doesn’t make it text itself.
All the terms in this area are muddled due to a strong history of liberal use. We have database “text” fields that really just store strings. The term “character” has a multitude of definitions, from the high-level human perceivable glyph, down to a terminal control code like escape. The term “character” should probably just be avoided entirely. There are string processing libraries that are really more focused on text. Source code seems to be both text and a string. I still think the terms “text” and “string” are okay, but we must be careful when using them.
HTML is not text
HTML is a prime example showing a distinction. It has the term “text” buried in its acronym, but an HTML document is really a string. The distinction can be seen with the operations that are performed on the data.
Take a simple word, like “hello”, as an example. A writer may wish to bold this word. They highlight the word, press bold, and the word appears bold. To the computer the operation is wrapping the string to produce “<b>hello</b>”. For whatever reason the program could instead create the string “<b>he</b><b>llo</b>”. This is the exact same text, a bold hello, yet the string is quite different.
Now the user presses the “Capitalize” function in their editor. They expect to see “Hello” now. The underlying document string should now be “<b>Hello</b>”. A function that blindly works on strings as though they were text might result in “<B>hello</b>”. A lot of text processing functions, like case manipulation, are not applicable to strings in general.
Problems with text handling are so common that I couldn’t even write this article without hitting some. Those “<” symbols above were not being properly escaped. I had to modify the options to my markdown processor to get it working as I expect. I also had a hard time in my editor working with the combining characters used in the next section.
Accented delimiters <Μ§
Document processing, like HTML, JSON or YAML, is done at the code string level. I’m using the word “code” here where other sources may choose to use the word “character”. I’m avoiding that term since it’s meaning is ambiguous. At the level of parsing we are clearly thinking only of the codes.
In HTML, code #60 is used to open an element. You may also recognize this as the ASCII, and Unicode, code for <
. The HTML parser however doesn’t care what #60 represents to you, only that it delineates the start of an element.
Combining characters can create an accented version of that symbol, <Μ§
. In text this is clearly a different symbol: it’s a distinct grapheme cluster. The HTML parser doesn’t care about that. It sees code #60 followed by #807 (combining cedilla). It thus sees the opening of an element. However, since it isn’t followed by a valid naming character most parsers just ignore this element (I’m not positive that is correct to do). This is not the case with an accented quote, like "Μ§
. Here the parsers (at least the browsers I tested), let the quote end an attribute and then have a garbage character lying around.
These aren’t binary formats, lest on be thinking that. A large range Unicode is available for names in HTML, and the source can be in any of the Unicode encodings: UTF-8, UTF-16, UTF-32. The parser must understand the decoded form, thus it truly is working at the code level, not in binary.
Mixing operations is bad
We need to work with strings at the code level. SQL, HTML, JSON, source code, HTTP, are all defined in terms of a string of codes. Manipulating and parsing documents tends to be at the code level (when not binary). Operations at this level tend to be very well defined. Operations at the text level are very ambiguous, as they are subjective and locale dependent.
Yet our strings are filled with functions that work somewhat at the text level. Perhaps the most dangerous of which is string formatting. In C the printf
family, in C++ the text streams, in Python the format
function. From experience we know that you can’t use the generic formatting functions to produce SQL statements, otherwise you’ll be open to SQL injection attacks. Similarly we can’t create HTML using simple string formatting or concatenation, since that would lead to corrupt documents. So why does a string even have a format
function if it isn’t actually applicable to strings?
format
is also not very good at producing text. It doesn’t know about document formats so can’t properly produce an HTML fragment. It can’t handle basic localization issues like pluralization. It doesn’t understand character widths, joiners, and non-characters, so can’t even do basic alignment correctly. It’s also not extensible thus never seems to handle the myriad of formatting issues real text tends to have. It’s far better to use a template language that is aware of these issues and specially tailored to text handling.
In my previous article I tested a variety of string operations. A lot of the unusual results come from mixing of string and text handling. If we can’t decide exactly what is a string, and what is text, some of those operations are not possible to unambiguously define.
An anachronism
We need to stop mixing our string and text operations. A whole range of security issues, like SQL injection and cross-site scripting attacks, happen because of this mixture. Our text handling is also bad because of it. We don’t sort things correctly. We don’t format things correctly. We do truncation wrong. We handle combining characters poorly.
Even if we know about the problem it is hard to avoid. Our current languages make it very easy to mix text and strings, and that leads to the bad behaviour. It tends to work a lot of times, which just makes it easy to forget the issues. I don’t know of any language that offers proper text handling as part of its standard library. At this point it’s hard to even say what that library needs to offer. That doesn’t make it any less essential.
Certainly though something needs to change. Hopefully I can address these problems in Leaf, but I’m not sure I fully understand what needs to be done yet. What’s important is that we all stop looking at strings as text handling types, and start looking for proper alternatives.