Philosophy

We don’t need a string type

Should string be a distinct type or merely an alias for an array of characters? I’m considering the options for Leaf and can’t convince myself that a specialized type is needed. Looking at C++, a string and a ‘vector’ are almost the same. Specialized operations, like case conversion, appear bound to the underlying ‘char’ type, not the collection.

What is a string

A string is nothing more than a sequence of characters. To be more correct, a programming character is a code point for a particular character set. A single code point doesn’t always encode a logical glyph, it may be a combining character, joiner, or other such control signal. Does this distinction play a significant role?

Consider the ‘length’ function of a string. Is it intended to return the number of glyphs, combined characters, or count of the underlying code points? Should two distinct, but canonically equivalent strings return the same length? Trying to accommodate the complexity of normalization and equivalence rules in a generic ‘length’ function seems ludicrous. Domains where the “true” glyph length is required are uncommon, and would probably require a rendering library. The only sane option for ‘length’ is to return the number of code points — which is indistinguishable from the ‘length’ of a character array.

We can support that decision by considering the behaviour of a subscript, or indexing operation. Should we index the logical character, or a specific code point? Again, looking at unicode combining characters, there is no actual representation, or even definition, of what a “logical” character is. Sequences of combining marks are unbound, thus no fixed size type could even represent a “logical” character. It would seem that operations on the string need to be done at the code point level — exactly the behaviour for the same operation on an array of characters.

The C++ difference

A ‘string’ and ‘vector’ in C++ have only one significant difference that I can see: the string is null-terminated. This one small difference allows the ‘c_str’ function to return a pointer directly to the internal memory of the string. (C++11 clarified the string to ensure this is the only valid way to store strings.)

For C++ this is an essential feature: the class would be nearly useless without the ‘c_str’ function. It isn’t however a desired feature. We’re forced to convert strings into their C equivalent for calling API functions. Quite disturbingly this includes functions in the standard C++ library itself. For example, the constructor of ‘ofstream’ used not to accept a ‘string’, only a ‘const char*’ (C++11 fixed this particular case).

Null-termination is also a sordid tale all on its own. The whole C library of string functions, like strcat, strcpy, are terribly unsafe. Working with C-strings is a perilous and error prone affair. Very few modern APIs even rely on null termination, requiring an explicit length argument instead.

This primary difference between a C++ ‘string’ and ‘vector’ is really just a historical oddity that many programs don’t even need anymore.

Surrogates and variable-length encodings

The preceding considers characters at the code point level: a single value is the direct encoding of a character. Storing strings this way is often inefficient. More compact variable-length encodings are frequently used: differing numbers of bytes are used to represent a single code point. For example, in UTF-16 a code point may comprise a 2-byte or 4-byte sequence. Values that don’t encode a full character, they are part of a sequence, are called surrogates.

The surrogates are not the same as combining characters. A surrogate has no meaning in the character set: it is merely an encoding primitive. Working with the string at the character level requires actual code points. Trying to work with an encoded string as a sequence of characters is very troublesome and leads to ambiguities. What should the ‘length’ function return, the number of encoding values, or the number of code points?

It is tempting to use encoding a justification for a special string type. With a proper abstraction you could store strings in a variety of formats then expose them as a sequence of characters. This would, for example, allow using UTF-8 strings directly in string operations. The ‘length’ function returns the number of characters. The underlying encoding is irrelevant (perhaps exposed via special functions).

A big challenge in such a class is efficiency. A basic operation like accessing a character by index is now a linear operation. It requires decoding the string, scanning from the beginning, assembling surrogates, and counting the resulting characters. Even simple forward scanning requires a potential loop and stateful extraction of the next character. This overhead will multiply through all basic routines, such as splitting, translation and regex matching.

I don’t know of a language that handles strings in this fashion. The efficiency problems leave it a very unattractive option. It’s just so much simpler to decode while loading and work in the character domain. It requires more memory, but I can’t think of any domain where this would actually be a significant issue — even massive strings are small compared to other assets.

Library support

Strings have a lot of specialized operations in comparison to generic arrays: normalization, character translation, regular expression matching, parsing, formatting, stripping, encoding, et cetera. To be pedantic though, a vector of any type will always have specialized operations: numbers have summations, averages, medians; vertices form meshes which can be transformed, simplified, rasterized.

The discussion may instead be one of free functions versus member functions. Should all these special operations be member functions, thus requiring a distinct ‘string’ class? Or should they all be free functions that would work just fine with the generic ‘array’ class? Certainly all these functions could be written as free functions. None of them require special access, the public interface of an array is enough.

Perhaps the issue is just one of syntax. If I could write `str.toUpperCase()` with free functions, the question is moot. D does exactly this with uniform function call syntax. I think we should also look at the trend reversal in C++11, many operations are now available as free functions rather than member functions. Popular opinion seems to be leaning in the direction of free functions.

If free functions are the preferred approach, then there is no need for a dedicated string class. String operations can be written as free functions on an array of characters.

A string of what

What if we need strings of something other than characters? By default Leaf will probably be the Unicode code set, but perhaps you wish to just use ascii, or latin-1 in your code. Most strings I’ve seen tend to ignore this issue. Some languages, like PHP, let you specify at a global level. A type labelled simply as ‘string’ doesn’t allow much variation.

So let’s assume characters can be marked as ascii. To have a string of these characters makes string a kind of template class ‘string’. It is still necessary to refer to individual characters, so a type ‘char ascii’ is required as well. Once I start adding these modifiers I don’t like the feel of ‘string’ anymore, it seems like it says something other than ‘array’.

Go back to our discussion of variable length encoding: what if you really wanted to have UTF-16 strings, and live with the surrogates as-is? Saying ‘string’ is now really ambiguous, is it a string of unicode characters, or actual utf16 code values. It would be clearer to create a type alias and use that. In Leaf we could do ‘type utf16 : binary 16bit’ and then have a ‘array’. There is no confusion here about what this is: an array of integral code values, not characters.

Just a typedef

I’m now rather convinced there is no need for a special string class. I will nonetheless have a ‘string’ in Leaf, but it will merely be a type alias for an ‘array’. Basic strings are used often enough that this shortcut is appreciated. For cases where encoding is vitally important, using explicit array types is likely the safer option. Clearly a rich string library is important, but there is no need for a string class to provide it.

Can you think of any other situation where a dedicated string class is needed, or at least very helpful?

Read the followup article “The string type is broken“.

36 replies »

  1. The first thing that comes to mind would be serialization. Say for example I need to create a JSON representation of some model in my system. By inspecting the type of some attribute I would want to be able to determine what serialization scheme to choose for that attribute’s value.

    In the case that all I can tell is that the type of the value is “array” I might then have to inspect members of that array in order to determine the type of data I am dealing with. But even if I find that the array contains chars I am faced with another dilemma.

    Was the intent of the creator of the data model trying to create an ordered collection of discrete values that each carry some independent meaning or was the creator of that model trying to combine a set of characters into a single logical “string”?

    In one case I should serialize the data to a JSON array of single character strings, and in the other to a JSON string. But how can I know which?

    A string type helps to solve this problem by allowing us to differentiate between those 2 separate but valid intentions.

    • This is a situation which I also considered, and actually removed a paragraph from the article. I think it is entirely theoretical, since I couldn’t think of any situation where I’d truly want to make the distinction. Even if we allow that the serialized form might look different, would you actually use it different after loading the value?

      I would also think that if you do need very tight control on serialization you’d likely have to use explicit conversions, not rely on the default ones. It’s usually only for quick-and-easy serialization where the defaults work — again, a case where it’s unlikely you need to distinguish between ‘string’ and ‘array’

  2. “This overhead will multiply through all basic routines, such as splitting, translation and regex matching.”

    With UTF-8 it doesn’t, since any byte-oriented searching algorithm can be used unmodified for UTF-8 (https://en.wikipedia.org/wiki/UTF-8#Advantages_3). This properly is what likely led Thompson and Pike to create UTF-8 in the first place – it is efficient for both storage *and* operation.

    I think the golang approach on strings is rather appealing. Strings are always encoded in UTF-8, stored in terms of bytes and exposed so (except for the iterator). The only downside I can tell is that finding the number of code points or the i-th code points is an O(n) operation – but I think these two operations are not as useful as they seem to be. Usually you either want the byte (which golang already gives you) or the character. Finding the number of characters or the i-th character is an O(n) in an array of code points too.

    Iterating coding points require an explicit state machine, and that’s exactly what an iterator is. The string iterator in golang iterates in terms of coding points. For example, `for i, r := range “世界” { fmt.Println(i, r) }` loops twice instead of 6 times.

    • But why would I want to use a byte-oriented search algorithm on a string? I want all of my string routines to work at the level of real characters, and not the bytes.

      So, in Go, you are saying the following would not be equivalent (excuse my pseudo-code, as I don’t know Go)?

      str = "世界"
      
      a = str[1]
      b = value_of( next( iterate(str) ) )
      
      assert( a == b )
      

      I don’t think I would trust a language where that assert fails. Accessing via an index and via an iteration should always yield the same results.

    • Ruby handles motoray’s test case…

      # encoding: UTF-8
      str = “世界”
      a = str[0]
      b = str.chars.first
      puts a == b

  3. (It doesn’t seem possible for me to reply to your reply.)

    Working with “real characters” is hard. If I understand correctly what you are actually proposing is working with code points. What’s the advantage of code point arrays over UTF-8?

    Enumerating two major classes of common string operations here: 1) pattern matching – including finding, splitting, substitution; 2) “real character” operation – counting number of characters, reverse the characters, finding the i-th character, etc..

    For 1) code point arrays and UTF-8 work equally well. For 2), code point arrays “sometimes work”, which is worse than “never works”.

    The space overhead involved in code point arrays is not about the RAM or hard drive; it’s about networking. Code point arrays, essentially UTF-32, is not a viable transmission encoding, making encoding/decoding compulsory when talking to the network. UTF-8 OTOH is a viable transmission encoding *and* quite easy to work internally.

    As for the indexing vs. iterating equivalence, golang fails the assert. But again I doubt the usefulness of finding “the i-th code point in a string”.

    P.S. the only way of using iterators in golang is the for-range syntax. There is not explicit iterator object.

    • 1) But byte arrays do not work, especially for splitting operations. Strings should be split at the character level (if not a higher grapheme level), never at the byte level. Byte splitting at the byte level you can actually create invalid UTF-8 strings.

      For any other type I wouldn’t want sub-structure access — it’d be bad for an array index to return a pointer to the middle of a structure. Yet that is exactly what you are saying strings should do: return a pointer to the middle of a character.

      2) Can you give an example where code points don’t actually work?

      I never said one should transmit strings in UTF-32, that would be silly. A string at a logical level is very different than an encoded string. Serialization should converted to UTF8 and back. This is the same for any structure, the serialized and in-memory form are different.

      (PS. Just reply to the original message to followup the reply)

    • “1) But byte arrays do not work, especially for splitting operations. Strings should be split at the character level (if not a higher grapheme level), never at the byte level. Byte splitting at the byte level you can actually create invalid UTF-8 strings.”

      Not really. If i split on n-th character of the string, then yes, it’s possible. But people usually don’t split on n-th character, they split on a found character (like spliting a path on ‘/’). And this is totally safe.

    • For certain token splitting operations you are correct. There are however many length-based truncations on strings that occur, especially where fixed schema DBs/protocols are involved. I would argue that token splitting tends also only to work because we are typically looking for tokens that happen to be in ASCII, thus 1-byte also in Unicode (punctuation like dots, slashes, colons). Once you need to split on multi-byte unicode punctuation I believe you’ll have the issue again.

  4. Ah… I forgot to reply your reply until I received a notification mail for today’s comment :)

    UTF-8 does work for splitting since it’s a self-synchronizing code (https://en.wikipedia.org/wiki/Self-synchronizing_code), which is best illustrated by an example – say you decide to split a string on on ‘。’ (the Chinese and Japanese period), or ‘\xe3\x80\82’ in Python 2 notation, or the bytes 11100011, 10000000, 10000010.

    UTF-8 has the property that for codepoints that take n bytes to encode (n > 1), the highest n bits of the first byte are 1; the (n+1) highest bit is 0. Following bytes all begin with 10. So looking at 11100011, we know it begins a 3-byte encoding.

    That leads to the core point: wherever you see the sequence in the bytes, it can only be the period mark. It can’t be the proper suffix or the middle part of a longer sequence since the first byte doesn’t begin of it is not 10. It can’t be the proper prefix of a longer sequence since its first byte already encodes its length, which is 3.

    I suppose I’m not explaining it pretty well… But it’s actually quite clear when you work through a few examples.

    Replying to “2) Can you give an example where code points don’t actually work?”, well, actually you made it clear in the article that a “real character” may be composed of several codepoints.

    • I don’t think we’re disagreeing, just wording it differently. As you say, UTF-8 is self-synchronizing and you can easily split UTF-8 encoded data. This is true, so long as you have a UTF-8 aware processor. I’m not denying this. What I’m saying is that raw byte manipulation of a UTF-8 data stream will lead to problems. That is, if you only access your utf-8 string though specialized functions, then you effectively have a code point array.

    • Well, I think the codepoint array vs. byte array different is significant; it has performance implications. According to you, “we don’t need a string type, let’s just use codepoint arrays”, while I prefer “We do need a string type, it is constrained byte arrays” :)

    • Okay, we agree it has performance implications, but I think we are both leaning in different directions on which ones. In any case, Leaf will use `array` for now (with 32-bit chars), because I think it’ll be easier to introduce a `string` type later, than to have to remove a broken one later.

  5. String makes sense – as in it implementing a more general “Stringlike” class, with different implementations. This helps reuse of simpler methods that just want a sequence of letters, ignoring that you in other places really wanted a specific way of storing them

    Two common ones:
    * shared arrays, substrings over them. you mention this; this can actually be quite handy if you for example want to work with suffix lists. not that common, just grabbing an example
    * ropes, or, linked lists strings as a concatenation. these have wide use in any system where you are pretty-printing. yes, you could have a “collector” that does the job as well, but being able to hide this inside a string makes it much easier to pass them around

    immutable strings? tricky one. I use to like using them for their consistency guarantees, but that said, these days I am always wrapping them in another type to let the type system do some work for me. thus I can enforce the immutability on the higher level (I am yet to try Leaf to see what your functional style can do for correctness guarantees)

    on that note, I love the book Purely functional data structures. by far not for everyday use, but inspiring

    • This variations on strings, or data structures, is one of the key reasons why I don’t think a generic string class makes sense. It leaves lots of space open for custom implementations tailored to use scenarios. The generic `array` form then serves as a simple backing of the common operations which don’t need that library, and also provide the basic interchange.

      Look at C++’s ‘string’ class. The first definition left it rather opaque as to how it was stored. Now, based on feedback, it’s been defined to be exactly a contiguous in memory array. The abstraction completely lost out to practical considerations.

  6. I’m glad others arrived before I did to make the points I wanted to make. UTF-8 is amazing. :D

    I too am working on an LLVM-backed language. It is greatly empowering being able to implement your own ideas into the very fabric of a toolkit. I am glad you had the courage to publicly discuss your ideas.

    Personally, I side with the others. I certainly respect your desire to simplify language concepts. (Less is definitely more when it comes to language design!) However, while strings and vectors may have a lot in common in C++, I think the answer is not to unify those concepts. Rather, send them in opposite directions as modern languages have done. A string should be an immutable tightly encoded collection of text, while a character array should represent a mutable ice tray of lone characters. (I plan on making strings UTF-8, whereas a character will be a single UTF-32 value).

    • “…as modern languages have done.”

      This is my problem. Can you actually point to a language that handles this correctly? I see that ObjectiveC has a GCString class, but the native strings of all languages I’ve seen tend not to handle Grapheme Clusters correctly. Do any do multi-character upper/lowercase? Which ones handle language specific collation?

    • What do you mean by “correctly”? I think C# and Java went about it fairly well. I disagree with the choice to use UTF-16 (which seems to be the universal compromise between memory usage and still being able to index directly), but they both are immutable and uniformly encoded. In other words, regardless of the encoding I’m reading from (a file containing ASCII, UTF-8, or UTF-32 even), the language’s native string object sticks to UTF-16 as the center junction. So, there is no (practical) way to hold a string object in other encodings, which saves a lot of headaches. I’d be very frustrated if one string was holding ASCII data while another was holding UTF-32. Stream readers/writers should be in charge of those conversions.

      As for multi-character uppercase/lowercase, no, I don’t know if any handle that. I imagine that the simple English set of functions does the job for most code bases, and extended libraries could handle international conversion tools.

    • By correctly I mean proper indexing, splitting and regex support. Java doesn’t really offer the safety one would expect of a specialize string class. First off, it does not offer (or at least didn’t) a way to index based on grapheme clusters (no consideration of continuation characters). Worse is that indexing and splitting completely ignore surrogates and allow you to create invalid strings by splitting in the middle of a single code value.

      I believe that if a string class doesn’t address these issues it is not any better than a plain array of characters.

    • Oh yes! I was floored when I learned that C# had unsafe indexing into surrogate pairs.

      I suppose my point is that, as others have pointed out, there are certain things people just don’t very often. Sure, indexing into a string switches from O(1) to O(n), but how often are people leaping straight to character #436211? How often are strings that big in the first place?

      Personally, I’d rather have a robust UTF-8 string class than a “fast” character array. I feel like, no matter what you pick, you lose. Java and C# went with UTF-16 and forfeited safety/correctness. C++ went with ASCII and forfeited UNICODE. Personally, I’m willing to forfeit speed for a good UTF-8 string. It’s not that I don’t care about speed. (I do game development. Speed is everything.) It’s just that I want to turn away from dealing with UNICODE support but maintain the highly convenient ASCII compatibility.

  7. Efficient String handling will require an approach similar to Java’s StringBuilder or Haskell’s ByteString: a String becomes a linked list of Arrays. Most languages learned this when it was already too late and programmers have to bear the burden of choosing between performance and adapting to third-party code.

    • But how does using an `array` prevent the creation of a string builder? In fact, by leaving it generic you would gain a generic array builder class, suitable for building arrays of any types, not just strings.

      The performance aspect uncovers another relevent point: what is the purpose of the string? In my opinion, a string is a base unit, not suitable for large text processing tasks. That is, just like trees, lists, and maps are built on simple arrays, so should text processing components be built on simple strings.

      For example, the input to a template system can be a string, with a map of strings, and the output should be a string. Inside that template library however I wouldn’t expect the processing to be done one just one large string. I’d expect a builder of some sort for efficiency.

  8. You may want to have a look at the String class hierarchy in VisualWorks Smalltalk. There Strings are collections of Character. A Character is basically a code-point in Unicode. There’re different subclasses of String that can be used for optimized storage in memory, but converting between these classes is done automatically by the strings, depending on what Character is added to them.

  9. In Google’s Go language, a string type is present, but the special thing about it is that it’s an *immutable* array of bytes (like in Python 2.x; note bytes, not characters). It’s designed to store UTF-8 strings, with an explicit API to parse out the code points. Indexing a string gives a numeric value of type byte. This isn’t too far off from the idea of a simple typedef, the only difference being that a string fits in some places where data has to be immutable. When dealing with mutable strings, an array of byte or “Rune” (codepoint) is used.

  10. Just like e.g.

    point translate(point) { …

    is prferable to

    pair translate(double, double) { …

    any function operating on strings will be better with strong type information.
    I agree with most points except for the strong typing bit. I think there may be a compromise solution. E.g. treat whatever like a bufer and have global functions like begin_utf8(T) that will return a specialized iterator. Then you could store your strings like you want and move the “string” concept a step up – to the iterators.

  11. In Erlang there are no strings – just lists of integers, encoded blobs indistinguishable from other binary data or lists containing integers, lists or blobs.
    While the ability to nest stuff until final outputting really helps garbadge collection, the lack of a string type makes debugging text processing developer’s hell.
    As usual the string type is a thing missed most when its… missing. ;)

    Also think about interoperability. What would developers around the world think about a Google using arrays of integers in their web service APIs. Developers are users of languages and APIs. Most of that not experienced in assembler would be really surprised by the lack of a string type in a high level language…
    …well it would be cool to have a new highlevel language designed without a string type just to read about the reactions. :P

    • I’m not saying we should use an array of integers. I’m in favour of a distinct character or code point type which doesn’t implicitly convert to/from an integer. This allows it to behave differently than a plain list of integers, as well as providing cleaner contracts with functions which operate on it.

    • Sorry, i misread it. Detecting strings for debugging and serialization is of course easy by checking for seqeunces of that code point type.
      I would advise a type explicitely designated to contain only code points and derived from some other sequence type to make contract definition and string handling even more straight forward. A nice short name for that specialized sequence type would be “string”. ;)

      It would also be nice to give that sequence-of-code-points type normalization functionality. It may even know about its current normalization state and use that knowledge to, for example, avoid normalization for comparison if it has already been normalized to the desired form. That would be a feature more easyly attachable to a specialized type than a generic container.
      Maybe not you, but users of your language want to implement things like that in a transparent way. They would attach the additional state to the string type – if it exists.

  12. In Haskell, String is just an alias for a list of characters: type String = [Char]
    It does not even support constant access, because it is a (lazy) linked list. I think this matches you requirements rather well. I don’t know if the behavior is correct regarding surrogates and the length, but this is a nice step in the right direction.
    Of course, they suffer because they are known to be slow when processing giant files of ASCII text such as csv files, but some alternative data types called for instance ByteString or Text deal with that.

  13. “Perhaps the issue is just one of syntax. If I could write `str.toUpperCase()` with free functions, the question is moot.”

    It isn’t if you plan to have late binding, say, if `toUpperCase` is late bound and identified by `str`’s runtime type. In other words, if your language has only ADTs/early-bindings, the matter is only of syntax. Otherwise, if free functions do not have an equivalent late-binding mechanism, the choice of defining a procedure attached to an object or making it a free procedure will have an impact, as both define distinct call semantics, impacting on static analysis and providing different maneuvers for adapting programmatically an existing system (through delegation/overriding/dynamic-dispatch).

    Great post, though!

    • Since everything is planned to be statically typed now that shouldn’t be an issue at first. Eventually I will probably add dynamic types. Those could then have a runtime mechanism for late-binding, or rather runtime resolution of overloads.

  14. Put me down in favor of strings as a native language type. I tend to favor strong typing; I like strings with light-weight type tagging to differentiate, for example, filename strings from company name strings. Lots of functions take multiple “string” parameters, and it’s easy to confuse the order. (Same problem would exist using byte arrays. Languages with named parameters help a lot!)

    To me, a key issue here involves exposing the internals of a character string; your post and many of the comments describe the difference between a string of characters and the underlying bytes that comprise that string. I tend to favor strong data encapsulation, so I lean towards hiding the innards and only allowing access through an interface.

    I also tend to favor strong protection in design. Good design does NOT allow a coder, accidentally or maliciously, to overwrite the middle byte of a three-byte UTF-8 sequence and thus make the string gibberish. Maybe it comes from writing a lot of library code, but I treat other coders code like coders treat user input. (Or should. If they always did there would be no Heartbleed bug.)

    So I come down liking strings. I don’t need them to be native, but being able to use double- and single-quoted literals is a nice leg-up on using strings in general. I lean towards string libraries for functionality with literals useable as initializers or for comparison. I’m a little leery of languages that overload string operations on standard operators. If I can “+” a string, then I should be able to “-” one. (And concatenation is not addition, damnit.)

    Can’t say I’m on board with byte arrays. To me arrays only make sense if you’re talking UTF-32 values (or raw Unicode code points or other unfolded format). For places where the raw data actually matters (e.g. networking, serializing) I’d lean towards conversion functions. Byte-level operations on bytes; string-level operations on strings.

    FWIW: I agree with the earlier comment about how Java and C# implement strings with UTF-16. If I were designing a language (and, heh, I am), I’d use UTF-32 to get around the multi-token protocols of UTF-16 and UTF-8.

    • In my followup https://mortoray.com/2014/03/17/strings-and-text-are-not-the-same/ I explore a bit the difference between text and strings. There is definitely a place for that string-like type that we know now. It is useful for protocol and structured data. It isn’t so good at handling text.

      Text needs a distinct type. It needn’t be opaque, but the interface must distinct from the underlying memory model.

      I think all types should have strong sub-types, not just strings. A language should offer the ability to make a distinct version of any of the types and prevent mixing by default. String is perhaps one that immediately benefits the most.

    • I think it might be useful to distinguish between native and defined types. It’s often true that using defined types is indistinguishable (or nearly so) from using the language’s data primitives, but the distinction matters when designing a language. A key question is what stringy capabilities are built in to the language. For example, many languages allow: “hello”[2] or “hello”.upper(), which reify quoted literal strings as, respectively, arrays and objects with methods.

      To me, a discussion about how to represent textual data differs depending on the context: designing a language (where we care about what’s “under the hood”) or our ideal defined string type. (In the latter case, I basically want a full-featured Unicode object, perhaps immutable, perhaps opaque(-ish), probably internally represented as UTF-32 with a bucketful of conversion functions! :) )

      I think we’re on the same page talking about code point arrays or UTF-32 arrays. I usually use “code point” to mean the abstract non-negative integer assigned to a language object and UTF-32 as a concrete encoding form that maps those integers into a physical format. Potato, potahto, really.

      To the extent this is about how a string “class” represents Unicode text, I think we agree. To the extent this is about dumping “string” types and using arrays of code point objects or UTF-32 characters, I think I’m actually a bit agnostic now that I think about it. The languages I actually use pretty much dictate what strings are. In situations where I can choose, I probably wouldn’t use bare arrays.

      If a type hierarchy is “vertical”, to me array-ness is “horizontal” (or think of it as an AOP “aspect”). In many languages, any data type can be an array, so array-ness is “something data can be.” To me that isn’t the foundation of a string, it’s a property strings often have.

      In other languages, arrays are a distinct type: “array of objects.” These are sometimes hard to type, because any object can be in any array slot. The overhead of objects and the lack of strong typing both make these kinds of arrays undesirable as string types to me.

      Even at the code point level, a Unicode string has a complicated invariant, so I’m inclined to not expose it (as an array does naturally) and require clients access it via methods that protect the invariant.

  15. You say, “We don’t need strings – just use arrays”.
    Lua says, “We don’t need arrays – just use hash tables.”
    So Lua has immutable strings, which speeds up string comparisons (and thus table lookups).

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s