A trail through custom languages

How often does one need to write a custom language? Many projects often shun them, but I think domain specific languages are important. They offer unique features and make certain tasks easier. In this article I’d like to go over some of the languages which I’ve developed before. It gives a bit of background and is part of why I am now working on yet another new language.

Miranda: Imperative Script

I probably developed my first language in my first year of University. We studied a functional language called Miranda (which later served as inspiration for Haskell). I immediately set to work to do the one thing least likely to be solved by a functional language: Create an imperative scripting language.

What came out wasn’t much. I mainly created a simple language to produce plan files and ASCII login graphics. It barely qualified as a language and was certainly not Turing complete. Nonetheless it was a seed that got me thinking about languages.

Mortar: HTML Transformations

When I started working with Web development in the mid 90s there were few tools available for HTML work. There were editors, of varying quality, that assisted in writing HTML but none really captured what I thought should be possible. I was getting tired of updating the footer on all my pages. This inspired me to develop a product called Mortar while working at a startup company called Big Picture Technologies.

Inside this product was an HTML transformation language. The syntax was HTML-based and allowed typical operations like variable assignment, loops, and function calls. It was nothing very complicated, but it worked. It also introduced me to a build process, as each source file needed to be transformed to its output format (this was done statically). Basic dependency tracking was also supported. As in C, if you modified a header, the files using it would also change.

It may sound a bit like XSL. However, XSL didn’t exist at the time, and my approach embedded the programming directly into the source HTML files. In fact I stuck to using this product until XSL was good enough to replace it for my web sites.

NewLang: Constraint based typing

At some point I got interested in type inference for statically-typed languages. I’m a believer in statically typed languages, but I also dislike having to enter in types directly: The compiler should be doing as much of the work as possible. I set about on an experiment to make something work via type inference wherever possible. In my repository this got the simple name of “newlang”.

The key to this language was constraint-based typing. No function on its own needed to be fully typed: Through the interaction of a chain of functions, the compiler could figure out what type was required to properly work. This communication worked forward and backward, so a constraint in a higher function could affect a statement in a lower one. Each statement in a function, and the function signature, would simply add constraints to each variable. The compiler would, at the entry points, then figure out which type would work. The executable was produced by generating C++ code which could then be compiled by a native compiler.

This was a good lesson to me: Basically this approach doesn’t work. Compilation is simply too slow and the memory needs are way too high. The source code essentially needed to be “solved” with a constraint engine. (Incidentally, this language soured me towards Bison and is one of the reasons why I don’t use parser generators anymore.)

TestPlan: Web Automation

For quality control at eCircle, we needed a way to automate web page interactions. At the time most available tools were terrible (and still to this day in my opinion) : The record and playback approach is fundamentally broken. At first we used the Java automation tools like HTTPUnit, but were dissatisfied by the level of redundancy and the difficulty in setting up scripts. We weren’t upset with the underlying tool, just how hard it was to write tests.

So I wrote a parser and started to simplify things. We already had a lot of tests, and we knew that trying to do everything in this higher level language would not be possible. For this reason our script “units” were also codable in Java. Thus one program could happily skip between the script language and Java. Eventually the language grew powerful enough that very little test code needed to be written directly in Java. The engine was Java, but all automation scripts were in the scripting language.

The core feature of this language was exposing all documents as XML and supporting XPath natively. Assignments could be done directly from an XPath, verification matched against locators, and loops could use a selector to loop over a set of nodes. Obviously HTML web pages were included, but we also converted email, CSV files, and a few other things into the same format. This allowed the automation to use XPath to navigate all document types.

Another feature of the language was a context system. Variables were stored in a tree hierarchy, as opposed to a flat global space, allowing each script to override some variables for itself and its function calls. This also allowed us to expose all configuration values directly as variables. The calling convention was also done via this context system. The interpreter did a lot of work to support what looked like a relatively simple system.

This product, TestPlan, is still in use and if I had more web automation to do I would certainly continue using it.

Haxe: Preprocessor Extension

Flash 9 had just come out and I wanted to do some flash programming. ActionScript didn’t appeal to me so I found a nice alternative called Haxe. At the time it was still quite new, and it was missing some things I wanted. I didn’t extend the language however. Instead I wrote a toolchain to use a preprocessor that I called M4.

Once I had the preprocessor, I proceeded to create all sorts of useful extensions to the language. None of these fundamentally altered Haxe, but allowed me to do many things which were otherwise inefficient or tedious. I even ended up creating a mini-DSL for animations. I’m not sure that was necessarily the best approach, though I did use it in several games that I wrote.

A new feature I played with is something still missing in most languages: package inheritance. In C++, or Java, you can inherit from individual classes, but this is extremely limiting. In some cases what you want is to inherit from an entire package. This comes up a lot when using engines, and people who have encountered the issue will know exactly what I mean. So between my M4 layer and GNU Make I created such a system for a Flash-based Web site called BigTPoker that I developed independently (the idea was to help people improve their poker skills by playing simpler games, for e.g. to test their ability to evaluate odds). This allowed me very quickly to create new games without having any code redundancies.

I’m not sure if I would still use my M4 layer today; Haxe has grown a lot and many of the things I needed it for have reasonable alternatives now.

Using a new language for financial trading

In my most recent project I needed a way to write small distributed programs (for an automated trading platform). Objects essentially had split personalities and existed on multiple machines. Each machine had to arrive at the same decisions, respond to user input, and react to the same events. None of the machines would obtain such information at the same time however (or ultimately even in the same order).

This is the first language where I used an actual virtual machine: The compiler produced a byte-code which was then loaded into the virtual machine. It was a simple stack-based machine (at first it was register based, but without a JIT, the stack model was more efficient). Later I introduced a simple optimizer for the byte-code. As part of the optimization there was also a link phase. I could bind objects directly to their host native objects, saving countless lookups.

One lesson I took from this effort has to do with efficiency. Having a virtual machine that interprets byte code has significant speed problems. Despite my optimizations, it was not possible to get close to the speed of a natively compiled program. In our case, we simply wrote all the slow bits in C++ and exposed very high-level functions to the language. The VM is essentially a giant switch statement and must manually increment code pointers and decode instructions. These are basics which an actual processor does extremely efficiently and in parallel with computation. You also lose processor optimizations like branch prediction (that is, the processor will optimize the VM code execution, but not the code running in that VM). Introducing a JIT for a simple byte-code language is not so hard, but it simply isn’t a step we took.

Part of the decision to create this language had to do with testing. The setup for tests was complicated enough that testing directly in C++ became too difficult so we created simulator. The simulator would play the role of many computers and could be controlled via a simple language. To make my work easier, the simulator handled a YAML input as a script. On top of this I built a language using M4: the M4 macros would resolve themselves into a proper YAML document. Naturally, the trading language could be used directly within the tests, which made testing different scenarios a lot easier.


All these language endeavours lead me to my new language project, codenamed Cloverleaf. My motivation is that I’m just not satisfied with the general-purpose languages which are currently available. My prior work is a confidence builder that I can build the language I want. It will be a lot of work, but definitely possible.

Cloverleaf is intended to be a statically typed and compiled language. The goal will be to replace C and C++ as systems programming languages. It should also be good enough to replace a lot of intermediate-level languages. Therefore it must be very efficient: on par with hand-optimized C or C++ code. This is perhaps the toughest goal to achieve (when combined with my other goals).

My approach will be first to get a quasi-interpreted language working with many of the core features. From there I’ll probably switch to compilation: I’ll produce byte-code for a custom virtual machine. This will help identify further issues needed to do the final compilation. This is important since while some features are easy to implement in a VM, they may be very difficult to compile. When it comes time to go the final step, and produce actual machine code, I will rely on something like LLVM.

I’ll use my blog to provide updates and continue with general discussion about language design.


  1. parttimenerd (@parttimen3rd)

    To respond to your initial question about the importance of writing a new DSL:
    It depends on the project whether you should use a DSL or not. In a tiny of project of mine I wanted the user (developers using my project) are able to input geographical coordinates, it would be fully sufficient to let the users input the coordinates in an JSON like array structure, but with a tiny DSL which a simply a data format that supports coordinates and some simple arythmetic natively, it’s quite easier to type in the data (as the DSL understands the syntax of the coordinate representation used by the wikipedia). But I’ve also had a friend who wanted to create a DSL being able to deal with matrices and other mathematical stuff natively and now only writes a library in Ruby, as Ruby has some features like operator overloading and missing methods and execution speed isn’t really important, but the speed of development is. (The two projects are only sparetime ones.)

    It’s a bit of off topic, but I’m going to try to write a parser of a tiny DSL from scratch – without a parser generator. I hope that I’m going to learn more about parsing this a way, as parser generators hide, a lot of complexity from the programmer. One of your last (and also this) post made my curious what’s behind the surface. I wrote a tiny parser (19 lines of Ruby) for brainf**k, which made so much fun, that I can only admit, that making own parsers for (own) languages is really great.

    1. mortoray

      I’m happy if I’ve made you curious about language parsing. It’s a very good ability to have. The first part of your comment touches on something important as well, that DSLs range from the very simple, the to the very complicated. Essentially any custom data format is a type of DSL. Not all DSLs need be imperative, or functional, languages. It’s fun to try out new concepts, and it often makes using your product a lot easier in the end.

    2. parttimenerd (@parttimen3rd)

      I’ve finished my tiny parser, it’s a simple parser for mathematical expressions (https://github.com/parttimenerd/parser-experiments/tree/master/math-parser) – it was cool when my parser finally produced his first parse tree…

      I hopepully learn bulding parser at university in greater detail in some months, so all DSL I wrote had been written for educational purposes, and learn new concepts and algorythms to design a language or a parser.
      The fun I have designing DSLs and writing parsers, is in fact one of the reasons, why I’m going to study computer science – as it’s really one of the most fascinating things you can do with programming skills…

    3. mortoray

      Parsing expressions is the cornerstone of parsing programming languages (well, unless it’s Lisp). You may wish to look up the “Shunting Yard Algorithm” for hand-coding such parsers. My parser in Cloverleaf uses that approach at its core.

      Parsers, compilers, and VMs are certainly very interesting topics in computer science.

    4. parttimenerd (@parttimen3rd)

      Thanks for this algorithm, I never heard of it, but it seems to be better then my approach.

      I hope I’ll be able to write a VM in some time, as it seems to be cool, especially when you write serveral tiny DSLs for your VM.

      An interesting fact: Some days ago I tried to write an blog article, about why developing new DSLs is bad, but while I wrote this text, I realized, that most of my arguments where nonsense when you use VMs like the one of Java, or write your parser in your mainly used language or only use it for one specific area (e.g. a data format or a config language).

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s