LJ Archive

Designing and Implementing a Domain-Specific Language

Ryan Paul

Issue #135, July 2005

“Like everything metaphysical the harmony between thought and reality is to be found in the grammar of the language.”—Ludwig Wittgenstein

In Star Trek V: The Final Frontier, Scotty reminds a junior engineer to use “the right tool, for the right job”. This sage advice is applicable to computer users as well as starship engineers. GNU/Linux users have a particularly impressive assortment of tools at their disposal, many of which feature unique syntaxes that facilitate concise expression of complex operations.

Good tools will reflect the specific needs of the individual problems they are designed to solve. Consider the highly effective text processing utilities awk and sed. With simple commands, users can perform efficient search and replace operations on streams or filter complex data. How much C code would it take to do the same things? Even with a concise general-purpose language like Python, the tasks still will require more typing than the equivalent command-line tools do. Utilities like awk and sed are effective because they interface well with other command-line utilities, and they leverage the power of domain-specific languages (DSLs), syntaxes specialized for a particular group of related tasks.

Despite the vast number of powerful applications developed for GNU/Linux, the right tool isn't always available for any given job. What should a resourceful user do when options are limited? In most cases, it is possible to combine existing tools, possibly creating a new tool in the process. Sometimes, a new tool needs to be made from scratch with a general-purpose programming language. Developers can add value to a new tool and increase its productive potential by implementing a domain-specific language for it.

Development time is an investment, and many programmers endeavor to maximize the return on that investment by writing reusable code libraries. Tools developed with a specialized code library generally expose only a limited subset of the library's features. Developers can provide more extensive access to library functionality by constructing a domain-specific language that can act as an interface. A well-built DSL allows users to employ an intuitive and self-documenting syntax to construct a multitude of highly specialized tools rapidly.

Pitfalls

Implementation of a DSL can be tricky business. Code that parses and validates specialized syntax is difficult to write and maintain, especially if the DSL supports sophisticated control structures. Tools written with DSLs are notoriously difficult to debug, and there will be no IDE available for your new language unless you make one.

One of the most compelling arguments against corporate use of DSLs is the so-called tower of Babel affect. When a number of developers all construct their own individual DSLs, the sheer number of disparate syntaxes can create a tremendous amount of confusion.

When developers perpetually increase the scope of a DSL's target domain, they risk under-specialization. When the target domain grows to unmanageable proportion, the DSL will transmogrify into a personal Perl implementation, and it will cease to fulfill the needs adequately of the individual tasks associated with the actual domain.

Implementation

Meta-programming is the art of writing code that generates or manipulates code. It is the basis for language implementation, and there many ways to do it. Meta-programming is either static or dynamic, depending on the type system of the implementation language. Static meta-programming typically is done with a preprocessor, and dynamic meta-programming typically is done with macros that are evaluated at runtime.

A number of excellent open-source language development platforms are available for GNU/Linux. One of the most impressive static meta-programming utilities is Camlp4, an extensible preprocessor for Inria's Ocaml programming language. Camlp4 facilitates rapid development of efficient, type-safe DSLs. Of the available dynamic meta-programming platforms, the best is Logix, an extremely versatile language design system implemented for and with Python.

Looking at Logix

LiveLogix is a consulting and development firm with big plans and innovative ideas. Logix, available under the GPL, is their first major release and the vanguard of their LiveLogix Application Platform, an assortment of versatile and dynamic development tools currently in the early stages of development. Inspired by the dynamicism of Python, the syntactic grace of Haskell and the mutability of Lisp, Logix is a unique fusion of features and flexibility.

Logix developers do not build complete formal grammars, they incrementally define the individual operators that make up a language. It is then possible to combine these operators to form expressions, which the Logix processor can parse and convert into Python byte-code. Logix DSLs optionally can leverage powerful Python language features like control structures, object orientation and list processing. Seamless Python integration and access to the tremendous number of useful libraries and modules available to Python further increase the power and value of Logix.

Logix developers build their programs with either the standard or base Logix dialects. The syntax of the base dialect is like normal Python syntax with a few additional features for language extension. The standard dialect includes a wide variety of excellent syntactic enhancements and unique features.

Experienced Python developers quickly adapt to standard Logix idioms. The documentation contains an excellent introduction for Python programmers that fully explores the syntactic divergences. Many of the substantial differences relate to Logix's special treatment of expressions. All statements return values, so it is possible to write code like this:

x = if 10 * 2 == 20: "yes it is!" else: "no"

A function call is written as a series of expressions:

min 2 6

In the standard dialect, parentheses distinguish individual expressions just as they do in algebra. Parentheses are not a part of the actual call nomenclature. The standard Logix expression:

min 2 6 (min 10 15)

is the same as the Python expression:

min(2, 6, min(10, 15))

Functions that do not require arguments are the exception. They are still called with trailing parentheses just as they are in Python:

function()

Language extensions are written with defop statements. It is possible to add new prefix, postfix and infix operators as well as special mixfix operators like C's conditional expressions. It also is possible to add new keywords.

Operator definitions consist of an associativity specification, a binding value, the operator syntax and the implementation. Associativity is specified with a single letter, either l for left or r for right. If associativity is not specified, Logix automatically makes the new operator left associative. The binding value specifies operator precedence. The binding value syntax is one of the only things I really don't like about Logix. Even in languages with static syntax, operator precedence tends to confuse me. It's much more difficult to keep track of in a structurally dynamic language. Fortunately, precedence and associativity won't be all that important in simple tool languages.

The operator syntax consists of variables and constants. Constants are enclosed in quotations, and variables are specified by type: expr (expression), symbol, term, token, block and freetext. The operator implementation can be either a macro or a function. A function is evaluated at runtime, whereas macros perform code replacement at compile time.

Let's have a look at an example from the Logix documentation:

defop 50 expr "isa" expr func ob typ:
  isinstance ob typ

This describes an isa operator. The new isa operator consists of an expression, followed by the constant isa, followed by an expression. The func keyword indicates that the implementation is a function, and the two symbols that follow are the names of the variables. Each variable in the implementation is associated with a variable in the syntax definition. In this case, the first expr is ob, and the second expr is typ. The code within the func block is evaluated when the operator is called.

In the following line of code:

"test" isa str

the string test is the first expression passed to the isa operator, and the type str is the second expression. The operator then passes the arguments to the isinstance function and returns the result, which in this case, is the boolean True.

A Language Is Born

Now that we have worked through the basics, let's try a more complex example. Imagine a company with a veritable fleet of network-enabled printers featuring telnet-accessible administrative interfaces. This hypothetical company maintains a record of the current configuration for all its printers in a text file. When someone wants to change the configuration of a particular printer, they record the change in the text document, and then they connect to the printer and make the change. The company could design a simple DSL that treats the configuration record as a program. So, when someone wants to change the configuration of a printer, they simply could change the document and run it. When run, the text document would connect to all the printers and repopulate the configuration data.

First, let's have a look at the document:

default:
  syslog_facility:local3
  idle_timeout:120
  old_idle_mode:off

accounting printers:
  - 10 hp5mo1
      syslog_facility:local2
  - 28 lpt9
  - 29 lpt10
  - 48 lpt6

developer printers:
  - 26 lpt4
  - 27 lpt7

marketing printers:
  - 62 hpcolor5:
      old_idle_mode:on
  - 154 lpt11


for department in
  [accounting, developer, marketing]:
  for printer in department:
    print ("Configuring %s..."%printer.host)
    printer.transmit()

print "Finished!"

When you design your own DSL, you must consider the implications of the syntax you select. If you want to add more features, will you be able to? Inexperienced DSL developers monopolize common meta-characters in order to make the syntax as concise as possible. In the long run, that makes it harder to learn, harder to use and harder to extend.

The default block contains the default configuration options that will be set on all printers. Each of the printers blocks contains a description of all the printers in a single department. Each individual printer definition contains the end of the printer's IP address and the associated hostname. A printer definition optionally can be followed by a block that contains configuration options specific to that printer. Our DSL turns each printer block into a list of Printer objects and assigns that list to a variable bearing the name of the department. It then will be possible to manipulate these lists with code written in the standard Logix dialect.

Now, let's have a look at the implementation:


setlang logix.stdlang

from telnetlib import Telnet

class TelnetDebug:
  def write self txt: print "dbg:%s"%txt

class Printer:
  def __init__ self ip host data:
    self.ip = ip
    self.host = host

    self.data = Printer.default.copy()
    self.data.update data

  def transmit self:
    #tn = Telnet "192.168.0.%s"%self.ip
    tn = TelnetDebug()

    tn.write "printer_password"
    tn.write ("host %s"%self.host)

    for x,y in self.data.items():
      tn.write ("%s %s"%(x,y))

deflang printerdef:

  defop 50 expr ":" expr macro n v:
    str n, str v

  defop 0 "-" token expr [":" block]/-
    macro ip v *b:
      ["host":str v, "ip":str ip, "block":b]


deflang printlang(logix.stdlang):

  defop 0 expr "printers:" block@printerdef
    macro n *v:
      `\n = [\@.Printer p/ip p/host (dict p/block)
         for p in \v]

  defop 0 "default:" block@printerdef macro *b:
    `\@.Printer.default = dict \b

The implementation starts with a setlang directive that tells the interpreter to use the standard Logix dialect. Next, we define the Printer class. Every printer defined in a printers block eventually becomes an instance of the Printer class. The Printer class contains no code specific to the DSL and easily can be used in another project. The Printer initialization method takes three arguments: the last part of the printer IP address, the printer hostname and a dict that associates option names with option values. The init method also copies the default printer options from a class variable into an instance variable called data and updates it with the printer-specific options passed into the instance via the data argument.

Now we get to the good part, the language definition. In Logix, the deflang statement is used to start a new language block. Each language block contains a sequence of operator definitions. The first language block describes the syntax we will use in the individual printers blocks and the default block. The printerdef language's first operator is the colon, an infix operator that is used to parse individual options. The first expr is the option name, and the second expr is the option value. The colon operator implementation is a macro that converts the expressions into strings and puts them in a tuple.

The second operator in the printerdef language is the hyphen operator, a mixfix operator that is used to define individual printers. This one is a bit more complicated. The operator starts with a literal hyphen, which is followed by a variable token, an expression and an optional block. A token is a single value, in this case a number. A block, as one might guess, is a block of content that is parsed using Python's indentation rules.

In the definition, the literal colon and the block are enclosed in braces and followed by a /-. The braces group syntactic elements, and the /- following the group indicates that it is optional. This makes it possible to omit the block for printers that don't need to specify their own configuration options. The implementation is a macro that takes three arguments. The token is the IP address suffix, the expr is the printer hostname and the block contains the printer options. The asterisk in front of the b indicates that the variable is a sequence. If you don't specify that the block variable is a sequence, blocks with more than one line will not be parsable. The implementation returns a dict containing the hostname, the IP suffix and the block. The block contains options, which get transformed into tuples, so in the implementation, the b variable is a sequence of tuples.

The second language in the implementation contains the primary syntax for our DSL. After the language name, you can see a reference to the standard Logix dialect enclosed in parentheses. Like classes, Logix languages support inheritance. The stdlang reference within the parentheses indicates that our printlang inherits all the operators of stdlang. Developers now can use standard Logix syntax in addition to the specialized operators defined within the printlang. That is how the for loop at the end of the printer configuration program is possible.

The printers operator starts with an expression, followed by the literal printers: and then a block. In this definition, the block is immediately followed by @printerdef, which tells the interpreter that the contents of the block should be parsed by the printerdef language. The printers implementation is a macro with two operators: the name of the group and the block, which is a sequence of dicts that contain printer definitions.

The back tick at the beginning of the implementation macro replaces the escaped variables with their values and converts the expression into code data. We want to be able to make a variable that uses a name provided by the user. For instance, we want to assign the value of the first printers block to the variable accounting. If the implementation wasn't quoted, it would try to assign the value to the variable n, rather than creating a new variable that uses the name provided by the value. Quoting is like Python's exec function:

n = 'test'

is like unquoted content, whereas:

exec("%s = 'test'"%n)

is like quoted content.

In Logix, the forward slash represents an escaped variable. Escaped variables are replaced with their values the same way that %s is replaced with the value of n in the sample exec expression. The escaped @ represents the current module, so \@.Printer is a reference to the Printer class. The list comprehension builds a Printer instance for each printer definition. Logix provides special syntax for dictionary access:

some_dict/key

is transformed into:

some_dict["key"]

So, the interpreter acquires the IP, the host and the option block from the Printer definition dict and passes them as arguments to the Printer constructor.

The default operator takes its block and assigns it to the default Printer class variable.

That's all there is to it. Now you can build the right tool for any job! With a good language development platform at your command, the only limitation is your imagination.

What the Future Holds

Did this tantalizing taste of Logix intrigue you? I asked Logix creator Tom Locke to shed some light on the Logix future. We soon can expect to see a faster, more effective Logix. The next release will feature an efficient new parser, written entirely in C. Eventually, Tom plans to port Logix to a more suitable language platform like Mono. He wants a versatile runtime engine that emphasizes security and offers a wide variety of featureful libraries.

Logix is currently available under the GPL. Future releases also will offer a less-restrictive license that will enable developers to distribute original and modified works in both source and binary form.

Resources for this article: /article/8209.

Ryan Paul is a system administrator, a freelance writer and an ardent proponent of open-source technology. He welcomes your questions and comments. Ryan can be contacted at segphault@sbcglobal.net.

LJ Archive