zerowidth positive lookahead

Parsing TOML in Ruby with Parslet

This is part 1 of 4.

Last night, in a fit of YAML-induced pique, Tom Preston-Werner created the spec for “Tom’s Obvious Minimal Language”, or TOML. It’s a simple text-based file format, similar to INI files, and this presents us with the opportunity to explore writing a parser.

Parslet is a ruby library for building parsers with Parsing Expression Grammars. Its modularity and simple, declarative nature makes it a good choice for writing a parser, especially in a test-driven style.

There is already discussion of a formal EBNF in a pull request, but we’ll just read the spec directly and interpret it as best we can.

The goals of this exercise:

  • Parse a TOML document, exercising the full spec
  • Display readable error messages when a document doesn’t conform
  • Convert a parsed document into a usable Hash structure
  • Learn something about parsing and transformations of syntax trees

Small Pieces

First, let’s start with the values. To define a parser, subclass Parslet::Parser, and start defining rules. Parslet allows the simple combination of rules by using a couple helpers:

  • str matches strings
  • match matches character classes

These “atoms” can have modifiers, e.g. .maybe for optional matching, and .repeat for repetition. They can also be combined in sequence using the >> operator, or | for alternation.

For digits, we match the character class [0-9]. Integers are just digits (assuming no leading 0’s), possibly preceded by a negative sign.

class TOML::Parser < Parslet::Parser
  rule(:digit)   { match("[0-9]") }

  rule(:integer) do
    str("-").maybe >> match("[1-9]") >> digit.repeat
  end
end

Floats are similarly straightforward:

rule(:float) do
  str("-").maybe >> digit.repeat(1) >> str(".") >> digit.repeat(1)
end

And booleans are simple string matching:

rule(:boolean) do
  str("true") | str("false")
end

Datetimes seem more complex, but are just a longer string of requirements:

rule(:datetime) do
  # 1979-05-27T07:32:00Z
  digit.repeat(4) >> str("-") >> 
  digit.repeat(2) >> str("-") >> 
  digit.repeat(2) >> str("T") >> 
  digit.repeat(2) >> str(":") >> 
  digit.repeat(2) >> str(":") >> 
  digit.repeat(2) >> str("Z")
end

Strings are the last “simple” value. We’ll need a few helper rules for special characters. First, the characters that aren’t allowed to be inside a string at all:

rule(:string_special) { match['\0\t\n\r"\\\\'] }

And a rule for escaped characters:

rule(:escaped_special) { str("\\") >> match['0tnr"\\\\'] }

The last piece is defining “any character but a special one”. This is accomplished with .absent?, which functions as a negative lookahead, followed by any character.

rule(:string) do
  str('"') >>
  (escaped_special | string_special.absent? >> any).repeat >>
  str('"')
end

And to tie them together, we can define a higher-level rule:

rule :value do
  datetime | float | integer | boolean | string
end

Testing

One of the nicest things about Parslet is that the way the parser is defined lets us access each rule as a public method. Along with the provided RSpec helpers, it makes it easy to test the individual rules.

require "parslet/rig/rspec"

describe TOML::Parser do
  let(:parser) { TOML::Parser.new }

  context "value parsing" do
    let(:value_parser) { parser.value }

    it "parses integers" do
      expect(value_parser).to     parse("1")
      expect(value_parser).to     parse("-123")
      expect(value_parser).to     parse("120381")
      expect(value_parser).to     parse("181")
      expect(value_parser).to_not parse("0181")
    end

    it "parses floats" do
      expect(value_parser).to     parse("0.1")
      expect(value_parser).to     parse("3.14159")
      expect(value_parser).to     parse("-0.00001")
      expect(value_parser).to_not parse(".1")
    end

    it "parses booleans" do
      expect(value_parser).to     parse("true")
      expect(value_parser).to     parse("false")
      expect(value_parser).to_not parse("truefalse")
    end

    it "parses datetimes" do
      expect(value_parser).to     parse("1979-05-27T07:32:00Z")
      expect(value_parser).to     parse("2013-02-24T17:26:21Z")
      expect(value_parser).to_not parse("1979l05-27 07:32:00")
    end

    it "parses strings" do
      expect(value_parser).to     parse('""')
      expect(value_parser).to     parse('"hello world"')
      expect(value_parser).to     parse('"hello\\nworld"')
      expect(value_parser).to     parse('"hello\\t\\n\\\\\\0world\\n"')
      expect(value_parser).to_not parse("\"hello\nworld\"")
    end
  end
end

I’m eliding the tests for the remainder of this article for brevity, but each step was developed test-first.

Document Structure

There are a few remaining pieces which are more structural: key/value assignments, arrays, and key groups.

Arrays

Arrays have a restriction that they may only contain values of the same datatype. We could accomplish this as we process the parse tree or use the parser directly. Since an array containing multiple data types is something we can know at parse time–compare to the semantic error of redefining a key with a key group–we’ll create a helper method to create array parsers for us.

An naive nested array parser would look something like this (ignoring whitespace):

rule :array do
  str("[") >>
  (value | array) >>
  (str(",") >> (value | array)).repeat >>
  str("]")
end

Instead, we can replace value with a specific datatype:

rule :string_array do
  str("[") >> string >> (str(",") >> string).repeat >> str("]")
end

Because Parslet parsers are just bits of ruby code, we can define and compose these programmatically:

rule(:array_space) { match["\t\n "].repeat }

def value_list(value_type)
  value_type >>
  (array_space >> str(",") >> array_space >> value_type).repeat
end

rule :array do
  str("[") >> array_space >> value_list(integer) >> array_space >> str("]")
end

And the full array definition becomes:

def array_contents
  value_list(datetime) | value_list(float) | value_list(integer) |
    value_list(boolean) | value_list(string) | value_list(array)
end

rule :array do
  str("[") >> array_space >> array_contents >> array_space >> str("]")
end

And expand the value rule to include arrays:

rule :value do
  datetime | float | integer | boolean | string | array
end

Because an array is defined as a series of a specific type, a mixed array simply won’t parse.

Assignments

Now that we have values of all potential types, assignments are straightforward. First, define keys and whitespace. For simplicity and “only one thing should work”, keys are not allowed to include square brackets or equal signs. Also note the difference between space and whitespace. Since the definition for a key requires an assertion that no space is present, we can’t use whitespace because it can match a blank string, and a blank string is always present next during a parse.

rule(:space)      { match["\t "] }
rule(:whitespace) { space.repeat }
rule :key do
  (match["\\[\\]="].absent? >> space.absent? >> any).repeat(1)
end

Then, we tie a key and a value together:

rule :assignment do
  whitespace >> key >> whitespace >> str("=") >> whitespace >> value
end

Key Group Names

Key groups names are similarly straightforward:

rule :key_group_name do
  whitespace >> str("[") >>
  (str("]").absent? >> any).repeat(1) >>
  str("]")
end

Document Parsing

Now that the pieces are in place, we can string them together. First, a series of assignments, separated by newlines.

rule :assignments do
  assignment >>
  (newline >> (assignment | whitespace)).repeat >>
end

Next, key groups. Empty key groups are allowed. This ends up being nearly identical.

rule :key_group do
  key_group_name >>
  (newline >> (assignment | whitespace)).repeat >>
end

Comments

The only thing we’ve left out is comments. Comments are allowed on lines of their own, or at the end of a line. There are only a handful of rules that include newlines, so we need only adjust them to include the possibility of comments.

rule(:comment) do
  str("#") >> (newline.absent? >> any).repeat
end

First, statements that can be followed by a comment, e.g. assignment and key group name:

rule :assignment do
  whitespace >>
  key >> whitespace >> str("=") >> whitespace >> value >>
  whitespace >> comment.maybe
end

rule :key_group_name do
  whitespace >> str("[") >>
  (str("]").absent? >> any).repeat(1) >>
  str("]") >> whitespace >> comment.maybe
end

Next, any place we can have empty lines in a sequence, so assignments and key groups:

rule :assignments do
  assignment >>
  (newline >> (assignment | whitespace >> comment.maybe)).repeat
end

rule :key_group do
  key_group_name >>
  (newline >> (assignment | whitespace >> comment.maybe)).repeat
end

Lastly, arrays. These are a little trickier because comments can be interspersed with the array contents. We’ll replace the array_space rule with something a little more flexible.

rule(:array_space) do
  (space | (comment.maybe >> newline)).repeat
end

TOML documents

Now we can tie it all together:

rule(:empty_lines) do
  (whitespace >> comment.maybe >> newline).repeat
end

rule :document do
  empty_lines >>
  assignments >>
  empty_lines >>
  key_group.repeat >>
  newline.maybe
end

root :document

Summary

The Parser class is now fully defined. Unfortunately, it’s not directly useful yet. It’ll get us a parse tree, but not the actual values from a document.

Next: Annotating a TOML Parse Tree

The code is available on GitHub.