Parsing TOML in Ruby with Parslet
This is part 1 of 4.
- Part 2 - Annotating a TOML Parse Tree
- Part 3 - Transforming a TOML Parse Tree
- Part 4 - Displaying Errors in a TOML Document
- toml-parslet on GitHub.
Last night, in a fit of YAML-induced pique, Tom Preston-Werner created the spec for “Tom’s Obvious Minimal Language”, or TOML. It’s a simple text-based file format, similar to INI files, and this presents us with the opportunity to explore writing a parser.
Parslet is a ruby library for building parsers with Parsing Expression Grammars. Its modularity and simple, declarative nature makes it a good choice for writing a parser, especially in a test-driven style.
There is already discussion of a formal EBNF in a pull request, but we’ll just read the spec directly and interpret it as best we can.
The goals of this exercise:
- Parse a TOML document, exercising the full spec
- Display readable error messages when a document doesn’t conform
- Convert a parsed document into a usable
Hash
structure - Learn something about parsing and transformations of syntax trees
Small Pieces
First, let’s start with the values. To define a parser, subclass
Parslet::Parser
, and start defining rules. Parslet allows the simple
combination of rules by using a couple helpers:
str
matches stringsmatch
matches character classes
These “atoms” can have modifiers, e.g. .maybe
for optional matching, and
.repeat
for repetition. They can also be combined in sequence using the >>
operator, or |
for alternation.
For digits, we match the character class [0-9]
. Integers are just digits
(assuming no leading 0’s), possibly preceded by a negative sign.
class TOML::Parser < Parslet::Parser
rule(:digit) { match("[0-9]") }
rule(:integer) do
str("-").maybe >> match("[1-9]") >> digit.repeat
end
end
Floats are similarly straightforward:
rule(:float) do
str("-").maybe >> digit.repeat(1) >> str(".") >> digit.repeat(1)
end
And booleans are simple string matching:
rule(:boolean) do
str("true") | str("false")
end
Datetimes seem more complex, but are just a longer string of requirements:
rule(:datetime) do
# 1979-05-27T07:32:00Z
digit.repeat(4) >> str("-") >>
digit.repeat(2) >> str("-") >>
digit.repeat(2) >> str("T") >>
digit.repeat(2) >> str(":") >>
digit.repeat(2) >> str(":") >>
digit.repeat(2) >> str("Z")
end
Strings are the last “simple” value. We’ll need a few helper rules for special characters. First, the characters that aren’t allowed to be inside a string at all:
rule(:string_special) { match['\0\t\n\r"\\\\'] }
And a rule for escaped characters:
rule(:escaped_special) { str("\\") >> match['0tnr"\\\\'] }
The last piece is defining “any character but a special one”. This is
accomplished with .absent?
, which functions as a negative lookahead,
followed by any
character.
rule(:string) do
str('"') >>
(escaped_special | string_special.absent? >> any).repeat >>
str('"')
end
And to tie them together, we can define a higher-level rule:
rule :value do
datetime | float | integer | boolean | string
end
Testing
One of the nicest things about Parslet is that the way the parser is defined lets us access each rule as a public method. Along with the provided RSpec helpers, it makes it easy to test the individual rules.
require "parslet/rig/rspec"
describe TOML::Parser do
let(:parser) { TOML::Parser.new }
context "value parsing" do
let(:value_parser) { parser.value }
it "parses integers" do
expect(value_parser).to parse("1")
expect(value_parser).to parse("-123")
expect(value_parser).to parse("120381")
expect(value_parser).to parse("181")
expect(value_parser).to_not parse("0181")
end
it "parses floats" do
expect(value_parser).to parse("0.1")
expect(value_parser).to parse("3.14159")
expect(value_parser).to parse("-0.00001")
expect(value_parser).to_not parse(".1")
end
it "parses booleans" do
expect(value_parser).to parse("true")
expect(value_parser).to parse("false")
expect(value_parser).to_not parse("truefalse")
end
it "parses datetimes" do
expect(value_parser).to parse("1979-05-27T07:32:00Z")
expect(value_parser).to parse("2013-02-24T17:26:21Z")
expect(value_parser).to_not parse("1979l05-27 07:32:00")
end
it "parses strings" do
expect(value_parser).to parse('""')
expect(value_parser).to parse('"hello world"')
expect(value_parser).to parse('"hello\\nworld"')
expect(value_parser).to parse('"hello\\t\\n\\\\\\0world\\n"')
expect(value_parser).to_not parse("\"hello\nworld\"")
end
end
end
I’m eliding the tests for the remainder of this article for brevity, but each step was developed test-first.
Document Structure
There are a few remaining pieces which are more structural: key/value assignments, arrays, and key groups.
Arrays
Arrays have a restriction that they may only contain values of the same datatype. We could accomplish this as we process the parse tree or use the parser directly. Since an array containing multiple data types is something we can know at parse time–compare to the semantic error of redefining a key with a key group–we’ll create a helper method to create array parsers for us.
An naive nested array parser would look something like this (ignoring whitespace):
rule :array do
str("[") >>
(value | array) >>
(str(",") >> (value | array)).repeat >>
str("]")
end
Instead, we can replace value
with a specific datatype:
rule :string_array do
str("[") >> string >> (str(",") >> string).repeat >> str("]")
end
Because Parslet parsers are just bits of ruby code, we can define and compose these programmatically:
rule(:array_space) { match["\t\n "].repeat }
def value_list(value_type)
value_type >>
(array_space >> str(",") >> array_space >> value_type).repeat
end
rule :array do
str("[") >> array_space >> value_list(integer) >> array_space >> str("]")
end
And the full array definition becomes:
def array_contents
value_list(datetime) | value_list(float) | value_list(integer) |
value_list(boolean) | value_list(string) | value_list(array)
end
rule :array do
str("[") >> array_space >> array_contents >> array_space >> str("]")
end
And expand the value
rule to include arrays:
rule :value do
datetime | float | integer | boolean | string | array
end
Because an array is defined as a series of a specific type, a mixed array simply won’t parse.
Assignments
Now that we have values of all potential types, assignments are straightforward.
First, define keys and whitespace. For simplicity and “only one thing should
work”, keys are not allowed to include square brackets or equal signs. Also note
the difference between space
and whitespace
. Since the definition for a key
requires an assertion that no space
is present, we can’t use whitespace
because it can match a blank string, and a blank string is always present next
during a parse.
rule(:space) { match["\t "] }
rule(:whitespace) { space.repeat }
rule :key do
(match["\\[\\]="].absent? >> space.absent? >> any).repeat(1)
end
Then, we tie a key and a value together:
rule :assignment do
whitespace >> key >> whitespace >> str("=") >> whitespace >> value
end
Key Group Names
Key groups names are similarly straightforward:
rule :key_group_name do
whitespace >> str("[") >>
(str("]").absent? >> any).repeat(1) >>
str("]")
end
Document Parsing
Now that the pieces are in place, we can string them together. First, a series of assignments, separated by newlines.
rule :assignments do
assignment >>
(newline >> (assignment | whitespace)).repeat >>
end
Next, key groups. Empty key groups are allowed. This ends up being nearly identical.
rule :key_group do
key_group_name >>
(newline >> (assignment | whitespace)).repeat >>
end
Comments
The only thing we’ve left out is comments. Comments are allowed on lines of their own, or at the end of a line. There are only a handful of rules that include newlines, so we need only adjust them to include the possibility of comments.
rule(:comment) do
str("#") >> (newline.absent? >> any).repeat
end
First, statements that can be followed by a comment, e.g. assignment and key group name:
rule :assignment do
whitespace >>
key >> whitespace >> str("=") >> whitespace >> value >>
whitespace >> comment.maybe
end
rule :key_group_name do
whitespace >> str("[") >>
(str("]").absent? >> any).repeat(1) >>
str("]") >> whitespace >> comment.maybe
end
Next, any place we can have empty lines in a sequence, so assignments and key groups:
rule :assignments do
assignment >>
(newline >> (assignment | whitespace >> comment.maybe)).repeat
end
rule :key_group do
key_group_name >>
(newline >> (assignment | whitespace >> comment.maybe)).repeat
end
Lastly, arrays. These are a little trickier because comments can be interspersed
with the array contents. We’ll replace the array_space
rule with something a
little more flexible.
rule(:array_space) do
(space | (comment.maybe >> newline)).repeat
end
TOML documents
Now we can tie it all together:
rule(:empty_lines) do
(whitespace >> comment.maybe >> newline).repeat
end
rule :document do
empty_lines >>
assignments >>
empty_lines >>
key_group.repeat >>
newline.maybe
end
root :document
Summary
The Parser
class is now fully defined. Unfortunately, it’s not directly useful
yet. It’ll get us a parse tree, but not the actual values from a document.
Next: Annotating a TOML Parse Tree
The code is available on GitHub.