zerowidth positive lookahead

Annotating a TOML Parse Tree

This is part 2 of 4.

In the previous article, we built a TOML parser. Now that we can parse a document, we need to convert the parse tree into something we can use.

Named Nodes

As defined, the parser simply returns the string that it parsed. This isn’t helpful, so we need to somehow annotate matches during parsing to build a structured representation of a TOML document. Once we’ve creating this tree representation, we’ll be able to transform it into the final data structure that we’re looking for.

We’ll do this using Parslet’s .as method to name matched parts of rules. When parslet matches something, it converts it to a hash of {:name => "matched value"}. These can be nested along with the rules, resulting in a tree of hashes and arrays.

Values

For values, we capture the type of the value along with its contents.

rule(:integer) do
  (str("-").maybe >> match["1-9"] >> digit.repeat).as(:integer)
end

rule(:float) do
  (str("-").maybe >> digit.repeat(1) >>
   str(".") >> digit.repeat(1)).as(:float)
end

rule(:boolean) do
  (str("true") | str("false")).as(:boolean)
end

rule(:datetime) do
  (digit.repeat(4) >> str("-") >>
   digit.repeat(2) >> str("-") >>
   digit.repeat(2) >> str("T") >>
   digit.repeat(2) >> str(":") >>
   digit.repeat(2) >> str(":") >>
   digit.repeat(2) >> str("Z")).as(:datetime)
end

rule(:string) do
  str('"') >>
  ((escaped_special | string_special.absent? >> any).repeat).as(:string) >>
  str('"')
end

And the tests:

it "parses integers into {:integer => 'digits'}" do
  expect(value_parser.parse("1234")).to eq :integer => "1234"
end

it "parses floats into {:float => 'digits'}" do
  expect(value_parser.parse("-0.123")).to eq :float => "-0.123"
end

it "parses booleans into {:boolean => 'value'}" do
  expect(value_parser.parse("true")).to eq :boolean => "true"
end

it "parses datetimes into hashes of date/time data" do
  expect(value_parser.parse("1979-05-27T07:32:00Z")).to eq(
    :datetime => "1979-05-27T07:32:00Z"
  )
end

it "parses strings into {:string => 'string contents'}" do
  expect(value_parser.parse('"hello world"')).to eq(
    :string => "hello world")
end

Arrays

Parslet handles repeated elements “magically”. If there is a sequence of matched values, it will automatically combine them into an array of elements. From the docs, capturing basic repeats can work either way we need it to:

str('a').repeat.as(:b) # "aaa" => {:b=>"aaa"@0}
str('a').as(:b).repeat # "aaa" => [{:b=>"a"@0}, {:b=>"a"@1}, {:b=>"a"@2}]

For arrays, we’ll want to capture the outer array as :array => ... and the contents as an array of values, [ {:integer => "1"}, {:integer => "2"}, ...].

If we weren’t parsing nested arrays, we could leave off the .as(:array) and Parslet would automatically give us bare arrays of values. However, it’s a little too smart about merging the results of parsed sub-trees and it flattens nested arrays, so we’ll be explicit.

rule :array do
  str("[") >> array_space >>
  array_contents.repeat(1).as(:array) >>
  array_space >> str("]")
end

it "captures arrays as :array => [ value, value, ... ]" do
  expect(array_parser.parse("[1,2]")).to eq(
    :array => [ {:integer => "1"}, {:integer => "2"}])
end

it "captures nested arrays" do
  expect(array_parser.parse("[ [1,2] ]")).to eq(
    :array => [
      {:array => [ {:integer => "1"}, {:integer => "2"}]}
    ])
end

Assignments

We’d like individual assignments to look like {:key => "key", :value => value}.

The initial version of the parser was a little loose about when and where it would match whitespace, so we’ll refactor the parser rules a bit too.

First, a couple of the helper rules, changing whitespace to space? and comment to comment?:

rule(:space?) { space.repeat }

rule(:comment?) do
  (str("#") >> (newline.absent? >> any).repeat).maybe
end

Next, we’ll look at series of assignments. Originally the assignment rule handled whitespace within itself, but we’ll move that elsewhere. We capture the key and the value:

rule :assignment do
  key.as(:key) >>
  space? >> str("=") >> space? >>
  value.as(:value)
end

it "captures the key and the value" do
  expect(ap.parse("thing = 1")).to eq(
    :key => "thing", :value => {:integer => "1"})
end

A sequence of assignments can have several forms: nothing (no assignments), a single assignment, a series of comments and whitespace, a series of “bare” assignments, or a series of assignments with comments and whitespace interspersed. To handle this well, we’ll start with a single line:

rule :assignment_line do
  space? >> assignment.maybe >> space? >> comment?
end

Now we can easily combine these and capture the overall results as :assignments => ...:

rule :assignments do
  (assignment_line >> (newline >> assignment_line).repeat).as(:assignments)
end

And test these with a variety of inputs:

let(:ap) { parser.assignments }

it "captures a list of assignments" do
  expect(ap.parse("a=1\nb=2")).to eq(
    :assignments => [
      {:key => "a", :value => {:integer => "1"}},
      {:key => "b", :value => {:integer => "2"}},
    ]
  )
end

it "captures an empty string" do
  expect(ap.parse("")).to eq(:assignments => "")
end

it "captures just comments as a string" do
  expect(ap.parse("#comment\n")).to eq(
    :assignments => "#comment\n"
  )
end

A list of assignments containing just a comment is matched as a string. This is because we’ve defined a capture, and even if it doesn’t match any structured {:key => ..., :value => ... pairs, it matches and captures the string itself. This is ok, we’ll just have to handle this case when we transform the tree later on.

But now it looks like we have a problem. If we try and parse the following string using assignments:

#comment
a = 1

It’s parsed as :assignments => [{:key => "#comment\na", :value => {:integer => "1"}}] The key has somehow managed to capture the preceding comment and newline.

If we look at the rule for an assignment again, it starts with key.as(:key). The way this is invoked from the assignment_line rule is: space? >> assignment.maybe. When presented with "#comment\nkey=value", the parser sees the '#' and interprets it as a key. Because a key is just “not whitespace”, the remainder of the comment and the newline are accepted.

To fix this, we need to restrict the definition of a key to make sure that it doesn’t begin with either a # or a newline:

rule :key do
  str("#").absent? >> newline.absent? >>
  (match["\\[\\]="].absent? >> space.absent? >> any).repeat(1)
end

And that solves it:

it "captures an assignment after a comment and newlines" do
  expect(ap.parse("#comment\na=1")).to eq(
    :assignments => [{:key => "a", :value => {:integer => "1"}}]
  )
  expect(ap.parse("#comment\n\t\n\na=1")).to eq(
    :assignments => [{:key => "a", :value => {:integer => "1"}}]
  )
end

Key Groups

Finally, key groups. We’ll capture key group names as :group_name => "name":

rule :group_name do
  space? >> str("[") >>
  (str("]").absent? >> any).repeat(1).as(:group_name) >>
  str("]") >> space? >> comment?
end

A key group must have a group name, but after that it can be empty or have a series of assignments. We’ve already written that rule, so:

rule :key_group do
  (group_name >>
   (newline >> assignments).maybe).as(:key_group)
end

let(:kgp) { parser.key_group }

it "captures the group name and assignments" do
  expect(kgp.parse("[kg]\na=1\nb=2")).to eq(
    :key_group =>
      {:group_name => "kg",
      :assignments => [
        {:key => "a", :value => {:integer => "1"}},
        {:key => "b", :value => {:integer => "2"}}]}
  )
end

it "captures empty assignments as a string" do
  expect(kgp.parse("[kg]\n#comment\n\t\n")).to eq(
    :key_group =>
      {:group_name => "kg",
       :assignments => "#comment\n\t\n"}
  )
end

it "captures a single assignment in a key group" do
  expect(kgp.parse("[kg]\na=1")).to eq(
    :key_group => {
      :group_name => "kg",
      :assignments => {:key => "a", :value => {:integer => "1"}}}
  )
end

Note the final test, where a single assignment is captured. It’s captured as a single hash rather than an array with one item, because only one matched during parsing. Parslet will only create an array of matched items if there are more than one.

Document

A document, as before, is an optional series of assignments, possibly followed by one or more key groups. We’ll capture the the whole thing as :document.

rule :document do
  ((key_group | assignments) >>
   key_group.repeat >>
   newline.maybe).as(:document)
end

And now, this TOML document:

title = "global title"
[group1]
a = 1
b = 2
[group2]
c = [ 3, 4 ]

is captured as:

{:document=>
  [{:assignments=>{:key=>"title", :value=>{:string=>"global title"}}},
   {:key_group=>
     {:group_name=>"group1",
      :assignments=>
       [{:key=>"a", :value=>{:integer=>"1"}},
        {:key=>"b", :value=>{:integer=>"2"}}]}},
   {:key_group=>
     {:group_name=>"group2",
      :assignments=>{:key=>"c", :value=>{:array=>[{:integer=>"3"}, {:integer=>"4"}]

In part 3 of this series, we’ll transform this tree of captured values into a usable hash using Parslet’s tranformation engine.

Next: Transforming a TOML Parse Tree