Extended DCGs in Prolog

What are DCGs?

One of my favourite features of Prolog are DCGs, or Definite Clause Grammars.

Markus Triska has a nice tutorial on them, but the quick pitch is they’re basically built-in syntax for writing parsers. Forget regular expressions – in Prolog it’s just as easy to write a full-on grammar!

Parsing With DCGs

Let’s say we want to parse a series of simple commands with option arguments. The “regular expression” way might look something like this:

def parseCommands s
s.scan(/(foo|bar \d \d|quux \w+) ?/) do |match|
    match_words = match[0].split(/\s+/)
    command = match_words[0]
    args = match_words[1..match_words.length]
    # Then we'd actually do something with the command & args
    puts "Command: #{command} Args: #{args}"
  end
end

parseCommands "bar 1 2 foo quux bloop"

Command: bar Args: ["1", "2"]
Command: foo Args: []
Command: quux Args: ["bloop"]

This is okay…but if we want to add more commands, or have more complicated rules on what the allowed arguments are, things are going to get pretty ugly in that regex pretty fast.

Here’s an alternative approach in Prolog, using DCGs:

% using the dcg/basics library for simple "character class"-type DCGs.
:- use_module(library(dcg/basics), [whites//0, nonblanks//1, digits//1]).

command(foo) --> "foo".
command(bar(A, B)) --> "bar", whites, digits(A), whites, digits(B).
command(quux(S)) --> "quux", whites, nonblanks(S).

commands([Command|Commands]) -->
    command(Command),
    whites,
    commands(Commands).
commands([]) --> [].

?- phrase(commands(Commands), `bar 1 2 foo quux bloop`),
   maplist(writeln, Commands).

bar([49],[50])
foo
quux([98,108,111,111,112])

This is pretty nice already – we’ve got our commands more structured and it almost looks like a BNF diagram!

However, it’s a little unpleasant that we have the numbers & strings (in bar and quux, respectively) as lists of codes…can we fix that? We can very easily, because Prolog DCGs aren’t just a syntax for a context-free grammar; we can actually embed code in them too!¹

% using the dcg/basics library for simple "character class"-type DCGs.
:- use_module(library(dcg/basics), [whites//0, nonblanks//1, digits//1]).

command(foo) --> "foo".
command(bar(A, B)) -->
    "bar", whites,
    digits(ACodes), whites, digits(BCodes),
    % The stuff in curly braces takes us out of the "DCG world"
    % and back in to "normal Prolog world"
    { number_codes(A, ACodes), number_codes(B, BCodes) }.
command(quux(S)) -->
   "quux", whites, nonblanks(Codes),
   { string_codes(S, Codes) }.

commands([Command|Commands]) -->
    command(Command),
    whites,
    commands(Commands).
commands([]) --> [].

?- phrase(commands(Commands), `bar 1 2 foo quux bloop`),
   maplist(writeln, Commands).

bar(1,2)
foo
quux(bloop)

Much better. We have our nice declarative parser with the ability to add in arbitrary code, but it’s still built-in to the language, so we don’t have to overcome the inertia of switching to some external tool like Yacc.

I really like having this capability in Prolog. In other languages, it is way too easy to start with a simple regex that gradually gets more and more complicated over time, until you realize you need a proper parser, and then you have to scrap a bunch of code & figure out which arcane parser generator library or external tool you’re going to use.

We also haven’t even gotten in to how cool it is to take advantage of Prolog’s “logic” nature to be able to run our parser in either direction – above, we’ve seen that we can run it “forward” to build our list of commands from a string, but if we take a little more care with exactly how we write our code, we can run it “backwards” to turn a list of commands into a string!

:- use_module(library(dcg/basics), [nonblanks//1, digits//1]).
% Using the =delay= library to delay evaluation of predicates until one of
% the arguments has been instantiated
:- use_module(library(delay)).

% using our own version of `whites` to enforce at least one space
whites --> " ".
whites --> " ", whites.

command(foo) --> "foo".
command(bar(A, B)) -->
    { delay(number_codes(A, ACodes)), delay(number_codes(B, BCodes)) },
    "bar", whites,
    digits(ACodes), " ", digits(BCodes).
command(quux(S)) -->
   { delay(string_codes(S, Codes)) },
   "quux", whites, nonblanks(Codes).

commands([Command|Commands]) -->
    command(Command),
    whites,
    commands(Commands).
commands([]) --> [].

?- phrase(commands(ReadCommands), `quux beep foo foo bar 0 1 `),
   format("Parsed commands: ~w~n", [ReadCommands]),
   phrase(commands([foo, bar(7, 8), quux("hello")]), CommandCodes),
   format("Serialized commands: ~s", [CommandCodes]).

Parsed commands: [quux(beep),foo,foo,bar(0,1)]
Serialized commands: foo bar 7 8 quux hello

It takes a little more care to be able to make the same DCG run in both directions, but it is pretty cool to be able to write one bit of code & have it serve for both serializing & deserializing – see my HTTP/2 library for an example of that in action.

How Do DCGs actually work?

One of the most interesting parts of DCGs is that the syntax itself isn’t actually doing very much. We can see what things expand to using the listing/1 predicate:

whites --> " ".
whites --> " ", whites.

command(foo) --> "foo", whites.

?- listing(whites//0),
   listing(command//1).

whites([32|A], A).
whites([32|A], B) :-
    whites(A, B).

command(foo, [102, 111, 111|A], B) :-
    whites(A, B).

We can see here it’s doing three things: First, adding two extra arguments to the predicate – the first extra argument is the input list and the second is the output list. Second, it expands lists (or strings, because inside DCGs, strings get treated as lists of character codes) into matching on the head of the input list. Third, it adds the extra two arguments to the nested DCGs that get called in the body.

Basically, the following two things are equivalent:

foo(A) --> bar, [10, 11], baz.

foo(A, In, Out) :-
   bar(In, BarOut),
   BarOut = [10, 11|ListOut],
   baz(ListOut, Out).

Here, we can get a better idea of the real value of DCGs – threading the state through a series of predicates. Because variables in Prolog are immutable, it is pretty annoying to pass state through a sequence of transformations – we need to use intermediate variables like BarOut & ListOut in the expanded example above.

Interestingly, the only parsing-specific thing that’s actually happening here is making lists match on the head of the current “state”. This implies that we can actually use DCG notation for any sort of threading-state-through-predicates thing, not just parsing!

increment(In, Out) :- Out is In + 1.

identity(In, In).

% subtract one if In is even, double if In is odd
weird_op(In, Out) :- 0 is In mod 2, Out is In - 1.
weird_op(In, Out) :- 1 is In mod 2, Out is In * 2.

chain --> increment, weird_op, identity.

?- chain(0, Zero), format("0 -> ~w~n", [Zero]),
   chain(1, One), format("1 -> ~w~n", [One]),
   chain(2, Two), format("2 -> ~w", [Two]).

0 -> 2
1 -> 1
2 -> 6

Much nicer than trying to do something like

chain(In, Out) :-
   increment(In, Out1),
   weird_op(Out1, Out2),
   identity(Out2, Out).

This is all well and good, but what happens if, say, we want to parse and also maintain some state? Like, for example, when writing an HPACK parser. If only there were a way to have multiple states in our DCG…

Extended DCGs

I recently heard about “Extended DCGs” purely accidentally, when someone in the SWI-Prolog mailing list mentioned that someone shouldn’t call their library edcg, because there already was a library implementing extended DCGs by the same name.

I was very curious about this, mostly because the implementer is mndrix, who is the author of a bunch of my favourite Prolog libraries. After taking a look though, I realized that they were just what I was looking for to help me address an annoying outstanding issue in the HPACK parser of my HTTP/2 client.

Because the HPACK parser both needs to parse the input, maintain a dynamic table of cached headers, and keep track of the maximum table size (which can be adjusted on the fly), there is a lot of state to be passed around. The DCG notation takes care of the input and I manually threaded the input and output tables through all the predicates that needed it, but when I realized that the size could change I just didn’t want to deal with it, since it would mean an extra argument added in a bunch of places.

However, with extended DCGs, one can add extra “hidden arguments”! A simple example looks something like this:

:- use_module(library(edcg)).

% 'table' is an operator, so undefining it to avoid operator priority issues
% when using it as the name of an accumulator
:- op(0, fx, table).

% define an accumulator/state `some_number` that just replaces the old value
edcg:acc_info(some_number, NewSize, _In, NewSize, true).
% define an accumulator `table` that appends values to a list
edcg:acc_info(table, NewEntry, TableIn, TableOut,
              TableOut=[NewEntry|TableIn]).

% putting dcg last in the list of accumulators, so our predicates
% can still pretend to be normal DCGs to phrase/2.
edcg:pred_info(do_thing, 1, [some_number, table, dcg]).
edcg:pred_info(do_stuff, 1, [some_number, table, dcg]).

% EDCGs don't do the automatic string->list-of-codes thing
% that normal DCGs do, so we use the `codes` syntax to write
% a sequence of character codes.
do_thing(foo) -->> `foo`. % just match "foo" like a normal DCG
do_thing(bar) -->>
    `bar`, % match "bar" like normal...
    [bar]:table. % then add the atom `bar` to the table accumulator
do_thing(baz(N)) -->>
    % match "baz" and a number N
    `baz`, [N],
    % get the input state of some_number as Num
    /(Num, some_number),
    % Run some normal Prolog code
    { format("IN BAZ: N = ~w size = ~w~n", [N, Num]) },
    % add the atom `bloop` to the table accumulator
    [bloop]:table,
    % and set `some_number` to the number N we read in
    [N]:some_number.

do_stuff([X|Xs]) -->>
  do_thing(X), do_stuff(Xs).
do_stuff([]) -->> [].

?- append([`foo`, `bar`, `baz`, [10]], Input),
   % use the same phrase/2 predicate as for normal DCGs
   % since it's just passing in the input & output codes list
   phrase(do_stuff(Things,
                   % some_number starts as 0, ends as NewNumber
                   0, NewNumber,
                   % table starts as [], ends as NewTable
                   [], NewTable), Input),
   format("Out: things ~w number ~w table ~w",
          [Things, NewNumber, NewTable]).

IN BAZ: N = 10 size = 0
Out: things [foo,bar,baz(10)] number 10 table [bloop,bar]

I got very confused trying to write my test program initially because I wanted to have an accumulator called table which is apparently defined as an operator in Prolog, resulting in some baffling syntax errors, hence the :- op(...) line at the top.

Once I got that though, I was able to very easily update my HPACK implementation to allow the size of the dynamic table to change, with minimal changes to the surrounding code. Not only that, it helped make the code much more clear, as previously I’d been stuffing all the table state in one argument (by writing it like TableSize-TableIn-TableOut); this way I can let it be implicit & just pull out the parts I need!

I’m very glad I found out about extended DCGs. The ability to define custom accumulators gives one the ability to do something almost like monadic do-notation and really promises to cut down on a lot of the boilerplate involved in expressing imperative algorithms in Prolog. I continue to really like this language 😁

Footnotes:

There is actually a helper DCG in the dcg/basics library called integer//1, which is like digits//1, except it also converts the parsed string of digits into a number, but we’ll ignore that for now to make a point.