Structure and Interpretation of Malli Regex Schemas
Seqex schemas shipped with Malli 0.3.0. This is their story. A story with many twists and turns and technical details Malli users don't need to know.
IntroductionLink to Introduction
In December 2020 Tommi called on me to implement "regex schemas" for Malli. Now what do we even mean by "regex schemas"? Hasn't Malli had support for
(m/validate #"foo|bar" "foo") ;=> true
for quite a while now?
What I am talking about here are regular expressions of sequences and not just of strings. In an attempt to reduce confusion I will call them "seqexes" from now on, Seqexp having pioneered the concept before Clojure Spec was even a thing.
Most developers are probably familiar with regular expressions of strings and their basic forms. Here they are in (JVM) Clojure and Malli forms:
- The empty string
#""
or sequence[:cat]
- Literal character
#"a"
or element schema:int
- Concatenation
#"ab"
or[:cat :string :int]
- Alternation
#"a|b"
or[:alt :string :int]
- Kleene Star
#"a*"
or[:* :int]
The formalism also contains the empty set but regex and seqex engines do not because a regex/seqex/schema that never
matches anything is not very useful in practice. Because Malli uses an S-expression or Hiccup-like syntax the empty
sequence is just the concatenation of zero subexpressions [:cat]
.
I am quite the theoretical computer science person, which is why I was tasked with this feature. More pragmatic programmers might regard these additional operators as equally "basic":
- Optional
#"a?"
or[:? :int]
- Kleene Plus
#"a+"
or[:+ :int]
- Repeat
#"a{1, 5}"
or[:repeat {:min 1, :max 5} :int]
even though they can often be implemented as syntactic sugar for the five fundamental constructs. There are many more regex operators in the wild but like Seqexp and Spec, Malli has a more minimal set of seqex operators. Many regex operators are text-specific while others are considered harmful. Seqexp does support lookaheads and non-greedy quantifiers, but even lookaheads may have been a mistake.
ExpressivityLink to Expressivity
Seqexes are useful for function schemas and they can handle quite complicated DSL:s too:
(m/parse
[:schema {:registry {"hiccup" [:or
[:catn
[:name keyword?]
[:props [:? [:map-of keyword? any?]]]
[:children [:* [:schema [:ref "hiccup"]]]]]
[:or nil? boolean? number? string?]]}}
"hiccup"]
[:div {:class [:foo :bar]}
"Hello, world of data"])
;=> {:name :div, :props {:class [:foo :bar]}, :children ["Hello, world of data"]}
Macro arguments can of course be targeted as well as data DSL:s. Go for it!
Although Malli seqex schemas are actually limited to regular languages by design, so you can't do this:
(require '[clojure.spec.alpha :as s])
(s/def ::ws (s/* #(Character/isWhitespace %)))
(s/def ::sexpr
(s/cat
:lws ::ws
:sexpr (s/alt
:list (s/cat :lparen #(= \( %) :sexprs (s/* ::sexpr) :rparen #(= \) %))
:symbol (s/cat :c #(Character/isJavaIdentifierStart %)
:cs (s/* #(Character/isJavaIdentifierPart %)))
:int (s/+ #(Character/isDigit %)))
:rws ::ws))
(s/valid? ::sexpr (seq "(foo (bar 1 2))"))
; => true
(s/valid? ::sexpr (seq "(foo (bar 1 2)"))
; => false
(Regular expressions are insufficient to handle recursively nested constructs like our lovely Lispy pairs of parentheses or HTML. That is also why we have Instaparse. And please don't do it with Spec either even if it happens to work as relying on emergent behaviour can be risky!)
RequirementsLink to Requirements
We have to extend the Malli schema functionalities to seqexes, which means:
- Validation with
validate
; this is just language recognition. Language recognition is rarely sufficient but here we have it. - Getting validation errors with
explain
; similar tovalidate
, but have to collect all errors from invalid seq elements as well as a seq that is too short or long. - Transformers (
encode
anddecode
); these are essentially parsing where the result just happens to be a seq of the same length with transformed elements (higher level encoder or decoder functions could of course slice and dice that seq later). - Destructuring and constructing sequences with
parse
andunparse
;parse
is well, parsing with seqexes. Full parsing with regular expressions is also a bit unusual; usually we just check whether the input matches and possibly extract some substrings of it.unparse
is a kind of pretty printing but producing sequences instead of just strings and quite simple compared to the other functionalities.parse
andunparse
were actually added after the seqex schemas and are no-ops unless you use seqexes or:orn
(which was only added withparse
) or:multi
(starting with Malli 0.4.0).
Possible Implementation StrategiesLink to Possible Implementation Strategies
There are three well-known approaches to implement regular expression matching:
- Convert the regular expression to a deterministic finite automaton (DFA) (usually via a nondeterministic finite
automaton (NFA) and the powerset construction) and run the DFA
on the input. This is what foundational tools like
grep
andlex
do. - Convert the regular expression to a NFA and run the NFA on the input. The Seqexp library uses the "PikeVM" variant of
this, where the NFA is stored as a bytecode program for a highly multithreaded virtual machine (the threads are green
and even more lightweight than general purpose green threads like running
go
blocks).- One can also add an incrementally constructed DFA as a caching layer, sort of like a "DFA JIT". ANTLR 4's ALL(*) lookahead also works like this, except it has to simulate the augmented transition network (ATN) for a context-free or even context-sensitive grammar instead of just a NFA for a regular expression.
- Use a backtracking parser similar to the typical implementation of parser combinators. This is how Perl, PCRE, Java etc. regular expressions work and must work because they have gone beyond regular languages with crazy stuff like backreferences. Caching can be added similarly to the memoizing Packrat parsing algorithms.
And then there is the forgotten jewel of Brzozowski's regular expression derivatives:
- Derivatives can be used instead of NFA-conversion and the powerset construction to turn the regular expression into a DFA. This is used in the SML ml-ulex and Racket parser-tools/lex lexer generators.
- It is also possible to do parsing by computing regex derivatives on the fly, which is how Clojure Spec regex operators are implemented.
Now suppose the length of the input is n
and the length of the regex is r
:
- DFA matching takes $O(n)$ time and $O(1)$ space (just the index of the current state). However constructing the DFA takes $O(2^r)$ time and the size of the DFA is also $O(2^n)$ (although the usual case is much more reasonable of course).
- NFA matching takes $O(rn)$ time and $O(r)$ space (e.g. the PikeVM needs $O(r)$ threads at most). Constructing the NFA only takes $O(r)$ time and so does storing it. The DFA cache gives speedup especially on longer inputs but does not change the bounds.
- Backtracking takes $O(2^n)$ time and $O(n)$ space. Memoization can reduce the time to $O(n)$ while only increasing the space usage to $O(rn)$.
- On the fly derivative matching takes $O(rn)$ time and $O(n^r)$ space. The optimizations from parsing with derivatives are surely applicable and might be able to reduce the space requirement to $O(r)$ but AFAIK Spec does not try to do that.
Obviously the backtracking strategy is a depth-first search. The other techniques are essentially breadth-first search. Typically depth-first search uses exponential time but linear space while breadth-first search uses linear time but exponential space. However because a string or even a general seq is quite a simple input there are tricks to make breadth-first search take very little space (converting to a DFA, deduplicating threads in the PikeVM) and the usual depth-first perf lifeline of memoization only requires linear space.
Overall the NFA approach seems like the winner; nice linear bounds without even the ahead of time DFA construction
exponentiality issues. Actually since we are matching general seqs instead of strings the DFA approach isn't really
applicable as it requires simulating running the NFA on every possible input string in advance. That is just fine on
ASCII, tricky on UTF (but there are solutions) and a non-starter on
every possible Clojure datum host object.
PikeVM Thread Deduplication Did not LastLink to PikeVM Thread Deduplication Did not Last
So I started implementing the regex operators based on the Seqexp code. We noticed that Seqexp is actually quite slow because its PikeVM engine is implemented in idiomatic Clojure with lots of intermediate sequences when ideally it would allocate just the thread table and no intermediate sequences at all. By writing Java in Clojure (or is it Rust in Clojure because I haven't written much Java at all...) I was able to make it 20x faster.
But then I realized that in Malli we need more than regular language recognition and subsequence extraction; at the time
we already had transformers (encode
/decode
) and recently we added parse
and unparse
as well. Those produce
sequences where potentially every element has been transformed somehow. The prefix of that sequence (or something from
which it can be reconstructed) needs to be a part of the thread state, which makes the thread deduplication ineffective
and we are back to $O(n^r)$ space usage. I saw no way around that, so had to pivot!
My perf
Seqexp branch is still out there if anyone is interested.
Implementing Seqex ParsingLink to Implementing Seqex Parsing
So I went for (memoized) backtracking instead. Since the implementation is quite interesting both as a final achievement and as a journey I thought I'd explain it here. (Also, I do worry about the bus factor with this!)
Warning: Here Be Code (but, no Dragon Books!)
Introductions to parsing like to emphasize that the parser input is an abstract sequence of any "symbols of an alphabet" while in practise it is almost always a string or stream of characters or tokens. But our input really is any seq(uence) of any values, and it really does not affect this implementation much (but, cannot use DFA).
As usual with parsing combinators, our (sub)parsers are functions that take the input and return a tuple of the parsed
prefix and the remaining input. Clojure vectors make bulky tuples (thus the attempts at clj-tuple and
cambrian-collections) but we will later get rid of them rather incidentally. If the parser fails we just return
:malli.core/invalid
; explain
has a separate implementation.
Parsing a single item consists of just using parse-item
on the first element of the seq. parse-item
should be a
function that transforms the item or returns :malli.core/invalid
:
(defn item-parser [parse-item]
(fn [coll]
(if (seq coll)
(let [res (parse-item (first coll))]
(if-not (miu/-invalid? res)
[res (rest coll)]
res))
:malli.core/invalid)))
In practice the parse-item
functions come from malli.core/-parser
.
end-parser
checks that the collection is empty. The old-school way to do this was to check for an 'EOF' sentinel
or null byte, but with Clojure sequences we have empty?
:
(defn end-parser [coll]
(if (empty? coll)
[nil coll]
:malli.core/invalid))
As with many parsing libraries and parser generators, end-parser
is needed to ensure that the entire input is consumed
by the overall parser. Parsing just a prefix of the input makes the parsing functions composable, but the top level
parser must not be satisfied with just a correct prefix -- that would lead to silent failures, the worst kind of bugs.
Obviously all of the input must be validated when we have web-facing API:s; offering s/every
in addition to
s/coll-of
is just another sign that Spec is not about web API:s.
The parser for :cat
may look formidable, but really it is mostly just boring plumbing. The parsers
are just run in
sequence, the results collected into a vector with the remaining input threade through and any error returned
immediately, also canceling the remaining parsers
:
(defn cat-parser [& parsers]
(reduce (fn [parser parser*]
(fn [coll]
(let [res (parser coll)]
(if-not (miu/-invalid? res)
(let [[vs coll] res]
(let [res* (parser* coll)]
(if-not (miu/-invalid? res*)
(let [[v coll] res*]
[(conj vs v) coll])
res*)))
res))))
(pure-parser []) parsers))
(And part of the bulk in cat-parser
is just due to it being variadic (because we want :cat
to be). In Haskell >>=
and <*>
are binary infix operators and the reduce
(or, uh, foldr
? foldl'
?) part would not appear.)
cat-parser
also uses a little utility pure-parser
which just returns the fixed v
and leaves the input alone:
(defn pure-parser [v] (fn [coll] [v coll]))
(In fancy terms, pure-parser
'embeds a pure value into the domain of parsing functions'; it is the 'monadic unit
operation'.)
Complementarily, the parser for :alt
tries the parsers for the alternatives in sequence and stops when it finds one
that succeeds:
(defn alt-parser [& parsers]
(reduce (fn [parser parser*]
(fn [coll]
(let [res (parser coll)]
(if-not (miu/-invalid? res)
res
(parser* coll)))))
parsers))
Note that the reduce
in alt-parser
and cat-parser
is done at parser generation time. Combined with the JIT
inlining the effect is similar to loop unrolling. This "compilation to closures" approach was pervasive in Malli already
before the seqex schemas and is the reason to prefer e.g. m/validator
over m/validate
. The approach is familiar
from Plumatic Schema, but is also described in SICP (section 4.1.7: Separating Syntactic Analysis from Execution), which
mentions Jonathan Rees already using it in 1982! To get similarly specialized machine code for the PikeVM approach we
would need an unlikely meta-tracing JIT.
For contrast, the parser for :*
must loop at parsing time, because we do not know how many times its child schema will
match:
(defn *-parser [parser]
(letfn [(parser* [vs coll]
(let [res (parser coll)]
(if-not (miu/-invalid? res)
(let [[v coll] res]
(recur (conj vs v) coll))
[vs coll])))]
(fn [coll] (parser* [] coll))))
Aside from the looping, it is similar to cat-parser
in collecting the results to a vector and threading the remaining
input through. But note how when the child parser
fails, *-parser
succeeds with the input and the previously
collected vs
; :*
can just stop matching at any time or match nothing at all, so it never fails. (The overall parser
can still fail thanks to the use of end-parser
: consider e.g. (= false (m/validate [:* int?] ["foo"]))
.)
Since we have parser combinators, the parsers for :?
and :+
can be derived in practice like in theory, by composing
the combinators (Always Be Composing!):
(defn ?-parser [parser] (alt-parser parser (pure-parser nil)))
(defn +-parser [parser]
(map-parser (fn [[v vs]] (cons v vs))
(cat-parser parser (*-parser parser))))
+-parser
also needs map-parser
, which just parses with parser
and, if successful, applies f
to the parse result:
(defn map-parser [f parser]
(fn [coll]
(let [res (parser coll)]
(if-not (miu/-invalid? res)
(let [[v coll] res]
[(f v) coll])
res))))
Having map-parser
technically makes our collection of parsing functions a functor (even if we are not extending any
fancy Functor
typeclass protocol). But don't worry, we won't make it a monad; that is only required for
context-dependent parsing and we only want to handle regular grammars.
(Not that making it a monad would be any scarier, just change []
to ()
:
(defn flat-map-parser [f parser]
(fn [coll]
(let [res (parser coll)]
(if-not (miu/-invalid? res)
(let [[v coll] res]
((f v) coll))
res))))
And a monad would also require pure-parser
, but we had that already.)
In theory the :repeat
parser could also be implemented with composition, but that would require duplicating the child
parser max
times and building some other closures besides. We prefer to implement the :repeat
parser directly:
(defn repeat-parser [min max parser]
(letfn [(compulsories [n vs coll]
(if (< n min)
(let [res (parser coll)]
(if-not (miu/-invalid? res)
(let [[v coll] res]
(recur (inc n) (conj vs v) coll))
res))
(optionals n vs coll)))
(optionals [n vs coll]
(if (< n max)
(let [res (parser coll)]
(if-not (miu/-invalid? res)
(let [[v coll] res]
(recur (inc n) (conj vs v) coll))
[vs coll]))
[vs coll])]
(fn [coll] (compulsories 0 [] coll))))
Quite similar to *-parser
, but we also have a counter to deal with. Brings me back to Programming 101... after
switching to FP I rarely get such juicy chances for off-by-one errors.
An NFA implementation like Seqexp has not choice but to add at least max
nodes to the NFA. Well, the PikeVM used by
Seqexp could be extended with a stack of registers for loops, but that would marr its elegance...
A GotchaLink to A Gotcha
Consider
(m/parse [:cat [:* pos?] [:= 4]] [2 4])
which boils down to
(let [parse (cat-parser (*-parser (item-parser (m/parser pos?)))
(item-parser (m/parser [:= 4])))]
(parse [2 4]))
According to the semantics we have for seqex parsing, it should return [[2] 4]
. But the result is
:malli.core/invalid
.
The problem is that not only is :*
greedy (as it should be); it also commits to every consumption of a
sequence element. Since (pos? 4)
is true, [:* pos?]
parses out [2 4]
, leaving [:= 4]
with an empty seq when it
wants [4]
. For the overall parse to succeed, it needs to backtrack to when [:* pos?]
only had [2]
and exit the
loop at that point, leaving [:= 4]
with the input it needs. I think I have run into this already on my previous
parser combinator adventures (of which I have had more than I care to admit).
CPS-Conversion and TrampoliningLink to CPS-Conversion and Trampolining
But how can the backtracking into :*
be implemented? The (implicit) loop in *-parser
cannot actually be restarted
after it has been exited. We could give up on the 'compilation to closures' approach and make seqex parsing 'interpret'
the schema tree instead. Then backtracking could be handled by resetting the current position in the schema, similar to
Rob Pike's elegant C regex demo implementation which just rewinds a char *
. But that would lose all the performance
gained by precomputation when building the closures.
The solution is simple but potentially mindblowing: capture the continuation inside the loop. Since Clojure does not
have any sort of call-with-current-continuation
we have to convert the parsers to continuation passing style (CPS)
ourselves to get the continuation. And because Clojure does not do general tail-call optimization either we also have to
implement that by trampolining the CPS-converted code to replace the recur
in *-parser
. Buckle up, it is going to be
a wild ride!
As in standard CPS, the converted parsers take a continuation parameter k
that they can call to return or pass to
other functions in tail calls. Usually continuations would have just one parameter, the return value. But we pass the
parsed value and the remainder of the input as separate arguments, conveniently avoiding the vector allocations as
promised. This threading through of the input coll
resembles the handling of the state value in State monads, if you
are familiar with those (and their implementation).
More unusually even for CPS the converted parsers also get a stack
of fallback continuations. Calls to parsers may
be pushed to that stack by 'parking' them. This is similar to how Prolog implementations mark backtracking points on
their stacks. The Prolog connection is no accident; parsing can be conceptualized as a search on the grammar that is
also guided by the input string and Prolog also uses a depth-first search.
*-parser
parks epsilon
on each iteration to make those available as restart locations (the line marked "remember
fallback"):
(defn *-parser [parser]
(letfn [(epsilon (fn [_ vs coll k] (k vs coll)))
(parser* [stack vs coll k]
(park*! stack epsilon vs coll k) ; remember fallback
(parser stack coll
(fn [v coll] (park*! stack parser* (conj vs v) coll k))))] ; TCO
(fn [stack coll k] (parser* stack [] coll k))))
This arrangement requires an external control loop to restart the call on top of the stack
when parsing fails. We can
also use that loop as a trampoline for TCO, as happens on the line marked "TCO".
The converted repeat
does the same tricks, it just has more bookkeeping (as before):
(defn repeat-parser [min max parser]
(letfn [(epsilon (fn [_ vs coll k] (k vs coll)))
(compulsories [stack n vs coll k]
(if (< n min)
(parser stack coll
(fn [v coll] (park**! stack compulsories (inc n) (conj vs v) coll k))) ; TCO
(optionals stack n vs coll k)))
(optionals [stack n vs coll k]
(if (< n max)
(do
(park*! stack epsilon vs coll k) ; remember fallback
(parser stack coll
(fn [v coll] (park**! stack optionals (inc n) (conj vs v) coll k)))) ; TCO
(k vs coll)))]
(fn [stack coll k] (compulsories stack 0 [] coll k))))
We use parking for all backtracking, including alt-parser
:
(defn alt-parser [& parsers]
(reduce (fn [parser parser*]
(fn [stack coll k]
(park! stack parser* coll k) ; remember fallback
(parser stack coll k)))
parsers))
That is not strictly necessary at this point, but will be useful when we add caching later on.
The other basic parsers are just converted to our flavor of CPS, taking k
and stack
and pushing coll
around:
(defn item-parser [parse-item]
(fn [_ coll k]
(when (seq coll)
(let [res (parse-item (first coll))]
(when-not (miu/-invalid? res)
(k res (rest coll)))))))
(defn end-parser [_ coll k]
(when (empty? coll)
(k nil coll)))
(defn cat-parser [& parsers]
(let [p (reduce (fn [rest-parser parser]
(fn [stack vs coll k]
(parser stack coll
(fn [v coll] (rest-parser stack (conj vs v) coll k)))))
(reverse parsers))]
(fn [stack coll k] (p stack [] coll k))))
Since ?-parser
and +-parser
are just defined in terms of the other combinators, they don't need to change. But the
pure-parser
and map-parser
that they use do need to be CPS-converted:
(defn pure-parser [v] (fn [_ coll k] (k v coll)))
(defn map-parser [f parser]
(fn [stack coll k]
(parser stack coll
(fn [v coll] (k (f v) coll)))))
parser
makes sure that the entire input is consumed by appending end-parser
. It also contains the backtracking and
trampolining loop, which just performs calls from the stack until one of them finishes successfully by calling the
final success continuation which sets success
and result
or until the backtracking stack is empty:
(defn parser [p]
(let [p (cat-parser p (end-parser))]
(fn [coll]
(if (sequential? coll)
(let [stack (make-stack)
success (volatile! false)
result (volatile! nil)]
(p stack coll (fn [v _ _] (vreset! success true) (vreset! result v)))
(if @success
(first @result)
(loop []
(if-some [thunk (pop-thunk! stack)]
(do
(thunk)
(if @success
(first @result)
(recur)))
:malli.core/invalid))))
:malli.core/invalid))))
Before adding caching, the stack is just a java.util.ArrayDeque
or a JS array:
(defn make-stack [] #?(:clj (ArrayDeque.), :cljs #js []))
(defn- empty-stack? [^ArrayDeque stack]
#?(:clj (.isEmpty stack), :cljs (zero? (alength stack))))
(defn park! [^ArrayDeque stack parser coll k] (.push stack #(parser stack coll k)))
(defn pop-thunk! [^ArrayDeque stack] (when-not (empty-stack? stack) (.pop stack)))
We can use volatile!
and mutable stacks freely because the parsers we create are single threaded. parse-item
can
(at least theoretically) be multi-threaded, but that cannot affect the parsing thread since we do not pass continuations
into parse-item
or anything like that. Besides, the parsers returned by parser
are externally pure, just like into
uses transients internally whenever possible.
MemoizationLink to Memoization
Opposite to the NFA, backtracking is very space-efficient but has exponential worst-case time behaviour. As is often the case, we can use caching to gain linear time behaviour at the expense of more space usage. Happily, our cache will only use linear space also, as previously promised.
We can avoid visiting backtracking points more than once by simply discarding the parsing function and its arguments
instead of pushing them to the stack
when we have already seen the particular combination of input and seqex
positions. The position in the input is already captured by the remainder of the input coll
but I added the input index
pos
everywhere because integers make much better hash map keys than arbitrary sequences. The position in the seqex
could be implemented as a reference to the schema subtree but instead I found it convenient to just guarantee that every
schema subtree gets its own parsing function which can be used as the key (by pointer equality and identity hash).
Perhaps surprisingly the counter in :repeat
adds quite a bit of complication to this memoization. Separate iterations
of :repeat
with a different counter should be recognized as separate positions in the seqex, so the counter needs to
be added to the hash key along with the seq index and the parsing function. Furthermore there can be nested :repeats
so a stack of all in-progress :repeat
counters is actually required. Oh well, at least the counters themselves are
just integers.
Having dealt with all these algorithmic complications I was faced with a more industrial one; I wanted a mutable hash
set for the memoization cache (the parsers and arguments going to the separate stack), but Javascript only got one in
ES6. I did try to use transient persistent sets in volatile!
s for #?(:cljs ...
but the performance was completely
unacceptable. Google Closure can polyfill ES6 sets with the right options, but those options can only be set in the
project that actually does the compilation to JS, e.g. your frontend app. And Malli is so general (and divisive) that it
is likely to end up in some applications as a transitive dependency only.
So I made my own mutable hash set in some 40 lines of Clojure. It uses quadratic probing with power of two sizes and
triangular numbers, another neat trick that could fill another blog post. Amusingly it was actually slightly faster than
java.util.HashSet
on Java 8 and roughly equivalent with it on newer JDK:s so I switched to it on JVM as well. But more
importantly Malli seqexes now had good performance on the ClojureScript side too.
You can see the actual Malli code for the
final algorithm with all the complications from memoization and associated issues tackled and also all the variations
required to implement validator
, explainer
, decoder
and encoder
in addition to parse
. The unparse
implementation is as straightforward for seqexes as for everything else and has nothing to do with all these parsing
techniques.
PerformanceLink to Performance
As the README says and as you might have gathered by now:
Although a lot of effort has gone into making the seqex implementation fast
(require '[clojure.spec.alpha :as s])
(require '[criterium.core :as cc])
(let [valid? (partial s/valid? (s/* int?))]
(cc/quick-bench (valid? (range 10)))) ; Execution time mean : 27µs
(let [valid? (m/validator [:* int?])]
(cc/quick-bench (valid? (range 10)))) ; Execution time mean : 2.7µs
it is always better to use less general tools whenever possible:
(let [valid? (partial s/valid? (s/coll-of int?))]
(cc/quick-bench (valid? (range 10)))) ; Execution time mean : 1.8µs
(let [valid? (m/validator [:sequential int?])]
(cc/quick-bench (valid? (range 10)))) ; Execution time mean : 0.12µs
Going BeyondLink to Going Beyond
- The implementation here is quite similar to GLL parsing except we don't have to care about continuations in the caching since regular languages don't include (nontail) recursion. https://epsil.github.io/gll/ is a nice explanation of GLL parser combinators and has links to papers etc. It also inspired Instaparse, which Engelberg had a presentation about at Clojure/West 2014.
- If you caught the parsing bug Parsing Techniques is the hefty tome to scratch that itch / scare you away.
AssessmentLink to Assessment
The Malli seqex implementation was an exceptional project. It is not often that parsing theory is so directly applicable to a practical problem and there is no pre-existing parsing tool that can just be used (although Seqexp did come close!). In retrospect all the issues from pivot-level to merely annoying also made the experience greater; it is exhilarating when you are able to pull aces like CPS and custom hash tables (with bit-twiddling and all) from your sleeve instead of having to admit defeat.
I feel that the most masterful code comes from "hunting high and low", reaching for mindbending theory as well as the bits and JITs at hand. Of course that is also the recipe to find amazing ways to shoot yourself in the foot with C++. As with many other things, the hard-earned skill is to know the difference (and no, it is not just C++!).