4 Brzozowski’s Derivatives of Regular Expressions
5 -----------------------------------------------
13 ∘ concatenation (see below)
16 λ singleton set containing just the empty string
17 I set of all letters in alphabet
19 Derivative of a set ``R`` of strings and a string ``a``:
31 ∂a(R∘S) → ∂a(R)∘S ∨ δ(R)∘∂a(S)
32 ∂a(R ∧ S) → ∂a(R) ∧ ∂a(S)
33 ∂a(R ∨ S) → ∂a(R) ∨ ∂a(S)
37 Auxiliary predicate function ``δ`` (I call it ``nully``) returns either
38 ``λ`` if ``λ ⊆ R`` or ``ϕ`` otherwise:
49 δ(R ∧ S) → δ(R) ∧ δ(S)
50 δ(R ∨ S) → δ(R) ∨ δ(S)
52 Some rules we will use later for “compaction”:
68 Concatination of sets: for two sets A and B the set A∘B is defined as:
70 {a∘b for a in A for b in B}
74 {‘a’, ‘b’}∘{‘c’, ‘d’} → {‘ac’, ‘ad’, ‘bc’, ‘bd’}
81 from functools import partial as curry
82 from itertools import product
87 The empty set and the set of just the empty string.
92 y = frozenset({''}) # λ
97 I’m only going to use two symbols (at first) becaase this is enough to
98 illustrate the algorithm and because you can represent any other
99 alphabet with two symbols (if you had to.)
101 I chose the names ``O`` and ``l`` (uppercase “o” and lowercase “L”) to
102 look like ``0`` and ``1`` (zero and one) respectively.
106 syms = O, l = frozenset({'0'}), frozenset({'1'})
108 Representing Regular Expressions
109 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
111 To represent REs in Python I’m going to use tagged tuples. A *regular
112 expression* is one of:
124 Where ``R`` and ``S`` stand for *regular expressions*.
128 AND, CONS, KSTAR, NOT, OR = 'and cons * not or'.split() # Tags are just strings.
130 Because they are formed of ``frozenset``, ``tuple`` and ``str`` objects
131 only, these datastructures are immutable.
133 String Representation of RE Datastructures
134 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
140 Return a nice string repr for a regular expression datastructure.
142 if re == I: return '.'
143 if re in syms: return next(iter(re))
144 if re == y: return '^'
145 if re == phi: return 'X'
147 assert isinstance(re, tuple), repr(re)
151 body = stringy(re[1])
152 if not body: return body
153 if len(body) > 1: return '(' + body + ")*"
157 body = stringy(re[1])
158 if not body: return body
159 if len(body) > 1: return '(' + body + ")'"
162 r, s = stringy(re[1]), stringy(re[2])
163 if tag == CONS: return r + s
164 if tag == OR: return '%s | %s' % (r, s)
165 if tag == AND: return '(%s) & (%s)' % (r, s)
172 Match anything. Often spelled “.”
180 I = (KSTAR, (OR, O, l))
192 ``(.111.) & (.01 + 11*)'``
193 ~~~~~~~~~~~~~~~~~~~~~~~~~~
195 The example expression from Brzozowski:
199 (.111.) & (.01 + 11*)'
202 Note that it contains one of everything.
206 a = (CONS, I, (CONS, l, (CONS, l, (CONS, l, I))))
207 b = (CONS, I, (CONS, O, l))
208 c = (CONS, l, (KSTAR, l))
209 it = (AND, a, (NOT, (OR, b, c)))
218 (.111.) & ((.01 | 11*)')
224 Let’s get that auxiliary predicate function ``δ`` out of the way.
230 δ - Return λ if λ ⊆ R otherwise ϕ.
235 if R in syms or R == phi:
251 return phi if nully(R[1]) else y
253 # δ(R∘S) → δ(R) ∧ δ(S)
254 # δ(R ∧ S) → δ(R) ∧ δ(S)
255 # δ(R ∨ S) → δ(R) ∨ δ(S)
256 r, s = nully(R[1]), nully(R[2])
257 return r & s if tag in {AND, CONS} else r | s
262 This is the straightforward version with no “compaction”. It works fine,
263 but does waaaay too much work because the expressions grow each
279 if R == y or R == phi or R in syms:
286 return (CONS, derv(R[1]), R)
290 return (NOT, derv(R[1]))
294 # ∂a(R∘S) → ∂a(R)∘S ∨ δ(R)∘∂a(S)
296 A = (CONS, derv(r), s) # A = ∂a(R)∘S
298 # A ∨ λ ∘ ∂a(S) → A ∨ ∂a(S)
299 # A ∨ ϕ ∘ ∂a(S) → A ∨ ϕ → A
300 return (OR, A, derv(s)) if nully(r) else A
302 # ∂a(R ∧ S) → ∂a(R) ∧ ∂a(S)
303 # ∂a(R ∨ S) → ∂a(R) ∨ ∂a(S)
304 return (tag, derv(r), derv(s))
313 def _compaction_rule(relation, one, zero, a, b):
315 b if a == one else # R*1 = 1*R = R
317 zero if a == zero or b == zero else # R*0 = 0*R = 0
327 _and = curry(_compaction_rule, AND, I, phi)
331 _or = curry(_compaction_rule, OR, phi, I)
335 _cons = curry(_compaction_rule, CONS, y, phi)
340 We can save re-processing by remembering results we have already
341 computed. RE datastructures are immutable and the ``derv()`` functions
342 are *pure* so this is fine.
348 def __init__(self, f):
350 self.calls = self.hits = 0
353 def __call__(self, key):
356 result = self.mem[key]
359 result = self.mem[key] = self.f(key)
365 This version uses the rules above to perform compaction. It keeps the
366 expressions from growing too large.
370 def D_compaction(symbol):
382 if R == y or R == phi or R in syms:
389 return _cons(derv(R[1]), R)
393 return (NOT, derv(R[1]))
397 # ∂a(R∘S) → ∂a(R)∘S ∨ δ(R)∘∂a(S)
399 A = _cons(derv(r), s) # A = ∂a(r)∘s
401 # A ∨ λ ∘ ∂a(S) → A ∨ ∂a(S)
402 # A ∨ ϕ ∘ ∂a(S) → A ∨ ϕ → A
403 return _or(A, derv(s)) if nully(r) else A
405 # ∂a(R ∧ S) → ∂a(R) ∧ ∂a(S)
406 # ∂a(R ∨ S) → ∂a(R) ∨ ∂a(S)
407 dr, ds = derv(r), derv(s)
408 return _and(dr, ds) if tag == AND else _or(dr, ds)
419 o, z = D_compaction('0'), D_compaction('1')
422 names = list(product(*(N * [(0, 1)])))
423 dervs = list(product(*(N * [(o, z)])))
424 for name, ds in zip(names, dervs):
429 if R == phi or R == I:
433 print stringy(it) ; print
434 print o.hits, '/', o.calls
435 print z.hits, '/', z.calls
437 for s in sorted(map(stringy, REs), key=lambda n: (len(n), n)):
443 (.111.) & ((.01 | 11*)')
452 (.111.) & ((.01 | 1)')
453 (.111. | 11.) & ((.01 | ^)')
454 (.111. | 11. | 1.) & ((.01)')
455 (.111. | 11.) & ((.01 | 1*)')
456 (.111. | 11. | 1.) & ((.01 | 1*)')
463 (.111.) & ((.01 | 11*)')
472 (.111.) & ((.01 | 1 )')
473 (.111. | 11.) & ((.01 | ^ )')
474 (.111. | 11.) & ((.01 | 1*)')
475 (.111. | 11. | 1.) & ((.01 )')
476 (.111. | 11. | 1.) & ((.01 | 1*)')
481 We could parse larger alphabets by defining patterns for e.g. each byte
482 of the ASCII code. Or we can generalize this code. If you study the code
483 above you’ll see that we never use the “set-ness” of the symbols ``O``
484 and ``l``. The only time Python set operators (``&`` and ``|``) appear
485 is in the ``nully()`` function, and there they operate on (recursively
486 computed) outputs of that function, never ``O`` and ``l``.
495 ∂a(R ∨ S) → ∂a(R) ∨ ∂a(S)
511 ∂a(R ∨ S) → ∂a(R) ∨ ∂a(S)
512 ∂1({'0')) ∨ ∂1({'1'))
520 This suggests that we should be able to alter the functions above to
521 detect sets and deal with them appropriately. Exercise for the Reader
527 We can drive the regular expressions to flesh out the underlying state
528 machine transition table.
534 Says, “Three or more 1’s and not ending in 01 nor composed of all 1’s.”
537 :alt: State Machine Graph
541 Start at ``a`` and follow the transition arrows according to their
542 labels. Accepting states have a double outline. (Graphic generated with
543 `Dot from Graphviz <http://www.graphviz.org/>`__.) You’ll see that only
544 paths that lead to one of the accepting states will match the regular
545 expression. All other paths will terminate at one of the non-accepting
548 There’s a happy path to ``g`` along 111:
554 After you reach ``g`` you’re stuck there eating 1’s until you see a 0,
555 which takes you to the ``i→j→i|i→j→h→i`` “trap”. You can’t reach any
556 other states from those two loops.
558 If you see a 0 before you see 111 you will reach ``b``, which forms
559 another “trap” with ``d`` and ``f``. The only way out is another happy
560 path along 111 to ``h``:
566 Once you have reached ``h`` you can see as many 1’s or as many 0’ in a
567 row and still be either still at ``h`` (for 1’s) or move to ``i`` (for
568 0’s). If you find yourself at ``i`` you can see as many 0’s, or
569 repetitions of 10, as there are, but if you see just a 1 you move to
575 So how do we get the state machine from the regular expression?
577 It turns out that each RE is effectively a state, and each arrow points
578 to the derivative RE in respect to the arrow’s symbol.
580 If we label the initial RE ``a``, we can say:
587 And so on, each new unique RE is a new state in the FSM table.
589 Here are the derived REs at each state:
593 a = (.111.) & ((.01 | 11*)')
594 b = (.111.) & ((.01 | 1)')
595 c = (.111. | 11.) & ((.01 | 1*)')
596 d = (.111. | 11.) & ((.01 | ^)')
597 e = (.111. | 11. | 1.) & ((.01 | 1*)')
598 f = (.111. | 11. | 1.) & ((.01)')
604 You can see the one-way nature of the ``g`` state and the ``hij`` “trap”
605 in the way that the ``.111.`` on the left-hand side of the ``&``
606 disappears once it has been matched.
610 from collections import defaultdict
611 from pprint import pprint
612 from string import ascii_lowercase
616 d0, d1 = D_compaction('0'), D_compaction('1')
625 # Don't have more than 26 states...
626 names = defaultdict(iter(ascii_lowercase).next)
628 table, accepting = dict(), set()
634 state_name = names[re]
636 if (state_name, 0) in table:
640 accepting.add(state_name)
642 o, i = d0(re), d1(re)
643 table[state_name, 0] = names[o] ; to_check.add(o)
644 table[state_name, 1] = names[i] ; to_check.add(i)
646 return table, accepting
650 table, accepting = explore(it)
697 Once we have the FSM table and the set of accepting states we can
698 generate the diagram above.
703 digraph finite_state_machine {
706 node [shape = doublecircle]; %s;
707 node [shape = circle];
712 def link(fr, nm, label):
713 return ' %s -> %s [ label = "%s" ];' % (fr, nm, label)
716 def make_graph(table, accepting):
720 link(from_, to, char)
721 for (from_, char), (to) in sorted(table.iteritems())
727 print make_graph(table, accepting)
732 digraph finite_state_machine {
735 node [shape = doublecircle]; i h;
736 node [shape = circle];
737 a -> b [ label = "0" ];
738 a -> c [ label = "1" ];
739 b -> b [ label = "0" ];
740 b -> d [ label = "1" ];
741 c -> b [ label = "0" ];
742 c -> e [ label = "1" ];
743 d -> b [ label = "0" ];
744 d -> f [ label = "1" ];
745 e -> b [ label = "0" ];
746 e -> g [ label = "1" ];
747 f -> b [ label = "0" ];
748 f -> h [ label = "1" ];
749 g -> i [ label = "0" ];
750 g -> g [ label = "1" ];
751 h -> i [ label = "0" ];
752 h -> h [ label = "1" ];
753 i -> i [ label = "0" ];
754 i -> j [ label = "1" ];
755 j -> i [ label = "0" ];
756 j -> h [ label = "1" ];
764 There are *lots* of FSM libraries already. Once you have the state
765 transition table they should all be straightforward to use. State
766 Machine code is very simple. Just for fun, here is an implementation in
767 Python that imitates what “compiled” FSM code might look like in an
768 “unrolled” form. Most FSM code uses a little driver loop and a table
769 datastructure, the code below instead acts like JMP instructions
770 (“jump”, or GOTO in higher-level-but-still-low-level languages) to
771 hard-code the information in the table into a little patch of branches.
776 Python has no GOTO statement but we can fake it with a “trampoline”
781 def trampoline(input_, jump_from, accepting):
785 bounce_to = jump_from(I)
786 except StopIteration:
787 return jump_from in accepting
788 jump_from = bounce_to
793 Little helpers to process the iterator of our data (a “stream” of “1”
794 and “0” characters, not bits.)
798 getch = lambda I: int(next(I))
808 while not getch(I): pass
810 A Finite State Machine
811 ^^^^^^^^^^^^^^^^^^^^^^
813 With those preliminaries out of the way, from the state table of
814 ``.111. & (.01 + 11*)'`` we can immediately write down state machine
815 code. (You have to imagine that these are GOTO statements in C or
816 branches in assembly and that the state names are branch destination
821 a = lambda I: c if getch(I) else b
822 b = lambda I: _0(I) or d
823 c = lambda I: e if getch(I) else b
824 d = lambda I: f if getch(I) else b
825 e = lambda I: g if getch(I) else b
826 f = lambda I: h if getch(I) else b
827 g = lambda I: _1(I) or i
828 h = lambda I: _1(I) or i
829 i = lambda I: _0(I) or j
830 j = lambda I: h if getch(I) else i
832 Note that the implementations of ``h`` and ``g`` are identical ergo
833 ``h = g`` and we could eliminate one in the code but ``h`` is an
834 accepting state and ``g`` isn’t.
838 def acceptable(input_):
839 return trampoline(input_, a, {h, i})
843 for n in range(2**5):
845 print '%05s' % s, acceptable(s)
884 Reversing the Derivatives to Generate Matching Strings
885 ------------------------------------------------------
887 (UNFINISHED) Brzozowski also shewed how to go from the state machine to
888 strings and expressions…
890 Each of these states is just a name for a Brzozowskian RE, and so, other
891 than the initial state ``a``, they can can be described in terms of the
892 derivative-with-respect-to-N of some other state/RE:
932 j = d1(d0(j)) = d01(j)
934 We have a loop or “fixed point”.
938 j = d01(j) = d0101(j) = d010101(j) = ...