Home

Regular Expression Decoding

Regular expressions are great. They can be a quick and easy way to extract substrings of interest from a string. I use the awesome hyper crate in my day job to serve up a web API and I use the hyper-router crate to define all of the routes for the web API using regular expressions. Often times, those regular expressions are used something like this (stub code borrowed from hyper-router):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
extern crate hyper;
extern crate hyper_router;
extern crate regex;

use hyper::server::{Server, Request, Response};
use hyper::status::StatusCode;
use hyper::uri::RequestUri;
use hyper_router::{Route, RouterBuilder};
use regex::Regex;

const ENDPOINT: &'static str = r"^/api/(?P<mac>[a-f0-9]{12})/version/(?P<version>[0-9]{14})";

fn get_uri(uri: &RequestUri) -> String {
  // get the uri as a string
}

// INTERESTING BIT IS HERE!!!
fn endpoint_handler(req: Request, res: Response) {
  let uri = get_uri(&req.uri);
  let endpoint_regex = Regex::new(ENDPOINT).unwrap();
  let captures = endpoint_regex.captures(&uri).unwrap()
  let mac = captures.name("mac").unwrap();
  let version_str = captures.name("version").unwrap();
  let version: u64 = version_str.parse().unwrap();

  // Do something with the mac and the version number

  res.send(b"Hello World!").unwrap();
}

fn main() {
  let router = RouterBuilder::new()
    .add(Route::get(ENDPOINT).using(endpoint_handler))
    .build();

  Server::http("0.0.0.0:8080").unwrap()
    .handle(move |request: Request, response: Response| {
      match router.find_handler(&request) {
        Ok(handler) => handler(request, response),
        Err(StatusCode::NotFound) => response.send(b"not found").unwrap(),
        Err(_) => response.send(b"some error").unwrap()
     }
   }).unwrap();
}

That code isn’t too terrible, but it is ignoring a lot of error handling for extracting the mac and version strings from the regular expression, as well as error code for converting that version string to an actual number. Now, in this case, the unwraps are probably okay since:

  • We had to match against the regular expression for the endpoint_handler to be called, so we should successfully get a captures object from applying the regex to the uri.
  • Both mac and version will exist.
  • The version string will be a number that fits in a u64 because of the regex.

However, the code gets very repetitive. What if we could just do something like this?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
extern crate regex_decode;
use regex_decode::decode;

#[derive(RustcDecodable)]
struct EndpointUri {
  mac: String,
  version: u64,
}

fn endpoint_handler(req: Request, res: Response) {
  let uri = get_uri(&req.uri);
  let endpoint_regex = Regex::new(ENDPOINT).unwrap();
  let endpoint = decode::<EndpointUri>(&endpoint_regex, &uri).unwrap();

  // Do something with the mac and the version number we got from
  // the endpoint...

  res.send(b"Hello World!").unwrap();
}

That is what the regex-decode crate does for you. It provides the ability to decode the named capture groups of a regular expression into a struct. Additionally, it will do type conversions from strings into other primitive types that are defined in the struct. So, in the above example, it hides away the boilerplate of extracting the named captures and the type conversion and gives you a single place to check for errors, instead of every step along the way.

If defining a struct is too heavy-weight for you, just have it decode to a tuple:

1
2
3
4
5
6
7
8
9
10
11
12
13
extern crate regex_decode;
use regex_decode::decode;

fn endpoint_handler(req: Request, res: Response) {
  let uri = get_uri(&req.uri);
  let endpoint_regex = Regex::new(ENDPOINT).unwrap();
  let (mac, version) = decode::<(String, u64)>(&endpoint_regex, &uri).unwrap();

  // Do something with the mac and the version number we got from
  // the endpoint...

  res.send(b"Hello World!").unwrap();
}

In this case, we are not decoding by the names of the capture groups and are instead just grabbing the captures by position. Plus, that still gives you the type safety that the struct would give you.

You should be able to use any type you want as long as it is RustcDecodable. However, recursive types do not make a whole lot of sense while you are parsing a regular expression, so those are not handled. Also, I have not figured out a good way to decode tuple structs yet, so the library will emit an error if you are trying to decode to a tuple struct. But, for the use case of quickly decoding to a simple struct, the crate works just fine.

Untyped Arithmetic Expressions

I have recently started working my way through the excellent Types and Programming Languages (TaPL) by Benjamin Pierce (expect to see this name pop up a lot in the future, because I plan on working through Software Foundations by him as well). Since I am also a huge fan of the Rust programming language, I have started working through the programming examples in Rust.

Rust is not the best language to write little interpreters in. While it provides pattern matching, which is really essential for this task, the borrowing rules force the code to be a bit ugly at times. The first major code example in the book is an interpreter for untyped arithmetic expressions. Rust does make it fairly short, so I will reproduce it all here and then walk through bits of it to explain.

Untyped Arithmetic Expressions

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
#![feature(box_syntax, box_patterns)]

#[derive(Clone, Debug, PartialEq)]
pub enum Term {
    True,
    False,
    Zero,
    IsZero(Box<Term>),
    Succ(Box<Term>),
    Pred(Box<Term>),
    If(Box<Term>, Box<Term>, Box<Term>),
}

pub fn is_numeric_val(t: &Term) -> bool {
    match *t {
        Term::Zero => true,
        Term::Succ(ref t) => is_numeric_val(t),
        _ => false,
    }
}

pub fn eval1(t: &Term) -> Option<Term> {
    match *t {
        /// True-valued if statement
        Term::If(box Term::True, ref t2, _) => Some(*t2.clone()),
        /// False-valued if statement
        Term::If(box Term::False, _, ref t3) => Some(*t3.clone()),
        /// Evaluate the condition first, then return an if statement
        /// that has the condition evaluated to a true or false
        Term::If(ref cond, ref t2, ref t3) => {
            eval1(cond).map(|boolean| {
                 Term::If(box boolean, box *t2.clone(), box *t3.clone())
            })
        }
        /// Succ(t) means we have to evaluate t first to simplify it, then we
        /// rewrite the term to Succ(evaled)
        Term::Succ(ref t) => eval1(t).map(|evaled| Term::Succ(box evaled)),
        /// The predecessor of Zero is defined as Zero
        Term::Pred(box Term::Zero) => Some(Term::Zero),
        /// The predecessor of the successor of a value is the value itself
        /// (if it was a numeric value to begin with)
        Term::Pred(box Term::Succ(ref nv1)) if is_numeric_val(nv1) => Some(*nv1.clone()),
        /// Pred(t) means we have to evaluate t first to simplify it, then we
        /// rewrite the term to Pred(evaled)
        Term::Pred(ref t) => eval1(t).map(|evaled| Term::Pred(box evaled)),
        /// By definition, the Zero term IsZero.
        Term::IsZero(box Term::Zero) => Some(Term::True),
        /// If the term is numeric and the successor of a value, then it cannot be IsZero
        Term::IsZero(box Term::Succ(ref nv1)) if is_numeric_val(nv1) => Some(Term::False),
        /// If a term evals to Zero, then the term IsZero
        Term::IsZero(ref t) => eval1(t).map(|evaled| Term::IsZero(box evaled)),
        _ => None,
    }
}

pub fn eval(t: &Term) -> Option<Term> {
    Some(eval1(t)
             .and_then(|t| eval(&t))
             .unwrap_or(t.clone()))
}

So, due to Rust’s memory management rules, we have to specify when structures are stored on the heap and when they are just stored in place. In Rust, we can do that by declaring the object to be boxed with Box. So, when we define the Term enumeration (lines 4–12), terms that have other terms embedded in them take the embedded terms as boxes. That leads us to the magic at the very first line. To be able to pattern match inside of a Box, we have to enable the box_patterns feature. I chose to also enable the box_syntax feature because it makes the code a bit shorter when creating boxes. Note that compiler features are not stable, so you have to compile this code with a nightly compiler to get it to build. I built it on 1.10 initially and recently rebuilt it using 1.13.

The evaluation functions are mostly self-explanatory with the in-line comments I put into place. I want to make a few general comments, though. Evaluating these terms works mainly like algebraic simplification. At each step, you look at the pattern and perform the required simplification. The eval1 function performs what is described in TaPL as a single-step evaluation. It does a single step in the evalution process and returns a simpler expression. Then, the eval function causes that simpler expression to go back through the eval1 function again to further simplify it until it cannot be simplified any more. At that point, eval1 will return None, causing the recursion to terminate and the fully simplified form to be returned as the answer.

While I did not include them, I wrote several tests to verify that I was evaluating everything correctly. You can see the Cargo project (including the tests) in the arith directory of my tapl repo on Github.