Hi,
Today I will finally go over how I used the lazy_static crate to help the compiler keep regex optimised.
References
Avoid using regex in a loop
From the regex documentation it is sub-optimal and anti-pattern to compile the same regular expression in a loop.
Not only is compilation itself expensive, but this also prevents optimizations that reuse allocations internally to the matching engines.
Initial Code
My original code was of course the anti-pattern.
pub fn parse(&mut self) {
// initialises required variables.
let records = self.source.split_whitespace();
let mut vec_of_includes: Vec<SpfMechanism<String>> = Vec::new();
let mut vec_of_ip4: Vec<SpfMechanism<IpNetwork>> = Vec::new();
let mut vec_of_ip6: Vec<SpfMechanism<IpNetwork>> = Vec::new();
let mut vec_of_a: Vec<SpfMechanism<String>> = Vec::new();
let mut vec_of_mx: Vec<SpfMechanism<String>> = Vec::new();
for record in records {
// Make this lazy.
let a_pattern =
Regex::new(r"^(?P<qualifier>[+?~-])?(?P<mechanism>a(?:[:/]{0,1}.+)?)").unwrap();
let mx_pattern =
Regex::new(r"^(?P<qualifier>[+?~-])?(?P<mechanism>mx(?:[:/]{0,1}.+)?)").unwrap();
if record.contains("redirect=") {
// Match a redirect
--snip--
}
This actually caused a significant issue.
- To reuse these regular expressions. I had to redeclare them in all files where these were used. Test files and the like. This of course introduces increased complexity for management and thus potential mistakes.
Remove duplication
The first step was to make them const &str
types.
const MECHANISM_A_PATTERN: &str = r"^(?P<qualifier>[+?~-])?a(?P<mechanism>[:/]{0,1}.+)?";
const MECHANISM_MX_PATTERN: &str = r"^(?P<qualifier>[+?~-])?mx(?P<mechanism>[:/]{0,1}.+)?";
const MECHANISM_PTR_PATTERN: &str = r"^(?P<qualifier>[+?~-])?ptr(?P<mechanism>[:]{0,1}.+)?";
This allowed me to have single place to edit the expressions if needed.
let a_pattern = Regex::new(MECHANISM_A_PATTERN).unwrap();
let mx_pattern = Regex::new(MECHANISM_MX_PATTERN).unwrap();
let ptr_pattern = Regex::new(MECHANISM_PTR_PATTERN).unwrap();
But this still meant I need to include the patterns with a use
statement in each file.
In a test file for example.
use crate::spf::helpers;
use crate::spf::MECHANISM_A_PATTERN;
use regex::Regex;
#[test]
fn test_match_on_a_only() {
let string = "a";
let pattern = Regex::new(MECHANISM_A_PATTERN).unwrap();
let option_test: Option<Mechanism<String>>;
option_test = helpers::capture_matches(pattern, &string, kinds::MechanismKind::A);
let test = option_test.unwrap();
assert_eq!(test.is_pass(), true);
assert_eq!(test.raw(), "a");
assert_eq!(test.string(), "a");
}
At this point you might notice I have a helpers::
. I needed to use this function a lot in my tests files. So I moved it to the helpers.rs
file.
helpers.rs
Let’s take a look.
use crate::spf::kinds;
use crate::spf::mechanism::Mechanism;
use crate::spf::qualifier::Qualifier;
use regex::Regex;
#[doc(hidden)]
pub(crate) fn capture_matches(
pattern: Regex,
string: &str,
kind: kinds::MechanismKind,
) -> Option<Mechanism<String>> {
--snip--
}
Here I am using pub(crate)
which defines that this function is public within the crate and only within the crate.
This still does not get me out of the inside loops
compile / optimisation issue.
The new helper.rs
To make regex expressions truely lazy and so the compiler only does its work once. These regex need to be created within a helper function. Fortunately, I basically already had one in the form of capture_matches()
.
Let’s revisit helpers.rs
use lazy_static::lazy_static;
use regex::Regex;
pub(crate) const MECHANISM_A_PATTERN: &str = r"^(?P<qualifier>[+?~-])?a(?P<mechanism>[:/]{0,1}.+)?";
pub(crate) const MECHANISM_MX_PATTERN: &str =
r"^(?P<qualifier>[+?~-])?mx(?P<mechanism>[:/]{0,1}.+)?";
pub(crate) const MECHANISM_PTR_PATTERN: &str =
r"^(?P<qualifier>[+?~-])?ptr(?P<mechanism>[:]{0,1}.+)?";
pub(crate) fn capture_matches(
string: &str,
kind: kinds::MechanismKind,
) -> Option<Mechanism<String>> {
lazy_static! {
static ref A_RE: Regex = Regex::new(MECHANISM_A_PATTERN).unwrap();
static ref MX_RE: Regex = Regex::new(MECHANISM_MX_PATTERN).unwrap();
static ref PTR_RE: Regex = Regex::new(MECHANISM_PTR_PATTERN).unwrap();
}
let caps = match kind {
kinds::MechanismKind::A => A_RE.captures(string),
kinds::MechanismKind::MX => MX_RE.captures(string),
kinds::MechanismKind::Ptr => PTR_RE.captures(string),
_ => unreachable!(),
};
let qualifier_char: char;
let mut qualifier_result: Qualifier = Qualifier::Pass;
let mechanism_string: String;
let mechanism;
match caps {
None => return None,
Some(caps) => {
// There was a match
if caps.name("qualifier").is_some() {
qualifier_char = caps
.name("qualifier")
.unwrap()
.as_str()
.chars()
.nth(0)
.unwrap();
qualifier_result = char_to_qualifier(qualifier_char);
};
if caps.name("mechanism").is_some() {
mechanism_string = caps.name("mechanism").unwrap().as_str().to_string();
mechanism = Mechanism::new(kind, qualifier_result, (*mechanism_string).to_string());
} else {
mechanism_string = match kind {
kinds::MechanismKind::A => "a".to_string(),
kinds::MechanismKind::MX => "mx".to_string(),
kinds::MechanismKind::Ptr => "ptr".to_string(),
_ => unreachable!(),
};
mechanism = Mechanism::new(kind, qualifier_result,
mechanism_string);
}
Some(mechanism)
}
}
}
Here the const &str
have also been made public to the crate.
I am then creating A_RE
, MX_RE
and PTR_RE
within this function enclosed within a lazy_static
macro block. This means that the regular expressions are now part of the function and will be used any time the function is called. So I was able to remove the pattern: Regex
from the parameter list.
My case may or may not be unusual in that I am using the same function for three different patterns.
I handle this by passing kinds::MechanismKind::{A,MX,PTR}
. I use match
to then ensure that the correct regular expression is applied to the string passed in the first parameter. The _ => unreachable!()
macro will cause the program to panic since this branch should never be reached. The _
tells rust that we are not interested in other possible arms of the match case.
The new Code
Now we can use this code more easily and keep the code much cleaner.
Another gain is that we no longer need to use regex
in other files. We now just need to call use crate::spf::helpers;
and we have access to our capture_matches()
function.
parse()
This function no longer declares and regular expressions.
pub fn parse(&mut self) {
// # TODO: This needs to have a test for emtpy source. return a Result?
// initialises required variables.
let records = self.source.split_whitespace();
let mut vec_of_includes: Vec<Mechanism<String>> = Vec::new();
let mut vec_of_ip4: Vec<Mechanism<IpNetwork>> = Vec::new();
let mut vec_of_ip6: Vec<Mechanism<IpNetwork>> = Vec::new();
let mut vec_of_a: Vec<Mechanism<String>> = Vec::new();
let mut vec_of_mx: Vec<Mechanism<String>> = Vec::new();
let mut vec_of_exists: Vec<Mechanism<String>> = Vec::new();
for record in records {
if record.contains("v=spf1") || record.starts_with("spf2.0") {
self.version = record.to_string();
-- snip --
} else if let Some(a_mechanism) =
helpers::capture_matches(record, kinds::MechanismKind::A)
{
vec_of_a.push(a_mechanism);
} else if let Some(mx_mechanism) =
helpers::capture_matches(record, kinds::MechanismKind::MX)
{
vec_of_mx.push(mx_mechanism);
} else if let Some(ptr_mechanism) =
helpers::capture_matches(record, kinds::MechanismKind::Ptr)
{
self.ptr = Some(ptr_mechanism);
}
-- snip --
}
Test File
The test file also now looks much nicer.
use crate::spf::helpers;
use crate::spf::kinds;
use crate::spf::mechanism::Mechanism;
#[test]
fn test_match_on_a_only() {
let string = "a";
let option_test: Option<Mechanism<String>>;
option_test = helpers::capture_matches(&string, kinds::MechanismKind::A);
let test = option_test.unwrap();
assert_eq!(test.is_pass(), true);
assert_eq!(test.raw(), "a");
assert_eq!(test.string(), "a");
}
Final Words.
So the take away here is that if you want to make your regulgar expressions lazy, and help the compiler keep things optimised.
Create a helper function that does the pattern matching and define your lazy_static! {}
within that function. Don’t forget to add the lazy_static
crate to your Cargo.toml
.