Hi. In my previous article Looking at Regex in Rust. I covered some basics. Today I will go over some new things that I have learned, as I expanded on things that my regex expression needs to handle. In particular the need to handle the possible existence or none existence of qualifiers +
,-
,~
,?
.
TL;DR:
If you don’t want to read the full article. I will place the highlights here.
A regex pattern such as
(x?)
will have the following results:- if there is no
x
the resulting capture will beSome("")
- If there is an
x
the resulting capture will beSome("x")
- if there is no
The solution is to make the capture itself optional and not the contents of the capture by doing
(x)?
.- if there is no
x
the resulting capture will beNone
- if there is an
x
the resulting capture will beSome(x)
- if there is no
- The regex crate defaults to being multi-line.
- You don’t need to enable.
- If you don’t want multi-line you should be able to disable it with
(?-m)
Adding support for Qualifiers
My initial regex started to look pretty nice. But I remembered that I needed to handle qualifiers. This lead to a new set of discovers.
My initial attempt without qualifiers turned out be this:
Which gave the output
Compiling playground v0.0.1 (/playground)
Finished dev [unoptimized + debuginfo] target(s) in 1.85s
Running `target/debug/playground`
Standard Output
Some(Captures({0: Some("a"), "a_only": Some("a"),
"a_colon": None, "a_slash": None}))
Some(Captures({0: Some("a:example.com"), "a_only": None,
"a_colon": Some("example.com"), "a_slash": None}))
Some(Captures({0: Some("a:mailers.example.com"), "a_only": None,
"a_colon": Some("mailers.example.com"), "a_slash": None}))
Some(Captures({0: Some("a/24"), "a_only": None,
"a_colon": None, "a_slash": Some("a/24")}))
Some(Captures({0: Some("a:offsite.example.com/24]"), "a_only": None,
"a_colon": Some("offsite.example.com/24]"), "a_slash": None}))
Adding Qualifiers to the mix
I now needed to handle the qualifiers. They might or might not exist. The new regex I ended up with:
/^(?P<qualifier>[+?~-])?(?P<is_a>a)(?:[:])?(?P<a_mechanism>.+)?/gmi
Let’s break this down a bit.
^
The line/string starts(?P<qualifier>[+?~-])?
. There is be either a+
,?
,~
,-
present. Or the capture itself can find nothing as it,()?
, ends with a?
. Also the position of the-
is important. It must be the last in the list.(?P<is_a>a)
. There will be ana
character. (Yes. this capture is really redundant now.)(?:[:])?
There may be a:
character. This is enclosed in a non-capturing group so that is can be defined as optional.(?P<a_mechanism>.+)?
. Capture any other text present. Again this complete capture is optional. The mechanism could possibly only containa
as the complete record.
Results
Without the ^ Character
Output
Compiling playground v0.0.1 (/playground)
Finished dev [unoptimized + debuginfo] target(s) in 3.37s
Running `target/debug/playground`
Standard Output
None
Some(Captures({0: Some("a:example.com"), "qualifier": None,
"mechamism": Some("example.com")}))
Some(Captures({0: Some("~a:example.com"), "qualifier": Some("~"),
"mechamism": Some("example.com")}))
Some(Captures({0: Some("a:mailers.example.com"), "qualifier": None,
"mechamism": Some("mailers.example.com")}))
Some(Captures({0: Some("-a:mailers.example.com"), "qualifier": Some("-"),
"mechamism": Some("mailers.example.com")}))
Some(Captures({0: Some("a/24"), "qualifier": None,
"mechamism": Some("/24")}))
Some(Captures({0: Some("-a:offsite.example.com/24]"), "qualifier": Some("-"),
"mechamism": Some("offsite.example.com/24]")}))
None
Some(Captures({0: Some("ailer.com.au"), "qualifier": None,
"mechamism": Some("iler.com.au")}))
Take note here that the first test data of a
gives a None
This is not correct. Also note that +mx:mailer.com.au
gives a match with a mechamism of Some("iler.com.au")
. This is also not correct.
With the ^ Character
(?P<is_a>a) re-introduced
Output
Compiling playground v0.0.1 (/playground)
Finished dev [unoptimized + debuginfo] target(s) in 2.26s
Running `target/debug/playground`
Standard Output
Some(Captures({0: Some("a"), "qualifier": None, "is_a": Some("a"),
"mechamism": None}))
Some(Captures({0: Some("a:example.com"), "qualifier": None, "is_a": Some("a"),
"mechamism": Some("example.com")}))
Some(Captures({0: Some("~a:example.com"), "qualifier": Some("~"), "is_a": Some("a"),
"mechamism": Some("example.com")}))
Some(Captures({0: Some("a:mailers.example.com"), "qualifier": None, "is_a": Some("a"),
"mechamism": Some("mailers.example.com")}))
Some(Captures({0: Some("-a:mailers.example.com"), "qualifier": Some("-"), "is_a": Some("a"),
"mechamism": Some("mailers.example.com")}))
Some(Captures({0: Some("a/24"), "qualifier": None, "is_a": Some("a"),
"mechamism": Some("/24")}))
Some(Captures({0: Some("-a:offsite.example.com/24]"), "qualifier": Some("-"), "is_a": Some("a"),
"mechamism": Some("offsite.example.com/24]")}))
None
None
With the ^
character present the regex expression now works correctly. It successfully matches a
and does not match mx
or +mx:mailer.com
. We also have the required None
when there is no capture.
Conclusion
Avoid capturing Some("")
by marking the capture itself as optional with (x)?
and not (x?)
.
The Rust Regex crate defaults to multi-line.
Thanks for reading.