Hi. In my previous article Looking at Regex in Rust. I covered some basics. Today I will go over some new things that I have learned, as I expanded on things that my regex expression needs to handle. In particular the need to handle the possible existence or none existence of qualifiers +,-,~,?.
TL;DR:
If you don’t want to read the full article. I will place the highlights here.
A regex pattern such as
(x?)will have the following results:- if there is no
xthe resulting capture will beSome("") - If there is an
xthe resulting capture will beSome("x")
- if there is no
The solution is to make the capture itself optional and not the contents of the capture by doing
(x)?.- if there is no
xthe resulting capture will beNone - if there is an
xthe resulting capture will beSome(x)
- if there is no
- The regex crate defaults to being multi-line.
- You don’t need to enable.
- If you don’t want multi-line you should be able to disable it with
(?-m)
Adding support for Qualifiers
My initial regex started to look pretty nice. But I remembered that I needed to handle qualifiers. This lead to a new set of discovers.
My initial attempt without qualifiers turned out be this:
| fn main() { | |
| let regex = regex::Regex::new(r"(?P<a_only>a$)|(?:a:)(?P<a_colon>[^/].+)|(?P<a_slash>a/\d{1,2})").unwrap(); | |
| [ | |
| "a", | |
| "a:example.com", | |
| "a:mailers.example.com", | |
| "a/24", | |
| "a:offsite.example.com/24]", | |
| ] | |
| .iter() | |
| .copied() | |
| .map(|string| regex.captures(string)) | |
| .for_each(|cap| println!("{:?}", cap)) | |
| } |
Which gave the output
Compiling playground v0.0.1 (/playground)
Finished dev [unoptimized + debuginfo] target(s) in 1.85s
Running `target/debug/playground`
Standard Output
Some(Captures({0: Some("a"), "a_only": Some("a"),
"a_colon": None, "a_slash": None}))
Some(Captures({0: Some("a:example.com"), "a_only": None,
"a_colon": Some("example.com"), "a_slash": None}))
Some(Captures({0: Some("a:mailers.example.com"), "a_only": None,
"a_colon": Some("mailers.example.com"), "a_slash": None}))
Some(Captures({0: Some("a/24"), "a_only": None,
"a_colon": None, "a_slash": Some("a/24")}))
Some(Captures({0: Some("a:offsite.example.com/24]"), "a_only": None,
"a_colon": Some("offsite.example.com/24]"), "a_slash": None}))
Adding Qualifiers to the mix
I now needed to handle the qualifiers. They might or might not exist. The new regex I ended up with:
/^(?P<qualifier>[+?~-])?(?P<is_a>a)(?:[:])?(?P<a_mechanism>.+)?/gmi
Let’s break this down a bit.
^The line/string starts(?P<qualifier>[+?~-])?. There is be either a+,?,~,-present. Or the capture itself can find nothing as it,()?, ends with a?. Also the position of the-is important. It must be the last in the list.(?P<is_a>a). There will be anacharacter. (Yes. this capture is really redundant now.)(?:[:])?There may be a:character. This is enclosed in a non-capturing group so that is can be defined as optional.(?P<a_mechanism>.+)?. Capture any other text present. Again this complete capture is optional. The mechanism could possibly only containaas the complete record.
Results
Without the ^ Character
| fn main() { | |
| let regex = regex::Regex::new(r"(?P<qualifier>[+?~-])?a(?:[:]|)(?P<mechamism>.+)").unwrap(); | |
| [ | |
| "a", | |
| "a:example.com", | |
| "~a:example.com", | |
| "a:mailers.example.com", | |
| "-a:mailers.example.com", | |
| "a/24", | |
| "-a:offsite.example.com/24]", | |
| "mx", | |
| "+mx:mailer.com.au", | |
| ] | |
| .iter() | |
| .copied() | |
| .map(|string| regex.captures(string)) | |
| .for_each(|cap| println!("{:?}", cap)) | |
| } |
Output
Compiling playground v0.0.1 (/playground)
Finished dev [unoptimized + debuginfo] target(s) in 3.37s
Running `target/debug/playground`
Standard Output
None
Some(Captures({0: Some("a:example.com"), "qualifier": None,
"mechamism": Some("example.com")}))
Some(Captures({0: Some("~a:example.com"), "qualifier": Some("~"),
"mechamism": Some("example.com")}))
Some(Captures({0: Some("a:mailers.example.com"), "qualifier": None,
"mechamism": Some("mailers.example.com")}))
Some(Captures({0: Some("-a:mailers.example.com"), "qualifier": Some("-"),
"mechamism": Some("mailers.example.com")}))
Some(Captures({0: Some("a/24"), "qualifier": None,
"mechamism": Some("/24")}))
Some(Captures({0: Some("-a:offsite.example.com/24]"), "qualifier": Some("-"),
"mechamism": Some("offsite.example.com/24]")}))
None
Some(Captures({0: Some("ailer.com.au"), "qualifier": None,
"mechamism": Some("iler.com.au")}))
Take note here that the first test data of a gives a None This is not correct. Also note that +mx:mailer.com.au gives a match with a mechamism of Some("iler.com.au"). This is also not correct.
With the ^ Character
| fn main() { | |
| let regex = regex::Regex::new(r"^(?P<qualifier>[+?~-])?(?P<is_a>a)(?:[:]|)?(?P<mechamism>.+)?").unwrap(); | |
| [ | |
| "a", | |
| "a:example.com", | |
| "~a:example.com", | |
| "a:mailers.example.com", | |
| "-a:mailers.example.com", | |
| "a/24", | |
| "-a:offsite.example.com/24]", | |
| "mx", | |
| "+mx:mailer.com", | |
| ] | |
| .iter() | |
| .copied() | |
| .map(|string| regex.captures(string)) | |
| .for_each(|cap| println!("{:?}", cap)) | |
| } |
(?P<is_a>a) re-introduced
Output
Compiling playground v0.0.1 (/playground)
Finished dev [unoptimized + debuginfo] target(s) in 2.26s
Running `target/debug/playground`
Standard Output
Some(Captures({0: Some("a"), "qualifier": None, "is_a": Some("a"),
"mechamism": None}))
Some(Captures({0: Some("a:example.com"), "qualifier": None, "is_a": Some("a"),
"mechamism": Some("example.com")}))
Some(Captures({0: Some("~a:example.com"), "qualifier": Some("~"), "is_a": Some("a"),
"mechamism": Some("example.com")}))
Some(Captures({0: Some("a:mailers.example.com"), "qualifier": None, "is_a": Some("a"),
"mechamism": Some("mailers.example.com")}))
Some(Captures({0: Some("-a:mailers.example.com"), "qualifier": Some("-"), "is_a": Some("a"),
"mechamism": Some("mailers.example.com")}))
Some(Captures({0: Some("a/24"), "qualifier": None, "is_a": Some("a"),
"mechamism": Some("/24")}))
Some(Captures({0: Some("-a:offsite.example.com/24]"), "qualifier": Some("-"), "is_a": Some("a"),
"mechamism": Some("offsite.example.com/24]")}))
None
None
With the ^ character present the regex expression now works correctly. It successfully matches a and does not match mx or +mx:mailer.com. We also have the required None when there is no capture.
Conclusion
Avoid capturing Some("") by marking the capture itself as optional with (x)? and not (x?).
The Rust Regex crate defaults to multi-line.
Thanks for reading.