Looking at Regex in Rust (Addendum)

Hi. In my previous article Looking at Regex in Rust. I covered some basics. Today I will go over some new things that I have learned, as I expanded on things that my regex expression needs to handle. In particular the need to handle the possible existence or none existence of qualifiers +,-,~,?.

TL;DR:

If you don’t want to read the full article. I will place the highlights here.

  • A regex pattern such as (x?) will have the following results:

    1. if there is no x the resulting capture will be Some("")
    2. If there is an x the resulting capture will be Some("x")
  • The solution is to make the capture itself optional and not the contents of the capture by doing (x)?.

    1. if there is no x the resulting capture will be None
    2. if there is an x the resulting capture will be Some(x)
  • The regex crate defaults to being multi-line.
    1. You don’t need to enable.
    2. If you don’t want multi-line you should be able to disable it with (?-m)

Adding support for Qualifiers

My initial regex started to look pretty nice. But I remembered that I needed to handle qualifiers. This lead to a new set of discovers.

My initial attempt without qualifiers turned out be this:

Which gave the output

Compiling playground v0.0.1 (/playground)
 Finished dev [unoptimized + debuginfo] target(s) in 1.85s
  Running `target/debug/playground`
Standard Output
Some(Captures({0: Some("a"), "a_only": Some("a"),
              "a_colon": None, "a_slash": None}))
Some(Captures({0: Some("a:example.com"), "a_only": None, 
              "a_colon": Some("example.com"), "a_slash": None}))
Some(Captures({0: Some("a:mailers.example.com"), "a_only": None, 
              "a_colon": Some("mailers.example.com"), "a_slash": None}))
Some(Captures({0: Some("a/24"), "a_only": None, 
              "a_colon": None, "a_slash": Some("a/24")}))
Some(Captures({0: Some("a:offsite.example.com/24]"), "a_only": None, 
              "a_colon": Some("offsite.example.com/24]"), "a_slash": None}))

Rust Playgound
Gist Link

Adding Qualifiers to the mix

I now needed to handle the qualifiers. They might or might not exist. The new regex I ended up with: /^(?P<qualifier>[+?~-])?(?P<is_a>a)(?:[:])?(?P<a_mechanism>.+)?/gmi

Regex101 Link

Let’s break this down a bit.

  1. ^ The line/string starts
  2. (?P<qualifier>[+?~-])?. There is be either a +,?,~,- present. Or the capture itself can find nothing as it, ()?, ends with a ?. Also the position of the - is important. It must be the last in the list.
  3. (?P<is_a>a). There will be an a character. (Yes. this capture is really redundant now.)
  4. (?:[:])? There may be a : character. This is enclosed in a non-capturing group so that is can be defined as optional.
  5. (?P<a_mechanism>.+)?. Capture any other text present. Again this complete capture is optional. The mechanism could possibly only contain a as the complete record.

Results

Without the ^ Character

Rust Playground
Gist

Output

Compiling playground v0.0.1 (/playground)
 Finished dev [unoptimized + debuginfo] target(s) in 3.37s
  Running `target/debug/playground`
Standard Output
None
Some(Captures({0: Some("a:example.com"), "qualifier": None, 
              "mechamism": Some("example.com")}))
Some(Captures({0: Some("~a:example.com"), "qualifier": Some("~"), 
              "mechamism": Some("example.com")}))
Some(Captures({0: Some("a:mailers.example.com"), "qualifier": None, 
              "mechamism": Some("mailers.example.com")}))
Some(Captures({0: Some("-a:mailers.example.com"), "qualifier": Some("-"), 
              "mechamism": Some("mailers.example.com")}))
Some(Captures({0: Some("a/24"), "qualifier": None, 
              "mechamism": Some("/24")}))
Some(Captures({0: Some("-a:offsite.example.com/24]"), "qualifier": Some("-"), 
              "mechamism": Some("offsite.example.com/24]")}))
None
Some(Captures({0: Some("ailer.com.au"), "qualifier": None, 
              "mechamism": Some("iler.com.au")}))

Take note here that the first test data of a gives a None This is not correct. Also note that +mx:mailer.com.au gives a match with a mechamism of Some("iler.com.au"). This is also not correct.

With the ^ Character

Rust Playground
Gist

(?P<is_a>a) re-introduced
Output

Compiling playground v0.0.1 (/playground)
 Finished dev [unoptimized + debuginfo] target(s) in 2.26s
  Running `target/debug/playground`
Standard Output
Some(Captures({0: Some("a"), "qualifier": None, "is_a": Some("a"), 
              "mechamism": None}))
Some(Captures({0: Some("a:example.com"), "qualifier": None, "is_a": Some("a"),      
              "mechamism": Some("example.com")}))
Some(Captures({0: Some("~a:example.com"), "qualifier": Some("~"), "is_a": Some("a"), 
              "mechamism": Some("example.com")}))
Some(Captures({0: Some("a:mailers.example.com"), "qualifier": None, "is_a": Some("a"), 
              "mechamism": Some("mailers.example.com")}))
Some(Captures({0: Some("-a:mailers.example.com"), "qualifier": Some("-"), "is_a": Some("a"), 
              "mechamism": Some("mailers.example.com")}))
Some(Captures({0: Some("a/24"), "qualifier": None, "is_a": Some("a"), 
              "mechamism": Some("/24")}))
Some(Captures({0: Some("-a:offsite.example.com/24]"), "qualifier": Some("-"), "is_a": Some("a"), 
              "mechamism": Some("offsite.example.com/24]")}))
None
None

With the ^ character present the regex expression now works correctly. It successfully matches a and does not match mx or +mx:mailer.com. We also have the required None when there is no capture.

Conclusion

Avoid capturing Some("") by marking the capture itself as optional with (x)? and not (x?).
The Rust Regex crate defaults to multi-line.

Thanks for reading.


See also