Working with double-byte regex expressions with Python3

As part of my project Self Hosted Zapier Alternative; I am having to deal with doing regex searches against the three Japanese written forms, Kanji, Hiragana and Katakana.

Fortunately this is a common problem. So I have found some references for this. Also one of my favourite tools for developing regex expressions, Regex101, also offers support in this area.

I found this useful Github Gist.
note:
You should also check the gist directly as there are some follow up comments and additions. See here

Using Regex101 I was able to come up with the following expression.

r"
^「(?P<busname>[一-龯]\d{1,2})\s
(?P<destination>[一-龯]+)行き・
(?P<boardedat>[一-龯]+)」
"

This will successfully match a string such as:
「渋11 渋谷駅行き・駒沢大学駅前」でタッチしました。
Resulting in the following three groups.
busname = 渋11
destination = 渋谷駅
boardedat = 駒沢大学駅

If you are working in PHP you can also use the following:
\p{Han} (Using Chinese to match Kanji)
\p{Hiragana}
\p{Katakana}

You can also checkout my Regex Experiments:
v1 PHP
v2 Python3

Categories: Programming

Tags: japanese regex double-byte unicode python

See also