As part of my project Self Hosted Zapier Alternative; I am having to deal with doing regex searches against the three Japanese written forms, Kanji, Hiragana and Katakana.
Fortunately this is a common problem. So I have found some references for this. Also one of my favourite tools for developing regex expressions, Regex101, also offers support in this area.
I found this useful Github Gist.
note:
You should also check the gist directly as there are some follow up comments and
additions. See here
Regex for matching ALL Japanese common & uncommon Kanji (4e00 – 9fcf) ~ The Big Kahuna! | |
([一-龯]) | |
Regex for matching Hirgana or Katakana | |
([ぁ-んァ-ン]) | |
Regex for matching Non-Hirgana or Non-Katakana | |
([^ぁ-んァ-ン]) | |
Regex for matching Hirgana or Katakana or basic punctuation (、。’) | |
([ぁ-んァ-ン\w]) | |
Regex for matching Hirgana or Katakana and random other characters | |
([ぁ-んァ-ン!:/]) | |
Regex for matching Hirgana | |
([ぁ-ん]) | |
Regex for matching full-width Katakana (zenkaku 全角) | |
([ァ-ン]) | |
Regex for matching half-width Katakana (hankaku 半角) | |
([ァ-ン゙゚]) | |
Regex for matching full-width Numbers (zenkaku 全角) | |
([0-9]) | |
Regex for matching full-width Letters (zenkaku 全角) | |
([A-z]) | |
Regex for matching Hiragana codespace characters (includes non phonetic characters) | |
([ぁ-ゞ]) | |
Regex for matching full-width (zenkaku) Katakana codespace characters (includes non phonetic characters) | |
([ァ-ヶ]) | |
Regex for matching half-width (hankaku) Katakana codespace characters (this is an old character set so the order is inconsistent with the hiragana) | |
([ヲ-゚]) | |
Regex for matching Japanese Post Codes | |
/^¥d{3}¥-¥d{4}$/ | |
/^¥d{3}-¥d{4}$|^¥d{3}-¥d{2}$|^¥d{3}$/ | |
Regex for matching Japanese mobile phone numbers (keitai bangou) | |
/^¥d{3}-¥d{4}-¥d{4}$|^¥d{11}$/ | |
/^0¥d0-¥d{4}-¥d{4}$/ | |
Regex for matching Japanese fixed line phone numbers | |
/^[0-9-]{6,9}$|^[0-9-]{12}$/ | |
/^¥d{1,4}-¥d{4}$|^¥d{2,5}-¥d{1,4}-¥d{4}$/ |
Using Regex101 I was able to come up with the following expression.
r"
^「(?P<busname>[一-龯]\d{1,2})\s
(?P<destination>[一-龯]+)行き・
(?P<boardedat>[一-龯]+)」
"
This will successfully match a string such as:「渋11 渋谷駅行き・駒沢大学駅前」でタッチしました。
Resulting in the following three groups.
busname = 渋11
destination = 渋谷駅
boardedat = 駒沢大学駅
If you are working in PHP you can also use the following:\p{Han}
(Using Chinese to match Kanji)\p{Hiragana}
\p{Katakana}
You can also checkout my Regex Experiments:
v1 PHP
v2 Python3