Segmenter for gettext (.po) translation files

Inserts ZWSP between the segments of Chinese and Japanese text.

For Chinese, uses a high quality zh_segmentation model from Google: https://tfhub.dev/google/zh_segmentation/1.

For Japanese, uses Sudachi.

Pre-requisites

Python. The easiest way to install Python on any Linux system is https://github.com/asdf-vm/asdf. On Windows you can use the official installer. Note that the Python version must be supported by tensorflow (this is usually not the latest Python version).
gettext. On Windows you can use this installer.

Python packages:

pip install -r tools/segmenter/requirements.txt

To re-segment all the translation files:

tools/segmenter/segment_all.py

To re-segment the Chinese translation files:

tools/segmenter/segment_zh.py --input_path Translations/zh_CN.po
tools/segmenter/segment_zh.py --input_path Translations/zh_TW.po

To re-segment the Japanese translation files:

tools/segmenter/segment_ja.py --input_path Translations/ja.po

Additionaly, you can provide a different separator, such as --separator='|', for debugging.

This tool performs a number of replacements to make sure interpolations are not affected etc.

You can also see the segmenter output for a given string like this:

tools/segmenter/segment_zh.py --debug '返回到 {:d} 层'

返回｜到 {:d} 层

When inspecting the diffs, you can use sed to display the segments, e.g.:

git diff --color | sed "s/$(echo -ne '\u200B')/｜/g"