# Segmenter for gettext (.po) translation files Inserts [ZWSP] between the segments of Chinese and Japanese text. For Chinese, uses a high quality `zh_segmentation` model from Google: . For Japanese, uses [Sudachi](https://github.com/WorksApplications/sudachi.rs). ## Pre-requisites 1. Python. The easiest way to install Python on any Linux system is . On Windows you can use the [official installer](https://www.python.org/downloads/windows/). Note that the Python version must be [supported by tensorflow](https://www.tensorflow.org/install/pip#system-requirements) (this is usually not the latest Python version). 2. `gettext`. On Windows you can use [this installer](https://github.com/mlocati/gettext-iconv-windows/releases). 3. Python packages: ```shell pip install --upgrade -r tools/segmenter/requirements.txt ``` ## Usage To re-segment all the translation files: ```shell tools/segmenter/segment_all.py ``` To re-segment the Chinese translation files: ```shell tools/segmenter/segment_zh.py --input_path Translations/zh_CN.po tools/segmenter/segment_zh.py --input_path Translations/zh_TW.po ``` To re-segment the Japanese translation files: ```shell tools/segmenter/segment_ja.py --input_path Translations/ja.po ``` Additionaly, you can provide a different separator, such as `--separator='|'`, for debugging. This tool performs a number of replacements to make sure interpolations are not affected etc. You can also see the segmenter output for a given string like this: ```console tools/segmenter/segment_zh.py --debug '返回到 {:d} 层' ``` ``` 返回|到 {:d} 层 ``` When inspecting the diffs, you can use `sed` to display the segments, e.g.: ```bash git diff --color | sed "s/$(echo -ne '\u200B')/|/g" ``` [ZWSP]: https://en.wikipedia.org/wiki/Zero-width_space