You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
 
 
 

1.8 KiB

Segmenter for gettext (.po) translation files

Inserts ZWSP between the segments of Chinese and Japanese text.

For Chinese, uses a high quality zh_segmentation model from Google: https://tfhub.dev/google/zh_segmentation/1.

For Japanese, uses Sudachi.

Pre-requisites

  1. Python. The easiest way to install Python on any Linux system is https://github.com/asdf-vm/asdf. On Windows you can use the official installer. Note that the Python version must be supported by tensorflow (this is usually not the latest Python version).

  2. gettext. On Windows you can use this installer.

  3. Python packages:

    pip install -r tools/segmenter/requirements.txt
    

Usage

To re-segment all the translation files:

tools/segmenter/segment_all.py

To re-segment the Chinese translation files:

tools/segmenter/segment_zh.py --input_path Translations/zh_CN.po
tools/segmenter/segment_zh.py --input_path Translations/zh_TW.po

To re-segment the Japanese translation files:

tools/segmenter/segment_ja.py --input_path Translations/ja.po

Additionaly, you can provide a different separator, such as --separator='|', for debugging.

This tool performs a number of replacements to make sure interpolations are not affected etc.

You can also see the segmenter output for a given string like this:

tools/segmenter/segment_zh.py --debug '返回到 {:d} 层'
返回|到 {:d} 层

When inspecting the diffs, you can use sed to display the segments, e.g.:

git diff --color | sed "s/$(echo -ne '\u200B')/|/g"