You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
 
 
 
Gleb Mazovetskiy 395dbb18f0 Segmenter: Use gettext to line-wrap po files 4 years ago
..
README.md Segmenter: Do not split printf specifiers 4 years ago
requirements.txt Add a Japanese segmenter 4 years ago
segment_all.py Segmenter: Use gettext to line-wrap po files 4 years ago
segment_ja.py Segmenter: Use gettext to line-wrap po files 4 years ago
segment_zh.py Add a Japanese segmenter 4 years ago
segmenter_lib.py Segmenter: Use gettext to line-wrap po files 4 years ago

README.md

Segmenter for gettext (.po) translation files

Inserts ZWSP between the segments of Chinese and Japanese text.

For Chinese, uses a high quality zh_segmentation model from Google: https://tfhub.dev/google/zh_segmentation/1.

For Japanese, uses Sudachi.

Pre-requisites

  1. Python. The easiest way to install Python on any Linux system is https://github.com/asdf-vm/asdf.

  2. Packages:

    pip install -r tools/segmenter/requirements.txt
    

Usage

To re-segment all the translation files:

tools/segmenter/segment_all.py

To re-segment the Chinese translation files:

tools/segmenter/segment_zh.py --input_path Translations/zh_CN.po
tools/segmenter/segment_zh.py --input_path Translations/zh_TW.po

To re-segment the Japanese translation files:

tools/segmenter/segment_ja.py --input_path Translations/ja.po

Additionaly, you can provide a different separator, such as --separator='|', for debugging.

This tool performs a number of replacements to make sure interpolations are not affected etc.

You can also see the segmenter output for a given string like this:

tools/segmenter/segment_zh.py --debug '返回到 {:d} 层'
返回|到 {:d} 层

When inspecting the diffs, you can use sed to display the segments, e.g.:

git diff --color | sed "s/$(echo -ne '\u200B')/|/g"