devilutionX/tools/segmenter/README.md

# Segmenter for gettext (.po) translation files

Inserts [ZWSP] between the segments of Chinese and Japanese text.

For Chinese, uses a high quality `zh_segmentation` model from Google: <https://tfhub.dev/google/zh_segmentation/1>.

For Japanese, uses [Sudachi](https://github.com/WorksApplications/sudachi.rs).

## Pre-requisites

1. Python. The easiest way to install Python on any Linux system is <https://github.com/asdf-vm/asdf>.
   On Windows you can use the [official installer](https://www.python.org/downloads/windows/).
   Note that the Python version must be [supported by tensorflow](https://www.tensorflow.org/install/pip#system-requirements) (this is usually not the latest Python version).

2. `gettext`. On Windows you can use [this installer](https://github.com/mlocati/gettext-iconv-windows/releases).

3. Python packages:

   ```shell
   pip install --upgrade -r tools/segmenter/requirements.txt
   ```

## Usage

To re-segment all the translation files:

```shell
tools/segmenter/segment_all.py
```

To re-segment the Chinese translation files:

```shell
tools/segmenter/segment_zh.py --input_path Translations/zh_CN.po
tools/segmenter/segment_zh.py --input_path Translations/zh_TW.po
```

To re-segment the Japanese translation files:

```shell
tools/segmenter/segment_ja.py --input_path Translations/ja.po
```

Additionaly, you can provide a different separator, such as `--separator='|'`, for debugging.

This tool performs a number of replacements to make sure interpolations are not affected etc.

You can also see the segmenter output for a given string like this:

```console
tools/segmenter/segment_zh.py --debug '返回到 {:d} 层'
```
```
返回｜到 {:d} 层
```

When inspecting the diffs, you can use `sed` to display the segments, e.g.:

```bash
git diff --color | sed "s/$(echo -ne '\u200B')/｜/g"
```

[ZWSP]: https://en.wikipedia.org/wiki/Zero-width_space
Add a Japanese segmenter 4 years ago			`# Segmenter for gettext (.po) translation files`

			`Inserts [ZWSP] between the segments of Chinese and Japanese text.`

			For Chinese, uses a high quality `zh_segmentation` model from Google: <https://tfhub.dev/google/zh_segmentation/1>.

			`For Japanese, uses [Sudachi](https://github.com/WorksApplications/sudachi.rs).`

			`## Pre-requisites`

			`1. Python. The easiest way to install Python on any Linux system is <https://github.com/asdf-vm/asdf>.`
tools/segmenter/README.md: Windows instructions 4 years ago			`On Windows you can use the [official installer](https://www.python.org/downloads/windows/).`
			`Note that the Python version must be [supported by tensorflow](https://www.tensorflow.org/install/pip#system-requirements) (this is usually not the latest Python version).`
Add a Japanese segmenter 4 years ago
tools/segmenter/README.md: Windows instructions 4 years ago			2. `gettext`. On Windows you can use [this installer](https://github.com/mlocati/gettext-iconv-windows/releases).

			`3. Python packages:`
Add a Japanese segmenter 4 years ago
			```shell
Resegment translations pip install --upgrade -r tools/segmenter/requirements.txt tools/segmenter/segment_all.py 4 years ago			`pip install --upgrade -r tools/segmenter/requirements.txt`
Add a Japanese segmenter 4 years ago			```

			`## Usage`

			`To re-segment all the translation files:`

			```shell
			`tools/segmenter/segment_all.py`
			```

			`To re-segment the Chinese translation files:`

			```shell
			`tools/segmenter/segment_zh.py --input_path Translations/zh_CN.po`
			`tools/segmenter/segment_zh.py --input_path Translations/zh_TW.po`
			```

			`To re-segment the Japanese translation files:`

			```shell
			`tools/segmenter/segment_ja.py --input_path Translations/ja.po`
			```

			Additionaly, you can provide a different separator, such as `--separator='\|'`, for debugging.

			`This tool performs a number of replacements to make sure interpolations are not affected etc.`

			`You can also see the segmenter output for a given string like this:`

			```console
			`tools/segmenter/segment_zh.py --debug '返回到 {:d} 层'`
			```
			```
Segmenter: Do not split printf specifiers 4 years ago			`返回｜到 {:d} 层`
			```

			When inspecting the diffs, you can use `sed` to display the segments, e.g.:

			```bash
			`git diff --color \| sed "s/$(echo -ne '\u200B')/｜/g"`
Add a Japanese segmenter 4 years ago			```

			`[ZWSP]: https://en.wikipedia.org/wiki/Zero-width_space`