devilutionX/Source/utils/utf8.hpp

#pragma once

#include <cstddef>
#include <string>
#include <string_view>

namespace devilution {

constexpr char32_t Utf8DecodeError = 0xFFFD;

/**
 * Decodes the first code point from UTF8-encoded input.
 *
 * Sets `len` to the length of the code point in bytes.
 * Returns `Utf8DecodeError` on error.
 */
char32_t DecodeFirstUtf8CodePoint(std::string_view input, std::size_t *len);

/**
 * Decodes and removes the first code point from UTF8-encoded input.
 */
inline char32_t ConsumeFirstUtf8CodePoint(std::string_view *input)
{
	std::size_t len;
	const char32_t result = DecodeFirstUtf8CodePoint(*input, &len);
	input->remove_prefix(len);
	return result;
}

/**
 * Returns true if the character is part of the Basic Latin set.
 *
 * This includes ASCII punctuation, symbols, math operators, digits, and both uppercase/lowercase latin alphabets
 */
constexpr bool IsBasicLatin(char x)
{
	return x >= '\x20' && x <= '\x7E';
}

/**
 * Returns true if this is a trailing byte in a UTF-8 code point encoding.
 *
 * Trailing bytes all begin with 10 as the most significant bits, meaning they generally fall in the range 0x80 to
 * 0xBF. Please note that certain 3 and 4 byte sequences use a narrower range for the second byte, this function is
 * not intended to guarantee the character is valid within the sequence (or that the sequence is well-formed).
 */
inline bool IsTrailUtf8CodeUnit(char x)
{
	// The following is equivalent to a bitmask test (x & 0xC0) == 0x80
	// On x86_64 architectures it ends up being one instruction shorter
	return static_cast<signed char>(x) < static_cast<signed char>('\xC0');
}

/**
 * @brief Returns the number of code units for a code point starting at *src;
 *
 * `src` must not be empty.
 * If `src` does not begin with a UTF-8 code point start byte, returns 1.
 */
inline size_t Utf8CodePointLen(const char *src)
{
	// This constant is effectively a lookup table for 2-bit keys, where
	// values represent code point length - 1.
	// `-1` is so that this method never returns 0, even for invalid values
	// (which could lead to infinite loops in some code).
	// Generated with:
	// ruby -e 'p "0000000000000000000000001111223".reverse.to_i(4).to_s(16)'
	return ((0x3a55000000000000ULL >> (2 * (static_cast<unsigned char>(*src) >> 3))) & 0x3) + 1;
}

/**
 * Returns the start byte index of the last code point in a UTF-8 string.
 */
inline std::size_t FindLastUtf8Symbols(std::string_view input)
{
	if (input.empty())
		return 0;

	std::size_t pos = input.size() - 1;
	while (pos > 0 && IsTrailUtf8CodeUnit(input[pos]))
		--pos;
	return pos;
}

/**
 * @brief Copy up to a given number of bytes from a UTF8 string, and zero terminate string
 * @param dest The destination buffer
 * @param source The source string
 * @param bytes Max number of bytes to copy
 */
void CopyUtf8(char *dest, std::string_view source, std::size_t bytes);

void AppendUtf8(char32_t codepoint, std::string &out);

/** @brief Truncates `str` to at most `len` at a code point boundary. */
std::string_view TruncateUtf8(std::string_view str, std::size_t len);

} // namespace devilution
DrawString: Stop allocating Switch to a state-machine UTF-8 decoder from the branchless one. This allows us to avoid copying the string on every `DrawString` call. 4 years ago			`#pragma once`

Slightly optimize `Utf8CodePointLen` A few more operations but the "lookup table" is now an immediate constant. https://godbolt.org/z/7YG3ohWT6 2 years ago			`#include <cstddef>`
DrawString: Stop allocating Switch to a state-machine UTF-8 decoder from the branchless one. This allows us to avoid copying the string on every `DrawString` call. 4 years ago			`#include <string>`
Remove utils/stdcompat/string_view.hpp 3 years ago			`#include <string_view>`
DrawString: Stop allocating Switch to a state-machine UTF-8 decoder from the branchless one. This allows us to avoid copying the string on every `DrawString` call. 4 years ago
			`namespace devilution {`

Replace hoehrmann_utf8 with SheenBidi 1 year ago			`constexpr char32_t Utf8DecodeError = 0xFFFD;`
DrawString: Stop allocating Switch to a state-machine UTF-8 decoder from the branchless one. This allows us to avoid copying the string on every `DrawString` call. 4 years ago
			`/**`
			`* Decodes the first code point from UTF8-encoded input.`
			`*`
			* Sets `len` to the length of the code point in bytes.
			* Returns `Utf8DecodeError` on error.
			`*/`
Remove utils/stdcompat/string_view.hpp 3 years ago			`char32_t DecodeFirstUtf8CodePoint(std::string_view input, std::size_t *len);`
DrawString: Stop allocating Switch to a state-machine UTF-8 decoder from the branchless one. This allows us to avoid copying the string on every `DrawString` call. 4 years ago
			`/**`
			`* Decodes and removes the first code point from UTF8-encoded input.`
			`*/`
Remove utils/stdcompat/string_view.hpp 3 years ago			`inline char32_t ConsumeFirstUtf8CodePoint(std::string_view *input)`
DrawString: Stop allocating Switch to a state-machine UTF-8 decoder from the branchless one. This allows us to avoid copying the string on every `DrawString` call. 4 years ago			`{`
Address type conversion warnings in WordWrapString 4 years ago			`std::size_t len;`
DrawString: Stop allocating Switch to a state-machine UTF-8 decoder from the branchless one. This allows us to avoid copying the string on every `DrawString` call. 4 years ago			`const char32_t result = DecodeFirstUtf8CodePoint(*input, &len);`
			`input->remove_prefix(len);`
			`return result;`
			`}`

Simplify character test in UiValidPlayerName 4 years ago			`/**`
			`* Returns true if the character is part of the Basic Latin set.`
			`*`
			`* This includes ASCII punctuation, symbols, math operators, digits, and both uppercase/lowercase latin alphabets`
			`*/`
			`constexpr bool IsBasicLatin(char x)`
			`{`
			`return x >= '\x20' && x <= '\x7E';`
			`}`

DrawString: Stop allocating Switch to a state-machine UTF-8 decoder from the branchless one. This allows us to avoid copying the string on every `DrawString` call. 4 years ago			`/**`
			`* Returns true if this is a trailing byte in a UTF-8 code point encoding.`
			`*`
Update comment on IsTrailUtf8CodeUnit Also use char literals instead of int literals. 4 years ago			`* Trailing bytes all begin with 10 as the most significant bits, meaning they generally fall in the range 0x80 to`
			`* 0xBF. Please note that certain 3 and 4 byte sequences use a narrower range for the second byte, this function is`
			`* not intended to guarantee the character is valid within the sequence (or that the sequence is well-formed).`
DrawString: Stop allocating Switch to a state-machine UTF-8 decoder from the branchless one. This allows us to avoid copying the string on every `DrawString` call. 4 years ago			`*/`
			`inline bool IsTrailUtf8CodeUnit(char x)`
			`{`
Update comment on IsTrailUtf8CodeUnit Also use char literals instead of int literals. 4 years ago			`// The following is equivalent to a bitmask test (x & 0xC0) == 0x80`
			`// On x86_64 architectures it ends up being one instruction shorter`
			`return static_cast<signed char>(x) < static_cast<signed char>('\xC0');`
DrawString: Stop allocating Switch to a state-machine UTF-8 decoder from the branchless one. This allows us to avoid copying the string on every `DrawString` call. 4 years ago			`}`

Add cursor support to DiabloUI and chat Supports move left/right/home/end, backspace, delete, and Ctrl+V. 2 years ago			`/**`
			`* @brief Returns the number of code units for a code point starting at *src;`
Slightly optimize `Utf8CodePointLen` A few more operations but the "lookup table" is now an immediate constant. https://godbolt.org/z/7YG3ohWT6 2 years ago			`*`
			* `src` must not be empty.
			* If `src` does not begin with a UTF-8 code point start byte, returns 1.
Add cursor support to DiabloUI and chat Supports move left/right/home/end, backspace, delete, and Ctrl+V. 2 years ago			`*/`
			`inline size_t Utf8CodePointLen(const char *src)`
			`{`
Slightly optimize `Utf8CodePointLen` A few more operations but the "lookup table" is now an immediate constant. https://godbolt.org/z/7YG3ohWT6 2 years ago			`// This constant is effectively a lookup table for 2-bit keys, where`
			`// values represent code point length - 1.`
			// `-1` is so that this method never returns 0, even for invalid values
			`// (which could lead to infinite loops in some code).`
			`// Generated with:`
			`// ruby -e 'p "0000000000000000000000001111223".reverse.to_i(4).to_s(16)'`
			`return ((0x3a55000000000000ULL >> (2 * (static_cast<unsigned char>(*src) >> 3))) & 0x3) + 1;`
Add cursor support to DiabloUI and chat Supports move left/right/home/end, backspace, delete, and Ctrl+V. 2 years ago			`}`

DrawString: Stop allocating Switch to a state-machine UTF-8 decoder from the branchless one. This allows us to avoid copying the string on every `DrawString` call. 4 years ago			`/**`
			`* Returns the start byte index of the last code point in a UTF-8 string.`
			`*/`
Remove utils/stdcompat/string_view.hpp 3 years ago			`inline std::size_t FindLastUtf8Symbols(std::string_view input)`
DrawString: Stop allocating Switch to a state-machine UTF-8 decoder from the branchless one. This allows us to avoid copying the string on every `DrawString` call. 4 years ago			`{`
			`if (input.empty())`
			`return 0;`

			`std::size_t pos = input.size() - 1;`
			`while (pos > 0 && IsTrailUtf8CodeUnit(input[pos]))`
			`--pos;`
			`return pos;`
			`}`

Performce UTF8 aware limited string copies 4 years ago			`/**`
			`* @brief Copy up to a given number of bytes from a UTF8 string, and zero terminate string`
:memo: Fixes Doxygen warnings (#4904) 4 years ago			`* @param dest The destination buffer`
			`* @param source The source string`
Performce UTF8 aware limited string copies 4 years ago			`* @param bytes Max number of bytes to copy`
			`*/`
Remove utils/stdcompat/string_view.hpp 3 years ago			`void CopyUtf8(char *dest, std::string_view source, std::size_t bytes);`
Performce UTF8 aware limited string copies 4 years ago
Remove uses of deprecated codecvt codecvt is deprecated in C++17 and we don't really need all of its heavy machinery for simply converting to UTF-8. 4 years ago			`void AppendUtf8(char32_t codepoint, std::string &out);`

Add cursor support to DiabloUI and chat Supports move left/right/home/end, backspace, delete, and Ctrl+V. 2 years ago			/** @brief Truncates `str` to at most `len` at a code point boundary. */
			`std::string_view TruncateUtf8(std::string_view str, std::size_t len);`

DrawString: Stop allocating Switch to a state-machine UTF-8 decoder from the branchless one. This allows us to avoid copying the string on every `DrawString` call. 4 years ago			`} // namespace devilution`