Also renames lua/lua.hpp to lua/lua_global.hpp.
The previous name broke version auto-detection in sol2, which involves
calling `__has_include(<lua/lua.hpp>)`.
1. There is no need to store the light table width as it is always 256.
2. Use a fixed-extent `std::span` rather than a pointer for readability.
This should optimize to the exact same code.
1. Extracts duplicate expressions into functions.
2. Addds `const` throughout to make it more obvious what's being modified
in the loop.
3. Add early return on out of bounds.
4. Extract `minx +` out of the `std::min({...})`.
3+4 lead to a nice perf increase
Before:
```
---------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
---------------------------------------------------------------------------
BM_BuildLightmap 77181 ns 77152 ns 9092 bytes_per_second=2.7194Gi/s items_per_second=3.24034M/s
```
After:
```
---------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
---------------------------------------------------------------------------
BM_BuildLightmap 65134 ns 65114 ns 10773 bytes_per_second=3.22218Gi/s items_per_second=3.83943M/s
```
Adds separate benchmarks for building the tree vs lookup
BM_GenerateBlendedLookupTable 2247062 ns 2246554 ns 312
BM_BuildTree 6842 ns 6840 ns 102200
BM_FindNearestNeighbor 2535103632 ns 2534784320 ns 1 items_per_second=6.61879M/s
Gets rid of `orig_palette`, we now always have only 2 palettes:
1. `logical_palette`
This palette has color cycling / swapping applied but no global
effects such as brightness / fade-in.
2. `system_palette`
This palette is the actual palette used for rendering.
It is usually `logical_palette` with the global brightness setting
and fade-in/out applied.
Additionally, we now keep the k-d tree around and use it to
update single colors.
The colors that are color-cycled / swapped are never included
in the k-d tree, so the tree does not need updating on color
cycles/swaps.
PR #8027 exposed a bug in `palette_update`.
The `first` argument in SDL palette functions always refers to the first
target index (the first source index is always 0).
Our values are `uint8_t`, so we can get the median somewhat faster
than `nth_element`.
```
Comparing build-reld-master/palette_blending_benchmark to build-reld/palette_blending_benchmark
Benchmark Time CPU Time Old Time New CPU Old CPU New
-----------------------------------------------------------------------------------------------------------------------------------
BM_GenerateBlendedLookupTable_pvalue 0.0002 0.0002 U Test, Repetitions: 10 vs 10
BM_GenerateBlendedLookupTable_mean -0.0198 -0.0198 2272848 2227790 2270291 2225348
BM_GenerateBlendedLookupTable_median -0.0199 -0.0199 2272884 2227649 2270323 2225212
BM_GenerateBlendedLookupTable_stddev +0.4575 +0.6710 536 781 396 661
BM_GenerateBlendedLookupTable_cv +0.4870 +0.7047 0 0 0 0
OVERALL_GEOMEAN -0.0198 -0.0198 0 0 0 0
```
1. Achieves near-perfect balancing on non-pathological data
by using a per-node pivot and calculating.
2. Increases the depth to 5 levels, which seems to be the
sweet spot.
New benchmark vs baseline:
```
Comparing build-reld-palette-cleanup/palette_blending_benchmark to build-reld/palette_blending_benchmark
Benchmark Time CPU Time Old Time New CPU Old CPU New
-----------------------------------------------------------------------------------------------------------------------------------
BM_GenerateBlendedLookupTable_pvalue 0.0002 0.0002 U Test, Repetitions: 10 vs 10
BM_GenerateBlendedLookupTable_mean -0.8768 -0.8767 18432956 2270752 18417846 2270141
BM_GenerateBlendedLookupTable_median -0.8767 -0.8767 18421978 2270802 18417838 2270167
BM_GenerateBlendedLookupTable_stddev -0.9860 -0.8051 33641 473 1222 238
BM_GenerateBlendedLookupTable_cv -0.8860 +0.5815 0 0 0 0
OVERALL_GEOMEAN -0.8768 -0.8767 0 0 0 0
```
Uses a k-d tree to quickly find the best match
for a color when generating the palette blending
lookup table.
https://en.wikipedia.org/wiki/K-d_tree
This is 3x faster than the previous naive approach:
```
Benchmark Time CPU Time Old Time New CPU Old CPU New
-----------------------------------------------------------------------------------------------------------------------------------
BM_GenerateBlendedLookupTable_pvalue 0.0002 0.0002 U Test, Repetitions: 10 vs 10
BM_GenerateBlendedLookupTable_mean -0.7153 -0.7153 18402641 5239051 18399111 5238025
BM_GenerateBlendedLookupTable_median -0.7153 -0.7153 18403261 5239042 18398841 5237497
BM_GenerateBlendedLookupTable_stddev -0.2775 +0.3858 2257 1631 1347 1867
BM_GenerateBlendedLookupTable_cv +1.5379 +3.8677 0 0 0 0
OVERALL_GEOMEAN -0.7153 -0.7153 0 0 0 0
```
The distribution is somewhat poor with just 3 levels, so this can be improved further.
For example, here is the leaf size distribution in the cathedral:
```
r0.g0.b0: 88
r0.g0.b1: 10
r0.g1.b0: 2
r0.g1.b1: 32
r1.g0.b0: 27
r1.g0.b1: 4
r1.g1.b0: 12
r1.g1.b1: 81
```