1. Unifies the underlying CLX and dun_render blitters.
2. Optimizes them by unrolling loops and using pointer comparison rather
than length comparison (saves a length decrement).
3. In `dun_render`, extracts `RenderLineTransparent/Opaque` branches into
functions via explicit template specialization.
Example RG-99 FPS (non-PGO'd): 17.4->18.4
As we recently confirmed, Square and Left/RightTriangle primitives
never use masks other than Transparent and Solid.
Simplify the code to take advantage of that.
We notice that masks can be described by 2 parameters:
1. Whether they have 0 or 1 as their high bits.
2. Whether they shift to the left or to the right on the next line.
Describing masks this way allows us to lift them to template variables and simplify the code.
We also avoid handling the mask in the `RenderLine` loop entirely.
Also fixes a foliage rendering bug: Transparent foliage pixels were previously blended but they should have been simply skipped.
Turns RenderLine line branches into template parameters, allowing the
compiler to eliminate the branches and also fully inline it.
Example FPS change
* In dungeon: 1450 -> 1530
* In town: 1655 -> 1700
Also splits RenderLine into 3 functions
Easier to read and also gives more useful profiling.
Apparently the most time is spent in `RenderLineOpaque`.
Also, mark them `inline` because that apparently hints GCC to inline
the function (in a later refactoring we can introduce attribute
always_inline instead where supported).
This is part of the work to allow us to eliminate buffer padding.
As this is a hotspot, we have 4 separate functions for each non-square
primitive, resulting in quite a bit of code:
1. Unclipped ("Full")
2. Vertical-only clip
3. Vertical + Left clip
4. Vertical + Right clip
FPS at 640x480: 1420 -> 1530
Instead of passing the CEL sprite width when drawing, store the CEL
width at load time in the new `CelSprite` struct.
Implemented for most sprites except towners, missiles, or monsters.