The new algorithm is a lot less code, slightly faster, and results
in a smaller binary (-40 KiB on rg99).
The previous algorithm filled all the pixels around every solid pixel.
The new algorithm only fills pixels that will be visible.
We first collect the outline pixels into an array (which may contain a
small amount of duplicates). Then, we render the entire array in a
single loop. This turns out to be slightly faster than rendering inline,
at the cost of ~4 KiB of stack (basically free).
To collect the pixels, we go through the CLX sprite, keeping track
of the solid runs in the current row, and the filled pixels on the line
above and the line below.
To be able to quickly test the pixels above and below, we introduce a
new data structure, `StaticBitVector`. It is similar to a bitset,
except the size is determined at runtime (capacity is fixed),
and it supports quick updates of entire subspans.
We know that length is never 0.
Letting the compiler know that allows it to optimize one instruction
away.
Moreover, for Fill runs, we also know that the length is at least 2.
Inlines blit command parsing.
We previously had blit commands because we supported rendering multiple
formats (CEL, CL2, CLX) but now we only ever render CLX, so this is
no longer necessary.
We were previously not setting it all which was incorrect but did not
cause any issues because we had not used it. We do check scancode for
the debug console key. This caused a sanitizer warning when running the
demo in debug mode.