Frost on a roof

On a slow processor like the Z80, it is essential to think about execution time. Often a clean approach is too slow, and you need to optimize the code to make it a lot faster.

The ZX Spectrum screen bitmap is not linear. The 192 pixel rows are divided into three sections of 64 pixel rows. In each of these sections, all the 8 first pixel rows come first, followed by the second pixel rows, and so on. The advantage is that when writing characters to the bitmap, you only need to increment the H register to reach the next bitmap row. The disadvantage is that a pixel precise address calculation is hell.

This is how the coordinates of a pixel are mapped to the address:

HL
1514131211109876543210
010Y7Y6Y2Y1Y0Y5Y4Y3X7X6X5X4X3

X2, X1 and X0 represent the bit number at the address. It can be used as a counter for right shift operations.

My first attempt was a straightforward code that shifted, masked and moved the bit groups into the correct places. It took 117 cycles. This is nice, but we can do better.

We need a lot of rotation operations to shift the bits to the right position. Rotation is a rather expensive operation on a Z80, because there are no instructions that rotate by more than one bit at a time. My idea was to divide the X coordinate by 8 (by rotating it three times to the right) and simultaneously shift Y3 to Y5 into the L register. With a similar trick, I could set bit 14 while rotating, which saved me another or operation with a constant.

This is the final optimized code. It takes the X coordinate in the C register, and the Y coordinate in the B register. The screen address is returned in the HL register pair. BC and DE are unchanged, so there is no need for expensive push and pop operations.

pixelAddress:   ld      a, b
                and     %00000111
                ld      h, a    ; h contains Y2-Y0
                ld      a, b
                rra
                scf             ; set bit 14
                rra
                rra
                ld      l, a    ; l contains Y5-Y3
                and     %01011000
                or      h
                ld      h, a    ; h is complete now
                ld      a, c    ; divide X by 8
                rr      l       ; and rotate Y5-Y3 in
                rra
                rr      l
                rra
                rr      l
                rra
                ld      l, a    ; l is complete now
                ret

It only takes 108 cycles, ret inclusive. Optimizing saved me 9 cycles (or about 8%). This doesn’t sound like much, but if the code is invoked in a loop, those 9 cycles are multiplied by the number of loop iterations.

I claim this is the fastest solution without resorting to a lookup table. Try to beat me! 😁