On a slow processor like the Z80, it is essential to think about execution time. Often a clean approach is too slow, and you need to optimize the code to make it a lot faster.
The ZX Spectrum screen bitmap is not linear. The 192 pixel rows are divided into three sections of 64 pixel rows. In each of these sections, all the 8 first pixel rows come first, followed by the second pixel rows, and so on. The advantage is that when writing characters to the bitmap, you only need to increment the H
register to reach the next bitmap row. The disadvantage is that a pixel precise address calculation is hell.
This is how the coordinates of a pixel are mapped to the address:
H | L | ||||||||||||||
15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
0 | 1 | 0 | Y7 | Y6 | Y2 | Y1 | Y0 | Y5 | Y4 | Y3 | X7 | X6 | X5 | X4 | X3 |
X2
, X1
and X0
represent the bit number at the address. It can be used as a counter for right shift operations.
My first attempt was a straightforward code that shifted, masked and moved the bit groups into the correct places. It took 117 cycles. This is nice, but we can do better.
We need a lot of rotation operations to shift the bits to the right position. Rotation is a rather expensive operation on a Z80, because there are no instructions that rotate by more than one bit at a time. My idea was to divide the X coordinate by 8 (by rotating it three times to the right) and simultaneously shift Y3 to Y5 into the L
register. With a similar trick, I could set bit 14 while rotating, which saved me another or
operation with a constant.
This is the final optimized code. It takes the X coordinate in the C
register, and the Y coordinate in the B
register. The screen address is returned in the HL
register pair. BC
and DE
are unchanged, so there is no need for expensive push
and pop
operations.
pixelAddress: ld a, b
and %00000111
ld h, a ; h contains Y2-Y0
ld a, b
rra
scf ; set bit 14
rra
rra
ld l, a ; l contains Y5-Y3
and %01011000
or h
ld h, a ; h is complete now
ld a, c ; divide X by 8
rr l ; and rotate Y5-Y3 in
rra
rr l
rra
rr l
rra
ld l, a ; l is complete now
ret
It only takes 108 cycles, ret
inclusive. Optimizing saved me 9 cycles (or about 8%). This doesn’t sound like much, but if the code is invoked in a loop, those 9 cycles are multiplied by the number of loop iterations.
I claim this is the fastest solution without resorting to a lookup table. Try to beat me! 😁