I've done a little more work on my text kernel. It still displays only 12 lines of text, but the code is cleaner and it now shows lower-case letters and numbers (the text is from "The Hunchback of Notre Dame"):
I can get this up to 14 lines by overlapping the buffer creation code with the display code (there are lots of free cycles in the kernel). The problem is that this requires four versions of the copying and kernel code, which takes up nearly the whole 4K and leaves very little room for the actual text. This could be solved by copying the text into zero page memory and then doing a bank-switch into the kernel code. However, the copying code would require 24 iterations of the following code (24x10 cycles = 240 cycles = 3.2 scanlines), which would negate any savings with this method!
lda (TEXTPTR),Y sta BUFFER+? dey
This is probably as good as I can get the code without pre-computing character pairs (as in supercats method), but this would be difficult with lower-case characters ...