Bitmap Allocation

flashjazzcat · July 18, 2014

I currently find myself writing an awful lot of linked list code, and one application is a message queue. The queue can hold up to sixty-four messages, and is based on an number of arrays (LSB/MSB pointer to buffer for each message, etc). A property of the queue is that although it is ostensibly FIFO, it is doubly linked and messages can be removed from anywhere in the list (in chronological order), since all processes use the same queue for messaging.

Anyway: one thing I've been trying to optimise is the search for unused array elements. Originally, each message slot had a corresponding used/free flag. When a new message is added to the queue (always at the end), the array of flags first had to be scanned until a free element was found:

	ldx #MaxQueueEntries
@
	dex ; x already contains max queue entries
	lda QueueNodeAlloc,x ; quick scan for free entry
	bne @-
	inc QueueNodeAlloc,x ; say entry is used

Note we already know at least one element is free before entering the loop. This approach is fine when there are few messages in the queue, or when unused/released elements are always near the top of the array. But when the queue becomes full, this linear search is just terrible: up to c. 600 cycles just to find a free slot. So I considered using bitmap allocation, and threw a 256-element look up table at the problem:


	ldx #8 ; find unallocated queue node (from 64 entries)
@
	dex ; at least one bit in one of the 8 bytes is guaranteed to be set
	ldy QueueNodeAllocTable,x ; get first byte in bitfield with bits set (free)
	beq @-

	lda BitLUT,y ; look up lowest set bit in byte
	tay ; save bit number (0-7)
	clc
	adc XLUT,x ; add to byte * 8
	sta KernelPtr2 ; save node number

	lda QueueNodeAllocTable,x ; now mark in use
	and QueueNodeLUT,y ; by clearing bit in allocation table
	sta QueueNodeAllocTable,x

...

QueueNodeLUT
	.byte 255-1,255-2,255-4,255-8,255-16,255-32,255-64,255-128
QueueNodeAllocTable ; 1 bit per node: 0=used, 1=free
	.byte $ff,$ff,$ff,$ff,$ff,$ff,$ff,$ff
XLUT
	.byte 0,8,16,24,32,40,48,56
	
BitLUT
	.byte 0,0,1,0,2,0,1,0,3,0,1,0,2,0,1,0 ; 0-15
	.byte 4,0,1,0,2,0,1,0,3,0,1,0,2,0,1,0 ; 16-31
	.byte 5,0,1,0,2,0,1,0,3,0,1,0,2,0,1,0 ; 32-47
	.byte 4,0,1,0,2,0,1,0,3,0,1,0,2,0,1,0 ; 48-63
	.byte 6,0,1,0,2,0,1,0,3,0,1,0,2,0,1,0 ; 64-79
	.byte 4,0,1,0,2,0,1,0,3,0,1,0,2,0,1,0 ; 80-95
	.byte 5,0,1,0,2,0,1,0,3,0,1,0,2,0,1,0 ; 96-111
	.byte 4,0,1,0,2,0,1,0,3,0,1,0,2,0,1,0 ; 112-127
	.byte 7,0,1,0,2,0,1,0,3,0,1,0,2,0,1,0 ; 128-143
	.byte 4,0,1,0,2,0,1,0,3,0,1,0,2,0,1,0 ; 144-159
	.byte 5,0,1,0,2,0,1,0,3,0,1,0,2,0,1,0 ; 160-175
	.byte 4,0,1,0,2,0,1,0,3,0,1,0,2,0,1,0 ; 176-191
	.byte 6,0,1,0,2,0,1,0,3,0,1,0,2,0,1,0 ; 192-207
	.byte 4,0,1,0,2,0,1,0,3,0,1,0,2,0,1,0 ; 208-223
	.byte 5,0,1,0,2,0,1,0,3,0,1,0,2,0,1,0 ; 224-239
	.byte 4,0,1,0,2,0,1,0,3,0,1,0,2,0,1,0 ; 240-255

Bits set in the bitmap represent free slots. "BitLUT" converts every possible bit combination (where at least one bit is clear) into the lowest bit number of any set bits in a byte. So, if %01101110 (110) is extracted from the bitmap, this will translate to bit position 1 via the LUT. Added to the byte position * 8 (also from a LUT), the actual index of the first free array element is found. The bit is immediately cleared in the bitmap to indicate that the slot is allocated.

My question is basically twofold: which is best, and is there anything faster? A linear search (at only around ten cycles per iteration) is unbeatable when the list is empty, but sub-optimal when the list is full. The bitmap method performs worse until there are around eight messages at the bottom of the queue: at that point, it leaves the linear search in the dust.

I'm inclined to think the linear search is the best compromise of brevity and performance in all but exceptional cases (i.e. when the message queue fills up for some reason and messages aren't being pulled out in a timely manner). The look up table for bitmap allocation might be ideally suited to situations where the size of the array is much greater, and unallocated elements are not regularly found near the beginning of the search (such as a memory allocator or disk VTOC).

Edited July 18, 2014 by flashjazzcat

Xuel · July 18, 2014

How about keeping the free and allocated slots in two separate doubly-linked lists? Then alloc and free are O(1) operations.

    ; Returns newly allocated message slot number in X register
allocMsg
    ; Take the head of the free list and make it the new head of the allocated list
    ldx free
    mvy forward,x free
    mva #$FF backward,y
    ldy allocated
    stx allocated
    tya
    sta forward,x
    txa
    sta backward,y
    rts

    ; Expects X register to contain message slot number to free
freeMsg
    ; Unlink node X and make it the new head of the free list
    ; ...left as an exercise for the reader
    rts

    ; Space for 65 doubly linked list nodes
    ; Keep at least one node allocated at all times to eliminate need
    ; for termination check in allocMsg. Termination is indicated with $FF.
free
    dta 0
allocated
    dta 64
forward
    :63 dta #+1
    :2 dta $FF
backward
    dta $FF
    :63 dta #
    dta $FF

EDIT: Fixed second to last line to read ":63 dta #" instead of ":63 dta #-1".

Edited July 18, 2014 by Xuel

Xuel · July 18, 2014

[deleted]

Edited July 18, 2014 by Xuel

Island2Live · July 18, 2014

Just to sum up what you've told:

1. you have a FIFO

2. any element in the FIFO - meaning from the first to the last - can be removed

3. removing an element in the FIFO frees the corresponding slot

I would go with a linked list: Each element in the FIFO has two additional pointers; one to the next and one to the previous element.

You need also two separate pointers: One pointing at the first occupied (PTR_OCC) (active, soon to be processed) element in the FIFO and one pointing at the first free element (PTR_FREE).

Finding a free element is a snap this way, because the Free-Pointer PTR_FREE always points to a free element.

Removing/deleting an element is just a matter of letting the pointers of the next and previous elements of the to-be-delete-element point to each other and let the current free element point to the now freed element. Now let PTR_FREE point to this new free element and there you go.

Inserting an element into the FIFO is straightforward.

There are of course side effects for the wraparound of your 64 entries and you need to decide if your FIFO is already full. But in my opinion and given the info in your post this should be the fastest way to go.

[update]

Ah, Xuel was faster!

Kind regards,

Henrik (Island2Live)

Edited July 18, 2014 by Island2Live

Xuel · July 18, 2014

I would go with a linked list: Each element in the FIFO has two additional pointers; one to the next and one to the previous element.

I agree these can be part of the whole message structure, but if you can, make them static and arrange them in an interleaved fashion so that you can use ABS,x and ABS,y addressing with offsets instead of using actual pointers.

Ah, Xuel was faster!

I left work early so that I could try to be first!

flashjazzcat · July 19, 2014

Good ideas - thanks both! I'll give a free linked list a try tomorrow. The message queue itself already acts as the "allocated" list, so keeping track of free elements in another list should be a snap. No problems of wrap-around in the message queue, BTW: it doesn't use a tail pointer chasing the head, since messages can be pulled out from anywhere in the queue (the only proviso being that they'll appear to the receiver in chronological order). That brings me to another optimisation: since the queue is system-wide, the GetMessage function has to walk the linked list until it finds a message whose intended recipient is either the calling process, or "any process". Each process keeps a running count of the number of pending messages which are intended for it, so at least we don't have to walk the message queue to find out there are NO messages for a given process. Of course having a separate linked list tracing the messages for each process ID would improve matters, but I think we might end up with linked-list overload for what is ostensibly a fairly simple structure.

Anyway: it's great getting some fresh eyes on this stuff, and I'll give a more informed response tomorrow.

Xuel · July 19, 2014

Why are you combining all messages into a single collection? Is it to save memory? Is it motivated by the "any process" messages?

You could eliminate the need to search through the messages while satisfying the chronological ordering requirement by giving each process its own message queue, i.e. a simple FIFO -- no need for a linked list. Maybe there could be a separate queue for the "any process" messages or you could duplicate them across all process queues.

If the message structure is large and you want to conserve memory then you could use a hybrid approach where 1) each process has a simple message queue which just contains offsets into the shared message slot space and 2) the shared space is managed with linked lists as before.

flashjazzcat · July 19, 2014

Why are you combining all messages into a single collection? Is it to save memory? Is it motivated by the "any process" messages?

I decided on the single message queue after discussing the matter with Prodatron (the author of SymbOS, after which my API is modelled): he said he was using a single message queue for the sake of simplicity (and reduced storage requirements), and it's clearly working well for him. I didn't yet ask him exactly what kind of data structures are used: I just went it alone and designed something which made sense to me.

You could eliminate the need to search through the messages while satisfying the chronological ordering requirement by giving each process its own message queue, i.e. a simple FIFO -- no need for a linked list. Maybe there could be a separate queue for the "any process" messages or you could duplicate them across all process queues.

This is a really excellent idea, and similar to something I was kicking around before I fell asleep last night. Any time a message is added which pertains to particular process, the absolute message slot number is added to a little FIFO list attached to that process. It's more storage, of course: you'd have to put some kind of reasonable hard-limit on how many messages a process could queue up. If a message is intended for all processes, just add the message slot number to every process's FIFO list, and attach a counter to the single copy of the message in the main queue, which ticks down every time a process reads the message. When the counter is zero, the message is removed from the main queue. A little bit risky if a given process dies before reading the message, but other than that it seems perfectly workable.

If the message structure is large and you want to conserve memory then you could use a hybrid approach where 1) each process has a simple message queue which just contains offsets into the shared message slot space and 2) the shared space is managed with linked lists as before.

That's another good approach, but then we get arrays of arrays to handle linked lists for each process descriptor. But it might be the optimal solution.

It occurred to me that the free list need only be forward linked, since it's sufficient to always pull the first free node we find:

; ----------------------------------------------------------------------------
; Initialize message queue's free node list
; ----------------------------------------------------------------------------

	.local InitQueueFreeList
	mva #1 QueueFreeNodeHead
	ldx #1
Loop1
	inx
	cpx #65
	bcs Done
	txa
	sta QueueFreeNodeNext-2,x
	bne Loop1
Done
	mva #0 QueueFreeNodeNext-2,x
	rts
	.endl


; ----------------------------------------------------------------------------
; Get a free message queue node
; and remove it from the head of the free list
; Returns node in Y
; ----------------------------------------------------------------------------

	.local GetFreeMsgNode
	ldy QueueFreeNodeHead
	lda QueueFreeNodeNext-1,y
	sta QueueFreeNodeHead
	rts
	.endl


; ----------------------------------------------------------------------------
; Add a message queue node to the head of the free list
; Pass node in KernelPtr2
; ----------------------------------------------------------------------------
	
	.local AddToQueueFree
	lda QueueFreeNodeHead
	ldx KernelPtr2
	sta QueueFreeNodeNext-1,x
	stx QueueFreeNodeHead
	rts
	.endl

So when a node is freed up, it just gets pushed back onto the head of the free queue. This looks wonderfully simple and efficient. Just need to sort out the addressee issue now.

Note that indices are numbered +1 of their actual offset; this is just to allow use of 0 as a terminal pointer.

Edited July 19, 2014 by flashjazzcat

flashjazzcat · July 19, 2014

On the bus, I realized we could also just do this:

; ----------------------------------------------------------------------------
; Get a free message queue node
; and remove it from the free list
; Returns node in A
; ----------------------------------------------------------------------------

	.local GetFreeMsgNode
	dec FreeNodePtr
	ldy FreeNodePtr
	lda FreeNodes,y
	rts
	.endl


; ----------------------------------------------------------------------------
; Add a message queue node to the free list
; Pass node in A
; ----------------------------------------------------------------------------
	
	.local AddToQueueFree
	ldy FreeNodePtr
	sta FreeNodes,y
	inc FreeNodePtr
	rts
	.endl


FreeNodePtr
	.byte 64

FreeNodes
?Node	equ 1
	.rept MaxQueueEntries
	.byte ?Node
?Node 	equ ?Node+1
	.endr

This is effectively a "bucket" of free nodes, which initially contains all 64 nodes. When we want a free one, we just pop the top one off the stack. Whichever node we're finished with goes back onto the top of the stack. Bitmap allocation and look up table can be saved for the memory allocator.

tschak909 · July 22, 2014

So basically a max of 256 message end points? (sorry for stupid question)

-Thom

dmsc · July 23, 2014

Hi,

On the bus, I realized we could also just do this:
; ----------------------------------------------------------------------------
; Get a free message queue node
; and remove it from the free list
; Returns node in A
; ----------------------------------------------------------------------------

	.local GetFreeMsgNode
	dec FreeNodePtr
	ldy FreeNodePtr
	lda FreeNodes,y
	rts
	.endl


; ----------------------------------------------------------------------------
; Add a message queue node to the free list
; Pass node in A
; ----------------------------------------------------------------------------
	
	.local AddToQueueFree
	ldy FreeNodePtr
	sta FreeNodes,y
	inc FreeNodePtr
	rts
	.endl


FreeNodePtr
	.byte 64

FreeNodes
?Node	equ 1
	.rept MaxQueueEntries
	.byte ?Node
?Node 	equ ?Node+1
	.endr
This is effectively a "bucket" of free nodes, which initially contains all 64 nodes. When we want a free one, we just pop the top one off the stack. Whichever node we're finished with goes back onto the top of the stack. Bitmap allocation and look up table can be saved for the memory allocator.

Good idea, it's simple and efficient.

Also, the messages could be stored "interleaved" to use direct indexed addressing to access the contents:

	.local GetFreeMsgNode
	dec FreeNodePtr
	ldy FreeNodePtr
	ldx FreeNodes,y
	rts
	.endl

        .local StoreNewMessage
        jsr GetFreeMsgNode
        lda #msgByte1
        sta MessageBuffer1,x
        lda #msgByte2
        sta MessageBuffer2,x
        lda #msgByte3
        sta MessageBuffer3,x
        ; ...
        txa
        jsr PushMessageToProcess
        .endl

- dmsc.

Heaven/TQA · July 23, 2014

man.... high end 6502 programming coding....

flashjazzcat · July 23, 2014

Also, the messages could be stored "interleaved" to use direct indexed addressing to access the contents:

        .local StoreNewMessage
        jsr GetFreeMsgNode
        lda #msgByte1
        sta MessageBuffer1,x
        lda #msgByte2
        sta MessageBuffer2,x
        lda #msgByte3
        sta MessageBuffer3,x
        ; ...
        txa
        jsr PushMessageToProcess
        .endl

Yes - this is a good idea too. Always much faster, and certainly ideal for shorter buffers (if we're keeping an eye on code size too). The rectangle list uses interleave like this, since there are only six bytes in each rect (Left [word], Right [word], Top [byte], and Bottom [byte]). Wrap these up in a .LOCAL range and things become very readable too:

	.local RectList
LeftLo
	.ds MaxWindowRects
LeftHi
	.ds MaxWindowRects
RightLo
	.ds MaxWindowRects
RightHi
	.ds MaxWindowRects
Top
	.ds MaxWindowRects
Bottom
	.ds MaxWindowRects
Next
	.ds MaxWindowRects
FreeHead
	.ds 1
FreeNext
	.ds MaxWindowRects
	.endl

...

	mva Left RectList.LeftLo-1,x
	mva Left+1 RectList.LeftHi-1,x
	mva Right RectList.RightLo-1,x
	mva Right+1 RectList.RightLo-1,x
	mva Top RectList.Top-1,x
	mva Bottom RectList.Bottom-1,x

Each window in the window list has a node pointer to the head of its rectangle list in "RectList". Using the freelist or free stack (hard to choose between them) has really speeded up node allocation there too. Good stuff.

man.... high end 6502 programming coding....

Well, dunno how high-end it is, but it's fun to brainstorm these problems with nifty coders.

Edited July 23, 2014 by flashjazzcat

Bitmap Allocation

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Recently Browsing 0 members