unpacking LZ4 on Jaguar

Ericde45 · September 24, 2022

hello,

in case you need to use compressed files on Jag, on my github, in my YM emulation code, there is an unpacking subroutine to decompress LZ4 using 68000, and the same for DSP

https://github.com/ericde45/YM2149_JAG/blob/main/ym1.s

the PC exe packing program is here :

https://github.com/lz4/lz4

compressing is done with : lz4.exe -9 -l --no-frame-crc [input file] [output file]

start of packed datas are at +8

the -l command generate a legacy frame format lz4, more simple then usual lz4 format

documentation of the format is here : https://android.googlesource.com/platform/external/lz4/+/HEAD/doc/lz4_Frame_format.md
( bloc size is 8 MB which is enough to only create 1 block file for the Jaguar )

42bs · September 24, 2022

Nice, but the the DSP code needs some tweaking. But I guess it is a first shot/prove of concept.

Ericde45 · September 24, 2022

you are welcome to tweak it

i am currently using it this way.

my time is limited and currently i don't need to optimize this part of my code, either for time or size.

in the YM replay, it is done in the backgroud, around every 1 minute

and in Jalaga, this will be used during wait times

42bs · September 24, 2022

I wanted to make a zx2 depacker, but since I did a lz4 for 6502 I will try to tweak it.

BTW out of curiosity, why don't you use creg to rename registers?

Ericde45 · September 24, 2022

if creg is tagging registers to create aliases for them, i don't have a clear answer.

i use it in complex code, for example in the gpu sprite collisions routine of Jalaga, but most of the time i code with directly the registers.

i may see things better with registers, and i'm using my own custom colors in notepad++ to help

42bs · September 28, 2022

Slighly optimized/size reduced (250 > 186)

; input: R20 : packed buffer
;		 R21 : output buffer
;		 R0  : LZ4 packed block size (in bytes)

; A4 => R24
; A0 => R20
; A1 => R21
; A3 => R23
; D0 => R0
; D1 => R1
; D2 => R2
; D4 => R4

; adresse saut 1 => R10
; adresse saut 2 => R11

; R12=$FF pour mask
; R13=tmp

lz4_depack_smallest_DSP:
		move	R20,R24
		add	R0,R24	; packed buffer end
		moveq	#0,R0
		moveq	#0,R2
		moveq	#$F,R4
		movei	#.lenOffset_smallest_DSP,R10
		movei	#$FF,R12

.tokenLoop_smallest_DSP:
		loadb	(R20),R0
		addq	#1,R20
		move	R0,R1
		shrq	#4,R1
		jump	eq,(R10)
		nop

.readLen_smallest1_DSP:
		cmp	R1,R4			; cmp.B !!!!
		jr	ne,.readEnd_smallest1_DSP
;;->		nop
.readLoop_smallest1_DSP:
		loadb	(R20),R2
		addq	#1,R20
		add	R2,R1			; final len could be > 64KiB
		not	R2
		and	R12,R2			; not R2.b
		jr	eq,.readLoop_smallest1_DSP
		nop
.readEnd_smallest1_DSP:

.litcopy_smallest_DSP:
		loadb	(R20),R13
		addq	#1,R20
		subq	#1,R1
		storeb	R13,(R21)
		jr	ne,.litcopy_smallest_DSP
		addq	#1,R21

		; end test is always done just after literals
		movei	#.readEnd_smallest_DSP,R11
		cmp	R20,R24
		jump	eq,(R11)
		nop

.lenOffset_smallest_DSP:
		loadb	(R20),R1	; read 16bits offset, little endian, unaligned
		addq	#1,R20
		loadb	(R20),R13
		addq	#1,R20
		shlq	#8,R13
		add	R13,R1

		move	R21,R23
		sub	R1,R23		; R1/d1 bits 31..16 are always 0 here

		moveq	#$F,R1
		and	R0,R1		; and.w	d0,d1 .W !!!

.readLen_smallest2_DSP:
		cmp	R1,R4			; cmp.B !!!!
		jr	ne,.readEnd_smallest2_DSP
		nop

.readLoop_smallest2_DSP:
		loadb	(R20),R2
		addq	#1,R20
		add	R2,R1			; final len could be > 64KiB
		not	R2
		and	R12,R2			; not R2.b
		jr	eq,.readLoop_smallest2_DSP
		nop

.readEnd_smallest2_DSP:
		addq	#4,R1

.copy_smallest_DSP:
		loadb	(R23),R13
		addq	#1,R23
		subq	#1,R1
		storeb	R13,(R21)
		jr	ne,.copy_smallest_DSP
		addqt	#1,R21

		movei	#.tokenLoop_smallest_DSP,R11
		jump	(R11)
		nop

.readLen_smallest_DSP:
		cmp	R1,R4				; cmp.B !!!!
		jr	ne,.readEnd_smallest_DSP
		nop

.readLoop_smallest_DSP:
		loadb	(R20),R2
		addq	#1,R20
		add	R2,R1			; final len could be > 64KiB
		not	R2
		and	R12,R2			; not R2.b
		jr	eq,.readLoop_smallest_DSP
		nop

.readEnd_smallest_DSP:

YM_DSP_retour_depack_LZ4_boucle_principale_DSP:
	movei	#YM_LZ4_nb_bloc_LZ4_disponibles,R0
	load	(R0),R1
	addq	#1,R1
	store	R1,(R0)

	movei	#DSP_boucle_centrale,R0
	jump	(R0)
	nop

+CyranoJ · September 28, 2022

Could save a few more, there are a couple of places you can do this:

.tokenLoop_smallest_DSP:
		loadb	(R20),R0
		addq	#1,R20
		move	R0,R1
		shrq	#4,R1
		jump	eq,(R10)
		nop
        
into this....

.tokenLoop_smallest_DSP:
        loadb   (R20),r0
        move    r0,r1
        shrq    #4,r1
        jump    eq,(R10)
        addq    #1,r20

42bs · September 28, 2022

It saves at least some bytes, but no benefit speed wise. But the "loadb" stalls anyway many cycles.

I have some more ideas, but wanted to have first tweak not too aggressive.

42bs · September 29, 2022

Just noticed, there is dead code. The label .readLen_smallest_DSP is never used.

42bs · September 29, 2022

Down to ~~146~~ 142 bytes plus some speed optimization:

;;; -*-asm-*-

; input: R20 : packed buffer
;		 R21 : output buffer
;		 R0  : LZ4 packed block size (in bytes)

; A4 => R24
; A0 => R20
; A1 => R21
; A3 => R23
; D0 => R0
; D1 => R1
; D2 => R2
; D4 => R4

; adresse saut 1 => R10
; adresse saut 2 => R11

; R12=$FF pour mask
; R13=tmp

lz4_depack_smallest_DSP:
		move	R20,R24
		add	R0,R24			; packed buffer end
		moveq	#$F,R4
		movei	#.lenOffset_smallest_DSP,R10
		movei	#.tokenLoop_smallest_DSP,R11
		movei	#$FF,R12

		loadb	(R20),R0
.tokenLoop_smallest_DSP:
		addqt	#1,R20
		move	R0,R1
		shrq	#4,R1
		jump	eq,(R10)
		and	r4,r0			; remove high nibble

.readLen_smallest1_DSP:
		cmp	R1,R4			; r1 == 15 ?
		loadb	(R20),R2
		jr	ne,.readEnd_smallest1a_DSP ; skip first addq in copy loop!
.readLoop_smallest1_DSP:
		addqt	#1,R20
		add	R2,R1			; final len could be > 64KiB
		cmp	R12,R2			; r2 = $ff ?
		jr	eq,.readLoop_smallest1_DSP
		loadb	(R20),R2
.readEnd_smallest1_DSP:

.litcopy_smallest_DSP:
		addqt	#1,R20
.readEnd_smallest1a_DSP:
		subq	#1,R1
		storeb	R2,(R21)
		addqt	#1,R21
		jr	ne,.litcopy_smallest_DSP
		loadb	(R20),r2

		; end test is always done just after literals
		cmp	R20,R24
		jr	ne,.lenOffset_smallestx_DSP
		loadb	(R20),R1	; read 16bits offset, little endian, unaligned

;;; .readEnd_smallest_DSP:

;;; ----------------------------------------
	;; Return to caller
YM_DSP_retour_depack_LZ4_boucle_principale_DSP:
		movei	#YM_LZ4_nb_bloc_LZ4_disponibles,R0
		load	(R0),R1
		addq	#1,R1
		store	R1,(R0)

		movei	#DSP_boucle_centrale,R0
		jump	(R0)
;;; ----------------------------------------
.lenOffset_smallest_DSP:
		loadb	(R20),R1	; read 16bits offset, little endian, unaligned
.lenOffset_smallestx_DSP:
		move	R21,R23
		addqt	#1,R20
		loadb	(R20),R13
		addqt	#1,R20

;;;readLen_smallest2_DSP
		cmp	R0,R4			; cmp.B !!!!
		jr	ne,.readEnd_smallest2_DSP
.readLoop_smallest2_DSP:
		loadb	(R20),R2
		add	R2,R0			; final len could be > 64KiB
		cmp	R12,R2			; r2 = $ff ?
		jr	eq,.readLoop_smallest2_DSP
		addq	#1,R20

.readEnd_smallest2_DSP:
	;; finish offset calculation
		shlq	#8,R13
		add	R13,R1
		sub	R1,R23		; R1/d1 bits 31..16 are always 0 here

		addq	#4,R0
.copy_smallest_DSP:
		loadb	(R23),R13
		addq	#1,R23
		subq	#1,R0
		storeb	R13,(R21)
		jr	ne,.copy_smallest_DSP
		addqt	#1,R21

		jump	(R11)
		loadb	(R20),R0

		align 4
YM_LZ4_nb_bloc_LZ4_disponibles:	ds.l 1

Edited September 29, 2022 by 42bs
No "not" needed

42bs · September 29, 2022

So, final version. Register usage reduced, more comments.

;;; -*-asm-*-

; input:
;;; R20 : packed buffer
;;; R21 : output buffer
;;; R0  : LZ4 packed block size (in bytes)
;;;
;;; Register usage (destroyed!)
;;; r1,r2,r4,r10,r11,r12,r13
;;;
;;; R1,R2     : temp register
;;; r4        : mask $0f
;;; r10       : jump destination
;;; r11       : jump destination
;;; r12       : mask $ff
;;; r13       : end of packed data

lz4_depack_smallest_DSP:
		move	R20,R13
		add	R0,R13			; packed buffer end
		moveq	#$F,R4
		movei	#.lenOffset_smallest_DSP,R10
		movei	#.tokenLoop_smallest_DSP,R11
		movei	#$FF,R12

		loadb	(R20),R0
.tokenLoop_smallest_DSP:
		addqt	#1,R20
		move	R0,R1
		shrq	#4,R1
		jump	eq,(R10)
		and	r4,r0			; remove high nibble

.readLen_smallest1_DSP:
		cmp	R1,R4			; r1 == 15 ?
		loadb	(R20),R2
		jr	ne,.readEnd_smallest1a_DSP ; skip first addq in copy loop!
.readLoop_smallest1_DSP:
		addqt	#1,R20
		add	R2,R1			; final len could be > 64KiB
		cmp	R12,R2			; r2 = $ff ?
		jr	eq,.readLoop_smallest1_DSP
		loadb	(R20),R2
.readEnd_smallest1_DSP:

.litcopy_smallest_DSP:
		addqt	#1,R20
.readEnd_smallest1a_DSP:
		subq	#1,R1
		storeb	R2,(R21)
		addqt	#1,R21
		jr	ne,.litcopy_smallest_DSP
		loadb	(R20),r2

		; end test is always done just after literals
		cmp	R20,R13
		jr	ne,.lenOffset_smallestx_DSP
		loadb	(R20),R1	; read 16bits offset, little endian, unaligned

;;; .readEnd_smallest_DSP:

;;; ----------------------------------------
	;; Return to caller
YM_DSP_retour_depack_LZ4_boucle_principale_DSP:
		movei	#YM_LZ4_nb_bloc_LZ4_disponibles,R0
		load	(R0),R1
		addq	#1,R1
		store	R1,(R0)

		movei	#DSP_boucle_centrale,R0
		jump	(R0)
;;; ----------------------------------------
.lenOffset_smallest_DSP:
		loadb	(R20),R1	; read 16bits offset, little endian, unaligned
.lenOffset_smallestx_DSP:
		addqt	#1,R20
		loadb	(R20),R2
		addqt	#1,R20
		shlq	#8,R2
		add	R2,R1
		neg	r1
		add	r21,r1		; source = dest - offset

;;;readLen_smallest2_DSP
		cmp	R0,R4		; r0 == 15 ?
		jr	ne,.readEnd_smallest2_DSP
.readLoop_smallest2_DSP:
		loadb	(R20),R2
		add	R2,R0		; final len could be > 64KiB
		cmp	R12,R2		; r2 = $ff ?
		jr	eq,.readLoop_smallest2_DSP
		addq	#1,R20

.readEnd_smallest2_DSP:

		addq	#4,R0
.copy_smallest_DSP:
		loadb	(R1),R2
		addq	#1,R1
		subq	#1,R0
		storeb	R2,(R21)
		jr	ne,.copy_smallest_DSP
		addqt	#1,R21

		jump	(R11)
		loadb	(R20),R0

		align 4
YM_LZ4_nb_bloc_LZ4_disponibles:	ds.l 1

42bs · September 29, 2022

1 minute ago, 42bs said:
;; Return to caller

Last note: The part in "return to caller" is there to keep the code compatible with @Ericde45`s YM player.

Ericde45 · September 29, 2022

nice work !

it would be nice to share it on your github no ?

42bs · September 29, 2022

15 minutes ago, Ericde45 said:

nice work !

it would be nice to share it on your github no ?

I will if it is ok with you (of course pointing to your GH).

Ericde45 · September 29, 2022

it is ok for me

if i put some sources as public, that's for them to travel, be modified, be used, and keep the flame of the Atari Jaguar alive

42bs · September 30, 2022

Ok, I am preparing something. Depack routine is now down to 116 bytes.

I only wait for permission to use a nice picture of the Star Trek Voyager as packed source.

42bs · September 30, 2022

Added an example in new_bjl:

https://github.com/42Bastian/new_bjl/tree/main/exp/depacker

unlz4 now down to ~~110~~ 108 bytes, freed one register.

Edited September 30, 2022 by 42bs

Ericde45 · October 2, 2022

regarding this depacking using the dsp, while playing a module, during the lz4 depacking, the music seems to be off tone.

it seems the interrupts are not able to do all their duty

aren't the interrupts I2S and timer 1 supposed to have priority upon the main DSP code ?

replay frequency is ~16000 Hz

42bs · October 2, 2022

Interrupts shall be able to interrupt main code any time. But depacking means a lot of bus accesses, which might also stall interrupts.

Ericde45 · October 2, 2022

do you know a way to slow down depack, so that interrupts can gain more time ?

shall i put some OR R0,R0 after each loadb ?

42bs · October 2, 2022

I doubt this will have any effect. Maybe adding a small loop at the end (after the copy loop).

42bs · October 3, 2022

Despite the above problem: unlz4 down to 100 bytes, one more register freed, speed improved.

Ericde45 · October 5, 2022

On 10/2/2022 at 8:57 PM, 42bs said:

I doubt this will have any effect. Maybe adding a small loop at the end (after the copy loop).

for each loadb, i added a nop + or Rx,Rx + nop after the loadb

this sounds a lot better now

42bs · October 5, 2022

oh, what a pity but size of unpacker/timer is likely no issue.

unpacking LZ4 on Jaguar

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Recently Browsing 0 members