Fractional Multiply routine

catsfolly · July 12, 2012

Here is a fractional multiply routine that I wrote for Air Hockey.

It takes the number in register r3, which can be signed, and multiplies it by the number in r1 (which has to be less than 256).

The result is divided by 256.

(r1 * r3)/ 256 -> r0

This is useful for multiplying numbers by sines or cosines, or for scaling numbers by arbitrary values.

Feel free to use this routine (or optimize it) if you like...

Catsfolly

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;;; umult / 256
;;;
;;;  unsigned multiply ((r1 * r3)/ 256 (r3 can be signed)
;;; r1 * r3, result in r0
;;; if r1 > 255, returns r3 value in r0
;;; r2, r3, r4 trashed
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
umultd256  proc
  cmpi #255,r1
  bnc @@r1ok   ; branch if 255 or less
  movr r3,r0	   ; r1 is 256 or greater or negative
  jr  r5   ; done
@@r1ok
  clrr r0   ; clear result
  clrr r4   ; result in r0, r4
  clrr r2   ; shifted value in r3 r2

  sarc r3,1  ; divide multiplicand by 2
  rrc  r2,1  ; extend to lsb word
 
  swap r1   ; move multiplier bits to top of word
  beq  @@done
 
  tstr r1   ; test true sign bit
  bmi  @@loop_entry
 
@@loop:
  sarc r3,1  ; divide multiplicand by 2
  rrc  r2,1  ; extend to lsb word
  sll  r1,1  ; get next bit of multiplier
  tstr r1   ; test sign and zero
  beq @@done
  bpl @@loop
 
@@loop_entry  
  ; add value in (r3, r2) to total in (r0, r4)
  addr r2,r4  ; add lsbits
  adcr r0   ; add carry if any
  addr r3,r0  ; add msbits
  b  @@loop

@@done
  jr r5
  endp

intvnut · July 12, 2012

Just a quick bit of feedback... You don't need the "tstr r1" in your @@loop, as the SLL will set the flags you need. :-)

Also, if you shift the multiplicand right rather than left, you can avoid the full 32-bit add as well, needing only the ADCR. This does mean you'll have to left-shift by 8 after the loop, but I think it may be a win overall.

If you don't mind a shade more code size in the name of speed, you can have two versions of the inner loop:

umultd256  proc
  cmpi  #255,r1
  bnc   @@r1ok      ; branch if 255 or less
  movr  r3,r0       ; r1 is 256 or greater or negative
  jr    r5          ; done
@@r1ok
  clrr  r0          ; clear result
  clrr  r2          ; result in r0, r2

  b     @@first

@@loop_add:
  sllc  r2          ;
  rlc   r0          ; shift 32-bit result left
  addr  r3,r2       ; add in multiplicand
  adcr  r0          ; ...and carry

  sarc  r1
  bc    @@loop_add
  beq   @@done

@@loop_noadd
  sllc  r2          ;
  rlc   r0          ; shift 32-bit result left

@@first:
  sarc  r1
  bc    @@loop_add
  bneq  @@loop_noadd

  ; when we get here, product r0:r2 needs to be left shifted by 8 more
  swap  r0          
  swap  r2
  andi  #$FF00,r0
  andi  #$00FF,r2
  xorr  r2,r0

  jr    r5

  endp

I admit, I haven't tested this yet...

catsfolly · July 12, 2012

Just a quick bit of feedback... You don't need the "tstr r1" in your @@loop, as the SLL will set the flags you need. :-)

You're right!

I guess when I tried it the first time (without the "tstr r1") I had some other error in the code, because I thought it didn't work. But I just tried it again with the tstr commented out and it worked fine.

Your approach is interesting - thanks for tackling the problem.

I was trying to avoid the fix up at the end.

It's a tradeoff - having 1 extra instruction in the "add loop" versus 5 instructions at the end.

If the multiplier has exactly 5 bits set in it (out of a possible , then the "add loop" is executed five times, and the routines break even (more or less)...

If less bits are set, my routine is faster. If more bits are set, your routine is faster... (IMHO)

Thanks for the feedback.

Catsfolly

Edited July 12, 2012 by catsfolly

intvnut · July 12, 2012

Just a quick bit of feedback... You don't need the "tstr r1" in your @@loop, as the SLL will set the flags you need. :-)

You're right!

I guess when I tried it the first time (without the "tstr r1") I had some other error in the code, because I thought it didn't work. But I just tried it again with the tstr commented out and it worked fine.

Your approach is interesting - thanks for tackling the problem.

I was trying to avoid the fix up at the end.

It's a tradeoff - having 1 extra instruction in the "add loop" versus 5 instructions at the end.

If the multiplier has exactly 5 bits set in it (out of a possible , then the "add loop" is executed five times, and the routines break even (more or less)...

If less bits are set, my routine is faster. If more bits are set, your routine is faster... (IMHO)

Thanks for the feedback.

Catsfolly

I admit, I threw that version together very quickly. :-) At any rate, is the cross-over point really that far out?

If you remove the TSTR in your loop, I count 34 cycles for a 0, 59 cycles for a 1.

I count mine at 34 cycles also for a 0, but only 39 cycles for a 1. The merge at the end costs 34 cycles. Counterbalancing that is 1 less CLRR and 2 less shifts, for an outside-the-loop penalty of only 16 cycles. So, I should be faster in all cases, shouldn't I?

Or am I counting wrong?

Staying closer to your original, you could rotate the loop like this:


umultd256  proc
  cmpi #255,r1
  bnc @@r1ok     ; branch if 255 or less
  movr r3,r0     ; r1 is 256 or greater or negative
  jr   r5        ; done
@@r1ok
  clrr r0        ; clear result
  clrr r4        ; result in r0, r4
  clrr r2        ; shifted value in r3 r2

  sarc r3,1      ; divide multiplicand by 2
  rrc  r2,1      ; extend to lsb word

  swap r1        ; move multiplier bits to top of word
  beq  @@done

  sllc r1,1      ; test first bit of multiplier
  bnc  @@loop_no_add

@@loop_add:
  ; add value in (r3, r2) to total in (r0, r4)
  addr r2,r4     ; add lsbits
  adcr r0        ; add carry if any
  addr r3,r0     ; add msbits

@@loop_no_add:
  sarc r3,1      ; divide multiplicand by 2
  rrc  r2,1      ; extend to lsb word
  sllc r1,1      ; get next bit of multiplier
  bc   @@loop_add
  bneq @@loop_no_add

@@done
  jr r5
  endp

This would cost 45 cycles per 1 bit, and 34 cycles per 0, which if nothing else reduces the penalties of 1s by swapping the branch order.

catsfolly · July 13, 2012

I admit, I threw that version together very quickly. :-) At any rate, is the cross-over point really that far out?

If you remove the TSTR in your loop, I count 34 cycles for a 0, 59 cycles for a 1.

I count mine at 34 cycles also for a 0, but only 39 cycles for a 1. The merge at the end costs 34 cycles. Counterbalancing that is 1 less CLRR and 2 less shifts, for an outside-the-loop penalty of only 16 cycles. So, I should be faster in all cases, shouldn't I?

Or am I counting wrong?

Oops. I guess I hadn't counted all the cycles.

I was just thinking that theoretically, my approach would have just one extra add in the "add loop". (Which was true in your final "rearranged" version).

But, as you pointed out, my approach requires more setup instructions.

Still it seems more "elegant" to me to avoid the fix up at the end.

But faster is faster...

Also, with your approach we could leave off the fix up and have a general unsigned multiply routine. Maybe have different entry points to do the fix up or not.

I'll give your version a try. (The speed is not so critical for this routine, but why waste cycles?)

Thanks,

Catsfolly

catsfolly · July 13, 2012

I think we can save a swap!

; when we get here, product r0:r2 needs to be left shifted by 8 more
  swap  r0		
  swap  r2
  andi  #$FF00,r0
  andi  #$00FF,r2
  xorr  r2,r0

could be:

; when we get here, product r0:r2 needs to be left shifted by 8 more
 
  andi  #$00FF,r0
  andi  #$FF00,r2
  xorr  r2,r0
  swap  r0

intvnut · July 13, 2012

I think we can save a swap!

; when we get here, product r0:r2 needs to be left shifted by 8 more
  swap  r0		
  swap  r2
  andi  #$FF00,r0
  andi  #$00FF,r2
  xorr  r2,r0

could be:

; when we get here, product r0:r2 needs to be left shifted by 8 more

  andi  #$00FF,r0
  andi  #$FF00,r2
  xorr  r2,r0
  swap  r0

Indeed!

BTW, I got thinking about it, and I realized there were some great routines put together by everyone on INTVPROG back when Arnauld and I were experimenting with "IntvWolf." You can probably search on that in the INTVPROG archives for more discussion details.

Here's a multiply routine that computes a 48-bit product between 32-bit and 16-bit fixed point values. I don't know who wrote it -- it may have been Carl. It costs only 49 cycles for a 1 and 37 for a 0 and has almost no setup overhead. It's rather impressive, and nearly as cost effective as the 8x16 => 24 code above.

;  Compute 48-bit product (A * B), where A is 32 bits, and B is 16 bits.
;  Return upper 32 bits in R0:R2.
;
;  Note, because we right-shift 'A' through the multiply process, the final
;  result is an approximation.
;
;  R3:R0 == A
;  R1    == B
;
;  R0:R2 == (A * B)>>16


m321616     PROC
           CLRR    R2
           CLRR    R4
           B       @@test

@@add:      CLRC
           RRC     R3
           RRC     R0

           ADDR    R0,     R2
           ADCR    R4
           ADDR    R3,     R4

@@test:     SLLC    R1,     1
           BC      @@add
           BEQ     @@done

           RRC     R3
           RRC     R0
           B       @@test

@@done:     MOVR    R4,     R0
           JR      R5
           ENDP

And here's a plain 16x16 => 16 multiply I wrote that goes 2 bits at a time. I was pretty sure I had written such a routine, and it came to me while I was at work where I'd find it:

;; MPY16   Multiply   signed 16-bit by   signed 16-bit into 16-bit result.
;; MPYSU16 Multiply   signed 16-bit by unsigned 16-bit into 16-bit result.
;; MPYU16  Multiply unsigned 16-bit by unsigned 16-bit into 16-bit result.
;;
;; R0 = 16-bit (signed for MPY16, MPYSU16, unsigned for MPYU16)
;; R1 = 16-bit (signed for MPY16, unsigned for MPYSU16, MPYU16)
;;
;; R2 = result
;;
;; Author: J. Zbiciak

mpy16:      PROC

           TSTR    R1          ; is R1 negative?
           BPL     @@core      ;
           NEGR    R1          ;\__ toggle sign on R0 then and force
           NEGR    R0          ;/   R1 to be +ve.

mpysu16:
mpyu16:
@@core:     CLRR    R2
           SARC    R2,     2
           RRC     R1,     2
           BOV     @@add_32
           BC      @@add_1
           B       @@test

@@add_32:   BNC     @@add_2
@@add_3:    ADDR    R0,     R2
@@add_2:    ADDR    R0,     R2
@@add_1:    ADDR    R0,     R2
@@test:     SLL     R0,     2
@@cont:     SARC    R1,     2
           BOV     @@add_32
           BC      @@add_1
           BNEQ    @@test

           JR      R5
           ENDP

And for 16x16=>32-bit and 32x16=>32, here's one I wrote that drops back to 1 bit at a time:

;; MPY   Multiply signed 16-bit by signed 16-bit into signed 32-bit result.
;; MPYIL Multiply 32-bit by signed 16-bit into signed 32-bit result.
;;
;; R3:R0 = 32-bit (signed 16-bit in R0 for MPY)
;; R1    = signed 16-bit
;;
;; R4:R2 = result
;;
;; R3 trashed
;;
;; Author: J. Zbiciak
;;
;; (Fun fact:  Routine named after instruction on C6200 DSP.)

mpy         PROC
           CLRR    R3          ;\
           TSTR    R0          ; |__ sign extend R0 into R3.
           BPL     mpyil       ; |
           DECR    R3          ;/

mpyil:      CLRR    R4
           CLRR    R2

           TSTR    R1
           BPL     @@first
           NEGR    R1          ; \
           COMR    R3          ;  |__ toggle R3:R0's sign to force R1 +ve
           NEGR    R0          ;  |
           ADCR    R3          ; /
           B       @@first

@@add:      ADDR    R0,     R2
           ADCR    R4
           ADDR    R3,     R4

@@test:     SLLC    R0,     1
           RLC     R3,     1
@@first:    SARC    R1,     1
           BC      @@add
           BNEQ    @@test

           JR      R5
           ENDP

Since there's other potentially interesting code in there, I've attached the ZIP archive. I'm not 100% sure it's the most recent version, but it's still interesting nonetheless.

IntvWolf.zip

Rev · July 13, 2012

This thread is gettin' me all hot and bothered...

catsfolly · July 13, 2012

BTW, I got thinking about it, and I realized there were some great routines put together by everyone on INTVPROG back when Arnauld and I were experimenting with "IntvWolf." You can probably search on that in the INTVPROG archives for more discussion details.

Thanks. I didn't know about these routines. In the intv library code there are some divide routines, and macros for multiplying by fixed constants, but I didn't see any general purpose multiply routines, so I wrote my own.

I'll check out the archive...

Catsfolly

Edited July 13, 2012 by catsfolly

intvnut · July 13, 2012

Thanks. I didn't know about these routines. In the intv library code there are some divide routines, and macros for multiplying by fixed constants, but I didn't see any general purpose multiply routines, so I wrote my own.

You're right, there isn't. I had actually started writing a library of multiplies (on paper), and in fact, that's exactly what I was doing when I first heard the Twin Towers got hit, sitting outside a Starbucks, waiting to meet a friend before work. I guess I didn't get back to that notebook...

Carl Mueller Jr · July 13, 2012

Yeah, I did notice the date on most of your code – September 2001. I was starting to wonder what had happened during that month to inspire you to write so much code! There must be quite a story.

Rev · July 14, 2012

Thanks. I didn't know about these routines. In the intv library code there are some divide routines, and macros for multiplying by fixed constants, but I didn't see any general purpose multiply routines, so I wrote my own.

You're right, there isn't. I had actually started writing a library of multiplies (on paper), and in fact, that's exactly what I was doing when I first heard the Twin Towers got hit, sitting outside a Starbucks, waiting to meet a friend before work. I guess I didn't get back to that notebook...

. The twin towers is the 'where we're you' of our generation. I personally was in downtown Atlanta on a construction site in the middle of tall bldgs , after about an hours or two the site was empty!

intvnut · July 14, 2012

Yeah, I did notice the date on most of your code – September 2001. I was starting to wonder what had happened during that month to inspire you to write so much code! There must be quite a story.

Actually, that was also around the time I started working on what become Space Patrol, shortly after the successful release of 4-Tris. We had also just announced samples of the first processor I contributed instructions to, the C6416 DSP, like the day before. (Tom Engibous rang the closing bell on Wall Street that day.) I was in the mood to write a library to go with SDK-1600, and develop games and such.

Of course, life being what it was, my priorities shifted for more reasons than 9/11. That whole "life away from the computer" thing. ;-) But, yeah, I spent the rest of that month kinda wrapping up what I had typed in, and not much on finishing and copying in what was in that notebook.

catsfolly · July 14, 2012

I tried Joe's version of the multiply routine, but it didn't work (he said it was untested, but the code looked so nice and sensible that I thought it had a good chance of running...)

Turns out you can't just shift the result when you find a zero in the multipliier. You have to shift the multiplicand (at least that is what I did to get it to work...)

Example walkthrough (old way).

if multiplier is %00000100

result = 0

lsb of multiplier is zero so result = result *2 = 0

next bit of multiplier is zero, so result = result *2 = original result * 4 = 0

next bit of multiplier is 1, so result = result *2 + multiplicand = 1 *multiplicand;

Example walkthrough (new way).

if multiplier is %00000100

result = 0

lsb of multiplier is zero so multiplicant = multiplicand *2

next bit of multiplier is zero, so multiplicant = multiplicand *2 = original multiplicand * 4

next bit of multiplier is 1, so result = result + multiplicand = original multiplicand * 4

Here is the version that worked:

umultd256 proc
       cmpi #255,r1
       bnc @@r1ok       ; branch if 255 or less
        movr r3,r0      ; r1 is 256 or greater or negative
       jr r5            ; done
@@r1ok
       clrr r0          ; clear result
       clrr r4          ; result in r0, r4
       clrr r2          ;\
        TSTR R3                 ; |__ sign extend R3 into R2.
        BPL     @@first         ; |
        DECR R2                 ;/
        b       @@first
@@loop_add:
       addr r3,r0       ; add in multiplicand
       adcr r4          ; ...and carry
       addr r2,r4       ;
@@loop_noadd
       sllc r3          ;
       rlc r2           ; shift 32-bit multiplicand left
@@first:
       sarc r1
       bc @@loop_add
       bneq @@loop_noadd
; when we get here, product r0:r2 needs to be left shifted by 8 more
@@done
       movr r4,r2
       andi #$FF00,r0
       andi #$00FF,r2
       xorr r2,r0
       swap r0
       jr r5

intvnut · July 14, 2012

I tried Joe's version of the multiply routine, but it didn't work (he said it was untested, but the code looked so nice and sensible that I thought it had a good chance of running...)

Turns out you can't just shift the result when you find a zero in the multipliier. You have to shift the multiplicand (at least that is what I did to get it to work...)

Example walkthrough (old way).

if multiplier is %00000100

result = 0

lsb of multiplier is zero so result = result *2 = 0

next bit of multiplier is zero, so result = result *2 = original result * 4 = 0

next bit of multiplier is 1, so result = result *2 + multiplicand = 1 *multiplicand;

Example walkthrough (new way).

if multiplier is %00000100

result = 0

lsb of multiplier is zero so multiplicant = multiplicand *2

next bit of multiplier is zero, so multiplicant = multiplicand *2 = original multiplicand * 4

next bit of multiplier is 1, so result = result + multiplicand = original multiplicand * 4

Here is the version that worked:
umultd256 proc
cmpi #255,r1
bnc @@r1ok ; branch if 255 or less
movr r3,r0 ; r1 is 256 or greater or negative
jr r5 ; done
@@r1ok
clrr r0 ; clear result
clrr r4 ; result in r0, r4
clrr r2 ;\
TSTR R3 ; |__ sign extend R3 into R2.
BPL @@first ; |
DECR R2 ;/
b @@first
@@loop_add:
addr r3,r0 ; add in multiplicand
adcr r4 ; ...and carry
addr r2,r4 ;
@@loop_noadd
sllc r3 ;
rlc r2 ; shift 32-bit multiplicand left
@@first:
sarc r1
bc @@loop_add
bneq @@loop_noadd
; when we get here, product r0:r2 needs to be left shifted by 8 more
@@done
movr r4,r2
andi #$FF00,r0
andi #$00FF,r2
xorr r2,r0
swap r0
jr r5

I see my problem, and yeah, you're right. If you only want to iterate once per multiplicand bit (in the 8-bit multiplicand) then with the 8-bit multiplicand shifting right, the 16-bit multiplicand has to shift left.

Alternately, you could shift the product being accumulated right, but then you'd have a variable fixup at the end.

See, I told you I wrote that too quickly. :-)

intvnut · July 14, 2012

I've gone mad with speed!

Specifically, I decided to try writing a "radix-16" version of the fractional multiply routine. Now, this one only does unsigned 16 x unsigned 8 => unsigned 16, but it's trivial to wrap it to handle signed x unsigned. I tested it over the full range of inputs ($0000 x $00 through $FFFF x $FF), and it passed every test. (The test program ran for 11,313,832,018 cycles... good thing I used --macho=0, since that's 3.5 hours in real life.)

Anyway, the resulting code averaged about 385.25 cycles/call across all those inputs. It seems like it can still go faster. Anyone want to give it a go?

About 264 cycles of that 385 is in the code outside the shift/add, so I think there's a few opportunities in there...

mpy_u8_u16_q8.zip

intvnut · July 15, 2012

Ok, so this is a tad depressing. I compared the updated version if catsfolly's code above (with a minor fix, changing CMPI #255 to CMPI #256) to mine and benchmarked it over the full range of multiply inputs also. It comes in at 399.7 cycles average.

My fancy radix-16 multiply hardly seems worth it for only 15 cycles gain.

Hmmm....

catsfolly · July 15, 2012

Well, you certainly are testing the code this time!

I'm surprised that the radix-16 version didn't do better.

I wonder what the fastest and slowest times were?

I would guess with the "simple" version (I can't call it my version because you basically wrote it (based on my version) and then I fixed it)

would be fastest when just one bit was set (in the 8 bit multiplier), and take longer when many bits were set.

On the other hand, the radix version would tend to take the about same amount of time for all cases, because a lot of the time is in the setup.

The "simple" version works for me.

Thanks.

Catsfolly

intvnut · July 15, 2012

Well, you certainly are testing the code this time!

I'm surprised that the radix-16 version didn't do better.

As was I!

Actually, when we looked at multiplies before almost a decade ago, I think we came to a similar conclusion -- as you increase radix, you don't really get faster on average, you just flatten the cycle count between the best/worst cases. I'm usually good at remembering things, but some neurons fire better than others.

In any case, I haven't forgotten Michael Abrash's admonition to measure, measure, measure. :-) (I read the Zen of Assembly Language Programming, Zen of Code Optimization and Zen of Computer Graphics cover to cover back when they came out. Great books.)

I wonder what the fastest and slowest times were?

Good question. I don't actually have a good way of measuring that right now. My way of measuring the average was pretty crude. jzIntv, with history logging enabled, will dump out a profile of cycles spent in each instruction. I just summed them all up and divided by 2²⁴. Maybe I should consider adding some sort of proper profiler to jzIntv someday.

If you think about it, there are twice as many multiplicands of bit length N as there are of length N-1. So, in the averages, longer multiplicands will get weighted more heavily. If your inputs are evenly distributed, this is OK. But, if shorter multiplicands are more common, then the "simple" version really has a chance to pull ahead, since the radix-16 execution time is somewhat flat, but the "simple" multiply's proportional to the magnitude of the multiplicand.

I'm going to guess that with a physics simulation, the distribution looks closer to uniform, or maybe even biased toward longer lengths if the 8 bit value comes from a sin/cos table.

Another interesting observation: Every multiply (except multiplying by 0) will have at least one '1' in it: the leading 1. If you assume a uniform distribution of multiplicands, though, the bits trailing the leading 1 will be 50/50 split between 0s and 1s. So, the looping cost of a multiplicand of length N averages to (N-1)*(cost_of_one + cost_of_zero)/2 + cost_of_one.

I first noticed the "leading one bias" when analyzing divide routines with Carl. (Which reminds me: Carl, I still need to look at IDIV9.) It suggests that you might get a marginal gain if you can make 1s cheaper than 0s in the multiply loop. For example, initialize your product to 2^N, and subtract for every 0. This is only likely to pay off if you have a floating point representation, where the leading 1 is in a fixed position at the left of the mantissa. As I recall, when I tried to take advantage of this in the divide routines I wrote, Carl was always able to beat me with the more straightforward approach.

So much for number theory. Benchmarks can be quite humbling. :-)

I would guess with the "simple" version (I can't call it my version because you basically wrote it (based on my version) and then I fixed it)

would be fastest when just one bit was set (in the 8 bit multiplier), and take longer when many bits were set.

Yep.

On the other hand, the radix version would tend to take the about same amount of time for all cases, because a lot of the time is in the setup.

The "simple" version works for me.

Yeah, there really isn't a compelling reason to use the radix-16 version at the moment, unless folks can see a way to get a lot of cycles out. Unrolling it would eliminate a fair bit of overhead, perhaps giving you back 40-50 cycles, but at a large codesize cost. If you're hurting for cycles, that's a tradeoff worth making. Most games won't need it though.

In Space Patrol, had I needed such a multiply, I would have unrolled, etc. for the speed. I needed every cycle I could get. I joked at the time that I must be borrowing cycles from all the other microprocessors in the room, because I was astonished I managed to make it all fit. (As it stands, doesn't actually all fit all the time. There are some creative hacks in there that hide the very occasional times when it doesn't fit, preventing glitches that would call attention to this fact. The cycle loading is very "spiky", and there's enough cycles if you average certain things over a few frames.)

Fractional Multiply routine

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Recently Browsing 0 members