Here's an example x86-64 function implementing element-wise multiply-accumulate of two single-precision floating-point vectors of arbitrary length.
; void vectorFMA(float *R, float *A, float *B, int N)
;
; R[i] += A[i] * B[i] for i = 0 .. N-1
;
; uses AVX to multiply two vectors, elementwise, and add
; results to a third vector
;
; rdi - R
; rsi - A
; rdx - B
; rcx - N
;
vectorFMA:
.L1:
; load words from R, A, and B
vmovups ymm1, [rdi]
vmovups ymm2, [rsi]
vmovups ymm3, [rdx]
; *R += *A * *B
vmulps ymm4, ymm2, ymm3
vaddps ymm5, ymm1, ymm4
vmovups [rdi], ymm5
; R+=8, A+=8, B+=8
add rdi, 32
add rsi, 32
add rdx, 32
; if (--rcx == 0) break;
dec rcx
jnz .L1
.L5:
ret
When our SandyBridge machine arrives next week, I should have a handful of interesting microbenchmarks to run on it.
No comments:
Post a Comment