<< Chapter < Page | Chapter >> Page > |
typedef struct _reg_t {
__m128 re, im; } reg_t;
static inline reg_t MUL_SPLIT(reg_t a, reg_t b) {
reg_t r; r.re = _mm_sub_ps(_mm_mul_ps(a.re,b.re),_mm_mul_ps(a.im,b.im));
r.im = _mm_add_ps(_mm_mul_ps(a.re,b.im),_mm_mul_ps(a.im,b.re)); return r;
}
SSE multiplication with split complex data
The function in [link] takes complex data in two structs of SSE registers, performs the complex multiplication of each element of the vectors, and returns the result in a struct of SSEregisters. Each struct is composed of a register containing the real parts of four complex numbers, and another register containing the imaginary parts – sothe function in [link] is effectively operating on vectors twice as long as the function in [link] . The benefit of operating in split format is obvious: the shuffle operations that wererequired in [link] are avoided because the real and imaginary parts can be implicitly swapped at the instruction level, rather thanby awkwardly manipulating SIMD registers at the data level of abstraction. Thus, [link] computes complex multiplication for vectors twice as long while using one less SSE instruction – not to mention otheradvantages such as reducing chains of dependent instructions. The only disadvantage to the split format approach is that twice as many registers areneeded to compute a given operation – this might preclude the use of a larger radix or force register paging for some kernels of computation.
[link] is fast method of interleaved complex multiplication that may be used in situations where one of the operands can beunpacked prior to multiplication – in such cases the instruction count is reduced from 7 instructions to 4 instructions (cf. [link] ). This method of complex multiplication lends itself especially well to the conjugate-pair algorithm where the same twiddlefactor is used twice – by doubling the size of the twiddle factor LUT, the multiplication instruction count is reduced from 14 instructions to 8instructions. Furthermore, large chains of dependent instructions are reduced,and in practice the actual performance gain can be quite impressive.
Operand a in [link] has been replaced with two operands in [link] : re and im – these operands have been unpacked, as was done in lines 3 and 5 of [link] . Furthermore, line 8 of [link] is also avoided by performing the selective negation during unpacking.
static inline __m128
MUL_UNPACKED_INTERLEAVED(__m128 re, __m128 im, __m128 b) { re = _mm_mul_ps(re, b);
im = _mm_mul_ps(im, b); im = _mm_shuffle_ps(im,im,_MM_SHUFFLE(2,3,0,1));
return _mm_add_ps(re, im); }
SSE multiplication with partially unpacked interleaved data
The performance of the FFTs in the previous sections can be increased by explicitly vectorizing the loops. The Macbook Air 4,2 used to compile and runthe previous examples has a CPU that implements SSE and AVX, but for the purposes of simplicity, SSE intrinsics are used in thefollowing examples. The loop of the radix-2 implementation is used as an example in [link] .
Notification Switch
Would you like to follow the 'Computing the fast fourier transform on simd microprocessors' conversation and receive update notifications?