<< Chapter < Page Chapter >> Page >
Sorted size-16 leaf nodes in size-128 hard-coded leaf FFT, grouped for VL-2
Size Input array addresses
16 {0, 64, 32, 96, 16, 80, 112, 48, 8, 72, 40, 104, 120, 56, 24, 88}
16 {1, 65, 33, 97, 17, 81, 113, 49, 9, 73, 41, 105, 121, 57, 25, 89}
16 {2, 66, 34, 98, 18, 82, 114, 50, 10, 74, 42, 106, 122, 58, 26, 90}
8(x2) {3, 67, 35, 99, 19, 83, 115, 51, 123, 59, 27, 91, 11, 75, 107, 43}
8(x2) {4, 68, 36, 100, 20, 84, 116, 52, 124, 60, 28, 92, 12, 76, 108, 44}
8(x2) {5, 69, 37, 101, 21, 85, 117, 53, 125, 61, 29, 93, 13, 77, 109, 45}
16 {126, 62, 30, 94, 14, 78, 110, 46, 6, 70, 38, 102, 118, 54, 22, 86}
16 {127, 63, 31, 95, 15, 79, 111, 47, 7, 71, 39, 103, 119, 55, 23, 87}

[link] shows the sorted size-16 leaf sub-transforms for a size-128 transform with the rows divided into VL-2 groups. Because each group of two leaf sub-transforms loads data from adjacent memory locations, the group of sub-transforms can be loaded in parallel with vector memory operations, and all (or some) of the computation done in parallel. The first, third and fourth groups in [link] contain leaf nodes of the same size/type; these are the easiest vector leaf sub-transforms to compute, as described in "Homogeneous leaf sub-transform vectors" . The second group of rows contains leaf sub-transforms of differing size/type, and computing these sub-transforms is covered separately in "Heterogeneous leaf sub-transform vectors" .

Homogeneous leaf sub-transform vectors

void sfft_fcf128_shl16_8_ee(offset_t *is,const SFFT_D *in,SFFT_D *out){  SFFT_R r0,r1,r2,r3,r4,r5,r6,r7,r8,r9,r10,r11,r12,r13,r14,r15;   L_4(in+is[0],in+is[1],in+is[2],in+is[3],&r0,&r1,&r2,&r3);   L_2(in+is[4],in+is[5],in+is[6],in+is[7],&r4,&r5,&r6,&r7);   K_0(&r0,&r2,&r4,&r6);   K_N(VLIT4(0.7071,0.7071,0.7071,0.7071),      VLIT4(0.7071,-0.7071,0.7071,-0.7071),       &r1,&r3,&r5,&r7);   L_4(in+is[8],in+is[9],in+is[10],in+is[11],&r8,&r9,&r10,&r11);   L_4(in+is[12],in+is[13],in+is[14],in+is[15],&r12,&r13,&r14,&r15);   K_0(&r0,&r4,&r8,&r12);   K_N(VLIT4(0.9239,0.9239,0.9239,0.9239),      VLIT4(0.3827,-0.3827,0.3827,-0.3827),       &r1,&r5,&r9,&r13);   TX2(&r0,&r1); TX2(&r4,&r5); TX2(&r8,&r9); TX2(&r12,&r13);   S_4(r0,r4,r8,r12,out0+0,out0+8,out0+16,out0+24);  S_4(r1,r5,r9,r13,out1+0,out1+8,out1+16,out1+24);   K_N(VLIT4(0.7071,0.7071,0.7071,0.7071),      VLIT4(0.7071,-0.7071,0.7071,-0.7071),       &r2,&r6,&r10,&r14);   K_N(VLIT4(0.3827,0.3827,0.3827,0.3827),      VLIT4(0.9239,-0.9239,0.9239,-0.9239),       &r3,&r7,&r11,&r15);   TX2(&r2,&r3); TX2(&r6,&r7); TX2(&r10,&r11); TX2(&r14,&r15);   S_4(r2,r6,r10,r14,out0+4,out0+12,out0+20,out0+28);  S_4(r3,r7,r11,r15,out1+4,out1+12,out1+20,out1+28); }
Homogeneous size-16 leaf sub-transform for VL-2 size-128 hard-coded leaf FFT

The vector leaf sub-transforms of a single size/type are handled in the same way as a VL-1 sub-transform, with one difference: the vector registers must be transposed before the data is stored to memory in the output array. In the example shown in [link] , the transposes take place at lines 16 and 25.

Prior to the store operations, each position of the vector register (each position being a whole complex word) contains an element belonging to each of the leaf sub-transforms composing the vectorized sub-transform. Because each leaf sub-transform is stored sequentially to different locations in memory with aligned vector store operations, sets of registers are transposed such that each vector register contains elements from only one leaf sub-transform.

Get Jobilize Job Search Mobile App in your pocket Now!

Get it on Google Play Download on the App Store Now




Source:  OpenStax, Computing the fast fourier transform on simd microprocessors. OpenStax CNX. Jul 15, 2012 Download for free at http://cnx.org/content/col11438/1.2
Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'Computing the fast fourier transform on simd microprocessors' conversation and receive update notifications?

Ask