<< Chapter < Page Chapter >> Page >

Heterogeneous leaf sub-transform vectors

void sfft_fcf128_shl16_8_eo(offset_t *is,const SFFT_D *in,SFFT_D *out){  SFFT_R r0_1,r2_3,r4_5,r6_7,r8_9,r10_11,r12_13,r14_15, r16_17,r18_19,r20_21,r22_23,r24_25,r26_27,r28_29,r30_31;  L_4_4(in+is[0],in+is[1],in+is[2],in+is[3],       &r0_1,&r2_3,&r16_17,&r18_19);   L_2_2(in+is[4],in+is[5],in+is[6],in+is[7],      &r4_5,&r6_7,&r20_21,&r22_23);   K_N(VLIT4(0.7071,0.7071,1,1),VLIT4(0.7071,-0.7071,0,-0),      &r0_1,&r2_3,&r4_5,&r6_7);   L_4_2(in+is[8],in+is[9],in+is[10],in+is[11],        &r8_9,&r10_11,&r28_29,&r30_31);   L_4_4(in+is[12],in+is[13],in+is[14],in+is[15],        &r12_13,&r14_15,&r24_25,&r26_27);   K_N(VLIT4(0.9239,0.9239,1,1),VLIT4(0.3827,-0.3827,0,-0),      &r0_1,&r4_5,&r8_9,&r12_13);   S_4(r0_1,r4_5,r8_9,r12_13,out0+0,out0+8,out0+16,out0+24);  K_N(VLIT4(0.3827,0.3827,0.7071,0.7071),       VLIT4(0.9239,-0.9239,0.7071,-0.7071),      &r2_3,&r6_7,&r10_11,&r14_15);   S_4(r2_3,r6_7,r10_11,r14_15,out0+4,out0+12,out0+20,out0+28);  K_N(VLIT4(0.7071,0.7071,1,1),VLIT4(0.7071,-0.7071,0,-0),       &r16_17,&r18_19,&r20_21,&r22_23);   S_4(r16_17,r18_19,r20_21,r22_23,out1+0,out1+4,out1+8,out1+12);  K_N(VLIT4(0.7071,0.7071,1,1),VLIT4(0.7071,-0.7071,0,-0),       &r24_25,&r26_27,&r28_29,&r30_31);   S_4(r24_25,r26_27,r28_29,r30_31,out1+16,out1+20,out1+24,out1+28);}
Heterogeneous size-16 leaf sub-transform for VL-2 size-128 hard-coded leaf FFT

In the case of a vector comprising heterogeneous leaf sub-transforms, the data is transposed into separate sub-transforms following the primitive leaf operations. The remainder of the computation is carried out separately for each leaf sub-transform in the vector, and no further transposes are required.

When elaborating and generating code for VL-2 transforms, there are only two heterogeneous leaf sub-transforms that might be required, but for other vector lengths the combinations are more complex. During the elaboration process, each unique combination that is encountered in the sorted list of leaf sub-transforms is elaborated into a function with repeated calls to the elaborate function, as was done in "Vector length 1" in order to elaborate a sub-transform composed of two size N l e a f / 2 sub-transforms.

[link] is an example of a heterogeneous size-16 VL-2 leaf sub-transform, where one size-16 leaf sub-transform is loaded into the lower halves of the vector registers, and the data from another leaf sub-transform composed of two size-8 sub-transforms is loaded into the upper halves. The primitive leaf operations at lines 5, 7, 11 and 13 transpose each sub-transform's data into separate vector registers, and the remainder of the computation is performed on each sub-transform separately. The size-16 sub-transform is stored to sequential locations in memory at lines 17 and 21, while the sub-transform composed of two size-8 leaf sub-transforms is stored to memory at lines 24 and 27.

Streaming stores

Some machines support streaming store or non-temporal store instructions; these instructions are used to store data to locations that do not have temporal locality, and thus the cache can be bypassed. The hard-coded leaf FFT described in the previous sections splits the computation into a pass of leaf sub-transforms and several passes of body sub-transforms. For large transforms where the size of the data exceeds the outermost level of cache, the non-temporal store instructions can be used in the leaf sub-transforms to bypass the cache when storing data to the output array; this can greatly improve performance by keeping other data in cache. The Intel SSE and AVX vector extensions both support streaming stores.

Get Jobilize Job Search Mobile App in your pocket Now!

Get it on Google Play Download on the App Store Now




Source:  OpenStax, Computing the fast fourier transform on simd microprocessors. OpenStax CNX. Jul 15, 2012 Download for free at http://cnx.org/content/col11438/1.2
Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'Computing the fast fourier transform on simd microprocessors' conversation and receive update notifications?

Ask