<< Chapter < Page Chapter >> Page >
  for(k=0;k<N/2;k++) {      data_t Ek = out[k];      data_t Ok = out[(k+N/2)];      data_t w = LUT[k<<log2stride];     out[k]        = Ek + w * Ok;     out[(k+N/2) ] = Ek - w * Ok;   }
Inner loop of radix-2 Cooley-Tukey FFT

Each iteration of the loop in [link] accesses two elements of complex data in the array out , and one complex element from the twiddle factor LUT. Over multiple iterations of the loop, out is accessed contiguously in two places, but the LUT is accessed with a non-unit stride in all sub-transforms except the outer transform. Some vectormachines can perform what are known as vector scatter or gather memory operations – where a vector gather could be used in this case to gatherelements from the LUT that are separated by a stride. But SSE only supports contiguous or streaming access to memory. Thus, to efficiently compute multiple iterations of the loop in parallel, the twiddle factorLUT is replaced with an array of LUTs – each corresponding to a sub-transform of a particular size. In this way, all memory accesses for theparallelized loop are contiguous and no memory bandwidth is wasted.

  for(k=0;k<N/2;k+=4) {     __m128 Ok_re = _mm_load_ps((float *)&out[k+N/2]);    __m128 Ok_im = _mm_load_ps((float *)&out[k+N/2+2]);    __m128 w_re = _mm_load_ps((float *)&LUT[log2stride][k]);     __m128 w_im = _mm_load_ps((float *)&LUT[log2stride][k+2]);     __m128 Ek_re = _mm_load_ps((float *)&out[k]);    __m128 Ek_im = _mm_load_ps((float *)&out[k+2]);    __m128 wOk_re =       _mm_sub_ps(_mm_mul_ps(Ok_re,w_re),_mm_mul_ps(Ok_im,w_im));    __m128 wOk_im =       _mm_add_ps(_mm_mul_ps(Ok_re,w_im),_mm_mul_ps(Ok_im,w_re));    _mm_store_ps((float *)(out+k), _mm_add_ps(Ek_re, wOk_re));     _mm_store_ps((float *)(out+k+2), _mm_add_ps(Ek_im, wOk_im));    _mm_store_ps((float *)(out+k+N/2), _mm_sub_ps(Ek_re, wOk_re));     _mm_store_ps((float *)(out+k+N/2+2), _mm_sub_ps(Ek_im, wOk_im));  }
Vectorized inner loop of Cooley-Tukey radix-2 FFT

[link] computes the loop of [link] using split format data and a vector length of four (i.e., it computes four iterations at once). Note that the vector load andstore operations used in [link] require that the memory accesses are 16-byte aligned – this is a fairly standard proviso forvector memory operations, and use of the correct memory alignment attributes and/or memory allocation routines ensures that memory is always correctlyaligned.

Some FFT libraries require the input to be in split format (i.e., the real parts of each element are stored in one contiguous array, while the imaginaryparts are stored contiguously in another array) for the purposes of simplifying the computation, but this conflicts with many other libraries and use cases ofthe FFT – for example, Apple's vDSP library operates in split format, but many examples require the use of un-zip/zip functions on the input/output data(see Usage Case 2: Fast Fourier Transforms in [link] ). A compromise is to convert interleaved format data to split format on the firstpass of the FFT, computing most of the FFT with split format sub-transforms, and converting the data back to interleaved format as it isprocessed on the last pass.

Get Jobilize Job Search Mobile App in your pocket Now!

Get it on Google Play Download on the App Store Now




Source:  OpenStax, Computing the fast fourier transform on simd microprocessors. OpenStax CNX. Jul 15, 2012 Download for free at http://cnx.org/content/col11438/1.2
Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'Computing the fast fourier transform on simd microprocessors' conversation and receive update notifications?

Ask