<< Chapter < Page Chapter >> Page >

Because access to memory is in many cases the slowest part of the computer, especially compared to arithmetic, one wishes to load asmuch data as possible in to the faster levels of the hierarchy, and then perform as much computation as possible before going back to theslower memory devices. This is called temporal locality : if a given datum is used more than once, we arrange the computation so that these usages occur as close together as possiblein time.

Understanding ffts with an ideal cache

To understand temporal-locality strategies at a basic level, in this section we will employ an idealized model of a cache in a two-levelmemory hierarchy, as defined in [link] . This ideal cache stores Z data items from main memory (e.g. complex numbers for our purposes): when the processor loads adatum from memory, the access is quick if the datum is already in the cache (a cache hit ) and slow otherwise (a cache miss , which requires the datum to be fetched into the cache). When a datum is loaded into the cache, More generally, one can assume that a cache line of L consecutive data items are loaded into the cache at once, in order to exploit spatial locality.The ideal-cache model in this case requires that the cache be tall : Z = Ω ( L 2 ) [link] . it must replace some other datum, and the ideal-cache model assumes that the optimalreplacement strategy is used [link] : the new datum replaces the datum that will not be needed for the longest time in the future;in practice, this can be simulated to within a factor of two by replacing the least-recently used datum [link] , but ideal replacement is much simpler to analyze. Armed with this ideal-cachemodel, we can now understand some basic features of FFT implementations that remain essentially true even on real cachearchitectures. In particular, we want to know the cache complexity , the number Q ( n ; Z ) of cache misses for an FFT of size n with an ideal cache of size Z , and what algorithm choices reduce this complexity.

First, consider a textbook radix-2 algorithm, which divides n by 2 at each stage and operates breadth-first as in [link] (left), performing all butterflies of a given size at a time. If n > Z , then each pass over the array incurs Θ ( n ) cache misses to reload the data, and there are log 2 n passes, for Θ ( n log 2 n ) cache misses in total—no temporal locality at all is exploited!

One traditional solution to this problem is blocking : the computation is divided into maximal blocks that fit into the cache,and the computations for each block are completed before moving on to the next block. Here, a block of Z numbers can fit into the cache Of course, O ( n ) additional storage may be required for twiddle factors, the output data (if the FFT is not in-place),and so on, but these only affect the n that fits into cache by a constant factor and hence do not impact cache-complexity analysis. Wewon't worry about such constant factors in this section. (not including storage for twiddle factors and so on), and thus the naturalunit of computation is a sub-FFT of size Z . Since each of these blocks involves Θ ( Z log Z ) arithmetic operations, and there are Θ ( n log n ) operations overall, there must be Θ ( n Z log Z n ) such blocks. More explicitly, one could use a radix- Z Cooley-Tukey algorithm, breaking n down by factors of Z [or Θ ( Z ) ] until a size Z is reached: each stage requires n / Z blocks, and there are log Z n stages, again giving Θ ( n Z log Z n ) blocks overall. Since each block requires Z cache misses to load it into cache, the cache complexity Q b of such a blocked algorithm is

Get Jobilize Job Search Mobile App in your pocket Now!

Get it on Google Play Download on the App Store Now




Source:  OpenStax, Fast fourier transforms. OpenStax CNX. Nov 18, 2012 Download for free at http://cnx.org/content/col10550/1.22
Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'Fast fourier transforms' conversation and receive update notifications?

Ask