<< Chapter < Page | Chapter >> Page > |
[link] is a C++ implementation of the
vectorize_ks
function. For each
CNodeBfly
node, the function searches forward for another
CNodeBfly
that does not have a register dependence. Once found, the registers of the latter node are added to the former node, and the latter node erased. Finally, at line 19, the registers of the vectorized
CNodeBfly
node are merged using a perfect shuffle, which is then recursively applied on each half of the list. The effect is a merge that works for any power of 2 vector length.
void CSplitRadix::vectorize_ks() {
vector<CNodeHardCoded *>::iterator i;
for(i=ns.begin(); i != ns.end();++i) { if(!(*i)->type().compare(``blockbfly'')) {
vector<CNodeHardCoded *>::iterator j = i+1, pj = i;
int count = 1; while(j != ns.end() && count < VL) {
if(!(*j)->type().compare(``blockbfly'') && !register_dependence(*i, *j)) {
(*i)->rs.insert( (*i)->rs.end(), (*j)->rs.begin(), (*j)->rs.end());
ns.erase(j); count++;
j = pj+1; }else {
pj = j; ++j; }
} (*i)->merge_rs();
} }
}
Body node vectorization
CNodeLoad *
CSplitRadix::find_parallel_load(vector<CNodeHardCoded*>::iterator i){
CNodeLoad *b = (CNodeLoad *)(*i); for(int k=0;k<((N>2)?4:2);k++) {
vector<CNodeHardCoded *>::iterator j = i+1;
while(j != ns.end()) { if(!(*j)->type().compare(``blockload'')) {
CNodeLoad *b2 = (CNodeLoad *)(*j); if(b2->iaddrs[k] > b->iaddrs[0] && b2->iaddrs[k] < b->iaddrs[0]+VL) { ns.erase(j);
return b2; }
++j; }
} }
return NULL; }
void CSplitRadix::merge_loads(CNodeLoad *b1, CNodeLoad *b2) { for(int i=0;i<b1->size;i++) {
for(int j=0;j<b2->iaddrs.size();j++) {
if(b2->iaddrs[j] > b1->iaddrs[i] && b2->iaddrs[j] < b1->iaddrs[i]+VL) { b1->iaddrs.push_back(b2->iaddrs[j]); b1->rs.push_back(b2->rs[j]); if(rmap[b2->rs[j]] == b2) rmap[b2->rs[j]] = b1; }
} }
b1->types.push_back(b2->types[0]); }
void CSplitRadix::vectorize_loads() { vector<CNodeHardCoded *>::iterator i;
for(i=ns.begin(); i != ns.end();++i) { if(!(*i)->type().compare(``blockload'')) {
while(CNodeLoad *b2 = find_parallel_load(i)) merge_loads((CNodeLoad *)(*i), b2);
} }
}
Leaf node vectorization
If
vectorize_loads
and
vectorize_ks
are invoked with
on the topological ordering of nodes in
[link] , the result is the vectorized node list shown in
[link] . As in
"Emitting code" , emitting code is a fairly simple process, and
[link] is the code emitted
from the node list in
[link] . There are only a few differences to note about the emitted code when
.
L_4_4
primitive is named to indicate a size-4 leaf operation on the lower and upper halves of the vector registers, while the
L_2_4
primitive performs two size-2 leaf operations on the lower half of the registers and a size-4 leaf operation on the upper halves;K_N
) and store primitives (
S_4
) are unchanged because they perform the same operation on each element of the vector registers. This is as a result of the register transposes that were previously performed on the outputs of the leaf primitives.Notification Switch
Would you like to follow the 'Computing the fast fourier transform on simd microprocessors' conversation and receive update notifications?