<< Chapter < Page Chapter >> Page >

This is very tight code and bears little resemblance to the original FORTRAN code.

Sparc architecture

These next examples were performed using a SPARC architecture system using FORTRAN. The SPARC architecture is a classic RISC processor using load-store access to memory, many registers and delayed branching. We first examine the code at the lowest optimization:


.L18: ! Top of the loop ld [%fp-4],%l2 ! Address of B sethi %hi(GPB.addem.i),%l0 ! Address of I in %l0or %l0,%lo(GPB.addem.i),%l0 ld [%l0+0],%l0 ! Load I sll %l0,2,%l1 ! Multiply by 4add %l2,%l1,%l0 ! Figure effective address of B(I) ld [%l0+0],%f3 ! Load B(I) ld [%fp-8],%l2 ! Address of C sethi %hi(GPB.addem.i),%l0 ! Address of I in %l0or %l0,%lo(GPB.addem.i),%l0 ld [%l0+0],%l0 ! Load I sll %l0,2,%l1 ! Multiply by 4add %l2,%l1,%l0 ! Figure effective address of B(I) ld [%l0+0],%f2 ! Load C(I) fadds %f3,%f2,%f2 ! Do the Floating Point Addld [%fp-12],%l2 ! Address of Asethi %hi(GPB.addem.i),%l0 ! Address of i in %l0 or %l0,%lo(GPB.addem.i),%l0ld [%l0+0],%l0 ! Load Isll %l0,2,%l1 ! Multiply by 4 add %l2,%l1,%l0 ! Figure effective address of A(I)st %f2,[%l0+0] ! Store A(I)sethi %hi(GPB.addem.i),%l0 ! Address of i in %l0 or %l0,%lo(GPB.addem.i),%l0ld [%l0+0],%l0 ! Load Iadd %l0,1,%l1 ! Increment I sethi %hi(GPB.addem.i),%l0 ! Address of I in %l0or %l0,%lo(GPB.addem.i),%l0 st %l1,[%l0+0]! Store I sethi %hi(GPB.addem.i),%l0 ! Address of I in %l0or %l0,%lo(GPB.addem.i),%l0 ld [%l0+0],%l1 ! Load I ld [%fp-20],%l0 ! Load N cmp %l1,%l0 ! Compareble .L18 nop ! Branch Delay Slot

This is some pretty poor code. We don’t need to go through it line by line, but there are a few quick observations we can make. The value for I is loaded from memory five times in the loop. The address of I is computed six times throughout the loop (each time takes two instructions). There are no tricky memory addressing modes, so multiplying I by 4 to get a byte offset is done explicitly three times (at least they use a shift). To add insult to injury, they even put a NO-OP in the branch delay slot.

One might ask, “Why do they ever generate code this bad?” Well, it’s not because the compiler isn’t capable of generating efficient code, as we shall see below. One explanation is that in this optimization level, it simply does a one-to-one translation of the tuples (intermediate code) into machine language. You can almost draw lines in the above example and precisely identify which instructions came from which tuples.

Get Jobilize Job Search Mobile App in your pocket Now!

Get it on Google Play Download on the App Store Now




Source:  OpenStax, High performance computing. OpenStax CNX. Aug 25, 2010 Download for free at http://cnx.org/content/col11136/1.5
Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'High performance computing' conversation and receive update notifications?

Ask