A modern desktop PC is more powerful than a 15 year old supercomputer. For instance, the Cray 2 supercomputer in 1985 had a peak processing power of 1 GFLOP and about 825 MIPS. In comparison, a modern computer with a Core 2 duo processor today can push about 20 GFLOPS, and “a more modern Intel i7 950 will push just north of 55 GFLOPS”.
The Cray 2 jfulgenc described was the world’s fastest computer in 1985. Today, the iPad 2 runs about as fast and powerfully as the Cray 2. That should demonstrate how far we’ve come technologically. A new device that we can hold in our hands is as powerful as a 1985 supercomputer that took up an entire room.
Moore’s law may be relevant to your question — the number of transistors that can be placed inexpensively on an integrated circuit has doubled every year for the last half century and looks to continue to do so for at least a bit longer. Advances in supercomputers — chip techology, etc. — tend to get integrated into the next generation’s consumer computers.
When comparing performance of your average desktop to the crays of the early 1990s or even 80s, you must take into consideration not the “peak mflop” ratings but the sustained mflop ratings. Peak mflops mean little if the machine never reaches such performance levels. The Army actually analyzed the cost/benefit of having a cluster of p4 2.8 ghz or going with a cray solution. They discovered that although the p4 2.8 ghz had a high peak mflop rating of 5.6gflops, in practice it only reached %3.4 of its peak performance due to bandwidth limitations! Please see the results of the army’s study right here https://cug.org/5-publications/proceedings_attendee_lists/2003CD/S03_Proceedings/Pages/Authors/Muzio_slides.pdf Needless to say their conclusion was that the cray solution was more cost effective and easier to program and maintain. Nasa’s high performance computer lab also did a comparison between their old cray xmp 12 (one processor 2 megawords of memory) and a dual pentium II 366 running windows NT. They had to redesign the space shuttle’s solid rocket boosters back in the late 80s after the challenger disaster and the cray xmp was used to model air flow and stresses on the new design. Some years later the code was ported to a windows NT workstation and the simulation rerun for comparison. The result is that a single processor cray xmp was able to compute the simulation in 6.1 hours versus 17.9 hours on the dual pentium II. The cray xmp could have up to four processors with an aggregate bandwidth of over 10gb a sec. to main memory, this kind of SUSTAINED bandwidth between cpu (not gpu) and main memory was not matched on the desktop until about 4 years ago. The pentium IIs had either a 66mhz or 100mhz bus speed so we are talking a maximum bandwidth of only 800mb (528mb with 66mhz bus) and with around 330mb/sec transfer rates sustained (remember pc’s use dram and the crays mostly used very expensive sram memory). The importance of bandwidth and real world number crunching performance can be seen in the STREAM benchmark. Please go to http://www.streambench.org/ to see exactly what I mean. In 1990 the C90 cray was the baddest super computer on the planet, and at $30 million fully configured it was also by far the costliest. Here’s a photo of it: http://www.cisl.ucar.edu/zine/96/fall/images/c90.gif. The cray c90 could have up to 16 processors, with 16gb of memory, and could achieve a maximum performance of around 16glfops. “Well gee, my cheapo phenom x6 can do well over 16 gflops because that’s what it says on my sisoft sandra score so I have a cray c90 sitting under my desk blah blah…” you are completely wrong if you think this. The sisoft sandra benchmark tests everything in cache which is easy for the cpu to access. Real world problems, the kind that crays are built to solve, can’t fit into a little 4mb cache and thus we come to sustained bandwith problems. The c90 can fetch 5 mega words per clock cycle (for each processor) from main memory and has a real world bandwidth of 105gb/sec; compare this to a relatively modern, quad processor (4 processors and 16 cores) core i7 2600 that gets a measly 12gb a second sustained bandwidth. “But the core i7 2600 is clocked much higher than the c90 which only operate at 244mhz per processor”. Ahhh but if the data is not available for the processor to operate on then it just sits there, wasting all cycles, waiting for the memory controller to deliver data to it. Without getting into too much detail (if you want a lot of detail read my analysis of the cray 1a versus pentium II below) the real world mflops of the C90, working on data sets too large for a typical pcs small cache, works out to roughly 8.6 gflops while the Intel Core i7 2600 will achieve only about 1gflops sustained on problems out of cache. So far there are no desktops, and won’t be for quite a few years, that come EVEN close to the real world sustained bandwidth (and thus sustained performance) of a C90. Now for problems that do fit into the tiny cache and can be mostly pre-fetched, of course the desktop will be superior to the old crays. Here is a rough comparison I made between a cray 1a and a pentium II 400, read on only if you want to be bored to death:
The Cray !A had a clock cycle time of 12.5 ns, or an operational frequency of 80 mhz. It had three vector functional units and three floating point units that were shared between vector and scalar operands in addition to four scalar units. For floating point operations it could perform 2 adds and a multiply operation per clock cycle. It had a maximum memory configuration of 1 million megawords or 8 megabytes at 50ns access time interleaved into 16 banks. This interleaving had the effect of allowing a maximum bandwidth of 320 million megawords into the instruction buffers or 2560 mb/sec. Bandwidth to the 8 vector registers of the Cray 1A could occur at a maximum rate of 640 mb/sec. The Cray !A possessed up to eight disk controllers each with one to four disks, and each disk having a capacity of 2.424X10^9 bits for a maximum total hard disk capacity of 9.7 gigabytes. There were also 12 input/output channels for peripheral devices and the master control unit. It cost over 7 million in 1976 dollars and weighed in at 10,500 lbs with a power requirement of 115 kilo watts. So how does this beast compare with myr old clunker of a PC with 384 mb of SD100 ram and a P2 400 mhz cpu?
Well lets take a simple triad operation, with V representing a vector register and S representing a scalar register.
S*V0[i] + V1[i] = V2[i]
Without getting into too much detail this equation requires 24 bytes of data to perform once. There are two floating point operations going on here, the multiplication of the scalar value with the vector, then the addition of the second vector.Thus, assuming a problem too large to just loop in the cray 1A registers, and a bandwidth of 640 mb/s, the maximum performance of a Cray1A would equal (640/24) * 2 = 53 mflops on large problems containing data which could not be reused. This figure correlates well with the reported performance of the Cray 1A on real world problems
True bandwidth on a Cray 1A would also have to take into account bank conflicts plus access latency so about 533 mb/sec sustained is a more realistic figure. On smaller problems with reusable data the Cray 1A could achieve up to 240 mflops by utilizing two addition function units and one multiplication function unit simultaneously through a process called chaining. So you see the Cray 1A could be severely bandwidth limited when dealing with larger heterogeneous data sets.
My pentium II 400 has 512 kb of L2 cache, 384 mebabytes of SD100 ram, and a 160gb 7200 rpm hard drive. Theoretically it can achieve a maximum of 400 mflops when operating on data contained in its L1 cache, although benchmarks like BLAS place its maximum performance at 240 mflops for double precision operations which is what we are interested in here. Interestingly this is about the same as what a Cray !A can do on small vectorizable code. However once we get out to problem sizes of 128kb or 256kb or even 512kb my pentium 2 would beat the Cray 1A even in its greatest strength, double precision floating point operations, due to the bandwidth advantage of the L2 cache over the Cray’s memory. At 1600 mb/s bandwidth my computer can do up to 133 mflops for problems under 512 kb in size but greater than the L1 Cache.
Once we get beyond 512 kilobytes the situation shifts as data would then need to be transferred from the SD100 ram.The theoretical bandwidth of SD100 ram is 800 mb/sec, still greater than the Cray 1A but here we run into some issues. The Cray 1A had memory comprised of much more expensive SRAM, while my memory is el crapo DRAM which require refresh cycles. So with these taken into account my DRAM actually has a theoretical maximum bandwidth of about 533mb/s and a real world maximum sustained bandwidth of a little over 300mb/s. This means that for problems out of cache, my pentium 2 gets slowed to a measly 315/12 = 26 mflops. In this special situation where the problem is vectorizable, the Cray 1A is still faster than my pentium 2, not bad for a computer that is 30 years old.
Once we get problems greater than 8 megabytes, the advantage shifts completely back to my pentium II as the Cray !A must then stream data from its hard disks (which were slower than ultra ATA/100) and my computer can go right on fetching data from ram. The Cray 1A could not realize its full potential as it was hampered by bandwidthand memory size issues, yet in certain situations could outperform a desktop computer from 1998. Solid state disks,more memory ports, and larger memories were utilized in the subsequent cray xmp to address these problems.
A desktop like the core duo E6700 can do over 12 gigaflops, BUT only on problems that are small and fit into its cache. Once the data gets out of cache today’s modern computers get their butts kicked by the old school Crays from the 80s. Just visit http://www.streambench.org/ to see what I mean.
Click here to cancel reply.
Sorry,At this time user registration is disabled. We will open registration soon!
Don't have an account? Click Here to Signup
© Copyright GreenAnswers.com LLC