Fast, faster, fastest: a history of supercomputers from the CDC 6600 to Fugaku

Review

Fast, faster, fastest: a history of supercomputers from the CDC 6600 to Fugaku

Ziya McKinley

December 13, 2020

Fast, faster, fastest: a history of supercomputers from the CDC 6600 to Fugaku

Computing has at all times been about pace. Herman Hollerith’s first job was to work on the 1880 US census. It was accomplished by hand, painful, gradual, and liable to errors. So, Hollerith created the Hollerith 1890 Census Tabulator punch card system to rely the 1890 census outcomes. His electromechanical gadgets may do the job in a remarkably fast two-years and saved the federal government $5 million. Using his earnings, he established his personal firm, the Tabulating Machine Company. You realize it higher as IBM.Of course, his machine wasn’t a real pc, nevertheless it set the sample for all of the computer systems to return. But, it wasn’t till 1964, when Seymour Cray designed the Control Data Center (CDC) 6600 that we began referred to as the quickest of the quick machines supercomputers.CDC: Supercomputing beginsCray believed there would at all times be a necessity for a machine “a hundred times more powerful than anything available today.” It was his dream and efforts to push the bounds of know-how, which led to supercomputers.The journey was by no means simple. Cray, already referred to as a temperamental {hardware} genius, threatened to depart CDC. He lastly agreed to remain after being allowed to type his personal workforce to construct the CDC 6600. With 400,000 transistors, over 100 miles of hand-wiring, and Freon cooling, it reached a prime pace of 40MHz or 3 million floating-point operations per second (MegaFlops). Hollerith would have acknowledged its main I/O technique: Punch playing cards. It left the earlier quickest pc, the IBM 7030 Stretch, consuming its mud.Today, 3 MegaFlops are painfully gradual. The first Raspberry Pi with its 700 MHz ARM1176JZF-S processor runs at 42 MegaFlops. For its day, the CDC 6600 was the quickest of the quick and it might stay so till Cray and CDC adopted it up in 1969 with the CDC 7600.The different pc producers of the 60s have been caught flat-footed. In a well-known memo, Thomas Watson, Jr, IBM’s CEO, stated “Last week, Control Data … announced the 6600 system. I understand that in the laboratory developing the system there are only 34 people including the janitor. Of these, 14 are engineers and 4 are programmers. Contrasting this modest effort with our vast development activities, I fail to understand why we have lost our industry leadership position by letting someone else offer the world’s most powerful computer.” To this present day, IBM stays a severe supercomputer competitor. The delivery of CrayIn the meantime, Cray and CDC weren’t getting alongside. His designs, whereas each technically highly effective and commercially profitable, have been costly. CDC noticed him as a perfectionist. He noticed CDC as clueless center managers. You can see the place this was going.So, when the following era CDC 8600 was working over finances and did not be on schedule, CDC elected to help one other high-performance computing machine: The Star-100. This was one of many first supercomputers to make use of separate vector processors along with its CPU. This set a pattern that also with us in the present day. Be that as it could, Cray, to nobody’s shock, left CDC to type his personal firm: Cray Research. There, freed of administration constraints and fueled with ample Wall Street funds, he constructed the primary of his eponymous supercomputers in 1976: The Cray-1.The 80MHz Cray-1 used built-in circuits to realize efficiency charges as excessive as 136 MegaFlops. Part of the Cray-1’s outstanding pace got here from its uncommon “C” form. This look was not accomplished for the science-fiction look of it, however as a result of the form gave essentially the most speed-dependent circuit boards shorter, therefore quicker, I/O.This consideration to each final element of design from the CPU up is a distinguishing mark of Cray’s work. Every factor of Cray’s design was constructed to be as quick as doable.Cray additionally adopted vector processing for the Cray-1. On the Cray design, the vector processors operated on vectors–linear arrays of 64-bit floating-point numbers, to acquire outcomes. Compared to scalar code, vector codes may decrease pipelining slowdowns by as a lot as 90%.The Cray-1 was additionally the primary pc to make use of transistor reminiscence, as a substitute of high-latency magnetic core reminiscence. With these new types of reminiscence and processing the Cray-1 and its descendants grew to become the poster little one of late 70s and early 80s supercomputing.Seymour Cray wasn’t accomplished main the best way in supercomputing. The Cray-1, like all of the machines, which had come earlier than it, used a single principal processor. With 1982’s Cray X-MP he added 4 processors to the Cray-1 signature C physique. With the X-MP 105MHz processors and a 200% plus enchancment reminiscence bandwidth, a maxed-out X-MP may ship 800 Megaflops of efficiency.The subsequent step ahead was 1985’s Cray 2. This mannequin got here with eight processors, with a “foreground processor” managing storage, reminiscence, and I/O to the “background processors,” which did the precise work. It additionally was the primary liquid-cooled supercomputer. And, in contrast to its predecessors, you may work with it utilizing a general-purpose working system: UniCOS, a Cray-specific Unix System V with extra BSD options, as a substitute of a custom-made, architecture-specific working system.Today, supercomputers are used to work on a wide range of huge computational issues. These jobs embrace quantum mechanics, climate forecasting, local weather analysis, and biomolecular evaluation on COVID-19. But, within the 70s and 80s, Cold War analysis on nuclear explosion simulations and code-cracking was what governments and their companies paid for. With the rise of glasnost and the disintegration of the Warsaw Pact and the Soviet Union, Cray’s military-industrial clients have been not keen on spending hundreds of thousands on supercomputers.MPP arrivesWhile Cray nonetheless liked his vector architectures, their processors have been very costly. Companies explored utilizing a number of processors in a single pc utilizing Massively Parallel Multiprocessing (MPP). For instance, the Connection Machine’s tens of 1000’s of easy single-bit processors working with international reminiscence value a fraction of Cray’s designs.MPP machines made supercomputing extra inexpensive, however Cray resisted it. As he stated, “If you were plowing a field, which would you rather use: Two strong oxen or 1024 chickens?” Instead, he refocused on quicker vector processors utilizing the untried gallium arsenide semiconductors. This would show a mistake.Cray, the corporate, went bankrupt in 1995. Cray, the supercomputer architect, wasn’t accomplished. He based a brand new firm, SRC Computers, to work on his tackle a machine that mixed the very best options of his strategy and the MPP. Unfortunately, he died in a automobile accident earlier than he may put his new tackle supercomputing to the check.His concepts lived on in Japan. There, firms reminiscent of NEC, Fujitsu, and Hitachi constructed vector-based supercomputers. From 1993 to 1996, Fujitsu’s Numerical Wind Tunnel was the world’s quickest supercomputer, with speeds of as much as 600 GigaFlops. A GigaFlop is one billion Flops.These machines relied on vector processing, devoted chips utilizing one-dimensional arrays of knowledge. They additionally used multi-buses to take advantage of MPP’s I/O. This, the ancestor of the a number of instruction, a number of knowledge (MIMD) strategy that permits in the present day’s CPUs to make use of a number of cores.Intel, which had been watching from the supercomputer sidelines, thought MIMD may allow them to create extra inexpensive supercomputers with out specialised vector processors. In 1996, ASCI Red proved Intel proper.ASIC Red used over 6,000 200MHz Pentium Pros to interrupt the 1 TeraFlop (one trillion Flops) barrier. For years, it might be each the quickest and most dependable supercomputer on this planet. Supercomputing for everybody: BeowulfWhile Intel was spending hundreds of thousands growing ASIC Red, some underfunded contractors at NASA’s Goddard Space Flight Center (GSFC) constructed their very own “supercomputer” utilizing industrial off the shelf (COTS) {hardware}. Using 16 486DX processors with a 10Mbps Ethernet cable for the “bus,” in 1994 NASA contractors Don Becker and Thomas Sterling created Beowulf,Little did they know that in creating the primary Beowulf cluster, they have been creating the ancestor to in the present day’s hottest supercomputer design: Linux-powered, Beowulf-cluster supercomputers. Today, all 500 of the quickest machines, the TOP500, run Linux. In the November 2020 Top 500 supercomputing rating, no fewer than 492 methods are utilizing cluster designs based mostly on Beowulf’s ideas.While the primary Beowulf may solely hit single-digit GigaFlops speeds, Beowulf confirmed that supercomputing was inside nearly anybody’s attain. I imply, you may even construct a Beowulf “supercomputer” from Raspberry Pi! Supercomputing in the present dayAndifferent advance in supercomputing got here when designers began utilizing a number of processor sorts inside their designs. For instance, in 2014, Tianhe-2, or Milky Way-2, used each Intel Xeon IvyBridge processors and Xeon Phi processors to change into the quickest supercomputer of its day. The Xeon Phi is a high-performance GPU. These “graphic” chips excel at floating-point calculations.This new fashion of mixing two forms of COTS processors is turning into extra frequent. In the November 2020 TOP500 supercomputer record, the overwhelming majority of the quickest of the quick use floating-point GPUs such because the Xeon Phi, NVIDIA Tesla V100 GPUs, and PEZY-SCx accelerators. Today, 149 methods of the TOP500 depend on accelerator/coprocessor chips.Why? Because the 2 chip types complement one another. Today’s GPUs have a massively parallel structure made up of 1000’s of cores dealing with a number of duties concurrently. For instance, NVIDIA V100 Tensor Core GPU, which is utilized in a number of supercomputers, has 640 cores. Conventional CPUs have few cores, however they’re optimized for sequential serial processing. Yoked collectively, they’re the muse for a lot quicker supercomputers.While over 90 p.c of the TOP500 use Intel Xeon or Xeon Phi chips. While AMD processors, particularly the AMD Ryzen 9 Zen household, have taken over the desktop and laptop computer pace information, solely 21 methods use AMD CPUs. Even so, that is twice as many because it was six months in the past. AMD is adopted by ten IBM Power-based methods and simply 5 Arm-based methods.However, the Arm outcomes are deceptive. The present world champion of supercomputers, Japan’s Fugaku is powered by Arm A64FX CPUs with 7,630,848 cores. Its world report pace is 442 petaflops, a quadrillion floating-point operations per second, on the High-Performance Linpack (HPL) check. If you are maintaining rating at house, that is thrice forward of its closest competitor. Intel’s processor lead will quickly be challenged by each AMD and Arm.Supercomputing tomorrowLooking forward, the following supercomputing objective is the ExaFlop. An exaFlop is one quintillion (1018) floating-point operations per second, or 1,000 petaFlops. We’d hope to be there by now, nevertheless it’s proving tougher than anticipated.Still, Intel hopes to be there first with Aurora in 2021. In the meantime, AMD and Cray assume they’re going to get there first with El Capitan. And, we will not rely Arm, now owned by Nvidia, out. The Fugaku supercomputer architects have their eye on cracking the ExaFlop barrier too.After that, the following mountain to climb is a ZettaFlop, a 1,000 exaFLOPS. Is ZettaFlop computing even doable? Sterling, now Professor of Computing at Indiana University and co-inventor of Beowulf, used to assume we could not do it. “I think we will never reach ZettaFlops, at least not by doing discrete floating-point operations.”More just lately, Sterling modified his thoughts. Sterling stated that by combining logic circuits with reminiscence to scale back I/O speeds reaching ZettaFlops speeds, and past, is feasible. Indeed, he thinks by utilizing non-von Neumann architectures and superconducting we would get to YottaFlop supercomputer speeds, a thousand ZettaFlops, by 2030. Chinese researchers aren’t that optimistic, however they predict that we’ll see Zettascale methods in 2035.The one factor we will say for sure is the race for extra computing pace won’t ever finish. We could not be capable of get our minds round what that pace will imply for us, however we’ll. Remember, Bill Gates was as soon as rumored to have stated, “640K is all the memory anybody would ever need.” Well, we actually discovered one thing to do with all that reminiscence, and we’ll actually discover helpful issues to do with all that pace.