India Chooses Arm’s Neoverse for Nationwide Chip Design Push

India’s Center for Development of Advanced Computing (C-DAC) this week announced[PDF] the nation’s first self-designed Excessive Efficiency Computing (HPC) CPU. Named Aum, India’s first chip is a chiplet design that may scale as much as 96 cores and relies on Arm’s v8.4 “Zeus” Neoverse V1 designs (the identical ones AWS makes use of in its Graviton3), and is anticipated to hit the market as early as 2024 on TSMC’s 5 nm course of.

Aum was developed as a part of its Nationwide Supercomputing Mission, a program that goals to cut back India’s publicity to doable export restrictions. To that finish, the intention is to deploy a nationally-developed processor structure. Maybe alarmingly for each Intel and AMD, nevertheless, the intention is for Aum to be useable in each excessive efficiency computing eventualities in addition to private computing chips. And the place Aum exists, the market is smaller.

The reasoning is easy: if India has the aptitude to design chips (whether or not which means mixing and matching items out of Arm’s open-handed portfolio or guiding particular implementations for the ultimate, manufacturable design), doable technological export restrictions may rather less. On the similar time, the Nationwide Computing Mission additionally goals to enhance safety towards eventual backdoors; a impartial design supplier equivalent to Arm naturally suits into these considerations. And whereas controlling the design course of itself would not get all of it the way in which there (not when backdoors could be utilized on the manufacturing unit flooring by keen and succesful adversaries), it is a sturdy begin. The deliberate utilization of open supply software program to prop up a specialised software program ecosystem additionally paints a extra diversified software program future, so {hardware} is not the one phase that is prone to fragment, given sufficient time.

C-DAC AUM chip presentation material

Aum’s bundle and particular person A48Z chiplet design. (Picture credit score: C-DAC)

The A48Z chiplets on the coronary heart of the 96-core Aum chip every function 48 Arm Zeus cores (3 GHz base, 3.5 GHz Turbo), supported by 96 MB of immediate-access L2 cache and one other 96 MB cache layer buffering the cores and the extra system reminiscence. All in all, every Aum bundle helps as much as 16 DDR5 reminiscence channels (at 5200 MHz, delivering 332.8 GB/s of bandwidth) and 64 GB of HBM3 reminiscence (6.4 GHz inventory, geared down to five.6 GHz at preliminary launch for a staggering 2.87 TB/s). Extra byte throughput is added by the 128 PCIe Gen 5 lanes, 64 of which allow extra accelerators (equivalent to GPU or FPGA accelerators). 

C-DAC AUM chip presentation material

Aum’s interconnect reminiscence subsystem. (Picture credit score: C-DAC)

The remaining 64 are probably routed for the chip’s inside communications cloth, a coherent mesh community of NUMA-style, absolutely reminiscence coherent hyperlinks primarily based on the CCIX protocol. This linked is utilized by two Aum sockets to speak, and it takes a design web page or two from AMD’s Infinity Material.

C-DAC AUM chip presentation material

A specs comparability between the C-DAC’s Aum HPC processor and Fujitsu’s A64FX, from Fugaku. (Picture credit score: C-DAC)

In accordance with the documentation, Aum’s design primarily goals to extend the quantity of reminiscence bandwidth accessible per flop of computing energy (the byte/flop ratio), which has been discovered to be a extremely limiting consider efficiency scaling for HPC computation. Too many automobiles (floating operations per second) on too few lanes (reminiscence throughput) can solely finish a method. The result’s that Aum and its Arm structure goal efficiency at 4.6 teraflops per socket and three TB/sec of combination reminiscence bandwidth. That’ll give it a byte/flop ratio of 0.7, a lot increased than the 0.38 hit by the world’s quickest Arm supercomputer, Japan’s Fugaku, and decisively beating USA’s IBM and Nvidia-based Summit (<0.2 bytes/flop). At an anticipated 300 W TDP, nevertheless, it appears vitality effectivity really declined in comparison with Fugaku’s A64FX Arm cores.

C-DAC AUM chip presentation material

The bytes/flop effectivity metrics for a number of HPC methods. (Picture credit score: C-DAC)

If the whole lot goes to plan, India’s Aum Arm CPU can be a robust entry into the supercomputing subject. Crucially, it will be a homegrown one – even when not dramatically so, at the least in its first iterations. A lot work was clearly put into advancing the reminiscence subsystem as a complete, and basically, reminiscence is less complicated and extra accessible to supply than the TSMC 5 nm chips Aum can be manufactured from. Customizing the CPU core itself might be C-DAC’s subsequent step, getting ready the way in which for India and including momentum to the “chip nationalization” course of in different international locations. China too has had an curiosity in Arm, by the way in which; however that’s a wholly completely different story.