Groq launches the first AI accelerator card capable of 1 PetaOPS

Contents hide

1 Why it matters: Groq is the hundredth startup to take a shot at making an AI accelerator card, the second to market, and the first to have a product reach the 1 quadrillion operations per second threshold. That’s quadruple the performance of Nvidia’s most powerful card.

2 AI accelerator card capable of 1 PetaOPS

2.1 Related

[ad_1]

Why it matters: Groq is the hundredth startup to take a shot at making an AI accelerator card, the second to market, and the first to have a product reach the 1 quadrillion operations per second threshold. That’s quadruple the performance of Nvidia’s most powerful card.

The Groq Tensor Streaming Processor (TSP) demands 300W per core, so luckily, it’s only got one. Even luckier, Groq has turned that from a disadvantage into the TSP’s greatest strength.

AI accelerator card capable of 1 PetaOPS

You should probably throw everything you know about GPUs or AI processing out the window, because the TSP is just plain weird. It’s a giant piece of silicon with almost nothing but Vector and Matrix processing units and cache, so no controllers or backend whatsoever. The compiler has direct control.

The TSP is divided into 20 superlanes. Superlanes are built from, in order of left to right: a Matrix Unit (320 MACs), Switch Unit, Memory Unit (5.5 MB), Vector Unit (16 ALUs), Memory Unit (5.5 MB), Switch Unit, Matrix Unit (320 MACs). You’ll notice that the components are mirrored around the Vector Unit, this divides the superlane into two hemispheres that can act almost independently.

The instruction stream (there is only one) is fed into every component of superlane 0, with 6 instructions for the Matrix Units, 14 for the Switch Units, 44 for the Memory Units, and 16 for the Vector Unit. Every clock cycle, the units perform their operations and move the piece of data to where it’s going next within the superlane. Each component can send and receive 512B from its next-door neighbors.

Once the superlane’s operations are complete, it passes everything down to the next superlane and receives whatever the superlane above (or the instruction controller) has. Instructions are always passed down vertically between the superlanes, while data only transfers horizontally within a superlane.

	Groq TSP	Nvidia Tesla V100	Nvidia Tesla T4
Cores	1	5120	2560
Maximum Frequency	1250 MHz	1530 MHz	1590 MHz
FP16 TFLOPS	205 TFLOPS	125 TFLOPS	65 TFLOPS
INT8 TOPS	1000 TOPS	250 TOPS	130 TOPS
Chip Cache (L1)	220 MB	10 MB	2.6 MB
Board Memory	N/A	32 GB HBM2	16 GB GDDR6
Board Power (TDP)	300W	300W	70W
Process	14nm	12nm	12nm
Die Area	725 mm²	815 mm²	545 mm²

Read the full article by Isaiah Mayersen

danielparente

Emprendedor en el sector de los videojuegos, tecnologica y educación. + de 20 años en videojuegos. CEO de Hydra Interactive. Fundador de Devsfromspains.

Why it matters: Groq is the hundredth startup to take a shot at making an AI accelerator card, the second to market, and the first to have a product reach the 1 quadrillion operations per second threshold. That’s quadruple the performance of Nvidia’s most powerful card.

AI accelerator card capable of 1 PetaOPS

Related

Leave a ReplyCancel Reply

Trending now