Groq is fast
and kicking off the age of inference-focused hardware?
The buzzy news from this morning: a company called Groq showed their model that performs LLM inference at 500 tokens / s. That’s….really fast. I’m on a flight and going mostly from memory, so let’s break this down.
First, if you just go to groq.com and type in anything, you’ll notice the text appears way, way faster than ChatGPT. That’s what the “LLM inference at X tokens/second” means. A token is machine learning jargon for a part of a word, more than a letter but less than a full word. Groq is using a publicly available model called Mixtral to serve these really fast.
Why do we care about inference speed? A lot of applications require fast inference. If you want to have a large language model doing some backend work for a virtual assistant or something, you need to process spoken words, convert them to text, pass that text through a large language model, extract text back, and convert it to sound. Different startups are working on different pieces of this pipeline (ElevenLabs does text to sound, for instance), but this whole process is really hard to get right, especially since human conversations happen so fast.
Groq does this using an in-house chip called an LPU (language processing unit). There really isn’t a lot of info on these right now, but many people have been heralding the new age of chips designed specifically for machine learning (besides GPUs, which were originally designed for graphics, Google’s TPUs have been on the scene for a bit. They’re an enormous pain to use, so I think it’s mostly Google that uses them internally.). It remains to be scene whether chip design will end up being an in-house thing (every company has its own internal chip as a competitive advantage), or if we start seeing more NVIDIA/AMD competitors.
I’m not entirely sure where Groq is achieving their speedups. You can achieve these sorts of speedups anywhere from better software layers to very physical hardware (e.g., I’m going to change the size of this thing on my chip so it’s better at adding numbers together).
The positioning of this company is really significant. GPUs are great at model training, and we all sort of used them for model inference without really asking if that was the right thing to do. But inference looks very different than model training, so it makes sense that there are lots of optimizations that one could use to speed things up.
I’ll wrap by leaving this here, because anything that makes Siri not be terrible is a blessing to mankind.

