In training, cost has been an afterthought, but in inference, where the money will be made, cost is key, and computing needs are mixed. When you type a question into your favorite AI chatbot, it turns it into tokens representing words, parts of words, and punctuation. It processes all of these tokens at once, a step called prefill, which favors the parallel computing of GPUs. But the answers come a token at a time, a bit like speaking, where each word builds on the last. CPUs can excel in this kind of sequential computing, but what you’d really like to have are purpose-built chips that can handle decode cheaply and efficiently, without, for example, the need for pricey off-chip memory.
