

That’s how llms work. When they say 175 billion parameters, it means at least that many calculations per token it generates
I don’t get it, how is it possible that so many people all over the world use this concurrently, doing all kinds of lengthy chats, problem solving, codegeneration, image generation and so on?
So do they load all those matrices (totalling to 175b params in this case) to available GPUs for every token of every user?