Typically you need about 1GB graphics RAM for each billion parameters (i.e. one byte per parameter). This is a 405B parameter model. Ouch.
Edit: you can try quantizing it. This reduces the amount of memory required per parameter to 4 bits, 2 bits or even 1 bit. As you reduce the size, the performance of the model can suffer. So in the extreme case you might be able to run this in under 64GB of graphics RAM.
Typically you need about 1GB graphics RAM for each billion parameters (i.e. one byte per parameter). This is a 405B parameter model. Ouch.
Edit: you can try quantizing it. This reduces the amount of memory required per parameter to 4 bits, 2 bits or even 1 bit. As you reduce the size, the performance of the model can suffer. So in the extreme case you might be able to run this in under 64GB of graphics RAM.
When the 8 bit quants hit, you could probably lease a 128GB system on runpod.
Can you run this in a distributed manner, like with kubernetes and lots of smaller machines?
Or you could run it via cpu and ram at a much slower rate.
Finally! My dumb dumb 1TB ram server (4x E5-4640 + 32x32GB DDR3 ECC) can shine.
Yeah uh let me just put in my 512GB ram stick…
Samsung do make them.
Goodluck finding 512gb of VRAM.
At work we habe a small cluster totalling around 4TB of RAM
It has 4 cooling units, a m3 of PSUs and it must take something like 30 m2 of space
According to huggingface, you can run a 34B model using 22.4GBs of RAM max. That’s a RTX 3090 Ti.