Ok, so using my “older” 2070 Super, I was able to get a response from a 70B parameter model in 9-12 minutes. (Llama 3 in this case.)
I’m fairly certain that you’re using your CPU or having another issue. Would you like to try and debug your configuration together?
xcjs@programming.dev 7 months ago
No offense intended, but are you sure it’s using your GPU? Twenty minutes is about how long my CPU-locked instance takes to run some 70B parameter models.
On my RTX 3060, I generally get responses in seconds.
kiku123@feddit.de 7 months ago
I agree. My 3070 runs the 8B Llama3 model in about 250ms, especially for short responses.