Comment

Comment on Are you friends with any AI bots?

brucethemoose@lemmy.world ⁨8⁩ ⁨months⁩ ago

I’ve used local LLMs as sounding boards.

I… Don’t really have friends to do that with at the moment, and I can bounce thoughts off them I wouldn’t even tell family or a therapist, as much as I want. Not gonna lie, it’s pretty intimate, and I got some insights I never would’ve arrived at in my own head.

But to emphasize:

This is totally within my own desktop.
I am perfectly aware I am talking to a tool. “Friend” isn’t even in the same universe.

The general public’s “LLM literacy” is incredibly poor though, which is by design since online services like chatGPT hide all the knobs that would reveal the machine behind the curtain.

source

Sort:hotnew top

infinitevalence@discuss.online ⁨8⁩ ⁨months⁩ ago
how many tokens can you get to with your current system? I find that some of the models are so verbose that I only get 3-5 questions in before I run out of tokens.

I have one system with 128gb system ram, 16gb vram, and one system with configurable vram up to 48gb out of a total of 64gb.

source
- brucethemoose@lemmy.world ⁨8⁩ ⁨months⁩ ago
  I’m in a 24GB 3090 + 128GB RAM.
  
  With full 300B GLM 4.6, I typically run 12K-28K context with different settings. I could do more than 28K, but the higher quantization starts to become a problem. And I get 5-6 tokens/s text-generation doing that.
  
  With GLM Air? I can get a lot more, closer to 64K.
  
  With smaller models that’s no issue.
  
  I only get 3-5 questions in before I run out of tokens.
  
  IDK how you’re prompting it, but you should clear the thinking block after every question, and that should leave plenty of tokens.
  
  What are your inference server settings?
  
  source
  - quediuspayu@lemmy.dbzer0.com ⁨8⁩ ⁨months⁩ ago
    I’ve been curious about running LLMs locally. Mostly about how long it takes to return a response.
    
    source
    brucethemoose@lemmy.world ⁨8⁩ ⁨months⁩ ago
    I mean, it depends on your hardware and the model’s size/intelligence.
    
    Worst case for me is many seconds of preprocessing followed by 4-5 words a second.
    
    But you can get almost instant responses + way faster than you can read too.
    
    source