Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

For local models are you wanting to do:

A) Embeddings.

B) Things like classification, structured outputs, image labelling etc.

C) Image generation.

D) LLM chatbot for answering questions, improving email drafts etc.

E) Agentic coding.

?

I have a MBP with M1 Max and 32GB RAM. I can run a 20GB mlx_vlm model like mlx-community/Qwen3.5-35B-A3B-4bit. But:

- it's not very fast

- the context window is small

- it's not useful for agentic coding

I asked "What was mary j blige's first album?" and it output 332 tokens (mostly reasoning) and the correct answer.

mlx_vlm reported:

  Prompt: 20 tokens @ 28.5 t/s | Generation: 332 tokens @ 56.0 t/s | Peak memory: 21.67 GB
 help



Thanks for the info.

I’d like to do agentic coding first, but then chatbot and classification as lower priorities. I don’t really care about image gen.

Also, if you’re only able to run 35B models in 32GB, seems like I’d definitely want at least 64GB for the newer, larger models (qwen has a 122B model, right). My theory there is that models are only getting larger, though perhaps also more efficient.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: