Are those numbers for the 4/8 bit quants or the full fp16?

dust42 · 2025-05-10T13:59:18 1746885558

It is a 4-bit quant gemma-3-4b-it-Q4_K_M.gguf. I just use "describe" as prompt or "short description" if I want less verbose output.

As you are a photographer, using a picture from your website gemma 4b produces the following:

"A stylish woman stands in the shade of a rustic wooden structure, overlooking a landscape of rolling hills and distant mountains. She is wearing a flowing, patterned maxi dress with a knotted waist and strappy sandals. The overall aesthetic is warm, summery, and evokes a sense of relaxed elegance."

This description is pretty spot on.

The picture I used is from the series L'Officiel.02 (L-officel_lanz_08_1369.jpg) from zamadatix' website.

zamadatix · 2025-05-10T18:42:02 1746902522

I'm can neither claim to be a photographer nor that https://www.dansmithphotography.com/ my website, but I appreciate the example! The specific photo for other's reference, based on the filename: https://payload.cargocollective.com/1/15/509333/14386490/L-o...

That said I'm not as impressed of the description. The structure has some wood but it's certainly not just wooden, there are distant mountains but not much in the way of rolling hills to speak of. The dress is flowing but the waist is not knotted - the more striking note might have been the sleeves.

For 4 GB of model I'm not going to ding it too badly though. The question on which quant was mainly around the tokens/second angle (q4 requires 1/4th the memory bandwidth as the full model would) rather than quality angle. As a note: a larger multimodal model gets all of these points accurately (e.g. "wooden and stone rustic structure"), they aren't just things I noted myself.

refulgentis · 2025-05-10T17:17:58 1746897478

n.b. the image processing is by a separate model, basically has to load the image and generate ~1000 tokens

(source: vision was available in llama.cpp but Very Hard, been maintaining an implementation)

(n.b. it's great work, extremely welcome, and new in that the vision code badly needed a rebase and refactoring after a year or two of each model adding in more stuff)

brrrrrm · 2025-05-10T21:49:48 1746913788

wait sorry, can you explain how this works? I thought gemma3 used siglip, which can output all 256 embeddings in parallel

(also, would you mind sharing a code pointer if you have any handy? I found this https://github.com/ggml-org/llama.cpp/blob/master/tools/mtmd... but not sure if that's the codepath taken)