iMatrix Quant Script

#252
by MikeRoz - opened

Hello,

I'm curious how you are able to create iMatrix quants of models that don't fit in your system RAM. You've also mentioned you've put work into automating your workflow. Would you mind sharing as much of your script(s) as you're comfortable with? Or, at least, the commands you use to pin the first 80 GB of the model in RAM? I'd like to try quantizing larger models locally myself.

Sorry if you've already shared this elsewhere - I looked through the other discussions on this repo and read through the ones that didn't look like specific model requests.

Thanks!

I'm curious how you are able to create iMatrix quants of models

You don't actually have to do anything special to do that, as llama-imatrix does not load the model into ram (on unix), and will, for every iteration, load through it.

pin the first 80 GB of the model in RAM?

It's quite the hack,a nd probably not useful to anybody, but I used a script I write a decade ago to load some database files into memory: http://data.plan9.de/mlock it is very overkill for this purpose, but was at hand. I recommend finding something else.

Usage would be:

MLOCK_LIMIT=80000000000 mlock file.gguf

Sorry if you've already shared this elsewhere

I haven't. The script is basically a convert_hf_to_gguf.py followed by a glorified for loop, full of details only relevant to my setup. Adapting it to something else or making it usable by others is likely more work than writing it from scratch, so I don't see a point in publishing it.

But if you have questions, I'll be happy to answer.

mradermacher changed discussion status to closed

Sign up or log in to comment