Offline LLM Inferencing
- 3 minutes read - 440 wordsWhen it comes to running LLMs locally (or offline - i.e., not in the cloud), we have several tools to choose from. Coming from the “cloud-native” services, I was quite surprised by their popularity. However, after observing the rapid cloud cost accumulation, I decided to do a little research on the subject.
Probably the best article on the subject is https://getstream.io/blog/best-local-llm-tools/ - this is a great one-stop-shop to start your offline inferencing with open-source models.
Tools for Offline LLM Inferencing:
- Ollama
- Llamafile / Executable Linkable Format (ELF)
- Jan
- LM Studio
- LLaMa.cpp
One more great article there is https://getstream.io/blog/local-deepseek-r1/
So, Ollama takes the first place in my list since this is a “Docker for models” - pretty much the same concept and the same command-line interface. Apparently people use the same approach to run inferencing in the cloud as well - for example https://cloud.google.com/run/docs/tutorials/gpu-gemma2-with-ollama
Ollama
Ollama
- Repo https://github.com/ollama/ollama
- Docs https://github.com/ollama/ollama/blob/main/docs/README.md
- Model library https://ollama.com/search
- HuggingFace models https://huggingface.co/models?other=llama
Ollama model CRUD:
ollama list
ollama show <model>
ollama pull <model>
ollama create <model> -f ./Modelfile
ollama run <model>
ollama serve # start the server, then ollama run <model>
ollama ps
ollama stop <model>
ollama rm <model>
ollama cp <model> <copy>
# Multiline """
# ...
# """
# Multimodal
ollama run llava "What's in this image? <path>"
# Prompt as an argument
ollama run llama3.2 "Summarize this file: $(cat <file>)"
Ollama service control:
sudo systemctl start ollama
sudo systemctl status ollama
sudo systemctl edit ollama
journalctl -e -u ollama
sudo systemctl stop ollama
sudo systemctl disable ollama
sudo rm /etc/systemd/system/ollama.service
GGUF
GPT-Generated Unified Format (GGUF) - file format that stores large language models (LLMs) for inference.
Modelfile analog to a dockerfile. As you can on the example from the doc - this is very familiar Docker build type of thing.
FROM ./vicuna-33b.Q4_0.gguf
PARAMETER temperature 1
SYSTEM """
You are Mario from Super Mario Bros. Answer as Mario, the assistant, only.
"""
ollama create example -f Modelfile
ollama run example
DeepSeek-coder
https://ollama.com/library/deepseek-coder
ollama run deepseek-coder
The interesting part, I was able to run Deepseek-Coder model on my Chromebook!
This Deepseek thing works on that tiny energy-effective AMD5 with 8G mem with no GPU.
Was it useful? Probably not - it loses context much faster than Copilot, so the generated code becomes convoluted quicker than with GPT tech, especially for Python since DeepSeek adds spaces here and there.
Templating
One more interestign thing - templating in Ollama models https://github.com/ollama/ollama/blob/main/docs/template.md
This is the same Go text/template - the absolutely same thing that Hugo uses for rendering the static web-sites, including this one. It was great to see that the investments in learning Hugo paid off so quickly.