Offline LLM Inferencing

February 2, 2025 - 3 minutes read - 440 words

When it comes to running LLMs locally (or offline - i.e., not in the cloud), we have several tools to choose from. Coming from the “cloud-native” services, I was quite surprised by their popularity. However, after observing the rapid cloud cost accumulation, I decided to do a little research on the subject.

Probably the best article on the subject is https://getstream.io/blog/best-local-llm-tools/ - this is a great one-stop-shop to start your offline inferencing with open-source models.

Tools for Offline LLM Inferencing:

Ollama
Llamafile / Executable Linkable Format (ELF)
Jan
LM Studio
LLaMa.cpp

One more great article there is https://getstream.io/blog/local-deepseek-r1/

So, Ollama takes the first place in my list since this is a “Docker for models” - pretty much the same concept and the same command-line interface. Apparently people use the same approach to run inferencing in the cloud as well - for example https://cloud.google.com/run/docs/tutorials/gpu-gemma2-with-ollama

Ollama

Repo https://github.com/ollama/ollama
Docs https://github.com/ollama/ollama/blob/main/docs/README.md
Model library https://ollama.com/search
HuggingFace models https://huggingface.co/models?other=llama

Ollama model CRUD:

ollama list
ollama show <model>

ollama pull <model>
ollama create <model> -f ./Modelfile

ollama run <model>
ollama serve        # start the server, then ollama run <model>
ollama ps
ollama stop <model>

ollama rm <model>
ollama cp <model> <copy>

# Multiline """
# ...
# """

# Multimodal
ollama run llava "What's in this image? <path>"

# Prompt as an argument
ollama run llama3.2 "Summarize this file: $(cat <file>)"

Ollama service control:

sudo systemctl start ollama
sudo systemctl status ollama
sudo systemctl edit ollama

journalctl -e -u ollama

sudo systemctl stop ollama
sudo systemctl disable ollama
sudo rm /etc/systemd/system/ollama.service

GGUF

GPT-Generated Unified Format (GGUF) - file format that stores large language models (LLMs) for inference.

Modelfile analog to a dockerfile. As you can on the example from the doc - this is very familiar Docker build type of thing.

FROM ./vicuna-33b.Q4_0.gguf
PARAMETER temperature 1
SYSTEM """
You are Mario from Super Mario Bros. Answer as Mario, the assistant, only.
"""

ollama create example -f Modelfile
ollama run example

DeepSeek-coder

https://ollama.com/library/deepseek-coder

ollama run deepseek-coder

The interesting part, I was able to run Deepseek-Coder model on my Chromebook!

DeepSeek-Coder on Chromebook

This Deepseek thing works on that tiny energy-effective AMD5 with 8G mem with no GPU.

Ollama on Chromebook Was it useful? Probably not - it loses context much faster than Copilot, so the generated code becomes convoluted quicker than with GPT tech, especially for Python since DeepSeek adds spaces here and there.

Templating

One more interestign thing - templating in Ollama models https://github.com/ollama/ollama/blob/main/docs/template.md

This is the same Go text/template - the absolutely same thing that Hugo uses for rendering the static web-sites, including this one. It was great to see that the investments in learning Hugo paid off so quickly.