Local unrestricted LLM on Linux

In this article we will see how to run a local unrestricted large language model (LLM) that you can use via command line or via browser GUI. We will be running the model on a separate dedicated system to offload the work from our main workstation.
In my case I will be using an old NVIDIA Jetson Nano that has 4GB memory and 128 CUDA cores.

We will be using Ollama. A platform to easily run a variety of LLMs (including unrestricted ones) locally.

Additionally, we will be doing some basic kernel tunning in order to better utilize virtual memory resources, adding swap, enabling and tunning zram and other fun system administration tasks.
If you happen to be using an NVIDIA Jetson Nano you may find this setup guide on my github useful, I wrote it 5 years ago to remember the steps for setup, but I used it again for this project and they worked fine:
musoto96/jetson-nano-setup-guide.

I also recommend using a PMW fan for the Jetson and running the following script to install a fan control daemon since things will be getting toasty:
Pyrestone/jetson-fan-ctl.

Without further ado, lets see how to run local, unrestricted LLMs on a Linux system.

Table of contents
Installing Ollama on Linux
Pulling and running LLM models locally
Using swap to increase memory
Kernel tuning
Setting up open-webui for web access
Closing thoughts

Installing Ollama on a Linux system

Install Ollama using the instructions provided on Ollama official website
and verify the server is running by checking the default port 11434:

msoto@jetson:~$ sudo ss -plutn |grep -i ollama
tcp    LISTEN   0        128             127.0.0.1:11434          0.0.0.0:*      users:(("ollama",pid=3785,fd=3))

A command line interface (cli) will be available for use:

msoto@jetson:~$ ollama --help
Large language model runner

Usage:
  ollama [flags]
  ollama [command]

Available Commands:
  serve       Start ollama
  create      Create a model from a Modelfile
  show        Show information for a model
  run         Run a model
  stop        Stop a running model
  pull        Pull a model from a registry
  push        Push a model to a registry
  list        List models
  ps          List running models
  cp          Copy a model
  rm          Remove a model
  help        Help about any command

Flags:
  -h, --help      help for ollama
  -v, --version   Show version information

Use "ollama [command] --help" for more information about a command.

We are now ready to download LLM models from Ollama catalog.

[Top]

Pulling and running LLM models locally

Pull a model with ollama pull <MODEL_NAME>:<VERSION> e.g.:

msoto@jetson:~$ ollama pull dolphin-phi:2.7b
pulling manifest 
pulling 4ec... 100%  ****************** 1.6 GB
pulling 876... 100%  ******************  10 KB
pulling a47... 100%  ****************** 106 KB
...
verifying sha256 digest 
writing manifest 
success

Run the model on the terminalollama run <MODEL_NAME>:<VERSION> e.g.:

msoto@jetson:~$ ollama run dolphin-phi:2.7b
>>> Send a message (/? for help)

And you can test if a model is restricted or not by asking it a potentially dangerous or nefarious question e.g.:

[Top]

Using swap to increase memory and avoid OOM errors

If we try to run a larger model relative to the available memory we ill see that the process is killed with the following message:

msoto@jetson:~$ ollama run dolphin-llama3:8b
Error: llama runner process has terminated: signal: killed

If we check kernel messages after this we will see that Linux Out of Memory (OOM) killer killed the process:

msoto@jetson:~$ dmesg |tail
[ 4394.072907] Out of memory: Kill process 9602 (ollama) score 548 or sacrifice child
[ 4394.142428] Killed process 9602 (ollama) total-vm:6335248kB, anon-rss:2441340kB, file-rss:0kB, shmem-rss:0kB
[ 4395.245813] oom_reaper: reaped process 9602 (ollama), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

If we are unlucky the system may hang and we may need to reboot the machine.
We can make use of swap, which essentially trades CPU for memory. When memory is full it will begin to dump data to the swap area, performance will take a hit, but this will allow us to extend virtual memory and run a larger model that we would otherwise not be able to run.

A swap area can be defined by either a file or a filesystem, in this case I will do it via a file. First we will allocate a certain amount of zero bytes to a file using dd we will use a special device /dev/zero for this and in this case I will allocate 4G for swap file 4 * 1024 = 4096 MB:

msoto@jetson:~$ sudo dd if=/dev/zero of=/swap.img bs=4M count=1024 status=progress
4290772992 bytes (4.3 GB, 4.0 GiB) copied, 101 s, 42.4 MB/s
1024+0 records in
1024+0 records out
4294967296 bytes (4.3 GB, 4.0 GiB) copied, 111.516 s, 38.5 MB/s

Next we set correct permissions with chmod and mark the swap file as swap area with mkswap:

msoto@jetson:~$ sudo chmod 0600 /swap.img 
msoto@jetson:~$ sudo mkswap /swap.img
Setting up swapspace version 1, size = 4 GiB (4294963200 bytes)
no label, UUID=e871dfd4-9be7-45b8-90eb-3bb15e80b681

Finally we will activate/extend the swaparea with swapon and verify we have extended swap area with free (I have 6GB since I already had 2GB swap preallocated):

msoto@jetson:~$ sudo swapon /swap.img 
msoto@jetson:~$ free -h
              total        used        free      shared  buff/cache   available
Mem:           3.9G        969M        472M        4.1M        2.5G        2.7G
Swap:          5.9G        1.1G        4.9G

To make sure this new swap file is used automatically we can add it to /etc/fstab:

msoto@jetson:~$ sudo su -
root@jetson:~# cat 0<<EOF 1>>/etc/fstab
/swap.img  none  swap  defaults  0 0
EOF
root@jetson:~# tail -1 /etc/fstab
/swap.img  none  swap  defaults  0 0
root@jetson:~#

Now we are able to run a larger model with the help of an increased virtual memory:

root@jetson:~# exit
logout
msoto@jetson:~$ ollama run dolphin-llama3:8b
>>> Send a message (/? for help)

[Top]

Kernel tuning for swap-heavy workload

Swapping is the activity the CPU performs when switching memory pages between physical memory and swap area. How prone the system is to swapping is controlled by a kernel parameter vm.swappiness that can be changed dynamically with sysctl you can check current value (default is 60) with:

msoto@jetson:~$ sudo sysctl vm.swappiness
vm.swappiness = 60
msoto@jetson:~$

We can choose a value from 0-100 (or higher than 100 in some setups), a higher value will cause swapping to be more aggressive.
I will bump this to a 100:

msoto@jetson:~$ sudo sysctl vm.swappiness=100
vm.swappiness = 100
msoto@jetson:~$

I will also be modifying vm.vfs_cache_pressure and vm.page-cluster since I could see a slight improvement in model initialization increasing these values from their defaults.
These variables are made accessible via the /proc/sys/vm in the /proc special device. For more information on virtual memory (vm) variables see the documentation for /proc/sys/vm

msoto@jetson:~$ sudo sysctl vm.vfs_cache_pressure=400
vm.vfs_cache_pressure = 400
msoto@jetson:~$ sudo sysctl vm.page-cluster=4
vm.page-cluster = 4
msoto@jetson:~$

We can fine-tune these variables by conducting a proper benchmark, but for now this should be OK.

[Top]

Setting up open-webui for web access

Instead of using the model through a terminal we can setup a Web frontend with docker, there are various front end UIs that integrate with Ollama and other AI tools. Below we will use Open-WebUI.

First create a docker compose file with the fllowing command (the entire block is one command, from cat to EOF):

msoto@jetson:~$ cat 0<<EOF 1>$HOME/docker-compose.yml
version: "3.3"

services:
  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    restart: always
    environment:
      OLLAMA_BASE_URL: http://127.0.0.1:11434 
      PORT: 80
    volumes: 
      - ./open-webui:/app/backend/data
    network_mode: host
EOF
msoto@jetson:~$

This will create a container with Open-WebUI official image, and connect to the Ollama server running locally on port 11434, it will also mount a volume on the host ./open-webui for data persistency.
The Open-WebUI server will be accesible on port 80 of the host.

To start the container simply run docker compose up -d (or docker-compose if using older version of docker compose).
You can monitor the status of the container with docker ps, since the compose file has the restart: always directive, the container will automatically start if the container crashes, or when rebooting the computer. To stop the container you can use docker compose down on the same directory where the docker-compose.yml file is located.

And we have enabled access to our models from the browser of any other computer in the network by using the private IP address of the machine running Ollama.

[Top]

Closing thoughts

In this article we saw how to run a local LLM model using Ollama on a dedicated Linux system, this helps to isolate the workload and avoid having poor performance on our workstation .
We allocated a swap area to extend virtual memory and tuned a few virtual memory kernel variables using sysctl that allowed us to run a larger model than the physical memory would allow. Finally we setup a browser UI frontend in front of the Ollama server with a simple docker compose file for ease of access to the LLM models from other systems in the network.

[Top]

Bitgolia

Local unrestricted LLM on Linux

Table of contents

Installing Ollama on a Linux system

Pulling and running LLM models locally

Using swap to increase memory and avoid OOM errors

Kernel tuning for swap-heavy workload

Setting up open-webui for web access

Closing thoughts