Local unrestricted LLM on Linux
April 19, 2025.
In this article we will see how to run a local unrestricted large language model (LLM) that you can use via command line or via browser GUI. We will be running the model on a separate dedicated system to offload the work from our main workstation.
In my case I will be using an old NVIDIA Jetson Nano that has 4GB memory and 128 CUDA cores.
We will be using Ollama. A platform to easily run a variety of LLMs (including unrestricted ones) locally.
Additionally, we will be doing some basic kernel tunning in order to better utilize virtual memory resources, adding swap, enabling and tunning zram and other fun system administration tasks.
If you happen to be using an NVIDIA Jetson Nano you may find this setup guide on my github useful, I wrote it 5 years ago to remember the steps for setup, but I used it again for this project and they worked fine:
musoto96/jetson-nano-setup-guide.
I also recommend using a PMW fan for the Jetson and running the following script to install a fan control daemon since things will be getting toasty:
Pyrestone/jetson-fan-ctl.
Without further ado, lets see how to run local, unrestricted LLMs on a Linux system.
Table of contents
- Table of contents
- Installing Ollama on Linux
- Pulling and running LLM models locally
- Using swap to increase memory
- Kernel tuning
- Setting up open-webui for web access
- Closing thoughts
Installing Ollama on a Linux system
Install Ollama using the instructions provided on Ollama official website
and verify the server is running by checking the default port 11434
:
msoto@jetson:~$ sudo ss -plutn |grep -i ollama
tcp LISTEN 0 128 127.0.0.1:11434 0.0.0.0:* users:(("ollama",pid=3785,fd=3))
A command line interface (cli) will be available for use:
msoto@jetson:~$ ollama --help
Large language model runner
Usage:
ollama [flags]
ollama [command]
Available Commands:
serve Start ollama
create Create a model from a Modelfile
show Show information for a model
run Run a model
stop Stop a running model
pull Pull a model from a registry
push Push a model to a registry
list List models
ps List running models
cp Copy a model
rm Remove a model
help Help about any command
Flags:
-h, --help help for ollama
-v, --version Show version information
Use "ollama [command] --help" for more information about a command.
We are now ready to download LLM models from Ollama catalog.
Pulling and running LLM models locally
Pull a model withollama pull <MODEL_NAME>:<VERSION>
e.g.:
msoto@jetson:~$ ollama pull dolphin-phi:2.7b
pulling manifest
pulling 4ec... 100% ****************** 1.6 GB
pulling 876... 100% ****************** 10 KB
pulling a47... 100% ****************** 106 KB
...
verifying sha256 digest
writing manifest
success
Run the model on the terminalollama run <MODEL_NAME>:<VERSION>
e.g.:
msoto@jetson:~$ ollama run dolphin-phi:2.7b
>>> Send a message (/? for help)
And you can test if a model is restricted or not by asking it a potentially dangerous or nefarious question e.g.:
Using swap to increase memory and avoid OOM errors
If we try to run a larger model relative to the available memory we ill see that the process is killed with the following message:
msoto@jetson:~$ ollama run dolphin-llama3:8b
Error: llama runner process has terminated: signal: killed
If we check kernel messages after this we will see that Linux Out of Memory (OOM) killer killed the process:
msoto@jetson:~$ dmesg |tail
[ 4394.072907] Out of memory: Kill process 9602 (ollama) score 548 or sacrifice child
[ 4394.142428] Killed process 9602 (ollama) total-vm:6335248kB, anon-rss:2441340kB, file-rss:0kB, shmem-rss:0kB
[ 4395.245813] oom_reaper: reaped process 9602 (ollama), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
If we are unlucky the system may hang and we may need to reboot the machine.We can make use of swap, which essentially trades CPU for memory. When memory is full it will begin to dump data to the swap area, performance will take a hit, but this will allow us to extend virtual memory and run a larger model that we would otherwise not be able to run.
A swap area can be defined by either a file or a filesystem, in this case I will do it via a file. First we will allocate a certain amount of zero bytes to a file using
dd
we will use a special device /dev/zero
for this and in this case I will allocate 4G for swap file 4 * 1024 = 4096 MB
:
msoto@jetson:~$ sudo dd if=/dev/zero of=/swap.img bs=4M count=1024 status=progress
4290772992 bytes (4.3 GB, 4.0 GiB) copied, 101 s, 42.4 MB/s
1024+0 records in
1024+0 records out
4294967296 bytes (4.3 GB, 4.0 GiB) copied, 111.516 s, 38.5 MB/s
Next we set correct permissions with chmod
and mark the swap file as swap area with mkswap
:
msoto@jetson:~$ sudo chmod 0600 /swap.img
msoto@jetson:~$ sudo mkswap /swap.img
Setting up swapspace version 1, size = 4 GiB (4294963200 bytes)
no label, UUID=e871dfd4-9be7-45b8-90eb-3bb15e80b681
Finally we will activate/extend the swaparea with swapon
and verify we have extended swap area with free
(I have 6GB since I already had 2GB swap preallocated):
msoto@jetson:~$ sudo swapon /swap.img
msoto@jetson:~$ free -h
total used free shared buff/cache available
Mem: 3.9G 969M 472M 4.1M 2.5G 2.7G
Swap: 5.9G 1.1G 4.9G
To make sure this new swap file is used automatically we can add it to /etc/fstab
:
msoto@jetson:~$ sudo su -
root@jetson:~# cat 0<<EOF 1>>/etc/fstab
/swap.img none swap defaults 0 0
EOF
root@jetson:~# tail -1 /etc/fstab
/swap.img none swap defaults 0 0
root@jetson:~#
Now we are able to run a larger model with the help of an increased virtual memory:
root@jetson:~# exit
logout
msoto@jetson:~$ ollama run dolphin-llama3:8b
>>> Send a message (/? for help)
Kernel tuning for swap-heavy workload
Swapping is the activity the CPU performs when switching memory pages between physical memory and swap area. How prone the system is to swapping is controlled by a kernel parametervm.swappiness
that can be changed dynamically with sysctl
you can check current value (default is 60) with:
msoto@jetson:~$ sudo sysctl vm.swappiness
vm.swappiness = 60
msoto@jetson:~$
We can choose a value from 0-100 (or higher than 100 in some setups), a higher value will cause swapping to be more aggressive.I will bump this to a 100:
msoto@jetson:~$ sudo sysctl vm.swappiness=100
vm.swappiness = 100
msoto@jetson:~$
I will also be modifying vm.vfs_cache_pressure
and vm.page-cluster
since I could see a slight improvement in model initialization increasing these values from their defaults.These variables are made accessible via the
/proc/sys/vm
in the /proc
special device. For more information on virtual memory (vm) variables see the documentation for /proc/sys/vm
msoto@jetson:~$ sudo sysctl vm.vfs_cache_pressure=400
vm.vfs_cache_pressure = 400
msoto@jetson:~$ sudo sysctl vm.page-cluster=4
vm.page-cluster = 4
msoto@jetson:~$
We can fine-tune these variables by conducting a proper benchmark, but for now this should be OK.
Setting up open-webui for web access
Instead of using the model through a terminal we can setup a Web frontend with docker, there are various front end UIs that integrate with Ollama and other AI tools. Below we will use Open-WebUI.First create a docker compose file with the fllowing command (the entire block is one command, from cat to EOF):
msoto@jetson:~$ cat 0<<EOF 1>$HOME/docker-compose.yml
version: "3.3"
services:
open-webui:
image: ghcr.io/open-webui/open-webui:main
restart: always
environment:
OLLAMA_BASE_URL: http://127.0.0.1:11434
PORT: 80
volumes:
- ./open-webui:/app/backend/data
network_mode: host
EOF
msoto@jetson:~$
This will create a container with Open-WebUI official image, and connect to the Ollama server running locally on port 11434
, it will also mount a volume on the host ./open-webui
for data persistency.The Open-WebUI server will be accesible on port 80 of the host.
To start the container simply run
docker compose up -d
(or docker-compose
if using older version of docker compose).You can monitor the status of the container with
docker ps
, since the compose file has the restart: always
directive, the container will automatically start if the container crashes, or when rebooting the computer. To stop the container you can use docker compose down
on the same directory where the docker-compose.yml
file is located.And we have enabled access to our models from the browser of any other computer in the network by using the private IP address of the machine running Ollama.

Closing thoughts
In this article we saw how to run a local LLM model using Ollama on a dedicated Linux system, this helps to isolate the workload and avoid having poor performance on our workstation .
We allocated a swap area to extend virtual memory and tuned a few virtual memory kernel variables using sysctl
that allowed us to run a larger model than the physical memory would allow. Finally we setup a browser UI frontend in front of the Ollama server with a simple docker compose file for ease of access to the LLM models from other systems in the network.