The practical question with local AI models is not whether your machine can run them — most can. It is which model runs fast enough to actually be useful. Here is what works.
The practical question when it comes to local AI models is not whether your machine can run them. Most can. It is which model runs fast enough to be worth using. I built Oracle VII around Ollama, which handles the model management and inference side. Getting started takes a few commands.
# Install Ollama from ollama.com, then: ollama pull phi3 # 2.3B params — fast on almost anything ollama pull mistral # 7B params — good quality, needs ~8GB RAM ollama pull llama3 # 8B params — best quality/speed balance ollama serve # start the local server
On a machine with no dedicated GPU (integrated Intel graphics), phi3 is the practical choice. Responses come in at roughly 10 to 15 tokens per second, which is fast enough for real back-and-forth conversation. Mistral drops to around 4 to 6 tokens per second on the same hardware, which starts to feel slow but is fine for one-off queries where you are not waiting on a response.
If you have an older Nvidia GPU (GTX 1060 or newer) you will get GPU offloading and things get significantly faster. Ollama handles this automatically if CUDA is available.
The thing I keep coming back to when using Oracle VII is that it is just fundamentally different to use a model where nothing leaves your machine. No rate limits, no API costs, no terms of service to worry about, no context being logged somewhere. You can be experimental with it in a way you cannot with an API service.
The SQLite knowledge base in Oracle VII stores things you tell it to remember and can retrieve them in future sessions. It is not magic — it is keyword search over stored notes — but it is useful for building up context about ongoing projects without re-explaining everything each session.