Running Local LLMs With Ollama
Getting Started With Ollama
Models and Model Parameters
We’ve installed Ollama locally and launched our first large language model (LLM). In this guide, we’ll explore the full catalog of models that Ollama supports and learn how to interpret their technical specifications. Whether you’re a beginner or an experienced developer, you don’t need deep AI knowledge to get started—our analogies and examples will clarify each concept.
To browse all available models, visit ollama.com/search. Below, we’ll show you how to search for models, inspect their details, and choose the right one for your project.
Consider Jane, a developer building a local AI assistant. When deciding on a model, she balances:
- Output quality (accuracy, coherence)
- Computational requirements (RAM, CPU/GPU)
- Hardware availability (laptop vs. server)
Ollama’s catalog includes families like Meta’s LLaMA, QWQ, Mistral, and more. You’ll see multiple versions—for example, LLaMA 3.3, 3.2, and 3.1. What do those numbers mean? Let’s dive in.
If you click a model on the website, you’ll see a detailed info page. For instance, on LLaMA 3.2, the specifications include its architecture, parameter count, and quantization format. You can also retrieve this info locally:
ollama run llama-3.2
# At the prompt, type:
show info
Understanding these terms will help you select a model that aligns with your application’s requirements and hardware constraints. Let’s break them down.
Architecture
A model’s architecture is its blueprint, defining the core design and the family it belongs to. For example, LLaMA 3.1, 3.2, and 3.3 all share the LLaMA architecture—Meta’s transformer-based model line.
Note
If you’ve already built pipelines around one architecture, sticking with the same family ensures consistent behavior.
Learn more about transformers in the Kubernetes Basics documentation.
Parameters
Parameters are the “knowledge” learned during training, stored as numerical weights. Think of a model as a library: each parameter is like a book on the shelf. More parameters = more books = more stored information.
- 3.2 B parameters means 3.2 billion “books.”
- For comparison, GPT-3 has 175 B parameters.
Larger models typically yield better accuracy but require more memory and compute power.
Warning
High-parameter models can exceed your machine’s RAM and slow down inference. Choose a smaller model if you have limited resources.
Weights
Weights determine the strength of connections within the neural network. During training, these values are optimized—much like refining a recipe by adjusting ingredient ratios to get the best flavor.
- Each input feature is an ingredient.
- The weight assigns its importance in the final prediction.
Once training finishes, weights are fixed. If the model “knows” that 2 + 2 = 4, that pathway has a high weight compared to incorrect routes.
Context Length
Context length (or context window) is the maximum number of tokens the model can process at once. Think of it as how much of your book you can feed to the model at a time.
- 131,072 tokens ≈ 100K–130K words.
Larger context lengths maintain coherence over longer documents or conversations but also increase memory usage.
Note
For long transcripts or codebases, choose a model with an extended context window to avoid cutting off important information.
Embedding Length
When processing text, each token is converted into a vector of fixed length—this is the embedding length. Larger embeddings capture richer contextual information.
- 3,072-dimensional vector → each token is represented in 3,072 dimensions.
Big embeddings help the model understand subtle nuances but increase computational load.
Quantization
Quantization reduces numeric precision (e.g., from 32-bit floats to 4-bit integers) to save memory and speed up inference. It’s similar to compressing an image: you lose a bit of detail but gain storage and performance benefits.
Note
Quantized models (e.g., Q4, Q8) strike a balance between speed and accuracy, ideal for local development.
Next, we’ll navigate back to the Ollama website to explore additional models and run a second experiment on our local machine.
Links and References
Watch Video
Watch video content