nvidia-smi, system monitoring with htop, and cloud/provisioning considerations for GPU workloads.
System summary (neofetch)
A representative neofetch output from the example system:Important architectural note
Discrete NVIDIA GPUs have dedicated VRAM that is separate from system RAM. Unlike Apple Silicon’s unified memory, discrete GPUs cannot transparently page model memory between system RAM and GPU VRAM. If a model requires more VRAM than the GPU provides (24 GB for an RTX 4090 in this example), it will usually fail to load unless you use model-specific sharding or offloading strategies.
Tip: model selection and performance
For RTX 4090-class GPUs you can typically run models that fit within ~22–24 GB of VRAM. Models that fit will often perform substantially faster than on many Apple M-series chips. When choosing models, check their VRAM footprint for the batch sizes and sequence lengths you plan to use.
Monitoring the GPU with nvidia-smi
NVIDIA shipsnvidia-smi to inspect driver and CUDA versions, VRAM usage, power/temperature, and running GPU processes. Use it to verify drivers are installed correctly and to see which processes consume VRAM.
Example invocation and sample output:
nvidia-smi
- Driver Version and CUDA Version — required for many CUDA libraries and PyTorch/CUDA compatibility.
- Fan/Temp/Power and GPU Util — indicates idle vs heavy load.
- Memory-Usage — shows VRAM usage per GPU; if a model exceeds total VRAM it will not load.
- Processes — lists which processes use the GPU and how much VRAM they hold.
| Field | Meaning |
|---|---|
| Driver Version / CUDA Version | Compatibility requirements for CUDA-based frameworks |
| Fan / Temp / Power | Thermal and power telemetry to detect heavy workloads |
| Memory-Usage | VRAM allocation: used / total (important for model fitting) |
| GPU-Util | Percent utilization of the GPU compute engines |
| Processes | Active GPU processes and per-process VRAM consumption |
Monitoring CPU and system processes
For host-level resource monitoring (CPU, system RAM, threads),htop or top are useful. Use them alongside nvidia-smi to correlate VRAM usage with system memory pressure.
Sample trimmed htop output:
Cloud and provisioning considerations
- Most cloud GPU offerings use NVIDIA hardware; the management plane typically abstracts the host OS. For GPU-heavy LLM work, NVIDIA instances are the common choice.
- If you need Apple Silicon specifically, some cloud providers expose macOS instances, but they are less common for GPU workloads.
- Memory rules described here apply across frameworks: Ollama, Hugging Face Transformers,
llama.cpp, LM Studio, and various Llama frontends. If a model’s VRAM requirement exceeds the GPU, you must use model sharding, tensor offloading, or other model-parallel strategies.
| Characteristic | Discrete NVIDIA GPU (e.g., RTX 4090) | Apple M-series (unified memory) |
|---|---|---|
| Memory type | Dedicated VRAM (e.g., 24 GB) | Unified system/GPU memory (varies by model) |
| Paging between RAM and VRAM | Not transparent; model must fit VRAM or support sharding/offload | Unified memory and swap can allow larger models to run (with performance cost) |
| Typical outcome if model > memory | Usually fails to load without sharding | May run slowly using unified memory & swap |
| Best use case | High-performance inference for models that fit VRAM | Flexible development and smaller models; good integration on Apple devices |
- Discrete NVIDIA GPUs have isolated VRAM; a model must fit within that VRAM (roughly 24 GB on an RTX 4090) unless you use sharding/offloading.
- The RTX 4090 can deliver much faster inference for models that fit its VRAM compared to many M-series chips.
- Use
nvidia-smito check driver/CUDA versions, VRAM usage, power/temperature, and active GPU processes; usehtop/topfor system CPU and RAM monitoring. - These principles apply across local LLM frameworks like Ollama, Hugging Face Transformers,
llama.cpp, and LM Studio.
- Ollama course — Running local LLMs with Ollama
- Hugging Face Transformers documentation
- llama.cpp repository
- LM Studio