A Python utility that detects your hardware capabilities and recommends appropriate Large Language Models (LLMs) for local deployment.
- Cross-Platform Hardware Detection: Supports NVIDIA CUDA, Apple Metal Performance Shaders (MPS), and CPU-only configurations
- Smart Memory Analysis: Estimates usable VRAM/memory for AI workloads
- Model Recommendations: Suggests optimal LLM models based on your hardware capacity
- 4-bit Quantization Support: Calculations based on modern quantization techniques
- Beautiful HTML Reports: Generates stunning, modern HTML reports with gradient backgrounds and interactive elements
- Automatic Browser Launch: Opens the generated report in your default browser
- Python 3.8+
- PyTorch
- psutil
-
Clone or download this repository
-
Create a virtual environment:
python3 -m venv venv
-
Activate the virtual environment:
- macOS/Linux:
source venv/bin/activate - Windows:
venv\Scripts\activate
- macOS/Linux:
-
Install dependencies:
pip install -r requirements.txt
Run the script to analyze your hardware and get model recommendations:
python ll_model_selection.pyThe script will:
- Detect your hardware capabilities
- Print analysis results to the console
- Generate a beautiful HTML report (
hardware_report.html) - Automatically open the report in your default browser
### Hardware Audit: Darwin ###
* System RAM: 16.00 GB
* Backend: Apple Metal Performance Shaders (MPS)
* Unified Memory: 16.00 GB (Est. Usable for AI: 12.00 GB)
### Feasible Local Deployment (4-bit Quantization) ###
* Effective VRAM Cap: 12.00 GB
* Theoretical Max Parameter Count: ~13B
Recommended Models (4-bit Quantization):
- Llama-3.1-8B: Gold standard for consumer hardware
- Mistral-7B: Excellent general-purpose model
- Gemma-7B: Google's efficient open model
- Qwen2.5-7B: Strong multilingual capabilities
✅ HTML report generated: /path/to/hardware_report.html
Opening in browser...
The generated HTML report includes:
- Modern gradient design with purple-to-blue background
- Performance tier badges (Enterprise, Professional, Advanced, Standard, Entry)
- Color-coded stats for quick visual assessment
- Interactive hover effects on all cards
- Responsive layout that works on desktop and mobile
- Detailed model recommendations with descriptions
The tool uses the following formula to estimate maximum model size:
Max Parameters (B) = (Available VRAM - Context Overhead) / GB per Billion Parameters
Where:
- Context Overhead: 2 GB (for KV cache and runtime overhead)
- GB per Billion: 0.75 GB (for 4-bit quantization)
- NVIDIA CUDA: Detects GPU VRAM on Windows/Linux systems
- Apple MPS: Uses unified memory on Apple Silicon Macs
- CPU: Falls back to system RAM (with performance warnings)
| VRAM Range | Recommended Models |
|---|---|
| 48GB+ | Llama-3.1-70B, Qwen2.5-72B, Mixtral 8x7B |
| 24GB+ | Llama-3.1-70B (quantized), Command R 35B, Qwen2.5-32B |
| 16GB+ | Llama-3.1-13B, Qwen2.5-14B, Mistral-7B |
| 8GB+ | Llama-3.1-8B, Mistral-7B, Gemma-7B, Qwen2.5-7B |
| <8GB | Phi-3-mini, Gemma-2B, TinyLlama |
MIT License - feel free to use and modify as needed.