# 🔍 Diagnostic Guide: Timeout vs Memory ## How to identify the problem? ### 1️⃣ Run the diagnostic tool In your HF Space, execute: ```bash python hf-spaces/diagnostic_tool.py ``` This tool will tell you **exactly** if the problem is: - ❌ **MEMORY_ERROR**: The system ran out of RAM - ⏰ **TIMEOUT_ERROR**: The operation took too long - ❓ **OTHER_ERROR**: Another type of problem ### 2️⃣ Interpret the results #### If you see "MEMORY_ERROR": ``` ❌ PROBLEM DETECTED: OUT OF MEMORY Memory used at failure: 15.8 GB (98.5%) ``` **Cause**: The model is too large for the available memory in HF Spaces. **Solutions**: 1. **Use smaller models** (1B-1.7B parameters) 2. **Upgrade to HF Spaces PRO** (more RAM available) 3. **Use int8 quantization** (reduces memory usage ~50%) 4. **Load models with `low_cpu_mem_usage=True`** #### If you see "TIMEOUT_ERROR": ``` ⏰ TIMEOUT ERROR after 298.5s Memory used: 8.2 GB (51.2%) ``` **Cause**: The model takes too long to load, but there is available memory. **Solutions**: 1. **Increase timeout** from 300s to 600s or 900s 2. **Cache pre-loaded models** at startup 3. **Use faster models** ## 🛠️ Implemented Solutions ### Solution 1: Increase Timeout (Easy) Edit `hf-spaces/optipfair_frontend.py`: ```python # Change from: response = requests.post(url, json=payload, timeout=300) # To: response = requests.post(url, json=payload, timeout=600) # 10 minutes ``` ### Solution 2: Use Quantization (For memory issues) Edit model loading code in the backend: ```python from transformers import AutoModel, BitsAndBytesConfig # Configure int8 quantization (reduces memory usage ~50%) quantization_config = BitsAndBytesConfig( load_in_8bit=True, llm_int8_threshold=6.0, ) model = AutoModel.from_pretrained( model_name, quantization_config=quantization_config, device_map="auto", low_cpu_mem_usage=True, ) ``` ### Solution 3: Model Cache (For timeout) Pre-load models at startup in `hf-spaces/app.py`: ```python from transformers import AutoModel, AutoTokenizer import logging logger = logging.getLogger(__name__) # Global model cache MODEL_CACHE = {} def preload_models(): """Pre-load common models at startup""" common_models = [ "meta-llama/Llama-3.2-1B", "oopere/pruned40-llama-3.2-1B", ] logger.info("🔄 Pre-loading common models...") for model_name in common_models: try: logger.info(f" Loading {model_name}...") MODEL_CACHE[model_name] = { "model": AutoModel.from_pretrained(model_name, low_cpu_mem_usage=True), "tokenizer": AutoTokenizer.from_pretrained(model_name) } logger.info(f" ✓ {model_name} loaded") except Exception as e: logger.warning(f" ✗ Could not pre-load {model_name}: {e}") logger.info("✅ Pre-loading complete") def main(): # Pre-load models before starting services preload_models() # Rest of the code... fastapi_thread = threading.Thread(target=run_fastapi, daemon=True) fastapi_thread.start() # ... ``` ### Solution 4: Improved Error Messages Better error messages are already included to help you identify the problem: ```python except requests.exceptions.Timeout: return ( None, "❌ **Timeout Error:**\nThe model took too long to load (>5min). " "This is normal with large models. Options:\n" "1. Try with a smaller model\n" "2. Wait and try again (model may be caching)\n" "3. Contact admin to increase timeout", "" ) except MemoryError: return ( None, "❌ **Memory Error:**\nNot enough RAM for this model. Options:\n" "1. Use a smaller model (1B parameters)\n" "2. Model requires more memory than available in HF Spaces", "" ) ``` ## 📊 Model Size Comparison | Model | Parameters | RAM Needed* | Load Time** | |--------|-----------|----------------|----------------| | Llama-3.2-1B | 1B | ~4 GB | ~30s | | Llama-3.2-3B | 3B | ~12 GB | ~90s | | Llama-3-8B | 8B | ~32 GB | ~240s | | Llama-3-70B | 70B | ~280 GB | ~600s+ | *Without quantization, FP32 **On typical HF Spaces hardware ## 🎯 Recommended Action Plan 1. **Run the diagnostic**: ```bash python hf-spaces/diagnostic_tool.py ``` 2. **Read the results** and follow the specific recommendations 3. **Apply the appropriate solution**: - If timeout → Increase timeout or use cache - If memory → Use small models or quantization 4. **Test again** with the adjusted configuration ## 📝 Useful Logs in HF Spaces Check the logs in HF Spaces for messages like: ``` 🔍 MODEL LOADING DIAGNOSTIC: meta-llama/Llama-3.2-1B 📊 INITIAL SYSTEM STATE: - Available memory: 12.50 GB - Used memory: 3.45 GB (21.6%) ⏳ Starting model loading (timeout: 300s)... [1/2] Loading tokenizer... ✓ Tokenizer loaded in 2.31s - Memory used: 3.48 GB (21.8%) [2/2] Loading model... ✓ Model loaded in 45.67s ✅ LOADING SUCCESSFUL in 47.98s ``` This tells you exactly how much memory and time each step uses.