Language Identification Model

This repository contains a Multinomial Logistic Regression model built from scratch using NumPy. It is optimized to distinguish between Swahili, Dholuo, and Kalenjin based on character n-gram frequency distributions.

📊 Performance Metrics

The model was trained on 325,071 training observations and evaluated on 36,120 test observations.

Macro-averaged F1: 0.977
Micro-averaged F1: 0.979
Training Epochs: 1 (achieving 0.980 Micro-F1 at peak)

Confusion Matrix (Test Set)

The model shows high diagonal density, indicating strong classification accuracy:

	Pred Swahili	Pred Dholuo	Pred Kalenjin
True Swahili	8,356	210	90
True Dholuo	135	11,368	127
True Kalenjin	60	137	15,637

🧪 Live Testing

Use the Inference Widget on the right to test the model immediately.

Input: Enter a sentence in Swahili, Dholuo, or Kalenjin.
Output: Probability scores for each language.

💻 Usage (Reference)

You can integrate this model into your Python workflow using the following code. It handles the download from Hugging Face and the custom vectorization logic.

Local Inference

Requires numpy and huggingface_hub.

import numpy as np
import pickle
from collections import Counter
from huggingface_hub import hf_hub_download

# 1. Download model weights and mapping from Hugging Face
model_path = hf_hub_download(repo_id="amidblue/ke-lang-id", filename="lang_id_model.pkl")

with open(model_path, "rb") as f:
    model_data = pickle.load(f)

def softmax(z):
    return np.exp(z - np.max(z)) / np.exp(z - np.max(z)).sum()

def extract_ngrams(text, n):
    return ["".join(s) for s in (zip(*[text[i:] for i in range(n)]))]

def predict(text):
    # Vectorize text based on the trained feature map
    ngrams = extract_ngrams(text, model_data["ngram_length"])
    counts = Counter(ngrams)
    
    x = np.zeros(len(model_data["feature_map"]))
    for ngram, count in counts.items():
        if ngram in model_data["feature_map"]:
            x[model_data["feature_map"][ngram]] = count
            
    # Add bias term (1.0) at the start of the vector
    x_aug = np.insert(x, 0, 1) 
    
    # Compute scores and apply softmax
    z = model_data["W"].dot(x_aug)
    probs = softmax(z)
    
    return model_data["lang_list"][np.argmax(probs)]

# Example Usage
text = "Kuna tashwishi ambao umetokea kulingana na mazungumzo ya wanaharakati"
print(f"Predicted Language: {predict(text)}")

Inference API

The model is not available under this category and we would suggest that users should use the above approach for referencing, but if you want to test the model real quick use the link to the demo provided on the right.

Downloads last month: -; Downloads are not tracked for this model. How to track

amidblue
/

ke-lang-id