Language Identification Model
This repository contains a Multinomial Logistic Regression model built from scratch using NumPy. It is optimized to distinguish between Swahili, Dholuo, and Kalenjin based on character n-gram frequency distributions.
π Performance Metrics
The model was trained on 325,071 training observations and evaluated on 36,120 test observations.
- Macro-averaged F1: 0.977
- Micro-averaged F1: 0.979
- Training Epochs: 1 (achieving 0.980 Micro-F1 at peak)
Confusion Matrix (Test Set)
The model shows high diagonal density, indicating strong classification accuracy:
| Pred Swahili | Pred Dholuo | Pred Kalenjin | |
|---|---|---|---|
| True Swahili | 8,356 | 210 | 90 |
| True Dholuo | 135 | 11,368 | 127 |
| True Kalenjin | 60 | 137 | 15,637 |
π§ͺ Live Testing
Use the Inference Widget on the right to test the model immediately.
- Input: Enter a sentence in Swahili, Dholuo, or Kalenjin.
- Output: Probability scores for each language.
π» Usage (Reference)
You can integrate this model into your Python workflow using the following code. It handles the download from Hugging Face and the custom vectorization logic.
Local Inference
Requires numpy and huggingface_hub.
import numpy as np
import pickle
from collections import Counter
from huggingface_hub import hf_hub_download
# 1. Download model weights and mapping from Hugging Face
model_path = hf_hub_download(repo_id="amidblue/ke-lang-id", filename="lang_id_model.pkl")
with open(model_path, "rb") as f:
model_data = pickle.load(f)
def softmax(z):
return np.exp(z - np.max(z)) / np.exp(z - np.max(z)).sum()
def extract_ngrams(text, n):
return ["".join(s) for s in (zip(*[text[i:] for i in range(n)]))]
def predict(text):
# Vectorize text based on the trained feature map
ngrams = extract_ngrams(text, model_data["ngram_length"])
counts = Counter(ngrams)
x = np.zeros(len(model_data["feature_map"]))
for ngram, count in counts.items():
if ngram in model_data["feature_map"]:
x[model_data["feature_map"][ngram]] = count
# Add bias term (1.0) at the start of the vector
x_aug = np.insert(x, 0, 1)
# Compute scores and apply softmax
z = model_data["W"].dot(x_aug)
probs = softmax(z)
return model_data["lang_list"][np.argmax(probs)]
# Example Usage
text = "Kuna tashwishi ambao umetokea kulingana na mazungumzo ya wanaharakati"
print(f"Predicted Language: {predict(text)}")
Inference API
The model is not available under this category and we would suggest that users should use the above approach for referencing, but if you want to test the model real quick use the link to the demo provided on the right.