AI & ML interests

Unlocked LLM

Cossaleย 
posted an update 9 days ago
view post
Post
198
Releasing 8 multilingual datasets from the People's Archive of Rural India (PAARI).
Indian languages represent 1B+ speakers but remain underrepresented in quality training data. These datasets help address that gap.
Languages: Hindi, Urdu, Punjabi, Tamil, Telugu, Marathi, Gujarati, English
Scripts: Devanagari, Arabic, Gurmukhi, Tamil, Telugu, Gujarati
Total: 7,650 articles, 19.9M tokens, 51MB
Content covers rural life, agriculture, social issues, and cultural traditions. Professionally written journalism, not web scrapes.
Free to use.
Collection: https://huggingface.co/collections/keplersystems/paari-datasets
Technical details: https://kepler.systems/blog/introducing-paari-datasets
ehartfordย 
posted an update almost 2 years ago