The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text Paper • 2506.05209 • Published Jun 5 • 58
Lapa v0.1.2 Release Collection Release of SOTA Ukrainian LLM and Datasets • 18 items • Updated 24 days ago • 23
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks Paper • 2005.11401 • Published May 22, 2020 • 14
The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only Paper • 2306.01116 • Published Jun 1, 2023 • 41