GAIR-ProX

community

https://gair-nlp.github.io/ProX/

AI & ML interests

NLP Research

Organization Card

Community About org cards

Clickable Image

GAIR-ProX, a subsidiary of GAIR, spearheads the 🫐 ProX Project. This initiative aims to enhance pre-training efficiency by refining corpus documents using language models at scale. Through meticulous operations (e.g., document-level filtering and chunk-level cleaning), implemented as scalable, executable programs, 🫐 ProX seeks to improve pre-training data quality at scale, ultimately developing more robust and efficient language models.

Read our technical report!

Collections 3

models 10

gair-prox/FW-ProX-1.7B

Text Generation • Updated 2 days ago • 9 • 2

gair-prox/Mistral-7B-ProXMath

Text Generation • Updated 2 days ago • 11 • 2

gair-prox/Llama-2-7B-ProXMath

Text Generation • Updated 10 days ago • 4

gair-prox/TinyLlama-1.1B-ProXMath

Updated 10 days ago • 2

gair-prox/CodeLlama-7B-ProXMath

Updated 10 days ago • 7

gair-prox/RedPJ-ProX-0.3B

Updated 10 days ago • 4

gair-prox/C4-ProX-1.7B

Updated 10 days ago • 4

gair-prox/RedPJ-ProX-1.7B

Updated 10 days ago • 4

gair-prox/RedPJ-ProX-0.7B

Updated 10 days ago • 4

gair-prox/ProX-RedPJ-1.7B-25B

Updated 11 days ago • 4

datasets 4

gair-prox/RedPajama-pro

Viewer • Updated 2 days ago • 10.2M

gair-prox/c4-pro

Viewer • Updated 2 days ago • 40.1M

gair-prox/open-web-math-pro

Viewer • Updated 2 days ago • 2.58M • 2

gair-prox/FineWeb-pro

Viewer • Updated 2 days ago • 63.1M • 5 • 4