Their model, built with seven billion parameters, competes with Meta’s Llama 1 and 2, despite limited funding.
A team of researchers from MIT, Cornell, the University of Toronto, and others have successfully trained a large language model using only openly licensed or public domain data, The Washington Post reports. The dataset, called Common Pile v0.1, comprises over eight terabytes of text and required extensive manual labor to verify copyright status and reformat content.
“This isn’t a thing where you can just scale up the resources that you have available,” said Stella Biderman, coauthor and director at Eleuther AI. “We use automated tools, but all of our stuff was manually annotated… and checked by people.”
Their model, built with seven billion parameters, competes with Meta’s Llama 1 and 2, despite limited funding. The work challenges the tech industry’s claim that ethical data sourcing is impossible. While not eliminating ethical concerns, the effort highlights a path forward. “Even partial transparency has a huge amount of social value,” Biderman noted.