Comment on What happened to techbros from the 90s to now?
SnotFlickerman@lemmy.blahaj.zone 2 days agohuggingface.co/datasets/…/the_pile_books3
This dataset is Shawn Presser’s work and is part of EleutherAi/The Pile dataset.
This dataset contains all of bibliotik in plain .txt form, aka 197,000 books processed in exactly the same way as did for bookcorpusopen (a.k.a. books1). seems to be similar to OpenAI’s mysterious “books2” dataset referenced in their papers. Unfortunately OpenAI will not give details, so we know very little about any differences. People suspect it’s “all of libgen”, but it’s purely conjecture.
I say “well known” because it was literally in the description when it was initially uploaded to the internet. It was always right out in the front that this was all the ebooks from private torrent tracker Bibliotik. Shawn Presser/books3 never lied about where it came from.
BaroqueInMind@lemmy.one 2 days ago
Thank you for the links and reading!