
If you additionally want to generate datasets to run the rust script on (and you probably do, at least to follow this demo) then you will need python dependencies:
#Text deduplicator plus install
You'll also need a C compiler, sudo apt-get install gcc will do that if you don't already. To run the rust deduplicator you will need to install Rust:Ĭurl -proto '=https' -tlsv1.2 -sSf | sh And so all of our algorithms are designed around these constraints. The main complication in the rest of src/main.rs is the fact that we want things to run in parallel, and we probably can't fit the entire suffix array into memory. Everything is a byte array, because we might be working over token sequences which aren't valid UTF8. The original implementation says that u32 works for "reasonably sized documents (~4GB)" but we're working with unreasonably sized documents. It has some minor changes from the original version that make it so we can't just import this library as a crate. We build a suffix array (based on Andrew Gallant's suffix array implementation) in src/table.rs. If your machine is big enough, there should be no upper bound on the size of the dataset it can handle (well, 2^64-1 bytes is the limit, but I think we can all agree that's essentially unlimited). You will also need >1TB hard drive space. If you want to deduplicate small (600GB of RAM.

We did clean it up fairly significantly for a Version 1.0.0 release (see below for release history). This is very much research code: it works well for what we designed it to do, and deduplicate text datasets, but it might not directly do what you want it to do. We provide an implementation of the exact deduplication technique used in the paper. Publisher = "Association for Computational Linguistics" If you use this repository or our deduplicated datasets you can Training Data Makes Language Models Better},Īuthor=,īooktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics", Moreover, language models are less likely to exhibit memorization when their training data has been well-deduplicated. Training models on deduplicated datasets is faster (because they see fewer total examples) and experimentally results in models with similar or better perplexity to models trained on data that hasn't been deduplicated. When datasets are created by scraping raw text from the Internet, this will often result in the same sequences being repeated multiple times (e.g., we find a single 50 word sequence that is repeated in the C4 dataset 60,000 times). This is not an officially supported Google product. We also release the document clusters resulting from running NearDup deduplication on C4, RealNews, LM1B, and Wiki-4B-en.

We release the ExactSubstr deduplication implementation (written in Rust) along with the scripts we used in the paper to perform ExactSubstr deduplication and inspect the results (written in Python).

#Text deduplicator plus code
This repository contains code to deduplicate language model datasets as descrbed in the paper "Deduplicating Training Data Makes Language Models Better" by Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch and Nicholas Carlini. Deduplicating Training Data Makes Language Models Better
