Why is plagiarism an accomplishment? I can reproduce the entire Harry Potter book with a scanner and some OCR software.
That’s the point of the article, I think. They’re saying AI models are copyright violators.
It isn’t. Its a sign of poor-quality work.
Its also interesting that this is the most conservative, pro “its not just memorizing” estimation possible : they multiplied the probabilities of consequent tokens. Basically it means if it starts shitting out a quote it will not be able to stop quoting until their anti copy the whole book finetuning kicks in after 50 words or so.
It can probably output far more under a realistic test (always picking the top token, temperature =0)
Which half?
Just the pronouns and articles, some of the verbs and adjectives