Why training AI can't be IP theft

blog.giovanh.com

25 points by OuterVale 6 hours ago

blagie 5 hours ago

I asked AI to complete an AGPL code file I wrote a decade ago. It did a pretty good job. What came out wasn't 100% identical, but clearly a paraphrased copy of my original.

Even if we accept the house-of-cards of shaky arguments this essay is built on, even just for the sake of argument, where Open AI breaks my copyright is by having a computer "memorize" my work. That's a form of copy.

If I've "learned" Harry Potter to the level where I can reproduce it verbatim, the reproduction would be a copyright violation. If I can paraphrase it, ditto. If I encode it in a different format (e.g. bits on magnetic media, or weights in a model), it still includes a duplicate.

On the face of it, OpenAI, Hugging Face, Anthropic, Google, and all other companies are breaking copyright law as written.

Usually, when reality and law diverge, law eventually shifts; not reality. Personally, I'm not a big fan of copyright law as written. We should have a discussion of what it should look like. That's a big discussion. I'll make a few claims:

- We no longer need to encourage technological progress; it's moving fast enough. If anything, slowing it down makes sense.

- "Fair use" is increasingly vague in an era where I can use AI to take your picture, tweak it, and reproduce an altered version in seconds

- Transparency is increasingly important as technology defines the world around us. If the TikTok algorithm controls elections, and Google analyzes my data, it's important I know what those are.

That's the bigger discussion to have.

protimewaster 5 hours ago

> If I've "learned" Harry Potter to the level where I can reproduce it verbatim, the reproduction would be a copyright violation. If I can paraphrase it, ditto.
Yeah, that's something that I've not seen a good answer to from the "everything AI does is legal" people. Even if the training is completely legal, how do you verify that the generated output is not illegally similar to a copyrighted work that was ingested? Humans get in legal trouble if they produce a work that's too similar. Does AI not? If AI doesn't, can I just write an AI whose job is to reproduce copyrighted content and now I have a loophole to reproduce copyrighted content?
- bko 4 hours ago
  
  I think you have to be practical. It would be difficult to train an AI to consume Harry Potter and compress it but prevent it from recreating it. You can try and people do, but there are always ways around it.
  But it's on an individual prompt basis. It's not like ChatGPT can produce the entirety of its text and sell it as a pdf. It's just a device that could reproduce it much like a word processor is a device that you can read the book and type out the contents.
  So the question is one of practicality. Do we ensure that no copyrighted material is in the training data? Difficult but probably not impossible. But what you can't do is target the content in all its various other forms, from descriptions of the plot, reviews, fan fiction, etc. So in the end its pretty much a lost cause.
  So what to do about it? I don't know. In the utilitarian sense, I think the world in which this technology exists in a non-crippled form is a better richer world than one in which there are all these procedural steps to try to prevent this (and ultimately failing).
  Whats the harm here? Are people not buying Harry Potter books and just having an LLM painfully recreate the plot? I would imagine Harry Potter fans would be able to explore their love of the media through LLMs and that would drive more revenue to Harry Potter media, much like fan fiction and pirated music lead to more engagement and concert sales.
  In the case of new art, maybe fewer artists get commissioned, but let's be real, Mike Tyson wasn't going to contract out an artist to create a ghibli style animation of him anyway, so there's really little harm in LLMs here to artists. If anything it expands the market and interest.
  - mdp2021 4 hours ago
    
    > So what to do about it
    We proceed towards AGIs that implement proper understanding, and have them read all of the masterpieces and essays and textbooks - otherwise they will be useless -, as is fully legitimate in any system that foresees libraries.
- the_snooze 5 hours ago
  
  Seems like so much tech "innovation" these days is really just to sneak around laws and social norms in pursuit of a rent-seeking position.
  - backWurdz 4 hours ago
    
    [dead]
- mdp2021 4 hours ago
  
  > Humans get in legal trouble if they [???] a work that's too similar
  If they sell a work that's too similar.
  What intellectuals do is quoting. Of course legal.
- morkalork 5 hours ago
  
  If gen-AI had been used to produce Nosferatu there still would have been a case, right?
naming_the_user 5 hours ago

Cleanroom implementation comes to mind.
If I just remember the source code of a 100 line program and then reproduce it verbatim a week later that doesn’t suddenly make it a new work.
- franktankbank 3 hours ago
  
  This is why I don't believe in restrictive licensing of open work.
HPsquared 5 hours ago

Maybe the infringement occurs when a user uses the model to produce the facsimile output.
- throwaway173738 5 hours ago
  
  Good idea. Let’s make it a minefield of copyright infringement for the user so they never know whether it’s emitting something novel or it’s emitting AGPL code.
  - formerly_proven 5 hours ago
    
    That's what the legal department at my employer (huge multinational corp) came up with: When an employee uses one of the approved gen-ai models it's on the employee to check the output isn't infringing - legal argues that not doing so would be grossly negligent, making the employee personally liable.
    
    throwaway173738 2 hours ago
    
    If my company had a policy like that I would just not use AI tools at all for anything.
  - CuriouslyC 5 hours ago
    
    Why the heavy handed internet sarcasm? Youtube has handled this exact issue with fingerprints for a while.
- SilasX 5 hours ago
  
  I agree that you should call the output copyright infringement (or not) without regard to how you got there. So, if it produces an identical copy of the text, or of e.g. Indiana Jones, that you then distribute it, sure, that is copyright infringement.
  But the mere act of using them for training and producing new works shouldn't be! In fact, until 2022, pretty much no one regarded it as a copyright violation to "learn from copyrighted works to create new ones" -- just the opposite! That's how it's supposed to work!
  Only when hated corporations did it with bots, did the internet hive mind suddenly decide that's stealing, and take this expansive view of IP rights (while, of course, having historically screamed bloody murder about any attempts to fight piracy).
mdp2021 4 hours ago

> If I've "learned" Harry Potter to the level where I can reproduce it verbatim, the reproduction would be a copyright violation
Of course not: in fact, memorizing has always been a right. (Edit: that would depend on what is meant with "reproduction" though. As written elsewhere, properly done quoting is of course fair use.)
> If I can paraphrase it, ditto
Even more so: people have lost legal actions because they sued authors of parodies.
pitaj 4 hours ago

> If I've "learned" Harry Potter to the level where I can reproduce it verbatim, the reproduction would be a copyright violation.
Yes.
> If I can paraphrase it, ditto.
Not necessarily. Summarizing, for instance, is typically free use.
dietr1ch 5 hours ago

What's stopping me from paraphrasing movies by peppering the least significant color bits? Would that make copying them legal?
- wongarsu 4 hours ago
  
  By that reasoning any VHS copy would be legal. Pretty sure Hollywood takes a slightly broader view on what constitutes a copy
tempodox 5 hours ago

Even if we define “AI” as lossy storage with random error insertions, it still amounts to unlicensed reproduction.
- mdp2021 5 hours ago
  
  > if we define “AI” as
  something completely oblivious of the past, it would be a great disservice.
  > as lossy storage
  well that would just be a great misunderstanding of what "learning" is.
  - Jensson 42 minutes ago
    
    > well that would just be a great misunderstanding of what "learning" is.
    You need to assume a malicious actor here, its trivial to use an AI as lossy storage, if that is legal copyright washing then large corporations will do so at industrial scale.
- mysterydip 5 hours ago
  
  Like saving an image as a JPEG doesn't make it a new work
SilasX 4 hours ago

>Even if we accept the house-of-cards of shaky arguments this essay is built on, even just for the sake of argument, where Open AI breaks my copyright is by having a computer "memorize" my work. That's a form of copy.
No, it isn't, unless you're also going to call it copyright infringement when search engines store internal, undistributed copies of websites for purposes of helping answer your search queries.
Edit: or, for that matter, accessing a website at all, which creates a copy on your computer.
- toshinoriyagi 4 hours ago
  
  Correct, the above posted ignored the four prongs of fair use:
  1. Is the work more educational or commercial.
  2. What is the nature of the underlying work (creative works are more protected).
  3. Is the work transformative.
  4. What will be the impact on the underlying work's market.
  Search engines do not make their internal copies available, compete in an entirely different market (that benefits the makers of the underlying works) and are considered quite transformative because they enable discovering vast information on the internet.
  On the other hand, almost zero LLMs/Text-to-Image generators are educational in nature (certainly none of the ones being sued for copyright infringement). They frequently are trained on highly creative works like art and writing. Some of the work could be transformative, depending on where the learned data manifold the output of your request lies on, but a huge amount is similar to the training data. Lastly, these models have an outsized negative impact on the underlying markets, by being vastly cheaper compared to a human's labor, and for dubious quality at that.
  - SilasX 4 hours ago
    
    What is that replying to? I was addressing the argument that "well, you stored the work in memory, therefore you infringed copyright". It seems like you agree that that jump doesn't follow, just as I argued?
    
    toshinoriyagi 4 hours ago
    
    You are totally correct, I misread your original comment and thought you took the opposite stance.
wiseowise 5 hours ago

> If I've "learned" Harry Potter to the level where I can reproduce it verbatim, the reproduction would be a copyright violation.
By your logic anyone with good enough memory violates copyright law just by act of remembering something.
- Apreche 5 hours ago
  
  No, because you don’t actually violate copyright law until you produce and distribute copies.
  It’s perfectly legal to memorize a book, type up a copy from memory, and to even print out a copy that you keep for yourself. But as soon as you start trying to sell or distribute copies, even for free, now you’re breaking the law as written.
- great_wubwub 5 hours ago
  
  No - it's the reproduction, not the memorization.
  - wiseowise 5 hours ago
    
    So training AI on copyrighted data isn’t a problem unless it spits out the data verbatim. Correct?
    
    lelanthran 4 hours ago
    
    Close.
    It isn't a problem until it spits out a similar enough copy.
    Copyright violation doesn't have to be verbatim; taking a 4k movie and reencoding it to 320x200 before distribution isn't legal.
  - greyface- 5 hours ago
    
    It's reproduced in the memorizer's neural connectome.
- earthnail 5 hours ago
  
  That was actually the case in music until a decade ago or so. Led to ridiculous lawsuits (for example Ed Sheeran’s).
  Previously, artists needed to prove they hadn’t heard the song that they were accused of infringing. That was virtually possible because there’s a lot of music you could hear anywhere, even just a car driving by. Artists continuously lost these court cases.
  Nowadays the burden of proof is luckily no longer on the defendant. But I think that only changed a decade ago or so, thanks to some efforts by music industry lawyer Damien Riehl. I know, ridiculous.
- shakna 5 hours ago
  
  Memory? No.
  Reproduction? Yes.
  If you write down a sizeable quote, that can be a copyright violation.
- ArinaS 5 hours ago
  
  I don't see that point in the original comment. Remembering copyrighted content ≠ reproducing a verbatim of it.
- techpineapple 5 hours ago
  
  This doesn’t seem true. I mean, it might be true if memory could be seen or manipulated, but what would you bring into a court of law to prove that I remembered something too clearly?

basch 5 hours ago

"I think the unambiguous answer to this question is that the act of training is viewing and analysis, not copying. There is no particular copy of the work (or any copyrightable elements) stored in the model. While some models are capable of producing work similar to their inputs, this isn’t their intended function, and that ability is instead an effect of their general utility. Models use input work as the subject of analysis, but they only “keep” the understanding created, not the original work."

The author just seems to have decided the answer and worked backwards. When in reality this is very much a ship of theseus type problem. At what point does a compressed jpeg not become the original image but a transformation? The same thing applies. If i ask a model to recite frankenstein and it largely does, is that not a lossy compression of the original. Would the author argue an mp3 isnt a copy of a song because all the information isnt there?

Calling it "training" instead of compression lets the author play semantic games.

Retr0id 5 hours ago

There clearly is a point when a compressed jpeg becomes a transformation, even if the precise point is ambiguous.
Take 'The Bee Movie at 3000% speed except when they say "bee"', for example - https://www.youtube.com/watch?v=7apltfVJBwU. It hasn't been taken down for over 5 years so I'm going to assume it's considered acceptable/transformational use.
Personally, I'd say what matters is whether you'd plausibly use the transformed/compressed version as a drop-in substitute for the original. ChatGPT can probably reproduce the complete works of shakespeare verbatim if prompted appropriately, but is anyone seriously going to read it that way?
- kemotep 4 hours ago
  
  Do you know that YouTube hasn’t reassigned the video’s ad revenue to the IP holder and the IP holder hasn’t requested it be taken down because they now receive compensation for it from YouTube?
- basch 5 hours ago
  
  Agreed. So not EVERY LLM is automatically copying in principle, but most of the current implementations probably retain TOO MUCH of the original sources to NOT be copies.
mdp2021 5 hours ago

The author in the quote wrote «understanding», whereas the poster here is talking of «compress[ion]», and the two are very different.
Understanding a text is not memorizing a details-fuzzy version of it.
- basch 5 hours ago
  
  at some threshold understanding becomes memorization which becomes the ability to recite it. it's not two different things, its points on a spectrum.
  - mdp2021 4 hours ago
    
    > becomes
    Why should it? "The artist sees in the wild horse a metaphor of death, as expressed in the fleeting image painted with solemnity, stillness and evanescence". That could be a (sketchy) example of understanding - and it does not need a verbatim or interpolated text to be there.
Calwestjobs 4 hours ago

yeah, but is exact wording/data IP or is IP those patterns underneath?

EdwardDiego 5 hours ago

That's a lot of words to justify what I presume to be the author's pre-existing viewpoint.

Given that "training" on someone else's IP will lead to a regurgitation of some slight permutation of that IP (e.g., all the Studio Ghibli style AI images), I think the author is pushing shit up hill with the word "can't".

mdp2021 4 hours ago

> Given that "training" on someone else's [would] lead to a regurgitation of some slight permutation
That is not necessary. It may happen with "bad" NNs.
seanhunter 4 hours ago

Yup. Nothing quite like someone who clearly has no legal background trying to use first principles reasoning + bullshit to make a quasi-legal argument that justifies their own prior opinion.

TimorousBestie 5 hours ago

The assumption that human learning and “machine learning” are somehow equivalent (in a physical, ethical, or legal sense—the domain shifts throughout the essay) is not supported with evidence here. They spend a long time describing how machine learning is different from human learning on a computational level, but that doesn’t seem to impact the rest of the argument.

I wish AI proponents would use the plain meaning of words in their persuasive arguments, instead of muddying the waters with anthropomorphic metaphors that smuggle in the conclusion.

gavinhoward 5 hours ago

Copyright laws (in the US) added fair use, which has four tests. Not all of the tests need to fail for fair use to disappear. Usually two are enough.

The one courts love the most is if the copy is used to create something commercial that competes with the original work.

From near the top of the article:

> I agree that the dynamic of corporations making for-profit tools using previously published material to directly compete with the original authors, especially when that work was published freely, is “bad.”

So essentially, the author admits that AI fails this test.

Thus, if authors can show the AI fails another test (and AI usually fails the substantive difference test), AI is copyright infringement. Period.

The fact that the article gives up that point so early makes me feel I would be wasting time reading more, but I will still do it.

Edit: still reading, but the author talks about enumerated rights. Most lawsuits target the distribution of model outputs because that is reproduction, an enumerated right.

Edit 2: the author talks about sunstantive differences, admits they happen aboit 2% of the time, but then seems to argue that means they are not infringing at all. No, they are infringing in those instances.

Edit 3: the author claims that model users are the infringing ones, but at least one AI company (Microsoft?) had agreed to indemnify users, so plaintiffs have full right to go after the company instead.

djoldman 5 hours ago

There are a few stages involved in delivering the output of a LLM or text-to-image model:

1. acquire training data

2. train on training data

3. run inference on trained model

4. deliver outputs of inference

One can subdivide the above however one likes.

My understanding is that most lawsuits are targeting 4. deliver outputs of inference.

This is presumably because it has the best chance of resulting in a verdict favorable to the plaintiff.

The issue of whether or not it's legal to train on training data to which one does not hold copyright is probably moot - businesses don't care too much about what you do unless you're making money off it.

mdp2021 4 hours ago

> businesses don't care too much
Not really so, since the deranged application of an idea of "loss of revenue" decades ago.

prophesi 4 hours ago

I think it can be IP theft, and also require labor negotiations. And global technical infrastructure for people to opt-in to having their data trained on. And a method for creators to be compensated if they do opt-in and their work is ingested. And ways for their datasets to be audited by third parties.

It sounds like a pipedream, but ethical enforcement of AI training across the globe will require multifaceted solutions that still won't stamp out all bad actors.

Calwestjobs 4 hours ago

look, quickest example if it IS or it IS NOT ip theft is - go to any image generation ML wizardry prompt machine and ask it this :

"generate image of jack ryan investigating nuclear bomb. he has to look like morgan freeman."

(and do it quickly before someone in FAANGM manually plays with something altering result of that prompt)

problem is opposite, is "original" work IP a original in itself or is it just remix

or someone just gave lawyer some generic text and make it arbitrarily protected for adding 0.000000001% to previous work.

EPWN3D 4 hours ago

I couldn't get through it, did he actually make an argument eventually?

fithisux 2 hours ago

Training AI is IP theft period.

mdp2021 2 hours ago

That only teaches us your opinion: too little information and too much.

light_hue_1 5 hours ago

This it totally the wrong analysis.

Think of AI tools like any other tools. If I include code I'm not allowed to use, like reading a book I pirated, that's copyright infringement. If I include an image as an example in my image editor, that's ok if I am allowed to copy it.

If someone decides to use my image editor to create an image that's copyrighted or trademarked, that's not the fault of the software. Even if my software says "hey look, here are some cool logos that you might want to draw inspiration from".

People are getting too hung up on the AI part. That's irrelevant.

This is just software. You need a license for the inputs and if the output is copyrighted that's on the user of the software. It's a significant risk of just using these models carelessly.

yniopper 5 hours ago

[dead]

re-thc 6 hours ago

The argument in the article breaks down by taking marketing by definition and try to apply it to a technical argument.

You might as well start by saying that the "cloud" as in some computers really float in the sky. Does AWS rain?

This "AI" or rather program is not "training" or "learning" - at least not the way these laws conceived by humans were anticipated or created for. It doesn't fit the usual dictionary term of training or learning. If it did we'd have real AI, i.e. the current term AGI.

Latty 5 hours ago

I agree you can't just say it's learning and be done with it, but I think there is a discussion to be had about what training a model is.
When they made the MP3 format, for example, they took a lot of music and used that to create algorithms that are effective for reproducing real-world music using less data. Is that a copyright violation? I think the answer is obviously no, so there is a way to use copyrighted material to produce something new based on it, that isn't reproduction.
The obvious answer is that MP3 doesn't replace the music itself commercially, it doesn't damage the market, while the things produced by an AI model can, but by that logic, is it a copyright violation for an instrument manufacturer to go and use a bunch of music to tailor an instrument to be better, if that instrument could be used to create music that competes with it? Again, no, but clearly there is a difference in how much that instrument would have drawn from the works. AI Models have the potential to spit out very similar works which makes them much more harmful to the original works' value.
I think looking at it through the lens of copyright just isn't useful: it's not exactly the same thing, and the rules around copyright aren't good for managing it. Rather, we should be asking what we want from models and what they provide to society. As I see it, we should be asking how we can address the artists having their work fed into something that may reduce the value of their work, it's clearly a problem, and I don't think pushing the onus onto the person using the model not to create anything that infringes is a strategy that will actually work.
I do think the author correctly calls out gatekeeping as a huge potential issue. I think a reasonable route is that models shouldn't be copyrightable/patentable themselves, companies should not be allowed to rent-seek on something largely based on other people's work, they should be inherently in the public domain like recipes. Of course, legislating something like that is hard at the best of times, and the current environment is hostile to passing anything, let alone something pro-consumer.
- lelanthran 3 hours ago
  
  > When they made the MP3 format, for example, they took a lot of music and used that to create algorithms that are effective for reproducing real-world music using less data
  Your entire argument is predicated on this incorrect assertion. Your argument, as is, is therefore completely invalid.
  - Latty 2 hours ago
    
    Even if we accept it's wrong (which I suspect is me being unclear: I wasn't suggesting MP3 is some kind of trained algorithm, just that humans developed it while testing on a range of music—which is well documented, with Tom's Diner famously being the first song encoded—which is how any such product gets developed, I accept the context makes it read like I was implying something else, my bad), I give a separate examples with varying degrees of similarity to training and then make my own comments, I explicitly say after this that I don't think the MP3 example is very comparable.
    While I get why you'd read what I said that way given context, I wasn't clear, maybe don't reject my entire post immediately after making an assumption about my point barely any way into it.
- re-thc 5 hours ago
  
  > they took a lot of music and used that to create algorithms that are effective for reproducing real-world music using less data
  That's redefining history. MP3 didn't evolve like this. There was a series of studies, experiments etc and it took many steps to get there.
  MP3 was not created by dumping music somewhere to get back an algorithm.
  > I think a reasonable route is that models shouldn't be copyrightable/patentable themselves
  Why? Why can't they just pay. AI companies have the highest valuation and so have the most $$$ and yet they can't pay? This is the equivalent of the rich claiming they are poor and then stealing from the poor.
  - Latty 2 hours ago
    
    > MP3 was not created by dumping music somewhere to get back an algorithm.
    This wasn't what I was trying to suggest, clearly I wasn't clear enough given the context, but my point was to give a very distant example where humans are using copyrighted works to test their algorithms as a starting point, as I later go on to say, I think the two cases are fundamentally different, but the point was to make the case that there are different types of "using copyrighted works to create tools", which is distinct from "learning".
    > Why? Why can't they just pay. AI companies have the highest valuation and so have the most $$$ and yet they can't pay? This is the equivalent of the rich claiming they are poor and then stealing from the poor.
    I don't think them paying solves the problem.
    1) These are trained on such enormous amounts of data that is sourced unreliably, how are these companies going to negotiate with all of the rights holders?
    2) How do you deal with the fact the original artists who previously sold rights to companies will now have their future work replaced in the market by these tools when they sold a specific work, not expecting that? Sure, the rights owners might make some money, but the artists end up getting nothing and suffering the impact of having their work devalued.
    3) You then create a world where only giant megacorps who can afford to get the training rights can make models, they can then demand all work made with them (potentially necessary to compete in future markets) give them back the rights, creating a viscous cycle of rent-seeking where a few companies control the tools necessary to be a commercial artist.
    Paying might, at best, help satisfy current rights holders, which is a fraction of the problems at hand, in my opinion. I think making models inherently public domain solves far more of them.
    
    re-thc 19 minutes ago
    
    > 1) These are trained on such enormous amounts of data that is sourced unreliably, how are these companies going to negotiate with all of the rights holders?
    i.e. the business was impossible? Then don't do it. That's like saying I can't do the exam reliably sir, so I cheated and you should accept.
    > 2) How do you deal with the fact the original artists who previously sold rights to companies will now have their future work replaced
    Charge differently / appropriately. This has already been done, e.g. there are different prices / licenses for once-off individual use, vs business vs unlimited use vs SaaS etc.
    > 3) You then create a world where only giant megacorps who can afford to
    Isn't this currently the case? Who can afford the GPUs? What difference does that make? These AI companies are already getting sky high valuations with the excuse of cost...
    > I think making models inherently public domain solves far more of them.
    How is that even enforceable? Companies just don't have to even announce there is a new model or the model they used and life progresses.

techpineapple 5 hours ago

“If humans were somehow required to have an explicit license to learn from work, it would be the end of individual creativity as we know it“

What about text books, in order to train on a textbook, I have to pay a licensing fee.

realharo 4 hours ago

Even if you accept the premise, what does it matter? AI are not humans.
Laws were made up by people at a specific time for a specific purpose. Obviously our existing laws are "designed" around human limitations.
As for future laws, it's just a matter of who is powerful and persuasive enough to push through their vision of the future.
pitaj 5 hours ago

If you pirate a text book, learn from it, and then apply that knowledge to write your own textbook: your textbook would not be a copyright violation of the original, even though you "stole" the original.
mdp2021 5 hours ago

> I have to pay
Fortunately, others have libraries. There is no need to pay for the examination of material stored in libraries (and similar).
Ylpertnodi 5 hours ago

>What about text books, in order to train on a textbook, I have to pay a licensing fee.
Would that also apply if you bought the text books second-hand (or were given it)?