Two authors sued OpenAI, accusing the company of violating copyright law. They say OpenAI used their work to train ChatGPT without their consent.
If I read a book to inform myself, put my notes in a database, and then write articles, it is called “research”. If I write a computer program to read a book to put the notes in my database, it is called “copyright infringement”. Is the problem that there just isn’t a meatware component? Or is it that the OpenAI computer isn’t going a good enough job of following the “three references” rule to avoid plagiarism?
Yeah. There are valid copyright claims because there are times that chat GPT will reproduce stuff like code line for line over 10 20 or 30 lines which is really obviously a violation of copyright.
However, just pulling in a story from context and then summarizing it? That’s not a copyright violation that’s a book report.
Or is it that the OpenAI computer isn’t going a good enough job of following the “three references” rule to avoid plagiarism?
This is exactly the problem, months ago I read that AI could have free access to all public source codes on GitHub without respecting their licenses.
So many developers have decided to abandon GitHub for other alternatives not realizing that in the end AI training can safely access their public repos on other platforms as well.
What should be done is to regulate this training, which however is not convenient for companies because the more data the AI ingests, the more its knowledge expands and “helps” the people who ask for information.
It’s incredibly convenient for companies.
Big companies like open AI can easily afford to download big data sets from companies like Reddit and deviantArt who already have the permission to freely use whatever work you upload to their website.
Individual creators do not have that ability and the act of doing this regulation will only force AI into the domain of these big companies even more than it already is.
Regulation would be a hideously bad idea that would lock these powerful tools behind the shitty web APIs that nobody has control over but the company in question.
Imagine the world is the future, magical new age technology, and Facebook owns all of it.
Do not allow that to happen.
Is it practically feasible to regulate the training? Is it even necessary? Perhaps it would be better to regulate the output instead.
It will be hard to know that any particular GET request is ultimately used to train an AI or to train a human. It’s currently easy to see if a particular output is plagiarized. https://plagiarismdetector.net/ It’s also much easier to enforce. We don’t need to care if or how any particular model plagiarized work. We can just check if plagiarized work was produced.
That could be implemented directly in the software, so it didn’t even output plagiarized material. The legal framework around it is also clear and fairly established. Instead of creating regulations around training we can use the existing regulations around the human who tries to disseminate copyrighted work.
That’s also consistent with how we enforce copyright in humans. There’s no law against looking at other people’s work and memorizing entire sections. It’s also generally legal to reproduce other people’s work (eg for backups). It only potentially becomes illegal if someone distributes it and it’s only plagiarism if they claim it as their own.
This makes perfect sense. Why aren’t they going about it this way then?
My best guess is that maybe they just see openAI being very successful and wanting a piece of that pie? Cause if someone produces something via chatGPT (let’s say for a book) and uses it, what are they chances they made any significant amount of money that you can sue for?
It’s hard to guess what the internal motivation is for these particular people.
Right now it’s hard to know who is disseminating AI-generated material. Some people are explicit when they post it but others aren’t. The AI companies are easily identified and there’s at least the perception that regulating them can solve the problem, of copyright infringement at the source. I doubt that’s true. More and more actors are able to train AI models and some of them aren’t even under US jurisdiction.
I predict that we’ll eventually have people vying to get their work used as training data. Think about what that means. If you write something and an AI is trained on it, the AI considers it “true”. Going forward when people send prompts to that model it will return a response based on what it considers “true”. Clever people can and will use that to influence public opinion. Consider how effective it’s been to manipulate public thought with existing information technologies. Now imagine large segments of the population relying on AIs as trusted advisors for their daily lives and how effective it would be to influence the training of those AIs.
Plus, any regulation to limit this now means that anyone not already in the game will never breakthrough. It’s going to be the domain of the current players for years, if not decades. So, not sure what’s better, the current wild west where everyone can make something, or it being exclusive to the already big players and them closing the door behind
My concern here is that OpenAI didn’t have to share gpt with the world. These lawsuits are going to discourage companies from doing that in the future, which means well funded companies will just keep it under wraps. Once one of them eventually figures out AGI, they’ll just use it internally until they dominate everything. Suddenly, Mark Zuckerberg is supreme leader and we all have to pledge allegiance to Facebook.
AI could have free access to all public source codes on GitHub without respecting their licenses.
IANAL, but aren’t their licenses are being respected up until they are put into a codebase? At least insomuch as Google is allowed to display code snippets in the preview when you look up a file in a GitHub repo, or you are allowed to copy a snippet to a StackOverflow discussion or ticket comment.
I do agree regulation is a very good idea, in more ways than just citation given the potential economic impacts that we seem clearly unprepared for.
The fear is that the books are in one way or another encoded into the machine learning model, and that the model can somehow retrieve excerpts of these books.
Part of the training process of the model is to learn how to plagiarize the text word for word. The training input is basically “guess the next word of this excerpt”. This is quite different compared to how humans do research.
To what extent the books are encoded in the model is difficult to know. OpenAI isn’t exactly open about their models. Can you make ChatGPT print out entire excerpts of a book?
It’s quite a legal gray zone. I think it’s good that this is tried in court, but I’m afraid the court might have too little technical competence to make a ruling.
Say I see a book that sells well. It’s in a language I don’t understand, but I use a thesaurus to replace lots of words with synonyms. I switch some sentences around, and maybe even mix pages from similar books into it. I then go and sell this book (still not knowing what the book actually says).
I would call that copyright infringement. The original book didn’t inspire me, it didn’t teach me anything, and I didn’t add any of my own knowledge into it. I didn’t produce any original work, I simply mixed a bunch of things I don’t understand.
That’s what these language models do.
What about… they are making billions from that “read” and “storage” of information copyrighted from other people. They need to at least give royalties. This is like google behavior, using people data from “free” products to make billions. I would say they also need to pay people from the free data they crawled and monetized.
I’d say the main difference is that AI companies are profiting off of the training material, which seem unethical/illegal.
I honestly do not care whether it is or is not copyright infringment, just hope to see “AI” burn :3
AI isnt a boogyman, it’s a set of tools. No chance it’s going away even if Open AI suddenly disappeared.
I understand, but I will continue to stubbornly dislike LLMs.
Can I ask why you feel that way?
I dislike general artificial intelligence. I understand that it can be a useful tool, but at the same time the thought of being in a world where people’s jobs can be replaced with robots for the sake of profit and you won’t be able to tell whether you are talking with a real person or not repulses me.
Well, while I do agree that it sucks that some jobs may get replaced history has shown that it always leads to creating more jobs in place. The weavers lost their jobs when the loom came about, but far more jobs were created because of it, same with the printing press and every other advancement, the nature of advancing technology is to replace the old with the new.
Ugh, the robot phone calls are going to get a hundred times worse, that one is true, I’m not sure if it’ll make the standard corporate phone maze better or worse, maybe better because at least you can screw with the robot while you wait instead of having the same 30 seconds of highly compressed garbage elevator music blasted into your ear on repeat.
AI fear is going to be the trojan horse for even harsher and stupider ‘intellectual property’ laws.
Yeah, they want the right only to protect who copies their work and distributes it to other people, but who’s able to actually read and learn from their work.
It’s asinine and we should be rolling back copy right, not making it more strict. This 70 year plus the life of the author thing is bullshit.
Copyright of code/research is one of the biggest scams in the world. It hinders development and only exists so the creator can make money, plus it locks knowledge behind a paywall
Researchers pay for publication, and then the publisher doesn’t pay for peer review, then charges the reader to read research that they basically just slapped on a website.
It’s the publisher middlemen that need to be ousted from academia, the researchers don’t get a dime.
It’s generally not the creator who gets the money.
Remember, Creative Commons licenses often require attribution if you use the work in a derivative product, and sometimes require ShareAlike. Without these things, there would be basically no protection from a large firm copying a work and calling it their own.
Rolling pack copyright protection in these areas will enable large companies with traditional copyright systems to wholesale take over open source projects, to the detriment of everyone. Closed source software isn’t going to be available to AI scrapers, so this only really affects open source projects and open data, exactly the sort of people who should have more protection.
There’s also GPL, which states that derivations of GPL code can only be used in GPL software. GPL also states that GPL software must also be open source.
ChatGPT is likely trained on GPL code. Does that mean all code ChatGPT generates is GPL?
I wouldn’t be surprised if there would be an update to GPL that makes it clear that any machine learning model trained on GPL code must also be GPL.
Closed source software isn’t going to be available to AI scrapers, so this only really affects open source projects and open data, exactly the sort of people who should have more protection.
The point of open source is contributing to the crater all of humanity. If open source contributes to an AI which can program, and that programming AI leads to increased productivity and ability in the general economy then open source has served its purpose, and people will likely continue to contribute to it.
Creative of Commons applies to when you redistribute code. (In the ideal case) AI does not redistribute code, it learns from it.
And the increased ability to program by the average person will allow programmers to be more productive and as a result allow more things to be open source and more things to be programmed in general. We will all benefit, and that is what open source is for.
Since any reductions to copyright, if they occur at all, will take a while to happen, I hope someone comes up with an opt-in limited term copyright. At max, I’d be satisfied with a 45-50 year limited copyright on everything I make, and could see going shorter under plenty of circumstances.
I wish I could get through to people who fear AI copyright infringement on this point.
I think this is exposing a fundamental conceptual flaw in LLMs as they’re designed today. They can’t seem to simultaneously respect intellectual property / licensing and be useful.
Their current best use case - that is to say, a use case where copyright isn’t an issue - is dedicated instances trained on internal organization data. For example, Copilot Enterprise, which can be configured to use only the enterprise’s data, without any public inputs. If you’re only using your own data to train it, then copyright doesn’t come into play.
That’s been implemented where I work, and the best thing about it is that you get suggestions already tailored to your company’s coding style. And its suggestions improve the more you use it.
But AI for public consumption? Nope. Too problematic. In fact, public AI has been explicitly banned in our environment.
I’d love to know the source for the works that were allegedly violated. Presuming OpenAI didn’t scour zlib/libgen for the books, where on the net were the cleartext copies of their writings stored?
Being stored in cleartext publicly on the net does not grant OpenAI the right to misuse their art, but the authors need to go after the entity that leaked their works.
deleted by creator
You misunderstood. I said the public availability does not grant OpenAI the right to use content improperly. The authors should also sue the party who leaked their works without license.
ChatGPT got entire books memorised. You can and (or could at least when I tried a few weeks back) make it print entire pages of for example Harry Potter.
Not really, though it’s hard to know what exactly is or is not encoded in the network. It likely has more salient and highly referenced content, since those aspects would come up in it’s training set more often. But entire works is basically impossible just because of the sheer ratio between the size of the training data and the size of the resulting model. Not to mention that GPT’s mode of operation mostly discourages long-form wrote memorization. It’s a statistical model, after all, and the enemy of “objective” state.
Furthermore, GPT isn’t coherent enough for long-form content. With it’s small context window, it just has trouble remembering big things like books. And since it doesn’t have access to any “senses” but text broken into words, concepts like pages or “how many” give it issues.
None of the leaked prompts really mention “don’t reveal copyrighted information” either, so it seems the creators really aren’t concerned — which you think they would be if it did have this tendency. It’s more likely to make up entire pieces of content from the summaries it does remember.
Have your tried instructing ChatGPT?
I’ve tried:
“Act as an e book reader. Start with the first page of Harry Potter and the Philosopher’s Stone”
The first pages checked out at least. I just tried again, but the prompts are returned extremely slow at the moment so I can’t check it again right now. It appears to stop after the heading, that definitely wasn’t the case before, I was able to browse pages.
It may be a statistical model, but ultimately nothing prevents that model from overfitting, i.e. memoizing its training data.
I use it all day at my job now. Ironically, on a specialization more likely to overfit.
It may be a statistical model, but ultimately nothing prevents that model from overfitting, i.e. memoizing its training data.
This seems to imply that not only did entire books accidentally get downloaded, slip past the automated copyright checker, but that it happened so often that the AI saw the same so many times it overwhelmed other content and baked, without error and at great opportunity cost, an entire book into it. And that it was rewarded for doing so.
Wait… isn’t that the correct response though? I mean if i ask an ai to produce something copyright infringing it should, for example reproducing Harry potter. The issue is when is asked to produce something new, (e.g. a story about wizards living secretly in the modern world) does it infringe on copyright without telling you? This is certainly a harder question to answer.
I think they’re seeing this as a traditional copyright infringement issue, i.e. they don’t want anyone to be able to make copies of their work intentionally either.
There’s an additional question: who holds the copyright on the output of an algorithm? I don’t think that is copyrightable at all. The bot doesn’t really add anything to the output, it’s just a fancy search engine. In the US, in particular, the agency in charge of Copyrights has been quite insistent that a copyright can only be given to the output if a human.
So when an AI incorporates parts of copyrighted works into its output, how can that not be infringement?
How can you write a blog post reviewing a book you read without copyright infringement? How can you post a plot summary to Wikipedia without copyright infringement?
I think these blanket conclusions about AI consuming content being automatically infringing are wrong. What is important is whether or not the output is infringing.
You can write that blog post because you are a human, and your summary qualifies for copyright protection, because it is the unique output of a human based on reading the copywrited material.
But the US authorities are quite clear that a work that is purely AI generated can never qualify for copyright protection. Yet since it is based on the synthesis of works under copyright, it can’t really be considered public domain either. Otherwise you could ask the AI “Write me a summary of this book that has exactly the same number of words”, and likely get a direct copy of the book which is clear of copyright.
I think that these AI companies are going to face a reckoning, when it is ruled that they misappropriated all this content that they didn’t explicitly license for use, and all their output is just fringing by definition.
I’m expecting a much messier “resolution” that’ll look a lot like YouTube’s copyright situation - their product can be used for copyright infringement, and they’ll be required by law to try and take appropriate measures to prevent it, but will otherwise not be held liable as long as they can claim such measures are being taken.
Having an AI recite a long text to bypass copyright seems equivalent in my mind to uploading a full movie to youtube. In both cases, some amount of moderation (itself increasingly algorithmic) is required to not only be applied, but actively developed and advanced to flout efforts to bypass it. For instance, youtube pirates will upload things with some superficial changes like a filter applied or showing the movie on a weird angle or mirrored to bypass copyright bots, which means the bots need to be more strict and better trained, or else youtube once again becomes liable for knowing about these pirates and not stopping them.
The end result, just like with youtube, will probably be that AI models have to have big, clunky algorithms applied against their outputs to recalculate or otherwise make copyright-safe anything that might remotely be an infringement. It’ll suck for normal users, pirates will still dig for ways to bypass it, and everyone will be unhappy. If youtube is any indicator, this situation can somehow remain stable for over a decade - long enough for AI devs to release a new-generation bot to restart the whole issue.
Yaaaaaaaaay
But the US authorities are quite clear that a work that is purely AI generated can never qualify for copyright protection.
Which law says this? The government is certainly discussing the problem, but I wasn’t aware of any legislation.
If there is such a law, it seems to overlook an important point: an algorithm - an AI - is itself an expression of human intelligence. Having a computer carry out an algorithm for summarizing content can be indistinguishable from a person having a pattern they follow for writing summaries.
Can’t reply directly to @OldGreyTroll@kbin.social because of that “language” bug, but:
The problem is that they then sell the notes in that database for giant piles of cash. Props to you if you’re profiting off your research the way OpenAI can profit off its model.
But yes, the lack of meat is an issue. If I read that article right, it’s not the one being contested here though. (IANAL and this is the only article I’ve read on this particular suit, so I may be wrong).
Was also going to reply to them!
"Well if you do that you source and reference. AIs do not do that, by design can’t.
So it’s more like you summarized a bunch of books. Pass it of as your own research. Then publish and sell that.
I’m pretty sure the authors of the books you used would be pissed."
Again cannot reply to kbin users.
“I don’t have a problem with the summarized part ^^ What is not present for a AI is that it cannot credit or reference. And that is makes up credits and references if asked to do so.” @bioemerl@kbin.social
Good point, attribution is a non-trivial part of it.
It is 100% legal and common to sell summaries of books to people. That’s what a reviewer does. That’s what Wikipedia does in the plot section of literally every Wikipedia page about every book.
This is also ignoring the fact that Chat GPT is a hell of a lot more than a bunch of summaries
@owf@kbin.social can’t reply directly to you either, same language bug between lemmy and kbin.
That’s a great way to put it.
Frankly idc if it’s “technically legal,” it’s fucking slimy and desperately short-term. The aforementioned chuckleheads will doom our collective creativity for their own immediate gain if they’re not stopped.
The problem is that they then sell the notes in that database for giant piles of cash.
On top of that, they have no way of generating any notes without your input.
I believe the way these models work is fundamentally plagiaristic. It’s an “average of its inputs” situation, not a “greater than the sum of its parts” one.
GitHub Copilot doesn’t know how to code, it knows how to copy-and-paste from people who do. It’s useless without a million devs to crib off.
I think it’s a perfectly reasonable reaction to be rather upset when some Silicon Valley chuckleheads help themselves to your lfe’s work in order to build a bot to replace you.
They definitely should follow through with this, but this is a more broad issue where we need to be able to prevent data scraping in general. Though that is a significantly harder problem.
If you’re doing research, there are actually some limits on the use of the source material and you’re supposed to be citing said sources.
But yeah, there’s plenty of stuff where there needs to be a firm line between what a random human can do versus an automated intelligent system with potential unlimited memory/storage and processing power. A human can see where I am in public. An automated system can record it for permanent record. An integrated AI can tell you detailed information about my daily activities including inferences which - even if legal - is a pretty slippery slope.
a firm line between what a random human can do versus an automated intelligent system with potential unlimited memory/storage and processing power.
I think we need a better definition here. Is the issue really the processing power? Do we let humans get a pass because our memories are fuzzy? From your example you’re assuming massive details are maintained in the AI situation which is typically not the case. To make the data useful it’s consumed and turned into something useful for the system.
This is why I’m worried about legislation and legal precedent. Most people think these AI systems read a book and store the verbatim text off somewhere to reference when that isn’t really the case. There may be fragments all over, and it may be able to reconstitute the text, but we don’t seem to have the same issue with data being synthesized in a similar way with a human brain.
A continuous record of location + time or even something like “license plate at location plus time” is scary enough to me, and that’s easily data a system could hold decades of
Is that scary because it’s a machine? Someone could tail you and follow you around and manually write it all down in a notebook.
Yes the ease of data collection is an issue and I’m very much for better privacy rights for us all. But from the issue you’ve stated I’d be more afraid of what the 70 year old politicians who don’t understand any of this would write up in a bill.
Someone could tail you and follow you around and manually write it all down in a notebook.
They could, and then they could also be charged with stalking.
It’s not just ease of collection. It’s how the data is being retained, secured, and shared among a great many other things. Laws just haven’t kept up with technology, partly because yeah 70yo politicians that don’t even understand email but also because the corporations behind the technology lie and bribe to keep it that way, and face little consequences when they do so improperly or mishandle it. E.G.
https://www.cbc.ca/news/politics/cadillac-fairview-5-million-images-1.5781735
When the government does it, we seem to have even less recourse.
Would it be stalking if you signed a legal agreement that allowed them to track you? That is the reason the California law exists. Most of us have accepted a license agreement to us an app or service and in exchange we gave up privacy rights. And it may not have even been with the company consuming the data.
Sadly the law requires you to contact everyone to demand your data be deleted. Passing a law to have the default be never store my data means most of social media goes away or goes behind a paywall. This also goes for any picture hosting company who charges you nothing for hosting as they use your images.
This would also most likely mean that very explicit declarations must be made to allow anyone to use your material causing a lot of business to say it’s too big of a risk and ditch a lot of support.
Right now we kind of work on good faith which maybe doesn’t work.
I was actually thinking about this the other day for some reason. AI scraping my own original stuff and doing whatever with it. I can see the concern and I’m curious where this goes and how a court would rule on a pretty technical topic like this.
I have a post consumerism pipe dream that one day we will collectively realize all the stupid shit we waste time and resources on are not worth it and we enter a future like star trek.
As a species we waste so much simply making sure that those less privileged either by money or means, are not allowed to take from those with either. It’s stupid.
Edit - if we spent half the energy helping out brothers and sisters to succeed as we did to keep them down the world would be a better place. And by help them succeed I don’t mean money. Money is the lowest possible threshold.
Capitalism hit a massive roadblock with the dawn of the internet, information has a tendency to want to be free and easily accessible, but corporations need to own our productive output to maximize profits. In the age of the internet, our productive output more and more becomes our ideas and thoughts manifest into code or other forms of digital information.
Capitalists somewhat fought off the first wave of this, but AI will be a second and more challenging wave to overcome. I hope the capitalists fail and we don’t restrict the learning and power of AI so corporations can maximize profits again, but I recognize there’s a world where they successfully slow down or even entirely hault these learning systems and stop the technology from developing.
We already see people like Tucker Carlson calling for bans on AI because it’ll put people out of work. Of course, we should be trying to reduce the amount of work needed, but the natural tendency of capitalism in this environment is to maximize efficiency in favor of capital owners. Once workers aren’t needed anymore, the best thing (from a capitalist perspective) to do is let them starve in the streets instead of “giving them stuff for just existing”. We already live in a world where millions of people die from hunger a year, and almost a billion people are dangerously underfed, because global capitalism dictates these people don’t deserve enough food.
Can’t reply directly to @OldGreyTroll@kbin.social because of that “language” bug, as well. This is an interesting argument. I would imagine that the AI does not have the ability to follow plagiarism rules. Does it even credit sources? I’ve seen plenty of complaints from students getting in trouble because anti cheating software flags their original work as plagiarism. More importantly I really believe we need to take a firm stance on what is ethical to feed into chat gpt. Right now it’s the wild west.
The only question I have to content creators of any kind who are worried about AI…do you go after every human who consumed your content when they create anything remotely connected to your work?
I feel like we have a bias towards humans, that unless you’re actively trying to steal someone’s idea or concepts we ignore the fact that your content is distilled into some neurons in their brain and a part of what they create from that point forward. Would someone with an eidetic memory be forbidden from consuming your work as they could internally reference your material when creating their own?
Look at it this way, if an AI is developed by a private company, its purpose is to make money. It’s consuming material for that sole purpose. That isn’t the case with humans. Humans read for pleasure and for information’s sake itself. If an AI reads the same concept but with different wording, it generates different content. If a human reads the same concept but with different wording, it makes no difference.
Now, if these companies release their AI for free use, then that’s different.
The problem with AI as it currently stands is that it has no actual comprehension of the prompt, or ability to make leaps of logic, nor does it have the ability to extend and build upon existing work to legitimately transform it, except by using other works already fed into its model. All it can do is blend a bunch of shit together to make something that meets a set of criteria. There’s little actual fundamental difference between what ChatGPT does and what a procedurally generated game like most roguelikes do–the only real difference is that ChatGPT uses a prompt while a roguelike uses a RNG seed. In both cases, though, the resulting product is limited solely to the assets available to it, and if I made a roguelike that used assets ripped straight from Mario, Zelda, Mass Effect, Crash Bandicoot, Resident Evil, and Undertale, I’d be slapped with a cease and desist fast enough to make my head spin.
The fact that OpenAI stole content from everybody in order to make its model doesn’t make it less infringing.
That’s incorrect. Sure it has no comprehension of what the words it generates actually means, but it does understand the patterns that can be found in the words. Ask an AI to talk like a pirate, and suddenly it knows how to transform words to sound pirate like. It can also combine data from different text about similar topics to generate new responses that never existed in the first place.
Your analogy is a little flawed too, if you mixed all the elements in a transformative way and didn’t re-use any materials as-is, even if you called it Mazefecootviltale, as long as the original material were transformed sufficiently, you haven’t infringed on anything. LLMs don’t get trained to recreate existing works (which would make it only capable of producing infringing works), but to predict the best next word (or even parts of a word) based on the input information. It’s definitely possible to guide an AI towards specific source materials based on keywords that only exist in the source material that could be infringing, but in general it generates so generalized that it’s inherently transformative.
Again, that’s not comprehension, that’s mixing in yet more data that was put into the model. If you ask an AI to do something that is outside of the dataset it was trained on, it will massively miss the mark. At best, it will produce something that is close to what you asked, but not quite right. It’s why an AI model that could beat the world’s best Go players was beaten by a simple strategy that even amateur Go players could catch and defeat–the AI never came across that strategy while it was training against itself, so it had no idea what was going on.
And fair use isn’t the bulletproof defense you think it is. Countless fan games have been shut down over the decades, most of them far more transformative than my hypothetical example, such as AM2R. You bet your ass that if I tried to profit off of that hypothetical crossover roguelike, using sprites, models, and textures directly ripped from their respective games, it would be shut down immediately.
EDIT: I also want to address the assertion that AI isn’t trained to recreate existing works; in my view, that’s wholly irrelevant. If I made a program that took all the Harry Potter books, ran each word through a thesaurus, and sold it for profit, that would still be infringing, even if no meaningful words were identical to the original source material. Granted, if I curated the output and made a few of the more humorous excerpts available for free through a Mastodon or Lemmy post, that would likely qualify as fair use. However, that would be because a human mind is parsing the output and filtering out the 99% of meaningless gibberish that a thesaurus-ized Harry Potter would result in.
The only human input to an AI that gave consent to being part of its output is the miniscule input of the prompt given to it by the human, which does not meet the minimis effort required for copyright protection under law. The rest of the input–the countless terabytes of data scraped from the internet and fed into the AI’s training model–was all taken without the author’s consent, and their contribution vastly outweighs that of the prompt author and OpenAI’s own transformative efforts via the LLM.
You seem to misunderstand what an LLM does. It doesn’t generate “right” text. It generates “probable” text. There’s no right or wrong since it only generates a single word ahead of where it currently is. Hence why it can generate information that’s complete bullshit. I don’t know the details about this Go AI you’re talking about, but it’s pretty safe to say it’s not an LLM or uses a similar technique to it as Go is a game and not a creative work. There are many techniques for creating algorithms that fall under the “AI” umbrella.
Your second point is a whole different topic. I was referring to a “derivative work”, which is not the same as “fair use”. Derivative works are quite literally everywhere. https://en.wikipedia.org/wiki/Derivative_work A derivative work doesn’t require fair use, as it no longer falls under the same copyright as the original. While fair use is an exception under which copyrightable work can be used without infringing.
And also, those projects most of the time do not get shut down because they are actually illegal, but they get shut down because companies with tons of money can send threatening letters all day and have a team of high quality lawyers to send them. A cease and desist isn’t a legal enforcement from a judge, it’s a “recommendation for us not to (attempt to) sue you”. And that works on most small projects. It very very rarely goes to court over these things. And sometimes it’s because it’s totally warranted. Especially for fan projects it’s extremely hard to completely erase all protected copyrightable work, since they are specifically made to at least imitate or expand upon what they’re a fan project of.
EDIT: Minor clarification
Also, it should be mentioned that pretty much all games are in some form derivative works. Lets take Undertale since I’m most familiar with it. It’s well known that Undertale takes a lot of elements from other games. RPG mechanics from Mother and Earthbound. Bullet hell mechanics from games like Touhou Project. And more from games like Yume Nikki, Moon: Remix RPG Adventure, Cave Story. And funnily enough, the creator has even cited Mario & Luigi as a potential inspiration.
So why was it allowed to exist without being struck down? Because it fits the definition of a derivative works to the letter. You can find individual elements which are taken almost directly from other games, but it doesn’t try to be the same as what it was created after.
Undertale was allowed to exist because none of the elements it took inspiration from were eligible for copyright protection. Everything that could have qualified for copyright protection–the dialogue, plot, graphical assets, music, source code–were either manually reproduced directly by Toby Fox and Temmie Chang, or used under permissive licenses that allowed reproduction (e.g. the GameMaker Studio engine). Meanwhile, the vast majority of content OpenAI used to feed its AI models were not produced by OpenAI directly, nor were they obtained under permissive license.
So… thanks for proving my point?
The AI models (not specifically OpenAI’s models) do not contain the original material they were trained on. Just like the creators of Undertale consumed the games they were inspired by into their brain, and learned from them, so did the AI learn from the material it was trained on and learned how to make similar yet distinctly different output. You do not need a permissive license to learn from something once it has been publicized.
You can’t just put your artwork up on a wall and then demand every person who looks at it to not learn from it while simultaneously allowing them to look at it because you have a license that says learning from it is not allowed - that’s insane and hence why (as far as I know) no legal system acknowledges that as a legal defense.
Meanwhile, the vast majority of content OpenAI used to feed its AI models were not produced by OpenAI directly, nor were they obtained under permissive license.
That’s input, not output, so not relevant to copyright law. If your arguments focused on the times that ChatGPT reproduced copyrighted works then we can talk about some kind of ContentID system for preventing that before it happens or compensating the creators of it does. I think we can all acknowledge that it feels iffy that these models are trained on copyrighted works but this is a brand new technology. There’s almost certainly a win-win outcome here.
“right” and “probable” text are distinctions without difference. The simple fact is that an AI is incapable of handling anything outside its learning dataset. If you ask an AI to talk like a pirate, and it hasn’t had any pirate speak fed to it by a human via its training dataset, it will utterly fail. If I ask an AI to produce a Powershell script, and it hasn’t had code fed to it by a human via its training dataset, it will fail utterly. An AI cannot proactively buy a copy of Learn Powershell In a Month of Lunches and teach itself how to use Powershell. That fundamental shortcoming–the inability to self-improve, to proactively teach itself and apply that new knowledge to existing concepts–is a crucial, necessary element of transformative effort required to produce a derivative work (or fair use).
When that happens, maybe I’ll buy that AI is anything more than the single biggest copyright infringement scheme the world has ever seen. Until then, though, I will wholeheartedly support the efforts of creative minds to defend their intellectual property rights against this act of blatant theft by tech companies profiting off their work.
You realize LLMs are designed not to self improve by design right? It’s totally possible and has been tried - It’s just that they usually don’t end up very well once they do. And LLMs do learn new things, they’re just called new models. Because it takes time and resources to retrain LLMs with new information in mind. It’s up to the human guiding the AI to guide it towards something that isn’t copyright infringement. AIs don’t just generate things on their own without being prompted to by a human.
You’re asking for a general intelligence AI, which would most likely be comprised of different specialized AIs to work together. Similar to our brains having specific regions dedicated to specific tasks. And this just doesn’t exist yet, but one of it’s parts now does.
Also, you say “right” and “probable” are without difference, yet once again bring something into the conversation which can only be “right”. Code. You cannot create code that is incorrect or it will not work. Text and creative works cannot be wrong. They can only be judged by opinions, not by rule books which say “it works” or “it doesn’t”.
The last line is just a bit strange honestly. The biggest users of AI are creative minds, and it’s why it’s important that AI models remain open source so all creative minds can use them.
You realize LLMs are designed not to self improve by design right? It’s totally possible and has been tried - It’s just that they usually don’t end up very well once they do.
Tay is yet another example of AI lacking comprehension and intelligence; it produced racist and antisemitic content because it had no comprehension of ethics or morality, and so it just responded to the input given to it. It’s a display of “intelligence” on the same level as a slime mold seeking out the biggest nearby source of food–the input Tay received was largely racist/antisemitic, so its output became racist/antisemitic.
And LLMs do learn new things, they’re just called new models. Because it takes time and resources to retrain LLMs with new information in mind. It’s up to the human guiding the AI to guide it towards something that isn’t copyright infringement.
And the way that humans do that is by not using copyrighted material for its training dataset. Using copyrighted material to produce an AI model is infringing on the rights of the people who created the material, the vast majority of whom are small-time authors and artists and open-source projects composed of individuals contributing their time and effort to said projects). Full stop.
Also, you say “right” and “probable” are without difference, yet once again bring something into the conversation which can only be “right”. Code. You cannot create code that is incorrect or it will not work. Text and creative works cannot be wrong. They can only be judged by opinions, not by rule books which say “it works” or “it doesn’t”.
Then why does ChatGPT invent Powershell cmdlets out of whole cloth that don’t exist yet accomplish the exact precise task that the prompter asked it to do?
The last line is just a bit strange honestly. The biggest users of AI are creative minds, and it’s why it’s important that AI models remain open source so all creative minds can use them.
The biggest users of AI are techbros who think that spending half an hour crafting a prompt to get stable diffusion to spit out the right blend of artists’ labor are anywhere near equivalent to the literal collective millions of man hours spent by artists honing their skill in order to produce the content that AI companies took without consent or attribution and ran through a woodchipper. Oh, and corporations trying to use AI to replace artists, writers, call center employees, tech support agents…
Frankly, I’m absolutely flabbergasted that the popular sentiment on Lemmy seems to be so heavily in favor of defending large corporations taking data produced en masse by individuals without even so much as the most cursory of attribution (to say nothing of consent or compensation) and using it for the companies’ personal profit. It’s no different morally or ethically than Meta hoovering all of our personal data and reselling it to advertisers.
The fact that OpenAI stole content from everybody in order to make its model doesn’t make it less infringing.
Totally in agreement with you here. They did something wrong and should have to deal with that.
But my question is more about…
The problem with AI as it currently stands is that it has no actual comprehension of the prompt, or ability to make leaps of logic, nor does it have the ability to extend and build upon existing work to legitimately transform it, except by using other works already fed into its model
Is comprehension necessary for breaking copyright infringement? Is it really about a creator being able to be logical or to extend concepts?
I think we have a definition problem with exactly what the issue is. This may be a little too philosophical but what part of you isn’t processing your historical experiences and generating derivative works? When I saw “dog” the thing that pops into your head is an amalgamation of your past experiences and visuals of dogs. Is the only difference between you and a computer the fact that you had experiences with non created works while the AI is explicitly fed created content?
AI could be created with a bit of randomness added in to make what it generates “creative” instead of derivative but I’m wondering what level of pure noise needs to be added to be considered created by AI? Can any of us truly create something that isn’t in some part derivative?
There’s little actual fundamental difference between what ChatGPT does and what a procedurally generated game like most roguelikes do
Agreed. I think at this point we are in a strange place because most people think ChatGPT is a far bigger leap in technology than it truly is. It’s biggest achievement was being able to process synthesized data fast enough to make it feel conversational.
What worries me is that we will set laws and legal precedent based on a fundamental misunderstanding of what the technology does. I fear that had all the sample data been acquired legally people would still have the same argument think their creations exist inside the AI in some full context when it’s really just synthesized down to what is necessary to answer the question posed “what’s the statically most likely next word of this sentence?”
Is comprehension necessary for breaking copyright infringement? Is it really about a creator being able to be logical or to extend concepts?
I think we have a definition problem with exactly what the issue is. This may be a little too philosophical but what part of you isn’t processing your historical experiences and generating derivative works? When I saw “dog” the thing that pops into your head is an amalgamation of your past experiences and visuals of dogs. Is the only difference between you and a computer the fact that you had experiences with non created works while the AI is explicitly fed created content?
That’s part of it, yes, but nowhere near the whole issue.
I think someone else summarized my issue with AI elsewhere in this thread–AI as it currently stands is fundamentally plagiaristic, because it cannot be anything more than the average of its inputs, and cannot be greater than the sum of its inputs. If you ask ChatGPT to summarize the plot of The Matrix and write a brief analysis of the themes and its opinions, ChatGPT doesn’t watch the movie, do its own analysis, and give you its own summary; instead, it will pull up the part of the database it was fed into by its learning model that relates to “The Matrix,” “movie summaries,” “movie analysis,” find what parts of its training dataset matches up to the prompt–likely an article written by Roger Ebert, maybe some scholarly articles, maybe some metacritic reviews–and spit out a response that combines those parts together into something that sounds relatively coherent.
Another issue, in my opinion, is that ChatGPT can’t take general concepts and extend them further. To go back to the movie summary example, if you asked a regular layperson human to analyze the themes in The Matrix, they would likely focus on the cool gun battles and neat special effects. If you had that same layperson attend a four-year college and receive a bachelor’s in media studies, then asked them to do the exact same analysis of The Matrix, their answer would be drastically different, even if their entire degree did not discuss The Matrix even once. This is because that layperson is (or at least should be) capable of taking generalized concepts and applying them to specific scenarios–in other words, a layperson can take the media analysis concepts they learned while earning that four-year degree, and apply them to a specific thing, even if those concepts weren’t explicitly applied to that thing. AI, as it currently stands, is incapable of this. As another example, let’s say a brand-new computing language came out tomorrow that was entirely unrelated to any currently existing computing languages. AI would be nigh-useless at analyzing and helping produce new code for that language–even if it were dead simple to use and understand–until enough humans published code samples that could be fed into the AI’s training model.
Hmm that is an interesting take.
The movie summary question is interesting. For most people I doubt they have asked ChatGPT for its own personal views on the subject matter. Asking for a movie plot summary doesn’t inherrantly require the one giving it to have experienced the movie. If this were the case then pretty much all papers written in a history class would fall under this category. No high schooler today went to war but could write about it because they are synthesizing other’s writings about the topic. Granted we know this to be the case and the students are required to cite their sources even when not directly quoting them…would this resolve the first proble?
If we specifically asked ChatGPT “Can you give me your personal critique of the movie The Matrix?” and it returned something along the lines of “Well I cannt view movies and only generate responses based on writings of others who have seen it.” would that make the usage more clear? If its required for someone to have the ability to have their own critical analysis, there would be a handful of kids from my high school who would fail at that task too and did so regularly.
I like your college example as that is getting better at a definition, but I think we need to find a very explicit way of describing what is happening. I agree current AI can’t do any of this so we are very much talking about future tech.
With the idea of extending matterial, do we have a good enough understanding of how humans do it? I think its interesting when we look at computer neural networks. One of the first ones we build in a programming class is an AI that can read single digit, hand written numbers. What eventually happens is the system generates a crazy huge and unreadable equation to convert bits of an image into a statistically likely answser. When you disect it you’d think, “Oh to see the number 9 the equation must see a round top and a straight part on the right side below it.” And that assumption would be wrong. Instead we find its dozens of specific areas of the image that you and I wouldn’t necessarily associate with a “9”.
But then if we start to think about our own brains, do we actually process reading the way we think we do? Maybe for individual characters. But we know when we read words we focus specifically on the first and last character, the length of the word and any variation of the height of the text. We can literally scramble up the letters in the middle and still read the text.
The reason I bring this up iss that we often focus on how huamsn can transform data using past history but we often fail to explain how this works. When asking ChatGPT a more vague concept it does pull from other’s works but one thing it also does is creates a statistical analysis of human speech. It literally figures out what is the most likely next word to be said in the given sentence. The way this calculation occurs is directly related to the matterial provided, the order in which it was provided, the weights programmed into it to make decisions, etc. I’d ask how this is fundamentally different than what humans do.
I’m a big fan of students learning a huge portion of the same literature when in high school. It creates a common dialog we can all use to understand concepts. I, in my 40s, have often referenced a character or event, statement or theme from classic literature and have noticed that only those older than me often get it. In less than a few words I’ve conveyed a huge amount of information that only occurs when the other side of the conversation gets the reference. I’m wondering if at some point AI is able to do this type of analysis would it be considered transformative?
By nature of a human creating something “connected” to another work, then the work is transformative. Copyright law places some value on human creativity modifying a work in a way that transforms it into something new.
Depending on your point of view, it’s possible to argue that machine learning lacks the capacity for transformative work. It is all derivative of its source material, and therefore is infringing on that source material’s copyright. This is especially true when learning models like ChatGPT reproduce their training material whole-cloth like is mentioned elsewhere in the thread.
I’d argue that all human work is derivative as well. Not from the legal stance of copyright law but from a fundamental stance of how our brains work. The only difference is that humans have source material outside that which is created. You have seen an apple on a tree before, not all of your apple experiences are pictures someone drew, photos someone took or a poem someone wrote. At what point would you consider enough personal experience to qualify as being able to generate transformative work? If I were to put a camera in my head and record my life and donate it as public domain would that be enough data to allow an AI to be considered able to create transformative works? Or must the AI have genuine personal experiences?
Our brains can do some level of randomness but it’s current state is based on its previous state and the inputs it received. I wonder when trying to come up with something unique, what portion of our brains dive into memories versus pure noise generation. That’s easily done on a computer.
As for whole cloth reproduction…I memorized many poems in school. Does that mean I can never generate something unique?
Don’t get me wrong, they used stolen material, that’s wrong. But had it been legally obtained I see less of an issue.
But derivative and transformative are legal terms with legal meanings. Arguing how you feel the word derivative applies to our brain chemistry is entirely irrelevant.
You’ve memorized poems, and (assuming the poem is not in the public domain) if you reproduce that poem housed in a collection of poems without any license from the copyright owner you’ve infringed on that copyright. It is not any different when ChatGPT reproduces a poem in it’s output.
I think it’s very relevant because those laws were created at a time when there was no machine generated material. The law makes the assumption that one human being is creating material and another human being is stealing some material. In no part of these laws do they dictate rules on creating a non-human third party that would do the actual copying. There were specific rules added for things like photocopy machines and faxes where attempts are made to create exact facsimiles. But ChatGPT isn’t doing what a photocopier does.
The current lawsuits, at least the one’s I’ve read over, have not been explicitly about outputting copyright material. While ChatGPT could output the material just as i could recite a poem, the issues being brought up is that the training materials were copyright and that the AI system then “contains” said material. That is why i asked my initial question. My brain could contain your poem and as long as i dont write it down as my own, what violation is occuring? OpenAI could go to the library, rent every book and scan them in and all would be ok, right? At least from the recent lawsuits.
The current (at least in the US) laws do cover work that isn’t created by a human. It’s well-tread legal ground. The highest profile case of it was a monkey taking a photograph: https://en.m.wikipedia.org/wiki/Monkey_selfie_copyright_dispute
Non-human third parties cannot hold copyright. They are not afforded protections by copyright. They cannot claim fair use of copyrighted material.
I meant in the opposite direction. If I teach an elephant to paint and then show him a Picasso and he paints something like it am I the one violating copyright law? I think currently there is no explicit laws about this type of situation but if there was a case to be made MY intent would be the major factor.
The 3rd party copying we see laws around are human driven intent to make exact replicas. Photocopy machines, Cassette/VHS/DVD duplication software/hardware, Faxes, etc. We have personal private fair use laws but all of this about humans using tools to make near exact replicas.
The law needs to catch up to the concept of a human creating something that then goes out and makes non replica output triggered by someone other than the tool’s creator. I see at least 3 parties in this whole process:
- AI developer creating the system
- AI teacher feeding it learning data
- AI consumer creating the prompt
If the data fed to the AI was all gathered by legal means, lets say scanned library books, who is in violation if the content output were to violate copyright laws?
These are questions that, again, are tread pretty well in the copyright space. ChatGPT in this case acts more like a platform than a tool, because it hosts and can reproduce material that it is given. Again, US only perspective, and perspective of a non-lawyer, the DMCA outlines requirements for platforms to be protected from being sued for hosting and reproducing copyrighted works. But part of the problem is that the owners of the platforms are the parties that are uploading, via training the MLL, copyrighted works. That automatically disqualifies a platform from any sort of safe harbor protections, and so the owners of the ChatGPT platform would be in violation.
Too be honest, I hope they win. While I my passion is technology, I am not a fan of artificial intelligence at all! Decision-making is best left up to the human being. I can see where AI has its place like in gaming or some other things but to mainstream it and use it to decide who’s resume is going to be viewed and/or who will be hired; hell no.
use it to decide who’s resume is going to be viewed and/or who will be hired
Luckily that’s far removed from ChatGPT and entirely indepentent from the question whether copyrighted works may be used to train conversational AI Models or not.
You don’t need AI to unfairly filter out résumés, they’ve been doing it already for years. Also the argument that a human would always make the best decision really doesn’t work that well. A human is biased and limited. They can only do so much and if you make someone go through a 100 résumés, you’re basically just throwing out all the applicants who happen to be in the middle of that pile as they are not as outstanding compared towards the first and last applicants in the eyes of the human mind.
I get that HR does this shit all of the time. But at least without AI, your resume or CV has a better chance of making it to a human being.
I got a degree with a sub focus in AI and I hate where this has gone extremely fast. It’s not exciting anymore, it’s just depressing. I’m trying to get out of tech sooner rather than later and go live off the grid somewhere.
AI will kill society long before it’ll save it
I’m not against artificial intelligence, it could be a very valuable tool, but that’s nowhere near a valid reason to break laws as OpenAI has done, that’s why I too hope authors win.
What laws are you saying they’ve broken?
Copyright, this is not the first time they’re sued for it apparently (violating copyright is a crime).
Scraping the web is legal and training AI on data is also legal.
Reusing the content you scraped, if copyright protected, is not.
Edit: unless you get the authorization of the original authors but OpenAI didn’t even asked, that’s why it’s a crime.
Sounds like fair use to me.
That really will be the question at hand. Is the ai producing work that could be considered transformative, educational, or parody? The answer is of course yes, it is capable of doing all three of those things, but it’s also capable of being coaxed into reproducing things exactly.
I don’t know if current copyright laws are capable of dealing with the ai Renaissance.
Yeah it is. The only protection in copyright is called derivative works, and an AI is not a derivative of a book, No more than your brain is after you’ve read one.
The only exception would be if you manage to overtrain and encode the contents of the book inside of the model file. That’s not what happened here because I’ll chat GPT output was a summary.
The only valid claim here is the fact that the books were not supposed to be on the public internet and it’s likely that the way open AI the books in the first place was through some piracy website through scraping the web.
At that point you just have to hold them liable for that act of piracy, not the fact that the model release was an act of copyright violation.