AI, Creators and the Commons

3 Nov

On October 2nd, the day before the Creative Commons (CC) Global Summit in Mexico City began, OpenFuture and Creative Commons hosted an all day workshop on “AI, Creators and the Commons”.

The goal of this workshop, organized as a side event to the CC Summit, was to understand and explore the impact of generative machine learning (ML) on creative practices and the commons, and to the mission of Creative Commons in particular.

The workshop brought together members of the Creative Commons global network with expertise in copyright law, CC licenses as legal tools, and issues in generative AI, in order to develop an understanding of these issues, as they play out across different jurisdictions around the world.

The workshop focused on the "input" side of generative AI particularly on the data used to train ML. The morning session focused on:

How do copyright systems around the world deal with the use of copyrighted works for training generative ML models?

The aim was to understand whether there are differences between jurisdictions that affect whether, and how copyright protected works (including CC licensed works) can be used for AI training. Questions asked included:

Are there differences between jurisdictions that affect whether, and how copyright protected works (including CC licensed works) can be used for AI training?
How do different legal frameworks deal with this issue and what balance do they strike?
What are the implications for creators?
What are the implications for using open licenses?

Here are a few of my takeaways and responses to those questions from this session.

In the USA two areas of copyright activity related to AI are copyright over AI outputs and copyright related to inputs used for AI training.

The legal case around the comic book Zaraya of the Dawn was used as an example of copyright related to AI outputs. Although originally granted full copyright that was subsequently revoked. Instead, the text as well as the selection, coordination, and arrangement of the work’s written and visual elements were granted copyright but the images in the work, generated by Midjourney, were not as they were deemed '“not the product of human authorship”.

A relationship between a creators use of AI technology and a creators use of photography related tools was made. In taking a shot a photographer engages in composition, timing, lighting, and setting. After taking a shot they engage in things like post editing, combining images, and final form. In photography copyright is assigned to the person who shoots or takes the shot. In what way is use of AI technology different?

Pertaining to issues related to copyright of inputs used for AI training, reference was made to the many class action legal cases that are underway contesting that use of copyrighted works to train AI constitutes copyright infringement.

The legal case around Getty Images suing AI art generator Stable Diffusion in the US for copyright infringement was used as an example. Getty has licensed its images and metadata to other AI art generators, but claims that Stability AI willfully scraped its images without permission. This claim is substantiated by Stable Diffusion recreating the Getty company’s watermark in some of its output images. This case is interesting for the way it attests copyright infringement but also manipulation of copyright data (the watermark).

Another example was the Authors Guild’s class action suit against OpenAI. This complaint draws attention to the fact that the plaintiffs’ books were downloaded from pirate ebook repositories and used to train ChatGPT. from which OpenAI expects to earn billions. The class action asserts that this threatens the role and livelihood of writers as a whole and seeks a settlement that gives authors choice and a reasonable licensing fee related to use of their copyrighted works.

US defendants in these cases are expected to argue that their use of these works to train their AI is allowed under “fair use”. From a copyright perspective Generative AI is particularly disruptive because outputs are not exactly similar to inputs. But a key question will be whether they are similar enough. Do they demonstrate high transformativity?

Some artists are suing based on principle. Some see AI competing with them unfairly. Some are concerned about AI replacing human labour. And still others see AI created works as a violation of their integrity and reputation.

Canada, Australia, New Zealand, Japan and the UK all have something called fair dealing which is similar to the US fair use. So in these jurisdictions AI use of copyrighted works are expected to argue fair dealing allows them to do what they are doing.

In addition to fair dealing, some jurisdictions have copyright exceptions that allow for text and data mining. Text and data mining is an automated process that analyzes massive amounts of text and data in digital form in order to discover new knowledge through patterns, trends and correlations. Initial efforts to establish text and data mining were done largely in the context of supporting research. Creators and general rights holders did not pay much attention to it. But now, with generative AI, text and data mining is affecting everyone. Text and data mining exceptions, where they exist, present another means by which AI tech companies can argue they are legally allowed to do what they do.

In 2016, Japan identified AI as one of the most important technological foundations for establishing a supersmart society they call Society 5.0. In 2017 they amended copyright legislation to allow text and data mining classifying the activities into four categories, (1) extraction, (2) comparison, (3) classification, or (4) other statistical analysis. The Japanese exception is regarded as the broadest text and data exception in the world because: (1) it applies to both commercial and noncommercial purposes; (2) it applies to any exploitation regardless of the rightholders reservations; (3) exploitation by any means is permitted; and (4) no lawful access is required.

Other countries such as Singapore, South Korea and Taiwan have adopted similar rules with the intention of removing uncertainties for their tech industries and positioning themselves in the AI race, unencumbered.

EU's Directive on Copyright in the Digital Single Market, adopted in 2019 introduced two text and data mining exceptions. One exception is for scientific research and cultural heritage with a caveat that use be non-commercial. The second is a more general purpose exception which allows commercial use as long as source data is lawfully accessed and creators have the option to opt out. This general purpose exception is seen as the one applicable to generative AI.

Opting out is seen as giving a creator leverage and the first step in securing a licensing deal. The opt-out requirement is seen as extremely difficult to implement at a practical level. At this time creators have no idea whether their works are being used by AI and AI tech players are not disclosing what data is being used to train their models. In some cases there may be multiple copies of works and there is no simple way of ensuring all copies of your work have been excluded. In addition it is not clear whether opt-out only applies to new uses or whether it is retroactive. How does opt out affect AI models that already include your work? Opt out is also seen as unfair to those who have passed away. They can't say no to their work being used by AI. Massive opt out may result in even greater bias being present in AI models.

There is a big challenge around ensuring opt out is respected. Early attempts to enable opt out such as those from ArtStation are seen as cumbersome difficult to enforce and placing a large part of the onus on the creator. Complex opt out systems will favour large players who are the biggest rights holders. Copyright holders are often not the creator themselves but large publishers or intermediaries who have acquired the rights to those works. Opt out needs to be simple enough that anyone can do it. There are some who don’t want opt out but instead a mandatory compensation.

An Africa perspective is one of being left behind and upending livelihoods. Big tech AI and transnationals have created a problem through their use of data sourced in Africa without collaboration or recognition of local communities. But what to do? The data is already taken.

Financing of AI research and use of data is all global north. AI is data colonization and a threat to sovereignty. Local languages in training AI are absent. Creators are facing realization that they can be replaced. Text and data mining was thought of as something for scientists and analyzing literature to get insights. Now there is a growing realization that text and data mining touches everyone. Does text and data mining, fair use / fair dealing really allow use of everything?

There is interest in creating a licensing market. But copyright is not and never has been a good or effective jobs program. Copyright has done a poor job of benefiting creators to generate a living. Copyright is wielded by a few big players to benefit a few. Want creators fairly rewarded and remunerated.

Kenyan workers cleaning AI data is unethical - not a copyright issue. Need to go beyond copyright and address labour issues.

Are we entering a knowledge renaissance? Or a desert where sharing is not allowed? Are countries taking a 21st century lens going to allow anything?

Latin America does not have copyright exceptions with a big enough scope to enable AI. They do not have fair use or text and data mining exceptions. Lack of these copyright exceptions is not really a current issue. Data privacy is the more pressing issue. Health data is an area of focus along with open science discussions on who owns data.

Renumeration tends to go not to the creator but to large rights holders. Desire to see creator right to renumeration against big platforms including but going beyond AI.

It is difficult for a country to figure out how to enter the AI field.

AI is data colonization. AI is extracting local community data and using it for commercial purposes. Local community data is communal not individual. It should not be used without local community permission. Western civilization notions like copyright are counter to traditional knowledge.

This session generated a lot of observations and questions for me:

Creators push for copyright because it is the only tool they have. What are other strategies?
Current copyright law is deficient in being able to handle all these AI issues. Ensuring an ethical, responsible and fair for all AI will require going beyond copyright law.
Data “mining” sounds like exploitation.
Rights holders want consent, credit, and compensation.
Opting out of something is different from opting in to something. What alternative to big tech AI can creators and AI users opt in to?
What is the commons? Are differentiations such as public domain, commons licensed, copyrighted still relevant? Is the entire Internet just a big commons database available for AI to freely scrape and use?

ML Training and Creators

This moderated group discussion after lunch focused on understanding the position of creators in relation to generative ML systems. What are the threats and opportunities for them? To what extent do creators have agency in determining how their work can or should be used? What tools (legal and/or technical) are available to creators to manage how their works can or should be used? How do CC licences fit into this, and is there a need for CC to adapt or expand the range of tools that it provides to creators?

Here are some of my points of learning and takeaways:

Law is slower than technology.

Creative Commons tools have reduced relevance in the context of generative AI. The way Creative Commons licenses involve attribution, giving back to the community, and creating a commons has been disrupted by AI. The original idea around Creative Commons was to give creators choice. How does Creative Commons support creator choice in the context of AI?

In the context of generative AI users are not just traditional creators but business enterprises, educators, biotech and health care. In what way are Creative Commons licenses useful to these new users?

Creative Commons has played a key historical role in the ethics of consent, expression of preferences, and norms around responsible use. In the context of AI how can CC continue with these roles? Perhaps there is a role around commons based models, commons based open outputs, and AI for the public good?

Generative AI represents a fundamental breaking of the reciprocity of the commons. It has spawned a lack of trust in copyright. AI needs to restore trust associated with data.

AI creates a different power structure. It breaks the social compact of the commons. Many rights holders have become overtly hostile to the commons. Opting out is in some ways an expression of “you can’t learn from me”, an undesirable outcome. One way some trust could be restored is if AI models were by default in the commons. This would restore some balance and giving back.

AI companies still don’t have a business model. The traditional big platform model of selling ads seems inappropriate. AI needs a model that does not give big tech disproportionate benefit.

We need a legal technical innovation (other than copyright) that builds a shared culture where we can all participate. We need a more global AI approach. We don’t have to be western to be modern. What we really need is guardianship not ownership. A means of connecting to one another across national boundaries. AI needs a set of principles and values shared across cultures. We need to give people agency back.

Creative Commons licensing is not a way to accomplish this. But, if it is not licensing then what?

ML Training and the Commons

This final session of the day moderated group discussion continued exploration of the relationship between generative ML and the commons. Questions posed included; How do generative ML models impact the commons? How important are the commons when it comes to ML training? How can we best manage the digital commons in the face of generative ML? How do traditional approaches to protecting the commons from appropriation, such as copyleft and share alike, interact with generative ML?

My note taking diminished this late in the day but here are a brief few points of interest that came up for me:

Creative Commons what are you going to do about AI?, is a question CC is hearing loud and clear.

Creative Commons is certainly about licenses, but not solely. CC is about reducing legal barriers to sharing and creativity. It’s not just about copyright, it’s about growing the commons.

CC tools should never overrule exceptions and limitations.

Copyright is not the best tool for resolving AI issues. CC licenses have held up well over time but other big issues have surfaced. What is legal vs what is not? What is right vs wrong? Calls for a code of ethics, community guidelines and social policies.

In creating new works AI could empower creators not betray them. AI’s unwillingness to cite sources and identify where data comes from is not helping.

Better sharing is sharing in the public interest.

AI could broaden the commons.

Is AI a wonder of the world?

Closing Thoughts

I found this day very thought provoking. This blog is by no means a complete comprehensive summary, merely things I took note of throughout the day.

Kudos to OpenFuture and Creative Commons for co-hosting this day and to all the participants for actively sharing their perspectives, experiences, and advice. I’m especially impressed with CC’s willingness to ask these hard questions and engage in self critique while at the same time actively seeking to define it’s position and role in generative AI and the commons going forward.

I appreciate being invited to this session and participating as a vocal active contributor. It helped develop a common understanding and generated observations, questions and discussion that carried forward into the follow-on three days of Creative Commons Global Summit.

I’m a strong advocate for being proactive in defining the future we want. As part of my participation in this workshop I mentioned being in a position to comment on and make recommendations related to Canada’s draft AI Act. I expressed interest in developing a shared collective position of principles, ideas and values that we could bring forward as part of an effort to shape AI legislation in all our respective countries right from the start.

I was thrilled when Paul Keller ran with this idea and over the ensuing three days of the Summit worked with a group of contributors to develop a statement on Making AI Work For Creators and the Commons.

When I returned home to Canada I did submit a briefing note of comments and recommendations on Canada’s AI Act and was delighted to reference and include the text of the Making AI Work For Creators and the Commons in my statement. My blog Policy Recommendations for Canada’s AI Act provides context and my full response.

I've created an AI, Creators and the Commons discussion topic in OEGlobal's Connect space. If you have thoughts or ideas on any of this I welcome discussion and suggestions there.

Paul Stacey

AI, Creators and the Commons

How do copyright systems around the world deal with the use of copyrighted works for training generative ML models?

ML Training and Creators

ML Training and the Commons

Closing Thoughts

PAUL STACEY, BA, BSc, BEd, MEd

Open Leader

AI, Creators and the Commons

How do copyright systems around the world deal with the use of copyrighted works for training generative ML models?

ML Training and Creators

ML Training and the Commons

Closing Thoughts

AI, E-learning & Open Education

Policy Recommendations for Canada’s AI Act

PAUL STACEY, BA, BSc, BEd, MEd

Open Leader