Data mining: why the EU's proposed copyright measures get it wrong
Data that is mined with the help of machine learning techniques has been a rapid area of technological advancement – with good and bad consequences for everyone. And EU copyright law is currently caught in the crossfire.
Cambridge Analytica and Facebook’s recent data scandal, which involved the profiling of users from their online behaviour facilitated by social networks, brought important issues to the surface about web privacy, only after it was reported that millions of people had their data harvested and improperly shared with a political consultancy.
Many people think that regulating the use of data is a matter of data protection or privacy laws. However, where the raw material subjected to analysis is not “personal data” but material protected under copyright law, such as texts or certain structured databases, another set of legal norms come into play. This has far reaching and little understood consequences.
Fit for the digital age?
The proposal for a Digital Single Market Copyright Directive, which is currently at a critical stage in the European Parliament and in the Council of the EU, contains a number of provisions intended to modernise the bloc’s copyright law, making it – in EU jargon – “fit for the digital age”.
One of the provisions relates to a mandatory text and data mining (TDM) exception that would benefit research organisations acting for research purposes, for example a university researcher scanning and analysing scientific articles.
At first sight, permitting such uses appears to be a progressive idea, because copyright law exists to protect new expressions, and by doing so, to promote creativity. However, in our view, the debate around text and data mining and the need for a dedicated copyright exception has been focusing on the wrong premise.
Why should text and data mining of copyright protected materials be a copyright matter at all? There would be no need for a copyright exception if the labelling of TDM as a copyright infringement was removed.
TDM refers to the use of ideas, principles, facts and correlations contained in literary works, other types of texts or in datasets. Put simply, TDM techniques do not use copyright works as works, they access the information stored in them.
The informational value in texts is not, and should not, be protected by copyright law.
Copyright protects original expressions. Its protection does not extend to ideas, procedures, methods of operation or mathematical concepts as such.
This principle can be found not only in all major copyright traditions, but also in the treaties that establish the international copyright framework.
Under the proposed directive, here’s how copyright law may treat text and data mining. While the idea of a young wizard studying magic with his fellow students is not protected by copyright law, the original expression rendered by JK Rowling with her series of Harry Potter novels is protected.
If a temporary reproduction of a Harry Potter book is made in order to produce unauthorised copies, this is clearly a copyright infringement. But if the temporary copy is made within the scope of a machine learning application to study natural language processing, only the informational value of Rowling’s text is extracted. These text extracts are annotated with labels, such as named entities and sentiment tags, which are then processed, for example, to create or improve automatic translation tools.
But the novel is not replicated as a novel, so such an extraction method should not constitute a copyright infringement.
It’s far from clear whether the current EU legal framework recognises this, however. The Information Society Directive gives a broad definition to the right of reproduction and provides a narrowly defined “closed” list of exceptions. It means that, at present, most TDM activities in the EU are deemed copyright infringements, or their legal status is so uncertain that it is safer to assume a violation has occurred.
It is important to test the wording of the exception proposed in Article 3 of the EU’s Copyright Directive against the goal of making the informational value of copyright works available to all. In our view, it remains unsatisfactory in at least three respects:
Societies are better off with more knowledge not less. They will be more culturally attractive, economically competitive and technologically advanced, all of which leads to more innovation, more jobs and well-informed citizens. Copyright law is an instrument to achieve this public policy goal.
We suggest that – in order to rebalance the relationship between investment and innovation – anyone who has lawful access to copyright materials, including public interest research organisations and businesses, should be able to conduct text and data mining.
EU copyright law really needs an open exception that is responsive to technological development, similar to those already available in many other jurisdictions that permit TDM activities. But it seems that legislators in the EU are unable to offer even modest certainty.
Martin Kretschmer is Director of CREATe, the UK Centre for Copyright and New Business Models in the Creative Economy at the University of Glasgow (www.create.ac.uk). CREATe has received funding from AHRC, EPSRC and ESRC. Martin Kretschmer also has received external funding from ALCS, Microsoft, UK Intellectual Property Office, UK Cabinet Office and the European Parliament. Martin Kretschmer is a member of the Steering Group of the UK Scholarly Communications Licence (http://ukscl.ac.uk/) and a member of the Research Expert Advisory Group of the UK Intellectual Property Office.
Thomas Margoni is member of CREATe and receives additional funding from the EU H2020 framework programme under projects openminted.eu and openaire.eu.