Mass Web Scraping for Copyright Enforcement: A Legal Risk under the GDPR?
In recent years, a growing number of companies offer services to photographers and photo agencies to track down copyright infringements online. While the aim of protecting intellectual property rights is entirely legitimate, the methods of mass web scraping or crawling employed by some of these companies raise significant legal concerns.
In previous blog posts, we discussed how to respond to copyright claims by such companies and outlined the general principles of copyright on pictures, with reference to a specific Belgian court case. This article focuses on the use of mass web scraping tools for copyright enforcement and how this practice raises issues under the General Data Protection Regulation (GDPR), particularly in relation to the principles of legitimate interest, data minimisation, transparency, and the potential shared liability of clients who engage these services.
Two Methods for Online Copyright Detection
Broadly speaking, there are two models in use to detect online copyright infringements on photographs:
1. Mass Web Scraping and Indexing:
In this approach, service providers build their own index or private database by downloading and storing large portions of the internet to scan for matches (i.e., instances where a client’s images are being used online). This typically involves the large-scale storage of pictures (often including identifiable persons), texts (including names and contact details of natural persons), IP addresses, and other data.
2. Targeted Search via Public Databases:
In this second approach, providers use public search engines like Google Images to perform reverse image searches and identify where clients’ photographs are being used online. When a potential copyright infringement is found, the copyright service provider will likely download the relevant part of the website, not the entire site.
The first method increasingly comes under scrutiny from data protection authorities and legal scholars. Compared to the more targeted second approach—which involves ‘post factum’ processing of specific data—the first method represents ‘ante factum’ bulk data collection and storage. This article focuses on the legal risks associated with the first method. Unfortunately, many of these service providers do not disclose which approach they use, thus creating uncertainty for both internet users in general and their own clients.[1]
Dutch DPA: Legitimate Interest Cannot Justify Mass Web Scraping
On 2 April 2025, the Dutch Data Protection Authority (Autoriteit Persoonsgegevens) published detailed guidance on web scraping or web crawling by private organisations. While the document is not limited to web scraping for copyright enforcement and also covers, for instance, scraping for AI training, it contains several important conclusions relevant to copyright contexts.
The Dutch DPA concluded: “Scraping of personal data from the internet quickly constitutes a serious infringement of the data protection rights of those whose data are being scraped. Private organisations and individuals who wish to use scraped data must fully comply with the principles and requirements of the GDPR”.
Notably, the Dutch DPA expressed doubt that large-scale web scraping can be justified under the legitimate interest basis (Article 6(1)(f) GDPR). It specifically noted that when using scraping techniques or scraped personal data from the internet, it will be difficult if not impossible to meet the criteria of legitimate interest.
When using legitimate interest as a legal basis for mass crawling and storing of websites, not only must the objective be legitimate, but the data processing must also meet the tests of necessity and proportionality. It seems to us that, in the context of enforcing copyright, the aim will indeed be legitimate, but mass collection and storage of data – including images and personal details unrelated to any specific copyright infringement – will likely fail to meet the necessity and proportionality thresholds.
The Dutch DPA rightly observed: “In general, the broader the scraper searches, the greater the infringement on the privacy of those concerned.” It also emphasised that when individuals do not reasonably expect their personal data to be scraped, their privacy interests will weigh more heavily against the interests of the scraper or its clients.
In other words, legitimate interest will likely not be a legitimate ground for such large-scale crawling and processing activities. The Dutch DPA added that the other legal bases under Article 6(1) GDPR – consent of the data subject, performance of a contract, legal obligation, vital interests, and task of general interest/public authority – are also unlikely to apply in this context.
Data Minimisation: A Core GDPR Principle at Risk
The principle of data minimisation under Article 5(1)(c) GDPR) requires that only data necessary for a specific purpose be collected. It seems to us that mass scraping of vast portions of the internet – capturing entire websites, images of identifiable persons, IP addresses, names and contact details, etc. – and storing them in a private database almost inevitably leads to excessive processing of personal data. This may even extend to special category data, such as images of children or other sensitive personal data.
The European Court of Justice (ECJ) has ruled that even for legitimate purposes, data processing must remain within the boundaries of what is strictly necessary (C-175/20, Rīgas satiksme). The Guidelines 4/2019 of the European Data Protection Board (EDPB) on data protection by design and by default also put forward an “obligation to only process personal data which are necessary for each specific purpose.”
The European Data Protection Supervisor (EDPS), in its Orientation for data protection compliance when using Generative AI systems, has also expressed concerns about scraping practices, particularly the large-scale collection of data from websites in the context of training AI systems. These concerns focus on potential violations of the principles of data minimisation and data accuracy.
In short, bulk processing of websites, including images whereby natural persons are depicted and other personal data of uninvolved third parties, is unlikely to be compatible with the principle of data minimisation, particularly where more targeted and less intrusive methods exist for fighting online copyright infringements (see the second method discussed above).
Transparency Obligations when Crawling and Indexing the Internet
Another major concern is the lack of transparency concerning these internet crawling activities (Article 5(1)(a) GDPR). Many scraping service providers do not adequately inform individuals that their personal data – such as their images – are being collected, stored, and processed. These scraping and storing activities are typically not disclosed in their websites, privacy policies, or terms and conditions.[2] Even if they were mentioned, such disclosures would likely be insufficient to properly inform data subjects.
This lack of transparency is particularly problematic given the scale and impact of their data processing. This seems to violate Articles 13 and 14 GDPR, which impose obligations to proactively inform data subjects when their personal data are being processed. It does not seem to us that these transparency obligations could be set aside by relying on the exemption of Article 14(5)(b) GDPR.
Accuracy and Retention of Scraped Data
Personal data that are being processed must be accurate and kept up to date (Article 5(1)(d) GDPR). When processing scraped data from different websites, it seems difficult or even impossible to verify the accuracy of the data at every stage of the processing, especially when stored for a long time. This raises further concerns regarding retention periods and storage limitation obligations.
Joint Control and Client Liability for GDPR infringements
The responsibility for GDPR-compliance does not solely lie with the service providers but also with the photo agencies and photographers using such services. Under the GDPR, clients may be considered joint controllers and held jointly liable for potential violations (see ECJ Fashion ID, C-40/17; and ECJ Wirtschaftsakademie, C-210/16). Whether a client using such scraping services qualifies as a joint controller will depend on the circumstances, but those who make use of mass scraping and indexing services are likely to meet the threshold and may be jointly liable for GDPR breaches, including fines up to 20 million euros or 4% of worldwide annual turnover (Article 83(5) GDPR).
This liability cannot be contractually waived. Clients must be able to demonstrate GDPR compliance and fulfil requests from data subjects, such as rights of access, rectification, or erasure.
It seems to us that few clients of these scraping services clearly disclose on their websites or elsewhere that they are using large-scale scraping services to collect personal data from the internet.
Conclusion: A Risk-Based Approach is Required
While enforcing copyright is a legitimate goal, the methods used must respect data protection law. Copyright holders and their legal advisors must evaluate whether their enforcement partners respect fundamental data protection rules under the GDPR.
Key questions to consider include:
- What method does your copyright enforcement provider use? Do they employ mass web scraping to build their own private databases, or do they use more targeted search tools in public databases?
- Is your service provider transparent about its data processing practices? Are data subjects – and clients – adequately informed (if not, why not)?
- Does your service provider comply with the principle of data minimisation?
- Is the processing based on mass web scraping truly necessary and proportionate?
- Could your organisation be deemed a joint controller with shared GDPR liability? Since your liability cannot be contractually excluded, what assurances, support, or information does your provider offer?
More regulatory guidance and case law on this topic are likely to follow. In the meantime, copyright owners should carefully assess the methods of the copyright enforcement services they engage to avoid GDPR liability risks.
Disclaimer: This article is provided for informational purposes only and does not constitute legal advice regarding the use of any particular service, technology, or method. It is based on limited publicly available information at the time of writing. Due to the lack of transparency surrounding many copyright enforcement providers and their operations, we make no representations or warranties as to the accuracy, completeness, or currentness of the information contained herein. We expressly disclaim any and all liability for any loss or damage arising from (reliance on) this article. Readers are advised to seek independent legal counsel for guidance regarding their specific situation.
For more information on data protection and copyright enforcement, or to assess your organisation’s compliance, feel free to contact us at Finnian & Columba.
Footnotes:
[1] For example, Maik Piel, CTO of Pixray (commercial name Fair Licensing) stated in a 2019 interview that the company used customised versions of StormCrawler to run three types of web crawls: broad regional scans (e.g. across the entire EU or North America) involving over 10 billion URLs and tens of millions of domains; deep, domain-specific crawls; and near real-time discovery scans on thousands of targeted domains. According to that interview, Pixray’s technology stack included StormCrawler, Elasticsearch and Kibana, integrated via RabbitMQ, and operated on substantial server infrastructure. Based on this information, it appears highly likely that Pixray stored crawled webpages and associated metadata (particularly given the use of Elasticsearch); that they created a searchable index of URLs, extracted content, and metadata; and that such index underpinned their image-matching and copyright monitoring activities. However, as this interview dates from 2019, we cannot confirm whether Pixray currently continues to use this approach or infrastructure.
A further example may be found on PicRights’ website, where the company mentions that its image recognition technology crawls commercial websites and reports matches to clients, which are then uploaded to the client’s account on its platform. To enable the detection of unauthorised uses of images, it seems reasonable to infer that PicRights would need to crawl a wide array of websites – including many that are not suspected of copyright infringement – since potential infringements are typically not known in advance. Such crawling would likely involve indexing and storing content in order to compare it against protected images supplied by clients. While we have not found publicly available figures regarding the number of URLs or domains crawled by PicRights, their stated objective of monitoring for global copyright infringement suggests the deployment of a high-volume, large-scale crawling and indexing operation.
Disclaimer: This analysis is based on publicly available information and reasonable inference. It does not purport to describe the current operations or technological practices of the companies mentioned with absolute certainty. Unfortunately, neither Pixray nor PicRights offer full public transparency regarding the scope, methods, or data practices of their crawling and enforcement technologies. This lack of disclosure necessitates reliance on cautious assumptions, historical sources, and indirect evidence – hence the use of qualified language and multiple disclaimers throughout this article. For the avoidance of doubt: we make no representations or warranties as to the accuracy, completeness, or currency of the information presented herein and disclaim all liability for any loss or damage arising from this article.
[2] To give the same examples as above: The Privacy Policy of Pixray – Fair Licensing refers only to personal information collected when visiting its own website. It does not mention any crawling, scraping, or storing activities of data from third-party websites. Likewise, the T&Cs and Privacy Policy of PicRights (version of March 2024) do not clearly mention any crawling, scraping, or processing activities of other websites. Its Privacy Policy, under heading III, outlines the data collected when visiting its own website, submitting a case, using the settlement portal, etc., but does not reference its broader data collection activities on third-party websites. Further in the document – after the numbering of the headings restarts – heading XI lists legitimate interests as the legal basis for processing. In heading XII, PicRights states that it collects data from websites where it has identified potential infringements, without mentioning that it also collects data from unrelated websites (its privacy policy reads: “We collect the data via freely accessible sources, i.e. from the website where we have established a possible infringement of copyright law, from the WhoIs data of the aforementioned website, yellow pages, commercial registers etc.”).
Recent Comments