Under GDPR, can personal data be processed based on legitimate interests for AI-related purposes, whether for training, deployment, or beyond? That was the key focus of the EDPB stakeholder event on AI models on 5 Nov 24 that I was registered to attend and was fortunate enough to get a place - many thanks to the EDPB for holding this event!
The event was intended to gather cross-stakeholder views to inform the EDPB's drafting of an Art.64(2) consistency opinion on AI models (defined quite broadly) requested by the Irish DPC. The EDPB said it will issue this opinion by the end of 2024 but, unlike EDPB guidelines, such consistency opinions can't be updated - which is concerning given how important this area is.
The specific questions were:
- AI models and "personal data" - technical ways to evaluate whether an AI model trained using personal data still processes personal data? Any specific tools / methods to assess risks of regurgitation and extraction of personal data from AI models trained using personal data? Which measures (upstream or downstream) can help reduce risks of extracting personal data from such AI models trained using personal data? (including effectiveness, metrics, residual risk)
- Can "legitimate interest” be relied on as a lawful basis for processing personal data in AI models?
- When training AI models - and what measures to ensure an appropriate balance of interests, considering both first-party and third-party personal data?
- In the post-training phase, like deployment or retraining - and what measures to ensure an appropriate balance, and what if the competent supervisory authority found the model's initial training involved unlawful processing?
There wasn't enough time for me to explain my planned input properly or to comment on some issues, given the number of attendees, so I am doing it here. I'll take the second set first.
Training AI models - legitimate interests
I strongly believe legitimate interest should be a valid legal basis for training AI with personal data. Particularly training AI to reduce the risk of bias or discrimination against people, when the AI is used in relation to them.
I had a negative experience with facial biometrics. The UK Passport Office's system kept insisting my eyes were shut, when they were wide open - they're just small East Asian eyes, white people's eyes are usually bigger. Others have suffered far worse from facial biometrics and facial recognition, including wrongful arrests, denial of food purchases, debanking (see my book and 23.5 of the free companion PDF under Facial recognition).
Had the AI concerned been trained on more, and enough, non-white faces, it would be much less likely to claim facial features that didn't match typical white facial features were "inappropriate" (like eye size, hair shape), or to misidentify the wrong non-white people leading to their wrongful arrests.
The EU AI Act is aware of this risk: Art.10(5) (and see Rec.70) specifically permits providers of high-risk AI systems to process special categories of personal data, subject to appropriate safeguards and meeting certain conditions:
- the bias detection and correction cannot be effectively fulfilled by processing other data, including synthetic or anonymised data;
- the special categories of personal data are subject to technical limitations on the re-use of the personal data, and state-of-the-art security and privacy-preserving measures, including pseudonymisation;
- the special categories of personal data are subject to measures to ensure that the personal data processed are secured, protected, subject to suitable safeguards, including strict controls and documentation of the access, to avoid misuse and ensure that only authorised persons have access to those personal data with appropriate confidentiality obligations;
- the special categories of personal data are not to be transmitted, transferred or otherwise accessed by other parties;
- the special categories of personal data are deleted once the bias has been corrected or the personal data has reached the end of its retention period, whichever comes first;
- (GDPR) records of processing activities include reasons why processing of special categories of personal data was strictly necessary to detect and correct biases, and why that objective could not be achieved by processing other data.
(Aside: I know that Article also mentions "appropriate safeguards", but I'd argue that meeting those conditions would provide the minimum required safeguards - although in some cases others could be considered necessary.)
The Act confines this permission to the use of special category data in high-risk AI systems, but I'd argue that legitimate interests should permit the use of non-special category personal data through meeting the above conditions (and any other appropriate safeguards).
Recall that personal data can be processed under GDPR's legitimate interests legal basis if "necessary for the purposes of the legitimate interests pursued by the controller or by a third party, except where such interests are overridden by the interests or fundamental rights and freedoms of the data subject which require protection of personal data, in particular where the data subject is a child." The EDPB's recent guidelines on processing based on Art.6(1)(f) note three cumulative conditions to enable processing based on legitimate interests:
- the pursuit of a legitimate interest by the controller or by a third party;
- the need to process personal data for the purposes of the legitimate interest(s) pursued; and
- the interests or fundamental freedoms and rights of the concerned data subjects do not take precedence over the legitimate interest(s) of the controller or of a third party.
Let's review those in turn.
Detecting and correcting bias involves the pursuit of a legitimate interest of the controller, i.e. AI developer, and third parties. I'd argue that many, many third parties, being those in relation to whom the AI is to be used, have a legitimate interest in not being discriminated against due to biased AI. I've already mentioned biased AI resulting in wrongful arrests, denial of services important to life like food buying, and debanking (see 23.5 of my free PDF under Facial recognition).
It is indeed necessary to process personal data of people in certain groups in order to train AI models to reduce bias, as per the experiences noted above and much more.
Finally, the balancing test in the final limb must clearly consider the legitimate interests, not just of the controller, but also of "a third party" - in this case, the legitimate interest of third parties, in relation to whom the AI is to be used, not to be discriminated against. (While fairness is a core principle of the GDPR, this only concerns fairness to the individual whose personal data is being processed. Processing A's personal data to try to ensure fairness to B isn't a concept explicitly provided for in GDPR. There are mentions of "rights and freedoms of others" or other data subjects, but more in the sense of not adversely affecting their rights/freedoms, rather than positive obligations in their favour.)
I argue that, if the conditions in Art.10(5) AI Act are implemented as a minimum when training AI using personal data, that should tilt the balancing test in favour of the controller and those third parties, and enable legitimate interests to be used as the legal basis for the training - at least in the case of non-special category data - even when training non-high-risk AI. I really hope the EDPB will agree.
However, the problem remains of how to use special category data to train non-high-risk AI systems to detect and address bias. Some examples I mentioned could fall through the cracks.
The UK Passport Office's AI system, designed to reject photos with "inappropriate" facial features, is probably a high-risk AI system within Annex III para.5(a) (if the Act applied in the UK). Yet, para.5 (and Annex III more generally) does not protect anyone from being refused a private bank account or being debanked as a result of biased AI being applied to them.
And, a huge hole in the AI Act is this: Annex III para.1(a) excludes "AI systems intended to be used for biometric verification the sole purpose of which is to confirm that a specific natural person is the person he or she claims to be". What if an AI biometric verification system used by a bank mistakenly says someone is not who they claim to be, because it can't verify the identity of non-white people properly due to not having been trained on the faces of enough non-white people - and therefore the bank's systems automatically debanks that individual? How can such a biased AI biometric verification system be "fixed", if it can't be fully trained in this way?
Such an AI system is not classed as a high risk AI system, because of the biometric verification exclusion. Therefore, the developer isn't allowed to train the AI using special category data, because Art.10(5) AI Act only allows this for high-risk AI systems! (Yes, I know there's the odd situation where biometric data is "special category" data only when used for the purpose of uniquely identifying someone, so it could be argued that using non-white people's facial biometrics to train personal data isn't processing their special category data, because the processing purpose isn't to identify those specific people, and I'd certainly be putting that argument and pushing for being able to use legitimate interests for that training. But - really? Why should those arguments be necessary?)
It was argued that Art.9(2)(g) (necessary for reasons of substantial public interest etc) doesn’t allow processing of special category data to train AI, even though there is a substantial public interest in addressing bias. I agree there is a huge public interest there, but I also agree that, due to the wording of that provision, it can’t apply unless proportionate etc. EU or Member State law provides a basis for such processing. EU law in the form of AI Act Art.10* does provides a basis for processing special category data in high-risk AI systems - doesn’t provide such a basis in the case of non-high-risk AI, or non-special category data - hence the need to argue that biometric data isn’t special category when used for training! I guess it’ll have to be down to national laws to provide for this clearly enough. France, Germany or Ireland, perhaps?
(Consent isn’t feasible in practice here, given the volumes involved, and issues like having to repeat AI training after removing, from the training data, any personal data where consent has been withdrawn. It was argued that financial costs or training time for AI developers shouldn’t be relevant in data protection, but equally it was argued that environmental costs etc. of repeating training are relevant. I’ll only mention briefly practical workarounds, like not removing that data but preventing it from appearing in outputs using technical measures whose efficacy is debated)
If including my personal data in training datasets can help to reduce the risk of otherwise biased AI systems discriminating against you (should you be in the same ethnic or other grouping as me) when deployed, personally I'd be OK with that - partly informed by my own bad experiences with AI biometrics. Shouldn't such processing of data for AI training be permitted, even encouraged? But, currently, this issue is not properly or fully addressed, as I've shown above. So, there's a big data dilemma here, that still remains to be dealt with.
AI models and personal data
Does an AI model "contain" personal data, given that strictly it's not a database per se? Or is it just something that can be used to produce personal data when used in deployment, with personal data being processed only at the usage stage? Much debate, and diametrically opposing views (and difficult questions like, can a GPAI model developer be said to control the purposes and means for which deployers of the model use it?). [Added: I meant to expand on that the clarify that question, is the model developer controlling the purposes of processing personal data, particularly with general-purpose/foundation models, or is it merely providing part of the means of processing to others, i.e. is it really a "controller"?]
Rather than pinhead-dancing around that question, personally I think that use of a deployed AI system is the most relevant processing here, because that's the main point at which the LLMs/large language models (that the event focused on, pretty much exclusively) could regurgitate accurate or inaccurate personal data - whether through prompt injection attacks or similar in the case of LLMs, or because a model's guardrails weren't strong enough.
I feel the EDPB's query on technical ways to evaluate whether an AI model trained using personal data "still processes personal" data is really more one for technical AI experts to answer, and that what merits more attention is preventing training data's regurgitation/extraction at the deployment/use stage, whether personal data or otherwise. It's well known that attacks have successfully obtained training personal data from models - although with some limitations and caveats (paper & article; another article). This has been shown to be possible not only with open source models (where attackers obviously have access to more info about the model, its parameters etc., and indeed to the model itself), but even semi-open and closed source models like ChatGPT.
Again, my view is that assessing and reducing training data regurgitation/extraction risks are essentially questions for technical AI experts. Reducing such risks mainly involve technical measures, and this is an emerging area where much research continues to be conducted, so I feel it's premature to rule on such measures at this point in time (although organisational measures are also possible, and recommended, like deployers prohibiting their users from trying to extract personal data from any AI).
AI value chain: controllers, processors
More interesting, and difficult, from a GDPR perspective are the crucial questions of: who is a controller, who is a processor, who is liable for what, and at which stages in the AI lifecycle?
Unfortunately, these weren't really discussed at the event. To be fair, the focus of the event was meant to be legitimate interests, not the controller/processor position of AI model/system providers.
I still tried to raise them, but wasn't allowed to speak again to clarify my points, so I'll do that below in the form of some "exam questions". But, first, I want to spell out some issues with the AI supply chain that I couldn't expand on during the event.
If a developer organisation makes its own AI model available for customers to use, depending on the business model adopted by the organisation (and the following isn't comprehensive!), the supply chain can involve several alternative options:
- The model could be accessed via the model developer's API, and/or
- The model could be permitted to be:
- Downloaded by customers as a standalone model, then
- Embedded/integrated within an AI system developed by the customer (which the customer could use internally only, or offer to its own customers in turn), or
- Accessed by a customer-developed AI system (which the customer could use internally only, or offer to its own customers in turn) via API, where the downloaded model is hosted
- on-prem, or
- (more likely) in-cloud, using the customer's IaaS/PaaS provider, but with all AI-related operations being self-managed by the customer, or
- (Common nowadays) deployed and used by the customer for the customer's AI system (which the customer could use internally only, or offer to its own customers in turn), through the customer using a provider's cloud AI management platform with the benefit of tools/services available from the cloud provider to ease AI-related operations like fine-tuning models, building AI systems, using RAG, etc.
- Note: the model used could be one of the cloud provider's own models (i.e. where the cloud provider is the model developer), or it could be a third-party model offered through the cloud provider's own AI marketplace or similar. Exactly what licence/contract terms apply to the customer in such a scenario, particularly with third-party models, let alone what the controller/processor position is there, is still clear as mud (see below).
And I won't even mention the twists introduced by using RAG/retrieval-augmented generation in LLMs, at this point.
All that spelt out, now on to my exam questions!:
- After an organisation deploys a third-party model in an AI system
- If a user in the organisation deliberately extracts personal data from the AI without the deploying/employing organisation's authorisation
- Is the rogue user a controller in their own right, so that the organisation is not responsible as a controller under data protection law (as with the Morrisons case in the UK)?
- Does or should the AI model developer bear any responsibility or liability at the deployment and use stage as a controller in some way, if the guardrails they implemented against the extraction weren't appropriate? Or could it be a processor, particularly if the model is hosted by the model or system developer?
- Even if the model is considered not to "contain" any personal data, so that the model developer is not a controller of the model itself, could the model developer be considered to have some responsibility if and when personal data is extracted from the AI at the deployment and use stage?
- Remember, for security measures under GDPR, a security breach alone doesn't mean the security measures weren't appropriate; it's quite possible for an organisation that had implemented appropriate security measures to suffer a personal data breach nevertheless.
- Also to reiterate, measures to reduce the risk of extracting training data from AI models are still being developed, this is very much a nascent research area.
- Recall that a developer providing software for download/on-prem install is not generally considered a processor or controller, but when it offers software via the cloud as SaaS, it is at least a processor, even a controller to the extent it uses customer data for its own purposes. If a model developer makes available a model (software), but doesn't host it for customers, it seems the developer shouldn't even be a processor?
- If a user in the organisation deliberately extracts personal data from an AI with the deploying/employing organisation's authorisation (e.g. for research, or for the organisation's own purposes)
- Is the organisation a controller, responsible/liable for that extraction as "processing"? (and could the GDPR research exemption apply there, if for research?)
- Could the AI model developer and/or AI system developer bear any responsibility or liability for this extraction as a controller in some way, if the guardrails they implemented against the extraction weren't appropriate, as above, or as a processor?
- Note the same points/queries apply as in 1.1 above!
- If a user in the organisation uses the AI in such a way that, without the user intending it, the AI regurgitates personal data, who is responsible as controller for the output, which is "processing"?
- Remember, a user could process personal data by including it in the input provided to the AI (not discussed further here), but personal data could also be processed if it is included in the AI's output
- Does or should the AI model developer and/or AI system developer bear any responsibility or liability as a controller in some way, if the guardrails they implemented against inadvertent regurgitation weren't appropriate, or could it be a processor, or neither?
- Note the same points/queries apply as with deliberate extraction, 1.1 above.
- What difference if any does it make if the personal data in the output is accurate, or inaccurate (e.g. defamatory of the individual concerned)?
- If a person unrelated to the organisation, e.g. a
third-party hacker, manages to access the deployed AI
to extract training data such as personal data, is the deploying
organisation responsible as controller? What about the model/system
developer?
- Do any of the above apply, are they relevant, when a AI developer makes its model available to customers via the developer's API only? Is the model developer/provider a processor for customers in that situation?
- Again see 1.1 above. In particular, it seems the AI developer hosting the model offered to customers would at least be a processor here.
- What if a customer uses a third-party AI model hosted by the customer's cloud provider? Is the cloud provider only a processor for the customer, or could it be a controller in any way?
- Does it make a difference if the model used by the customer is the cloud provider's own model, or another party's model?
- Does it make a difference if the model's use is completely self-managed by the customer, or if the customer is using a cloud provider's cloud AI management platform?
- Do the license terms, cloud agreement terms and/or other terms applicable to the customer's use of the cloud service/AI platform affect the position (under GDPR it's the factual control of purposes and means that matters, and contract terms are not determinative, but nevertheless terms could influence the factual position in some cases, especially in what they permit or prohibit...).
- Indeed, back to the AI Act, who is the model provider - the AI platform provider, or the model developer?
- Rinse and repeat for AI system developers/providers - could they be responsible/liable as controllers and/or processors especially if a model provider hosts its model or AI systems using its model for customers in-cloud?