Representative or Minimal? Training Data Between the AI Act and the GDPR

Training data is the point where two major regimes collide. The EU AI Act, through Article 10, wants training data that is relevant and representative and examined for bias. The GDPR wants personal data minimised, collected for a clear purpose, and processed on a lawful basis. Pull both ways at once and you have the central tension of AI data governance.

The conflict is not theoretical. Article 10's representativeness requirement can push toward collecting more data, more demographic coverage, more edge cases, precisely to detect and reduce bias. GDPR data minimisation pushes the other way. Neither regime yields to the other, because they apply cumulatively, so the resolution is not to pick one but to document proportionality: why the data you hold is the minimum necessary to meet the representativeness and bias-examination duties.

The special-category problem is sharper still. Examining a system for bias on protected characteristics often requires processing exactly the sensitive data that GDPR Article 9 restricts, and Article 10(5) and Article 9 apply together rather than one excusing the other. The lawful route exists, but it has to be reasoned and recorded, not assumed.

Sequence matters too. The bias examination Article 10(4) requires is, in practice, a precondition for defensible automated decision-making under GDPR Article 22, so the order in which you do the work affects whether the result holds up.

Purpose compatibility is the threshold most pipelines skip. Before a dataset enters training, GDPR Article 6(4) compatibility should be treated as a gate, not an afterthought, because data repurposed into training without that analysis carries risk into everything built on it.

Then there is the unresolved question the guidance has not answered: erasure. The GDPR's right to erasure under Article 17 sits awkwardly against trained model parameters, where the data is no longer stored as records but is, in some sense, reflected in the weights. The law here is genuinely open, and a defensible position has to be reasoned rather than asserted.

The exposure spans both regimes, and aggregate penalties can reach 7% of global annual turnover, while a DPIA is required for virtually all high-risk AI training involving personal data.

Our whitepaper, Training Data Governance, maps the interactions in detail, names the conflicts and open questions, and sets out a practical framework, including the tooling and governance structures that make documented proportionality sustainable. The two regimes are not going to reconcile themselves, so the work is to hold both, on the record.

About the author