The Data Problem:
Who Owns the Fuel That Runs the AI Revolution — and Did Anyone Actually Ask?
Training data, consent, power concentration, and the privacy reckoning nobody planned for. The battle over data is the battle over AI’s future.
“Data is the new oil.” It is one of those metaphors that contains just enough truth to be useful and just enough distortion to be dangerous. Oil is finite. Oil pollutes. Data grows when used. Data can be in multiple places simultaneously. Data, in the right hands, compounds in value in ways oil cannot. The metaphor is catchy. The reality is considerably stranger and more consequential. The battle over data is the battle over AI’s future — and it is being fought largely without the people whose data is at stake.
The Fuel Nobody Talks About Honestly
Modern AI systems have been trained on hundreds of billions of words of text, billions of images, vast repositories of code. The internet, in many ways, is the training set. Every Wikipedia article, every digitised book, every Reddit thread — all of it has contributed to the models we interact with daily. Where did that data come from? Largely from people who had no idea their words, images, and creative work would be used to train systems sold as commercial products by companies worth hundreds of billions of dollars. Their contribution was extracted rather than purchased. This is the foundational economic arrangement of the AI industry.
The human beings who produced the raw material of the AI revolution were, in the overwhelming majority of cases, not consulted, not compensated, and not informed. Their contribution was extracted rather than purchased. This is the foundational economic arrangement of the AI industry.Neal Lloyd · Inside The Machine, Day 10
Did Anyone Actually Ask?
Platform terms of service were written before large-scale AI training existed as a concept. Whether accepting those terms constitutes consent to AI training requires stretching language written for one purpose to cover a categorically different use. Even if scraping public data is technically legal — is it right? When someone writes a personal essay, posts it on a platform, and that essay becomes part of the training data that teaches an AI to simulate emotional depth — did they consent to that use?
Most platform terms of service were written before large-scale AI training existed. Whether accepting these terms constitutes consent requires stretching language written for one purpose to cover a categorically different use. Whether courts accept this will define the legal landscape for decades.
The Power Asymmetry Nobody Wants to Name
The organisations that control the largest, highest-quality datasets have a structural advantage that compounds over time. Training data is not easily replicated. The competitive moat in AI is not primarily algorithmic — algorithms can be replicated. The moat is data. And the organisations that control the most comprehensive datasets will have disproportionate influence over what AI systems know, what perspectives they reflect, and what biases they embed — for a very long time. The battle over data is not over. The precedents being set now will determine who benefits from AI and who provides the raw material without sharing in it.
Inside The Machine, Day 10 · May 2026
Neal Lloyd writes about technology, human adaptation, and the uncomfortable questions nobody wants to answer at dinner. Inside The Machine is his ongoing daily series on AI.
- Day 01What Is This Thing?Published — add real URL
- Day 02Survive the MachinePublished — add real URL
- Day 03The Great DebatePublished — add real URL
- Day 04Who Gets Hurt?Published — add real URL
- Day 05Who's In Charge?Published — add real URL
- Day 06The Industries That WinPublished — add real URL
- Day 07The Human EdgePublished — add real URL
- Day 08The Creativity QuestionPublished — add real URL
- Day 09Does AI Feel Anything?Published — add real URL
- Day 10The Data ProblemPublished — add real URL
- Day 11The Trust QuestionPublished — add real URL



