The Data Heist Nobody Is Talking About

A few months ago, we set out to assess a potential new product opportunity at Corrata. Like most organisations today, we turned to AI tools to accelerate the research. We asked the models the obvious questions: what are the limitations of current approaches, where are the gaps organisations are struggling most with, what is the lived reality of InfoSec practitioners trying to protect their users?
The responses were confident, detailed and almost entirely useless.
What came back was a polished reflection of the public record – vendor white papers, analyst reports, conference presentations, marketing copy. The kind of material that describes the security landscape as vendors would like it to appear, not as practitioners actually experience it. When we pressed for the harder truths – where defences are genuinely failing, what problems remain unsolved, where budgets fall short of rhetoric – the models had nothing credible to offer. That information simply does not exist in the sources LLMs have traditionally been trained on.
We reverted to traditional market research. We picked up the phone and spoke to InfoSec practitioners. Within a handful of conversations, the picture that emerged was fundamentally different from the AI-generated assessment – different enough to completely change our view of the opportunity. The LLM had not just been incomplete. It had pointed us in the wrong direction, inflated by the optimism of vendor communications and structurally blind to the gap between published best practice and operational reality.
This experience crystallised something important: the most strategically valuable information in any organisation is precisely the information that has never been publicly shared. Internal assessments, honest post-mortems, unvarnished performance data, real-world limitations. This is what makes decisions – and what makes organisations competitive. And it is exactly the information that AI models, trained predominantly on public sources, have until now been unable to access.
Which is why what is now happening in the AI training data industry should concern every organisation’s leadership team.
A Systematic Effort to Close the Gap
In January 2026, Wired reported that OpenAI, working with training data firm Handshake AI, had been asking third-party contractors to upload real examples of work they had produced in past and current jobs. As TechCrunch confirmed, contractors were asked to submit “concrete outputs (not a summary of the file, but the actual file)” – Word documents, PDFs, PowerPoint presentations, Excel spreadsheets, code repositories. The goal was explicitly to obtain authentic professional work product: the kind of complex, time-intensive output that reflects how white-collar work actually gets done.
The commercial logic is straightforward. AI labs have extracted most of the value available from public sources. The frontier of model improvement now lies in exactly the kind of authentic, context-rich professional knowledge that our own research experience illustrated – the information that has never been published because its value depends on it staying private.
OpenAI is not alone. A network of specialist data firms – Mercor, Surge, Scale AI, Micro1, and others – has built a substantial industry around recruiting knowledge workers to contribute professional expertise to AI training. Mercor alone reportedly pays out more than $1.5 million to contractors daily, has tens of thousands of contributors, and counts OpenAI, Anthropic, and Meta among its clients. Surge reportedly brought in $1.2 billion in revenue in 2024. These are not marginal operations. They represent a well-funded, systematically organised effort to harvest the institutional knowledge that organisations have spent decades accumulating.
The Gap Between Policy and Practice
To be fair, the guidelines these companies issue to contractors include instructions to remove proprietary and personally identifiable information before uploading. OpenAI reportedly provides a tool called the ChatGPT “Superstar Scrubbing” tool for this purpose. Mercor instructs contributors to operate based on personal knowledge rather than employer documents.
The problem, as intellectual property lawyer Evan Brown told Wired, is that this system requires an enormous amount of trust in individual contractors to make judgements about what is and is not confidential. In his assessment, AI labs adopting this methodology are “putting themselves at great risk” because the process inherently depends on “a lot of trust in its contractors to decide what is and isn’t confidential.”
That trust is misplaced for a simple reason: the people uploading this material are, overwhelmingly, full-time employees of other organisations. They are not independent freelancers with no confidentiality obligations – they are consultants uploading client deliverables, analysts uploading financial models, engineers uploading proprietary code, security practitioners uploading internal assessments of exactly the kind our research was seeking. The work they are contributing was created on company time, using company resources, for company clients. It belongs to their employers, not to them.
Even the Mercor CEO has acknowledged that “things that happen” – that confidential material does slip through – given the sheer volume of activity on the platform.
What Is Actually Being Shared
The types of documents being solicited paint a clear picture of the risk. AI labs want authentic, complex, high-value work product precisely because it reflects real professional environments. That means strategic planning documents and internal market analyses, financial models and performance metrics, client presentations and pitch decks, proprietary methodologies and process documentation, internal security assessments and vulnerability analyses.
This is the accumulated intellectual capital of organisations – the distillation of years of experience, competitive intelligence, and hard-won operational knowledge. It is information that derives its value precisely from not being publicly known. The painful irony is that the more genuinely useful information is – the more it would have helped us in our own research – the more likely it represents a serious confidentiality breach when uploaded.
Capturing the Value of Institutional Knowledge
The most important consequence of what is happening here is not that a specific document might be reproduced to the wrong person. It is something more fundamental: institutional knowledge that was previously the exclusive possession of an enterprise is being absorbed into AI models and made available to anyone who asks.
This is a transfer of value on a significant scale and it has happened before.
News organisations spent decades and enormous resources building editorial expertise, investigative capability, and audience relationships. That value was systematically absorbed by AI models trained on their published output, without compensation and without consent. Journalists, authors, designers, photographers the products of their professional labour became training data, and the resulting capability now sits inside models that compete directly with the people whose work built them. The legal battles are ongoing and the commercial damage is already done.
The same dynamic is now playing out with enterprise institutional knowledge but the mechanism is less visible and the affected organisations are largely unaware it is happening.
Mercor’s CEO has articulated this dynamic directly, describing his company’s mission as recruiting Goldman Sachs analysts, McKinsey consultants, and elite lawyers to train models on the workflows their employers will not share. The pitch to contributors is that the knowledge in their heads belongs to them. But the knowledge in a financial model, a strategy deck, or an internal security assessment belongs to the organisation that commissioned and paid for it.
The scale of what is being transferred is not trivial. Mercor distributes more than $1.5 million to contractors daily across tens of thousands of contributors. Surge reportedly generated $1.2 billion in revenue in 2024 from the same model. This is not a cottage industry of individual bad actors. It is a well-capitalised, systematically organised effort to extract the institutional knowledge that enterprises have spent years accumulating and to concentrate that value inside a handful of AI platforms.
What Organisations Should Do
The first step is acknowledging that this is a policy and governance problem as much as a technical one. Organisations need to explicitly address AI training data participation in their employment contracts, acceptable use policies, and security training. This is not covered by most existing agreements legal experts are only now beginning to advise that NDAs be updated to explicitly prohibit contribution of confidential material to AI training pipelines. Employees need to understand clearly that contributing professional work product to AI data platforms is not acceptable. At the technical level,endpoint controls that monitor for uploads of sensitive file types to data collection platforms are an increasingly important part of the answer. The same controls organisations deploy to prevent data exfiltration to consumer cloud storage apply directly here.
The AI data supply chain has created a systematic, financially incentivised mechanism for extracting precisely the corporate intellectual property that makes organisations competitive. The organisations whose information is being consumed have largely not noticed. The gap that made our own AI-assisted research so frustrating is actively being closed but not in a way that benefits the organisations whose knowledge is fuelling that improvement.
Corrata provides mobile security solutions that help organisations protect sensitive data from unauthorised upload and exfiltration via employee mobile devices.