I’ve had a version of the same conversation with half a dozen business leaders in the past year.
They’ve invested in an AI initiative. They had a credible vendor. The technology was sound. The use case made sense. And six months in, the results were disappointing – lower accuracy than expected, outputs that couldn’t be trusted, a team that had quietly stopped using the system.
In most of these cases, the technology wasn’t the problem. The data was.
This is the conversation the AI industry doesn’t have loudly enough: AI is only as good as the data it learns from. And most business data – the kind that exists in real companies running real operations – is messier than anyone wants to admit.
What “Bad Data” Actually Means
When people talk about data quality problems, it usually sounds abstract. Let me make it concrete.
I worked with a manufacturing client who wanted to build a predictive quality control system – using computer vision and sensor data to catch defects before they reached the end of the production line. Genuinely high-value use case. The technology to do it exists and works well.
The problem was their historical data. Four years of production records, theoretically perfect for training a machine learning model. Except the data had been entered by multiple operators across three shifts, with different conventions for recording defects. The sensor readings had gaps from equipment downtime that nobody had documented. And the labelling – which batches were defective, which were good – turned out to be inconsistent because the definition of “defective” had changed twice over the four-year period without anyone updating the historical records.
The model trained on that data learned the noise as well as the signal. It made predictions, but they weren’t reliable enough to act on.
This isn’t unusual. It’s the norm. Most business data was collected to support operations, not to train AI. It’s good enough for its original purpose – running reports, processing transactions, tracking inventory – but it carries years of inconsistencies, gaps, and undocumented changes that create serious problems when you try to build something intelligent on top of it.
The Three Data Problems We Find in Almost Every Engagement
The first is inconsistency. The same thing recorded in different ways by different people or systems over time. Units that changed. Categories that were renamed. Fields that were repurposed. This is almost universal in businesses that have been operating for more than a few years.
The second is incompleteness. Gaps in the record that nobody thought mattered at the time. Sensor readings that stopped during maintenance windows. Customer records that were never fully populated. Transaction logs that don’t capture the full context of what happened. Machine learning models are particularly sensitive to incomplete data – gaps aren’t neutral, they introduce bias.
The third is lack of labels. This one catches people by surprise. Building a supervised machine learning model requires not just data about what happened, but data about what the outcome was – and whether it was good or bad. Many businesses have extensive records of events and almost no systematic record of outcomes. Without labels, you can’t train a model to distinguish success from failure.
What Good Data Preparation Looks Like
Fixing these problems isn’t glamorous work, but it’s foundational. And it’s work that pays dividends beyond the specific AI project – better data infrastructure improves almost every operational and analytical function in a business.
The process starts with a data audit – a clear-eyed assessment of what data exists, what quality it’s in, how consistent it is across sources, and what gaps exist relative to what the AI use case actually needs.
From there, the work is a combination of technical cleaning – standardising formats, resolving inconsistencies, filling gaps where possible – and process changes that prevent the same problems from recurring. This is where we often find that the data problem is actually a process problem in disguise. The inconsistency in the data reflects an inconsistency in how the business operates, and fixing one surfaces the other.
For some use cases, we also work with clients on labelling exercises – systematically going back through historical records to add the outcome data that was never captured, or establishing new processes to capture it going forward.
None of this is fast. Depending on the state of the data and the complexity of the use case, preparation work can take anywhere from a few weeks to a few months. But it’s the difference between an AI project that delivers results and one that produces a system nobody trusts.
What to Ask Before Any AI Project Starts
If you’re evaluating an AI initiative – whether you’re building something custom or implementing a third-party solution – there are three data questions worth asking before anything else.
What historical data exists that’s relevant to this use case, and how far back does it go? What’s the known quality of that data – has it ever been audited? And what outcome data exists to validate whether the AI’s predictions are actually correct?
The answers won’t tell you whether the project is worth doing. But they’ll tell you what you’re getting into – and whether there’s groundwork needed before the AI work can deliver what you’re expecting from it. At Do Systems, every AI engagement starts with a data readiness assessment. We’ve found it’s the most reliable predictor of project success – more than the use case, more than the technology choice, more than the budget. The projects that deliver are the ones where the data foundation was solid before the model training started.




Comments are closed