Some call data the new oil. Others call it the new gold. Philosophers and economists may argue about the quality of the metaphor, but there’s no doubt that organizing and analyzing data is a vital endeavor for any enterprise looking to deliver on the promise of data-driven decision-making.
And to do so, a solid data management strategy is key. Encompassing data governance, data ops, data warehousing, data engineering, data analytics, data science, and more, data management, when done right, can provide businesses in every industry a competitive edge.
The good news is that many facets of data management are well-understood and are grounded in sound principles that have evolved over decades. For example, they may not be easy to apply or simple to comprehend but thanks to bench scientists and mathematicians alike, companies now have a range of logistical frameworks for analyzing data and coming to conclusions. More importantly, we also have statistical models that draw error bars that delineate the limits of our analysis.
But for all the good that’s come out of the study of data science and the various disciplines that fuel it, sometimes we’re still left scratching our heads. Enterprises are often bumping up to the limits of the field. Some of the paradoxes relate to the practical challenges of gathering and organizing so much data. Others are philosophical, testing our ability to reason about abstract qualities. And then there is the rise of privacy concerns around so much data being collected in the first place.
Following are some of the dark secrets that make data management such a challenge for so many enterprises.
Unstructured data is difficult to analyze
Much of the data stored away in the corporate archives doesn’t have much structure at all. One of my friends yearns to use an AI to search through the text notes taken by call center staff at his bank. These sentences may contain insights that could help improve the bank’s lending and services. Perhaps. But the notes were taken by hundreds of different people with different ideas of what to write down about a given call. Moreover, staff members have different writing styles and abilities. Some didn’t write much at all. Some write down too much information about their given calls. Text by itself doesn’t have much structure to begin with, but when you’ve got a pile of text written by hundreds or thousands of employees over dozens of years, then whatever structure there is might be even weaker.
Even structured data is often unstructured
Good scientists and database administrators guide databases by specifying the type and structure of each field. Sometimes, in the name of even more structure, they limit the values in a given field to integers in certain ranges or to predefined choices. Even then, the people filling out the forms that the database stores find ways to add wrinkles and glitches. Sometimes fields are left empty. Other people put in a dash or the initials “n.a.” when they think a question doesn’t apply. People even spell their names differently from year to year, day to day, or even line to line on the same form. Good developers can catch some of these issues through validation. Good data scientists can also reduce some of this uncertainty through cleansing. But it’s still maddening that even the most structured tables have questionable entries — and that those questionable entries can introduce unknowns and even errors in analysis.
Data schemas are either too strict or too loose
No matter how hard data teams try to spell out schema constraints, the resulting schemas for defining the values in the various data fields are either too strict or too loose. If the data team adds tight constraints, users complain that their answers aren’t found on the narrow list of acceptable values. If the schema is too accommodating, users can add strange values with little consistency. It’s almost impossible to tune the schema just right.
Data laws are very strict
Laws about privacy and data protection are strong and are only getting stronger. Between regulations such as the GDPR, HIPPA, and a dozen or so more, it can be very difficult to assemble data, and even more dangerous to keep it lying around waiting for a hacker to break in. In many cases, it’s easier to spend more money on lawyers than programmers or data scientists. These headaches are why some companies simply dispose of their data as soon as they can get rid of it.
Data cleansing costs are huge
Many data scientists will confirm that 90% of the job is just collecting the data, putting it in a consistent form, and dealing with the endless holes or mistakes. The person with the data will always say, “It’s all in a CSV and ready to go.” But they don’t mention the empty fields or the mischaracterizations. It’s easy to spend 10 times as much time on cleaning up data for use in a data science project than just starting up the routine in R or Python to actually perform the statistical analysis.
Users are increasingly suspicious of your data practices
End users and customers are getting evermore suspicious about a company’s data management practices, and some AI algorithms and their use are only amplifying the fear, leaving many people very uneasy about what’s happening to the data capturing their every move. Those fears are fueling regulation and often snagging companies and even well-meaning data scientists into public relations blowback. Not only that, but people are deliberately jamming data collection with fake values or wrong answers. Sometimes half of the work is dealing with malicious partners and customers.
Integrating outside data can reap rewards — and bring disaster
It’s one thing for a company to take ownership of the data it gathers. The IT department and data scientists have control over that. But increasingly aggressive companies are figuring out how to integrate their homegrown information with third-party data and the vast seas of personalized information floating on the internet. Some tools openly promise to suck in data about each and every customer to build personalized dossiers on each purchase. Yes, they use the same words as the spy agencies going after terrorists to track your fast-food purchases and credit scores. Is it any wonder that people fret and panic?
Regulators are cracking down on data use
No one knows when clever data analsyis crosses some line, but once it does the regulators show up. In one recent example from Canada, the government explored how some of the doughnut shops were tracking customers who were also shopping at competitors. A recent news release announced, “The investigation found that Tim Hortons’ contract with an American third-party location services supplier contained language so vague and permissive that it would have allowed the company to sell ‘de-identified’ location data for its own purposes.” And for what? To sell more doughnuts? Regulators are increasingly taking notice of anything involving personal information.
Your data scheme may not be worth it
We imagine that a brilliant algorithm may make everything more efficient and profitable. And sometimes such an algorithm is actually possible, but the price can also be too high. For instance, consumers — and even companies — are increasingly questioning the value of targeted marketing that comes from elaborate data management schemes. Some point to the way that we often see ads for something we already purchased because the ad trackers haven’t figured out that we’re not in the market anymore. The same fate often awaits other clever schemes. Sometimes a rigorous data analysis identifies the worst performing factory, but it doesn’t matter because the company signed a 30-year lease on the building. Companies need to be ready for the likelihood that all that genius of data science might produce an answer that isn’t acceptable.
In the end, data decisions are often just judgment calls
Numbers can offer plenty of precision, but how humans interpret them is often what matters. After all the data analysis and AI magic, most algorithms require a decision to be made about whether some value is over or under a threshold. Sometimes scientists want a p-value lower than 0.05. Sometimes a cop is looking to give tickets to cars going 20% over the speed limit. These thresholds are often just arbitrary values. For all the science and mathematics that can be applied to data, many “data-driven” processes have more gray area in them than we would like to believe, leaving decisions up to what amounts to gut instinct despite all the resources a company may have put into its data management practices.
Data storage costs are exploding
Yes, disk drives keep getting fatter and the price per terabyte keeps dropping, but the programmers are gathering bits faster than the prices can fall. The devices from the internet of things (IoT) keep uploading data and users expect to browse a rich collection of these bytes forever. In the meantime, compliance officers and regulators keep asking for more and more data in case of future audits. It would be one thing if someone actually looked at some of the bits, but we only have so much time in the day. The percentage of data that is actually accessed again keeps dropping lower and lower. Yet the price for storing the expanding bundle keeps drifting up.