Following on from Vanessa Douglas-Savage’s post about the use of predictive analytics in law and order services, and with the topic of predictive analytics changing the way we interact with our personal mobile devices in the news this week, I thought it would be interesting to look at some of the challenges that currently hamper the effectiveness of analytics efforts.
As stated in the blog post, “predictive analytics is the practice of extracting patterns from information to predict future trends and outcomes. Typically used as a decision-making guide, predictive analytics is steadily impacting the way in which governments will design and deliver public services to their citizens.”
The success of predictive analytics tools hinges on overcoming some key big data analytics challenges.
Firstly, they must consolidate and integrate data from across a range of sources, including complex legacy systems, isolated information groups, and data from external sources that maintain citizen information. With so many sources, the amount of data is overwhelmingly large.
To derive analyses and value from these systems, the key questions to ask are, ‘What information is important?’ And, ‘How do the pieces of information fit together?’
Historically, the big data wave has been poor at answering these questions, and reliance is still on human intervention to reach decisions, determine directions and remove the noise from the real message. The key for humans to navigate this effectively is to recognise patterns in the data, and this is where machines can assist.
Secondly, of the data landscape, 80% is unstructured data. That is, the data is contained in documents, emails, images or social media text-based posts rather than structured databases. Where structured databases provide a level of consistency in format and metadata, unstructured data requires a level of organisation prior to any analysis activity. The aim is to establish context and connections to structured data sources. This e-discovery activity can be complex and costly particularly in the review phase where traditionally every piece of data would have to be read by a human, sometimes involving armies of humans.
Taking some steps towards breaking through this challenge is the practice of predictive coding. This involves a level of machine intelligence where systems are taught by human subject matter experts to gather data, perform analysis and make decisions about what is relevant. In large, complex information environments, machines are taught to do the heavy lifting which in the long run can cut the cost of e-discovery significantly. The downside of predictive coding is that media files like video, images and audio cannot be read.
Another method of unstructured content organisation is automated metadata tagging for classification, description, indexing and management of content. In the case of automated tagging, a machine is taught via rules, suggestion-based tuning and previous experience to apply tags to content.
However, predictive coding and automated tagging highlight the third big data analysis challenge: imperfect data. As with all learning experiences, mistakes will be made by both machine and human. Humans must learn how to teach the machine, and the machine must go through the learning curve of understanding subject matter and reaching relevant conclusions.
But more significantly, the challenge of garbage in-garbage out remains at large. Because the machine is learning, if human input is incorrect, or there are errors in the source data, then the machine will arrive at incorrect conclusions. As a result, predictive coding and automated tagging are not completely trusted to deliver the same outcomes as a human eyeball activity.
And finally, there is the challenge of de-identifying data so that the identity of a person cannot be determined. This activity is often underestimated as conclusions are drawn around the importance and use of the data prior to release. A recent example was in New York City where taxi trip logs were made available. Within a short space of time, it was found that despite some anonomisation it could easily be determined who drove what vehicle, a driver’s gross income and where they live. There is ongoing debate about whether de-identification works and whether you can truly anonomise a dataset.
While there are no quick fixes or out of the box solutions to these complex problems, awareness and discussion are crucial to moving towards resolution. In particular, progressively overcoming the unstructured data issue would mean a significant contribution to the data landscape.