By Phani Arega
In recent times, with the Web, Social Media, Smartphones and the Internet of Things, a colossal amount of data is being generated continuously. It is estimated that over a Zebibyte (1021 bytes) of data is being generated per year now and is increasing each year. Advances in hardware in the last two decades have made processors, memory and storage more powerful and cheaper. On the software front, technologies like MapReduce, Hadoop, Spark, and NoSQL databases have emerged to tackle the many challenges of Big Data processing.
Together with cloud computing, these developments have made the processing of a large amount of data practical. Naturally, commercial entities and governments stand to benefit greatly by the leveraging of these developments, which enable data-driven decision making and ultimately lead to improved customer experiences.
With the stage set by these developments, one would expect most businesses, organizations and governments to be successfully leveraging Big Data technologies. But there have been limited success in achieving good data-driven decision-making. Many irrelevant advertisements flooding our email inboxes, SMS, banner ads, paper mail, social media etc. is an evidence of such failures. Spamming is not only annoying to consumers but also ineffective way for companies wanting to market goods and services to consumers. If the consumer is really interested in something, and a company really has what the consumer needs, this becomes a win-win situation, reducing annoyance for the consumer, and better utilization of marketing spend by the company.
Governments, too, can leverage accurate information of characteristics, propensities, specific needs of citizens and households and accordingly devise needed campaigns targeting respective cohorts. Governments communications for campaigns, advisories also will reach only specific subset of citizens instead of everyone improving the signal to noise ratio and hence their effectiveness.
Among, top 20 (by revenue) companies in India, only 8 have significant focus on analytics – Study by analyticsindiamag.com
Bad Data Quality
One of the main reasons for the lack of adoption of big data technologies is insufficient data and bad quality of data. The age-old principle of Garbage in, Garbage out (GIGO) very much applies in this case. While the developments mentioned above addressed the 3Vs of big data: namely volume, variety, and velocity, the issue of fourth V, veracity, remain to be addressed.
De-duplication is one of the key challenges that need to be solved for cleaning data. De-duplication of data in countries like India is harder than in the west. Examples of challenges are variations in name abbreviation, positioning of the surname at the beginning or at the end, and surname determination principles, are not uniform. Another example are the challenges posed due to addresses being very unstructured (e.g. “Behind Sangeet theatre”), and with differences in spelling of road names (e.g. “Marg” versus “Road”), etc. Tackling such challenges is highly non-trivial.
Another challenge is getting holistic data. Person’s demographic data and behavior data is scattered across computer systems of one’s mobile service provider, grocery store, car dealer, online vendors where one has shopped, utility providers, different departments of governments etc. Data in each of these systems alone is of piece-meal nature and hence insufficient to derive dependable insights. So, holistic information is feasible only by merging several such data sets at granular (person) level. Diversity in data formats and absence of a common unique key to match a person across systems mean that straightforward key based merging won’t work.
Worsening this situation, each dataset has tangible amount of noise and stale data. It is inherent nature of data to become stale unless its custodian continually cleanses and maintains it up-to-date.
“Half the money I spend on advertising is wasted; the trouble is I don’t know which half.” — Nineteenth century retailer John Wanamaker
Addressing above issues improves effectiveness of data-driven decision making for corporates and governments which in turn benefits their customers and citizens. It enables corporates and governments accurately target respective cohorts of people for a campaign. Customers also benefit because the spam of irrelevant messages is replaced by fewer and relevant messages and right messages not being missed out.
Holistic Data
This calls for entity matching-technology that solves all the three aforementioned issues. It should unify (merge) data from heterogeneous systems at granular (person) level overcoming the above-mentioned challenges and derive the holistic information of a person. Analytics based on such holistic data are far superior to analytics based on only one of the constituting datasets. Good unification exponentially increases the value of a dataset.
The entity-matching capability also effectively de-duplicates the data. It should also continually stamp accuracy on each piece of data. The accuracy rating feature also insulates the dependent programs from bad and stale data.
Above should be accomplished while meeting or exceeding national and international data privacy and data security regulations and norms. From this standpoint, it is not advisable to store the full unified data set of an individual at a single place; else it becomes a honey-pot for hackers. It should be a model of unify on the fly just before consuming the unified data.
Tokenising Data
PII (Personally Identifiable Information) in the data should be non-reversibly tokenised to anonymise the data to avoid compromising individual’s privacy. In tokenisation, each value in sensitive fields is substituted with a non-meaningful but unique character string called token. To make it non-reversible, methods like generating token from random numbers should be used and the mapping between original value and token should not be saved.
Another important aspect is consent of involved subjects for storing or using the respective data. Also, right revenue sharing across legitimate stakeholders should be implemented whenever data is monetized rather than the data’s custodian pocketing all the monetization revenue.
Data exchange frameworks should maintain a tamper-proof audit trail of data accesses done, consents exchanged and revenue sharing. That brings in accountability and acts as a deterrent to those who misuse data access. Having audit trail stored on blockchain and consent and revenue sharing enforced through a smart contract on a blockchain serves this purpose. Blockchain is a distributed ledger technology where data once written is immutable and behavior of software (smart contracts) once deployed cannot be manipulated by anyone.
The author is the senior vice president of engineering at Zebi Data (www.zebi.io).