It is widely accepted that many companies such as Facebook, Google, Amazon, etc have vast amounts of data on our friends, interests, and spending habits amongst other things. At times, for example for data mining or scientific collaboration, it can be useful for companies to access internal and external data. However, they are rightfully blocked by Privacy laws.
Hence, there is an increasing work in thinking of ways to properly anonymise data to enable mining. Aggregation of data to wash PII (personally identifiable information) is one way to achieve this, but can lose important granularity and detail.
Raffael Strassnig, VP Data Scientist at Barclays retail bank, spoke at a summit last month to stress the importance of protecting privacy. Anonymising data at scale is a very hard problem but Strassnig’s team have implemented an algorithm by a PhD candidate and modified it to work on Barclay’s Big Data. The method involves:
“clustering the data into k-means clusters, with no cluster overlapping, the clusters being a certain size to comply with k-anonymity constraint, and minimising the loss of data when applying the procedure to the dataset by using a dissimilarity measure”
Future developments in application of Machine Learning techniques may enable use of PII without anonymisation. Until then, the Data Science team at Barclays is leading the way in protecting their users’ data while processing it.
Michael Li of the Data Incubator has written a timely article in VentureBeat on what a Data Scientist is not. In short a Data Scientist is:
- Not just a Business Analyst working on more data,
- Not just a rebranded Software Engineer,
- Not just a Machine Learning expert with no business knowledge.
A Data Scientist needs to be able to extract insights from datasets that are orders of magnitude larger than what they were 5 years ago. And they need to extract this insight carefully, with statistical significance and integrity. Moreover, the insight is only as useful as the business need it solves.
As a regular interviewer at Airbnb for junior and senior Data Scientists, attention to data cleaning and diligence in statistical analysis are fundamental for successful candidates. Moreover, we look for people that understand the ‘why’ of a problem and the business impact of a solution. This is what differentiates a really smart candidate from a hired candidate.
An article on Thursday in the UK online tech journal ArsTechnica reviews the surprising power of mobile communications data to identify trending unemployment.
A PLOS One paper and Journal of the Royal Society Interface paper both published last week look at changes in the frequency, location, and timing of interactions between people via their cellular records. The correlations between these changes and observed layoffs can be used to train models for future predictions.
The article asks: is this harvesting of phone records to get ahead of employment shocks a critical tool for planners and government officials? Or actually a very creepy and invasive use of personal information? Comments welcome!
This image, unrelated to the unemployment study, shows seasonal population changes in France and Portugal, measured by cellphone activity.
I recently came across this TED talk by Michael Specter of the New Yorker from 2010 and found it’s message powerful:
“You can have your own opinions…but you’re not entitled to your own facts”
This particularly hit home because, as a data scientist at Airbnb, there is a tremendous responsibility to report accurately and fully on the data we collect. And it’s important to get this right because, as the talk poignantly demonstrates, while views can disagree, the data that they are based on should not.