I had the pleasure of video-conferencing into Kellogg‘s MBA class in Social Media at Northwestern University yesterday. Brayden King kindly invited me to talk about how Airbnb thinks about Trust and the challenges facing sharing economies.
We spoke about the role of Data Science at the company and how it has changed over the years. As the volume of data has grown, we have more often than not moved away from explanatory predictive models to Machine Learning algorithms.
One thing that stood out to me as top of mind for the students in the MBA class was the process of Trust development for first time users. How does a first time guest get accepted by a host on Airbnb? How does a first time host get selected by a guest?
At Airbnb we have a team of highly skilled Data Scientists and Engineers working on matching algorithms designed to help first time guests and hosts. And even more than this, the community are their own best resource. Experienced hosts help new hosts manage their listing and new guests book their first experience.
At the heart of everything data-related we work on at Airbnb is the community and enabling them to make more connections amongst themselves and new users.
A recent survey by eFinancialCareers of their CV database has put Imperial College London at number 8 in the world wide rankings for best places to prepare for a financial career in Data Science. This is hot off the heels of the launch of their new Data Science Institute last year and the new MSc Business Analytics.
The usual East Coast (CMU, Columbia, NYU) and West Coast (Stanford, Berkeley) are also in the top 10, as well as Cambridge and Oxford from the UK.
In an exciting new partnership, Airbnb has teamed up with Kaggle to create an online Data Science data challenge. In this challenge we provide historical data on the first country guests book and then ask candidates to predict future first bookings.
Try the challenge yourself! You have until February 11th 2016 to submit your entries. And if you have any questions you can use the forum and I will respond as soon as possible. Good luck and hope you have fun playing with our data!
Along with a team of Stanford University Sociologists led by Karen Cook and Paolo Parigi, I am conducting a study on behalf of Airbnb to understand the social consequences of sharing goods and services with strangers.
Karen has published multiple books on the formation of Trust in modern societies and more recently on the role of Trust in the online world. Paolo is also interested in social networks and has conducted previous studies of Trust in the sharing economy.
Together we will be surveying Airbnb members to better understand Trust inside and outside of the sharing economy, as well as what drives changes in Trust. Stay tuned for more!
My latest Machine Learning blog post Confidence Splitting Criterions Can Improve Precision And Recall in Random Forest Classifiers is out on the Airbnb Data blog:
The Trust and Safety Team maintains a number of models for predicting and detecting fraudulent online and offline behaviour. A common challenge we face is attaining high confidence in the identification of fraudulent actions. Both in terms of classifying a fraudulent action as a fraudulent action (recall) and not classifying a good action as a fraudulent action (precision).
A classification model we often use is a Random Forest Classifier (RFC). However, by adjusting the logic of this algorithm slightly, so that we look for high confidence regions of classification, we can significantly improve the recall and precision of the classifier’s predictions. To do this we introduce a new splitting criterion (explained below) and show experimentally that it can enable more accurate fraud detection.
Have a read and let me know what you think!
A world super power in electronics, Japaenese company Fujitsu claims that it no longer needs Data Scientists, and has automated their job! The company claims that
Data scientists use their skill to select a combination of algorithm and configuration to get the most accurate predictive model from the starting data
and that they have found a way to automate this searching over different configurations and models for the optimum. The diagram depicts a meta machine learning pipeline that tunes the hyper-parameters of a model in Spark or another language.
While it certainly makes sense to automate this potentially tedious optimisation, this will by no means deprecate the role of a Data Scientist. It is of course true that a Data Scientist has to intelligently choose an algorithm and its configuration, but this is a small part of the full life cycle of a data product that a Data Scientist is responsible for.
The processes of defining a metric to optimise, then obtaining and cleaning data, transforming data to informative feature, maybe also obtaining and cleaning labels (in the case of supervised learning) are all part of a Data Scientist’s responsibilities and need to be completed before an algorithm can be optimised.
Moreover, these processes constitute 90% of the blood, sweat, and tears of a Data Scientist that go into making a successful data product. Algorithm and configuration optimisation can give you a few percentage points boost in performance at most, but it is the accuracy of the data and intelligent feature sculpting which make the real difference.
Let’s hope Fujitsu does not sack all their Data Scientists just yet, or they may have a machine learning tuner with no data for the machine to learn from!
I am hoping to give a talk with Eric Levine on behalf of Airbnb at next year’s SXSW Interactive conference in Austin. Please vote for our submission and leave some comments too!
A recent article by TechCrunch reported that New York mayor De Blasio has promised to invest $70M over the next 10 years to install universal broadband wifi in New York City.
De Blasio’s office comments that “Broadband is no longer a luxury – it’s as central to education, jobs, businesses and our civic life as water and electricity. For the first time in the history of the City, broadband is in the capital budget”
This is an exciting and natural direction. I think there are few that would argue against viewing the internet as a utility in this day and age. And those that do not have access are unfairly disadvantaged.
The implications for the tech industry and data science is enormous. Not only will there be more data to collect and analysed, but many online companies will now have access to a segment of users that was previously impossible. Some ‘wireless corridors’ already exist to serve disadvantaged communities e.g. the ‘Harlem free wifi zone’. But this is the first time a major USA city has made a openly underlined its dependence on the internet. The next decade in NYC should be an exciting time for data consumers.
Yesterday I was invited by David Webster to talk to the team at innovative design company IDEO. IDEO is a cutting edge digital and physical design studio in Palo Alto that has been leading creativity for over 30 years. I was lucky enough to have a tour by David through their workshop, engineering office, and toy lab.
After the tour we had a joint Q&A with the whole team about how big data is used at Airbnb and how it might be used more in the design process at IDEO. Some key thoughts emerged:
- The world is moving towards more wearable sensory technology e.g. Google glasses, Apple watch, Fitbit. With this comes a wealth of feedback data on the user in the offline world. The internet of things (IoT) will make, for example, A/B testing in the offline (physical) world possible.
- For designers to be more data empowered, we first need the analytics and prediction tools to catch up. Currently it is easy to log data, cheap to store data and there are standardised tools to query data. However, no leader has emerged for extracting insights from data. This democratisation of insights needs to happen before data can permeate design.
- Data science works best with design when they collaborate early. At the start of a project it is easier to scope what data is necessary and easy to collect at the outset so that decisions can be informed and iterations can be faster.
The future for Data Science in Design is exciting and, when they start to overlap more, we will see changes in the world around around us accelerate even faster.
Michael Li of the Data Incubator has written a timely article in VentureBeat on what a Data Scientist is not. In short a Data Scientist is:
- Not just a Business Analyst working on more data,
- Not just a rebranded Software Engineer,
- Not just a Machine Learning expert with no business knowledge.
A Data Scientist needs to be able to extract insights from datasets that are orders of magnitude larger than what they were 5 years ago. And they need to extract this insight carefully, with statistical significance and integrity. Moreover, the insight is only as useful as the business need it solves.
As a regular interviewer at Airbnb for junior and senior Data Scientists, attention to data cleaning and diligence in statistical analysis are fundamental for successful candidates. Moreover, we look for people that understand the ‘why’ of a problem and the business impact of a solution. This is what differentiates a really smart candidate from a hired candidate.