Check out my new Machine Learning blog post on Airbnb


While almost all members of the Airbnb community interact in good faith, there is an ever shrinking group of bad actors that seek to take advantage of the platform for profit. This problem is not unique to Airbnb: social networks battle with attempts to spam or phish users for their details; ecommerce sites try to prevent the use of stolen credit cards. The Trust and Safety team at Airbnb works tirelessly to remove bad actors from the Airbnb community and to help make the platform a safer and trustworthy place to experience belonging.

Missing Values In A Random Forest

We can train machine learning models to identify new bad actors (for more details see the previous blog post Architecting a Machine Learning System for Risk). One particular family of models we use is Random Forest Classifiers (RFCs). A RFC is a collection of trees, each independently grown using labeled and complete input training data. By complete we explicitly mean that there are no missing values i.e. NULL or NaN values. But in practice the data often can have (many) missing values. In particular, very predictive features do not always have values available so they must be imputed before a random forest can be trained.

Read more…


When to wait for flight prices to drop

Bing Price Predictor

Bing Flights Price Predictor

Kayak Price Predictor

Kayak Flights Price Predictor

I’ve often heard people talk about when is the best time to book flights (apparently its Tuesday nights). And there has been a rise in airfare blogs such as Airfare Watchdog and CheapAir’s Blog.

Even online flight booking platforms such as Bing and Kayak are starting to offer advice on whether prices are trending up or down and whether now is the best time to buy.

Model Parameters Value Over Time

Model Parameters Values Over Time

Recently, I came across a dataset of about 6 months worth of internal US flights prices data. For about 100 popular routes, the dataset had the time and current price for the future flight. I wanted to see whether we could actually predict directional changes in price with any confidence.

I built a model to try to predict whether the price would drop by at least 10% in the next 7 days. Using only historical price returns and weekly updating of the model parameters, I calculated the daily out-of-sample performance. The results were much better than I expected.

Model R2 In Test Data Over Time

Model R2 In Test Data Over Time

Firstly, the 2 parameters in my model were reasonably stable over time – a key property of a well defined model. And secondly, the out-of-sample R2 (measure of performance) was consistently positive and around 5%.

More concretely and actionable: for the dataset I was looking at, the price actually dropped 18% of the time (to below 10% in the proceeding week), the model made a prediction that the price would drop 13% of the time, and it was correct in 73% of these predictions.

With more features data such as flight duration, number of changes, oil prices, seasonality i’m confident that the 13% could get closer to 18% and the 73% could be pushed even higher, maybe to 95%.