Tech Projects | Pang Sheng Wei

Recommender Systems: Netflix

Summary: This project explores several algorithms for recommendation systems along with its real-world use cases. View the code .
Project Type: Data Science
Skills: Content-Based, Collaborative Filtering, Matrix Factorization
Tech Stack: Python, Pandas, Scikit-Surprise, Scikit-Learn

When we binge our favourite content on Netflix, Youtube or Facebook, it often seems like there's a never ending stream of awesome things to watch - but why? The truth is there are hundreds of recommendation algorithms running behind the scenes, learning your preferences based on your content history and serving you with videos that you love, to keep you hooked for as long as possible. In this blog post, we explore some examples of these algorithms to answer the following questions:

What are the various types of recommendation algorithms?
How do they work and what are their limitations?
What are the pros and cons of each type, and when should we use them?
What is Netflix's recommendation strategy and how do they scale it?

Types of Recommendation Algorithms

Let's start with a brief overview of the various types of recommendation algorithms. Remember, the goal is to accurately recommend a user the top N items based on some kind of ranking.

Rule-Based - defining simple rules to recommend an item, for example the top N highest rated movies or top 10 best selling shoes or users who bought X also bought Y
Content-Based - recommending items based on how similar they are, for example because you watched the movie Dark Knight, we might also recommend you other Batman, Superhero or Christopher Nolan Films
Collaborative-Filtering - predicting the score each user would give to each item that they have not rated, by learning from what other similar users with the same interests have rated

There are many algorithms in literature for each type so we cant possibly cover all, but we'll see at least 1 example of each.

Rule-Based

The datasets we shall be exploring today will be the famous Netflix Prize Movies in which Netflix created an open competition rewarding $1M USD to the person who could build the best recommender system. The first dataset we see consists of 17k+ movies and the year they were released.

Next we load another dataset consisting of the description of these movies

And in our last dataset we get the 24M+ ratings of each movie by various users.

For efficiency purposes of this example, lets exclude movies that have less than 10k ratings and users who rated less than 200 movies, leaving us with a more reasonable dataset size of about 4M+ user-movie ratings.

Further, we also split our dataset into training and testing to evaluate the algorithms. Now let's transform our dataframe into a sparse matrix of users and movie ratings for computation later on.

For our rule-based algorithm, lets recommend the all-time most popular movies. Computing the mean rating for all movies creates a ranking. The recommendation will be the same for all users and can be used if there is no information on the user. Variations of this approach can be separate rankings for each movie genre or even getting recent vs old movies. However, if we only use the average rating of a movie alone, it could mean that some movies with a smaller number of ratings could be skewed unfairly towards higher ratings. To tackle the problem of the unstable mean with few ratings we shall use the popular IMDB weighted rating formula:

We basically apply this formula to all our movies and volia we generate a scored ranking of each movie. As you can see lord of the rings came out at the top with the highest weighted rating at 4.37, while the top 10 movies include popular shows such as The Godfather, The Simpsons and Batman Begins.

Content-Based

Now let's turn to content-based strategies, which involve recommending either 1) items that similar users also like or 2) items that are similar to what the user has previously liked

Taking a look at our matrix again, each row can be seen as a vector of movie preferences for each user, a similarity between all user-vectors can be computed. This enables us to find all similar users and to work on user-specific recommendations. Recommending high rated movies of similar users to a specific user seems reasonable.

Since there are still empty values left in the matrix, we have to use a reliable way to impute a decent value. A simple approach is to fill in the mean of each user into the empty values. Afterwards the ratings of all similar users for each movie will be weighted with their similarity score and the mean rating will be computed.

After filling the NA values with the user's average rating, we can compute the cosine similarity for each user. Next, we can define a function that takes in a user index and similarity matrix, and returns the top 100 movie recommendations based on similar users! We can see some of the top movies recommended were Shrek 2 and The Sixth Sense, along with their corresponding scores.

Next, instead of similar users, we can recommend similar movies based on the movie metadata such as the title, description and actors/directors. In this approach, we can use a metric called Term Frequency Inverse Document Frequency (TFIDF), which basically tells us:

TF - how frequently a term(t) appears in a document(d)
IDF - the ratio of number of documents(d) containing a term(t)

The higher the TFIDF of a term, the more important and useful it is for differentiating between documents. For a more detailed example, read .

We can use a TFIDF-Vectorizer on the movie description to create a TFIDF-matrix, which counts and weights words in all descriptions. and compute a cosine similarity between all of those sparse text-vectors.

The similarity matrix tells us the similarity of each movie based on its metadata, which we can use to retrieve the top N similar movies, as shown below.

This can easily be extended to more or different features if you like. Unfortunately it is impossible for this model to compute a RMSE score, since the model does not recommend the movies directly. In this way it is possible to find movies closely related to each other, but it is hard to find movies of different genres.

Collaborative Filtering

Now we get to the types of algorithms mainly used by giants like Netflix and Amazon, one type of which involve a concept known as Matrix Factorization. It's used mostly when you have many hundreds of thousands of customers and items, so you can create a matrix (that's usually sparse) and try to Raters probably follow some logic where they weight the things they like in a movie (a specific actress or a genre) against things they don’t like (long duration or bad jokes) and then come up with a score. These are known as latent factors.

The Singular Value Decomposition (SVD), a matrix factorisation method from linear algebra that has been generally used as a dimensionality reduction technique in machine learning. In the context of the recommender system, the SVD is used as a collaborative filtering technique. It uses a matrix structure where each row represents a user, and each column represents an item. The elements of this matrix are the ratings that are given to items by users. In the User-Item matrix, each value represents the rating a user has given to an item. The blank squares mean that the user has not rated that item, and that we need to construct latent factors to predict those values. This can be done by decomposing into the user and item matrices with as many latent factors as possible.

The factorisation of this matrix is done by the singular value decomposition. It finds factors of matrices from the factorisation of a high-level (user-item-rating) matrix. The singular value decomposition is a method of decomposing a matrix into three other matrices as given below:

The matrices are as follows

A (m x n) represents the current rating each user gives to a movie
U (m x r ) represents the relationship between users and latent factors
S (r x r ) represents the weighted coefficient of each latent factor
V (r x n) indicates the similarity between items and latent factors.

Once again, the latent factors here are the characteristics of the movies, for example, the genre of the music. The SVD decreases the dimension of the utility matrix A by extracting its latent factors. It maps each user and each item into a r-dimensional latent space. This mapping facilitates a clear representation of relationships between users and items. The SVD Matrix Factorization approach is basically building a set of weights in the matrix S, that predicts what each user would rate a movie they haven't watched yet based on all other users. This is similar to doing linear regression, but on a much bigger scale for hundreds of thousands of users at a time, which is why doing matrix vectorization is much more computationally efficient at scale for online inference.

The library was built for creating and analyzing recommender systems. It has to be mentioned that most of the built-in algorithms use some kind of the above approaches. I am going to compare these algorithms to each other in this section using 5-fold cross-validation. Since the algorithms and the dataset have a large memory footprint the comparison will be executed on a subsampled dataset which is not comparable to the above models. We can see that it achieves a pretty respectable average RMSE of about 0.98.

Great but how do we use it? After training and prediction, can write a function that retrieves the predicted top N highest rated movies for each user that they have not watched yet. From the results output, we can see that for each userid, we have a generated a list of recommended movie_ids.

Finally, a great way to measure how well each set of K items has been recommended to a user is to compute the precision and recall at K, given by the following formula:

Precision is the fraction of recommended items that are actually relevant to the user while Recall is the fraction of items relevant to the user that were recommended. We can define wether an item is relevant or not based on a certain threshold.

Summary and Best Practices From Netflix

Let's take a quick look at a real-life example in the case of Netflix. Everyone watches it. But how exactly does it recommend you shows? It does this at 2 levels:

1.First it computes the top N shows for each category using various techniques mentioned above. For the trending-now section, it could get the most watched shows over the past week. For each genre-section, it could get the highest rated shows for each genre. For shows-for-you-section, it could use some kind of collaborative filtering algorithm like SVD combined with other behavioural factors like the time of the day, relevant world events, how past shows were watched.

2. Next, it treats each of the hundreds of rows as items, computes a score and ranks them based on the likelihood of each user being interested, depending on factors such as diversity, accessibility, appropriateness, hardware limitations. This means that Netflix wants to accurately predict what users want to watch in that session, but not forgetting that he/she might want to pick up on videos that were left off halfway. At the same time, it wants to highlight the depth of its catalog by providing something fresh, and perhaps capture trends that are going on in the user's region.

The result is a personalised page for each user, with the top rows being the most interested genres and the left most columns being the most relevant shows for each genre. There's a lot more that goes behind Netflix's platform that truly makes it an engineering marvel, checkout their for more info.

In summary, recommender systems are pretty much embedded in all the content we consume from e-commerce to youtube videos. There are mainly 3 types, rule-based, content-based and collaborative filtering, each with varying complexity and use cases. For example, one common issue is the cold start problem where there is simply not enough item ratings from a brand new user. For that, using simpler techniques like rule-based or content-based is more appropriate. We have explored several types of algorithms but there are still many more such as KNNs, SVD++, various flavours of and other .

A/B Testing From Google

Summary: This project aims to explore the power of using A/B testing to make better product decisions. View the code
Project Type: Data Science
Skills: A/B Testing, Online Experimentation
Tech Stack: Python, Pandas, Statsmodels, Scipy

is a type of experimental methodology to make data-driven decisions about users responses to different product features. It works by having a control (current product) and a test (feature change), and monitoring wether this positively impacts user behaviour. Almost all internet companies like Google or Facebook leverage A/B Testing to determine product decisions from like which colour of buttons improve click through rate of ads to wether certain notifications increase daily active user count.

This blog post show cases the final project of an . Thus thought it would be useful to summarize the entire course and final project to provide a succinct and clear methodology of how to do A/B Testing properly from Google themselves. Here's some of the critical questions we will be covering:

How to test wether a new feature actually makes a product better?
How to determine the evaluation and invariant metrics to measure?
How many data points do we need to collect and for how long?
How to conduct sanity checks on our results to ensure we can trust them?
How to analyse the results and make a recommendation to senior leadership

Define The Hypothesis

is an online education company that provides both free and paid courses which people can sign up. They currently provide a free trial option to allow users to try the course free for 2 weeks and decide wether they want to purchase it or not at the end. One of the main reasons why customers decide to not purchase the course after 2 weeks is because they realise that they are not able to commit that much time. To improve the conversion rate from free-trial to paid users, Udacity is thinking of adding a time commitment screener to ask users how many hours they can commit to learning per week, after the user has clicked on the start free-trial button. If a user enters a number below 5 hours per week, they will be dissuaded from enrolling in the free-trial as the time commitment level is insufficient for completing the course.

The hypothesis is that this might set clearer expectations for students upfront, thus reducing the number of frustrated students who left the free trial because they didn't have enough time—without significantly reducing the number of students to continue past the free trial and eventually complete the course. If this hypothesis held true, Udacity could improve the overall student experience and improve coaches' capacity to support students who are likely to complete the course. You can view the detailed problem and scenario as defined .

Metrics To Measure

The next step is to decide which metrics do we want to track that would indicate the success of our experiment. There are 2 types of metrics we need to track:

1) Evaluation Metrics
- Metrics that we expect to change favourably if the experiment is a success
- It must be sensitive to desired change but robust enough not to be affected by undesired changes
- Usually sums, counts, mean, median, click-through-rate (CTR), click-through-probability (CTP) and ratios between any two metrics defined earlier
- For this experiment we shall go with:
  - 1) Gross conversion: No. of user-ids to enroll in the free trial divided by no. of unique cookies to click the "Start free trial" button. (dmin= 0.01)
  - 2) Retention: No. of user-ids to continue paying after the free-trial divided by number of user-ids who enrolled in the free trial (dmin=0.01)
  - 3) Net conversion: No. of user-ids to continue paying after the free-trial
    divided by no. cookies to click the "Start free trial" button. (dmin= 0.0075)
2) Invariant Metrics
- Metrics that we expect to remain constant throughout our experiment so that we know the changes in our test results are really due to the feature and not due to some external events such as sales or holiday seasons.
- We have to do a sanity check on these metrics first to ensure that there are no biases in our evaluation metrics and that we can trust it
- For this experiment we shall go with:
  - 1) Number of unique cookies to view the main web page (dmin=3000)
  - 2) Number of clicks of the "start free trial" button (dmin=240)
  - 3) Click through probability (CTP) of a user who joins the free trial after landing on the main page (dmin=0.01)

Dmin refers to the minimum amount of unit change that we must observe with the test results, to say that there was a significant enough change and not due to random standard deviation. This practical significance level was given to us by Udacity but we could also compute it ourselves if we wanted to . It is also important to note that our unit of divergence for this experiment is a web browser cookie which means that is how we are identifying each test subject. There can be other possible ones as well such as login user_id, an event (click), device_id or even IP Addresses.

Scaling and Standard Error

Here are the current metrics and their mean (estimated) and dmin values provided by Udacity. Since the sample size given by Udacity is n = 5000 cookies, we first need to scale the collected count data by a factor of 8.

Since the unit of diversion is the same as the unit of analysis (denominator of the metric formula) for each evaluation metric (cookie in the case of Gross Conversion and Net Conversion and user-id in the case of Retention) and we can make assumptions about the distributions of the metrics (binominal), we can calculate the standard errors analytically (instead of empirically).

1) Given N unique Cookies that click, what is Prob(enrollment)?
2) Given N unique Cookies that click, what is Prob(payment)?
3) Given N enrollments, what is Prob(payment)?

Further, as n is relatively large in each case, we can assume that the sampling distribution of a sample proportion approaches a normal distribution (due to the Central Limit Theorem). We can also use a rule such as the to check if n is large enough:

Great, we can assume our data follows a normal distribution with 3 standard deviations. Next we need to calculate the standard error for each of the 3 evaluation metrics, which is the tells us how accurate our estimated sample is co compared to the true population mean. When the standard error increases, i.e. the means are more spread out, it becomes more likely that any given mean is an inaccurate representation of the true population mean. We can compute it using the standard deviation of the distribution's sample mean, given by the formula:

Determine sample size and duration

The next question we need to ask is, what is the minimum sample size of online traffic that we need to observe for a conclusive result? Since we do not assume common standard deviations then a more precise way to determine the required sample size would be:

This is also the approach used by many online sample size calculators such as the one by . Further, we want to calculate the experiment sample size in terms of cookies that visit the page. Thus, we also need to account for the circumstance that our evaluation metrics' units of analysis are clicks and user-ids, respectively.The total experiment sample size per evaluation metric is hence given by:

Given our calculations, we would need around 638,940 pageviews (cookies) to test the first hypothesis (given our assumptions on alpha, beta, baseline conversions and dmin). To additionally test the third hypothesis, we would need a total of 685,336 pageviews. And, in case we would like to also test the second hypothesis, we would need a total of around 4,737,771 pageviews.

Now, for each case, we can calculate how many days we would approximately need to run the experiment in order to reach n_C. According to the challenge description, we are thereby assuming that there are no other experiments we want to run simultaneously. So, theoretically, we could divert 100% of the traffic to our experiment (i.e. about 50% of all visitors would then be in the treatment condition). Given our estimation that there are about 40,000 unique pageviews per day, this would result in:

We see that we would need to run the experiment for about 119 days in order to test all three hypotheses (and this does not even take into account the 14 additional days (free trial period) we have to wait until we can evaluate the experiment). Such a duration (esp. with 100% traffic diverted to it) appears to be very risky.

First, we cannot perfom any other experiment during this period (opportunity costs).
Secondly, if the treatment harms the user experience (frustrated students, inefficient coaching resources) and decreases conversion rates, we won't notice it (or cannot really say so) for more than four months (business risk).
Consequently, it seems more reasonable to only test the first and third hypothesis and to discard retention as an evaluation metric. Especially since net conversion is a product of retention and gross conversion, so that we might be able to draw inferences about the retention rate from the two remaining evaluation metrics.

So, how much traffic should we divert to the experiment? Given the considerations above, we want the experiment to run relatively fast and for not more than a few weeks. Also, as the nature of the experiment itself does not seem to be very risky (e.g. the treatment doesn't involve a feature that is critical with regards to potential media coverage), we can be confident in diverting a high percentage of traffic to the experiment.

Still, since there is always the potential that something goes wrong during implementation, we may not want to divert all of our traffic to it. Hence, 80% (22 days) would seem to be quite reasonable. However, when we look at the data provided by Udacity, we see that it takes 37 days to collect 690,203 pageviews, meaning that they most likely diverted somewhere between 45% and 50% of their traffic to the experiment

Sanity Check

Finally, after 37 days, we have collected all the data we need to determine our hypothesis. But hold up. To ensure that the experiment has been run properly, we first conduct a sanity check using the three invariant metrics outlined above. We have two counts (number of cookies, number of clicks) and one probability. As stated earlier, we would expect that these metrics do not differ significantly between control and treatment group. Otherwise, this would imply that something is wrong with the experiment setup and that our results are biased.

number of cookies + number of clicks

In the provided data, the column "pageviews" represents the number of cookies that browse the course overview page. Given our assumptions, we would expect that the total number of cookies in the treatment group and the total number of cookies in the control group each account for about 50% of the combined number of cookies of both groups (treatment + control) as they should have been assigned randomly. We can calculate the test-statistic Z and compared the corresponding p-value against our selected alpha level.

click-through probabilities

To check whether the click-through probabilities in the control and treatment groups are significantly different from each other, we conduct a two proportion z-test with a click being interpreted as a success. We thereby assume that the two populations have normal distributions but not necessarily equal variances (hence p is not pooled below).can calculate the Z-test-statistic and then check the corresponding p-value as shown:

Since the p-value of our 2-sided binomial test is greater than 0.05, we cannot reject the null-hypothesis and can conclude that there was no signifcant change in the number of cookies, clicks and click through probability over the duration of the experiment, as expected.

Evaluation Metrics

CI_left - lower bound of confidence interval of the metric
CI_right - upper bound of confidence interval of the metric
d - the observed change
dmin - the minimum observed change for the metric to be practically relevant
A metric is statistically significant if the confidence interval does not include 0 (you can be confident there was a change),
A metric is practically relevant if the confidence interval does not include the practical significance boundary (that is, you can be confident there is a change that matters to the business.)

For our evaluation metric hypotheses using two proportion z-tests with a click being interpreted as a success. We thereby assume that the two populations have normal distributions but not necessarily equal variances (hence p is not pooled below). To perform the test, we can calculate a 95% confidence interval around the expected difference of the two metrics which is 0, we will compute the respective confidence interval around the observed difference between the conversion metrics.

Gross conversion: the observed gross conversion in the treatment group is around 2.06% smaller than the gross conversion observed in the control group. Further, we see that also the values within the confidence interval are most compatible with a negative effect. Lastly, this effect appears to be practically relevant as those values are smaller than dmin, the minimum effect size to be considered relevant for the business.

Net conversion: While we cannot reject the null hypothesis for this test, we see that the observed net conversion in the treatment group is around 0.49% smaller than the net conversion observed in the control group. Further, the values that are considered most reasonably compatible with the data range from -1.16% to 0.19%.

Given these results, we can assume that the introduction of the "Free Trial Screener" may indeed help to set clearer expectations for students upfront. However, the results are less compatible with the assumption that the decrease in gross conversion is entirely absorbed by an improvement in the overall student experience and still less compatible with dmin(net conversion), the minimum effect size to be considered relevant for the business. Consequently, assuming that Udacity has a fair interest in increasing revenues, we would recommend to not roll out the "Free Trial Screener" feature.

This being said the feature may increase the total number of people who opt for the freely available materials. If true and assuming a steady conversion rate from users who first learn with the freely accessible materials and then upgrade, the feature may still help to increase net conversion. However, if at all, this effect is more likely to happen over a longer time period and, hence, would require a test with a longer timeframe.

Conclusion

How that we have gone through the entire process of A/B Testing here is a summary of the key best practices that Google takes into consideration as well:

Ethical considerations
- what are the risks vs benefits to participants?
- What data is being collected? How sensitive or anonymised are they?
- Do you have a privacy policy?
Metric considerations
- sums, counts, means, medians, probabilities, rates, ratios
- Use a rate when measuring usability of a product, for example you can see which buttons are clicked most frequently
- Use a probability when you want to measure total impact, for example how many unique users moved down each funnel
- Metric is not good if hard to compute, dont have access to data or takes too long
Analyzing Results
- Triple check experiment was correct, along with sanity checks
- Does the change produce statistically and practically significant results
- What are various ways this change would actually impact user experience?
- Is it worth the effort to make that change?
- If results look weird / unexpected, segment it by country/city/demographic for more granularity and find anomalies
- Your impact will be: Did you recommend to launch or not and was it correct?
Follow Up
- Always do a ramp up on the roll-out of the change
- Effect may differ overtime due to external factors, maintain control group
- What if it makes it better for some users but worse for others?
- Focus on business and engineering costs and impact of that change
- Test on more experiments to get more confident results

Marketing Data Science II

Summary: This project aims to explore several useful techniques for using data science to solve problems in the domain of marketing. View the code .
Project Type: Data Science
Skills: Uplift Modelling, Campaign ROI
Tech Stack: Python, Scikit-Learn, KMEANs Clustering, XGBoost

In this 2nd part of our marketing data science series, we shall explore how to use Uplift Modelling to improve a business' marketing efforts and strategic decision making.

Uplift Modelling

Imagine a case that you are about to launch a promotional campaign and you know which segment you want to target. Do you need to send the offer to everyone? The answer is no. In your current target group, there will be customers who are going to purchase anyways. You will cannibalize yourself by giving the promotion. We can summarize the segments based on this approach like below:

Treatment Responders (TR): Customers that will purchase only if they receive an offer
Treatment Non-Responders (TN): Customer that won’t purchase in any case
Control Responders (CR): Customers that will purchase without an offer
Control Non-Responders (CN): Customers that will not purchase if they don’t receive an offer

As you can see, to maximise profitability, we need to target Treatment Responders (TR) or persuadables as our first priority, since they won’t purchase unless you give an offer. After that we can also try the Control Non-Responders (CN) as they might also have a change to increase conversion rate.

On the other hand, you need to avoid targeting Treatment Non-Responders (TN) and Control Responders (CR), since they have been proven to not respond to marketing offers and so we should not waste marketing resources on them.

Hence the purpose of uplift modelling is to identify which customers fall into which buckets using a 2 step approach:

Predict the probabilities of being in each group for all customers: we are going to build a multi-classification model for that.
We will calculate the uplift score by summing up the probability of being TR and CN and subtract the probability of falling into other buckets. The higher score means higher uplift.

Let's use this dataset of about 64K rows of unique customers, which consists of their recent activity. For each customer, we have randomly offered them either 1) a discount or 2) a buy one get one (BOGO) offer. Conversion means wether or not the customer actually took up the offer and made a purchase.

Calculating Uplift

First let's define a function to calculate the uplift for each type of promotion. To do so we first calculate the mean base conversion rate of purchases without any offers. Then we calculate the mean conversion rate for both offers and subtract the mean. Finally we multiply by the average order value for each offer type to get the uplift. From the results, we can see that the discount generated a higher uplift of $40.7K compared to the BOGO of $24.1K.

Remember that our goal is to predict which class a customer is likely to be in. Hence we need to add some class labels to each row defined as:

CR: If the customer purchased even without an offer
CN: If the customer did not purchase without an offer
TR: If the customer purchased using an offer
TN: If the customer did not purchase despite using an offer

Clustering and One-Hot-Encoding

Next we can do some feature engineering to generate new columns such as using KMeans clustering to generate additional bin labels according to each customer's history of number of days since they've joined the store. We also perform one-hot-encoding on some of the categorical columns and drop the labels we are predicting.

Next, we can train our classifier model to predict the probability for a customer being in each class, and further compute the predicted uplift score to be P(CN) + P(TR) - P(TN) - P(CR) .

Model Evaluation

Finally, to evaluate our model, we will split the predicted customers two different groups and compare them with our benchmark:

High Uplift Score: Customers whose predicted uplift score > 3rd quantile
Low Uplift Score: Customers whose predicted uplift score < 2nd quantile

From the results, we see that by choosing to target only customers with higher probabilities of purchase, the conversion uplift of each type of campaign almost doubles to between 15-19% while the revenue increases by about 15%. Hence to conclude, we have shown that by correctly predicting and targeting customers who are more likely to buy and excluding the ones that inherently are not, we can more efficiently allocate our marketing budget to increase ROI.

Marketing Data Science I

Summary: This project aims to explore several useful techniques for using data science to solve problems in the domain of marketing. View the code .
Project Type: Data Science
Skills: Marketing Metrics, Ensemble Modelling, Churn Prediction, Customer Segmentation
Tech Stack: Python, Scikit-Learn, KMEANs Clustering, Correlation Plots, Pandas

In the age of attention arbitrage, marketing is more important for businesses than ever. In this project, we will be looking at various ways data science can be applied to deliver value to a business marketing team, covering the following:

How to calculate key metrics?
How to estimate customer lifetime value?
How to segment customers into different demographics?
How to predict users who churn?

Calculating Key Metrics

The first thing to do when it comes to marketing analytics, is to always measure our north star metrics that best reflect the core value that our product delivers to customers. Lets start by looking at our dataset, where each row is a purchase made by a customer on an online retail marketplace.

Most businesses would be interested in computing these metrics KPIs to indicate how well the business is performing for each month:

Monthly Revenue & Growth Rate
Monthly Sign up rate - number of new customers who sign up over time
Monthly Order Count - sum of the number of orders made per month
Average Revenue Per User (ARPU) - average spending by each user
Average Revenue Per Order - average spending for each order
Monthly Retention Rate (MRR) - percentage of customers who order again for this month and the previous one

We can write some simple functions to do so as shown:

First we see that monthly revenue looks like it increases steadily throughout the year, peaking towards the end of year at $140K in November perhaps due to holiday sales. Revenue in December was the lowest because the data was still being recorded then.

Next we also see that the number of new customers is increasing steadily every month, however fewer of them are actually spending, falling to about 400 every month.

Next we see that the average revenue per order remains consistent around $17.5 whole revenue per user seems seasonal, being lowest around July at $20 but can double to $40 towards end of the year.

Finally, we see that monthly retention rate seems to be improving slowly going from 30% in Jan 2011 to 50% in Dec 2011.

Estimating Customer Lifetime Value (CLTV)

The next metric that is critical to businesses is CLTV, which we can define as

CLTV = freq of spending x avg amount per spend x time before churn

For each customer, we should calculate:

how frequent do they spend?
how much do they spend on average per purchase?
what is the average time customers spend before churning?

Lets start by calculating the revenue for each transaction:

Followed by the mean revenue of each invoice, and the earliest and latest invoice dates of each customer.

Next we calculate the average daily spend of each user and number of days active. We see that the average lifetime value is about 134 days or 4.5 months, which is quite short but note that this is because we only have a year's worth of data. You could also assume this to be 1 or 2 years if we had more data. Finally, we compute the CLTV by multiplying the average daily spend with the average lifetime of each user.

Visualising the results, we are looking for high value customers who spend alot and frequently, which would be those data points on the top left corner of our chart. We could further analyze them this segment to see what kind of demographics they are and try to target more of them in our marketing campaigns.

Customer Segmentation

The next use case is when a business wants to find out what types of customers are using the product? Can we split common ones into distinct groups? Segmenting helps allow more targeted ads and communication to each customer group since their needs might be different, for example students vs professionals. Given a dataset about customer info, we want to answer the questions:

What are our various types of customer personas?
Which customers should we focus on to drive revenue growth?
Which customers should we avoid?

For this approach we can use unsupervised learning techniques such as KMEANs to cluster customers with common attributes together. Here we shall use another rich dataset on US customers who purchased vehicle insurance with attributes such as their state, CLTV, education, employment status etc.

Before running the KMEANs model, we need to preprocess the data by first performing one-hot-encoding on categorical variables, followed by scaling them using MinMaxScaler() since KMEANs is sensitive to values that are too large from each other. Now we have 65 columns instead of 24.

Moving on to modelling, let's try a range of values of clusters (K) from 1 to 10 to determine which value is most appropriate. Its not super obvious, but seems like the inflection point occurs around 4 or 5. Let's go with 5 clusters.

We simply initialize the KMeans model with 5 clusters and and make our predictions in a new column named cluster.

There are several ways to visualize the clusters for each attribute, the first is to plot a stacked bar chart where each color represents a cluster and we can see for each attribute, which clusters are present.

Another way is to unstack the bar charts and plot each attribute to see each cluster more visually, here are some of the charts (there are many so we wont be viewing them all but you get the idea)

From this analysis, we can derive the following about each class

Class 0: High Income, most complaints, mostly men
Class 1: People who are low income, unemployed, single, claim the most, most fit, least educated, suburban
Class 2: High Income, corporate insurance
Class 3: Mainly females, highest CLTV, high income
Class 4: Retirees with good savings, most educated, disabled, highest divorced

Hence for this vehicle insurance business, it makes most sense to focus on acquiring users from classes 2 and 3 since they have the highest CLTV and have the lowest insurance claims. Class 0 could also be possible but customer support would have a harder time. Class 4 is riskier due to the older demographic which tend to result in higher accident claims and hence lower revenues. Class 1 is to absolutely avoid since they not only claim the most but are also highest risk of defaulting on payments due to low income.

One more way of potentially visualizing the classes would be to perform dimensionality reduction using PCA and plotting the first 2 principle components. However, we would have to reconstruct the original attributes of those 2 components to derive meaningful conclusions, which is why I did not do it for this example.

Churn Prediction

The final use case, here's a simple example on how to predict users who churn. It's critical for any business to know when a customer is potentially going to stop using its service and try to 'save' the customer if he is likely going to switch to a competitor by say offering discounts or promotions, because often the cost of acquiring a new customer would be higher. For this example we shall use a past dataset of a telco company where each row has attributes about the customer's spending patterns and wether or not he has churned. For this case the definition of churn would be to end the telco contract. However, for an internet business, churn could be defined in other ways such as if the customer did not visit the website for 60 days.

Plotting the correlation matrix, we observe that

monthly charges are highly correlated with the type of internet service, and also streaming TV and movies
phone service and multiple lines are highly correlated
churn is inversely correlated with having a contract, certain payment methods, online security and moderate correlation with tenure

We can perform a straightforward classification on the churn as a target, but we can enhance the our performance by using an ensemble of 3 models - Logistic Regression, Random Forest and Support Vector Classifiers. We see that the ensemble voting classifier outperforms each of the 3 individually by compensating for each other's biases during prediction. This is technique of stacking models is used alot by kaggle competition experts to achieve superior performance.

In summary, we have covered a range of use cases related to marketing, showing how data science can add value to businesses by:

Monitoring key business metrics to make informed decisions
Calculating customer lifetime value
Segmenting customers into demographics for more focused targeting
Predicting user churn and intervening before it happens.

Fraud Detection Techniques

Summary: This project aims to explore several techniques for detecting fraudulent behaviour in datasets. View the code .
Project Type: Data Science
Skills: Classification, Unsupervised Learning, Model Selection, Grid Search, NLP
Tech Stack: Python, Scikit-Learn, Genism, NLTK

Fraud is prevalent in all industries, especially so in corporations that handle a large amount of transactions such as banks, ecommerce platforms and consumer apps. In this article, we will explore several techniques for detecting fraudulent transactions in datasets, exploring labelled data, unlabelled data and even text using natural language processing.

Supervised Learning: Classification

Let us start by exploring this dataset of 7.3K credit card transactions which have been labelled as either fraudulent or non-fraduelent. Just like in real-life, fradulent cases are rare, only making up 300 or 4% of all transactions.

Lets start by naively fitting a vanilla decision tree classifier, as we can see, the ROC-AUC score is surprisingly not bad at 0.89. But we can do better.

First we do model selection by repeating the steps above but for several models to compare which has the best vanilla performance. However, one change we need to make is to upsample our minority classes to a 1:1 ratio by using to synthetically create more datapoints of our fraud classes.

From the results, we see that our best performing models are boosted trees such as AdaBoost and XGBoost classifiers.

Lets go with XGBoost. The next step would be to tune its hyperparameters using gridsearch cross validation to try all possible combinations. If time permits, the ideal way is to , but we dont have to for this example.

Overall we can see that XGBoost after tuning, improved our model performance by a decent margin from 0.89 to 0.98 which looks like its good enough for deployment now.

This was a pretty example of having labelled fraud data. But what happens if we dont know which datapoints are actually fraudulent? This is what we will be exploring next.

Unsupervised Learning: KMeans

Let's turn our attention into a different dataset, this time it contains ~7K rows on various bank customers and their spending. Here we can see the features such as how long the customer has been with the bank (age), amount spent per transaction and the category of the spending. Our goal is to figure out which of the spending seems anomalous and we can flag them. The dataset actually comes with fraud labels, but we shall drop them in order to simulate a real-world example.

The first thing we need to do is to perform dimensionality reduction on our data to extract the two most important principle components by using PCA. Next we can then plot an elbow curve to test out different number of clusters. Here we see that the silhoutte score decreases dramatically with k > 3. Hence our optimal number of clusters is 3.

In order to make predictions using our KMeans model, we first run the algorithm to classify each datapoint into one of the three classes. Next, we compute the euclidean distances for each datapoint to each centroid. Finally, we set a rule to classify all datapoints that are further than the 95th percentile to be outliers (and hence suspected frauds). These would be the datapoints in pink that are far out from the rest of the points in the cluster. By adding back the label, we can evaluate our classification performance and see that we achieve quite a good f1-score of 0.95. The ROC score is respectable, but could be improved further if we get more data and more feature engineering.

But what if our dataset isnt just unlabelled but also contains text? Next we will see how to handle that.

NLP: Topic Modelling

For this example, let's use a dataset consisting of ~2K enron emails and try to potentially find fraudulent conversations within them.

Of course the most basic way is to simply search the emails for certain suspicious key words such as police or money laundering, but we can do better than that.

We can do topic modelling by potentially identifying key topics amongst the emails based on common words and phrases. But first, we need to clean our text data using the following steps:

Convert to lower case and remove punctuation
Removing Stop Words - ridding common, un-meaningful words such as the, a, she, I
Lemmatization - reducing words to their basic meaning e.g. have, had --> has. We can use WordNetLemmatizer to group similar words.
Stemming - reducing words to their basic structure e.g. helpful, helping, helped --> help. We can use the PorterStemmer library to remove the suffix.

As you can see, our text has now been processed into clean and succinct tokens. The next step is to build a bag of words, which is basically:

Encoding each word into a unique number (dictionary)
Generating a list of (word, frequency) for each email

We can use the genism package to help us with this by first building a corpora.Dictionary() of unique words and numbers, followed by applying dictionary.doc2bow(text) on each email. The result is a corpus of list of email documents each represented by a list of [(word, frequency)].

Next we can run a on our corpus. LDA works by generating topics that are related to groups of words and assigning each text document to a topic. In simplifed terms, here's how the algorithm learns - given a set of text documents (D), a dictionary of words (W) and number of topics to generate (K) :

For each document in D, randomly assign each word to one of the topics K.
For each d, w, k compute 1) the probability that the topic is assigned to the document and 2) the probability that the word is assigned to the topic. Reassign each word to a new topic with probabiliy P(1) x P(2).
Repeat step (2) until a steady state is found where P(1) and P(2) remain constant.

The intuition behind this is that if a text document contains many words such as apples, orange, bannanas, and these words are associated to the same topic for example fruits, then there is a high probability that one of the topics in the document is fruit. But there could be other topics as well, depending on the type of words. I recommend reading this excellent and if you're interested to dive deeper.

We can import the LDA model from the genism library and pass in a number of topics, along with the corpus and dictionary that we created earlier. We can see that each of the 10 topics is made up of a linear combination of probabilities x words, but of course it up to use to identify those topics in the end.

One really cool way of visualizing our model is by using the pyLDAvis library, which generates an interactive dashboard to show the topics after PCA dimenstionality reduction. Clicking each topic (blue circle) shows the top 30 most relevant words that are related to that topic, and the frequency of the words appearing in all documents.

So how do we interpret this? We can basically try out various topic numbers and see the words associated with each topic to flag suspicious documents. For example, we see that topic 5 and 8 are outliers from the other main topics, and topic 5 in particular contains suspicous words related to fraud such as Enron, Million, Fund, Bankruptcy, Donation, Pleas hence we can query all documents containing these words and manually read them to determine if they really are fradulent or not.

In summary, these are only a few of many ways to apply data science to fraud. There are so many other ways fraud can occur from bogus ecommerce sellers, to investigating customer complaints to even image detection. I hope these examples of supervised, unsupervised and NLP Topic Modelling shed some light as to how to approach fraud detection.

Time Series Forecasting

Summary: This project aims to forecast app sales over time using various time series models. View the code .
Project Type: Data Science
Skills: Box-Jenkins, Stationarity, Autocorrelation, Seasonality, SARIMAX, Regression
Tech Stack: Python, Statsmodels, Pymdarina, Sktime

Being able to forecast data into the future is an invaluable skill for any organization, with many applications such as:

Predicting monthly store product sales for revenue reporting
Forecasting seasonal weather conditions
Projecting marketing ad spend and customer life time value
Estimating patient levels in a hospital to plan for demand
Calculating minimum order quantities for a supply chain

But what are some techniques, beyond basic extrapolation and excel sheets, that allow us to do that? Turns out there are many machine learning methods that allow us to make intelligent forecasts, and today we will be mainly exploring 2 types - statistical time series using SARIMAX and ML Regression methods. Let's start by using this dataset that shows the daily in-game currency spent from a mobile games developer company. They want to answer the question - how much are users expected to spend in the app over the next few months?

The first thing to introduce is the Box-Jenkins Framework, which is a series of best practices to build the best time series model that can capture the trend, seasonality and fluctuations in the data.

Stationarity and Seasonality

Before even starting modeling, we need to ensure that the data is stationary, because non-stationary behaviour would be too random to observe a reliable trend from many statistical models. Intuitively a time series variable is stationary about some equilibrium path if after a event it tends to return to that path. A series is non-stationary if its mean and variance changes randomly over time, making it almost impossible to model.

While most real-world data like sales or stock prices are non-stationary, it is still possible to predict their core trends by extracting their stationary parts. We can use the , to determine wether a series is stationary or not. The intuition behind the test is as follows. If the series is , then it has a tendency to return to a constant mean.

We can use the statsmodels.adfuller() function to compute stationarity. We see that the test statistic is negative and p-value > 0.05, indicating that it is likely not stationary. Hence, we can make it stationary by computing this 1st order difference.

Next, we need to find out if there is seasonality so that we can factor it into our model. To do so, we can detrend the data by subtracting its rolling mean of 90 days, in order to reveal any seasonal patterns. From this it is obvious that there is a ~30 day cycle.

Now that we know the seasonal period, we can further do seasonal decomposition to visualize the trend, season and residual (noise). Overall, there is an increasing trend, which is a good sign as people tend to spend more over time on the app. We see a seasonal spike of user spending near the end of each month probably because people receive their paychecks and are happier to splurge.

Determining Model Order

The first model that we will be using is Autoregressive Integrated Moving Average (ARIMA) Model, which is one of the most widely used forecasting algorithms but its quite a mouthful so lets break it down:

Autoregressive (AR) - dependent variable (x) is the sum linear combination of one or more lagged values. The idea is that the current value depends is correlated to one or more weighted previous/lagged values of the time series. p is the order / number of autocorrelated variables, alpha (a) is the coefficient, and omega (w) is a noise term. It explains the momentum and mean reversion effects often observed in trading markets.

Integrated (I) - The I in ARIMA stands for Integrated, which is simply the order of differencing required to transform the time series into a stationary one, which we saw earlier was d=1.

Moving Average (MA) - similar to AR, except that the dependent variable (x) is the sum linear combination of one or more lagged values of its error terms. Omega (w) is white noise with E(wt) = 0 and variance of sigma squared. It explains the shock effects observed in the white noise terms, such as unexpected events e.g. Surprise earnings, A terrorist attack, etc.

Combining both of these, logically we get the ARMA(p, q) model equation, but the question is how do we determine the model orders p and q?

For this we have to investigate the and Partial Autocorrelation Functions (PACF) of our time series, which show how any two values in the time series correlate with each other. We can import these functions once again from the statsmodels library.

This gives a good summary of how to interpret the plots:

From this we can see that the ACF tails off (not indicative) while the PACF cuts off at around 5, indicating that q could be either 4 or 5. Finally, we can add the seasonality component to form , which requires computing extra P, D, Q parameters for the seasonal part of the time series, in addition to the trend.

Officially the way to do it is to compute the seasonal difference and analyse the ACF and PACF as well, however as you can see, it's not very clear what they should be. Fortunately for us, there are auto_arima packages made available that can help us find these values which I will demonstrate later.

Model Optimisation

Let's first plot a baseline ARMA(1, 1) model, without taking into account stationarity and seasonality. We can use statsmodels.SARIMAX() to help us build the model and make a dynamic 60-day ahead forecast.

The model output shows very detailed results, and we see that the p-values Prob(Q) and Prob(JB) are close to zero which means the residuals are uncorrelated and normally distributed. Here, the Mean Absolute Percentage Error (MAPE) of 29% is very high and shows that the predictions are quite far from the actual data. We can see that the model fails to capture the baseline trend and the variance of fluctuations, because it is non-stationary.

Plotting the model diagnostics, we see from the histogram that the green and orange curves are similar while the q-q plot is in-line with the red, indicating that the data points follows a normal distribution. However we see obvious patterns in the standardized residuals and high weights for the correlogram, indicating that it is non-stationary, which is correct. Hence this model is unacceptable.

Lets add 1st order differencing and seasonality of 12-months:

We see a big improvement in the MAPE dropping to 17.59% because the model is now able to capture the trend-line, but still does not predict the fluctuations.

Since there are many permutations of p, d, q, P, D and Q, fortunately there are now AutoARIMA libraries out there that can help us perform hyper-parameter tuning.

It basically does a series of intelligent stepwise for loops to guess combination of parameters that result in the lowest which is a measure of the lowest model order that gives the best fits the data.

Here we see that the MAPE drops even more to only 12.79% and that our model seems to be able to capture some of the fluctuations now.

Of course, we see that the error drops even more if we do a one-step ahead prediction instead of predicting 60 days ahead.

Regression Time Series

Next up, we can also perform time series predictions by using normal regression based models. For this, lets use a different but similar dataset, which is the ad spend over time by the game company:

Here are some of the features we can generate:

Lags of time series
Window statistics:
- Max/min value of series in a window
- Average/median value in a window
- Window variance
Date and time features:
Minute of an hour, hour of a day, day of the week, and so on
- Is this day a holiday? Maybe there is a special event? Represent that as a boolean feature
- Target encoding

The most basic features we can generate are to shift the series n-steps back for each column. For example, getting the last 30 days of data to use as features to predict the 31st day. We can even predict multiple days in advance by increasing the lag to n days. However If something fundamentally changes the series during that unobserved period, the model will not catch these changes and will return forecasts with a large error.

Therefore, during the initial lag selection, one has to find a balance between the optimal prediction quality and the length of the forecasting horizon. Here, lets try to predict 6 days in advanced by creating some features from 6 days before and more.

Next we can also create certain categorical features of each day, such as which day of the week is it and which days are the weekend.

Next we can go one step further to encode each with the mean value of the target variable. In our example, every day of the week and every hour of the day can be encoded by the corresponding average number of ads watched during that day or hour. It's very important to make sure that the mean value is calculated over the training set only (or over the current cross-validation fold only) so that the model is not aware of the future.

Fitting the data to a gradient boosting regressor model, we can see that it performs quite well with only a MAPE of 5.32% with lags_24 and hour_average being the most predictive features. However, gradient boosting models tend to overfit and I would likely try another with L1 or L2 regularisation.

Finally there is a new unified library for time-series called which abstracts on top of sklearn, statsmodels and pmdarima, that is up and coming. It is able to perform cross validation over time by training our model on a small segment of the time series from the beginning until some t , make predictions for the next t+n steps, and calculate an error.

Then, we expand our training sample to t+n value, make predictions from t+n until t+2∗n , and continue moving our test segment of the time series until we hit the last available observation. As a result, we have as many folds as n will fit between the initial training sample and the last observation. Here we can do so by fitting a reduce regression forecaster on top of a gradient boosting regressor.

It even has AutoArima built in and achieves similar results with lesser lines of code!

In conclusion, I have shown how we can build both statistical and regression time series models to forecast any trend over time such as sales or even stock prices. However this is usually a very hard problem to solve as real-world data is often non-stationary and future values can change alot depending on both macro and micro factors influencing a process. The further into the future one tries to forecast, the higher the error will be. There are also many other techniques we have not covered such as , but hopefully this gives a fundamental approach on time series modelling.

GCP: Scalable Data Pipelines

Summary: This project aims to build a scalable end-to-end ML solution for real-time streaming and batch predictions. View the code .
Project Type: Full Stack Data Science
Skills: Data Engineering, Machine Learning, Website Deployment, Model Deployment, Real-Time Streaming
Tech Stack: Python, Google Cloud Platform, Apache Beam, Linux, SQL

Data is the new oil, and companies are racing ahead to collect, process and mine more insights from data than ever. Some of the benefits include:

Identifying key business metrics to track and forecast
Building predictive models of customer behavior
Running experiments to test product changes
Building data products that enable new product features

But how does one go about building and scaling a big data pipeline, from a few gigabytes per week to petabytes per day? Enter,. While AWS provides a larger breadth of microservices for monolithic applications, Google is quickly catching up, particularly in the area of big data and machine learning. In this post, we be doing a hands-on deep dive into actually building this.

A Production Grade Data Pipeline

Every big data pipeline needs to be able to accomplish the following at scale:

Rapidly ingest different types of live data coming from multiple sources
Store the data in scalable warehouses for downstream use
Process, clean and transform data for more meaningful results
Enable people to visualise and analyse important metrics on dashboards
Perform efficient batch and streaming ML predictions in real-time

Trying to build big solutions like these take years even for the most experienced teams in MNCs. This is why managed solutions from cloud providers such as AWS, GCP and Azure are increasingly becoming more attractive for both small start ups and large enterprises. Analytics teams are able to easily spin up data infrastructure for small projects, scale them up dynamically and decomission them whenever they want.

Defining The Problem

Lets start with a real-world use case where an e-commerce business owner is trying to predict wether their users would buy a certain product or not, and thereafter recommend certain products to up-sell and cross-sell. Here we have a dataset of ~16k website visitor events such as visiting certain pages, which country they are from and the type of transactions they made.

Our goal is to predict the probability of each user making a purchase after visiting the store, hence we can group them by visitor_id, and compute certain aggregated metrics such as the average session quality, product price, page views and number of visits. We see there are about ~700+ unique visitors to this store.

Since only about 1% of site visitors actually made purchases, we can do some simple synthetic upsampling of our minority class before making a prediction using a Vanilla XGBoost model.

A quick look at our feature importance scores tell us that the product price and time spent browsing through the site are the top 2 most important features.

With an acceptable model built, we can now save it into a binary file using joblib package, and can use it later as so:

Since our goal for these predictions is mainly for marketing use later, we can do batch predictions instead of real-time. Now according to Google there are , we can either call the model as an API on the AI platform or run the model file directly in our ML pipeline. Since they have recommended that the second approach is a more efficent, we shall go with that instead. Hence we start by first uploading our model.joblib file into a model bucket that we created.

Data Ingestion Using Google Pub/Sub

The first piece of our data pipeline is to have a way to collect data in real time from the website. Instead of using API calls or files to store data, the robust and scalable way would be a stream processing system such as Kafka, Amazon Kinesis, or Google’s PubSub. These systems are able to handle petabyte volume of data in real-time, and PubSub offers the added advantage of being a serverless, fully managed service so you do not have to set up and maintain your own server infrastructure.

Key Benefits:

Asynchronous - subscribers can receive data from many publishers concurrently based on various topics
At-least-once delivery - guarantees high durability by storing copies of the same message on multiple servers
Serverless - Fully managed sharding and partitioning for high latency
Security - HIPAA-compliant granular controls + end-to-end encryption
Integrations - send notifications or data to any google service, database, Web APIs or even clients like email

We can easily create a new pub/sub topic and integrate the pub/sub API to our website as follows:

We shall also deploy a simple client website on Google Compute Engine that allows us to simulate collection of the raw data fields, using this tutorial .

Data Storage In Google Big Query

Next, we can store this data in Google BigQuery, which lets users gain real-time business insights from massive amounts of data without any up-front hardware or software investments. Accessible via a simple UI or REST interface, Google BigQuery lets you take advantage of Google’s massive computing power, store as much data as needed, and pay only for what you use. Your data is protected with multiple layers of security, replicated across multiple data centers, and can be easily exported. With Google BigQuery, you can run ad hoc, SQL-like queries against datasets with billions of rows.

Key Benefits:

Scalability - Relational Data storage that scales seamlessly to hundreds of terabytes, with no management required
Fault Tolerant - Automatic sharding and partioning with data replication across multiple regions
Speed and Flexibility - Ad hoc queries and JOINS on multi-terabyte datasets
Accessible - Easily ingest and export to and from many data sources like data bases, data studio dashboards, excel files and machine learning models

We can simply create an automated pipeline using DataFlow, which I'll elaborate on more later, to process the data streaming from our Pub/Sub service and dump them into our BigQuery Table.

This pipeline can be easily created from one of the many templates provided and it is even able to be handle incoming data that have errors. We can now do a simple query and our data streaming from our app is now nicely populated into the SQL tables, vola!

Data Processing Using DataFlow

So back to the question, what exactly is ? It's basically a fully-managed data processing service that allows you to easily build scalable analytics ETL pipelines. It leverages Apache Beam, an open source, unified model for defining both batch and streaming data-parallel processing pipelines. It also allows you to choose any open-source distributed processing back-ends, which include , , and .

Key Benefits

Serverless - Fully automated provisioning of processing resources
Scalable - Efficiently processes large volumes of data in real-time using horizontal autoscaling and rebalancing of worker resources to minimize latency
Flexible - Build pipelines between any two resources from Pub/Sub, BigQuery, Storage and other APIs
Customizable - define any processing functions you want using Apache Beam for an open source, unified model for defining both batch and streaming data-parallel processing pipelines.

We can write a custom DataFlow script that uses Apache Beam to do the following:

Extract newly populated rows from our raw BigQuery table
Transform the data by grouping by customer_id and calculating average and sum statistics to generate our features such as page views
Loading our machine learning model that we stored earlier, and using that to make batch predictions
Storing the prediction results into a new BigQuery Table for use later.

It looks similar to this except that we are using an XGBoost model instead of TensorFlow to predict the purchase probabilities of each customer and store them in BigQuery.

As you can see, Apache Beam allows us to define custom ETL functions and classes such as Predict() and collect() which we can call in our custom pipeline. This script can either be run ad-hoc or using any scheduler like Chron or Airflow job, both locally or on another server.

After running our DataFlow job, we can see that indeed our predictions are uploaded into a new BigQuery Table! Once there, the possibilities are endless because the data can be exported to analytics tools like Excel, DataStudio or Alteryx, or even be ingested by another API that can leverage these predictions, for example to perform marketing campaigns based on each user's purchase probability.

Summary

Building efficient and scalable data pipelines are critical to any company's success, and fortunately today there are many SaaS providers that provide a wide variety of solutions out there. Of course GCP is only one of the many, but I felt it was interesting to try it out and compare it with the incumbent AWS. What I've personally noticed is that its definitely more affordable and developer friendly, and it's . It's defintely not going to replace AWS anytime soon, but it makes for a compelling option for SMEs to quickly get started with a robust suite of AI and ML tools.

Bonus: If you've read until this far and want to learn more, I highly recommend this gem of an e-book on .

Singapore HDB Resale Prices - A Geospatial Analysis

Summary: This data analysis project aims to investigate flat prices in Singapore. View the code .
Project Type: Data Science
Skills: Data Analysis, Data Visualization, Machine Learning
Tech Stack: Geopandas, Folium, Shapely, SHAP, XGBoost, Google Maps API

Disclaimer: This article represents the author's own analysis/views and does not constitute property investment advice. The author is not liable for any financial losses to anyone who makes financial decisions reading this article.

In Singapore, a whopping own their own homes, much higher than the average rate in other smaller city states like . The Singapore government achieved this through the Housing Development Board (HDB) over the last 5 decades, by building densely populated high rise flats. As a Singaporean, owning our own home is one of life's major milestones - but what is a reasonable price you should be paying for your HDB flat?

While owning a home is a safe space for building a family, wouldn't it be great if the price of your home appreciated over time as well? For new couples who qualify for Build To Order (BTO) Flats, the cost is usually signficantly lower but for this article, we will be doing some data analysis on resale flat prices from to shed light on some burning questions:

How have the overall prices of HDB's changed over time?
How do HDB prices appreciate over time by flat-type and location?
What are the most significant factors that influence the resale price of HDB Flats?
What type and location of flats should you buy to achieve the highest appreciation value?

How have HDB prices changed over time?

To answer this question, lets start by looking at the median HDB resale flat prices from 2007 to 2020. Unfortunately, this dataset does not have enough data on 2-room and executive flats (we'll zoom in on a more granular one later). We see that there were ~8000+ median resale flat prices during that period in various estates/towns.

After grouping them by year, we see that overall that the median prices of flats generally increased between 47-60% from 2007 to 2020. However, we notice that there was a significant price dip of about 20% for all flat types between 2013 to 2015, now why did that happen? In Jan 2013, the Monetary Authority Of Singapore (MAS) introduced a slew of aimed at keeping flat prices affordable for Singaporeans/PRs, which included:

Increased buyer stamp duty (ABSD) by 5-7%
Lower loan to value (LTV) financing quantum up to 50%
Higher cash downpayment for 2nd mortgages from 10 to 25%
PRs disallowed from subletting their entire flat

Overall this seems appropriate to ensure that public housing remains within reach for most people but it also means that the rate of appreciation of flats after 2015 have slowed drastically between 2015 and 2020:

Median 3-room flat prices have depreciated by 13%
Median 4-room flat prices have appreciated by 1.01%
Median 5-room flat prices have appreciated by 10.4%

This puts the compounded annual growth rate (CAGR) at only ~0.5-1% on average, a far cry from the 8.65% between 2007 to 2013. We also see that bigger flats (4 and 5 rooms) tend to appreciate more in price albeit due to higher demand by couples who want more space to raise a family. As long as the cooling measures remain in place, it means we can only rely on the prices from 2015 onwards to accurately reflect and project the resale prices in the future. It also means that unfortunately, it will be very hard to regard on public housing as decent investment that will generate ROI over time, or can it still be?

To visualize the most recent median prices of HDBs by location, we can group them by town and use geopandas to generate the lat and long coordinates, before using the folium package to plot a nice openstreetmap of Singapore. Here, we can see that flats closer to the South Central Business District (CBD) tend to be unsurprisingly higher, and prices tend to decrease the further out one goes from there.

If you really to live near the CBD without paying a high premium, flats around Toa Payoh and Kallang appear to be relatively cheaper while being close. We also see that flats around the north-east punggol area also seem to command high price - probably due to close proximity to the newly renovated punggol waterway park. These trends are similar for 4 and 5 room flats as well.

How do HDB prices appreciate over time by flat-type and location?

Now that we only know that we should look at prices from 2015 onwards, lets do a deeper dive by using more granular datasets. Here, we can see the resale prices of ~108K HDB flats sold between Jan 2015 to May 2020, and it also has alot more information besides town and flat_type, such as the flat_model, remaining lease date, storey_rate and square_area.

Visualizing the data, we can see how prices for all room types have changed from 2015 to 2020:

Median 2-room flat prices have depreciated by 6.5%
Median 3-room flat prices have depreciated by 8.7%
Median 4-room flat prices have apppreciated by 0.5%
Median 5-room flat prices have apppreciated by 10.5%
Median Executive flat prices have remained pretty much the same

What is interesting about this is that when we break it down by quartile. The top 75th percentile by room types have changed from 2015 to 2020:

2-room flat prices have depreciated by 6%
3-room flat prices have depreciated by 5.4%
4-room flat prices have apppreciated by 2.1%
5-room flat prices have apppreciated by 6.3%
Median Executive flat prices have apppreciated by 2.2%

The bottom 25th percentile by room types have changed from 2015 to 2020:

2-room flat prices have depreciated by 12.3%
3-room flat prices have depreciated by 13.7%
4-room flat prices have depreciated by 4.5%
5-room flat prices have depreciated by 0.6%
Median Executive flat prices have apppreciated by 0.9%

Firstly, this reinforces the earlier dataset, that 4 and 5 room flats in general will likely appreciate in value ~1-2% annually while 2 and 3 room flats tend to depreciate by the same amount, probably because there is growing demand as citizens prefer having more spacious flats to raise children. Secondly, we also observe that flats at the top 75% percentile of prices tend to appreciate more while those at the bottom 25% tend to depreciate more, regardless of flat type. This could be happening because of other factors such as location and development of surrounding amenities. This means that whatever flat type that you buy, its better to buy a more expensive one that has better location or amenities as it would likely continue to appreciate in value.

Next, lets see which locations tend to appreciate in value over time. Since most people would be buying 3, 4 and 5 room flats, let's focus on these few.

For median resale 3-Room Flat prices in 2020, we see that for locations:

The top 5 most expensive are: Bukit Timah, CBD, Bishan, Pasir Ris, Punggol
The top 5 least expensive are: Geylang, Toa Payoh, Bukit Batok, Woodlands, Jurong West
Towns that appreciated most in value over the last 5 years:
- Pasir Ris (35%)
- Sembawang (6%)
- Bukit Panjang (5.17%)
- Bukit Timah (2.5%)
Towns that depreciated most in value over the last 5 years:
- Geylang (-22%)
- Toa Payoh (-19%)
- Queenstown & Marine Parade (-18%)
- Kallang (-17%)

For median resale 4-Room Flat prices in 2020, we see that for locations:

The top 5 most expensive are: Queenstown, Bukit Merah, Clementi, Bukit Timah, CBD
The top 5 least expensive are: CCK, Woodlands, Bukit Batok, Sembawang, Jurong West
Towns that appreciated most in value over the last 5 years:
- Clementi (20.7%)
- Bukit Panjang (14.1%)
- Punggol (9.88%)
- Queenstown/Bukit Merah (~5.5%)
- Sengkang / Toa Payoh (~3.5%)
Towns that depreciated most in value over the last 5 years:
- CBD Area (-28%)
- Bukit Batok (-14.4%)
- Ang Mo Kio / Marine Parade / Jurong East (~ -11-13%)
- Geylang (-8.7%)
- Woodlands / Sembawang / Choa Chu Kang [CCK] (~-5%)

For median resale 5-Room Flat prices in 2020, we see that for locations:

The top 5 most expensive are: CBD, Bukit Timah, Queenstown, Bukit Merah, Marine Parade
The top 5 least expensive are: CCK, Woodlands, Sembawang, Jurong West & East, CCK
Towns that appreciated most in value over the last 5 years:
- Bukit Panjang (11.5%)
- Punggol / Clementi (~9.5%)
- Pasir Ris / Serangoon / Bukit Merah (~6.2%)
- Bukit Timah / Sengkang (~4.5%)
- Hougang / Tampines (~2-3%)
Towns that depreciated most in value over the last 5 years:
- Jurong East (-12.8%)
- Ang Mo Kio (-8.6%)
- Bukit Batok (-5%)
- Bishan / Marine Parade / Bedok (~ -4-5%)

What are the most significant factors that affect the price of HDB Flats?

To start we can do a proximity analysis, for example, I suspect that towns closer to more amenities such as mrt stations, hawker centres, malls, supermarkets and clinics, would be valued at a higher resale price. First we group our new dataset by town and use google maps API to get the geometry.

Next we can obtain all the amenities from data.gov as well with geometry data and plot a map to visualize each town with a 1km radius and the amenities.

Next we can count the number of each type of amenity within a 1km radius of each town and visualize a heatmap correlation plot using the seaborn library.

Below, we see that indeed the number of MRT stations and malls in the town does correlate more with higher resale prices.

Next, we can build a simple machine learning model and observe the feature importance scores. This time we will be training on the full, ungrouped dataset of 100K resale flats from 2015 to 2020. First we count encode the categorical variables and compute other features such as age and highest floor.

Next, we simply train and fit an XGBoost Regressor to predict on the resale price, using an 80-20 train-test split and plot its feature importance score.

From this we can see that resale price is impacted most by the street/town, size of the flat and also the age of the flat. The flat age was computed by taking 2020 subtracting the lease commencement date. Interestingly we see that flat_type is not as important, likely because some flats within a the same flat_types had more floor area and so were valued more.

Finally we can also do some Shapely Additive Plots (SHAP) to explain which factors increased or decreased the resale price prediction. For the first one, it had a small floor area of only 67m and only 3 floors. Wheras the 2nd one had almost double floor area and was probably in a better location.

Summarizing our findings

We see that HDB resale prices have cooled considerably since the new property measures were in place since 2013. If you purchase as 2 or 3 room flat, you can expect it to depreciate about 1-2 % per annum on average. If you purchase a 4 or 5 room flat, you can expect it to appreciate ~1-2% per annum on average.
However, there were some towns that bucked this trend which will likely continue appreciating over the next few years:
- 3-Room: Pasir Ris, Sembawang, Bukit Timah, Bukit Panjang
- 4-Room: Clementi, Bukit Panjang, Punggol, Queenstown, Bukit Merah, Toa Payoh
- 5-Room: Bukit Panjang, Punggol, Clementi, Pasir Ris, Serangoon, Bukit Merah, Sengkang
We observe that the following factors correlate most to a higher flat price unit
- Proximity to Amenities such as MRT stations and Malls
- Larger square foot area
- Located at higher Floors
- The newer the flat's age
It might be better buy a more expensive flat with better location and amenities that will likely appreciate over time if you can afford it than a cheap one that will likely depreciate. Of course there are other factors to consider such as proximity to your work place or children's schools or even parent's house.

AWS FullStack Cloud Webapp

Summary: This project aims to build a full stack web application using a suite of AWS microservices. View the code .
Project Type: Software Engineering
Skills: AWS Cloud Solution Architect, Web Development
Tech Stack: AWS EC2, Fargate, S3, DynamoDB, CodeBuild, Lambda, Kinesis, API, Flask

In this project, we explore how to build a full stack web app using purely AWS services. However, instead of focusing on the nitty gritty code, we will be mostly exploring why and how we can use various microservices for our needs. The website we shall be building will be a simple game platform for users to adopt mythical pets as shown here. But what really goes behind the scenes?

Setting Up Dev Environment

The first thing we can do is to set up our development environment, which we can either choose to do using your own IDE + CI/CD workflow or use amazon's IDE.

is amazon's service that we can use build automated release pipelines for various testing stages such as UI testing, load testing, integration testing, API reliability testing, etc.where every code change is built, tested, and then pushed to staging environment, before it is finally released into production.

We can additionally integrate this with which is a fully managed continuous integration service that compiles source code, runs tests, and produces software packages that are ready to deploy. It automatically provisions, manages, and concurrently scales build servers for you when you need it. Any time a build execution is triggered after you push new code, AWS CodeBuild automatically provisions a build server to our configuration, executes the steps to build our Docker image and push a new version of it to the ECR repository we created (and spins the server down after).

In order to do these, we can specify json files like code-pipeline.json, code-build-project.json and artifacts-bucket-policy.json to configure the pipeline, build and have a bucket to store the builds we create in our pipeline. Running the following commands will build our pipeline:

aws s3api put-bucket-policy --bucket REPLACE_ME_ARTIFACTS_BUCKET_NAME --policy file://~/environment/aws-modern-application-workshop/module-2/aws-cli/artifacts-bucket-policy.json
aws codepipeline create-pipeline --cli-input-json file://~/environment/aws-modern-application-workshop/module-2/aws-cli/code-pipeline.json
aws codebuild create-project --cli-input-json file://~/environment/aws-modern-application-workshop/module-2/aws-cli/code-build-project.json

Building A Dynamic Website

For our website, we can use Python to create a simple Flask app for the backend to provide the following features:

Allow login
List mythical pets from our database based on a filter
Retrieve all details of a pet
Enable people to like the pets

For the front end we can simply render an index.html file that queries the backend API for data, which we can simply upload to an S3 bucket for hosting using the command:

aws s3 cp ~/environment/aws-modern-application-workshop/module-2/web/index.html s3://INSERT-YOUR-BUCKET-NAME/index.html

However, while these are the bare essentials developing apps today requires more than the code logic - Multiple languages, frameworks, dependencies, pipelines, tests, architectures, and discontinuous interfaces between tools for each lifecycle stage creates enormous complexity. This is why we need to containerize our backend service using . A container is a standardized unit of software that allows developers to deploy and run their app from any environment, solving the “it works on my machine” headache.

To do so we simply need to install docker in our environement (is pre-installed in cloud9) followed by defining a docker file that looks something like this:

If we build and run the dockerfile as follows, we should see our flask app running locally:

docker build . -t account_id.dkr.ecr.us-east-1.amazonaws.com/mythicalmysfits/service:latest
docker run -p 8080:8080 111111111111.dkr.ecr.us-east-1.amazonaws.com/mythicalmysfits/service:latest

But in order to deploy it, we need to create an in order to store our image to be used later.

aws ecr create-repository --repository-name mythicalmysfits/service
docker push 111111111111.dkr.ecr.us-east-1.amazonaws.com/mythicalmysfits/service:latest

Finally, we can use , to deploy our container via a serverless compute engine that works that works with Kubernetes without having to manage clusters or servers, lets you specify and pay for resources per application, and improves security through application isolation by design.

On top of that, it is also good practice to route our traffic through a loadbalancer to ensure our system can handle the traffic load and scale appropriately if needed. We can use for that.

aws elbv2 create-load-balancer --name mysfits-nlb --scheme internet-facing --type network --subnets REPLACE_ME_PUBLIC_SUBNET_ONE REPLACE_ME_PUBLIC_SUBNET_TWO > ~/environment/nlb-output.json
aws elbv2 create-target-group --name MythicalMysfits-TargetGroup --port 8080 --protocol TCP --target-type ip --vpc-id REPLACE_ME_VPC_ID --health-check-interval-seconds 10 --health-check-path / --health-check-protocol HTTP --healthy-threshold-count 3 --unhealthy-threshold-count 3 > ~/environment/target-group-output.json
aws elbv2 create-listener --default-actions TargetGroupArn=REPLACE_ME_NLB_TARGET_GROUP_ARN,Type=forward --load-balancer-arn REPLACE_ME_NLB_ARN --port 80 --protocol TCP

Setting Up A Database

Externalize all of the mysfit data and persist it with a managed NoSQL database provided by Amazon DynamoDB. To add a DynamoDB table to the architecture, we have included another JSON CLI input file that defines a table called MysfitsTable. This table will have a primary index defined by a hash key attribute called MysfitId, and two more secondary indexes. The first secondary index will have the hash key of Species and a range key of MysfitId, and the second secondary index will have the hash key of Alignment and a range key of MysfitId.

These two secondary indexes will allow us to execute queries against the table to retrieve all of the mysfits that match a given Species or Alignment to enable the filter functionality.

aws dynamodb create-table --cli-input-json file://~/environment/aws-modern-application-workshop/module-3/aws-cli/dynamodb-table.json
aws dynamodb describe-table --table-name MysfitsTable
aws dynamodb batch-write-item --request-items file://~/environment/aws-modern-application-workshop/module-3/aws-cli/populate-dynamodb.json

User Authentication

In order to add some more critical aspects to the Mythical Mysfits website, like allowing users to vote for their favorite mysfit and adopt a mysfit, we need to first have users register on the website. To enable registration and authentication of website users, we will create a in - a fully managed user identity management service.

Then, to make sure that only registered users are authorized to like or adopt mysfits on the website, we will deploy an REST API with to provide commonly required REST API capabilities out of the box like SSL termination, request authorization, throttling, API stages and versioning, and much more.

aws cognito-idp create-user-pool --pool-name MysfitsUserPool --auto-verified-attributes email
aws cognito-idp create-user-pool-client --user-pool-id REPLACE_ME --client-name MysfitsUserPoolClient
aws apigateway create-vpc-link --name MysfitsApiVpcLink --target-arns REPLACE_ME_NLB_ARN > ~/environment/api-gateway-link-output.json
aws apigateway import-rest-api --parameters endpointConfigurationTypes=REGIONAL --body file://~/environment/aws-modern-application-workshop/module-4/aws-cli/api-swagger.json --fail-on-warnings
aws apigateway create-deployment --rest-api-id REPLACE_ME_WITH_API_ID --stage-name prod

Tracking User Traffic

Finally, understanding the actions your users are taking on the website before a decision to like or adopt a mysfit could help you design a better user experience in the future that leads to mysfits getting adopted even faster.

To help us gather these insights, we will implement the ability for the website frontend to submit a tiny request, each time a mysfit profile is clicked by a user, to a new microservice API we'll create. Those records will be processed in real-time by a serverless code function, aggregated, and stored for any future analysis that you may want to perform.Capture user behavior with a clickstream analysis microservice that will record and analyze clicks on the website using and .

is a highly available and managed real-time streaming service that accepts data records and automatically ingests them into several possible storage destinations within AWS, such as Amazon S3 bucket, or Amazon Redshift data warehouse cluster. This is great for data-driven applications that need to respond in real-time to changes in data for batch processing, stream analytics, and machine learning inference.

AWS Lambda enables developers to write code functions that only contain what their logic requires and have their code be deployed, invoked, made highly reliable, and scaled without having to manage infrastructure whatsoever.

We can first write a python function that retrieves additional attributes about the clicked on Mysfit to make the click record more meaningful using . Then we proceed to configure Kinesis Firehose delivery stream as an event source for the function, which will automatically deliver click records as events to code function we've created, receive the responses that our code returns, and deliver the updated records to the configured Amazon S3 bucket.

sam package --template-file ./real-time-streaming.yml --output-template-file ./transformed-streaming.yml --s3-bucket replace-with-your-bucket-name
aws cloudformation deploy --template-file /home/ec2-user/environment/MythicalMysfitsStreamingService-Repository/cfn/transformed-streaming.yml --stack-name MythicalMysfitsStreamingStack --capabilities CAPABILITY_IAM

Bringing It All Together

This experience was meant to give you a taste of what it's like to be a developer designing and building modern application architectures on top of AWS. Developers on AWS are able to programmatically provision resources using the AWS CLI, reuse infrastructure definitions via AWS CloudFormation, automatically build and deploy code changes using the AWS developer tool suite of Code services, and take advantage of multiple different compute and application service capabilities that do not require you to provision or manage any servers at all!

And the end result is a full stack cloud application that looks like this!

AWS SageMaker: Building an ML pipeline

Summary: This project aims to build a distributed ETL machine learning pipeline using AWS and Spark. View the and here.
Project Type: Applied Machine Learning
Skills: Regression, XGBoost, Feature Engineering, ETL, Data Engineering, Model Deployment
Tech Stack: Python, AWS SageMaker, Glue, S3, PySpark, Boto3

So far we have seen how to do data analysis and build machine learning models locally - but how do we actually productionize them? Some key concerns include:

how do we perform distributed preprocessing on large datasets?
how do we deploy our algorithm onto an API endpoint that can be easily consumed?
how do we continuously train the model with new data after deployment?

In the this project, we shall explore how to build an ML Pipeline leveraging Spark Feature Transformers and SageMaker XGBoost algorithm & after the model is trained, deploy the Pipeline (Feature Transformer and XGBoost) as an Inference Pipeline behind a single Endpoint for real-time inference.This involves performing a few high-level steps:

Using AWS Glue for executing the SparkML feature processing job.
Using SageMaker XGBoost to train on the processed dataset produced by SparkML job.
Building a Pipeline of SparkML & XGBoost models for a realtime inference endpoint.

The above architecture is what we will be building. Here are the tools that we will use:

- machine learning services on the amazon cloud
- python SDK for interfacing with AWS services
- simple storage service bucket for storing data in the cloud
- serverless ETL service which can be used to execute distributed computing jobs like Spark
- a high performance, fast, distributed computing service for big data
- a machine learning algorithm that uses extreme gradient boosting

Exploratory Data Analysis

The problem we will tackle is to predict the age of an Abalone from its physical measurements such as sec, length, diameter, height etc. The target variable is rings since it corresponds with the age. The dataset consists of about 4k+ rows and is available on the .

Initializing AWS Sagemaker

First, we begin setting up our and creating a new with the following trust relationship json.

Now, we are ready to initialize our session, create a new S3 bucket and upload our dataset!

Yay our dataset is now available on the S3 bucket in the console!

Building ETL Pipeline

Next, lets build the entire ETL pipeline and convert that into a script abalone_processing.py consisting of the following steps:

1.First we initialize the spark session and define the schema of our dataset, before retrieving it from our S3 bucket.

2.Next we build a feature processing ETL pipeline by converting our only categorical variable sex into numberical labels before one-hot-encoding them, followed by vectorizing our dataset to speed up computation. We then fit, transform and split our data into training and validation sets with 80:20 ratio.

3.Finally we convert our pipeline into a binary file and upload it back to S3 so that SageMaker will be able to use it later during training and deployment. Note that the pandas dataframes are converted into so as to speed up computation and also because this is the standard format for spark jobs.

Let's now upload our ETL script to S3 for use in Sagemaker later:

For our job, we will also have to pass MLeap dependencies to Glue. MLeap is an additional library we are using which does not come bundled with default Spark. Similar to most of the packages in the Spark ecosystem, MLeap is also implemented as a Scala package with a front-end wrapper written in Python so that it can be used from PySpark.

We need to make sure that the MLeap Python library as well as the JAR is available within the Glue job environment. In the following cell, we will download the MLeap Python dependency & JAR from a SageMaker hosted bucket and upload to the S3 bucket we created above in your account.

Executing Jobs with AWS Glue

Next we define the output location where the transformed dataset should be uploaded. We are also specifying a model location where the MLeap serialized model would be updated. This locations should be consumed as part of the Spark script using getResolvedOptionsmethod of AWS Glue library (see abalone_processing.py for details).

We'll be creating Glue client via AWS Boto3 so that we can invoke the create_job API which will allow us to define mutable jobs for execution. Note that his requires passing the code location as well as the dependencies location to Glue.

Our ETL spark job will be executed now by calling start_job_run API. This API creates an immutable run/execution corresponding to the job definition created above. We will require the job_run_id for the particular job execution to check for status. We'll pass the data and model locations as part of the job execution parameters.

Now we will check for the job status to see if it has succeeded, failed or stopped. Once the job is succeeded, we have the transformed data into S3 in CSV format which we can use with XGBoost for training. If the job fails, you can go to AWS Glue console, click on Jobs tab on the left, and from the page, click on this particular job and you will be able to find the CloudWatch logs (the link under Logs) link for these jobs which can help you to see what exactly went wrong in the job execution.

We can see the jobs being run successfully as well on the console =)

Train and Deploy Model

Now we will use SageMaker XGBoost algorithm to train on this dataset. We already know the S3 location where the preprocessed training data was uploaded as part of the Glue job.

We can see that the boosting algorithm ran for 10 epochs to achieve a final RMSE of 2.42! Next we will proceed with deploying the models in SageMaker to create an Inference Pipeline. You can create an Inference Pipeline with up to five containers. Deploying a model in SageMaker requires two components:

Model Docker image in ECR - we created the fitted model during training
ETL Pipeline - the serialized ETL pipeline we uploaded to S3 earlier

SparkML serving container needs to know the schema of the request that'll be passed to it while calling the predict method. In order to alleviate the pain of not having to pass the schema with every request, sagemaker-sparkml-serving allows you to pass it via an environment variable while creating the model definitions.

Making predictions using our deployed model

Now we will invoke the endpoint with a valid payload that SageMaker SparkML Serving can recognize. There are three ways in which input payload can be passed to the request:

Pass it as a valid CSV string.
Pass it as a valid JSON string.
Pass the request in JSON format along with the schema and the data.

Here we pass in a single payload of features either in json or csv and our model makes a prediction using our deployment endpoint, pretty neat eh?

With this endpoint, our model can now be used to make predictions from any client application that calls it!

Facebook Ads: Bot or Not?

Project Type: Kaggle Competition
Skills: Classification, Class Imbalance, XGBoost, Grid Search, SMOTE, Feature Engineering
Tech Stack: Python, Pandas, Numpy, Scikit-Learn, Imblearn, Matplotlib

Exploratory Data Analysis

The goal of this competition is to identify online auction bids that are placed by "robots", helping the site owners easily flag these users for removal from their site to prevent unfair auction activity. Lets get our hands dirty - digging into the data, we see that the first dataset includes a list of bidder information, including their id, payment account, and address. The target label to predict is the outcome column indicating whether a bidder is a robot (1.0) or human (0.0). Upon investigation, we see that only 5% of the dataset are 1s (bots) while the rest are 0s (humans), which means there is severe class imbalace which we have to deal with later.

The second is a bid dataset that includes 7.6 million bids on different auctions made via mobile devices. Here we can see that we are able to do a join on the bidder_id to the training table, and we have even more categorical features such as auction type, merchandise type, country and time of bid.

Fortunately for us, there are no null values. However, we see that payment_account and address are all unique to the bidder_id and hence are likely to be low informational features that we can drop.

Feature Engineering

Looking at the bids table, we know that for each auction, there will be multiple bidders, each possibly bidding multiple times as well. The time a bidder takes to bid could be useful because bots are able to make the bid much faster whereas a real human would have reaction time and also some thinking before bidding. Hence, the following aggregated time features could potentially be useful:

the mean and median time between a user's bids
the mean and median time between a user's bids for each auction
the mean and median time between a user's bids from the previous bid for each auction

One thing to note is that the timestamp given to us has been anonymized and converted into units instead so we shall be using relative time units instead of actual hours or days.

Next we can do a left join on bidder_id to merge our newly generated time features for each user to the training dataset and volia!

Next, we can generate more features that might be useful as well such as

total count of bids each user has bidded
min, max, mean median number of bids per user per auction
min, max, mean median number of bids per user per device
min, max, mean median number of bids per user per merchandise
min, max, mean median number of bids per user per url

Modeling I

At first, we see an accuracy score of 0.955 which appears to be excellent. However, plotting the confusion matrix shows us that the model is great at predicting humans (0s) but not bot (1s). This is the classical trap of using accuracy as our classification metric. Why? Because of the class imbalance we found earlier. Since only 5% of the target are bots, any model that predicts 1s all time time, would automatically have 95% accuracy and is not a true indication that the model actually knows how to differentiate between the classes.

Hence it is recommended to use the AUC-ROC probability curve to measure how well a classifier is able to distinguish between classes. The higher the AUC, better the model is at predicting humans vs bots, with less error. The ROC curve is plotted with True Positive Rate (TPR) against the False Positive Rate (FPR). Here we see that the AUC_ROC score is only 0.66 which is not great.

We can also evaluate the model using a precision-recall curve for different thresholds, where:

Precision (P) = True Positives / (True Positives + False Positives)
Recall (R) = True Positives / (True Positives + False Negatives)
F1-Score = 2 * (P * R) / (P + R)

From this , we can see that the precision-recall of class=1 (bots) is only 0.44. This implies that our model is merely predicting most classes to be humans, and is not actually very accurate at predicting bots. We can further use another metric F1-Score that calculates the harmonic mean of the precision and recall.

Since our model has a high rate of false negatives because we are predicting bots as humans, the recall value is lower, and hence the F1-Score captures that. This makes precision-recall and a plot of precision vs. recall and summary measures useful tools for binary classification problems that have an imbalance in the observations for each class. Hence we can see that our mode is not performing well due to the class imbalance and achieves a poor F1-Score of only 0.437.

Handling Class Imbalance

Specifically, a random example from the minority class is first chosen. Then k of the nearest neighbors for that example are found (typically k=5). A randomly selected neighbor is chosen and a synthetic example is created at a randomly selected point between the two examples in feature space. The approach is effective because new synthetic examples from the minority class are created that are plausible, that is, are relatively close in feature space to existing examples from the minority class.

We can see that our new training dataset now has 3820 rows instead of only 2013 because SMOTE has upsampled the minority class to become balanced.

Modeling II

Finally, plotting the feature importance scores, we see that the top 5 features are as follows:

Mean bids per auction
Mean time between bids
Mean bids per URL
Mean time between bids per auction
Mean bids per device

Hence this shows that our engineered features play critical role in achieving a good performance, and that it is important that we handle class imblance well. Thats all for this project folks, hope you learned some good ML techniques in the process!

Airbnb Country Bookings

Project Type: Kaggle Competition
Skills: Classification, XGBoost, Grid Search, Feature Engineering, Model Explainability
Tech Stack: Python, Pandas, Numpy, Scikit-Learn, Matplotlib, Shap, Partial Dependence

Exploratory Data Analysis

This project tackles a Kaggle Competition by Airbnb to predict which country a new user will make their first booking.The training set consists of over 200K rows of unique users with features such as user activity, date of first booking, date of account creation etc.

Our target variable is country_destination which consists of 12 classes, one of which is NDF which means no country/booking was seleted. We see that while, more than half of users actually never even book a country at all, the classes are not too imbalanced in my opinion, with the exception of NL, AU and PT.

We also see that there are some null, categorical and time variables which we should deal with in data preprocessing using the following strategy:

Missing
- date_first_booking --> replace with 0
- age --> fill with mean age
- first_affiliate_tracked --> encode as a class
Categorical
- gender, signup_method, signup_flow, language, country
- affiliate_channel, affiliate_provider, first_browser
- first_affiliate_tracked, signup_app, first_device_type
Time
- date_account_created
- date_first_booking
- timestamp_first_activ
- split into individual

First we fill the NaN values in date_first_bookingso that we can perform some arbitrary calculations later. Next we also fill NaN values of age with the mean age and those with faulty age data of 2014 years. This is followed by converting the string dates into datetime objects.

Let's try to visualize the time series of countries booked for each month using a rolling stacked bar chart over time. We can generally see that the number of bookings increased over time and peaked mid 2014, after which it dropped significantly and continued decreasing over time. We also see that majority of bookings are for the US, followed by AU and FR.

Finally we perform count categorical encoding on the dataset to convert all non-numeric values into numeric ones.

Next let us explore the sessions dataset, which consists of 10 million rows of user activity. Unfortunately there are no timestamps on each of the users actions, only the duration of the action secs_elapsed.

Using the data preprocessing strategy below, we perform count encoding on the action types and remove rows where user_id is null since we cannot join them to our training set. We also see there are some null values.

Categorical values
- action
- action_type
- action_detail
- device_type
Missing Values
- user_id --> remove rows as cannot join
- action --> categorically encode as "no action"
- action_type --> categorically encode as "no action"
- action_detail --> categorically encode as "no action"
- secs_elapsed --> fill it with 0s (mean is also possible)

Further, only 135K out of 200K users actually did some kind of action on the website, the rest did not and so the will not have alot of the features.

Feature Engineering

The following features were generated as I felt they were useful. For the users table, I decided to break down individual dates.

Users:

extract day/month/year from the dates

Sessions:

Number of actions taken
Count/mean/min/max/median of unique Actions, ActionTypes, ActionDetails, Devices
Sum/mean/min/max/median/s.d./skewness/kurtosis of seconds_elapsed

Let us now compute features for secs_elaposed for each action, where we can compute the standard stuff like mean, median, max and standard deviation. min turned out to be all zeros so we can skip it. We can also compute the skewness and kurtosis.

Skewness is a measure of symmetry, or more precisely, the lack of symmetry. A distribution, or data set, is symmetric if it looks the same to the left and right of the center point.
Kurtosis is a measure of whether the data are heavy-tailed or light-tailed relative to a normal distribution. That is, data sets with high kurtosis tend to have heavy tails, or outliers. Data sets with low kurtosis tend to have light tails, or lack of outliers. A uniform distribution would be the extreme case.

Let us finish computing the date variables

Model Building

Model Explainability

1. Feature Importance & Selection

Permutation importance is calculated after a model has been fitted. So we won't change the model or change what predictions we'd get for a given value of height, sock-count, etc.

Instead we will ask the following question: If I randomly shuffle a single column of the validation data, leaving the target and all other columns in place, how would that affect the accuracy of predictions in that now-shuffled data?

Here we can see that our top 3 features are age, median time spent for each action and the day that the account was created.

However do we really need so many features? Lets do some feature selection to reduce the dimensionality of our dataset. We could use something like PCA but I wanted to compare the results of using the top 15 features. Here we can see that the model achieves pretty much the same performance using less than half of the features.This is a good practice to speed up computation and reduce the number of features needed during prediction and deployment.

2. Decision Tree Structure

Viewing the tree structure is a good way to explain the logic behind decision tree models. For each split, the model computes the entropy and information gain for each feature/node. Each node represents the feature that the model using and age is at the top because it contributes to the most information gain from the split.

Entropy - a measure of how much variance/classes the data has. Entropy = 0 indicates the dataset only has 1 class. Entropy = 1 indicates there are same number of classes in the dataset.
Information Gain - a measure to quantify the quality of a split using a feature. It is computed using Entropy(original) - Entropy(feature_split). The higher the information gain, the better the feature is at helping us predict between classes.

3. Partial Dependence Plots

While feature importance shows what variables most affect predictions, partial dependence plots show how a feature affects predictions as it increases in value. This is useful to help us answer questions like how does age affect the country destination booked?

From the plots below we see that the relationship between age and each of the 12 classes are being plotted. The y axis is interpreted as change in the prediction from what it would be predicted at the baseline or leftmost value. A blue shaded area indicates level of confidence.

For countries 0, 1, 5, 9, 10 - we see that as age increases, the probability of people booking the country increases up to ~40 years old. Hence we can conclude that middle-aged people prefer them.

For countries - 2, 3, 4, 6, 7, 8 - we see that as age increases, the probability of people booking the country decreases. Hence we can conclude that younger people prefer them.

4. SHAP Values

A model says a bank shouldn't loan someone money, and the bank is legally required to explain the basis for each loan rejection
A healthcare provider wants to identify what factors are driving each patient's risk of some disease so they can directly address those risk factors with targeted health interventions

SHAP values interpret the impact of having a certain value for a given feature in comparison to the prediction we'd make if that feature took some baseline value. Feature values causing increased predictions are in pink, and their visual size shows the magnitude of the feature's effect. Feature values decreasing the prediction are in blue. If you subtract the length of the blue bars from the length of the pink bars, it equals the distance from the base value to the output. Note that due to the many multiclass labels, the shap values were taking forever to plot but the code works.

Conclusion

This project was rather laborious as the modeling part and shap values took forever and it took me some time to figure some of the packages - some of which did not work for me as well. Overall I'm satisfied with the results but I feel they could have been better. If I had more time, here's what else I would do:

create more features such as entropy of sec_elapsed and # actions over time for each user
use SMOTE to achieve a more balanced ratio of classes
ensemble learning by combining several estimators