Crash Course into Content Recommendation Algos

Mark A. C. Eggensperger
6 min readApr 9, 2021

Ever wonder how Pandora or Spotify can play the right songs, how Netflix and Hulu can recommend the right movies and TV shows to watch, or how Amazon knows what products to suggest? It’s all by analyzing user behavior and creating prediction models. This process existed long before the tech boom and is well demonstrated by this article here… It exists everywhere and a good algo is key to a company’s success. TikTok is not a great app because it can play videos well, it is a great app because it can suck you in to watching video after video. And for anyone using a dating app, you may not think you have a type, but you do and Hinge and Bumble know exactly what it is.

So how do they do it, well it’s a mix of finding similar products, similar users and doing some crazy calculations.

The General Process

Recommendation engines work in a four part process: collecting valuable data, storing data in an effective manner, analyzing the data they received, and filtering out the results to the relevant users. Let’s look more closely into it

Collection:

This is collecting feedback from users, most commonly we see this as a thumbs up or down. Collecting this is easy for some apps, such as dating apps, you can’t move onto the next potential mating partner until you provide feedback on the current. However others are more difficult, I for one have more fingers than the number of Yelp reviews or Amazon reviews written and I cannot remember the last time I rated something on Netflix.

Storing

This is not as exciting of a step, but it is highly crucial, how will you store the data that you receive? SQL NoSQL, object storage. All have their benefits and weaknesses, but what they all have in common is that it would be extremely difficult to change midway through. A big consideration should be if you need to write data frequently and fast or if you need to access data fast and frequently but the write time can be lengthy.

Analyzing

This is the section that can make your head hurt if we dive too deep into it, so let’s not get too deep here. But in simple terms, each user will be analyzed against another to determine how similar they are. Then we can apply a weighting to each item of content and evaluate a score by how similar the user is and if they used the content.

Filtering

After all steps are completed, filtering would be conducted to provide your top recommendations. Using statistical models you can achieve confidence scores and then recommend the best one to the users. Think in the wise words of one Dwight K. Schrute: “First rule in road-side beet sales, put the most attractive beets on top. The ones that make you pull the car over and go, ‘Wow. I need this beet right now. ‘ Those are the money beets.”

Three Main Engine Types

Collaborative Filtering

Collaborative filtering is finding similar users. If I like X and you like X then we are similar, and therefore if I like Y then you should like Y too. This can be done both on a user to user basis or a product to product basis. Using this method is the easiest approach to beginners and allows you to tackle abstract products and offerings. Say I were to create a cocktail recommender, without having to breakdown the ingredient mixture of each cocktail and the similarity of each ingredient: A cocktail with Tequila and lime juice is very similar to a cocktail with mezcal and lemon juice but since they share no common ingredients, it would be hard to recommend one based on the other. But with collaborative filtering, we can see that users tend to like both cocktails and therefore, we can recommend one based on the other.

Content Based Filtering:

This is recommending content or products because it is similar to other content and the content needs to be broken down to its genetic makeup. At Pandora, they use AI engines and musicologists to build a genetic makeup of every song by assigning a 1 to 10 rating on up to 450 factors. This is an example of the extreme end but still demonstrates its complexity. More commonly you will see a use of AI to analyze the description of the product or movie to determine how similar it is to another.

Hybrid Recommendations:

This is using the combination of both collaborative filtering and content filtering. Netflix is a great example of this. By analyzing the movies to find similar movies, then looking at what similar viewers are watching, they can create a more accurate prediction of content you would love. Finding movies that similar users like can be helpful, but when a new movie arises they can also analyze its description against the movies you like and know who to recommend new movies to.

Issues To Consider

Recommendation engines are great, but there are a few precautions you should consider when setting up your engine.

Synonymy

As we discussed in Content Based Filtering, using AI to break down the summary and find similar content can be very advantageous. But imagine somebody who is a fan of Lord of the Rings and watches all 3 movies religiously. Then a documentary comes out about jewelry belonging to medieval lords. They may find it somewhat interesting, but this is certainly not a fantasy action film that should be recommended. This is actually a huge problem with basic engines. Adding more factors to analyze the content such as genre, actors and box office take in this example can help avoid this issue. The other way to avoid this is using the hybrid method to check if other fans of lord of the rings have enjoyed the documentary as well, which they likely haven’t.

Scalability, Latency and Processing

As your content store increases, your user base increases and your factors for describing content increases, all necessary analysis increases exponentially. If each user is being compared to each other and considering each product as well, you are dealing with a large amount of data which requires excessive processing power and storage power. This creates a big engineering decision, do you process data on demand which can delay the user experience but only calculate small amounts at one time or do you calculate the entire database periodically, which produces less frequently updated results and takes up larger amounts of storage.

Sparsity of Responses

Everything is based on feedback. Dating apps have some of the best track record with this since each user is required to give a response on the content (potential matches) before moving onto the next. But this requirement can sometimes be a hindrance to the user experience. Imagine if you were listening to Pandora or Spotify and it required you to give a thumbs up or down on each song before moving on, users would hate the experience. This is where you have to go back to what we discussed before about being creative with data collection. To get around the lack of user response, here are some ways to be creative:

  • Search history: What did they search, what did they look at?
  • Purchase history: Did they buy or watch something?
  • Stickiness: Did they watch the whole movie, listen to the whole song?
  • Pausing: While scrolling through, did they take time to review it closely?

Conclusion

I hoped this helped break down the theory of content recommendation engines. The actual calculations get pretty tricky and that is an article for another day. Remember that engines operate on four step processes: collection, storage, analysis and filtering. From there you can choose to do collaborative filtering, content based filtering or a hybrid of the two. Just make sure to keep in mind the issues of synonyms, scalability and sparsity. I hope you enjoyed this deep dive.

--

--