BUILDING RECOMMENDED SYSTEM USING PYSPARK WITH ALS ALGORITHM

5 min readNov 26, 2020

Recommender System is an information filtering tool that seeks to predict which product a user will like, and based on that, recommends a few products to the users. For example, Amazon can recommend new shopping items to buy, Netflix can recommend new movies to watch, and Google can recommend news that a user might be interested in. The two widely used approaches for building a recommender system are the content-based filtering (CBF) and collaborative filtering (CF).

Majorly we have 3 types of recommended system : -

Collaborative Filtering
Content-Based Filtering
Hybrid Recommendation Systems

Collaborative Filtering

Collaborative filtering methods are based on collecting and analyzing a large amount of information on users’ behaviors, activities or preferences and predicting what users will like based on their similarity to other users. A key advantage of the collaborative filtering approach is that it does not rely on machine analyzable content and therefore it is capable of accurately recommending complex items such as movies without requiring an “understanding” of the item itself. Many algorithms have been used in measuring user similarity or item similarity in recommender systems. For example, the k-nearest neighbor (k-NN) approach and the Pearson Correlation.

Content Based Filtering

Content-based filtering methods are based on a description of the item and a profile of the user’s preference. In a content-based recommendation system, keywords are used to describe the items; beside, a user profile is built to indicate the type of item this user likes. In other words, these algorithms try to recommend items that are similar to those that a user liked in the past (or is examining in the present). In particular, various candidate items are compared with items previously rated by the user and the best-matching items are recommended. This approach has its roots in information retrieval and information filtering research.

Hybrid Recommendation Systems

Recent research has demonstrated that a hybrid approach, combining collaborative filtering and content-based filtering could be more effective in some cases. Hybrid approaches can be implemented in several ways, by making content-based and collaborative-based predictions separately and then combining them, by adding content-based capabilities to a collaborative-based approach (and vice versa), or by unifying the approaches into one model. Several studies empirically compare the performance of the hybrid with the pure collaborative and content-based methods and demonstrate that the hybrid methods can provide more accurate recommendations than pure approaches. These methods can also be used to overcome some of the common problems in recommendation systems such as cold start and the sparsity problem.

Netflix is a good example of a hybrid system. They make recommendations by comparing the watching and searching habits of similar users (i.e. collaborative filtering) as well as by offering movies that share characteristics with films that a user has rated highly (content-based filtering).

#before starting of the notebook we will familiar with the data.

MOVIELENS_RATING.CSV

in this data we have movie_id ,rating and user_id.

movie_id refers to the id of the spaefic movie user watched.

rating is from 1.0 to 5.0 given by user.

user_id refers to the id to the particular user.

#we will be implementing the code in google colab before starting o f the code let’s go for installation of pyspark library.

!pip install pyspark

#then we will start of the new sparksession which helps to make the model in it is one of the basic steps in executing the model.

from pyspark.sql import SparkSession

# now we will go for making the app builder in the sparksession

spark = SparkSession.builder.appName(‘rec_sys’).getOrCreate()

#now we will start the session by importing the dataset

data=spark.read.csv(‘movielens_ratings.csv’,inferSchema=True,header=True)
data.show(5)

#now we will be implementing the printschema to show the deafult schema by spark session.

data.printSchema()

root

| — movieId: integer (nullable = true)

| — rating: double (nullable = true)

| — userId: integer (nullable = true)

# now we will showing the describe function to make data undertable

data.describe().show()

# Build the recommendation model using ALS on the training data

# Smaller dataset so we will use 0.8 / 0.2

(training, test) = data.randomSplit([0.8, 0.2])
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS
als = ALS(maxIter=2, regParam=0.01, userCol=”userId”, itemCol=”movieId”, ratingCol=”rating”)
model = als.fit(training)

# these above are the parameters of the ALS model with the userCol,itemCol,ratingCol wih the specific iteration.

#then we will be implementing them on the test data through transformation

predictions = model.transform(test)

#then we will see the prediction about the single user for the testing and the predicition according to the model.

# we will be evaluating the model on the basis of rmse ,r2,mae value’s .

evaluator = RegressionEvaluator(metricName=”rmse”, labelCol=”rating”,predictionCol=”prediction”)
rmse = evaluator.evaluate(predictions)
print(“Root-mean-square error = “ + str(rmse))
evaluator = RegressionEvaluator(metricName=”r2", labelCol=”rating”,predictionCol=”prediction”)
r2 = evaluator.evaluate(predictions)
print(“r2= “ + str(r2))
evaluator = RegressionEvaluator(metricName=”mae”, labelCol=”rating”,predictionCol=”prediction”)
mae = evaluator.evaluate(predictions)
print(“mean absolute error = “ + str(mae))

# then for single user we will be implementing them on

single_user=test.filter(test[‘userId’]==20).select([‘movieId’,’userId’,’rating’])
single_user.show(5)
reccomendations = model.transform(single_user)
reccomendations.sort(‘prediction’,ascen=True).show(5)

HERE ARE THE SCREENSHOTS OF THE CODE