import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neighbors import NearestNeighbors
from scipy.sparse import csr_matrix
import tqdm
from tqdm import tqdm
from IPython.display import clear_output
import matplotlib.pyplot as plt
%matplotlib inline
import os
subset100 = pd.read_csv("../raw_data/track_meta_100subset_new.csv")
ALS, Alternative Least Square, is a method to compute a matrix factorization of user-item matrix. We used ALS to examine the underlying latent factor between playlist and tracks.
We first created a playlist-track relationship table that has each row as one relationship (playlist, track). We then trained our recommendation system on various latent dimensions and number of iterations to run ALS. Finally, we output the scoring of individual tracks for playlist.
To build our prediction set, we recommend the top k tracks with highest latent similarity score. k is pre-determined to be the the test set size * 15.
We did a stratified splitting of the data (by Playlistid) into training and test set by 80-20. Stratified splitting ensures that we have the right proportion of train and test for playlists of different lengths.
co_mat = pd.crosstab(train.Playlistid, train.Track_uri)
co_mat = co_mat.clip(upper=1)
res = []
for i, row in co_mat.iterrows():
for track in row[row == 1].index.values:
res.append((i, track))
res = pd.DataFrame(np.array(res), columns=['user', 'items'])
res.head()
user | items | |
---|---|---|
0 | 430 | spotify:track:0E1NL6gkv5aQKGNjJfBE3A |
1 | 430 | spotify:track:0OuPMjmicFfmnB3SFFqdgQ |
2 | 430 | spotify:track:0TIv1rjOG6Wbc02T4p3y7o |
3 | 430 | spotify:track:1CtOCnWYfIwVgIKiR2Lufw |
4 | 430 | spotify:track:1lNGwNQX4IrvDwgETwyPjR |
ALS is an iterative optimization process where we for every iteration try to arrive closer and closer to a factorized representation of our original data.
We have our original matrix R of size u x i with our playlist, tracks. We then want to find a way to turn that into one matrix with users and hidden features of size u x f and one with items and hidden features of size f x i. In U and V we have weights for how each user/item relates to each feature. What we do is we calculate U and V so that their product approximates R as closely as possible: R ≈ U x V.
Picture source: https://medium.com/radon-dev/als-implicit-collaborative-filtering-5ed653ba39fe
rs = ALSRecommenderSystem(res, True, latent_dimension=20)
rs.train(0.0001, iterations=30)
We used the following metrics to evaluate our model, based on Spotify RecSys rules
Data | R-Precision | NDCG | Average of the two metrics |
---|---|---|---|
Test set of 100 playlists | 0.239692 | 0.150403 | 0.195048 |
We achieved a 0.239 R-precision score and 0.1504 NDCG score with our Alternating Least Squares model. Interestingly, it greatly improves R-precision, it has similar NDCG score with other model approaches we tried, e.g., k-means clustering.
NDCG score does not perform as good as I expected, because ALC considers similarity and it suggests songs with higher similarity. One of the reasons I suspect is because some of the items have very high similarity scores (e.g., score > 0.999) and therefore, the ranking of the suggestion might be less sensitive.