k-NN Collaborative Filtering

Contents

The following notebook illustrates our k-NN collaborative filtering approach (uses as our baseline model) that uses track co-occurrence in playlists to recommend tracks to playlists.

0. Motivation

In order to recommend relevant songs to playlists, it is natural to think about songs that are in a similar playlist but not currently in our target playlist.

Our k-NN collaborative filtering modeling approach looks at playlists that share similar tracks (i.e., if the two playlists have a high number of overlapping tracks) and recommend songs that are in the similar playlists but not in the target playlist.

0.1 Strategy

To compute prediction set per playlist, our model does the following:,

  1. The model finds k nearest neighbor playlists of our target playlist (measured by cosine distance between playlists).

  2. From the k-NN playlists, the model then build the recommendation by recommending tracks that are in the neighbor playlists but not in the target playlist, starting from the most similar playlist. It will stop building the recommendation list until it reaches the pre-determined length (test set size * 15 for each playlist)

1. Data Processing

1.1 Train-test split

We did a stratified splitting of the data (by Playlistid) into training and test set by 80-20. Stratified splitting ensures that we have the right proportion of train and test for playlists of different lengths.

train.shape, test.shape
((1970, 28), (616, 28))

1.2 Binary Sparse Matrix

We used the training set to create a binary sparse matrix with 100 playlists and 1534 unique songs. Each row represents tracks that are in the playlist (1) or not (0). As you can imagine, it is a sparse matrix, because one playlist only has maximum of 350 songs in our dataset, while we have 1534 unqiue songs in the matrix.

We then transformed the matrix to a Compressed Sparse Row matrix from scipy and fit a 5-nearest-neighbor model (using cosine as distance metric and brute-force search).

co_mat.shape
(100, 1534)

Below shows the few rows of the matrix.

co_mat.head()
Track_uri spotify:track:00LfFm08VWeZwB0Zlm24AT spotify:track:00qOE7OjRl0BpYiCiweZB2 spotify:track:01a0J96fRD91VnjQQUCqMK spotify:track:01iyCAUm8EvOFqVWYJ3dVX spotify:track:027h5P3kCyktHv9dpHUBBS spotify:track:02M6vucOvmRfMxTXDUwRXu spotify:track:03L2AoiRbWhvt7BDMx1jUB spotify:track:03LpkqucyYKcYclDs8HuxO spotify:track:03fT3OHB9KyMtGMt2zwqCT spotify:track:03tqyYWC9Um2ZqU0ZN849H ... spotify:track:7yFMhCJOsH7khgpdnyrZAZ spotify:track:7yHEDfrJNd0zWOfXwydNH0 spotify:track:7ySUcLPVX7KudhnmNcgY2D spotify:track:7yfg0Eer6UZZt5tZ1XdsWz spotify:track:7yq4Qj7cqayVTp3FF9CWbm spotify:track:7yyRTcZmCiyzzJlNzGC9Ol spotify:track:7zBPzAjKAqQpcv8F8GCq5s spotify:track:7zWj09xkFgA9tcV6YhfU6q spotify:track:7zbq8RT5Kd3ExOGVTiUQbR spotify:track:7zxRMhXxJMQCeDDg0rKAVo
Playlistid
430 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
622 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1990 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2259 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 1 0
2535 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 1534 columns

2. Model Performance

2.1 Metrics

We used the following metrics to evaluate our model, based on Spotify RecSys rules

2.2 Model Test-Set Performance

Data R-Precision NDCG Average of the two metrics
Test set of 100 playlists 0.077025 0.080346 0.078685

3. Conclusion

We achieved a 0.07 R-precision score and 0.08 NDCG score with our baseline k-NN collaborative filtering model. The current model only consider track co-occurence between playlists, we are curious to know if the model will improve if audio features are added.

https://github.com/phoebewong/spotify-teamNPK/blob/master/src