Building an artist recommender — Part I


The Problem

I‘m constantly listening to music. Most of the time, that‘s on Spotify — I‘m a big fan of their personalized Discover Weekly and Daily Mix. Although I appreciate the little amount of effort it now takes to "discover" new music, I want a bit more control over what type of music the recommender serves.

For the past 3 years I‘ve had the same top 10 songs — here‘s my list from 2018. Let‘s gloss over the fact that Shawn Mendes is #3… most of the songs here are what I use to focus while working. They‘re particularly effective at drowning out the noise around me, without having a heavy bass that would bother my coworkers.

Since I spend most of my waking hours focused working, Spotify mostly recommends artists that fit the non-lyrical, don‘t-break-my-flow genre. Oddly enough, this isn‘t really the type of music I want to discover… for fun.

Not only do I want to have a say in the type of music that I‘m being recommended, but I also want some control over the novelty of the recommendation. Occasionally, I have the time to dedicate an afternoon to listening to an unknown artist‘s entire discography (and later, drop this obscure music reference at a party 😎). Otherwise, I want to listen to artists who have a significant following but I‘m not familiar with. In a way, I want to turn the dials of the recommender to fit my mood.

Since I couldn‘t find a similar product on Spotify — and I desperately need more party tricks — I set out to build it myself. Early on, I knew a few things to be true:

Now to the crux

How would I determine what artists to recommend?

Well, I would first need a sense of the user‘s tastes. Maybe not everyone thinks Shawn Mendes is the next Justin Timberlake — everyone likes J.T. though. I‘ll use the user‘s most common artists across their public playlists as a proxy for their taste. Based on those artists, I can recommend new + similar artists.

New to the user

Most sensible playlists are curated around a theme and have similar songs and artists. If artists often show up in many playlists together, I can assume they are somewhat similar. My best bet is to analyze a ton of public playlists on Spotify — the more the merrier. I‘ll use this same approach to assign a novelty rank to an artist. By ranking artists based on how well connected they are, I can assume the top X% are popular artists.

Similar artists

One way I could approach this is by counting the times any pair of artists show up in the same playlists together. Since we're only looking at public data, our best bet is to analyze a ton of public playlists on Spotify.

By ranking artists based on how well connected they are, I can filter out the top X% so I don't recommend people well known artists.

The Data

I‘ll start out on a small set of data, and worry about scaling it later. I‘m going to limit the pool of playlists to all those created by the Spotify user account. Given that these are professionally curated by Spotify editors, and they get paid for this, I expect a high level of quality and consistency.

Using Spotify‘s playlist API, I can pull all the playlists I need from the US market. I chose to use Spotify's client credential flow to get a token.
        # client_credentials_flow.py
def create_token(client_path, secret_path):
  client = open(client_path).read().strip()
  secret = open(secret_path).read().strip()

  url = "https://accounts.spotify.com/api/token"
  data = { 'grant_type': 'client_credentials' }
  r = requests.post(url, data=data, auth=(client, secret)).json()
  print("this access token will expire in {0} minutes".format(r["expires_in"] / 60.0))
  token = r['access_token']

  return token
      

You can check out the actual code on GitHub, but here's the simplified version of an API query.

          #search.py
from collections import defaultdict
import requests
import client_credentials_flow # the one I wrote above

# we'll store our data in these dictionaries
nodes = defaultdict(dict)
edges = defaultdict(set)

user = "your-spotify-id" # the one that show's up in your playlist urls
path_to_client = "/the/path/to/client"
path_to_secret = "/the/path/to/secret"
token = client_credentials_flow.create_token(path_to_client, path_to_secret)
session = requests.Session()
headers = { "Authorization": "Bearer {0}".format(token) }
href = "https://api.spotify.com/v1/users/{0}/playlists?market=us&limit=50".format(user) # focus on the US market
request = session.get(href, headers=headers).json() # send request to spotify api
        

The request variable holds up to 50 public playlists for our user. Each of those playlists then has a reference to all of the tracks, which contain some of the artist information that we're after. I‘ll store every playlist + artist combination to use later.

          while "items" in request and loop:
  loop = True

  # loop through all playlists
  for playlist in request["items"]:
      pid = playlist["id"]

      # add playlist to nodes
      nodes[pid] = { "node_id": pid, "name": playlist["name"], "type": "playlist" }

      # get all tracks from this playlist
      tracks = session.get(playlist["tracks"]["href"], headers=headers).json()

      # pull tracks to see artists (there can be multiple)
      if "items" in tracks:
          for item in tracks["items"]:
              for artist in item["track"]["artists"]:
                  aid = artist["id"]
                  nodes[aid] = { "node_id": aid, "name": artist["name"], "type": "artist" }
                  edges[pid].update([aid])

  # move on to other playlists
  if request["next"]:
      request = session.get(request["next"], headers=headers).json()
  else:
      loop = False
        

At this point,I know which artists are connected to which playlists. Here's a visual using my own public playlists. Click on a playlist 🙌

With this information we can build a network graph between artists. The code looks something like this:

        # search.py
def artist_graph(self):
  artist_edges = collections.defaultdict(dict)
  for pid in self.edges:
      artist_set = self.edges[pid]
      for combo in itertools.combinations(artist_set, 2):
          self.nodes[combo[0]]["artists_connections"] += 1
          self.nodes[combo[1]]["artists_connections"] += 1

          if not (combo in artist_edges.keys()):
              artist_edges[combo] = { "num_playlists": 1 }
              self.nodes[combo[0]]["artists_degree"] += 1
              self.nodes[combo[1]]["artists_degree"] += 1
          else:
              artist_edges[combo]["num_playlists"] += 1

  result = []
  columns = ["this_index", "other_index", "degree", "connections", "weight", "num_playlists"]

  for combo in artist_edges:
      in_id = combo[0]
      out_id = combo[1]
      np = artist_edges[combo]["num_playlists"] * 1.0
      out_w = np / self.nodes[out_id]["artists_connections"]
      in_w = np / self.nodes[in_id]["artists_connections"]

      result.append(
          [
              self.nodes[in_id]["index"],
              self.nodes[out_id]["index"],
              self.nodes[in_id]["artists_degree"],
              self.nodes[in_id]["artists_connections"],
              in_w,
              np
          ])

      result.append(
          [
              self.nodes[out_id]["index"],
              self.nodes[in_id]["index"],
              self.nodes[out_id]["artists_degree"],
              self.nodes[out_id]["artists_connections"],
              out_w,
              np
          ])

  return pd.DataFrame(result, columns=columns)
      

Now I can pull down the same data for the Spotify user! I‘ll use this to build out the recommender. As of the date of this post, this is the composition of the Spotify public playlists (in the US market) — or at least what I could pull using their API:

Which artists show up in most playlists?

name playlist_count degree connections
Drake 100 2,455 5,778
Khalid 99 3,151 7,772
Beyoncé 96 2,398 5,731
Rihanna 93 2,350 5,382
Ariana Grande 92 2,178 5,940
Nicki Minaj 91 2,278 5,158
Sam Smith 85 2,540 5,637
Taylor Swift 82 2,311 5,535
Lady Gaga 81 2,092 4,613
Calvin Harris 79 2,628 5,679
Kanye West 77 1,773 4,256
Ed Sheeran 75 2,363 4,804
John Mayer 75 2,608 4,884
Marshmello 73 2,763 6,104
Ellie Goulding 72 2,131 5,403
JAY Z 71 1,775 3,844
The Weeknd 71 1,896 4,370
Post Malone 69 2,123 5,065
Avicii 69 2,168 5,093
The Chainsmokers 66 2,039 5,166

This tells us that Drake shows up in 100 of the 1,469 Spotify playlists (~7%), he's connected to 2,455 unique artists (i.e. degree), and has a total of 5,778 connections. To clarify, if Drake is in 20 playlists with Beyoncé, that is 20 connections but 1 single (-lady) degree.

There is something you need to know about the music industry: The 1% phenomenon is real. A small number of artists account for a majority of the connections. I ranked each artist based on the number of playlists they show up in. Take a look at how quickly the number of playlists per artist drops, as their rank decreases.

As you can see, out of 30K artists, 20K only show up in 1 playlist. This creates a bit of a problem; I want to recommend artists that are relatively unknown, but similar enough to artists that the user listens to. I can‘t consider an artist similar to another artist if they only show up in 1 playlist together. This will be a problem until I pull more playlists, so for now I‘m comfortable with recommending artists that are in the top ⅓. Scaling will be the topic of a future post.

The Model

I have enough data to start making recommendations, I need to give artists a novelty score. Remember, we want to give the user control of the novelty of a recommendation. My initial approach was to treat the artist-to-artist graph as an undirected weighted acyclic graph and ran PageRank. Let me break this piece down:

Turns out that the output of PageRank for an undirected graph converges to the centrality measure being used when calculating the transition matrix — intuitively obvious to the most casual observer. This fact simplifies the process significantly, and now I can simply rank the artists based on their connections.

The Recommendations

I‘m a big fan of Tom Misch — he‘s somehow able to find the perfect blend of Hip Hop and Jazz, and embed it into the smoothest guitar licks. Turns out, he‘s also the artist that shows up in most of my playlists. Knowing this, the recommender will try to serve me similar artists to Tom Misch.

738 — The number of total artists that show up in the 15 Spotify playlists that feature Mr. Misch.

 17 — Number of artists that show up in more than 1 playlists with Tom, my guy.

 10 — Artists left over when it cross-references my public playlists.

Here‘s that list of 10, showing the artist name, the number of playlists they have in common with T.M.❤️ , and whether or not they show up in the top X% of the artist pool.

other_name num_playlists top_5% top_10% top_25% top_50%
Emotional Oranges 5.0 False True True True
GoldLink 5.0 True True True True
The Internet 4.0 False True True True
PJ Morton 3.0 False True True True
VanJess 3.0 True True True True
CeeLo Green 3.0 True True True True
Steve Lacy 3.0 True True True True
YEBBA 3.0 False True True True
Nao 3.0 False False True True
Shay Lia 3.0 False True True True

Who we decide to recommend depends on the novelty that the user is looking for. We can avoid recommending well known artists by filtering out the top 5%, or maybe 10% of connected artists. From the above artists I can say that I know CeeLo Green from the radio, The Internet from my "cool" West Coast friends, and I‘m very familiar with GoldLink‘s "Herside Story," but that‘s about it.

Now I get to perform some biased user-testing on the remaining artists!

In future posts, I‘ll cover how to scale the recommender and best store recommended artists; as well as how to display recommendations interactively through the magic of web development.

Gustavo