Data Science minor final project
Link to Shiny application zip. Includes:
- app.R - application
- w2v_90m - word2vec model trained on full dataset (3 files due to model size)
- artistsChoice.csv - list of artists (for app's input)
- models - word2vec models; (includes only the one trained on 10M)
- notebooks - notebooks
- slides - presentations
- other - code snippets and other stuff
- app -
shiny application🚧 --> https://vk.cc/9v83UO
Data: song playlists from russian SNs, including ~950K users and 90M user-item pairs in total.
Example:
user_id | song | artist |
---|---|---|
1 | Bohemian Rhapsody | Queen |
1 | The Immigrant Song | Led Zeppelin |
2 | LaBelle | Lady Marmalade |
2 | Non! Je Ne Regrette Rien | Edith Piaf |
2 | On était beau | Louane |
2 | Город | PRAVADA |
... | ... | ... |
968772 | Grand Piano | Nicki Minaj |
968772 | thank u, next | Ariana Grande |
Method: word2vec skip-gram
Idea: Each user's playlist is represented as a sentence, and if artists appear in the same playlists, they are similar and belong to the same context. The model takes artists (at least one) as an input, and recommends n artists.
The final model was trained on full dataset: approximately 90 000 000 user-item, ~ 950 000 users, time: 9 hours.
Case: something epic for the one who loves Game of Thrones
model_w2v.wv.most_similar('ramin djawadi', topn=10)
Model recommends other authors of soundtracks. Interesting case: soundtrack to 'the witcher 3 wild hunt' - 'The Trail' is quite similar to the TV series main song.
[('drake', 0.6897857189178467),
('hans zimmer', 0.774951577186584),
('ramin djawadi', 0.761010468006134),
('westworld', 0.747692465782166),
('the witcher 3 wild hunt', 0.722944676876068),
('daniel pemberton', 0.721352934837341),
('howard shore', 0.719944596290588),
('jeremy soule', 0.715254724025726),
('two steps from hell', 0.711450159549713),
('hans zimmer', 0.709886014461517),
('akihiro honda', 0.703315019607544)]
Case: the one who listens to placebo and radiohead and is not in a good mood
placebo_sim = [a[0].strip() for a in model_w2v.wv.most_similar(['placebo'], topn=15)]
placebo_sim
['iamx',
'arctic monkeys',
'blue october',
'franz ferdinand',
'radiohead',
'the smiths',
'the killers',
'him',
'the pretty reckless',
'30 seconds to mars',
'hypnogaja',
'my chemical romance',
'sea wolf',
'she wants revenge',
'stereophonics']
And who does not match there?
human prediction: 'my chemical romance' or 'franz ferdinand'
model:
print(model_w2v.wv.doesnt_match(placebo_sim))
my chemical romance