Skip to content

Million Song Dataset

Million Song Dataset also known as Echo Nest Taste Profile Subset is a part of MSD, which contains play history of songs. Other datasets, such as preprocessed song features can be found at dataset site.

Stats

  • 1,019,318 unique users
  • 384,546 unique songs
  • 48,373,586 user-song-play count triplets

Extra parameters

  • merge_kaggle_splits=True

    In MSD Challenge on Kaggle there were public and private parts. By default they are merged together.

  • drop_mismatches=True

    There is a matching error between track ids and song ids in MSD. It shouldn't matter if you don't use audio features, but by default these items are removed.

Warning

Dataset is quite big and loads for about 5 minutes first time, taking about 1.2GB RAM. If you use default recommended parameters, processed data is saved to disk so that consequent loads take about 30 seconds.

Example

from rs_datasets import MillionSongDataset
msd = MillionSongDataset()
msd.info()
train
                                    user_id             item_id  play_count
0  b80344d063b5ccb3212f76538f3d9e43d87dca9e  SOAKIMP12A8C130995           1
1  b80344d063b5ccb3212f76538f3d9e43d87dca9e  SOAPDEY12A81C210A9           1
2  b80344d063b5ccb3212f76538f3d9e43d87dca9e  SOBBMDR12A8C13253B           2

val
                                    user_id             item_id  play_count
0  0007140a3796e901f3190f12e9de6d7548d4ac4a  SONVMBN12AC9075271           1
1  0007140a3796e901f3190f12e9de6d7548d4ac4a  SOVIGZG12A6D4FB188           1
2  0007140a3796e901f3190f12e9de6d7548d4ac4a  SOZGXYF12AB0185579           2

test
                                    user_id             item_id  play_count
0  00007a02388c208ea7176479f6ae06f8224355b3  SOAITVD12A6D4F824B           3
1  00007a02388c208ea7176479f6ae06f8224355b3  SONZGLW12A6D4FBBC1           1
2  00007a02388c208ea7176479f6ae06f8224355b3  SOXNWYP12A6D4FBDC4           1