readme: dataset update

mdeff · mdeff · commit 54d109c2ed26 · 2017-05-05T12:35:00.000+02:00
diff --git a/README.md b/README.md
@@ -7,83 +7,100 @@
 
 [paper]:     https://arxiv.org/abs/1612.01840
 [FMA]:       https://freemusicarchive.org
-[WFMU]:      https://wfmu.org
-[Wikipedia]: https://en.wikipedia.org/wiki/Free_Music_Archive
 
-Note that this is a **beta release** and that this repository as well as the
+The dataset is a dump of the [Free Music Archive (FMA)][FMA], an interactive
+library of high-quality, legal audio downloads. Below the abstract from the
+[paper].
+> We introduce the Free Music Archive (FMA), an open and easily accessible
+> dataset which can be used to evaluate several tasks in music information
+> retrieval (MIR), a field concerned with browsing, searching, and organizing
+> large music collections. The community's growing interest in feature and
+> end-to-end learning is however restrained by the limited availability of
+> large audio datasets. By releasing the FMA, we hope to foster research which
+> will improve the state-of-the-art and hopefully surpass the performance
+> ceiling observed in e.g. genre recognition (MGR). The data is made of 106,574
+> tracks, 16,341 artists, 14,854 albums, arranged in a hierarchical taxonomy of
+> 161 genres, for a total of 343 days of audio and 917 GiB, all under
+> permissive Creative Commons licenses. It features metadata like song title,
+> album, artist and genres; user data like play counts, favorites, and
+> comments; free-form text like description, biography, and tags; together with
+> full-length, high-quality audio, and some pre-computed features. We propose
+> a train/validation/test split and three subsets: a genre-balanced set of
+> 8,000 tracks from 8 major genres, a genre-unbalanced set of 25,000 tracks
+> from 16 genres, and a 98 GiB version with clips trimmed to 30s. This paper
+> describes the dataset and how it was created, proposes some tasks like music
+> classification and annotation or recommendation, and evaluates some baselines
+> for MGR. Code, data, and usage examples are available at
+> <https://github.com/mdeff/fma>.
+
+This is a **pre-publication release**. As such, this repository as well as the
 paper and data are subject to change. Stay tuned!
 
 ## Data
 
-The dataset is a dump of the [Free Music Archive (FMA)][FMA], an interactive
-library of high-quality, legal audio downloads. Please see our [paper] for
-a description of how the data was collected and cleaned as well as an analysis
-and some baselines.
-
-You got various sizes of MP3-encoded audio data:
-
-1. [fma_small.zip]: 4,000 tracks of 30 seconds, 10 balanced genres (GTZAN-like)
-   (~3.4 GiB)
-2. [fma_medium.zip]: 14,511 tracks of 30 seconds, 20 unbalanced genres
-   (~12.2 GiB)
-3. [fma_large.zip]: 77,643 tracks of 30 seconds, 68 unbalanced genres (~90 GiB)
-   (available soon)
-4. [fma_full.zip]: 77,643 untrimmed tracks, 164 unbalanced genres (~900 GiB)
-   (subject to distribution constraints)
-
-[fma_small.zip]:  https://os.unil.cloud.switch.ch/fma/fma_small.zip
-[fma_medium.zip]: https://os.unil.cloud.switch.ch/fma/fma_medium.zip
-
-All the below metadata and features are tables which can be imported as [pandas
-dataframes][pandas], or used with any other data analysis tool. See the [paper]
-or the [usage] notebook for an exhaustive description.
-
-* [fma_metadata.zip]: all metadata for all tracks (~7 MiB)
-	* `tracks.json`: per track metadata such as ID, title, artist, genres and
-	  play counts, for all 110,000 tracks.
-	* `genres.json`: all 164 genre IDs with their name and parent (used to
-	  infer the genre hierarchy and top-level genres).
-* [fma_features.zip]: all features for all tracks (~400 MiB)
-	* `features.json`: common features extracted with [librosa].
-	* `spotify.json`: audio features provided by [Spotify], formerly
-	  [Echonest]. Cover all tracks distributed in `fma_small.zip` and
-	  `fma_medium.zip` as well as some others.
+All metadata and features for all tracks are distributed in
+**[fma_metadata.zip]** (342 MiB). The below tables can be used with [pandas] or
+any other data analysis tool. See the [paper] or the [usage] notebook for
+a description.
+* `tracks.csv`: per track metadata such as ID, title, artist, genres, tags and
+  play counts, for all 106,574 tracks.
+* `genres.csv`: all 163 genre IDs with their name and parent (used to infer the
+  genre hierarchy and top-level genres).
+* `features.csv`: common features extracted with [librosa].
+* `echonest.csv`: audio features provided by [Echonest] (now [Spotify]) for
+  a subset of 13,129 tracks.
 
 [pandas]:   http://pandas.pydata.org/
 [librosa]:  https://librosa.github.io/librosa/
 [spotify]:  https://www.spotify.com/
 [echonest]: http://the.echonest.com/
 
+Then, you got various sizes of MP3-encoded audio data:
+
+1. **[fma_small.zip]**: 8,000 tracks of 30 seconds, 8 balanced genres
+   (GTZAN-like) (7.2 GiB)
+2. **[fma_medium.zip]**: 25,000 tracks of 30 seconds, 16 unbalanced genres (22
+   GiB)
+3. **[fma_large.zip]**: 106,574 tracks of 30 seconds, 161 unbalanced genres (93
+   GiB)
+4. **[fma_full.zip]**: 106,574 untrimmed tracks, 161 unbalanced genres (879
+   GiB) (pending hosting agreement)
+
+**Download is not available for some time as the dataset is now being updated.
+Please come back in a few days.**
+
 ## Code
 
-The following notebooks have been used to create and evaluate the dataset. They
-should be useful to users.
+The following notebooks and scripts, stored in this repository, have been
+developed for the dataset.
 
-1. [usage]: how to load the datasets and develop, train and test your own
+1. [usage]: shows how to load the datasets and develop, train and test your own
    models with it.
-2. [analysis]: some exploration of the metadata, data and features.
+2. [analysis]: exploration of the metadata, data and features.
 3. [baselines]: baseline models for genre recognition, both from audio and
    features.
 4. [features]: features extraction from the audio (used to create
-   `features.json`).
-5. [webapi]: query the web API of the [FMA]. Can be used to update the dataset
-   or gather further information.
-6. [creation]: creation of the dataset (used to create `tracks.json` and
-   `genres.json`).
+   `features.csv`).
+5. [webapi]: query the web API of the [FMA]. Can be used to update the dataset.
+6. [creation]: creation of the dataset (used to create `tracks.csv` and
+   `genres.csv`).
 
 [usage]:     https://nbviewer.jupyter.org/github/mdeff/fma/blob/outputs/usage.ipynb
 [analysis]:  https://nbviewer.jupyter.org/github/mdeff/fma/blob/outputs/analysis.ipynb
 [baselines]: https://nbviewer.jupyter.org/github/mdeff/fma/blob/outputs/baselines.ipynb
-[features]:  https://nbviewer.jupyter.org/github/mdeff/fma/blob/outputs/features.ipynb
+[features]:  features.py
 [webapi]:    https://nbviewer.jupyter.org/github/mdeff/fma/blob/outputs/webapi.ipynb
 [creation]:  https://nbviewer.jupyter.org/github/mdeff/fma/blob/outputs/creation.ipynb
 
 ## Installation
 
 1. Download some data and verify its integrity.
 	```sh
-	echo "e731a5d56a5625f7b7f770923ee32922374e2cbf  fma_small.zip" | sha1sum -c -
-	echo "fe23d6f2a400821ed1271ded6bcd530b7a8ea551  fma_medium.zip" | sha1sum -c -
+	echo "f0df49ffe5f2a6008d7dc83c6915b31835dfe733  fma_metadata.zip" | sha1sum -c -
+	echo "ade154f733639d52e35e32f5593efe5be76c6d70  fma_small.zip"    | sha1sum -c -
+	echo "c67b69ea232021025fca9231fc1c7c1a063ab50b  fma_medium.zip"   | sha1sum -c -
+	echo "497109f4dd721066b5ce5e5f250ec604dc78939e  fma_large.zip"    | sha1sum -c -
+	echo "0f0ace23fbe9ba30ecb7e95f763e435ea802b8ab  fma_full.zip"     | sha1sum -c -
 	```
 
 2. Optionally, use [pyenv] to install Python 3.6 and create a [virtual
@@ -102,8 +119,8 @@ should be useful to users.
 
 4. Install the Python dependencies from `requirements.txt`. Depending on your
    usage, you may need to install [ffmpeg] or [graphviz]. Install [CUDA] if you
-   want to train neural networks on GPUs. See
-   [Tensorflow's instructions](https://www.tensorflow.org/install/).
+   want to train neural networks on GPUs (see
+   [Tensorflow's instructions](https://www.tensorflow.org/install/)).
 	```sh
 	make install
 	```
@@ -129,17 +146,34 @@ should be useful to users.
 
 ## History
 
+* 2017-05-05 pre-publication release
+	* paper: [arXiv:1612.01840v2](https://arxiv.org/abs/1612.01840v2)
+	* code: [git tag rc1](https://github.com/mdeff/fma/releases/tag/rc1)
+	* `fma_metadata.zip` sha1: `f0df49ffe5f2a6008d7dc83c6915b31835dfe733`
+	* `fma_small.zip`    sha1: `ade154f733639d52e35e32f5593efe5be76c6d70`
+	* `fma_medium.zip`   sha1: `c67b69ea232021025fca9231fc1c7c1a063ab50b`
+	* `fma_large.zip`    sha1: `497109f4dd721066b5ce5e5f250ec604dc78939e`
+	* `fma_full.zip`     sha1: `0f0ace23fbe9ba30ecb7e95f763e435ea802b8ab`
+
 * 2016-12-06 beta release
 	* paper: [arXiv:1612.01840v1](https://arxiv.org/abs/1612.01840v1)
 	* code: [git tag beta](https://github.com/mdeff/fma/releases/tag/beta)
 	* `fma_small.zip`  sha1: `e731a5d56a5625f7b7f770923ee32922374e2cbf`
 	* `fma_medium.zip` sha1: `fe23d6f2a400821ed1271ded6bcd530b7a8ea551`
 
+## Contributing
+
+Please open an issue or a pull request if you want to contribute. Let's try to
+keep this repository the central place around the dataset! Links to resources
+related to the dataset are welcome. I hope the community will like it and that
+we can keep it lively by evolving it toward people's needs.
+
 ## License & co
 
 * Please cite our [paper] if you use our code or data.
-* The code in this repository is released under the terms of the [MIT license](LICENSE.txt).
-* The meta-data is released under the terms of the
+* The code in this repository is released under the terms of the
+  [MIT license](LICENSE.txt).
+* The metadata is released under the terms of the
   [Creative Commons Attribution 4.0 International License (CC BY 4.0)][ccby40].
 * We do not hold the copyright on the audio and distribute it under the terms
   of the license chosen by the artist.