From 57a813ca217c00813cad0e500f2da030cef6cc42 Mon Sep 17 00:00:00 2001 From: Stefan Jansen Date: Sun, 16 Aug 2020 15:49:48 -0400 Subject: [PATCH] readme update --- 13_unsupervised_learning/README.md | 142 +++++++++------ 14_working_with_text_data/README.md | 76 +++++--- 15_topic_modeling/README.md | 80 +++++---- 16_word_embeddings/README.md | 107 ++++++++---- 17_deep_learning/README.md | 112 +++++++++--- 18_convolutional_neural_nets/README.md | 163 +++++++++++++----- 19_recurrent_neural_nets/README.md | 99 +++++++---- .../README.md | 147 ++++++++-------- 21_gans_for_synthetic_time_series/README.md | 148 ++++++++++------ 22_deep_reinforcement_learning/README.md | 121 ++++++++----- 23_next_steps/README.md | 86 ++++++++- 24_alpha_factor_library/README.md | 76 ++++++++ README.md | 4 + 13 files changed, 931 insertions(+), 430 deletions(-) create mode 100644 24_alpha_factor_library/README.md diff --git a/13_unsupervised_learning/README.md b/13_unsupervised_learning/README.md index 0dae4343e..af065e9cc 100644 --- a/13_unsupervised_learning/README.md +++ b/13_unsupervised_learning/README.md @@ -7,39 +7,45 @@ Dimensionality reduction and clustering are the main tasks for unsupervised lear - Clustering algorithms identify and group similar observations or features instead of identifying new features. Algorithms differ in how they define the similarity of observations and their assumptions about the resulting groups. More specifically, this chapter covers: -- how principal and independent component analysis perform linear dimensionality reduction -- how to apply PCA to identify risk factors and eigen portfolios from asset returns -- how to use non-linear manifold learning to summarize high-dimensional data for effective visualization -- how to use T-SNE and UMAP to explore high-dimensional alternative image data -- how k-Means, hierarchical, and density-based clustering algorithms work -- how to apply agglomerative clustering to build robust portfolios according to hierarchical risk parity +- How principal and independent component analysis (PCA and ICA) perform linear dimensionality reduction +- Identifying data-driven risk factors and eigenportfolios from asset returns using PCA +- Effectively visualizing nonlinear, high-dimensional data using manifold learning +- Using T-SNE and UMAP to explore high-dimensional image data +- How k-means, hierarchical, and density-based clustering algorithms work +- Using agglomerative clustering to build robust portfolios with hierarchical risk parity ## Content -1. [Dimensionality reduction](#dimensionality-reduction) - * [The curse of dimensionality](#the-curse-of-dimensionality) - * [Linear Dimensionality Reduction](#linear-dimensionality-reduction) - * [PCA](#pca) - - [Code Example](#code-example) - - [References](#references) - * [PCA for Algorithmic Trading ](#pca-for-algorithmic-trading-) - - [References](#references-2) - * [ICA](#ica) - * [Manifold Learning](#manifold-learning) - - [Data](#data) - * [Local Linear Embedding](#local-linear-embedding) - - [References](#references-3) +1. [Code Example: the curse of dimensionality](#code-example-the-curse-of-dimensionality) +2. [Linear Dimensionality Reduction](#linear-dimensionality-reduction) + * [Code Example: Principal Component Analysis](#code-example-principal-component-analysis) + - [Visualizing key ideas behind PCA ](#visualizing-key-ideas-behind-pca-) + - [How the PCA algorithm works](#how-the-pca-algorithm-works) + * [References](#references) +3. [Code Examples: PCA for Trading ](#code-examples-pca-for-trading-) + * [Data-driven risk factors](#data-driven-risk-factors) + * [Eigenportfolios](#eigenportfolios) + * [References](#references-2) +4. [Independent Component Analysis](#independent-component-analysis) +5. [Manifold Learning](#manifold-learning) + * [Code Example: what a manifold looks like ](#code-example-what-a-manifold-looks-like-) + * [Code Example: Local Linear Embedding](#code-example-local-linear-embedding) + * [References](#references-3) +6. [Code Examples: visualizing high-dimensional image and asset price data with manifold learning](#code-examples-visualizing-high-dimensional-image-and-asset-price-data-with-manifold-learning) * [t-distributed stochastic neighbor embedding (t-SNE)](#t-distributed-stochastic-neighbor-embedding-t-sne) * [UMAP](#umap) - - [Code Examples](#code-examples) -2. [Cluster Algorithms](#cluster-algorithms) - * [k-Means](#k-means) - * [Hierarchical Clustering](#hierarchical-clustering) - * [Density-Based Clustering](#density-based-clustering) - * [Gaussian Mixture Models](#gaussian-mixture-models) - * [Hierarchical Risk Parity](#hierarchical-risk-parity) - - [References](#references-4) - +7. [Cluster Algorithms](#cluster-algorithms) + * [Code example: comparing cluster algorithms](#code-example-comparing-cluster-algorithms) + * [Code example: k-Means](#code-example-k-means) + - [The algorithm](#the-algorithm) + - [Evaluating the results](#evaluating-the-results) + * [Code example: Hierarchical Clustering](#code-example-hierarchical-clustering) + * [Code example: Density-Based Clustering](#code-example-density-based-clustering) + * [Code example: Gaussian Mixture Models](#code-example-gaussian-mixture-models) + * [Code example: Hierarchical Risk Parity](#code-example-hierarchical-risk-parity) + - [The algorithm](#the-algorithm-2) + - [Backtest comparison with alternatives](#backtest-comparison-with-alternatives) + * [References](#references-4) ## Code Example: the curse of dimensionality @@ -81,22 +87,34 @@ The notebook [the_math_behind_pca](01_linear_dimensionality_reduction/02_the_mat - [Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions](http://users.cms.caltech.edu/~jtropp/papers/HMT11-Finding-Structure-SIREV.pdf), N. Halko†, P. G. Martinsson, J. A. Tropp, SIAM REVIEW, Vol. 53, No. 2, pp. 217–288 - [Relationship between SVD and PCA. How to use SVD to perform PCA?](https://stats.stackexchange.com/questions/134282/relationship-between-svd-and-pca-how-to-use-svd-to-perform-pca), excellent technical CrossValidated StackExchange answer with visualization -## PCA for Trading +## Code Examples: PCA for Trading PCA is useful for algorithmic trading in several respects, including the data-driven derivation of risk factors by applying PCA to asset returns, and the construction of uncorrelated portfolios based on the principal components of the correlation matrix of asset returns. -### Code Example: PCA and risk factor models +### Data-driven risk factors + In [Chapter 07 - Linear Models](../07_linear_models/02_fama_macbeth.ipynb), we explored risk factor models used in quantitative finance to capture the main drivers of returns. These models explain differences in returns on assets based on their exposure to systematic risk factors and the rewards associated with these factors. In particular, we explored the Fama-French approach that specifies factors based on prior knowledge about the empirical behavior of average returns, treats these factors as observable, and then estimates risk model coefficients using linear regression. An alternative approach treats risk factors as latent variables and uses factor analytic techniques like PCA to simultaneously estimate the factors and how the drive returns from historical returns. + - The notebook [pca_and_risk_factor_models](01_linear_dimensionality_reduction/03_pca_and_risk_factor_models.ipynb) demonstrates how this method derives factors in a purely statistical or data-driven way with the advantage of not requiring ex-ante knowledge of the behavior of asset returns. +### Eigenportfolios + +Another application of PCA involves the covariance matrix of the normalized returns. The principal components of the correlation matrix capture most of the covariation among assets in descending order and are mutually uncorrelated. Moreover, we can use standardized principal components as portfolio weights. + +The notebook [pca_and_eigen_portfolios](01_linear_dimensionality_reduction/04_pca_and_eigen_portfolios.ipynb) illustrates how to create Eigenportfolios. + ### References - [Characteristics Are Covariances: A Unified Model of Risk and Return](http://www.nber.org/2018LTAM/kelly.pdf), Kelly, Pruitt and Su, NBER, 2018 - [Statistical Arbitrage in the U.S. Equities Market](https://math.nyu.edu/faculty/avellane/AvellanedaLeeStatArb20090616.pdf), Marco Avellaneda and Jeong-Hyun Lee, 2008 -### Independent Component Analysis +## Independent Component Analysis + +Independent component analysis (ICA) is another linear algorithm that identifies a new basis to represent the original data but pursues a different objective than PCA. See [Hyvärinen and Oja](https://www.sciencedirect.com/science/article/pii/S0893608000000265) (2000) for a detailed introduction. + +ICA emerged in signal processing, and the problem it aims to solve is called blind source separation. It is typically framed as the cocktail party problem where a given number of guests are speaking at the same time so that a single microphone would record overlapping signals. ICA assumes there are as many different microphones as there are speakers, each placed at different locations so that it records a different mix of the signals. ICA then aims to recover the individual signals from the different recordings. - [Independent Component Analysis: Algorithms and Applications](https://www.sciencedirect.com/science/article/pii/S0893608000000265), Aapo Hyvärinen and Erkki Oja, Neural Networks, 2000 - [Independent Components Analysis](http://cs229.stanford.edu/notes/cs229-notes11.pdf), CS229 Lecture Notes, Andrew Ng @@ -105,29 +123,33 @@ In particular, we explored the Fama-French approach that specifies factors based - [The Prediction Performance of Independent Factor Models](http://www.cs.cuhk.hk/~lwchan/papers/icapred.pdf), Chan, In: proceedings of the 2002 IEEE International Joint Conference on Neural Networks - [An Overview of Independent Component Analysis and Its Applications](http://www.informatica.si/ojs-2.4.3/index.php/informatica/article/download/334/333), Ganesh R. Naik, Dinesh K Kumar, Informatica 2011 -### Manifold Learning +## Manifold Learning The manifold hypothesis emphasizes that high-dimensional data often lies on or near a lower-dimensional non-linear manifold that is embedded in the higher dimensional space. -The notebook [manifold_learning_intro](02_manifold_learning/01_manifold_learning_intro.ipynb) contains several exampoles, including the two-dimensional swiss roll that illustrates such a topological structure. [Manifold learning](https://scikit-learn.org/stable/modules/manifold.html) aims to find the manifold of intrinsic dimensionality and then represent the data in this subspace. A simplified example uses a road as one-dimensional manifolds in a three-dimensional space and identifies data points using house numbers as local coordinates. - -#### Data +[Manifold learning](https://scikit-learn.org/stable/modules/manifold.html) aims to find the manifold of intrinsic dimensionality and then represent the data in this subspace. A simplified example uses a road as one-dimensional manifolds in a three-dimensional space and identifies data points using house numbers as local coordinates. -This section uses the following datasets: +### Code Example: what a manifold looks like -- [MNIST Data](http://yann.lecun.com/exdb/mnist/) -- [Fashion MNIST dataset](https://github.com/zalandoresearch/fashion-mnist) +The notebook [manifold_learning_intro](02_manifold_learning/01_manifold_learning_intro.ipynb) contains several exampoles, including the two-dimensional swiss roll that illustrates the topological structure of manifolds. -### Local Linear Embedding +### Code Example: Local Linear Embedding Several techniques approximate a lower dimensional manifold. One example is [locally-linear embedding](https://cs.nyu.edu/~roweis/lle/) (LLE) that was developed in 2000 by Sam Roweis and Lawrence Saul. - The notebook [manifold_learning_lle](02_manifold_learning/02_manifold_learning_lle.ipynb) demonstrates how it ‘unrolls’ the swiss roll. For each data point, LLE identifies a given number of nearest neighbors and computes weights that represent each point as a linear combination of its neighbors. It finds a lower-dimensional embedding by linearly projecting each neighborhood on global internal coordinates on the lower-dimensional manifold and can be thought of as a sequence of PCA applications. +- The notebook [manifold_learning_lle](02_manifold_learning/02_manifold_learning_lle.ipynb) demonstrates how it ‘unrolls’ the swiss roll. For each data point, LLE identifies a given number of nearest neighbors and computes weights that represent each point as a linear combination of its neighbors. It finds a lower-dimensional embedding by linearly projecting each neighborhood on global internal coordinates on the lower-dimensional manifold and can be thought of as a sequence of PCA applications. + +The generic examples use the following datasets: + +- [MNIST Data](http://yann.lecun.com/exdb/mnist/) +- [Fashion MNIST dataset](https://github.com/zalandoresearch/fashion-mnist) -#### References +### References - [Locally Linear Embedding](https://cs.nyu.edu/~roweis/lle/), Sam T. Roweis and Lawrence K. Saul (LLE author website) +## Code Examples: visualizing high-dimensional image and asset price data with manifold learning + ### t-distributed stochastic neighbor embedding (t-SNE) [t-SNE](https://lvdmaaten.github.io/tsne/) is an award-winning algorithm developed in 2010 by Laurens van der Maaten and Geoff Hinton to detect patterns in high-dimensional data. It takes a probabilistic, non-linear approach to locating data on several different, but related low-dimensional manifolds. The algorithm emphasizes keeping similar points together in low dimensions, as opposed to maintaining the distance between points that are apart in high dimensions, which results from algorithms like PCA that minimize squared distances. @@ -146,9 +168,7 @@ It is faster and hence scales better to large datasets than t-SNE, and sometimes - [UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction](https://arxiv.org/abs/1802.03426), Leland McInnes, John Healy, 2018 -#### Code Examples - -The notebooks [manifold_learning_tsne_umap](02_manifold_learning/03_manifold_learning_tsne_umap.ipynb) and [manifold_learning_asset_prices](02_manifold_learning/04_manifold_learning_asset_prices.ipynb) demonstrate the usage of both t-SNE and UMAP with various data sets, including equity returns. +- The notebooks [manifold_learning_tsne_umap](02_manifold_learning/03_manifold_learning_tsne_umap.ipynb) and [manifold_learning_asset_prices](02_manifold_learning/04_manifold_learning_asset_prices.ipynb) demonstrate the usage of both t-SNE and UMAP with various data sets, including equity returns. ## Cluster Algorithms @@ -172,19 +192,27 @@ Important additional aspects of a clustering algorithm include whether - makes hard, i.e., binary, or soft, probabilistic assignment, and - is complete and assigns all data points to clusters. +### Code example: comparing cluster algorithms + The notebook [clustering_algos](03_clustering_algorithms/01_clustering_algos.ipynb) compares the clustering results for several algorithm using toy dataset designed to test clustering algorithms. -### k-Means +### Code example: k-Means k-Means is the most well-known clustering algorithm and was first proposed by Stuart Lloyd at Bell Labs in 1957. +#### The algorithm + The algorithm finds K centroids and assigns each data point to exactly one cluster with the goal of minimizing the within-cluster variance (called inertia). It typically uses Euclidean distance but other metrics can also be used. k-Means assumes that clusters are spherical and of equal size and ignores the covariance among features. -The notebooks [kmeans_implementation](03_clustering_algorithms/02_kmeans_implementation.ipynb) demonstrates how the k-Means algorithm works. +- The notebook [kmeans_implementation](03_clustering_algorithms/02_kmeans_implementation.ipynb) demonstrates how the k-Means algorithm works. -Cluster quality metrics help select among alternative clustering results. The notebook [kmeans_evaluation ](03_clustering_algorithms/03_kmeans_evaluation.ipynb) illustrates how to evaluate clustering quality using inertia and the [silhouette score](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html). +#### Evaluating the results -### Hierarchical Clustering +Cluster quality metrics help select among alternative clustering results. + +- The notebook [kmeans_evaluation ](03_clustering_algorithms/03_kmeans_evaluation.ipynb) illustrates how to evaluate clustering quality using inertia and the [silhouette score](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html). + +### Code example: Hierarchical Clustering Hierarchical clustering avoids the need to specify a target number of clusters because it assumes that data can successively be merged into increasingly dissimilar clusters. It does not pursue a global objective but decides incrementally how to produce a sequence of nested clusters that range from a single cluster to clusters consisting of the individual data points. @@ -195,9 +223,9 @@ While hierarchical clustering does not have hyperparameters like k-Means, the me - Group average - Ward’s method: minimize within-cluster variance -The notebook [hierarchical_clusterin](03_clustering_algorithms/04_hierarchical_clustering.ipynb) demonstrates how this algorithm works, and how to visualize and evaluate the results. +The notebook [hierarchical_clustering](03_clustering_algorithms/04_hierarchical_clustering.ipynb) demonstrates how this algorithm works, and how to visualize and evaluate the results. -### Density-Based Clustering +### Code example: Density-Based Clustering Density-based clustering algorithms assign cluster membership based on proximity to other cluster members. They pursue the goal of identifying dense regions of arbitrary shapes and sizes. They do not require the specification of a certain number of clusters but instead rely on parameters that define the size of a neighborhood and a density threshold. @@ -205,7 +233,7 @@ The notebook [density_based_clustering](03_clustering_algorithms/05_density_base - [Pairs Trading with density-based clustering and cointegration](https://www.quantopian.com/posts/pairs-trading-with-machine-learning) -### Gaussian Mixture Models +### Code example: Gaussian Mixture Models Gaussian mixture models (GMM) are a generative model that assumes the data has been generated by a mix of various multivariate normal distributions. The algorithm aims to estimate the mean & covariance matrices of these distributions. @@ -213,13 +241,19 @@ It generalizes the k-Means algorithm: it adds covariance among features so that The notebook [gaussian_mixture_models](03_clustering_algorithms/06_gaussian_mixture_models.ipynb) demonstrates the application of a GMM clustering model. -### Hierarchical Risk Parity +### Code example: Hierarchical Risk Parity The key idea of hierarchical risk parity (HRP) is to use hierarchical clustering on the covariance matrix to be able to group assets with similar correlations together and reduce the number of degrees of freedom by only considering 'similar' assets as substitutes when constructing the portfolio. -The notebook [hrp](04_hierarchical_risk_parity/hrp.ipynb) and the python files in subfolder [hierarchical_risk_parity](04_hierarchical_risk_parity) illustrate its application. +#### The algorithm + +The notebook [hierarchical_risk_parity](04_hierarchical_risk_parity/01_hierarchical_risk_parity.ipynb) in the subfolder [hierarchical_risk_parity](04_hierarchical_risk_parity) illustrate its application. -#### References +#### Backtest comparison with alternatives + +The notebook [pf_optimization_with_hrp_zipline_benchmark](04_hierarchical_risk_parity/02_pf_optimization_with_hrp_zipline_benchmark.ipynb) in the subfolder [hierarchical_risk_parity](04_hierarchical_risk_parity) compares HRP with other portfolio optimization methods discussed in [Chapter 5](../05_strategy_evaluation). + +### References - [Building Diversified Portfolios that Outperform Out-of-Sample](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2708678), Lopez de Prado, Journal of Portfolio Management, 2015 - [Hierarchical Clustering Based Asset Allocation](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2840729), Raffinot 2016 diff --git a/14_working_with_text_data/README.md b/14_working_with_text_data/README.md index 6882047f2..fe5192813 100644 --- a/14_working_with_text_data/README.md +++ b/14_working_with_text_data/README.md @@ -10,14 +10,37 @@ In the following two chapters, we build on these techniques and use ML algorithm In particular, in this chapter we will cover: - What the fundamental NLP workflow looks like -- How to build a multilingual feature extraction pipeline using spaCy and Textblob -- How to perform NLP tasks like part-of-speech tagging or named entity recognition -- How to convert tokens to numbers using the document-term matrix -- How to classify text using the Naive Bayes model +- How to build a multilingual feature extraction pipeline using spaCy and TextBlob +- Performing NLP tasks like part-of-speech tagging or named entity recognition +- Converting tokens to numbers using the document-term matrix +- Classifying text using the naive Bayes model - How to perform sentiment analysis +## Content + +1. [ML with text data - from language to features](#ml-with-text-data---from-language-to-features) + * [Challenges of Natural Language Processing](#challenges-of-natural-language-processing) + * [Use cases](#use-cases) + * [The NLP workflow](#the-nlp-workflow) +2. [From text to tokens – the NLP pipeline](#from-text-to-tokens--the-nlp-pipeline) + * [Code example: NLP pipeline with spaCy and textacy](#code-example-nlp-pipeline-with-spacy-and-textacy) + - [Data](#data) + * [Code example: NLP with TextBlob](#code-example-nlp-with-textblob) +3. [Counting tokens – the document-term matrix](#counting-tokens--the-document-term-matrix) + * [Code example: document-term matrix with scikit-learn](#code-example-document-term-matrix-with-scikit-learn) +4. [NLP for trading: text classification and sentiment analysis](#nlp-for-trading-text-classification-and-sentiment-analysis) + * [The Naive Bayes classifier](#the-naive-bayes-classifier) + * [Code example: news article classification](#code-example-news-article-classification) + * [Code examples: sentiment analysis](#code-examples-sentiment-analysis) + - [Binary classification: twitter data](#binary-classification-twitter-data) + - [Comparing different ML algorithms on large, multiclass Yelp data](#comparing-different-ml-algorithms-on-large-multiclass-yelp-data) + +## ML with text data - from language to features + +Text data can be extremely valuable given how much information humans communicate and store using natural language. The diverse set of data sources relevant to investment range from formal documents like company statements, contracts, or patents to news, opinion, and analyst research or commentary to various types of social media postings or messages. + +Useful resources include: -## How to extract features from text data - [Speech and Language Processing](https://web.stanford.edu/~jurafsky/slp3/ed3book.pdf), Daniel Jurafsky & James H. Martin, 3rd edition, draft, 2018 - [Statistical natural language processing and corpus-based computational linguistics](https://nlp.stanford.edu/links/statnlp.html), Annotated list of resources, Stanford University - [NLP Data Sources](https://github.com/niderhoff/nlp-datasets) @@ -33,9 +56,7 @@ NLP is challenging because the effective use of text data for machine learning r - entity names can be tricky : ‘Where is A Bug's Life playing?’ - the need for knowledge about the world: ‘Mary and Sue are sisters’ vs ‘Mary and Sue are mothers’ -### Use Cases - -Key NLP use cases include: +### Use cases | Use Case | Description | Examples | |---|---|---| @@ -51,11 +72,19 @@ Key NLP use cases include: | Speech recognition and generation | Speech-to-text, text-to-speech | [Google's Web Speech API demo](https://www.google.com/intl/en/chrome/demos/speech.html), [Vocalware Text-to-Speech demo](https://www.vocalware.com/index/demo) | | Question answering | Determine the intent of the question, match query with knowledge base, evaluate hypotheses | [How did Watson beat Jeopardy champion Ken Jennings?](http://blog.ted.com/how-did-supercomputer-watson-beat-jeopardy-champion-ken-jennings-experts-discuss/), [Watson Trivia Challenge](http://www.nytimes.com/interactive/2010/06/16/magazine/watson-trivia-game.html), [The AI Behind Watson](http://www.aaai.org/Magazine/Watson/watson.php) +### The NLP workflow + +A key goal for using machine learning from text data for algorithmic trading is to extract signals from documents. A document is an individual sample from a relevant text data source, e.g. a company report, a headline or news article, or a tweet. A corpus, in turn, is a collection of documents. +The following figure lays out key steps to convert documents into a dataset that can be used to train a supervised machine learning algorithm capable of making actionable predictions. + +

+ +

+ ## From text to tokens – the NLP pipeline The following table summarizes the key tasks of an NLP pipeline: - | Feature | Description | |-----------------------------|-------------------------------------------------------------------| | Tokenization | Segment text into words, punctuations marks etc. | @@ -66,23 +95,18 @@ The following table summarizes the key tasks of an NLP pipeline: | Named Entity Recognition | Label "real-world" objects, like persons, companies or locations. | | Similarity | Evaluate similarity of words, text spans, and documents. | - -### NLP pipeline with spaCy and textacy +### Code example: NLP pipeline with spaCy and textacy The notebook [nlp_pipeline_with_spaCy](01_nlp_pipeline_with_spaCy.ipynb) demonstrates how to construct an NLP pipeline using the open-source python library [spaCy]((https://spacy.io/)). The [textacy](https://chartbeat-labs.github.io/textacy/index.html) library builds on spaCy and provides easy access to spaCy attributes and additional functionality. - spaCy [docs](https://spacy.io/) and installation [instructions](https://spacy.io/usage/#installation) - textacy relies on `spaCy` to solve additional NLP tasks - see [documentation](https://chartbeat-labs.github.io/textacy/index.html) -#### Code Examples - -The code for this section is in the notebook `nlp_pipeline_with_spaCy` - #### Data - [BBC Articles](http://mlg.ucd.ie/datasets/bbc.html), use raw text files - [TED2013](http://opus.nlpl.eu/TED2013.php), a parallel corpus of TED talk subtitles in 15 langugages -### NLP with TextBlob +### Code example: NLP with TextBlob The `TextBlob` library provides a simplified interface for common NLP tasks including part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and others. @@ -95,13 +119,13 @@ A good alternative is NLTK, a leading platform for building Python programs to w - Natural Language ToolKit (NLTK) [Documentation](http://www.nltk.org/) -## From tokens to numbers – the document-term matrix +## Counting tokens – the document-term matrix This section introduces the bag-of-words model that converts text data into a numeric vector space representation that permits the comparison of documents using their distance. We demonstrate how to create a document-term matrix using the sklearn library. - [TF-IDF is about what matters](https://planspace.org/20150524-tfidf_is_about_what_matters/) -### Document-term matrix with sklearn +### Code example: document-term matrix with scikit-learn The scikit-learn preprocessing module offers two tools to create a document-term matrix. 1. The [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) uses binary or absolute counts to measure the term frequency tf(d, t) for each document d and token t. @@ -109,7 +133,7 @@ The scikit-learn preprocessing module offers two tools to create a document-term The notebook [document_term_matrix](03_document_term_matrix.ipynb) demonstrate usage and configuration. -## Text classification and sentiment analysis +## NLP for trading: text classification and sentiment analysis Once text data has been converted into numerical features using the natural language processing techniques discussed in the previous sections, text classification works just like any other classification task. @@ -127,20 +151,21 @@ The Naive Bayes algorithm is very popular for text classification because low co The model relies on Bayes theorem and the assumption that the various features are independent of each other given the outcome class. In other words, for a given outcome, knowing the value of one feature (e.g. the presence of a token in a document) does not provide any information about the value of another feature. - -### News article classification +### Code example: news article classification We start with an illustration of the Naive Bayes model to classify 2,225 BBC news articles that we know belong to five different categories. The notebook [text_classification](04_text_classification.ipynb) contains the relevant examples. -### Sentiment Analysis +### Code examples: sentiment analysis Sentiment analysis is one of the most popular uses of natural language processing and machine learning for trading because positive or negative perspectives on assets or other price drivers are likely to impact returns. Generally, modeling approaches to sentiment analysis rely on dictionaries as the TextBlob library or models trained on outcomes for a specific domain. The latter is preferable because it permits more targeted labeling, e.g. by tying text features to subsequent price changes rather than indirect sentiment scores. -#### Twitter Dataset +See [data](../data) directory for instructions on obtaining the data. + +#### Binary classification: twitter data We illustrate machine learning for sentiment analysis using a [Twitter dataset](http://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip) with binary polarity labels, and a large Yelp business review dataset with a five-point outcome scale. @@ -148,11 +173,10 @@ The notebook [sentiment_analysis_twitter](05_sentiment_analysis_twitter.ipynb) c - [Cheng-Caverlee-Lee September 2009 - January 2010 Twitter Scrape](https://archive.org/details/twitter_cikm_2010) -#### Yelp Dataset +#### Comparing different ML algorithms on large, multiclass Yelp data To illustrate text processing and classification at larger scale, we also use the [Yelp Dataset](https://www.yelp.com/dataset). The notebook [sentiment_analysis_yelp](06_sentiment_analysis_yelp.ipynb) contains the relevant example. -- [Yelp Dataset Challenge](https://www.yelp.com/dataset/challenge) - +- [Yelp Dataset Challenge](https://www.yelp.com/dataset/challenge) \ No newline at end of file diff --git a/15_topic_modeling/README.md b/15_topic_modeling/README.md index f1a4bb847..e1f233d4a 100644 --- a/15_topic_modeling/README.md +++ b/15_topic_modeling/README.md @@ -12,6 +12,26 @@ Topic models permit the extraction of sophisticated, interpretable text features - How to implement LDA using sklearn and gensim - How to apply topic modeling to collections of earnings calls and Yelp business reviews +## Content + +1. [Learning latent topics: goals and approaches](#learning-latent-topics-goals-and-approaches) +2. [Latent semantic indexing (LSI)](#latent-semantic-indexing-lsi) + * [Code example: how to implement LSI using scikit-learn](#code-example-how-to-implement-lsi-using-scikit-learn) +3. [Probabilistic Latent Semantic Analysis (pLSA)](#probabilistic-latent-semantic-analysis-plsa) + * [Code example: how to implement pLSA using scikit-learn](#code-example-how-to-implement-plsa-using-scikit-learn) +4. [Latent Dirichlet Allocation (LDA)](#latent-dirichlet-allocation-lda) + * [Code example: the Dirichlet distribution](#code-example-the-dirichlet-distribution) + * [How to evaluate LDA topics](#how-to-evaluate-lda-topics) + * [Code example: how to implement LDA using scikit-learn](#code-example-how-to-implement-lda-using-scikit-learn) + * [How to visualize LDA results using pyLDAvis](#how-to-visualize-lda-results-using-pyldavis) + * [Code example: how to implement LDA using gensim](#code-example-how-to-implement-lda-using-gensim) + * [References](#references) +5. [Code example: Modeling topics discussed during earnings calls](#code-example-modeling-topics-discussed-during-earnings-calls) +6. [Code example: topic modeling with financial news articles](#code-example-topic-modeling-with-financial-news-articles) +7. [Resources](#resources) + * [Applications](#applications) + * [Topic Modeling libraries](#topic-modeling-libraries) + ## Learning latent topics: goals and approaches Initial attempts by topic models to improve on the vector space model (developed in the mid-1970s) applied linear algebra to reduce the dimensionality of the document-term matrix. This approach is similar to the algorithm we discussed as principal component analysis in chapter 12 on unsupervised learning. While effective, it is difficult to evaluate the results of these models absent a benchmark model. @@ -20,7 +40,6 @@ In response, probabilistic models emerged that assume an explicit document gener The below table highlights key milestones in the model evolution that we will address in more detail in the following sections. - | Model | Year | Description | |-----------------------------------------------|------|---------------------------------------------------------------------------------------------------------------| | Latent Semantic Indexing (LSI) | 1988 | Capture semantic document-term relationship by reducing the dimensionality of the word space | @@ -33,7 +52,7 @@ Latent Semantic Analysis set out to improve the results of queries that omitted LSI uses linear algebra to find a given number k of latent topics by decomposing the DTM. More specifically, it uses the Singular Value Decomposition (SVD) to find the best lower-rank DTM approximation using k singular values & vectors. In other words, LSI is an application of the unsupervised learning techniques of dimensionality reduction we encountered in chapter 12 (with some additional detail). The authors experimented with hierarchical clustering but found it too restrictive to explicitly model the document-topic and topic-term relationships or capture associations of documents or terms with several topics. -### How to implement LSI using sklearn +### Code example: how to implement LSI using scikit-learn The notebook [latent_semantic_indexing](01_latent_semantic_indexing.ipynb) demonstrates how to apply LSI to the BBC new articles we used in the last chapter. @@ -43,26 +62,21 @@ Probabilistic Latent Semantic Analysis (pLSA) takes a statistical perspective on pLSA explicitly models the probability each co-occurrence of documents d and words w described by the DTM as a mixture of conditionally independent multinomial distributions that involve topics t. The number of topics is a hyperparameter chosen prior to training and is not learned from the data. -### How to implement pLSA using sklearn +### Code example: how to implement pLSA using scikit-learn The notebook [probabilistic_latent_analysis](02_probabilistic_latent_analysis.ipynb) demonstrates how to apply LSI to the BBC new articles we used in the last chapter. - [Relation between PLSA and NMF and Implications](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.70.8839&rep=rep1&type=pdf), Gaussier, Goutte, 2005 -## LDA: +## Latent Dirichlet Allocation (LDA) Latent Dirichlet Allocation extends pLSA by adding a generative process for topics. It is the most popular topic model because it tends to produce meaningful topics that humans, can relate to, can assign topics to new documents, and is extensible. Variants of LDA models can include metadata like authors, or include image data, or learn hierarchical topics. LDA is a hierarchical Bayesian model that assumes topics are probability distributions over words, and documents are distributions over topics. More specifically, the model assumes that topics follow a sparse Dirichlet distribution, which implies that documents cover only a small set of topics, and topics use only a small set of words frequently. -#### References +### Code example: the Dirichlet distribution -- [David Blei Homepage @ Columbia](http://www.cs.columbia.edu/~blei/) -- [Introductory Paper](http://www.cs.columbia.edu/~blei/papers/Blei2012.pdf) and [more technical review paper](http://www.cs.columbia.edu/~blei/papers/BleiLafferty2009.pdf) -- [Blei Lab @ GitHub](https://github.com/Blei-Lab) - -### The Dirichlet Distribution The Dirichlet distribution produces probability vectors that can be used with discrete distributions. That is, it randomly generates a given number of values that are positive and sum to one. It has a parameter 𝜶 of positive real value that controls the concentration of the probabilities. The notebook [dirichlet_distribution](03_dirichlet_distribution.ipynb) contains a simulation so you can experiment with different parameter values. @@ -73,20 +87,10 @@ Unsupervised topic models do not provide a guarantee that the result will be mea Two options to evaluate results more objectively include perplexity that evaluates the model on unseen documents and topic coherence metrics that aim to evaluate the semantic quality of the uncovered patterns. -#### References - -- [Exploring Topic Coherence over many models and many topics](https://www.aclweb.org/anthology/D/D12/D12-1087.pdf) -- [Paper on various Methods](http://www.aclweb.org/anthology/N10-1012) -- [Blog Post - Overview](http://qpleple.com/topic-coherence-to-evaluate-topic-models/) - -### How to implement LDA using sklearn +### Code example: how to implement LDA using scikit-learn The notebook [lda_with_sklearn](04_lda_with_sklearn.ipynb) shows how to apply LDA to the BBC news articles. We use [sklearn.decomposition.LatentDirichletAllocation](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html) to train an LDA model with five topics. -#### Code Examples - -The notebook `lda_with_sklearn` contains the code examples used in this section. - ### How to visualize LDA results using pyLDAvis Topic visualization facilitates the evaluation of topic quality using human judgment. pyLDAvis is a python port of LDAvis, developed in R and D3.js. We will introduce the key concepts; each LDA implementation notebook contains examples. @@ -96,34 +100,42 @@ pyLDAvis displays the global relationships among topics while also facilitating - [Talk by the Author](https://speakerdeck.com/bmabey/visualizing-topic-models) and [Paper by (original) Author](http://www.aclweb.org/anthology/W14-3110) - [Documentation](http://pyldavis.readthedocs.io/en/latest/index.html) -### How to implement LDA using gensim +### Code example: how to implement LDA using gensim -Gensim is a specialized NLP library with a fast LDA implementation and many additional features. We will also use it in the next chapter on word vectors (see the notebook [lda_with_gensim](05_lda_with_gensim.ipynb) for details. +Gensim is a specialized NLP library with a fast LDA implementation and many additional features. We will also use it in the next chapter to learn word vectors (see the notebook [lda_with_gensim](05_lda_with_gensim.ipynb) for details. -### Topic modeling for earnings calls +### References -In Chapter 3 on [Alternative Data](../03_alternative_data/02_earnings_calls), we learned how to scrape earnings call data from the SeekingAlpha site. In this section, we will illustrate topic modeling on the harvested data. +- [David Blei Homepage @ Columbia](http://www.cs.columbia.edu/~blei/) +- [Introductory Paper](http://www.cs.columbia.edu/~blei/papers/Blei2012.pdf) and [more technical review paper](http://www.cs.columbia.edu/~blei/papers/BleiLafferty2009.pdf) +- [Blei Lab @ GitHub](https://github.com/Blei-Lab) +- [Exploring Topic Coherence over many models and many topics](https://www.aclweb.org/anthology/D/D12/D12-1087.pdf) +- [Paper on various Methods](http://www.aclweb.org/anthology/N10-1012) +- [Blog Post - Overview](http://qpleple.com/topic-coherence-to-evaluate-topic-models/) -We're using a (small) sample of some 1,000 earnings call transcripts from the second half of 2018. For a practical application, a larger dataset would be highly desirable. +## Code example: Modeling topics discussed during earnings calls -The directory [earnings_calls](06_earnings_calls) contains several files with examples mentioned below. See the notebook [lda_earnings_calls](06_earnings_calls/lda_earnings_calls.ipynb) for details on loading, exploring, and preprocessing the data, as well as training and evaluating individual models, and the [run_experiments.py](06_earnings_calls/run_experiments.py)) file for the experiments evaluated in the notebook. +In Chapter 3 on [Alternative Data](../03_alternative_data/02_earnings_calls), we learned how to scrape earnings call data from the SeekingAlpha site. -### Topic modeling for Yelp business reviews +In this section, we will illustrate topic modeling using this source. I’m using a sample of some 700 earnings call transcripts from 2018 and 2019 (see [data](../data) directory). This is a fairly small sample; for a practical application, we would need a larger dataset. + +The notebook [lda_earnings_calls](06_lda_earnings_calls.ipynb) contains details on loading, exploring, and preprocessing the data, as well as training and evaluating different models. -The notebook [lda_yelp_reviews](07_yelp/lda_yelp_reviews.ipynb) contains an example of LDA applied to six million business review on yelp. Reviews are a more uniform in length than the statements extracted from the earnings call transcript. After cleaning as above, the 10th and 90th percentile range from 14 to 90 tokens. +## Code example: topic modeling with financial news articles -## Applications +The notebook [lda_financial_news](07_lda_financial_news.ipynb) shows how to summarize a large corpus of financial news articles sourced from Reuters and others (see [data](../data) for sources) using LDA. -- [Applications of Topic Models](https://mimno.infosci.cornell.edu/papers/2017_fntir_tm_applications.pdf), Jordan Boyd-Graber, Yuening Hu, David Minmo, 2017 +## Resources + +### Applications +- [Applications of Topic Models](https://mimno.infosci.cornell.edu/papers/2017_fntir_tm_applications.pdf), Jordan Boyd-Graber, Yuening Hu, David Minmo, 2017 - [High Quality Topic Extraction from Business News Explains Abnormal Financial Market Volatility](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3675119/pdf/pone.0064846.pdf) - [What are You Saying? Using Topic to Detect Financial Misreporting](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2803733) - - [LDA in the browser - javascript implementation](https://github.com/mimno/jsLDA) - [David Mimno @ Cornell University](https://mimno.infosci.cornell.edu/) +### Topic Modeling libraries -## Topic Modeling Software - [David Blei's List of Open Source Topic Modeling Software](http://www.cs.columbia.edu/~blei/topicmodeling_software.html) - - [Mallet (MAchine Learning for LanguagE Toolkit (in Java)](http://mallet.cs.umass.edu/) diff --git a/16_word_embeddings/README.md b/16_word_embeddings/README.md index b2518b41a..db2284bd0 100644 --- a/16_word_embeddings/README.md +++ b/16_word_embeddings/README.md @@ -1,16 +1,40 @@ -# Extracting better features: Word Embeddings for SEC Filings +# Word Embeddings for Earnings Calls and SEC Filings This chapter introduces uses neural networks to learn a vector representation of individual semantic units like a word or a paragraph. These vectors are dense rather than sparse as in the bag-of-words model and have a few hundred real-valued rather than tens of thousand binary or discrete entries. They are called embeddings because they assign each semantic unit a location in a continuous vector space. Embeddings result from training a model to relate tokens to their context with the benefit that similar usage implies a similar vector. As a result, the embeddings encode semantic aspects like relationships among words by means of their relative location. They are powerful features for use in the deep learning models that we will introduce in the following chapters. More specifically, in this chapter, we will cover: -- What word embeddings are, how they work and capture semantic information -- How to use trained word vectors -- Which network architectures are useful to train word2vec models -- How to train a word2vec model using keras, gensim, and TensorFlow -- How to visualize and evaluate the quality of word vectors -- How to train a word2vec model using SEC filings -- How doc2vec extends word2vec using SEC filings -- How doc2vec extends word2vec +- What word embeddings are, how they work, and why they capture semantic information +- How to obtain and use pre-trained word vectors +- Which network architectures are most effective at training word2vec models +- How to train a word2vec model using Keras, Gensim, and TensorFlow +- Visualizing and evaluating the quality of word vectors +- How to train a word2vec model on SEC filings to predict stock price moves +- How doc2vec extends word2vec and can be used for sentiment analysis +- Why the transformer’s attention mechanism had such an impact on natural language processing +- How to fine-tune pre-trained BERT models on financial data and extract high-quality embeddings + +## Content + +1. [How Word Embeddings encode Semantics](#how-word-embeddings-encode-semantics) + * [How neural language models learn usage in context](#how-neural-language-models-learn-usage-in-context) + * [The word2vec Model: scalable word and phrase embeddings](#the-word2vec-model-scalable-word-and-phrase-embeddings) + * [Evaluating embeddings: vector arithmetic and analogies](#evaluating-embeddings-vector-arithmetic-and-analogies) +2. [Code example: Working with embedding models](#code-example-working-with-embedding-models) + * [Working with Global Vectors for Word Representation (GloVe)](#working-with-global-vectors-for-word-representation-glove) + * [Evaluating embeddings using analogies](#evaluating-embeddings-using-analogies) +3. [Code example: training domain-specific embeddings using financial news](#code-example-training-domain-specific-embeddings-using-financial-news) + * [Preprocessing financial news: sentence detection and n-grams](#preprocessing-financial-news-sentence-detection-and-n-grams) + * [Skip-gram architecture in TensorFlow 2 and visualization with TensorBoard](#skip-gram-architecture-in-tensorflow-2-and-visualization-with-tensorboard) + * [How to train embeddings faster with Gensim](#how-to-train-embeddings-faster-with-gensim) +4. [Code Example: word Vectors from SEC Filings using gensim](#code-example-word-vectors-from-sec-filings-using-gensim) + * [Preprocessing: content selection, sentence detection, and n-grams](#preprocessing-content-selection-sentence-detection-and-n-grams) + * [Model training and evaluation](#model-training-and-evaluation) +5. [Code example: sentiment Analysis with Doc2Vec](#code-example-sentiment-analysis-with-doc2vec) +6. [New Frontiers: Attention, Transformers, and Pretraining](#new-frontiers-attention-transformers-and-pretraining) + * [Attention is all you need: transforming natural language generation](#attention-is-all-you-need-transforming-natural-language-generation) + * [BERT: Towards a more universal, pretrained language model](#bert-towards-a-more-universal-pretrained-language-model) + * [Using pretrained state-of-the-art models](#using-pretrained-state-of-the-art-models) +7. [Additional Resources](#additional-resources) ## How Word Embeddings encode Semantics @@ -27,8 +51,8 @@ In contrast, the bag-of-words model uses the entire documents as context and use ### The word2vec Model: scalable word and phrase embeddings A word2vec model is a two-layer neural net that takes a text corpus as input and outputs a set of embedding vectors for words in that corpus. There are two different architectures to efficiently learn word vectors using shallow neural networks. -- The continuous-bag-of-words (CBOW) model predicts the target word using the average of the context word vectors as input so that their order does not matter. CBOW trains faster and tends to be slightly more accurate for frequent terms, but pays less attention to infrequent words. -- The skip-gram (SG) model, in contrast, uses the target word to predict words sampled from the context. It works well with small datasets and finds good representations even for rare words or phrases. +- The **continuous-bag-of-words** (CBOW) model predicts the target word using the average of the context word vectors as input so that their order does not matter. CBOW trains faster and tends to be slightly more accurate for frequent terms, but pays less attention to infrequent words. +- The **skip-gram** (SG) model, in contrast, uses the target word to predict words sampled from the context. It works well with small datasets and finds good representations even for rare words or phrases. ### Evaluating embeddings: vector arithmetic and analogies @@ -38,18 +62,14 @@ Just as words can be used in different contexts, they can be related to other wo The word2vec authors provide a list of several thousand relationships spanning aspects of geography, grammar and syntax, and family relationships to evaluate the quality of embedding vectors (see directory [analogies](data/analogies)). -## Working with embedding models +## Code example: Working with embedding models Similar to other unsupervised learning techniques, the goal of learning embedding vectors is to generate features for other tasks like text classification or sentiment analysis. There are several options to obtain embedding vectors for a given corpus of documents: - Use embeddings learned from a generic large corpus like Wikipedia or Google News - Train your own model using documents that reflect a domain of interest -### Using trained word vectors - -There are several sources for pre-trained word embeddings. Popular options include Stanford’s GloVE and spaCy’s built-in vectors (see the notebook [using_trained_vectors ](02_using_trained_vectors.ipynb) for details). - -#### GloVe: Global Vectors for Word Representation +### Working with Global Vectors for Word Representation (GloVe) GloVe is an unsupervised algorithm developed at the Stanford NLP lab that learns vector representations for words from aggregated global word-word co-occurrence statistics (see references). Vectors pre-trained on the following web-scale sources are available: - Common Crawl with 42B or 840B tokens and a vocabulary of 1.9M or 2.2M tokens @@ -68,30 +88,53 @@ The following table shows the accuracy on the word2vec semantics test achieved b | adjective-to-adverb | 992 | 22.58% | plural | 1332 | 78.08% | | opposite | 756 | 28.57% | plural-verbs | 870 | 58.51% | -### How to train your own word vector embeddings +There are several sources for pre-trained word embeddings. Popular options include Stanford’s GloVE and spaCy’s built-in vectors. +- The notebook [using_trained_vectors ](01_using_trained_vectors.ipynb) illustrates how to work with pretrained vectors. -Many tasks require embeddings or domain-specific vocabulary that pre-trained models based on a generic corpus may not represent well or at all. Standard word2vec models are not able to assign vectors to out-of-vocabulary words and instead use a default vector that reduces their predictive value. +### Evaluating embeddings using analogies -E.g., when working with industry-specific documents, the vocabulary or its usage may change over time as new technologies or products emerge. As a result, the embeddings need to evolve as well. In addition, corporate earnings releases use nuanced language not fully reflected in Glove vectors pre-trained on Wikipedia articles. +The notebook [evaluating_embeddings](02_evaluating_embeddings.ipynb) demonstrates how to test the quality of word vectors using analogies and other semantic relationships among words. -- [Word embeddings | TensorFlow Core](https://www.tensorflow.org/tutorials/text/word_embeddings) -- [Visualizing Data using the Embedding Projector in TensorBoard](https://www.tensorflow.org/tensorboard/tensorboard_projector_plugin) +## Code example: training domain-specific embeddings using financial news + +Many tasks require embeddings of domain-specific vocabulary that models pre-trained on a generic corpus may not be able to capture. Standard word2vec models are not able to assign vectors to out-of-vocabulary words and instead use a default vector that reduces their predictive value. + +For example, when working with industry-specific documents, the vocabulary or its usage may change over time as new technologies or products emerge. As a result, the embeddings need to evolve as well. In addition, documents like corporate earnings releases use nuanced language that GloVe vectors pre-trained on Wikipedia articles are unlikely to properly reflect. + +See the [data](../data) directory for instructions on sourcing the financial news dataset. + +### Preprocessing financial news: sentence detection and n-grams + +The notebook [financial_news_preprocessing](03_financial_news_preprocessing.ipynb) demonstrates how to prepare the source data for our model -### Bonus: word2vec for translation +### Skip-gram architecture in TensorFlow 2 and visualization with TensorBoard -- [Exploiting Similarities among Languages for Machine Translation](https://arxiv.org/abs/1309.4168), Tomas Mikolov, Quoc V. Le, Ilya Sutskever, arxiv 2013 -- [Word and Phrase Translation with word2vec](https://arxiv.org/abs/1705.03127), Stefan Jansen, arxiv, 2017 +The notebook [financal_news_word2vec_tensorflow](04_financal_news_word2vec_tensorflow.ipynb) illustrates how to build a word2vec model using the Keras interface of TensorFlow 2 that we will introduce in much more detail in the next chapter. -## Word Vectors from SEC Filings using gensim +### How to train embeddings faster with Gensim + +The TensorFlow implementation is very transparent in terms of its architecture, but it is not particularly fast. The natural language processing (NLP) library [gensim](https://radimrehurek.com/gensim/) that we also used for topic modeling in the last chapter, offers better performance and more closely resembles the C-based word2vec implementation provided by the original authors. + +The notebook [inancial_news_word2vec_gensim](05_financial_news_word2vec_gensim.ipynb) shows how to learn word vectors more efficiently. + +## Code Example: word Vectors from SEC Filings using gensim In this section, we will learn word and phrase vectors from annual SEC filings using gensim to illustrate the potential value of word embeddings for algorithmic trading. In the following sections, we will combine these vectors as features with price returns to train neural networks to predict equity prices from the content of security filings. -In particular, we use a dataset containing over 22,000 10-K annual reports from the period 2013-2016 that are filed by listed companies and contain both financial information and management commentary (see chapter 3 on Alternative Data). For about half of 11K filings for companies that we have stock prices to label the data for predictive modeling (see references about data source and the notebooks in the folder [sec-filings](sec-filings) for details). +In particular, we use a dataset containing over 22,000 10-K annual reports from the period 2013-2016 that are filed by listed companies and contain both financial information and management commentary (see Chapter 3 on [Alternative Data](../03_alternative_data)). For about half of 11K filings for companies that we have stock prices to label the data for predictive modeling (see references about data source and the notebooks in the folder [sec-filings](sec-filings) for details). - [2013-2016 Cleaned/Parsed 10-K Filings with the SEC](https://data.world/jumpyaf/2013-2016-cleaned-parsed-10-k-filings-with-the-sec) - [Stock Market Predictions with Natural Language Deep Learning](https://www.microsoft.com/developerblog/2017/12/04/predicting-stock-performance-deep-learning/) -## Sentiment Analysis with Doc2Vec +### Preprocessing: content selection, sentence detection, and n-grams + +The notebook [sec_preprocessing](06_sec_preprocessing.ipynb) shows how to parse and tokenize the text using spaCy, similar to the approach in Chapter 14, [Text Data for Trading: Sentiment Analysis](../14_working_with_text_data). + +### Model training and evaluation + +The notebook [sec_word2vec](07_sec_word2vec.ipynb) uses gensim's [word2vec](https://radimrehurek.com/gensim/models/word2vec.html) implementation of the skip-gram architecture to learn word vectors for the SEC filings dataset. + +## Code example: sentiment Analysis with Doc2Vec Text classification requires combining multiple word embeddings. A common approach is to average the embedding vectors for each word in the document. This uses information from all embeddings and effectively uses vector addition to arrive at a different location point in the embedding space. However, relevant information about the order of words is lost. @@ -99,7 +142,7 @@ In contrast, the state-of-the-art generation of embeddings for pieces of text li - The distributed bag of words (DBOW) model corresponds to the Word2Vec CBOW model. The document vectors result from training a network on the synthetic task of predicting a target word based on both the context word vectors and the document's doc vector. - The distributed memory (DM) model corresponds to the word2wec skipgram architecture. The doc vectors result from training a neural net to predict a target word using the full document’s doc vector. -The notebook [yelp_sentiment](doc2vec/yelp_sentiment.ipynb) applied doc2vec to a random sample of 1mn Yelp reviews with their associated star ratings. +The notebook [doc2vec_yelp_sentiment](08_doc2vec_yelp_sentiment.ipynb) applies doc2vec to a random sample of 1mn Yelp reviews with their associated star ratings. ## New Frontiers: Attention, Transformers, and Pretraining @@ -136,11 +179,13 @@ The BERT model builds on two key ideas, namely the transformer architecture desc - [Sentence Transformers: Multilingual Sentence Embeddings using BERT / RoBERTa / XLM-RoBERTa & Co. with PyTorch] -BERT / RoBERTa / XLM-RoBERTa produces out-of-the-box rather bad sentence embeddings. This repository fine-tunes BERT / RoBERTa / DistilBERT / ALBERT / XLNet with a siamese or triplet network structure to produce semantically meaningful sentence embeddings that can be used in unsupervised scenarios: Semantic textual similarity via cosine-similarity, clustering, semantic search. -### Resources +## Additional Resources - [GloVe: Global Vectors for Word Representation](https://github.com/stanfordnlp/GloVe) - [Common Crawl Data](http://commoncrawl.org/the-data/) - [word2vec analogy samples](https://github.com/nicholas-leonard/word2vec/blob/master/questions-words.txt) - [spaCy word vectors and semantic similarity](https://spacy.io/usage/vectors-similarity) - [2013-2016 Cleaned/Parsed 10-K Filings with the SEC](https://data.world/jumpyaf/2013-2016-cleaned-parsed-10-k-filings-with-the-sec) -- [Stanford Sentiment Tree Bank](https://nlp.stanford.edu/sentiment/treebank.html) \ No newline at end of file +- [Stanford Sentiment Tree Bank](https://nlp.stanford.edu/sentiment/treebank.html) +- [Word embeddings | TensorFlow Core](https://www.tensorflow.org/tutorials/text/word_embeddings) +- [Visualizing Data using the Embedding Projector in TensorBoard](https://www.tensorflow.org/tensorboard/tensorboard_projector_plugin) diff --git a/17_deep_learning/README.md b/17_deep_learning/README.md index 1f4ac575a..f129397a0 100644 --- a/17_deep_learning/README.md +++ b/17_deep_learning/README.md @@ -9,12 +9,37 @@ In the following chapters, we will build on this foundation to design various ar In particular, this chapter will cover - How DL solves AI challenges in complex domains - Key innovations that have propelled DL to its current popularity -- How feed-forward networks learn representations from data -- Designing and training deep neural networks in Python -- Implementing deep NN using Keras, TensorFlow, and PyTorch -- Building and tuning a deep NN to predict asset price moves - -## How Deep Learning Works +- How feedforward networks learn representations from data +- Designing and training deep neural networks (NNs) in Python +- Implementing deep NNs using Keras, TensorFlow, and PyTorch +- Building and tuning a deep NN to predict asset returns +- Designing and backtesting a trading strategy based on deep NN signals + +## Content + +1. [Deep learning: How it differs and why it matters](#deep-learning-how-it-differs-and-why-it-matters) + * [How hierarchical features help tame high-dimensional data](#how-hierarchical-features-help-tame-high-dimensional-data) + * [Automating feature extraction: DL as representation learning](#automating-feature-extraction-dl-as-representation-learning) + * [How DL relates to machine learning and artificial intelligence](#how-dl-relates-to-machine-learning-and-artificial-intelligence) +2. [Code example: Designing a neural network](#code-example-designing-a-neural-network) + * [Key design choices](#key-design-choices) + * [How to regularize deep neural networks](#how-to-regularize-deep-neural-networks) + * [Training faster: Optimizations for deep learning](#training-faster-optimizations-for-deep-learning) +3. [Popular Deep Learning libraries](#popular-deep-learning-libraries) + * [How to Leverage GPU Optimization](#how-to-leverage-gpu-optimization) + * [How to use Tensorboard](#how-to-use-tensorboard) + * [Code example: how to use PyTorch](#code-example-how-to-use-pytorch) + * [Code example: How to use TensorFlow](#code-example-how-to-use-tensorflow) +4. [Code example: Optimizing a neural network for a long-short trading strategy](#code-example-optimizing-a-neural-network-for-a-long-short-trading-strategy) + * [Optimizing the NN architecture](#optimizing-the-nn-architecture) + * [Backtesting a long-short strategy based on ensembled signals](#backtesting-a-long-short-strategy-based-on-ensembled-signals) + + +## Deep learning: How it differs and why it matters + +The machine learning (ML) algorithms covered in Part 2 work well on a wide variety of important problems, including on text data as demonstrated in Part 3. They have been less successful, however, in solving central AI problems such as recognizing speech or classifying objects in images. These limitations have motivated the development of DL, and the recent DL breakthroughs have greatly contributed to a resurgence of interest in AI. F + +or a comprehensive introduction that includes and expands on many of the points in this section, see Goodfellow, Bengio, and Courville (2016), or for a much shorter version, see LeCun, Bengio, and Hinton (2015). - [Deep Learning](https://www.deeplearningbook.org/), Ian Goodfellow, Yoshua Bengio and Aaron Courville, MIT Press, 2016 - [Deep learning](https://www.nature.com/articles/nature14539), Yann LeCun, Yoshua Bengio, and Geoffrey Hinton, Nature 2015 @@ -23,21 +48,50 @@ In particular, this chapter will cover - [One Hundred Year Study on Artificial Intelligence (AI100)](https://ai100.stanford.edu/) - [TensorFlow Playground](http://playground.tensorflow.org/#activation=tanh&batchSize=10&dataset=circle®Dataset=reg-plane&learningRate=0.03®ularizationRate=0&noise=0&networkShape=4,2&seed=0.71056&showTestData=false&discretize=false&percTrainData=50&x=true&y=true&xTimesY=false&xSquared=false&ySquared=false&cosX=false&sinX=false&cosY=false&sinY=false&collectStats=false&problem=classification&initZero=false&hideText=false), Interactive, browser-based Deep Learning platform +### How hierarchical features help tame high-dimensional data -### Backpropagation +As discussed throughout Part 2, the key challenge of supervised learning is to generalize from training data to new samples. Generalization becomes exponentially more difficult as the dimensionality of the data increases. We encountered the root causes of these difficulties as the curse of dimensionality in Chapter 13, [Unsupervised Learning: From Data-Driven Risk Factors to Hierarchical Risk Parity](../13_unsupervised_learning). -- [Gradient Checking & Advanced Optimization](http://ufldl.stanford.edu/wiki/index.php/Gradient_checking_and_advanced_optimization), Unsupervised Feature Learning and Deep Learning, Stanford University -- [An overview of gradient descent optimization algorithms](http://ruder.io/optimizing-gradient-descent/index.html#momentum), Sebastian Ruder, 2016 +### Automating feature extraction: DL as representation learning + +Many AI tasks like image or speech recognition require knowledge about the world. One of the key challenges is to encode this knowledge so a computer can utilize it. For decades, the development of ML systems required considerable domain expertise to transform the raw data (such as image pixels) into an internal representation that a learning algorithm could use to detect or classify patterns. -## How to build a Neural Network using Python +### How DL relates to machine learning and artificial intelligence + +AI has a long history, going back at least to the 1950s as an academic field and much longer as a subject of human inquiry, but has experienced several waves of ebbing and flowing enthusiasm since (see [The Quest for Artificial Intelligence](https://ai.stanford.edu/~nilsson/QAI/qai.pdf), Nilsson, 2009 for an in-depth survey). +- ML is an important subfield with a long history in related disciplines such as statistics and became prominent in the 1980s. +- DL is a form of representation learning and itself a subfield of ML. + +## Code example: Designing a neural network To gain a better understanding of how NN work, the notebook [01_build_and_train_feedforward_nn](build_and_train_feedforward_nn.ipynb) formulates as simple feedforward architecture and forward propagation computations using matrix algebra and implements it using Numpy, the Python counterpart of linear algebra. +

+ +

+ +### Key design choices + +Some NN design choices resemble those for other supervised learning models. For example, the output is dictated by the type of the ML problem such as regression, classification, or ranking. Given the output, we need to select a cost function to measure prediction success and failure, and an algorithm that optimizes the network parameters to minimize the cost. + +NN-specific choices include the numbers of layers and nodes per layer, the connections between nodes of different layers, and the type of activation functions. + +### How to regularize deep neural networks + +The downside of the capacity of NN to approximate arbitrary functions is the greatly increased risk of overfitting. The best protection against overfitting is to train the model on a larger dataset. Data augmentation, e.g. by creating slightly modified versions of images, is a powerful alternative approach. The generation of synthetic financial training data for this purpose is an active research area that we will address in [Chapter 21](../21_gans_for_synthetic_time_series) + +### Training faster: Optimizations for deep learning + +Backprop refers to the computation of the gradient of the cost function with respect to the internal parameter we wish to update and the use of this information to update the parameter values. The gradient is useful because it indicates the direction of parameter change that causes the maximal increase in the cost function. Hence, adjusting the parameters according to the negative gradient produces an optimal cost reduction, at least for a region very close to the observed samples. See Ruder (2016) for an excellent overview of key gradient descent optimization algorithms. + +- [Gradient Checking & Advanced Optimization](http://ufldl.stanford.edu/wiki/index.php/Gradient_checking_and_advanced_optimization), Unsupervised Feature Learning and Deep Learning, Stanford University +- [An overview of gradient descent optimization algorithms](http://ruder.io/optimizing-gradient-descent/index.html#momentum), Sebastian Ruder, 2016 ## Popular Deep Learning libraries -Currently, the most popular DL libraries are TensorFlow (supported by Google), Keras (led by Francois Chollet, now at Google), and PyTorch (supported by Facebook). Development is very active with PyTorch just releasing version 1.0 and TensorFlow 2.0 expected in early Spring 2019 when it is expected to adopt Keras as its main interface. +Currently, the most popular DL libraries are [TensorFlow](https://www.tensorflow.org/) (supported by Google) and [PyTorch](https://pytorch.org/) (supported by Facebook). +Development is very active with PyTorch at version 1.4 and TensorFlow at 2.2 as of March 2020. TensorFlow 2.0 adopted [Keras](https://keras.io/) as its main interface, effectively combining both libraries into one. Additional options include: - [Microsoft Cognitive Toolkit (CNTK)](https://github.com/Microsoft/CNTK) @@ -55,15 +109,7 @@ All popular Deep Learning libraries support the use of GPU, and some also allow A more straightforward way to leverage GPU is via the the Docker virtualization platform. There are numerous images available that you can run in local container managed by Docker that circumvents many of the driver and version conflicts that you may otherwise encounter. Tensorflow provides docker images on its website that can also be used with Keras. - [Which GPU(s) to Get for Deep Learning: My Experience and Advice for Using GPUs in Deep Learning](http://timdettmers.com/2018/11/05/which-gpu-for-deep-learning/), Tim Dettmers - -### How to use Keras - -Keras was designed as a high-level or meta API to accelerate the iterative workflow when designing and training deep neural networks with computational backends like TensorFlow, Theano, or CNTK. It has been integrated into TensorFlow in 2017 and is set to become the principal TensorFlow interface with the 2.0 release. You can also combine code from both libraries to leverage Keras’ high-level abstractions as well as customized TensorFlow graph operations. - -The notebook [how_to_use_keras](02_how_to_use_keras.ipynb) demonstrates the functionality. - - [A Full Hardware Guide to Deep Learning](http://timdettmers.com/2018/12/16/deep-learning-hardware-guide/), Tim Dettmers -- [Keras documentation](https://keras.io/) ### How to use Tensorboard @@ -75,15 +121,16 @@ When you run the how_to_use_keras notebook, and with TensorFlow installed, you c ```python tensorboard --logdir=/full_path_to_your_logs ## e.g. ./tensorboard ``` - - [TensorBoard: Visualizing Learning](https://www.tensorflow.org/guide/summaries_and_tensorboard) -### How to use PyTorch 1.0 +### Code example: how to use PyTorch Pytorch has been developed at the Facebook AI Research group led by Yann LeCunn and the first alpha version released in September 2016. It provides deep integration with Python libraries like Numpy that can be used to extend its functionality, strong GPU acceleration, and automatic differentiation using its autograd system. It provides more granular control than Keras through a lower-level API and is mainly used as a deep learning research platform but can also replace NumPy while enabling GPU computation. It employs eager execution, in contrast to the static computation graphs used by, e.g., Theano or TensorFlow. Rather than initially defining and compiling a network for fast but static execution, it relies on its autograd package for automatic differentiation of Tensor operations, i.e., it computes gradients ‘on the fly’ so that network structures can be partially modified more easily. This is called define-by-run, meaning that backpropagation is defined by how your code runs, which in turn implies that every single iteration can be different. The PyTorch documentation provides a detailed tutorial on this. +The notebook [how_to_use_pytorch](03_how_to_use_pytorch.ipynb) illustrates how to use the 1.4 release. + - [PyTorch Documentation](https://pytorch.org/docs) - [PyTorch Tutorials](https://pytorch.org/tutorials) - [PyTorch Ecosystem](https://pytorch.org/ecosystem) @@ -91,19 +138,30 @@ It employs eager execution, in contrast to the static computation graphs used by - [Flair](https://github.com/zalandoresearch/flair), simple framework for state-of-the-art NLP developed at Zalando - [fst.ai](http://www.fast.ai/), simplifies training NN using modern best practices; offers online training -### How to use TensorFlow +### Code example: How to use TensorFlow TensorFlow has become the leading deep learning library shortly after its release in September 2015, one year before PyTorch. TensorFlow 2.0 aims to simplify the API that has grown increasingly complex over time by making the Keras API, integrated into TensorFlow as part of the contrib package since 2017 its principal interface, and adopting eager execution. It will continue to focus on a robust implementation across numerous platforms but will make it easier to experiment and do research. -The notebook [how_to_use_tensorflow](04_how_to_use_tensorflow.ipynb) will illustrateshow to use the 2.0 release (updated as the interface stabilizes). +The notebook [how_to_use_tensorflow](04_how_to_use_tensorflow.ipynb) illustrates how to use the 2.0 release. - [TensorFlow.org](https://www.tensorflow.org/) - [Standardizing on Keras: Guidance on High-level APIs in TensorFlow 2.0](https://medium.com/tensorflow/standardizing-on-keras-guidance-on-high-level-apis-in-tensorflow-2-0-bad2b04c819a) - [TensorFlow.js](https://js.tensorflow.org/), A JavaScript library for training and deploying ML models in the browser and on Node.js -## How to optimize Neural Network Architectures +## Code example: Optimizing a neural network for a long-short trading strategy + +In practice, we need to explore variations for the design options for the NN architecture and how we train it from those we outlined previously because we can never be sure from the outset which configuration best suits the data. + +This code example explores various architectures for a simple feedforward neural network to predict daily stock returns using the dataset developed in [Chapter 12](../12_gradient_boosting_machines) (see the notebook [preparing_the_model_data](../12_gradient_boosting_machines/04_preparing_the_model_data.ipynb)). + +To this end, we will define a function that returns a TensorFlow model based on several architectural input parameters and cross-validate alternative designs using the MultipleTimeSeriesCV we introduced in Chapter 7. To assess the signal quality of the model predictions, we build a simple ranking-based long-short strategy based on an ensemble of the models that perform best during the in-sample cross-validation period. To limit the risk of false discoveries, we then evaluate the performance of this strategy for an out-of-sample test period. + +### Optimizing the NN architecture + +The notebook [how_to_optimize_a_NN_architecure](04_how_to_use_tensorflow.ipynb) explores various options to build a simple feedforward Neural Network to predict asset returns. To develop our trading strategy, we use the daily stock returns for 995 US stocks for the eight-year period from 2010 to 2017. + +### Backtesting a long-short strategy based on ensembled signals -In practice, we need to explore variations of the design options outlined above because we can rarely be sure from the outset which network architecture best suits the data. -The GridSearchCV class provided by scikit-learn that we encountered in Chapter 6, The Machine Learning Workflow conveniently automates this process. Just be mindful of the risk of false discoveries and keep track of how many experiments you are running to adjust the results accordingly. +To translate our NN model into a trading strategy, we generate predictions, evaluate their signal quality, create rules that define how to trade on these predictions, and backtest the performance of a strategy that implements these rules. -The notebook [how_to_optimize_a_NN_architecure](04_how_to_use_tensorflow.ipynb) explores various options to build a simple feedforward Neural Network to predict asset price moves for a one-month horizon. The python script of the same name aims to facilitate running the code on a server in order to speed up computation. +The notebook [backtesting_with_zipline](05_backtesting_with_zipline.ipynb) contains the code examples for this section. diff --git a/18_convolutional_neural_nets/README.md b/18_convolutional_neural_nets/README.md index 21e45dbb1..44db82ed3 100644 --- a/18_convolutional_neural_nets/README.md +++ b/18_convolutional_neural_nets/README.md @@ -7,14 +7,38 @@ CNNs are named after the linear algebra operation called convolution that replac Research into CNN architectures has proceeded very rapidly and new architectures that improve benchmark performance continue to emerge. We will describe a set of building blocks that consistently appears in successful applications and illustrate their application to image data and financial time series. We will also demonstrate how transfer learning can speed up learning by using pre-trained weights for some of the CNN layers. More specifically, in this chapter, you will learn about: -- How CNNs use key building blocks to efficiently model grid-like data -- Designing CNN architectures using Keras and PyTorch -- Training, tuning, and regularizing CNN for various data types -- Using transfer learning to streamline CNN, even with fewer data -- How to classify satellite images - - -## How to build a Deep ConvNet +- How CNNs employ several building blocks to efficiently model grid-like data +- Training, tuning and regularizing CNNs for images and time series data using TensorFlow +- Using transfer learning to streamline CNNs, even with fewer data +- Designing a trading strategy using return predictions by a CNN trained on time-series data formatted like images +- How to classify economic activity based on satellite images + +## Content + +1. [How CNNs learn to model grid-like data](#how-cnns-learn-to-model-grid-like-data) + * [Code example: From hand-coding to learning and synthesizing filters from data](#code-example-from-hand-coding-to-learning-and-synthesizing-filters-from-data) + * [How the key elements of a convolutional layer operate](#how-the-key-elements-of-a-convolutional-layer-operate) + * [Computer Vision Tasks](#computer-vision-tasks) + * [The evolution of CNN architectures: key innovations](#the-evolution-of-cnn-architectures-key-innovations) +2. [CNN for Images: From Satellite Data to Object Detection](#cnn-for-images-from-satellite-data-to-object-detection) + * [Code example: LeNet5: The first CNN with industrial applications](#code-example-lenet5-the-first-cnn-with-industrial-applications) + * [Code example: AlexNet - reigniting deep learning research](#code-example-alexnet---reigniting-deep-learning-research) + * [Code example: transfer learning with VGG16 in practice](#code-example-transfer-learning-with-vgg16-in-practice) + - [How to extract bottleneck features](#how-to-extract-bottleneck-features) + - [How to fine-tune a pre-trained model](#how-to-fine-tune-a-pre-trained-model) + * [Code example: identifying land use with satellite images using transfer learning](#code-example-identifying-land-use-with-satellite-images-using-transfer-learning) + * [Code example: object detection in practice with Google Street View House Numbers](#code-example-object-detection-in-practice-with-google-street-view-house-numbers) + - [Preprocessing the source images](#preprocessing-the-source-images) + - [Transfer learning with a custom final layer for multiple outputs](#transfer-learning-with-a-custom-final-layer-for-multiple-outputs) +3. [CNN for time series data: predicting stock returns](#cnn-for-time-series-data-predicting-stock-returns) + * [Code example: building an autoregressive CNN with 1D convolutions](#code-example-building-an-autoregressive-cnn-with-1d-convolutions) + * [Code example: CNN-TA - clustering financial time series in 2D image format](#code-example-cnn-ta---clustering-financial-time-series-in-2d-image-format) + - [Creating the 2D time series of financial indicators](#creating-the-2d-time-series-of-financial-indicators) + - [Select and cluster the most relevant features](#select-and-cluster-the-most-relevant-features) + - [Create and train a convolutional neural network](#create-and-train-a-convolutional-neural-network) + - [Backtesting a long-short trading strategy](#backtesting-a-long-short-trading-strategy) + +## How CNNs learn to model grid-like data CNNs are conceptually similar to the feedforward NNs we covered in the previous chapter. They consist of units that contain parameters called weights and biases, and the training process adjusts these parameters to optimize the network’s output for a given input. Each unit applies its parameters to a linear operation on the input data or activations received from other units, possibly followed by a non-linear transformation. @@ -22,14 +46,22 @@ CNNs differ because they encode the assumption that the input has a structure mo The most important element to encode the assumption of a grid-like topology is the convolution operation that gives CNNs their name, combined with pooling. We will see that the specific assumptions about the functional relationship between input and output data implies that CNNs need far fewer parameters and compute more efficiently. -### How Convolutional Layers work +### Code example: From hand-coding to learning and synthesizing filters from data + +For image data, this local structure has traditionally motivated the development of hand-coded filters that extract such patterns for the use as features in machine learning models. +- The notebook [filter_example](01_filter_example.ipynb) illustrates how to use hand-coded filters in a convolutional network and visualize the resulting transformation of the image. +- See [Interpretability of Deep Learning Models with Tensorflow 2.0](https://www.sicara.ai/blog/2019-08-28-interpretability-deep-learning-tensorflow) for an example visualization of the patterns learned by CNN filters. + +### How the key elements of a convolutional layer operate Fully-connected feedforwardNNs make no assumptions about the topology, or local structure of the input data so that arbitrarily reordering the features has no impact on the training result. For many data sources, however, local structure is quite significant. Examples include autocorrelation in time series or the spatial correlation among pixel values due to common patterns like edges or corners. For image data, this local structure has traditionally motivated the development of hand-coded filter methods that extract local patterns for the use as features in machine learning models. - [Deep Learning](http://www.deeplearningbook.org/contents/convnets.html), Chapter 9, Convolutional Networks, Ian Goodfellow et al, MIT Press, 2016 +- [CS231n: Convolutional Neural Networks for Visual Recognition](http://cs231n.stanford.edu/syllabus.html), Stanford’s deep learning course. Helpful for building foundations, with engaging lectures and illustrative problem sets. - [Convolutional Neural Networks (CNNs / ConvNets)](http://cs231n.github.io/convolutional-networks/#conv), Module 2 in CS231n Convolutional Neural Networks for Visual Recognition, Lecture Notes by Andrew Karpathy, Stanford, 2016 +- [ImageNet Large Scale Visual Recognition Challenge (ILSVRC)](http://www.image-net.org/challenges/LSVRC/) - [Convnet Benchmarks](https://github.com/soumith/convnet-benchmarks), Benchmarking of all publicly accessible implementations of convnets - [ConvNetJS](https://cs.stanford.edu/people/karpathy/convnetjs/demo/cifar10.html), ConvNetJS CIFAR-10 demo in the browser by Andrew Karpathy - [An Interactive Node-Link Visualization of Convolutional Neural Networks](http://scs.ryerson.ca/~aharley/vis/), interactive CNN visualization @@ -37,11 +69,6 @@ For many data sources, however, local structure is quite significant. Examples i - [Understanding Convolutions](http://colah.github.io/posts/2014-07-Understanding-Convolutions/), Christopher Olah, 2014 - [Multi-Scale Context Aggregation by Dilated Convolutions](https://arxiv.org/abs/1511.07122), Fisher Yu, Vladlen Koltun, ICLR 2016 -#### Code examples - -- The notebook [filter_example](01_filter_example.ipynb) illustrates how to use hand-coded filters in a convolutional network and visualize the resulting transformation of the image. -- See [Interpretability of Deep Learning Models with Tensorflow 2.0](https://www.sicara.ai/blog/2019-08-28-interpretability-deep-learning-tensorflow) for an example visualization of the patterns learned by CNN filters. - ### Computer Vision Tasks Image classification is a fundamental computer vision task that requires labeling an image based on certain objects it contains. Many practical applications, including investment and trading strategies, require additional information. @@ -54,7 +81,9 @@ Image classification is a fundamental computer vision task that requires labelin - [Playing around with RCNN](https://cs.stanford.edu/people/karpathy/rcnn/), Andrew Karpathy, Stanford - [R-CNN, Fast R-CNN, Faster R-CNN, YOLO — Object Detection Algorithms](https://towardsdatascience.com/r-cnn-fast-r-cnn-faster-r-cnn-yolo-object-detection-algorithms-36d53571365e), Rohith Ghandi, 2018 -### Reference Architectures & Benchmarks +### The evolution of CNN architectures: key innovations + +Several CNN architectures have pushed performance boundaries over the past two decades by introducing important innovations. Predictive performance growth accelerated dramatically with the arrival of big data in the form of ImageNet (Fei-Fei 2015) with 14 million images assigned to 20,000 classes by humans via Amazon’s Mechanical Turk. The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) became the focal point of CNN progress around a slightly smaller set of 1.2 million images from 1,000 classes. - [Fully Convolutional Networks for Semantic Segmentation](https://people.eecs.berkeley.edu/~jonlong/long_shelhamer_fcn.pdf), Long et al, Berkeley - [Mask R-CNN](https://arxiv.org/abs/1703.06870), Kaiming He, Georgia Gkioxari, Piotr Dollár, Ross Girshick, arxiv, 2017 @@ -72,32 +101,29 @@ Image classification is a fundamental computer vision task that requires labelin - [Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift](https://arxiv.org/abs/1502.03167), Sergey Ioffe, Christian Szegedy, arxiv 2015 - [An Overview of ResNet and its Variants](https://towardsdatascience.com/an-overview-of-resnet-and-its-variants-5281e2f56035), Vincent Fung, 2017 +## CNN for Images: From Satellite Data to Object Detection -## How to design and train a CNN using Python - -### LeNet5 and MNIST using Keras +This section demonstrates how to solve key computer vision tasks such as image classification and object detection. As mentioned in the introduction and in Chapter 3 on alternative data, image data can inform a trading strategy by providing clues about future trends, changing fundamentals, or specific events relevant for a target asset class or investment universe. Popular examples include exploiting satellite images for clues about the supply of agricultural commodities, consumer and economic activity, or the status of manufacturing or raw material supply chains. Specific tasks might include, for example: +- Image classification: identify whether cultivated land for certain crops is expanding or predict harvest quality and quantities, or +- Object detection: count the number of oil tankers on a certain transport route or the number of cars in a parking lot, or identify the location of shoppers in a mall. -All libraries we introduced in the last chapter provide support for convolutional layers. The notebook [mnist_with_ffnn_and_lenet5](02_mnist_with_ffnn_and_lenet5.ipynb) illustrates the LeNet5 architecture using the most basic MNIST handwritten digit dataset, and then use AlexNet on CIFAR10, a simplified version of the original ImageNet to demonstrate the use of data augmentation. +### Code example: LeNet5: The first CNN with industrial applications -### AlexNet and CIFAR10 with Keras +All libraries we introduced in the last chapter provide support for convolutional layers. -Fast-forward to 2012, and we move on to the deeper and more modern AlexNet architecture. We will use the CIFAR10 dataset that uses 60,000 ImageNet samples, compressed to 32x32 pixel resolution (from the original 224x224), but still with three color channels. There are only 10 of the original 1,000 classes. See the notebook [cifar10_image_classification](03_cifar10_image_classification.ipynb) for implementation. +The notebook [digit_classification_with_lenet5](02_digit_classification_with_lenet5.ipynb) illustrates the LeNet5 architecture using the most basic MNIST handwritten digit dataset, -### How to use CNN with time series data +### Code example: AlexNet - reigniting deep learning research -The regular measurements of time series result in a similar grid-like data structure as for the image data we have focused on so far. As a result, we can use CNN architectures for univariate and multivariate time series. In the latter case, we consider different time series as channels, similar to the different color signals. +Fast-forward to 2012, and we move on to the deeper and more modern AlexNet architecture. We will use the CIFAR10 dataset that uses 60,000 ImageNet samples, compressed to 32x32 pixel resolution (from the original 224x224), but still with three color channels. There are only 10 of the original 1,000 classes. -The notebook [cnn_with_time_series](04_cnn_with_time_series.ipynb) illustrates the time series use case with the univariate asset price forecast example we introduced in the last chapter. Recall that we create rolling monthly stock returns and use the 24 lagged returns alongside one-hot-encoded month information to predict whether the subsequent monthly return is positive or negative. +See the notebook [image_classification_with_alexnet](03_image_classification_with_alexnet.ipynb) for implementation, including the use of data augmentation. -## Transfer Learning +### Code example: transfer learning with VGG16 in practice In practice, we often do not have enough data to train a CNN from scratch with random initialization. Transfer learning is a machine learning technique that repurposes a model trained on one set of data for another task. Naturally, it works if the learning from the first task carries over to the task of interest. If successful, it can lead to better performance and faster training that requires less labeled data than training a neural network from scratch on the target task. -- [Building powerful image classification models using very little data](https://blog.keras.io/building-powerful-image-classification-models-using-very-little-data.html) -- [How transferable are features in deep neural networks?](https://papers.nips.cc/paper/5347-how-transferable-are-features-in-deep-neural-networks.pdf), Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson, NIPS, 2014 -- [PyTorch Transfer Learning Tutorial](https://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html) - -### How to build on a pre-trained CNN +Tensorflow 2, for example, contains pre-trained models for several of the reference architectures discussed previously, namely VGG16 and its larger version VGG19, ResNet50, InceptionV3, and InceptionResNetV2, as well as MobileNet, DenseNet, NASNet, and MobileNetV2. The transfer learning approach to CNN relies on pre-training on a very large dataset like ImageNet. The goal is that the convolutional filters extract a feature representation that generalizes to new images. In a second step, it leverages the result to either initialize and retrain a new CNN or as inputs to in a new network that tackles the task of interest. @@ -109,39 +135,84 @@ Alternatively, we can use the bottleneck features as inputs into a different mac Alternatively, we can go a step further and not only replace and retrain the classifier on top of the CNN using new data but to also fine-tune the weights of the pre-trained CNN. To achieve this, we continue training, either only for later layers while freezing the weights of some earlier layers. The motivation is to preserve presumably more generic patterns learned by lower layers, such as edge or color blob detectors while allowing later layers of the CNN to adapt to the details of a new task. ImageNet, e.g., contains a wide variety of dog breeds which may lead to feature representations specifically useful for differentiating between these classes. -### How to extract bottleneck features +- [Building powerful image classification models using very little data](https://blog.keras.io/building-powerful-image-classification-models-using-very-little-data.html) +- [How transferable are features in deep neural networks?](https://papers.nips.cc/paper/5347-how-transferable-are-features-in-deep-neural-networks.pdf), Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson, NIPS, 2014 +- [PyTorch Transfer Learning Tutorial](https://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html) -Modern CNNs can take weeks to train on multiple GPUs on ImageNet, but fortunately, many researchers share their final weights. Keras, e.g., contains pre-trained models for several of the reference architectures discussed above, namely VGG16 and 19, ResNet50, InceptionV3 and InceptionResNetV2, MobileNet, DenseNet, NASNet and MobileNetV2 +#### How to extract bottleneck features -The notebook [bottleneck_features](05_bottleneck_features.ipynb) illustrates how to download pre-trained VGG16 model, either with the final layers to generate predictions or without the final layers as illustrated in the figure below to extract the outputs produced by the bottleneck features. +The notebook [bottleneck_features](09_bottleneck_features.ipynb) illustrates how to download the pre-trained VGG16 model, either with the final layers to generate predictions or without the final layers to extract the outputs produced by the bottleneck features. -### How to further train a pre-trained model +#### How to fine-tune a pre-trained model -The notebook [transfer_learning](06_transfer_learning.ipynb) demonstrates how to freeze some or all of the layers of a pre-trained model and continue training using a new fully-connected set of layers and data with a different format. +The notebook [transfer_learning](10_transfer_learning.ipynb), adapted from a TensorFlow 2 tutorial, demonstrates how to freeze some or all of the layers of a pre-trained model and continue training using a new fully-connected set of layers and data with a different format. -## How to detect objects +### Code example: identifying land use with satellite images using transfer learning -Object detection requires the ability to distinguish between several classes of objects and to decide how many and which of these objects are present in an image. +Satellite images figure prominently among alternative data (see [Chapter 3](../03_alternative_data)). For instance, commodity traders may rely on satellite images to predict the supply of certain crops or activity at mining sites, oil or tanker traffic. -### Google Street View Housenumber Dataset +To illustrate working with this type of data, we load the [EuroSat dataset](https://arxiv.org/abs/1709.00029) included in the TensorFlow 2 datasets (Helber et al. 2017). The EuroSat dataset includes around 27,000 images in 64x64 format that represent 10 different types of land uses. + +The notebook [satellite_images](11_satellite_images.ipynb) downloads the [DenseNet201](https://www.tensorflow.org/api_docs/python/tf/keras/applications/DenseNet201) architecture from `tensorflow.keras.applications` and replace its final layers. - A prominent example is Ian Goodfellow’s identification of house numbers from Google’s street view dataset. It requires to identify +We use 10 percent of the training images for validation purposes and achieve the best out-of-sample classification accuracy of 97.96 percent after ten epochs. This exceeds the performance cited in the original paper for the best performing ResNet-50 architecture with 90-10 split. + +### Code example: object detection in practice with Google Street View House Numbers + +Object detection requires the ability to distinguish between several classes of objects and to decide how many and which of these objects are present in an image. + +A prominent example is Ian Goodfellow’s identification of house numbers from Google’s street view dataset. It requires to identify - how many of up to five digits make up the house number, - The correct digit for each component, and - The proper order of the constituent digits. -The notebooks [svhn_preprocessing](07_svhn_preprocessing.ipynb) contains code to produce a simplified, cropped dataset that uses bounding box information to create regularly shaped 32x32 images containing the digits; the original images are of arbitrary shape. +See the [data](../data) directory for instructions on obtaining the dataset. -The notebook [svhn_object_detection](08_svhn_object_detection.ipynb) goes on to illustrate how to build a deep CNN using Keras’ functional API to generate multiple outputs: one to predict how many digits are present, and five for the value of each in the order they appear. +#### Preprocessing the source images -## Capsule Nets +The notebooks [svhn_preprocessing](12_svhn_preprocessing.ipynb) contains code to produce a simplified, cropped dataset that uses bounding box information to create regularly shaped 32x32 images containing the digits; the original images are of arbitrary shape. -- [Dynamic Routing Between Capsules](https://arxiv.org/abs/1710.09829), Sara Sabour, Nicholas Frosst, Geoffrey E Hinton, arxiv, 2017 +#### Transfer learning with a custom final layer for multiple outputs -## Resources +The notebook [svhn_object_detection](13_svhn_object_detection.ipynb) goes on to illustrate how to build a deep CNN using Keras’ functional API to generate multiple outputs: one to predict how many digits are present, and five for the value of each in the order they appear. -- [CS231n: Convolutional Neural Networks for Visual Recognition](http://cs231n.stanford.edu/syllabus.html), Stanford’s deep learning course. Helpful for building foundations, with engaging lectures and illustrative problem sets. -- [ImageNet Large Scale Visual Recognition Challenge (ILSVRC)](http://www.image-net.org/challenges/LSVRC/) +## CNN for time series data: predicting stock returns + +CNN were originally developed to process image data and have achieved superhuman performance on various computer vision tasks. As discussed in the first section, time series data has a grid-like structure similar to that of images, and CNN have been successfully applied to one-, two- and three dimensional representations of temporal data. + +The application of CNN to time series will most likely bear fruit if the data meets the model’s key assumption that local patterns or relationships help predict the outcome. In the time-series context, local patterns could be autocorrelation or similar non-linear relationships at relevant intervals. Along the second and third dimension, local patterns imply systematic relationships among different components of a multivariate series or among these series for different tickers. Since locality matters, it is important that the data is organized accordingly in contrast to feed-forward networks where shuffling the elements of any dimension does not negatively affect the learning process. + +### Code example: building an autoregressive CNN with 1D convolutions + +We will introduce the time series use case for CNN with a univariate autoregressive asset return model. More specifically, the model receives the most recent 12 months of returns and uses a single layer of one-dimensional convolutions to predict the subsequent month. + +The notebook [time_series_prediction](04_time_series_prediction.ipynb) illustrates the time series use case with the univariate asset price forecast example we introduced in the last chapter. Recall that we create rolling monthly stock returns and use the 24 lagged returns alongside one-hot-encoded month information to predict whether the subsequent monthly return is positive or negative. + +### Code example: CNN-TA - clustering financial time series in 2D image format + +To exploit the grid-like structure of time-series data, we can use CNN architectures for univariate and multivariate time series. In the latter case, we consider different time series as channels, similar to the different color signals. + +An alternative approach converts a time series of alpha factors into a two-dimensional format to leverage the ability of CNNs to detect local patterns. [Sezer and Ozbayoglu](https://www.sciencedirect.com/science/article/abs/pii/S1568494618302151) (2018) propose [CNN-TA](https://github.com/omerbsezer/CNN-TA) that computes 15 technical indicators for different intervals and uses hierarchical clustering (see Chapter 13) to locate indicators that behave similarly close to each other in a 2D grid. + +#### Creating the 2D time series of financial indicators + +The notebook [engineer_cnn_features](05_engineer_cnn_features.ipynb) creates technical indicators at different intervals. + +#### Select and cluster the most relevant features + +The notebook [convert_cnn_features_to_image_format](06_convert_cnn_features_to_image_format.ipynb) selects the 15 most relevant features from the 20 candidates to fill the 15⨉15 input grid and then applies hierarchical clustering. + +#### Create and train a convolutional neural network + +Now we are ready to design, train and evaluate a CNN following the steps outlined in the previous section. The notebook [cnn_for_trading](07_cnn_for_trading.ipynb) contains the relevant code examples. + +#### Backtesting a long-short trading strategy + +To get a sense of the signal quality, we compute the spread between equal-weighted portfolios invested in stocks selected according to the signal quintiles using [Alphalens](https://github.com/quantopian/alphalens) (see [Chapter 4](../04_alpha_factor_research)). + +

+ +

diff --git a/19_recurrent_neural_nets/README.md b/19_recurrent_neural_nets/README.md index 6ae08687a..d8398be14 100644 --- a/19_recurrent_neural_nets/README.md +++ b/19_recurrent_neural_nets/README.md @@ -3,19 +3,36 @@ The major innovation of RNN is that each output is a function of both previous output and new data. As a result, RNN gain the ability to incorporate information on previous observations into the computation it performs on a new feature vector, effectively creating a model with memory. This recurrent formulation enables parameter sharing across a much deeper computational graph that includes cycles. Prominent architectures include Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU) that aim to overcome the challenge of vanishing gradients associated with learning long-range dependencies, where errors need to be propagated over many connections. RNNs have been successfully applied to various tasks that require mapping one or more input sequences to one or more output sequences and are particularly well suited to natural language. RNN can also be applied to univariate and multivariate time series to predict market or fundamental data. This chapter covers how RNN can model alternative text data using the word embeddings that we covered in [Chapter 16](16_word_embeddings) to classify the sentiment expressed in documents. Most specifically, this chapter addresses: -- How to unroll and analyze the computational graph for an RNN -- How gated units learn to regulate an RNN’s memory from data to enable long-range dependencies -- How to design and train RNN for univariate and multivariate time series in Python -- How to leverage word embeddings for sentiment analysis with RNN - - -## How RNN work - -RNNs assume that data is sequential so that previous data points impact the current observation and are relevant for predictions of subsequent elements in the sequence. -They allow for more complex and diverse input-output relationships than feedforward networks (FFNN) and convolutional nets that are designed to map one input to one output vector, usually of fixed size and using a given number of computational steps. RNN, in contrast, can model data for tasks where the input, the output or both are best represented as a sequence of vectors. - -Note that input and output sequences can be of arbitrary lengths because the recurrent transformation that is fixed but learned from the data can be applied as many times as needed. -Just as CNN easily scale to large images and some CNN can process images of variable size, RNN scale to much longer sequences than networks not tailored to sequence-based tasks. Most RNN can also process sequences of variable length. +- How recurrent connections allow RNNs to memorize patterns and model a hidden state +- Unrolling and analyzing the computational graph of RNNs +- How gated units learn to regulate RNN memory from data to enable long-range dependencies +- Designing and training RNNs for univariate and multivariate time series in Python +- How to learn word embeddings or use pretrained word vectors for sentiment analysis with RNNs +- Building a bidirectional RNN to predict stock returns using custom word embeddings + +## Content + +1. [How recurrent neural nets work](#how-recurrent-neural-nets-work) + * [Backpropagation through Time](#backpropagation-through-time) + * [Alternative RNN Architectures](#alternative-rnn-architectures) + - [Long-Short Term Memory](#long-short-term-memory) + - [Gated Recurrent Units](#gated-recurrent-units) +2. [RNN for financial time series with TensorFlow 2](#rnn-for-financial-time-series-with-tensorflow-2) + * [Code example: Univariate time-series regression: predicting the S&P 500](#code-example-univariate-time-series-regression-predicting-the-sp-500) + * [Code example: Stacked LSTM for predicting weekly stock price moves and returns](#code-example-stacked-lstm-for-predicting-weekly-stock-price-moves-and-returns) + * [Code example: Predicting returns instead of directional price moves](#code-example-predicting-returns-instead-of-directional-price-moves) + * [Code example: Multivariate time-series regression for macro data](#code-example-multivariate-time-series-regression-for-macro-data) +3. [RNN for text data: sentiment analysis and return prediction](#rnn-for-text-data-sentiment-analysis-and-return-prediction) + * [Code example: LSTM with custom word embeddings for sentiment classification](#code-example-lstm-with-custom-word-embeddings-for-sentiment-classification) + * [Code example: Sentiment analysis with pretrained word vectors](#code-example-sentiment-analysis-with-pretrained-word-vectors) + * [Code example: SEC filings for a bidirectional RNN GRU to predict weekly returns](#code-example-sec-filings-for-a-bidirectional-rnn-gru-to-predict-weekly-returns) + +## How recurrent neural nets work + +RNNs assume that the input data has been generated as a sequence such that previous data points impact the current observation and are relevant for predicting subsequent values. Thus, they allow for more complex input-output relationships than FFNNs and CNNs, which are designed to map one input vector to one output vector using a given number of computational steps. +RNNs, in contrast, can model data for tasks where the input, the output, or both, are best represented as a sequence of vectors. + +For a thorough overview, see [chapter 10](https://www.deeplearningbook.org/contents/rnn.html in [Deep Learning](https://www.deeplearningbook.org/) by Goodfellow, Bengio, and Courville (2016). ### Backpropagation through Time @@ -28,11 +45,11 @@ The backpropagation algorithm that updates the weight parameters based on the gr - [Tutorial on LSTM Recurrent Networks](http://people.idsia.ch/~juergen/lstm/sld001.htm), Juergen Schmidhuber, 2003 - [The Unreasonable Effectiveness of Recurrent Neural Networks](http://karpathy.github.io/2015/05/21/rnn-effectiveness/) -## Alternative RNN Architectures +### Alternative RNN Architectures RNNs can be designed in a variety of ways to best capture the functional relationship and dynamic between input and output data. In addition to the recurrent connections between the hidden states, there are several alternative approaches, including recurrent output relationships, bidirectional RNN, and encoder-decoder architectures. -### Long-Short Term Memory +#### Long-Short Term Memory RNNs with an LSTM architecture have more complex units that maintain an internal state and contain gates to keep track of dependencies between elements of the input sequence and regulate the cell’s state accordingly. These gates recurrently connect to each other instead of the usual hidden units we encountered above. They aim to address the problem of vanishing and exploding gradients by letting gradients pass through unchanged. @@ -41,49 +58,67 @@ A typical LSTM unit combines four parameterized layers that interact with each o - [Understanding LSTM Networks](http://colah.github.io/posts/2015-08-Understanding-LSTMs/), Christopher Olah, 2015 - [An Empirical Exploration of Recurrent Network Architectures](http://proceedings.mlr.press/v37/jozefowicz15.pdf), Rafal Jozefowicz, Ilya Sutskever, et al, 2015 -### Gated Recurrent Units +#### Gated Recurrent Units Gated recurrent units (GRU) simplify LSTM units by omitting the output gate. They have been shown to achieve similar performance on certain language modeling tasks but do better on smaller datasets. - [Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation](https://arxiv.org/pdf/1406.1078.pdf), Kyunghyun Cho, Yoshua Bengio, et al 2014 - [Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling](https://arxiv.org/abs/1412.3555), Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, Yoshua Bengio, 2014 -## How to build and train an RNN using Python +## RNN for financial time series with TensorFlow 2 We illustrate how to build RNN using the Keras library for various scenarios. The first set of models includes regression and classification of univariate and multivariate time series. The second set of tasks focuses on text data for sentiment analysis using text data converted to word embeddings (see [Chapter 15](../15_word_embeddings)). +- [Recurrent Neural Networks (RNN) with Keras](https://www.tensorflow.org/guide/keras/rnn) +- [Time series forecasting](https://www.tensorflow.org/tutorials/structured_data/time_series) - [Keras documentation](https://keras.io/getting-started/sequential-model-guide/) -- [LSTM documentation](https://keras.io/layers/recurrent/) -- [Keras-recommended approach for RNNs](https://keras.io/optimizers/) (use RMSProp) +- [LSTM documentation](https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM) +- [Working with RNNs](https://keras.io/guides/working_with_rnns/) by Scott Zhu and Francois Chollet -### Univariate Time Series Regression +### Code example: Univariate time-series regression: predicting the S&P 500 The notebook [univariate_time_series_regression](01_univariate_time_series_regression.ipynb) demonstrates how to get data into the requisite shape and how to forecast the S&P 500 index values using a Recurrent Neural Network. -### Stacked LSTMs for time series classification +### Code example: Stacked LSTM for predicting weekly stock price moves and returns + +We'll now build a slightly deeper model by stacking two LSTM layers using the Quandl stock price data. Furthermore, we will include features that are not sequential in nature, namely indicator variables that identify the ticker and time periods like month and year. +- See the [stacked_lstm_with_feature_embeddings](02_stacked_lstm_with_feature_embeddings.ipynb) notebook for implementation details. + +### Code example: Predicting returns instead of directional price moves -We'll now build a slightly deeper model by stacking two LSTM layers using the Quandl stock price data (see the [stacked_lstm_with_feature_embeddings](02_stacked_lstm_with_feature_embeddings.ipynb) notebook for implementation details). Furthermore, we will include features that are not sequential in nature, namely indicator variables that identify the ticker and time periods like month and year. +The notebook [stacked_lstm_with_feature_embeddings_regression](03_stacked_lstm_with_feature_embeddings_regression.ipynb) illustrates how to adapt the model to the regression task of predicting returns rather than binary price changes. -### Multivariate Time Series Regression +### Code example: Multivariate time-series regression for macro data -So far, we have limited our modeling efforts to single time series. RNNs are naturally well suited to multivariate time series and represent a non-linear alternative to the Vector Autoregressive (VAR) models we covered in [Chapter 8, Time Series Models](../08_time_series_models). +So far, we have limited our modeling efforts to single time series. RNNs are naturally well suited to multivariate time series and represent a non-linear alternative to the Vector Autoregressive (VAR) models we covered in [Chapter 9, Time Series Models](../09_time_series_models). -The notebook [multivariate_timeseries](03_multivariate_timeseries.ipynb) demonstrates the application of RNNs to modeling and forecasting several time series using the same dataset we used for the [VAR example](../08_time_series_models/03_vector_autoregressive_model.ipynb), namely monthly data on consumer sentiment, and industrial production from the Federal Reserve's FRED service. +The notebook [multivariate_timeseries](04_multivariate_timeseries.ipynb) demonstrates the application of RNNs to modeling and forecasting several time series using the same dataset we used for the [VAR example](../09_time_series_models/04_vector_autoregressive_model.ipynb), namely monthly data on consumer sentiment, and industrial production from the Federal Reserve's FRED service. -### LSTM & Word Embeddings for Sentiment Classification +## RNN for text data: sentiment analysis and return prediction -RNNs are commonly applied to various natural language processing tasks. We've already encountered sentiment analysis using text data in part three of [this book](https://www.amazon.com/Hands-Machine-Learning-Algorithmic-Trading-ebook/dp/B07JLFH7C5/ref=sr_1_2?ie=UTF8&qid=1548455634&sr=8-2&keywords=machine+learning+algorithmic+trading). +### Code example: LSTM with custom word embeddings for sentiment classification -The notebook [sentiment_analysis](04_sentiment_analysis.ipynb) illustrates how to apply an RNN model to text data to detect positive or negative sentiment (which can easily be extended to a finer-grained sentiment scale). We are going to use word embeddings to represent the tokens in the documents. We covered word embeddings in [Chapter 15, Word Embeddings](../15_word_embeddings). They are an excellent technique to convert text into a continuous vector representation such that the relative location of words in the latent space encodes useful semantic aspects based on the words' usage in context. +RNNs are commonly applied to various natural language processing tasks. We've already encountered sentiment analysis using text data in part three of [this book](https://www.amazon.com/Machine-Learning-Algorithmic-Trading-alternative/dp/1839217715?pf_rd_r=VMKJPZC4N36TTZZCWATP&pf_rd_p=c5b6893a-24f2-4a59-9d4b-aff5065c90ec&pd_rd_r=8f331266-0d21-4c76-a3eb-d2e61d23bb31&pd_rd_w=kVGNF&pd_rd_wg=LYLKH&ref_=pd_gw_ci_mcx_mr_hp_d). -In this example, we again use Keras' built-in embedding layer that allows us to train vector representations specific to the task at hand. In the next example, we use pretrained vectors instead. +This example shows how to learn custom embedding vectors while training an RNN on the classification task. This differs from the word2vec model that learns vectors while optimizing predictions of neighboring tokens, resulting in their ability to capture certain semantic relationships among words (see Chapter 16). Learning word vectors with the goal of predicting sentiment implies that embeddings will reflect how a token relates to the outcomes it is associated with. +The notebook [sentiment_analysis_imdb](05_sentiment_analysis_imdb.ipynb) illustrates how to apply an RNN model to text data to detect positive or negative sentiment (which can easily be extended to a finer-grained sentiment scale). We are going to use word embeddings to represent the tokens in the documents. We covered word embeddings in [Chapter 15, Word Embeddings](../15_word_embeddings). They are an excellent technique to convert text into a continuous vector representation such that the relative location of words in the latent space encodes useful semantic aspects based on the words' usage in context. -### How to use pre-trained word embeddings +### Code example: Sentiment analysis with pretrained word vectors In [Chapter 15, Word Embeddings](../15_word_embeddings), we showed how to learn domain-specific word embeddings. Word2vec, and related learning algorithms, produce high-quality word vectors, but require large datasets. Hence, it is common that research groups share word vectors trained on large datasets, similar to the weights for pretrained deep learning models that we encountered in the section on transfer learning in the [previous chapter](../17_convolutional_neural_nets). -The notebook [sentiment_analysis_pretrained_embeddings](05_sentiment_analysis_pretrained_embeddings.ipynb) illustrates how to use pretrained Global Vectors for Word Representation (GloVe) provided by the Stanford NLP group with the IMDB review dataset. +The notebook [sentiment_analysis_pretrained_embeddings](06_sentiment_analysis_pretrained_embeddings.ipynb) illustrates how to use pretrained Global Vectors for Word Representation (GloVe) provided by the Stanford NLP group with the IMDB review dataset. - [Large Movie Review Dataset](http://ai.stanford.edu/~amaas/data/sentiment/), Stanford AI Group -- [GloVe: Global Vectors for Word Representation](https://nlp.stanford.edu/projects/glove/), Stanford NLP \ No newline at end of file +- [GloVe: Global Vectors for Word Representation](https://nlp.stanford.edu/projects/glove/), Stanford NLP + +### Code example: SEC filings for a bidirectional RNN GRU to predict weekly returns + +In Chapter 16, we discussed important differences between product reviews and financial text data. While the former was useful to illustrate important workflows, in this section, we will tackle more challenging but also more relevant financial documents. + +More specifically, we will use the SEC filings data introduced in [Chapter 16](../16_word_embeddings) to learn word embeddings tailored to predicting the return of the ticker associated with the disclosures from before publication to one week after. + +The notebook [sec_filings_return_prediction](07_sec_filings_return_prediction.ipynb) contains the code examples for this application. + +See the notebook [sec_preprocessing](../16_word_embeddings/06_sec_preprocessing.ipynb) in Chapter 16 and instructions in the data folder on GitHub on how to obtain the data. diff --git a/20_autoencoders_for_conditional_risk_factors/README.md b/20_autoencoders_for_conditional_risk_factors/README.md index a5af3c626..8cd01638a 100644 --- a/20_autoencoders_for_conditional_risk_factors/README.md +++ b/20_autoencoders_for_conditional_risk_factors/README.md @@ -1,116 +1,115 @@ -# Conditional Autoencoders for Asset Pricing and GANs +# Autoencoders for Conditional Risk Factors and Asset Pricing -This chapter presents two unsupervised learning techniques that leverage deep learning: autoencoders, which have been around for decades, and Generative Adversarial Networks (GANs), which were introduced by Ian Goodfellow in 2014 and which Yann LeCun has called the most exciting idea in AI in the last ten years. -- An autoencoder is a neural network trained to reproduce the input while learning a new representation of the data, encoded by the parameters of a hidden layer. Autoencoders have long been used for nonlinear dimensionality reduction and manifold learning. More recently, autoencoders have been designed as generative models that learn probability distributions over observed and latent variables. A variety of designs leverage the feedforward network, Convolutional Neural Network (CNN), and recurrent neural network (RNN) architectures we covered in the last three chapters. -- GANs are a recent innovation that train two neural nets—a generator and a discriminator—in a competitive setting. The generator aims to produce samples that the discriminator is unable to distinguish from a given class of training data. The result is a generative model capable of producing new (fake) samples that are representative of a certain target distribution. GANs have produced a wave of research and can be successfully applied in many domains. An example from the medical domain that could potentially be highly relevant for trading is the generation of time-series data that simulates alternative trajectories and can be used to train supervised or reinforcement algorithms. +This chapter shows how unsupervised learning can leverage deep learning for trading. More specifically, we’ll discuss autoencoders that have been around for decades but recently attracted fresh interest. + +An autoencoder is a neural network trained to reproduce the input while learning a new representation of the data, encoded by the parameters of a hidden layer. +Autoencoders have long been used for nonlinear dimensionality reduction and manifold learning (see [Chapter 13](../13_unsupervised_learning)). +A variety of designs leverage the feedforward, convolutional, and recurrent network architectures we covered in the last three chapters. +We will see how autoencoders can underpin a trading strategy: we will build a deep neural network that uses an [autoencoder to extract risk factors](https://www.aqr.com/Insights/Research/Working-Paper/Autoencoder-Asset-Pricing-Models) and predict equity returns, conditioned on a range of equity attributes (Gu, Kelly, and Xiu 2020). More specifically, this chapter covers: - Which types of autoencoders are of practical use and how they work -- How to build and train autoencoders using Python -- How GANs work, why they're useful, and how they could be applied to trading -- How to build GANs using Python - -- [Unsupervised Learning](https://cilvr.nyu.edu/lib/exe/fetch.php?media=deeplearning:2016:lecun-20160308-unssupervised-learning-nyu.pdf), Yann LeCun, 2016 - -## How Autoencoders work - -An autoencoder, in contrast, is a neural network designed exclusively to learn a new representation, that is, an encoding of the input. To this end, the training forces the network to faithfully reproduce the input. Since autoencoders typically use the same data as input and output, they are also considered an instance of self-supervised learning. In the process, the parameters of a hidden layer become the code that represents the input. +- Building and training autoencoders using Python +- Using autoencoders to extract data-driven risk factors that take into account asset characteristics to predict returns -- [Autoencoders](http://www.deeplearningbook.org/contents/autoencoders.html), Ian Goodfellow, Yoshua Bengio and Aaron Courville, Deep Learning Book, Chapter 14, MIT Press 2016 +## Content -### Nonlinear dimensionality reduction +1. [Autoencoders for nonlinear feature extraction](#autoencoders-for-nonlinear-feature-extraction) + * [Code example: Generalizing PCA with nonlinear dimensionality reduction](#code-example-generalizing-pca-with-nonlinear-dimensionality-reduction) + * [Code example: convolutional autoencoders to compress and denoise images](#code-example-convolutional-autoencoders-to-compress-and-denoise-images) + * [Seq2seq autoencoders to extract time-series features for trading](#seq2seq-autoencoders-to-extract-time-series-features-for-trading) + * [Code example: Variational autoencoders - learning how to generate the input data](#code-example-variational-autoencoders---learning-how-to-generate-the-input-data) +2. [Code example: A conditional autoencoder for return forecasts and trading](#code-example-a-conditional-autoencoder-for-return-forecasts-and-trading) + * [Creating a new dataset with stock price and metadata information](#creating-a-new-dataset-with-stock-price-and-metadata-information) + * [Computing predictive asset characteristics](#computing-predictive-asset-characteristics) + * [Creating and training the conditional autoencoder architecture](#creating-and-training-the-conditional-autoencoder-architecture) + * [Evaluating the results](#evaluating-the-results) -A traditional use case includes dimensionality reduction, achieved by limiting the size of the hidden layer so that it performs lossy compression. Such an autoencoder is called undercomplete and the purpose is to force it to learn the most salient properties of the data by minimizing a loss function. In addition to feedforward architectures, autoencoders can also use convolutional layers to learn hierarchical feature representations. +## Autoencoders for nonlinear feature extraction -The powerful capabilities of neural networks to represent complex functions require tight limitations of the capacity of the encoder and decoder to force the extraction of a useful signal rather than noise. In other words, when it is too easy for the network to recreate the input, it fails to learn only the most interesting aspects of the data. This challenge is similar to the overfitting phenomenon that frequently occurs when using models with a high capacity for supervised learning. Just as in these settings, regularization can help by adding constraints to the autoencoder that facilitate the learning of a useful representation. +In Chapter 17, [Deep Learning for Trading](../17_deep_learning), we saw how neural networks succeed at supervised learning by extracting a hierarchical feature representation useful for the given task. Convolutional neural networks, e.g., learn and synthesize increasingly complex patterns from grid-like data, for example, to identify or detect objects in an image or to classify time series. +An autoencoder, in contrast, is a neural network designed exclusively to learn a new representation that encodes the input in a way that helps solve another task. To this end, the training forces the network to reproduce the input. Since autoencoders typically use the same data as input and output, they are also considered an instance of self-supervised learning. +In the process, the parameters of a hidden layer h become the code that represents the input, similar to the word2vec model covered in [Chapter 16](../16_word_embeddings). -### Sequence-to-Sequence Autoencoders +For a good overview, see Chapter 14 in Deep Learning: +- [Autoencoders](http://www.deeplearningbook.org/contents/autoencoders.html), Ian Goodfellow, Yoshua Bengio and Aaron Courville, Deep Learning Book, MIT Press 2016 -Sequence-to-sequence autoencoders are based on RNN components, such as Long Short-Term Memory (LSTM) or Gated Recurrent Units (GRUs). They learn a compressed representation of sequential data and have been applied to video, text, audio, and time-series data. - -- [A ten-minute introduction to sequence-to-sequence learning in Keras](https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html), Francois Chollet, September 2017 -- [Unsupervised Learning of Video Representations using LSTMs](https://arxiv.org/abs/1502.04681), Nitish Srivastava, Elman Mansimov, Ruslan Salakhutdinov, 2016 - -### Variational Autoencoders - -Variational Autoencoders (VAE) are more recent developments focused on generative modeling. More specifically, VAEs are designed to learn a latent variable model for the input data. Note that we encountered latent variables in Chapter 14, Topic Modeling. - -Hence, VAEs do not let the network learn arbitrary functions as long as it faithfully reproduces the input. Instead, they aim to learn the parameters of a probability distribution that generates the input data. In other words, VAEs are generative models because, if successful, you can generate new data points by sampling from the distribution learned by the VAE. - -- [Auto-encoding variational bayes](https://arxiv.org/abs/1312.6114), Diederik P Kingma, Max Welling, 2014 - -## How to build autoencoders using Python - -The Keras library makes it fairly straightforward to build various types of autoencoders and the following examples are adapted from Keras' tutorials. +The TensorFlow's Keras interfacte makes it fairly straightforward to build various types of autoencoders and the following examples are adapted from Keras' tutorials. - [Building Autoencoders in Keras](https://blog.keras.io/building-autoencoders-in-keras.html) -### Feedforward Autoencoders with Sparsity Constraints +### Code example: Generalizing PCA with nonlinear dimensionality reduction -The notebook [deep_autoencoders](01_deep_autoencoders.ipynb) illustrates how to implement several of the autoencoder models introduced in the preceding section using Keras. This includes autoencoders using deep feedforward nets and sparsity constraints. +A traditional use case includes dimensionality reduction, achieved by limiting the size of the hidden layer so that it performs lossy compression. Such an autoencoder is called undercomplete and the purpose is to force it to learn the most salient properties of the data by minimizing a loss function. In addition to feedforward architectures, autoencoders can also use convolutional layers to learn hierarchical feature representations. -### Convolutional & Denoising Autoencoders +The notebook [deep_autoencoders](01_deep_autoencoders.ipynb) illustrates how to implement several of autoencoder models using TensorFlow, including autoencoders using deep feedforward nets and sparsity constraints. + +### Code example: convolutional autoencoders to compress and denoise images -The notebook [convolutional_denoising_autoencoders](02_convolutional_denoising_autoencoders.ipynb) goes on to demonstrate how to implement convolutionals and denoising autencoders to recover corrupted image inputs. +As discussed in Chapter 18, [CNNs: Time Series as Images and Satellite Image Classification](../18_convolutional_neural_nets), fully-connected feedforward architectures are not well suited to capture local correlations typical to data with a grid-like structure. Instead, autoencoders can also use convolutional layers to learn a hierarchical feature representation. Convolutional autoencoders leverage convolutions and parameter sharing to learn hierarchical patterns and features irrespective of their location, translation, or changes in size. -### Sequence-to-sequence autoencoders +The notebook [convolutional_denoising_autoencoders](02_convolutional_denoising_autoencoders.ipynb) goes on to demonstrate how to implement convolutional and denoising autencoders to recover corrupted image inputs. -Sequence-to-sequence autoencoders are based on RNN components like long short-term memory (LSTM) or gated recurrent units (GRUs). They learn a compressed representation of sequential data and have been applied to video, text, audio, and time-series data. +### Seq2seq autoencoders to extract time-series features for trading + +Sequence-to-sequence autoencoders are based on RNN components, such as Long Short-Term Memory (LSTM) or Gated Recurrent Units (GRUs). They learn a compressed representation of sequential data and have been applied to video, text, audio, and time-series data. +- [A ten-minute introduction to sequence-to-sequence learning in Keras](https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html), Francois Chollet, September 2017 +- [Unsupervised Learning of Video Representations using LSTMs](https://arxiv.org/abs/1502.04681), Nitish Srivastava, Elman Mansimov, Ruslan Salakhutdinov, 2016 - [Gradient Trader Part 1: The Surprising Usefulness of Autoencoders](https://rickyhan.com/jekyll/update/2017/09/14/autoencoders.html) - [Code examples](https://github.com/0b01/recurrent-autoencoder) - [Deep Learning Financial Market Data](http://wp.doc.ic.ac.uk/hipeds/wp-content/uploads/sites/78/2017/01/Steven_Hutt_Deep_Networks_Financial.pdf) - Motivation: Regulators identify prohibited patterns of trading activity detrimental to orderly markets. Financial Exchanges are responsible for maintaining orderly markets. (e.g. Flash Crash and Hound of Hounslow.) - Challenge: Identify prohibited trading patterns quickly and efficiently. - Goal: Build a trading pattern search function using Deep Learning. Given a sample trading pattern identify similar patterns in historical LOB data. -### Variational Autoencoders + - **Goal**: Build a trading pattern search function using Deep Learning. Given a sample trading pattern identify similar patterns in historical LOB data. + +### Code example: Variational autoencoders - learning how to generate the input data + +Variational Autoencoders (VAE) are more recent developments focused on generative modeling. More specifically, VAEs are designed to learn a latent variable model for the input data. Note that we encountered latent variables in Chapter 14, Topic Modeling. + +Hence, VAEs do not let the network learn arbitrary functions as long as it faithfully reproduces the input. Instead, they aim to learn the parameters of a probability distribution that generates the input data. In other words, VAEs are generative models because, if successful, you can generate new data points by sampling from the distribution learned by the VAE. The notebook [variational_autoencoder](03_variational_autoencoder.ipynb) shows how to build a Variational Autoencoder using Keras. +- [Auto-encoding variational bayes](https://arxiv.org/abs/1312.6114), Diederik P Kingma, Max Welling, 2014 - [Tutorial: What is a variational autoencoder?](https://jaan.io/what-is-variational-autoencoder-vae-tutorial/) - - [Variational Autoencoder / Deep Latent Gaussian Model in tensorflow and pytorch](https://github.com/altosaar/variational-autoencoder) +- [Variational Autoencoder / Deep Latent Gaussian Model in tensorflow and pytorch](https://github.com/altosaar/variational-autoencoder) + +## Code example: A conditional autoencoder for return forecasts and trading + +Recent research by [Gu, Kelly, and Xiu](https://www.aqr.com/Insights/Research/Working-Paper/Autoencoder-Asset-Pricing-Models) develops an asset pricing model based on the exposure of securities to risk factors. It builds on the concept of data-driven risk factors that we discussed in Chapter 13 when introducing PCA as well as the risk factor models covered in Chapter 4, Financial Feature Engineering: How to Research Alpha Factors. +The authors aim to show that the asset characteristics used by factor models to capture the systematic drivers of ‘anomalies’ are just proxies for the time-varying exposure to risk factors that cannot be directly measured. +In this context, anomalies are returns in excess of those explained by the exposure to aggregate market risk (see the discussion of the capital asset pricing model in [Chapter 5](../05_strategy_evaluation)). -## Generative Adversarial Networks +### Creating a new dataset with stock price and metadata information -The supervised learning algorithms that we focused on for most of this book receive input data that's typically complex and predicts a numerical or categorical label that we can compare to the ground truth to evaluate its performance. These algorithms are also called discriminative models because they learn to differentiate between different output classes. +The reference implementation uses stock price and firm characteristic data for over 30,000 US equities from the Center for Research in Security Prices (CRSP) from 1957-2016 at monthly frequency. It computes 94 metrics that include a broad range of asset attributes suggested as predictive of returns in previous academic research and listed in Green, Hand, and Zhang (2017), who set out to verify these claims. +Since we do not have access to the high-quality but costly CRSP data, we leverage [yfinance](https://github.com/ranaroussi/yfinance) (see Chapter 2, [Market and Fundamental Data: Sources and Techniques](../02_market_and_fundamental_data)) to download price and metadata from Yahoo Finance. There are downsides to choosing free data, including: +- the lack of quality control regarding adjustments, +- survivorship bias because we cannot get data for stocks that are no longer listed, and +- a smaller scope in terms of both the number of equities and the length of their history. -The goal of generative models is to produce complex output, such as realistic images, given simple input, which can even be random numbers. They achieve this by modeling a probability distribution over the possible output. This probability distribution can have many dimensions, for example, one for each pixel in an image or its character or token in a document. As a result, the model can generate output that are very likely representative of the class of output. +The notebook [build_us_stock_dataset](04_build_us_stock_dataset.ipynb) contains the relevant code examples for this section. -- [NIPS 2016 Tutorial: Generative Adversarial Networks](https://arxiv.org/pdf/1701.00160.pdf), Ian Goodfellow, 2017 -- [Why is unsupervised learning important?](https://www.quora.com/Why-is-unsupervised-learning-important), Yoshua Bengio on Quora, 2018 +### Computing predictive asset characteristics -### How GANs work +The authors test 94 asset attributes and identify the 20 most influential metrics while asserting that feature importance drops off quickly thereafter. The top 20 stock characteristics fall into three categories, namely: +- Price trend, including (industry) momentum, short- and long-term reversal, or the recent maximum return +- Liquidity such as turnover, dollar volume, or market capitalization +- Risk measures, for instance, total and idiosyncratic return volatility or market beta -- [GAN Lab: Understanding Complex Deep Generative Models using Interactive Visual Experimentation](https://www.groundai.com/project/gan-lab-understanding-complex-deep-generative-models-using-interactive-visual-experimentation/), Minsuk Kahng, Nikhil Thorat, Duen Horng (Polo) Chau, Fernanda B. Viégas, and Martin Wattenberg, IEEE Transactions on Visualization and Computer Graphics, 25(1) (VAST 2018), Jan. 2019 - - [GitHub](https://poloclub.github.io/ganlab/) -- [Generative Adversarial Networks](https://arxiv.org/abs/1406.2661), Ian Goodfellow, et al, 2014 -- [Generative Adversarial Networks: an Overview](https://arxiv.org/pdf/1710.07035.pdf), Antonia Creswell, et al, 2017 -- [Generative Models](https://blog.openai.com/generative-models/), OpenAI Blog +Of these 20, we limit the analysis to 16 for which we have or can approximate the relevant inputs. The notebook [conditional_autoencoder_for_trading_data](05_conditional_autoencoder_for_trading_data.ipynb) demonstrates how to calculate the relevant metrics. -### Evolution of GAN Architectures +### Creating and training the conditional autoencoder architecture -- [Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks (DCGAN)](https://arxiv.org/pdf/1511.06434.pdf), Luke Metz et al, 2016 -- [Conditional Generative Adversarial Net](https://arxiv.org/pdf/1411.1784.pdf), Medhi Mirza and Simon Osindero, 2014 -- [Infogan: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets](https://arxiv.org/pdf/1606.03657.pdf), Xi Chen et al, 2016 -- [Stackgan: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks](https://arxiv.org/pdf/1612.03242.pdf), Shaoting Zhang et al, 2016 -- [Photo-realistic Single Image Super-resolution Using a Generative Adversarial Network](https://arxiv.org/pdf/1609.04802.pdf), Alejando Acosta et al, 2016 -- [Unpaired Image-to-image Translation Using Cycle-consistent Adversarial Networks](https://arxiv.org/pdf/1703.10593.pdf), Juan-Yan Zhu et al, 2018 -- [Learning What and Where to Draw](https://arxiv.org/abs/1610.02454), Scott Reed, et al 2016 -- [Fantastic GANs and where to find them](http://guimperarnau.com/blog/2017/03/Fantastic-GANs-and-where-to-find-them) +The conditional autoencoder proposed by the authors allows for time-varying return distributions that take into account changing asset characteristics. +To this end, they extend standard autoencoder architectures that we discussed in the first section of this chapter to allow for features to shape the encoding. -### Applications of GANs +The notebook [conditional_autoencoder_for_asset_pricing_model](06_conditional_autoencoder_for_asset_pricing_model.ipynb) demonstrates how to create and train this architecture. -- [Real-valued (Medical) Time Series Generation with Recurrent Conditional GANs](https://arxiv.org/abs/1706.02633), Cristóbal Esteban, Stephanie L. Hyland, Gunnar Rätsch, 2016 - - [GitHub Repo](https://github.com/ratschlab/RGAN) -- [MAD-GAN: Multivariate Anomaly Detection for Time Series Data with Generative Adversarial Networks](https://arxiv.org/pdf/1901.04997.pdf), Dan Li, Dacheng Chen, Jonathan Goh, and See-Kiong Ng, 2019 - - [GitHub Repo](https://github.com/LiDan456/MAD-GANs) -- [GAN — Some cool applications](https://medium.com/@jonathan_hui/gan-some-cool-applications-of-gans-4c9ecca35900), Jonathan Hui, 2018 -- [gans-awesome-applications](https://github.com/nashory/gans-awesome-applications), curated list of awesome GAN applications +### Evaluating the results -### How to build GANs using Python +The notebook [alphalens_analysis](07_alphalens_analysis.ipynb) measures the financial performance of the model's prediction. -The notebook [deep_convolutional_generative_adversarial_network](04_deep_convolutional_generative_adversarial_network.ipynb) illustrates the implementation of a GAN using Python. It uses the Deep Convolutional GAN (DCGAN) example to synthesize images from the fashion MNIST dataset -- [Keras-GAN](https://github.com/eriklindernoren/Keras-GAN), numerous Keras GAN implementations -- [PyTorch-GAN](https://github.com/eriklindernoren/PyTorch-GAN), numerous PyTorch GAN implementations \ No newline at end of file diff --git a/21_gans_for_synthetic_time_series/README.md b/21_gans_for_synthetic_time_series/README.md index a5af3c626..bc5341da9 100644 --- a/21_gans_for_synthetic_time_series/README.md +++ b/21_gans_for_synthetic_time_series/README.md @@ -1,94 +1,136 @@ -# Conditional Autoencoders for Asset Pricing and GANs +# Generative Adversarial Nets for Synthetic Time Series Data -This chapter presents two unsupervised learning techniques that leverage deep learning: autoencoders, which have been around for decades, and Generative Adversarial Networks (GANs), which were introduced by Ian Goodfellow in 2014 and which Yann LeCun has called the most exciting idea in AI in the last ten years. -- An autoencoder is a neural network trained to reproduce the input while learning a new representation of the data, encoded by the parameters of a hidden layer. Autoencoders have long been used for nonlinear dimensionality reduction and manifold learning. More recently, autoencoders have been designed as generative models that learn probability distributions over observed and latent variables. A variety of designs leverage the feedforward network, Convolutional Neural Network (CNN), and recurrent neural network (RNN) architectures we covered in the last three chapters. -- GANs are a recent innovation that train two neural nets—a generator and a discriminator—in a competitive setting. The generator aims to produce samples that the discriminator is unable to distinguish from a given class of training data. The result is a generative model capable of producing new (fake) samples that are representative of a certain target distribution. GANs have produced a wave of research and can be successfully applied in many domains. An example from the medical domain that could potentially be highly relevant for trading is the generation of time-series data that simulates alternative trajectories and can be used to train supervised or reinforcement algorithms. +This chapter introduces generative adversarial networks (GAN). GANs train a generator and a discriminator network in a competitive setting so that the generator learns to produce samples that the discriminator cannot distinguish from a given class of training data. The goal is to yield a generative model capable of producing synthetic samples representative of this class. +While most popular with image data, GANs have also been used to generate synthetic time-series data in the medical domain. Subsequent experiments with financial data explored whether GANs can produce alternative price trajectories useful for ML training or strategy backtests. We replicate the 2019 NeurIPS Time-Series GAN paper to illustrate the approach and demonstrate the results. -More specifically, this chapter covers: +

+ +

-- Which types of autoencoders are of practical use and how they work -- How to build and train autoencoders using Python -- How GANs work, why they're useful, and how they could be applied to trading -- How to build GANs using Python +More specifically, in this chapter you will learn about: +- How GANs work, why they are useful, and how they could be applied to trading +- Designing and training GANs using TensorFlow 2 +- Generating synthetic financial data to expand the inputs available for training ML models and backtesting -- [Unsupervised Learning](https://cilvr.nyu.edu/lib/exe/fetch.php?media=deeplearning:2016:lecun-20160308-unssupervised-learning-nyu.pdf), Yann LeCun, 2016 +## Content -## How Autoencoders work +1. [Generative adversarial networks for synthetic data](#generative-adversarial-networks-for-synthetic-data) + * [Comparing generative and discriminative models](#comparing-generative-and-discriminative-models) + * [Adversarial training: a zero-sum game of trickery](#adversarial-training-a-zero-sum-game-of-trickery) +2. [Code example: How to build a GAN using TensorFlow 2](#code-example-how-to-build-a-gan-using-tensorflow-2) +3. [Code example: TimeGAN: Adversarial Training for Synthetic Financial Data](#code-example-timegan-adversarial-training-for-synthetic-financial-data) + * [Learning the data generation process across features and time](#learning-the-data-generation-process-across-features-and-time) + * [Combining adversarial and supervised training with time-series embedding](#combining-adversarial-and-supervised-training-with-time-series-embedding) + * [The four components of the TimeGAN architecture](#the-four-components-of-the-timegan-architecture) + * [Implementing TimeGAN using TensorFlow 2](#implementing-timegan-using-tensorflow-2) + * [Evaluating the quality of synthetic time-series data](#evaluating-the-quality-of-synthetic-time-series-data) +4. [Resources](#resources) + * [How GAN's work](#how-gans-work) + * [Implementation](#implementation) + * [The rapid evolution of the GAN architecture zoo](#the-rapid-evolution-of-the-gan-architecture-zoo) + * [Applications](#applications) -An autoencoder, in contrast, is a neural network designed exclusively to learn a new representation, that is, an encoding of the input. To this end, the training forces the network to faithfully reproduce the input. Since autoencoders typically use the same data as input and output, they are also considered an instance of self-supervised learning. In the process, the parameters of a hidden layer become the code that represents the input. +## Generative adversarial networks for synthetic data -- [Autoencoders](http://www.deeplearningbook.org/contents/autoencoders.html), Ian Goodfellow, Yoshua Bengio and Aaron Courville, Deep Learning Book, Chapter 14, MIT Press 2016 +This book mostly focuses on supervised learning algorithms that receive input data and predict an outcome, which we can compare to the ground truth to evaluate their performance. Such algorithms are also called discriminative models because they learn to differentiate between different output values. +Generative adversarial networks (GANs) are an instance of generative models like the variational autoencoder we encountered in the [last chapter](../20_autoencoders_for_conditional_risk_factors). -### Nonlinear dimensionality reduction +### Comparing generative and discriminative models -A traditional use case includes dimensionality reduction, achieved by limiting the size of the hidden layer so that it performs lossy compression. Such an autoencoder is called undercomplete and the purpose is to force it to learn the most salient properties of the data by minimizing a loss function. In addition to feedforward architectures, autoencoders can also use convolutional layers to learn hierarchical feature representations. +Discriminative models learn how to differentiate among outcomes y, given input data X. In other words, they learn the probability of the outcome given the data: p(y | X). Generative models, on the other hand, learn the joint distribution of inputs and outcome p(y, X). -The powerful capabilities of neural networks to represent complex functions require tight limitations of the capacity of the encoder and decoder to force the extraction of a useful signal rather than noise. In other words, when it is too easy for the network to recreate the input, it fails to learn only the most interesting aspects of the data. This challenge is similar to the overfitting phenomenon that frequently occurs when using models with a high capacity for supervised learning. Just as in these settings, regularization can help by adding constraints to the autoencoder that facilitate the learning of a useful representation. +While generative models can be used as discriminative models using Bayes Rule to compute which class is most likely (see [Chapter 10](../10_bayesian_machine_learning)), it appears often preferable to solve the prediction problem directly rather than by solving the more general generative challenge first. -### Sequence-to-Sequence Autoencoders +### Adversarial training: a zero-sum game of trickery -Sequence-to-sequence autoencoders are based on RNN components, such as Long Short-Term Memory (LSTM) or Gated Recurrent Units (GRUs). They learn a compressed representation of sequential data and have been applied to video, text, audio, and time-series data. +The key innovation of GANs is a new way of learning the data-generating probability distribution. The algorithm sets up a competitive, or adversarial game between two neural networks called the generator and the discriminator. -- [A ten-minute introduction to sequence-to-sequence learning in Keras](https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html), Francois Chollet, September 2017 -- [Unsupervised Learning of Video Representations using LSTMs](https://arxiv.org/abs/1502.04681), Nitish Srivastava, Elman Mansimov, Ruslan Salakhutdinov, 2016 +

+ +

-### Variational Autoencoders +## Code example: How to build a GAN using TensorFlow 2 -Variational Autoencoders (VAE) are more recent developments focused on generative modeling. More specifically, VAEs are designed to learn a latent variable model for the input data. Note that we encountered latent variables in Chapter 14, Topic Modeling. +To illustrate the implementation of a generative adversarial network using Python, we use the deep convolutional GAN (DCGAN) example discussed earlier in this section to synthesize images from the fashion MNIST dataset that we first encountered in Chapter 13. -Hence, VAEs do not let the network learn arbitrary functions as long as it faithfully reproduces the input. Instead, they aim to learn the parameters of a probability distribution that generates the input data. In other words, VAEs are generative models because, if successful, you can generate new data points by sampling from the distribution learned by the VAE. +The notebook [deep_convolutional_generative_adversarial_network](01_deep_convolutional_generative_adversarial_network.ipynb) illustrates the implementation of a GAN using Python. It uses the Deep Convolutional GAN (DCGAN) example to synthesize images from the fashion MNIST dataset -- [Auto-encoding variational bayes](https://arxiv.org/abs/1312.6114), Diederik P Kingma, Max Welling, 2014 +## Code example: TimeGAN: Adversarial Training for Synthetic Financial Data -## How to build autoencoders using Python +Generating synthetic time-series data poses specific challenges above and beyond those encountered when designing GANs for images. +In addition to the distribution over variables at any given point, such as pixel values or the prices of numerous stocks, a generative model for time-series data should also learn the temporal dynamics that shapes how one sequence of observations follows another (see also discussion in Chapter 9: [Time Series Models for Volatility Forecasts and Statistical Arbitrage](../09_time_series_models)). -The Keras library makes it fairly straightforward to build various types of autoencoders and the following examples are adapted from Keras' tutorials. +Very recent and promising [research](https://papers.nips.cc/paper/8789-time-series-generative-adversarial-networks.pdf) by Yoon, Jarrett, and van der Schaar, presented at NeurIPS in December 2019, introduces a novel [Time-Series Generative Adversarial Network](https://papers.nips.cc/paper/8789-time-series-generative-adversarial-networks.pdf) (TimeGAN) framework that aims to account for temporal correlations by combining supervised and unsupervised training. +The model learns a time-series embedding space while optimizing both supervised and adversarial objectives that encourage it to adhere to the dynamics observed while sampling from historical data during training. +The authors test the model on various time series, including historical stock prices, and find that the quality of the synthetic data significantly outperforms that of available alternatives. -- [Building Autoencoders in Keras](https://blog.keras.io/building-autoencoders-in-keras.html) +### Learning the data generation process across features and time -### Feedforward Autoencoders with Sparsity Constraints +A successful generative model for time-series data needs to capture both the cross-sectional distribution of features at each point in time and the longitudinal relationships among these features over time. +Expressed in the image context we just discussed, the model needs to learn not only what a realistic image looks like, but also how one image evolves from the next as in a video. -The notebook [deep_autoencoders](01_deep_autoencoders.ipynb) illustrates how to implement several of the autoencoder models introduced in the preceding section using Keras. This includes autoencoders using deep feedforward nets and sparsity constraints. +### Combining adversarial and supervised training with time-series embedding -### Convolutional & Denoising Autoencoders +Prior attempts at generating time-series data like the recurrent (conditional) GAN relied on recurrent neural networks (RNN, see Chapter 19, [RNN for Multivariate Time Series and Sentiment Analysis](../19_recurrent_neural_nets)) in the roles of generator and discriminator. -The notebook [convolutional_denoising_autoencoders](02_convolutional_denoising_autoencoders.ipynb) goes on to demonstrate how to implement convolutionals and denoising autencoders to recover corrupted image inputs. +TimeGAN explicitly incorporates the autoregressive nature of time series by combining the unsupervised adversarial loss on both real and synthetic sequences familiar from the DCGAN example with a stepwise supervised loss with respect to the original data. +The goal is to reward the model for learning the distribution over transitions from one point in time to the next present in the historical data. -### Sequence-to-sequence autoencoders +### The four components of the TimeGAN architecture -Sequence-to-sequence autoencoders are based on RNN components like long short-term memory (LSTM) or gated recurrent units (GRUs). They learn a compressed representation of sequential data and have been applied to video, text, audio, and time-series data. +The TimeGAN architecture combines an adversarial network with an autoencoder and has thus four network components as depicted in Figure 21.4: +Autoencoder: embedding and recovery networks +Adversarial Network: sequence generator and sequence discriminator components +

+ +

-- [Gradient Trader Part 1: The Surprising Usefulness of Autoencoders](https://rickyhan.com/jekyll/update/2017/09/14/autoencoders.html) - - [Code examples](https://github.com/0b01/recurrent-autoencoder) -- [Deep Learning Financial Market Data](http://wp.doc.ic.ac.uk/hipeds/wp-content/uploads/sites/78/2017/01/Steven_Hutt_Deep_Networks_Financial.pdf) - - Motivation: Regulators identify prohibited patterns of trading activity detrimental to orderly markets. Financial Exchanges are responsible for maintaining orderly markets. (e.g. Flash Crash and Hound of Hounslow.) - - Challenge: Identify prohibited trading patterns quickly and efficiently. - Goal: Build a trading pattern search function using Deep Learning. Given a sample trading pattern identify similar patterns in historical LOB data. -### Variational Autoencoders +### Implementing TimeGAN using TensorFlow 2 -The notebook [variational_autoencoder](03_variational_autoencoder.ipynb) shows how to build a Variational Autoencoder using Keras. +In this section, we implement the TimeGAN architecture just described. The authors provide sample code using TensorFlow 1 that we port to TensorFlow 2. Building and training TimeGAN requires several steps: +1. Selecting and preparing real and random time series inputs +2. Creating the key TimeGAN model components +3. Defining the various loss functions and train steps used during the three training phases +4. Running the training loops and logging the results +5. Generating synthetic time series and evaluating the results -- [Tutorial: What is a variational autoencoder?](https://jaan.io/what-is-variational-autoencoder-vae-tutorial/) - - [Variational Autoencoder / Deep Latent Gaussian Model in tensorflow and pytorch](https://github.com/altosaar/variational-autoencoder) +The notebook [TimeGAN_TF2](02_TimeGAN_TF2.ipynb) shows how to implement these steps. -## Generative Adversarial Networks +### Evaluating the quality of synthetic time-series data -The supervised learning algorithms that we focused on for most of this book receive input data that's typically complex and predicts a numerical or categorical label that we can compare to the ground truth to evaluate its performance. These algorithms are also called discriminative models because they learn to differentiate between different output classes. +The TimeGAN authors assess the quality of the generated data with respect to three practical criteria: +1. **Diversity**: the distribution of the synthetic samples should roughly match that of the real data +2. **Fidelity**: the sample series should be indistinguishable from the real data, and +3. **Usefulness**: the synthetic data should be as useful as their real counterparts for solving a predictive task -The goal of generative models is to produce complex output, such as realistic images, given simple input, which can even be random numbers. They achieve this by modeling a probability distribution over the possible output. This probability distribution can have many dimensions, for example, one for each pixel in an image or its character or token in a document. As a result, the model can generate output that are very likely representative of the class of output. +The authors apply three methods to evaluate whether the synthetic data actually exhibits these characteristics: +1. **Visualization**: for a qualitative diversity assessment of diversity, we use dimensionality reduction (principal components analysis (PCA) and t-SNE, see Chapter 13) to visually inspect how closely the distribution of the synthetic samples resembles that of the original data +2. **Discriminative Score**: for a quantitative assessment of fidelity, the test error of a time-series classifier such as a 2-layer LSTM (see Chapter 18) let’s us evaluate whether real and synthetic time series can be differentiated or are, in fact, indistinguishable. +3. **Predictive Score**: for a quantitative measure of usefulness, we can compare the test errors of a sequence prediction model trained on, alternatively, real or synthetic data to predict the next time step for the real data. -- [NIPS 2016 Tutorial: Generative Adversarial Networks](https://arxiv.org/pdf/1701.00160.pdf), Ian Goodfellow, 2017 -- [Why is unsupervised learning important?](https://www.quora.com/Why-is-unsupervised-learning-important), Yoshua Bengio on Quora, 2018 +The notebook [evaluating_synthetic_data](03_evaluating_synthetic_data.ipynb) contains the relevant code samples. + +## Resources -### How GANs work +### How GAN's work +- [NIPS 2016 Tutorial: Generative Adversarial Networks](https://arxiv.org/pdf/1701.00160.pdf), Ian Goodfellow, 2017 +- [Why is unsupervised learning important?](https://www.quora.com/Why-is-unsupervised-learning-important), Yoshua Bengio on Quora, 2018 - [GAN Lab: Understanding Complex Deep Generative Models using Interactive Visual Experimentation](https://www.groundai.com/project/gan-lab-understanding-complex-deep-generative-models-using-interactive-visual-experimentation/), Minsuk Kahng, Nikhil Thorat, Duen Horng (Polo) Chau, Fernanda B. Viégas, and Martin Wattenberg, IEEE Transactions on Visualization and Computer Graphics, 25(1) (VAST 2018), Jan. 2019 - [GitHub](https://poloclub.github.io/ganlab/) - [Generative Adversarial Networks](https://arxiv.org/abs/1406.2661), Ian Goodfellow, et al, 2014 - [Generative Adversarial Networks: an Overview](https://arxiv.org/pdf/1710.07035.pdf), Antonia Creswell, et al, 2017 - [Generative Models](https://blog.openai.com/generative-models/), OpenAI Blog -### Evolution of GAN Architectures +### Implementation + +- [Deep Convolutional Generative Adversarial Network](https://www.tensorflow.org/tutorials/generative/dcgan) +- [CycleGAN](https://www.tensorflow.org/tutorials/generative/cyclegan) +- [Keras-GAN](https://github.com/eriklindernoren/Keras-GAN), numerous Keras GAN implementations +- [PyTorch-GAN](https://github.com/eriklindernoren/PyTorch-GAN), numerous PyTorch GAN implementations + + +### The rapid evolution of the GAN architecture zoo - [Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks (DCGAN)](https://arxiv.org/pdf/1511.06434.pdf), Luke Metz et al, 2016 - [Conditional Generative Adversarial Net](https://arxiv.org/pdf/1411.1784.pdf), Medhi Mirza and Simon Osindero, 2014 @@ -99,7 +141,7 @@ The goal of generative models is to produce complex output, such as realistic im - [Learning What and Where to Draw](https://arxiv.org/abs/1610.02454), Scott Reed, et al 2016 - [Fantastic GANs and where to find them](http://guimperarnau.com/blog/2017/03/Fantastic-GANs-and-where-to-find-them) -### Applications of GANs +### Applications - [Real-valued (Medical) Time Series Generation with Recurrent Conditional GANs](https://arxiv.org/abs/1706.02633), Cristóbal Esteban, Stephanie L. Hyland, Gunnar Rätsch, 2016 - [GitHub Repo](https://github.com/ratschlab/RGAN) @@ -108,9 +150,5 @@ The goal of generative models is to produce complex output, such as realistic im - [GAN — Some cool applications](https://medium.com/@jonathan_hui/gan-some-cool-applications-of-gans-4c9ecca35900), Jonathan Hui, 2018 - [gans-awesome-applications](https://github.com/nashory/gans-awesome-applications), curated list of awesome GAN applications -### How to build GANs using Python -The notebook [deep_convolutional_generative_adversarial_network](04_deep_convolutional_generative_adversarial_network.ipynb) illustrates the implementation of a GAN using Python. It uses the Deep Convolutional GAN (DCGAN) example to synthesize images from the fashion MNIST dataset -- [Keras-GAN](https://github.com/eriklindernoren/Keras-GAN), numerous Keras GAN implementations -- [PyTorch-GAN](https://github.com/eriklindernoren/PyTorch-GAN), numerous PyTorch GAN implementations \ No newline at end of file diff --git a/22_deep_reinforcement_learning/README.md b/22_deep_reinforcement_learning/README.md index 06cf3bbf9..a9f813b50 100644 --- a/22_deep_reinforcement_learning/README.md +++ b/22_deep_reinforcement_learning/README.md @@ -1,20 +1,60 @@ -# Chapter 21: Reinforcement Learning +# Deep Reinforcement Learning: Building a Trading Agent Reinforcement Learning (RL) is a computational approach to goal-directed learning performed by an agent that interacts with a typically stochastic environment which the agent has incomplete information about. RL aims to automate how the agent makes decisions to achieve a long-term objective by learning the value of states and actions from a reward signal. The ultimate goal is to derive a policy that encodes behavioral rules and maps states to actions. -This [chapter](20_reinforcement_learning) shows how to formulate an RL problem and how to apply various solution methods. It covers model-based and model-free methods, introduces the [OpenAI Gym](https://gym.openai.com/) environment, and combines deep learning with RL to train an agent that navigates a complex environment. Finally, we'll show you how to adapt RL to algorithmic trading by modeling an agent that interacts with the financial market while trying to optimize an objective function. +This chapter shows how to formulate an RL problem and how to apply various solution methods. It covers model-based and model-free methods, introduces the [OpenAI Gym](https://gym.openai.com/) environment, and combines deep learning with RL to train an agent that navigates a complex environment. Finally, we'll show you how to adapt RL to algorithmic trading by modeling an agent that interacts with the financial market while trying to optimize an objective function. More specifically,this chapter will cover: -- How to define a Markov Decision Problem (MDP) -- How to use Value and Policy Iteration to solve an MDP -- How to apply Q-learning in an environment with discrete states and actions -- How to build and train a deep Q-learning agent in a continuous environment -- How to use OpenAI Gym to train an RL trading agent - -## Key elements of RL - -RL problems aim to optimize an agent's decisions based on an objective function vis-a-vis an environment. The environment presents information about its state to the agent, assigns rewards for actions, and transitions the agent to new states subject to probability distributions the agent may or may not know about. It may be fully or partially observable, and may also contain other agents. The design of the environment typically requires significant up-front design effort to facilitate goal-oriented learning by the agent during training. +- Define a Markov decision problem (MDP) +- Use value and policy iteration to solve an MDP +- Apply Q-learning in an environment with discrete states and actions +- Build and train a deep Q-learning agent in a continuous environment +- Use the OpenAI Gym to design a custom market environment and train an RL agent to trade stocks + +#### Table of contents + +1. [Key elements of a reinforcement learning system](#key-elements-of-a-reinforcement-learning-system) + * [The policy: translating states into actions](#the-policy-translating-states-into-actions) + * [Rewards: learning from actions](#rewards-learning-from-actions) + * [The value function: optimal decisions for the long run](#the-value-function-optimal-decisions-for-the-long-run) + * [The environment](#the-environment) + * [Components of an interactive RL system](#components-of-an-interactive-rl-system) +2. [How to solve RL problems](#how-to-solve-rl-problems) + * [Code example: dynamic programming – value and policy iteration](#code-example-dynamic-programming--value-and-policy-iteration) + * [Code example: Q-Learning](#code-example-q-learning) +3. [Deep Reinforcement Learning](#deep-reinforcement-learning) + * [Value function approximation with neural networks](#value-function-approximation-with-neural-networks) + * [The Deep Q-learning algorithm and extensions](#the-deep-q-learning-algorithm-and-extensions) + * [The Open AI Gym – the Lunar Lander environment](#the-open-ai-gym--the-lunar-lander-environment) + * [Code example: Double Deep Q-Learning using Tensorflow](#code-example-double-deep-q-learning-using-tensorflow) +4. [Code example: deep RL for trading with TensorFlow 2 and OpenAI Gym](#code-example-deep-rl-for-trading-with-tensorflow-2-and-openai-gym) + * [How to Design an OpenAI trading environment](#how-to-design-an-openai-trading-environment) + * [How to build a Deep Q-learning agent for the stock market](#how-to-build-a-deep-q-learning-agent-for-the-stock-market) +5. [Resources](#resources) + * [RL Algorithms](#rl-algorithms) + * [Investment Applications](#investment-applications) + +## Key elements of a reinforcement learning system + +RL problems feature several elements that set them apart from the ML settings we have covered so far. The following two sections outline the key features required for defining and solving an RL problem by learning a policy that automates decisions. +We’ll use the notation and generally follow [Reinforcement Learning: An Introduction](https://web.stanford.edu/class/psych209/Readings/SuttonBartoIPRLBook2ndEd.pdf) (Sutton and Barto 2018) and David Silver’s [UCL Courses on RL](https://www.davidsilver.uk/teaching/) that are recommended for further study beyond the brief summary that the scope of this chapter permits. + +RL problems aim to optimize an agent's decisions based on an objective function vis-a-vis an environment. + +### The policy: translating states into actions +At any point in time, the policy defines the agent’s behavior. It maps any state the agent may encounter to one or several actions. In an environment with a limited number of states and actions, the policy can be a simple lookup table filled in during training. + +### Rewards: learning from actions + +The reward signal is a single value that the environment sends to the agent at each time step. The agent’s objective is typically to maximize the total reward received over time. Rewards can also be a stochastic function of the state and the actions. They are typically discounted to facilitate convergence and reflect the time decay of value. + +### The value function: optimal decisions for the long run +The reward provides immediate feedback on actions. However, solving an RL problem requires decisions that create value in the long run. This is where the value function comes in: it summarizes the utility of states or of actions in a given state in terms of their long-term reward. + +### The environment +The environment presents information about its state to the agent, assigns rewards for actions, and transitions the agent to new states subject to probability distributions the agent may or may not know about. +It may be fully or partially observable, and may also contain other agents. The design of the environment typically requires significant up-front design effort to facilitate goal-oriented learning by the agent during training. RL problems differ by the complexity of their state and action spaces that can be either discrete or continuous. The latter requires ML to approximate a functional relationship between states, actions, and their value. They also require us to generalize from the subset of states and actions they are experienced by the agent during training. @@ -28,16 +68,10 @@ The components of an RL system typically include: In addition, the environment emits a reward signal that reflects the new state resulting from the agent's action. At the core, the agent usually learns a value function that shapes its judgment over actions. The agent has an objective function to process the reward signal and translate the value judgments into an optimal policy. -- [Reinforcement Learning: An Introduction, 2nd eition](http://incompleteideas.net/book/RLbook2018.pdf), Richard S. Sutton and Andrew G. Barto, 2018 -- [University College of London Course on Reinforcement Learning](http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html), David Silver, 2015 - - ## How to solve RL problems RL methods aim to learn from experience on how to take actions that achieve a long-term goal. To this end, the agent and the environment interact over a sequence of discrete time steps via the interface of actions, state observations, and rewards that we described in the previous section. -### Fundamental approaches to solving RL problems - There are numerous approaches to solving RL problems which implies finding rules for the agent's optimal behavior: - **Dynamic programming** (DP) methods make the often unrealistic assumption of complete knowledge of the environment, but are the conceptual foundation for most other approaches. @@ -49,26 +83,20 @@ Approaches for continuous state and/or action spaces often leverage ML to approx - The reward signal does not directly reflect the target concept, such as a labeled sample - The distribution of the observations depends on the agent's actions and the policy which is itself the subject of the learning process -## Dynamic programming – value and policy iteration +### Code example: dynamic programming – value and policy iteration Finite MDPs are a simple yet fundamental framework. This section introduces the trajectories of rewards that the agent aims to optimize, and define the policy and value functions they are used to formulate the optimization problem and the Bellman equations that form the basis for the solution methods. -### Dynamic programming in Python - The notebook [gridworld_dynamic_programming](01_gridworld_dynamic_programming.ipynb) applies Value and Policy Iteration to a toy environment that consists of a 3 x 4 grid. -## Q-Learning +### Code example: Q-Learning Q-learning was an early RL breakthrough when it was developed by Chris Watkins for his [PhD thesis]((http://www.cs.rhul.ac.uk/~chrisw/thesis.html)) in 1989 . It introduces incremental dynamic programming to control an MDP without knowing or modeling the transition and reward matrices that we used for value and policy iteration in the previous section. A convergence proof followed three years later by [Watkins and Dayan](http://www.gatsby.ucl.ac.uk/~dayan/papers/wd92.html). -### The Q-learning algorithm - Q-learning directly optimizes the action-value function, q, to approximate q*. The learning proceeds off-policy, that is, the algorithm does not need to select actions based on the policy that's implied by the value function alone. However, convergence requires that all state-action pairs continue to be updated throughout the training process. A straightforward way to ensure this is by using an ε-greedy policy. The Q-learning algorithm keeps improving a state-action value function after random initialization for a given number of episodes. At each time step, it chooses an action based on an ε-greedy policy, and uses a learning rate, α, to update the value function based on the reward and its current estimate of the value function for the next state. -### Training a Q-learning agent using Python - The notebook [gridworld_q_learning](02_gridworld_q_learning.ipynb) demonstrates how to build a Q-learning agent using the 3 x 4 grid of states from the previous section. ## Deep Reinforcement Learning @@ -100,28 +128,41 @@ The [OpenAI Gym](https://gym.openai.com/) is a RL platform that provides standar The [Lunar Lander](https://gym.openai.com/envs/LunarLander-v2) (LL) environment requires the agent to control its motion in two dimensions, based on a discrete action space and low-dimensional state observations that include position, orientation, and velocity. At each time step, the environment provides an observation of the new state and a positive or negative reward. Each episode consists of up to 1,000 time steps. -### Double Deep Q-Learning using Tensorflow +### Code example: Double Deep Q-Learning using Tensorflow The [lunar_lander_deep_q_learning](03_lunar_lander_deep_q_learning.ipynb) notebook implements a DDQN agent that uses TensorFlow and Open AI Gym's Lunar Lander environment. -## Reinforcement Learning for trading +## Code example: deep RL for trading with TensorFlow 2 and OpenAI Gym To train a trading agent, we need to create a market environment that provides price and other information, offers trading-related actions, and keeps track of the portfolio to reward the agent accordingly. ### How to Design an OpenAI trading environment -The OpenAI Gym allows for the design, registration, and utilization of environments that adhere to its architecture, as described in its [documentation](https://github.com/openai/gym/tree/master/gym/envs#how-to-create-new-environments-for-gym). The [trading_env.py](trading_env.py) file implements an example that illustrates how to create a class that implements the requisite `step()` and `reset()` methods. +The OpenAI Gym allows for the design, registration, and utilization of environments that adhere to its architecture, as described in its [documentation](https://github.com/openai/gym/tree/master/gym/envs#how-to-create-new-environments-for-gym). +- The [trading_env.py](trading_env.py) file implements an example that illustrates how to create a class that implements the requisite `step()` and `reset()` methods. The trading environment consists of three classes that interact to facilitate the agent's activities: 1. The `DataSource` class loads a time series, generates a few features, and provides the latest observation to the agent at each time step. 2. `TradingSimulator` tracks the positions, trades and cost, and the performance. It also implements and records the results of a buy-and-hold benchmark strategy. 3. `TradingEnvironment` itself orchestrates the process. - ### How to build a Deep Q-learning agent for the stock market +### How to build a Deep Q-learning agent for the stock market - The notebook [q_learning_for_trading](04_q_learning_for_trading.ipynb) demonstrates how to set up a simple game with a limited set of options, a relatively low-dimensional state, and other parameters that can be easily modified and extended o train the same Deep Q-Learning agent used in [lunar_lander_deep_q_learning](03_lunar_lander_deep_q_learning.ipynb). +The notebook [q_learning_for_trading](04_q_learning_for_trading.ipynb) demonstrates how to set up a simple game with a limited set of options, a relatively low-dimensional state, and other parameters that can be easily modified and extended to train the Deep Q-Learning agent used in [lunar_lander_deep_q_learning](03_lunar_lander_deep_q_learning.ipynb). -## RL Algorithms - References +

+ +

+ + +## Resources + +- [Reinforcement Learning: An Introduction, 2nd eition](http://incompleteideas.net/book/RLbook2018.pdf), Richard S. Sutton and Andrew G. Barto, 2018 +- [University College of London Course on Reinforcement Learning](http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html), David Silver, 2015 +- [Implementation of Reinforcement Learning Algorithms](https://github.com/dennybritz/reinforcement-learning), Denny Britz + - This repository provides code, exercises and solutions for popular Reinforcement Learning algorithms. These are meant to serve as a learning tool to complement the theoretical materials from Sutton/Baron and Silver (see above). + +### RL Algorithms - Q Learning - [Learning from Delayed Rewards](http://www.cs.rhul.ac.uk/~chrisw/thesis.html), PhD Thesis, Chris Watkins, 1989 @@ -152,28 +193,16 @@ The trading environment consists of three classes that interact to facilitate th - Categorical 51-Atom DQN (C51) - [A Distributional Perspective on Reinforcement Learning](https://arxiv.org/abs/1707.06887), Bellemare, et al 2017 - In this paper we argue for the fundamental importance of the value distribution: the distribution of the random return received by a reinforcement learning agent. This is in contrast to the common approach to reinforcement learning which models the expectation of this return, or value. Although there is an established body of literature studying the value distribution, thus far it has always been used for a specific purpose such as implementing risk-aware behaviour. We begin with theoretical results in both the policy evaluation and control settings, exposing a significant distributional instability in the latter. We then use the distributional perspective to design a new algorithm which applies Bellman's equation to the learning of approximate value distributions. We evaluate our algorithm using the suite of games from the Arcade Learning Environment. We obtain both state-of-the-art results and anecdotal evidence demonstrating the importance of the value distribution in approximate reinforcement learning. Finally, we combine theoretical and empirical evidence to highlight the ways in which the value distribution impacts learning in the approximate setting. - -## Investment Applications + +### Investment Applications - [A Deep Reinforcement Learning Framework for the Financial Portfolio Management Problem](https://arxiv.org/abs/1706.10059), Zhengyao Jiang, Dixing Xu, Jinjun Liang 2017 - Financial portfolio management is the process of constant redistribution of a fund into different financial products. This paper presents a financial-model-free Reinforcement Learning framework to provide a deep machine learning solution to the portfolio management problem. The framework consists of the Ensemble of Identical Independent Evaluators (EIIE) topology, a Portfolio-Vector Memory (PVM), an Online Stochastic Batch Learning (OSBL) scheme, and a fully exploiting and explicit reward function. This framework is realized in three instants in this work with a Convolutional Neural Network (CNN), a basic Recurrent Neural Network (RNN), and a Long Short-Term Memory (LSTM). They are, along with a number of recently reviewed or published portfolio-selection strategies, examined in three back-test experiments with a trading period of 30 minutes in a cryptocurrency market. Cryptocurrencies are electronic and decentralized alternatives to government-issued money, with Bitcoin as the best-known example of a cryptocurrency. All three instances of the framework monopolize the top three positions in all experiments, outdistancing other compared trading algorithms. Although with a high commission rate of 0.25% in the backtests, the framework is able to achieve at least 4-fold returns in 50 days. - [PGPortfolio](https://github.com/ZhengyaoJiang/PGPortfolio); corresponding GitHub repo - [Financial Trading as a Game: A Deep Reinforcement Learning Approach](https://arxiv.org/pdf/1807.02787.pdf), Huang, Chien-Yi, 2018 - -## RL Implementations - -Several GitHub repositories offer open-source environments or sample algorithm implementations. - -### General RL - -- [Implementation of Reinforcement Learning Algorithms](https://github.com/dennybritz/reinforcement-learning), Denny Britz - - This repository provides code, exercises and solutions for popular Reinforcement Learning algorithms. These are meant to serve as a learning tool to complement the theoretical materials from Sutton/Baron and Silver (see above). - -### For Algorithmic Trading - - [Order placement with Reinforcement Learning](https://github.com/mjuchli/ctc-executioner) - CTC-Executioner is a tool that provides an on-demand execution/placement strategy for limit orders on crypto currency markets using Reinforcement Learning techniques. The underlying framework provides functionalities which allow to analyse order book data and derive features thereof. Those findings can then be used in order to dynamically update the decision making process of the execution strategy. - The methods being used are based on a research project (master thesis) currently proceeding at TU Delft. - [Q-Trader](https://github.com/edwardhdlu/q-trader) - An implementation of Q-learning applied to (short-term) stock trading. The model uses n-day windows of closing prices to determine if the best action to take at a given time is to buy, sell or sit. As a result of the short-term state representation, the model is not very good at making decisions over long-term trends, but is quite good at predicting peaks and troughs. - + \ No newline at end of file diff --git a/23_next_steps/README.md b/23_next_steps/README.md index f47c2d2e6..81c546403 100644 --- a/23_next_steps/README.md +++ b/23_next_steps/README.md @@ -1,21 +1,97 @@ -# Chapter 21 - Next Steps +# Chapter 23 - Next Steps + +In this concluding chapter, we will briefly summarize the key tools, applications, and lessons learned throughout the book to avoid losing sight of the big picture after so much detail. We will then identify areas that we did not cover but would be worthwhile to focus on as you expand on the many machine learning techniques we introduced and become productive in their daily use. +In sum, in this chapter, we will +- Review key takeaways and lessons learned +- Point out the next steps to build on the techniques in this book +- Suggest ways to incorporate ML into your investment process + +## Content + +1. [Key Takeaways and Lessons Learned](#key-takeaways-and-lessons-learned) + * [Data is the single most important ingredient](#data-is-the-single-most-important-ingredient) + * [Domain expertise: separate the signal from the noise](#domain-expertise-separate-the-signal-from-the-noise) + * [ML is a toolkit for solving problems with data](#ml-is-a-toolkit-for-solving-problems-with-data) + * [Beware of backtest overfitting](#beware-of-backtest-overfitting) + * [How to gain insights from black-box models](#how-to-gain-insights-from-black-box-models) +2. [Machine Learning for Trading in Practice](#machine-learning-for-trading-in-practice) + * [Data management technologies](#data-management-technologies) + * [Machine learning tools](#machine-learning-tools) + * [Online trading platforms](#online-trading-platforms) ## Key Takeaways and Lessons Learned +Important insights to keep in mind as you proceed to the practice of machine learning for trading include: +- Data is the single most important ingredient that requires careful sourcing and handling +- Domain expertise is key to realizing the value contained in data and avoiding some of the pitfalls of using ML. +- ML offers tools that you can adapt and combine to create solutions for your use case. +- The choices of model objectives and performance diagnostics are key to productive iterations towards an optimal system. +- Backtest overfitting is a huge challenge that requires significant attention. +- Transparency of black-box models can help build confidence and facilitate the adoption of ML by skeptics. + ### Data is the single most important ingredient -### Domain expertise helps unlock value in data +A key insight is that state-of-the-art ML techniques like deep neural networks are successful because their predictive performance continues to improve with more data. On the flip side, model and data complexity need to match to balance the bias-variance trade-off, which becomes more challenging the higher the noise-to-signal ratio of the data. Managing data quality and integrating data sets are key steps in realizing the potential value. + +### Domain expertise: separate the signal from the noise + +We emphasized that informative data is a necessary condition for successful ML applications. However, domain expertise is equally essential to define the strategic direction, select relevant data, engineer informative features, and design robust models. ### ML is a toolkit for solving problems with data -### Model diagnostics help speed up optimization +Machine learning offers algorithmic solutions and techniques that can be applied to many use cases. Parts 2, 3 and 4 of the book have presented machine learning as a diverse set of tools that can add value to various steps of the strategy process, including +- Idea generation and alpha factor research +- Signal aggregation and portfolio optimization +- Strategy testing +- Trade execution +- Strategy evaluation ### Beware of backtest overfitting +We covered the risks of false discoveries due to overfitting to historical data repeatedly throughout the book. Chapter 5, on strategy evaluation, lays out the main drivers and potential remedies. The low noise-to-signal ratio and relatively small datasets (compared to web-scale image or text data) make this challenge particularly serious in the trading domain. Awareness is critical since the ease of access to data and tools to apply ML increases the risks significantly. + +There are no easy answers because the risks are inevitable. However, we presented methods to adjust backtest metrics to account for repeated trials such as the deflated Sharpe ratio. When working towards a live trading strategy, staged paper-trading, and closely monitored performance during execution in the market need to be part of the implementation process. + ### How to gain insights from black-box models +Deep neural networks and complex ensembles can raise suspicion when they are considered impenetrable black-box models, in particular in light of the risks of backtest overfitting. We introduced several methods to gain insights into how these models make predictions in Chapter 12, Boosting Your Trading Strategy. + +In addition to conventional measures of feature importance, the recent game-theoretic innovation of SHapley Additive exPlanations (SHAP) is a significant step towards understanding the mechanics of complex models. SHAP values allow for the exact attribution of features and their values to predictions so that it becomes easier to validate the logic of a model in the light of specific theories about market behavior for a given investment target. Besides justification, exact feature importance scores and attribution of predictions allow for deeper insights into the drivers of the investment outcome of interest. + ## Machine Learning for Trading in Practice -### Machine Learning Tools and Big Data Technologies +As you proceed to integrate the numerous tools and techniques into your investment and trading process, there are numerous things you can focus your efforts on. If your goal is to make better decisions, you should select projects that are realistic yet ambitious given your current skill set. This will help you to develop an efficient workflow underpinned by productive tools and gain practical experience. + +### Data management technologies + +The central role of data in the ML4T process requires familiarity with a range of technologies to store, transform, and analyze data at scale, including the use of cloud-based services like Amazon Web Services, Microsoft Azure, and Google Cloud. + +### Machine learning tools + +We covered many libraries of the Python ecosystem in this book. Python has evolved to become the language of choice for data science and machine learning. The set of open-source libraries continues to both diversify and mature, and are built on the robust core of scientific computing libraries NumPy and SciPy. + +There are several providers that aim to facilitate the machine learning workflow: +- H2O.ai offers the H2O platform that integrates cloud computing with machine learning automation. It allows users to fit thousands of potential models to their data to explore patterns in the data. It has interfaces in Python as well as R and Java. +- Datarobot aims to automate the model development process by providing a platform to rapidly build and deploy predictive models in the cloud or on-premise. +- Dataiku is a collaborative data science platform designed to help analysts and engineers explore, prototype, build, and deliver their own data products. + +There are also several open-source initiatives led by companies that build on and expand the Python ecosystem: +- The quantitative hedge fund [Two Sigma](https://www.twosigma.com/) contributes quantitative analysis tools to the Jupyter Notebook environment under the [BeakerX](https://github.com/twosigma/beakerx) project. +- Bloomberg has integrated the Jupyter Notebook into its terminal to facilitate the interactive analysis of its financial data. + +### Online trading platforms + +The main options to develop trading strategies that use machine learning are online platforms, which often look for and allocate capital to successful trading strategies. + +Popular solutions include +- [Quantopian](https://www.quantopian.com/), +- [Quantconnect](https://www.quantconnect.com/), and +- [QuantRocket](https://www.quantrocket.com/). + +In addition, [Interactive Brokers](https://www.interactivebrokers.com/en/home.php) offers a [Python API](https://www.interactivebrokers.com/en/index.php?f=44094) that you can use to develop your own trading solution. + +[Alpaca](https://alpaca.markets/algotrading?gclid=EAIaIQobChMInNybkbug6wIV1f_jBx1Z9AayEAAYASAAEgLu5fD_BwE) offers commission-free execution of algorithmic trading strategies. Several libraries provide integration: +- [pipeline-live](https://github.com/alpacahq/pipeline-live): Zipline Pipeline Extension for Live Trading +- [pylivetrader](https://github.com/alpacahq/pylivetrader): a simple python live trading framework with zipline interface -- [Beakerx](https://github.com/twosigma/beakerx) by Two Sigma \ No newline at end of file +[Backtrader](https://www.backtrader.com/) is intended for both backtesting and trading with multiple broker integrations. \ No newline at end of file diff --git a/24_alpha_factor_library/README.md b/24_alpha_factor_library/README.md new file mode 100644 index 000000000..31ab011ee --- /dev/null +++ b/24_alpha_factor_library/README.md @@ -0,0 +1,76 @@ +# Appendix - Alpha Factor Library + +Throughout this book, we emphasized how the smart design of features, including appropriate preprocessing and denoising, typically leads to an effective strategy. +This appendix synthesizes some of the lessons learned on feature engineering and provides additional information on this vital topic. + +Chapter 4 categorized factors by the underlying risk they represent and for which an investor would earn a reward above and beyond the market return. +These categories include value vs growth, quality, and sentiment, as well as volatility, momentum, and liquidity. +Throughout the book, we used numerous metrics to capture these risk factors. +This appendix expands on those examples and collects popular indicators so you can use it as a reference or inspiration for your own strategy development. +It also shows you how to compute them and includes some steps to evaluate these indicators. + +To this end, we focus on the broad range of indicators implemented by TA-Lib (see [Chapter 4](04_alpha_factor_research)) and WorldQuant's [101 Formulaic Alphas](https://arxiv.org/pdf/1601.00991.pdf) paper (Kakushadze 2016), which presents real-life quantitative trading factors used in production with an average holding period of 0.6-6.4 days. + +This chapter covers: +- How to compute several dozen technical indicators using TA-Lib and NumPy/pandas, +- Creating the formulaic alphas describe in the above paper, and +- Evaluating the predictive quality of the results using various metrics from rank correlation and mutual information to feature importance, SHAP values and Alphalens. + +## Content + +1. [The Indicator Zoo](#the-indicator-zoo) +2. [Code example: common alpha factors implemented in TA-Lib](#code-example-common-alpha-factors-implemented-in-ta-lib) +3. [Code example: WorldQuant’s quest for formulaic alphas](#code-example-worldquants-quest-for-formulaic-alphas) +4. [Code example: Bivariate and multivariate factor evaluation](#code-example-bivariate-and-multivariate-factor-evaluation) + +## The Indicator Zoo + +Chapter 4, [Financial Feature Engineering: How to Research Alpha Factors](../04_alpha_factor_research), summarized the long-standing efforts of academics and practitioners to identify information or variables that helps reliably predict asset returns. +This research led from the single-factor capital asset pricing model to a “[zoo of new factors](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.407.3913&rep=rep1&type=pdf)" (Cochrane 2011). + +This factor zoo contains hundreds of firm characteristics and security price metrics presented as statistically significant predictors of equity returns in the anomalies literature since 1970 (see a summary in [Green, Hand, and Zhang](https://academic.oup.com/rfs/article-abstract/30/12/4389/3091648), 2017). +- The notebook [indicator_zoo](00_indicator_zoo.ipynb) lists numerous examples. + +## Code example: common alpha factors implemented in TA-Lib + +The TA-Lib library is widely used to perform technical analysis of financial market data by trading software developers. It includes over 150 popular indicators from multiple categories that range from Overlap Studies, including moving averages and Bollinger Bands, to Statistic Functions such as linear regression. + +**Function Group**|**# Indicators** +:-----:|:-----: +Overlap Studies|17 +Momentum Indicators|30 +Volume Indicators|3 +Volatility Indicators|3 +Price Transform|4 +Cycle Indicators|5 +Math Operators|11 +Math Transform|15 +Statistic Functions|9 + +The notebook [common_alpha_factors](02_common_alpha_factors.ipynb) contains the relevant code samples. + +## Code example: WorldQuant’s quest for formulaic alphas + +We introduced [WorldQuant](https://www.worldquant.com/home/) in Chapter 1, [Machine Learning for Trading: From Idea to Execution](../01_machine_learning_for_trading), as part of a trend towards crowd-sourcing investment strategies. +WorldQuant maintains a virtual research center where quants worldwide compete to identify alphas. +These alphas are trading signals in the form of computational expressions that help predict price movements just like the common factors described in the previous section. + +These formulaic alphas translate the mechanism to extract the signal from data into code and can be developed and tested individually with the goal to integrate their information into a broader automated strategy ([Tulchinsky 2019](https://onlinelibrary.wiley.com/doi/abs/10.1002/9781119571278.ch1). +As stated repeatedly throughout the book, mining for signals in large datasets is prone to multiple testing bias and false discoveries. +Regardless of these important caveats, this approach represents a modern alternative to the more conventional features presented in the previous section. + +[Kakushadze (2016) presents [101 examples](https://arxiv.org/pdf/1601.00991.pdf) of such alphas, 80 percent of which were used in a real-world trading system at the time. It defines a range of functions that operate on cross-sectional or time-series data and can be combined, e.g. in nested form. + +The notebook [101_formulaic_alphas](03_101_formulaic_alphas.ipynb) contains the relevant code. + +## Code example: Bivariate and multivariate factor evaluation + +To evaluate the numerous factors, we rely on the various performance measures introduced in this book, including the following: +- Bivariate measures of the signal content of a factor with respect to the one-day forward returns +- Multivariate measures of feature importance for a gradient boosting model trained to predict the one-day forward returns using all factors +- Financial performance of portfolios invested according to factor quantiles using Alphalens + +The notebooks [factor_evaluation](04_factor_evaluation.ipynb) and [alphalens_analysis](05_alphalens_analysis.ipynb) contain the relevant code examples. + + + diff --git a/README.md b/README.md index fb39b88a3..1b24f00cb 100644 --- a/README.md +++ b/README.md @@ -19,6 +19,8 @@ This repo contains **over 150 notebooks** that put the concepts, algorithms, and - how to train and tune models that predict returns for different asset classes and investment horizons, including how to replicate recently published research, and - how to design, backtest, and evaluate trading strategies. +> We **highly recommend** to review the notebooks while reading the book; they are usually in executed state and often contain additional information that the space constraints of the book did not permit to include. + ## What's new in the 2nd Edition? First and foremost, this [book](https://www.amazon.com/Machine-Learning-Algorithmic-Trading-alternative/dp/1839217715?pf_rd_r=VMKJPZC4N36TTZZCWATP&pf_rd_p=c5b6893a-24f2-4a59-9d4b-aff5065c90ec&pd_rd_r=8f331266-0d21-4c76-a3eb-d2e61d23bb31&pd_rd_w=kVGNF&pd_rd_wg=LYLKH&ref_=pd_gw_ci_mcx_mr_hp_d) demonstrates how you can extract signals from a diverse set of data sources and design trading strategies for different asset classes using a broad range of supervised, unsupervised, and reinforcement learning algorithms. It also provides relevant mathematical and statistical knowledge to facilitate the tuning of an algorithm or the interpretation of the results. Furthermore, it covers the financial background that will help you work with market and fundamental data, extract informative features, and manage the performance of a trading strategy. @@ -51,6 +53,8 @@ All applications now use the latest available (at the time of writing) software The [book](https://www.amazon.com/Machine-Learning-Algorithmic-Trading-alternative/dp/1839217715?pf_rd_r=GZH2XZ35GB3BET09PCCA&pf_rd_p=c5b6893a-24f2-4a59-9d4b-aff5065c90ec&pd_rd_r=91a679c7-f069-4a6e-bdbb-a2b3f548f0c8&pd_rd_w=2B0Q0&pd_rd_wg=GMY5S&ref_=pd_gw_ci_mcx_mr_hp_d) has four parts that address different challenges that arise when sourcing and working with market, fundamental and alternative data sourcing, developing ML solutions to various predictive tasks in the trading context, and designing and evaluating a trading strategy that relies on predictive signals generated by an ML model. +> The directory for each chapter contains a README with additional information on content, code examples and additional resources. + [Part 1: From Data to Strategy Development](#part-1-from-data-to-strategy-development) * [01 Machine Learning for Trading: From Idea to Execution](#01-machine-learning-for-trading-from-idea-to-execution) * [02 Market & Fundamental Data: Sources and Techniques](#02-market--fundamental-data-sources-and-techniques)