Skip to content

Commit

Permalink
webscraping instructions update
Browse files Browse the repository at this point in the history
  • Loading branch information
stefan-jansen committed Jan 22, 2021
1 parent 6ed2fd9 commit 0e0d85b
Show file tree
Hide file tree
Showing 3 changed files with 12 additions and 0 deletions.
2 changes: 2 additions & 0 deletions 03_alternative_data/01_opentable/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,8 @@ The data needs to be extracted from the HTML source, barring any legal obstacles

### Building a dataset of restaurant bookings

> Note: different from all other examples, the code that uses Selenium is written to run on a host rather than using the Docker image because it relies on a browser. The code has been tested on Ubuntu and Mac only.
With the browser automation tool [Selenium](https://www.seleniumhq.org/), you can follow the links to the next pages and quickly build a dataset of over 10,000 restaurants in NYC that you could then update periodically to track a time series.

To set up selenium, run
Expand Down
4 changes: 4 additions & 0 deletions 03_alternative_data/02_earnings_calls/README.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,15 @@
## How to Scrape Earnings Call Transcripts

> Update: unfortunately, seekingalpha has updated their website to use captcha so automatic downloads are no longer possible in the way described here.
Textual data is an essential alternative data source. One example of textual information is transcripts of earnings calls where executives do not only present the latest financial results, but also respond to questions by financial analysts. Investors utilize transcripts to evaluate changes in sentiment, emphasis on particular topics, or style of communication.

We will illustrate the scraping and parsing of earnings call transcripts from the popular trading website [www.seekingalpha.com](www.seekingalpha.com).

### Instructions

> Note: different from all other examples, the code is written to run on a host rather than using the Docker image because it relies on a browser. The code has been tested on Ubuntu and Mac only.
This section contains code to retrieve earnings call transcripts from Seeking Alpha.

Run `python sa_selenium.py` file to scrape transcripts and store the result under transcipts/parts and the company's symbol in csv files, named by the aspect of the earnings call they capture:
Expand Down
6 changes: 6 additions & 0 deletions 03_alternative_data/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,12 +66,18 @@ This section illustrates the acquisition of alternative data using web scraping,

### Code Example: Open Table Web Scraping

> Note: different from all other examples, the code that uses Selenium is written to run on a host rather than using the Docker image because it relies on a browser. The code has been tested on Ubuntu and Mac only.
This subfolder [01_opentable](01_opentable) contains the script [opentable_selenium](01_opentable/opentable_selenium.py) to scrape OpenTable data using Scrapy and Selenium.

- [How to View the Source Code of a Web Page in Every Browser](https://www.lifewire.com/view-web-source-code-4151702)

### Code Example: SeekingAlpha Earnings Transcripts

> Update: unfortunately, seekingalpha has updated their website to use captcha so automatic downloads are no longer possible in the way described here.
> Note: different from all other examples, the code is written to run on a host rather than using the Docker image because it relies on a browser. The code has been tested on Ubuntu and Mac only.
The subfolder [02_earnings_calls](02_earnings_calls) contains the script [sa_selenium](02_earnings_calls/sa_selenium.py) to scrape earnings call transcripts from the [SeekingAlpha](www.seekingalpha.com) website.

## Python Libraries & Documentation
Expand Down

0 comments on commit 0e0d85b

Please sign in to comment.