- The process of buying and selling stocks involves transactions in the stock market
- Companies secure capital in the stock market through the sale of shares, or equity, to investors
- A stock signifies ownership in a company or organization, entailing a proportional stake in its assets and earnings
- Stock exchanges serve as secondary markets where shareholders can engage in transactions (Hayes, 2023)
- To facilitate the integration of Spark and Python
- The Python API for Apache Spark, an open-source distributed computing framework and collection of libraries for processing massive amounts of data in real-time
- PySpark not only offers an API for Spark but also aids in connecting with Resilient Distributed Datasets (RDDs)
- Contrast between Pandas and Spark dataframes lies in their execution
- In PySpark, operations are postponed until they are specifically requested in the pipeline (PySpark, n.d.)
- Long Short-Term Memory (LSTM) stands out for its proficiency in capturing intricate patterns and dependencies within historical data
- LSTM shows promise for understanding the unpredictable nature of the stock market
- LSTM uses memory cells, gates, and well-designed connections to selectively store and transmit information over extended time periods, allowing these models to effectively capture complex temporal patterns in sequential data, especially in predicting time series data like stock prices (Chauhan, 2023)
- Date: Date of each transaction from 2015 to 2020
- Open: Opening price, denoting the initial transaction price per share in the beginning of a day
- High: The highest price during the trading in a day
- Low: The lowest price for the trading in a day
- Close: Closing price, representing the transaction price at the end of a trading day
- Close Adjusted: The adjusted closing price is adjusted to dividends and stock splits
- Volume: The number of stocks of a company transacted in a day
- Symbol: The unique identifier or ticker for a stock of a company
-
Number of rows: 68522038
-
Minimum and maximum dates are determined
-
Total count of distinct company symbols
- Temporal Features: Date (Timestamp)
- Quantitative Features/Continuous Variables: Volume, Open, High, Low, Close, Adjclose
- Qualitative Features: Symbol/Ticker
- symbol -> string
- date -> date format and is used for storing date-related information
- volume -> long (i.e., integer)
- open, high, low, close, and adjclose -> double (i.e., floating point)
- Check for missing values
- No null values
- Filtered a list of company using symbols
- Calculated the number of occurrences
- Arranged in descending order
- Apple -> AAPL
- Amazon -> Amazon
- Google -> Google
- Microsoft -> MSFT
- Tesla -> TSLA
- Visualized outliers
- Plotted each attribute of all 5 stocks
- Eventhough there are outliers, it cannot be removed/modified as it will remove the important values
- Plotted each stock separately based on features
- Large volume: MSFT, followed by AAPL
- Highest price: AMZN
- Open is not considered, as previous day close is same as the open for the next day
- Plotted bar chart for volume of each stocks
- APPL has the largest volume, followed by MSFT
- Histograms are used for continuous variables.
- APPL and AMZN have the highest volume
- The highest price rate (high, low, close, and adjclose) is for AMZN
- The lowest for MSFT
AAPL
- Mean close : 168.11
- SD : 60.89
- Minimum close : 90.34
- Maximum close : 366.53
GOOG
- Mean close : 954.07
- SD : 257.03
- Minimum close : 491.20
- Maximum close : 1526.69
AMZN
- Mean close : 1213.50
- SD : 601.22
- Minimum close : 286.95
- Maximum close : 2890.30
MSFT
- Mean close : 89.55
- SD : 40.84
- Minimum close : 40.29
- Maximum close : 206.26
TSLA
- Mean close : 310.69
- SD : 152.18
- Minimum close : 143.67
- Maximum close : 1208.66
- To identify historical patterns, model price movements, and make predictions
- AMZN: Upward trend, exponential growth after 2018
- GOOG: In 2015, exhibited the highest stock price, but later surpassed by AMZN, after mid-2016
- TSLA: Steady till 2020
- AAPL and MSFT: No major changes
- To check the day-to-day fluctuations in pricing of each stock
- AMZN: High daily return percentage, touching 1000 in 2019
- TSLA: Initially high, then dropped to 100 mid-2019
- GOOG and APPL: Less volatility with some overlaps
- MSFT: Not much daily returns
- Help identify trends by reducing noise in price
- 30-day and 50-day rolling averages: calculating the average value of a stock price over a specified window of time
- Overall trend: Upward
- MSFT exhibits higher average trading volumes
- AAPL is closely followed by MSFT
- AMZN, GOOG, and TSLA have no noticeable movement
- To identify where the majority of values lie
- To determine whether there is any relationship between each feature
- Distribution of values for the numerical column
- Open, close, high, low, and adjclose are similar
To check the correlation,
- if the value is near to -1, then it is strong negative correlation
- if the value is near to 0, then it is weak correlation
- if the value is near to +1, then it is strong positive correlation
- As the values of high, low, close, and adjclose are the same, all the correlation is represented as 1.
- AAPL stock is decreased till 2016
- After 2016, there was an upward trend eventhough it decreases in later years
- No visible variations in high, low, close, and adjclose values, indicating all are more similar to one another.
- Filter the stock to be predicted
- In this case, AAPL is considered from 5 stocks
- Target variable: close
Normalization of the values is performed to make all the values to be consistent
- All values are treated equally
- Optimization algorithms work faster
- Training: 80% of the data
- Testing: 20% of the data
Sequences of length from training and testing data for training an LSTM model are created, where each sequence represents a historical window used to predict the next time step.
A sequential model for a Long Short-Term Memory (LSTM) neural network with three layers is initialized
- 1 input layer
- 2 hidden layers
- 1 output
- 50 neurons
- dropout layers to prevent overfitting
- a dense output layer with one unit
An LSTM model is trained using historical data
- 50 epochs
- a batch size of 64
- validation split of 0.1
- Training loss remains consistently low and stable
- Validation loss exhibits slight variations
MAPE is calculated for the predicted values against the actual values, resulting in an accuracy of 95.03%.
The R2 value of 0.75 suggests that the model explains about 75% of the variability in the data, indicating a moderately good fit.
- Visual comparison of the assessment of the model's predictions against actual values over specified time steps
- To evaluate the performance of predictive models
- Degree of alignment indicates the accuracy of the model's predictions
- A close correspondence between the lines signifies that the model is making precise predictions
- Previous 60 time steps are considered for plotting
- Visualize the training and test predictions compared to the actual Apple stock prices, with predictions rescaled back to the original scale using the scaler.
- The stock prices for the next 10 days using an LSTM model are predicted, generating predictions iteratively, and then transforming the predictions back to the original scale.
- From the graph, it can be concluded that the price of the AAPL stock will decrease for the next 10 days.
- Chauhan, P. (2023). Stock prediction and forecasting using LSTM (Long-Short-Term-Memory). https://medium.com/@prajjwalchauhan94017/stock-prediction-and-forecasting-using-lstm-long-short-term-memory-9ff56625de73
- Hayes, A. (2023). How does the stock market work? https://www.investopedia.com/articles/investing/082614/how-stock-market-works.asp#toc-what-is-a-stock
- PySpark. (n.d.). What is PySpark? https://domino.ai/data-science-dictionary/pyspark