Skip to content

AthulyaSG/Stock_Market_Analysis_Using_Big_Data-Spark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

63 Commits
 
 
 
 
 
 

Repository files navigation

Big Data

Application of PySpark in the Stock Market

Stock Market

  • The process of buying and selling stocks involves transactions in the stock market
  • Companies secure capital in the stock market through the sale of shares, or equity, to investors
  • A stock signifies ownership in a company or organization, entailing a proportional stake in its assets and earnings
  • Stock exchanges serve as secondary markets where shareholders can engage in transactions (Hayes, 2023)

PySpark

  • To facilitate the integration of Spark and Python
  • The Python API for Apache Spark, an open-source distributed computing framework and collection of libraries for processing massive amounts of data in real-time
  • PySpark not only offers an API for Spark but also aids in connecting with Resilient Distributed Datasets (RDDs)
  • Contrast between Pandas and Spark dataframes lies in their execution
  • In PySpark, operations are postponed until they are specifically requested in the pipeline (PySpark, n.d.)

Long Short-Term Memory

  • Long Short-Term Memory (LSTM) stands out for its proficiency in capturing intricate patterns and dependencies within historical data
  • LSTM shows promise for understanding the unpredictable nature of the stock market
  • LSTM uses memory cells, gates, and well-designed connections to selectively store and transmit information over extended time periods, allowing these models to effectively capture complex temporal patterns in sequential data, especially in predicting time series data like stock prices (Chauhan, 2023)

Data

  • Date: Date of each transaction from 2015 to 2020
  • Open: Opening price, denoting the initial transaction price per share in the beginning of a day
  • High: The highest price during the trading in a day
  • Low: The lowest price for the trading in a day
  • Close: Closing price, representing the transaction price at the end of a trading day
  • Close Adjusted: The adjusted closing price is adjusted to dividends and stock splits
  • Volume: The number of stocks of a company transacted in a day
  • Symbol: The unique identifier or ticker for a stock of a company

image

Data Characteristics

  • Number of rows: 68522038

  • Number of columns: 8 image

  • Minimum and maximum dates are determined

  • Timeframe from January 2, 2015, to July 2, 2020 image

  • Total count of distinct company symbols

  • Presence of 6335 companies image

Data Categorization

  • Temporal Features: Date (Timestamp)
  • Quantitative Features/Continuous Variables: Volume, Open, High, Low, Close, Adjclose
  • Qualitative Features: Symbol/Ticker

image

  • symbol -> string
  • date -> date format and is used for storing date-related information
  • volume -> long (i.e., integer)
  • open, high, low, close, and adjclose -> double (i.e., floating point)

Data Selection

  • Check for missing values
  • No null values

image

  • Filtered a list of company using symbols
  • Calculated the number of occurrences
  • Arranged in descending order
  1. Apple -> AAPL
  2. Amazon -> Amazon
  3. Google -> Google
  4. Microsoft -> MSFT
  5. Tesla -> TSLA

image

Visualization

Boxplots

  • Visualized outliers
  • Plotted each attribute of all 5 stocks
  • Eventhough there are outliers, it cannot be removed/modified as it will remove the important values

image

  • Plotted each stock separately based on features
  • Large volume: MSFT, followed by AAPL
  • Highest price: AMZN
  • Open is not considered, as previous day close is same as the open for the next day

image

Bar Chart

  • Plotted bar chart for volume of each stocks
  • APPL has the largest volume, followed by MSFT

image

Histograms

  • Histograms are used for continuous variables.
  • APPL and AMZN have the highest volume
  • The highest price rate (high, low, close, and adjclose) is for AMZN
  • The lowest for MSFT

image

Descriptive Analysis

AAPL

  • Mean close : 168.11
  • SD : 60.89
  • Minimum close : 90.34
  • Maximum close : 366.53

image

GOOG

  • Mean close : 954.07
  • SD : 257.03
  • Minimum close : 491.20
  • Maximum close : 1526.69

image

AMZN

  • Mean close : 1213.50
  • SD : 601.22
  • Minimum close : 286.95
  • Maximum close : 2890.30

image

MSFT

  • Mean close : 89.55
  • SD : 40.84
  • Minimum close : 40.29
  • Maximum close : 206.26

image

TSLA

  • Mean close : 310.69
  • SD : 152.18
  • Minimum close : 143.67
  • Maximum close : 1208.66

image

Time Series Analysis

  • To identify historical patterns, model price movements, and make predictions
  • AMZN: Upward trend, exponential growth after 2018
  • GOOG: In 2015, exhibited the highest stock price, but later surpassed by AMZN, after mid-2016
  • TSLA: Steady till 2020
  • AAPL and MSFT: No major changes

image

Daily Return

  • To check the day-to-day fluctuations in pricing of each stock
  • AMZN: High daily return percentage, touching 1000 in 2019
  • TSLA: Initially high, then dropped to 100 mid-2019
  • GOOG and APPL: Less volatility with some overlaps
  • MSFT: Not much daily returns

image

Moving Average

  • Help identify trends by reducing noise in price
  • 30-day and 50-day rolling averages: calculating the average value of a stock price over a specified window of time
  • Overall trend: Upward

image

Volume Analysis

  • MSFT exhibits higher average trading volumes
  • AAPL is closely followed by MSFT
  • AMZN, GOOG, and TSLA have no noticeable movement

image

Correlation

  • To identify where the majority of values lie
  • To determine whether there is any relationship between each feature

image

  • Distribution of values for the numerical column
  • Open, close, high, low, and adjclose are similar

image

To check the correlation,

  • if the value is near to -1, then it is strong negative correlation
  • if the value is near to 0, then it is weak correlation
  • if the value is near to +1, then it is strong positive correlation
  • As the values of high, low, close, and adjclose are the same, all the correlation is represented as 1.

image

  • AAPL stock is decreased till 2016
  • After 2016, there was an upward trend eventhough it decreases in later years
  • No visible variations in high, low, close, and adjclose values, indicating all are more similar to one another.

image

Model

  • Filter the stock to be predicted
  • In this case, AAPL is considered from 5 stocks
  • Target variable: close

image

image

Data Preprocessing

Normalization of the values is performed to make all the values to be consistent

  • All values are treated equally
  • Optimization algorithms work faster

image

Train and Test Split

  • Training: 80% of the data
  • Testing: 20% of the data

image

Sequences of length from training and testing data for training an LSTM model are created, where each sequence represents a historical window used to predict the next time step.

image

Creating and Fitting LSTM Model

A sequential model for a Long Short-Term Memory (LSTM) neural network with three layers is initialized

  • 1 input layer
  • 2 hidden layers
  • 1 output
    • 50 neurons
    • dropout layers to prevent overfitting
    • a dense output layer with one unit

image

An LSTM model is trained using historical data

  • 50 epochs
  • a batch size of 64
  • validation split of 0.1

image

  • Training loss remains consistently low and stable
  • Validation loss exhibits slight variations

image

Mean Absolute Percentage Error

MAPE is calculated for the predicted values against the actual values, resulting in an accuracy of 95.03%.

image

R2 Value

The R2 value of 0.75 suggests that the model explains about 75% of the variability in the data, indicating a moderately good fit.

image

Actual versus Predicted Values

  • Visual comparison of the assessment of the model's predictions against actual values over specified time steps
  • To evaluate the performance of predictive models
  • Degree of alignment indicates the accuracy of the model's predictions
  • A close correspondence between the lines signifies that the model is making precise predictions

image

  • Previous 60 time steps are considered for plotting
  • Visualize the training and test predictions compared to the actual Apple stock prices, with predictions rescaled back to the original scale using the scaler.

image

10-Day Prediction

  • The stock prices for the next 10 days using an LSTM model are predicted, generating predictions iteratively, and then transforming the predictions back to the original scale.
  • From the graph, it can be concluded that the price of the AAPL stock will decrease for the next 10 days.

image

image

References

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published