Skip to content

Latest commit

 

History

History
212 lines (145 loc) · 9.69 KB

File metadata and controls

212 lines (145 loc) · 9.69 KB

Analyzing Stack Overflow Annual Developer Survey Data with MariaDB

This repository provides data and information that will enable you to transform, import and analyze the raw Stack Overflow Annual Developer Survey data with MariaDB ColumnStore.

Starting at Step 2, Parse and transform the data, this repository will walk you through the process of preparing, importing and, ultimately, being able to analyze the raw Stack Overflow Annual Developer Survey data for 2020 (which is included in the developer_survey_2020 folder).

Note: This repository will be updated to include the raw 2021 data once it becomes available here.

Table of Contents

  1. Requirements
  2. Parse and transform the data
  3. Prepare the database
    1. Docker Container
    2. MariaDB SkySQL
  4. Create the schema
  5. Import the data
  6. Analyze the data
    1. SQL
    2. MariaDB SkySQL
  7. Support and contribution
  8. License

Requirements

Parsing and transforming the data

In this sample you will use the Python script file, parse_and_transform.py, to parse and transform the Stack Overflow Annual Survey data for 2020.

Executing parse_and_transform.py will parse through the [survey results data] and split each row, which contains all the responses from an single survey respondent, into multiple rows, one per response. The result of turns ~66k rows of data into ~5 million rows.

To execute the parse_and_transform.py you need to perform the following steps.

  1. Open a new terminal window at this location.

  2. Create a new Python virtual environment.

$ python3 -m venv venv
  1. Activate the virtual environment.
$ . venv/bin/activate
  1. Install the pandas Python package, which will be use for data manipulation within parse_and_transform.py.
$ pip install pandas
  1. And, finally, execute the parse_and_transform.py script!
$ python3 parse_and_transform.py

Preparing the database

To be able to store and analyze the survey data you're going to need a place to put it. MariaDB to the rescue! Below includes instructions on setting up a local database, using the official MariaDB Docker image, or MariaDB SkySQL, the ultimate MariaDB database in the cloud.

Docker Container

Running a single instance (container) of MariaDB ColumnStore is incredibly simple using the MariaDB Community Server ColumnStore image.

Check out the instructions here.

MariaDB SkySQL

SkySQL is the first and only database-as-a-service (DBaaS) to bring the full power of MariaDB Platform to the cloud, including its support for transactional, analytical and hybrid workloads. Built on Kubernetes, and optimized for cloud infrastructure and services, SkySQL combines ease of use and self-service with enterprise reliability and world-class support – everything needed to safely run mission-critical databases in the cloud, and with enterprise governance.

Get started with SkySQL!

IMPORTANT: Once you've registered for MariaDB SkySQL you will need to create a new analytics service so that you can take advantage of the MariaDB columnar storage engine, ColumnStore. For more information on how to do this check out this walk-through, or check out this short video on launching a new SkySQL service - don't worry it only takes a couple of minutes!

Create the schema

The survey result data contained in newly created answers.csv file will need to be imported to MariaDB. To accomodate that you will need to create a new database, survey_data, that contains a single table, answers.

To create the new database and table you can either copy and execute the following code within a database client of your choice.

DROP DATABASE IF EXISTS survey_data;
CREATE DATABASE survey_data;

CREATE TABLE answers (
    respondent_id INT unsigned NOT NULL, 
    question_id VARCHAR(25) NOT NULL,
    answer VARCHAR(65) NOT NULL
) ENGINE=ColumnStore DEFAULT CHARSET=utf8;

or use the MariaDB Client to execute the schema.sql script contained within this repository.

For example:

Locally

$ mariadb --host 127.0.0.1 --user root -pPassword123! < schema.sql

SkySQL

mariadb --host analytics-1.mdb0001265.db.skysql.net --port 5001 --user DB00004537 -p --ssl-ca ~/Downloads/skysql_chain.pem < schema.sql

Note: Remember to update the command above with your database location, user and SSL information accordingly!

Import the data

After you've created the new schema, you can import the answers.csv data using the MariaDB Client.

For example:

Locally

mariadb --host 127.0.0.1 --port 3306 --user root -pPassword123! -e "LOAD DATA LOCAL INFILE 'answers.csv' INTO TABLE answers FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '\"' LINES TERMINATED BY '\n'" survey_data

SkySQL

mariadb --host analytics-1.mdb0001265.db.skysql.net --port 5001 --user DB00004537 -p --ssl-ca ~/Downloads/skysql_chain.pem -e "LOAD DATA LOCAL INFILE 'answers.csv' INTO TABLE answers FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '\"' LINES TERMINATED BY '\n'" survey_data

Note: Remember to update the command above with your database location, user and SSL information accordingly!

Analyze the data

Once the data has been successfully imported into MariaDB there are many ways that you can use the data!

SQL

You can use a database client to execute SQL queries directly on the results data contained in the answers table.

For exmple, using the MariaDB Client:

  1. Start by connecting to your MariaDB database instance.
$ mariadb --host 127.0.0.1 --user root -pPassword123!
  1. Execute the following query to SELECT the top 10 programming langauges that have been used by respondents that have also used MariaDB.
SELECT
	answer, COUNT(answer) AS respondent_count
FROM
	survey_data.answers
WHERE 
	question_id = "LanguageWorkedWith" AND 
	respondent_id IN (SELECT respondent_id FROM answers WHERE question_id = "DatabaseWorkedWith" AND answer = "MariaDB")
GROUP BY
	answer
ORDER BY
	COUNT(answer) DESC
LIMIT 10;
+-----------------------+------------------+
| answer                | respondent_count |
+-----------------------+------------------+
| JavaScript            |             6878 |
| HTML/CSS              |             6597 |
| SQL                   |             6239 |
| PHP                   |             5149 |
| Python                |             4204 |
| Java                  |             4028 |
| Bash/Shell/PowerShell |             3746 |
| TypeScript            |             2558 |
| C#                    |             2396 |
| C++                   |             2277 |
+-----------------------+------------------+

Python & Jupyter Lab

You can also use modern data analysis and visualization tools like Jupyter Lab, in combination with MariaDB Connector/Python and Python libraries like Plotly and Pandas.

For more information on how you can do this please check out the following resources:

Support and Contribution

Please feel free to submit PR's, issues or requests to this project project directly.

If you have any other questions, comments, or looking for more information on MariaDB please check out:

Or reach out to us diretly via:

License

License