- 12.10.2024. Slides CorrelCon 2024 available here.
- 11.12.2023. Slides DataKindUK SDS available here.
- 11.11.2023. Slides CorrelCon 2023 available here.
You can clone this repository and use it locally as follows.
git clone https://github.com/darenasc/eda.git
cd eda
pip install pipenv
pipenv install Pipfile
Or you can go to the eda.ipynb notebook and open it in Colab.
Mindmap created with freeplane.
Some useful commands for the terminal.
# Explore directories
ls
# Explore content of files
cat
more
less
head
tail
# Count number of lines
wc -l
# Search in files
grep
# Get documentation of commands
man
# Download data or files
wget
curl
# Monitor resources
htop
btop
# Modify files
vim
sed
Some python libraries to explore data.
- pandas
- ydata-profiling (former pandas-profiling)
- sweetviz
- pygwalker
- facets
- datapane
- streamlit
- gradio
- geopandas
- pysal
- networkx
- word_cloud
- great_expectations
- featuretools
- superset
- metabase
- openml-python
- What are the formats?
- Are there files with problems? (can't be opened)
- How many files, tables, databases?
- Per item: How many columns and rows?
- Are there any encoding issues?
- Verify data types of columns: Discrete, Continuous, Dates, GIS, network, other.
- Univariate analysis
- Histogram
- Bar plot
- Boxplot
- Multivariate analysis
- Correlations
- Use target variable to visualize other features