Skip to content

Commit

Permalink
Add documentation with mkdocs
Browse files Browse the repository at this point in the history
  • Loading branch information
darenasc committed Dec 23, 2024
1 parent d01bf79 commit 5416b7c
Show file tree
Hide file tree
Showing 9 changed files with 291 additions and 11 deletions.
22 changes: 22 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
name: ci
on:
push:
branches:
- main
permissions:
contents: write
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: actions/setup-python@v4
with:
python-version: 3.x
- uses: actions/cache@v2
with:
key: ${{ github.ref }}
path: .cache
- run: pip install mkdocs-material
- run: pip install pillow cairosvg
- run: mkdocs gh-deploy --force
4 changes: 4 additions & 0 deletions Pipfile
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,10 @@ pipfile = "*"
pytest = "*"
pytest-cov = "*"
typing-extensions = "*"
mkdocs-material = "*"
mkdocs = "*"
mkdocstrings-python = "*"
mkdocstrings = {version = "*", extras = ["python"]}

[requires]
python_version = "3.10"
Expand Down
67 changes: 56 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,8 @@
![](https://img.shields.io/github/last-commit/darenasc/auto-fes)
![](https://img.shields.io/github/stars/darenasc/auto-fes?style=social)

Automated exploration of files in a folder structure to extract metadata and potential usage of information.
Automated exploration of files in a folder structure to extract metadata and
potential usage of information.

If you have a bunch of sctructured data in plain files, this library is for you.

Expand Down Expand Up @@ -51,8 +52,6 @@ flowchart LR

## Explore

You need to import the [auto_fe.py](code/auto_fe.py) file and call it as follows.

```python
from afes import afe

Expand All @@ -64,13 +63,31 @@ df_files = afe.explore(TARGET_FOLDER)
df_files
```

Checkout the [example.py](src/example.py) file and then run it from a terminal with python as the following code, or using a Jupyter [notebook](src/notebook-example.ipynb).
The `df_files` dataframe will look like the following table, depending on the
files you plan to explore.

```
| | path | name | extension | size | human_readable | rows | separator |
| ---: | :------------------------------------------------ | :----------------------- | :-------- | ------: | :------------- | ----: | :-------- |
| 1 | /content/sample_data/auto_mpg.csv | auto_mpg | .csv | 20854 | 20.4 KiB | 399 | comma |
| 2 | /content/sample_data/car_evaluation.csv | car_evaluation | .csv | 51916 | 50.7 KiB | 1729 | comma |
| 3 | /content/sample_data/iris.csv | iris | .csv | 4606 | 4.5 KiB | 151 | comma |
| 4 | /content/sample_data/wine_quality.csv | wine_quality | .csv | 414831 | 405.1 KiB | 6498 | comma |
| 5 | /content/sample_data/california_housing_test.csv | california_housing_test | .csv | 301141 | 294.1 KiB | 3001 | comma |
| 6 | /content/sample_data/california_housing_train.csv | california_housing_train | .csv | 1706430 | 1.6 MiB | 17001 | comma |
```

Checkout the [example.py](src/example.py) file and then run it from a terminal
with python as the following code, or using a Jupyter
[notebook](src/notebook-example.ipynb).

## Generate code

Using the dataframe `df_files` generated in the explore phase, you can generate working python pandas code to be used.
Using the dataframe `df_files` generated in the explore phase, you can generate
working python pandas code to be used.

the function `generate()` will generate python code to load the files using `pandas`.
the function `generate()` will generate python code to load the files using
`pandas`.

```python
from afes import afe
Expand All @@ -83,13 +100,36 @@ df_files = afe.explore(TARGET_FOLDER)
afe.generate(df_files)
```

By default the code is printed to the standard output but also written by default to the `./code.txt` file.
The generated code will look like this:

```bash
### Start of the code ###
import pandas as pd

df_auto_mpg = pd.read_csv('/content/sample_data/auto_mpg.csv', sep = ',')
df_car_evaluation = pd.read_csv('/content/sample_data/car_evaluation.csv', sep = ',')
df_iris = pd.read_csv('/content/sample_data/iris.csv', sep = ',')
df_wine_quality = pd.read_csv('/content/sample_data/wine_quality.csv', sep = ',')
df_california_housing_test = pd.read_csv('/content/sample_data/california_housing_test.csv', sep = ',')
df_california_housing_train = pd.read_csv('/content/sample_data/california_housing_train.csv', sep = ',')

### End of the code ###

"code.txt" has the generated Python code to load the files.
```

By default the code is printed to the standard output but also written by
default to the `./code.txt` file.

> Note: you can replace the `.txt` extension by `.py` to make it a working Python script.
> Note: you can replace the `.txt` extension by `.py` to make it a working
> Python script.
### Profile

Using the dataframe `df_files` generated in the explore phase, the function `profile(df_files)` will automatically load and profiline the files using [ydata-profiling](https://github.com/ydataai/ydata-profiling) or [sweetviz](https://github.com/fbdesignpro/sweetviz).
Using the dataframe `df_files` generated in the explore phase, the function
`profile(df_files)` will automatically load and profiline the files using
[ydata-profiling](https://github.com/ydataai/ydata-profiling) or
[sweetviz](https://github.com/fbdesignpro/sweetviz).

```python
# Path to folder with files to be explored
Expand All @@ -103,8 +143,13 @@ afe.profile(df_files, profile_tool="ydata-profiling", output_path=OUTPUT_FOLDER)
afe.profile(df_files, profile_tool="sweetviz", output_path=OUTPUT_FOLDER)
```

By default, it will process the files using `ydata-profiling` by size order starting with the smallest file. It will create the reports and export them in HTML format. It will store the reports in the same directory where the code is running or it save them in a given directory with the `output_path = '<YOUR_OUTPUT_PATH>'` argument.
By default, it will process the files using `ydata-profiling` by size order
starting with the smallest file. It will create the reports and export them in
HTML format. It will store the reports in the same directory where the code is
running or it save them in a given directory with the
`output_path = '<YOUR_OUTPUT_PATH>'` argument.

# Contributing

* Open an [issue](https://github.com/darenasc/auto-fes/issues) to request more functionalities or feedback.
* Open an [issue](https://github.com/darenasc/auto-fes/issues) to request more
* functionalities or feedback.
1 change: 1 addition & 0 deletions docs/afe.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
::: src.afes.afe
1 change: 1 addition & 0 deletions docs/generate.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
::: src.afes.generate
146 changes: 146 additions & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,146 @@
# Automated File Exploration System

Automated exploration of files in a folder structure to extract metadata and
potential usage of information.

If you have a bunch of sctructured data in plain files, this library is for you.

# Installation

```bash
pip install -q git+https://github.com/darenasc/auto-fes.git
pip install -q ydata_profiling sweetviz # to make profiling tools work
```

## How to use it
```python
from afes import afe

# Path to folder with files to be explored
TARGET_FOLDER = "<PATH_TO_FILES_TO_EXPLORE>"
OUTPUT_FOLDER = "<PATH_TO_OUTPUTS>"

# Run exploration on the files
df_files = afe.explore(TARGET_FOLDER)

# Generate pandas code to load the files
afe.generate(df_files)

# Run profiling on each file
afe.profile(df_files, profile_tool="ydata-profiling", output_path=OUTPUT_FOLDER)
afe.profile(df_files, profile_tool="sweetviz", output_path=OUTPUT_FOLDER)
```

# What can you do with AFES

* Explore
* Generate code
* Profile

```mermaid
flowchart LR
Explore --> Generate
Explore --> Profile
Generate --> PandasCode
Profile --> ydata-profile@{ shape: doc }
Profile --> sweetviz@{ shape: doc }
```

## Explore

```python
from afes import afe

# Path to folder with files to be explored
TARGET_FOLDER = "<PATH_TO_FILES_TO_EXPLORE>"

# Run exploration on the files
df_files = afe.explore(TARGET_FOLDER)
df_files
```

The `df_files` dataframe will look like the following table, depending on the
files you plan to explore.

```
| | path | name | extension | size | human_readable | rows | separator |
| ---: | :------------------------------------------------ | :----------------------- | :-------- | ------: | :------------- | ----: | :-------- |
| 1 | /content/sample_data/auto_mpg.csv | auto_mpg | .csv | 20854 | 20.4 KiB | 399 | comma |
| 2 | /content/sample_data/car_evaluation.csv | car_evaluation | .csv | 51916 | 50.7 KiB | 1729 | comma |
| 3 | /content/sample_data/iris.csv | iris | .csv | 4606 | 4.5 KiB | 151 | comma |
| 4 | /content/sample_data/wine_quality.csv | wine_quality | .csv | 414831 | 405.1 KiB | 6498 | comma |
| 5 | /content/sample_data/california_housing_test.csv | california_housing_test | .csv | 301141 | 294.1 KiB | 3001 | comma |
| 6 | /content/sample_data/california_housing_train.csv | california_housing_train | .csv | 1706430 | 1.6 MiB | 17001 | comma |
```

## Generate code

Using the dataframe `df_files` generated in the explore phase, you can generate
working python pandas code to be used.

the function `generate()` will generate python code to load the files using
`pandas`.

```python
from afes import afe

# Path to folder with files to be explored
TARGET_FOLDER = "<PATH_TO_FILES_TO_EXPLORE>"
OUTPUT_FOLDER = "<PATH_TO_OUTPUTS>"

df_files = afe.explore(TARGET_FOLDER)
afe.generate(df_files)
```

The generated code will look like this:

```bash
### Start of the code ###
import pandas as pd

df_auto_mpg = pd.read_csv('/content/sample_data/auto_mpg.csv', sep = ',')
df_car_evaluation = pd.read_csv('/content/sample_data/car_evaluation.csv', sep = ',')
df_iris = pd.read_csv('/content/sample_data/iris.csv', sep = ',')
df_wine_quality = pd.read_csv('/content/sample_data/wine_quality.csv', sep = ',')
df_california_housing_test = pd.read_csv('/content/sample_data/california_housing_test.csv', sep = ',')
df_california_housing_train = pd.read_csv('/content/sample_data/california_housing_train.csv', sep = ',')

### End of the code ###

"code.txt" has the generated Python code to load the files.
```

By default the code is printed to the standard output but also written by
default to the `./code.txt` file.

> Note: you can replace the `.txt` extension by `.py` to make it a working
> Python script.
### Profile

Using the dataframe `df_files` generated in the explore phase, the function
`profile(df_files)` will automatically load and profiline the files using
[ydata-profiling](https://github.com/ydataai/ydata-profiling) or
[sweetviz](https://github.com/fbdesignpro/sweetviz).

```python
# Path to folder with files to be explored
TARGET_FOLDER = "<PATH_TO_FILES_TO_EXPLORE>"
OUTPUT_FOLDER = "<PATH_TO_OUTPUTS>"

# Run exploration on the files
df_files = afe.explore(TARGET_FOLDER)

afe.profile(df_files, profile_tool="ydata-profiling", output_path=OUTPUT_FOLDER) # or
afe.profile(df_files, profile_tool="sweetviz", output_path=OUTPUT_FOLDER)
```

By default, it will process the files using `ydata-profiling` by size order
starting with the smallest file. It will create the reports and export them in
HTML format. It will store the reports in the same directory where the code is
running or it save them in a given directory with the
`output_path = '<YOUR_OUTPUT_PATH>'` argument.

# Contributing

* Open an [issue](https://github.com/darenasc/auto-fes/issues) to request more functionalities or feedback.
1 change: 1 addition & 0 deletions docs/profile.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
::: src.afes.profile
43 changes: 43 additions & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
site_name: Auto File Exploration System

theme:
name: "material"
features:
# - navigation.tabs
# - navigation.sections
# - toc.integrate
# - navigation.top
- search.suggest
# - search.highlight
# - content.tabs.link
- content.code.annotation
- content.code.copy
language: en

plugins:
- search
- mkdocstrings

extra:
social:
- icon: fontawesome/brands/github-alt
link: https://github.com/darenasc
- icon: fontawesome/brands/twitter
link: https://twitter.com/darenasc
- icon: fontawesome/brands/linkedin
link: https://www.linkedin.com/in/darenasc/
# markdown_extensions:
# - pymdownx.highlight:
# anchor_linenums: true
# - pymdownx.inlinehilite
# - pymdownx.snippets
# - admonition
# - pymdownx.arithmatex:
# generic: true
# - footnotes
# - pymdownx.details
# - pymdownx.superfences
# - pymdownx.mark
# - attr_list
copyright: |
&copy; 2024 <a href="https://github.com/darenasc" target="_blank" rel="noopener">Diego Arenas</a>
17 changes: 17 additions & 0 deletions src/afes/afe.py
Original file line number Diff line number Diff line change
Expand Up @@ -117,6 +117,14 @@ def generate(
python_file: str = "code.txt",
verbose: bool = True,
):
"""Generate pandas code to load the files.
Args:
df (pd.DataFrame): DataFrame with the explored files.
python_file (str, optional): Name of the file to save the code.
Defaults to "code.txt".
verbose (bool, optional): Flag to print the code. Defaults to True.
"""
generate_pandas_code(df, python_file=python_file, verbose=verbose)


Expand All @@ -125,6 +133,15 @@ def profile(
output_path: str | Path = ".",
profile_tool: str = "ydata-profiling",
):
"""Profile the structured data.
Args:
df (pd.DataFrame): DataFrame with the files to be profiled.
output_path (str | Path, optional): Folder to save the HTML reports.
Defaults to ".".
profile_tool (str, optional): Select which profiling too to use.
Defaults to "ydata-profiling".
"""
output_path = Path(output_path)
output_path.mkdir(parents=True, exist_ok=True)
df.sort_values(by="size", inplace=True)
Expand Down

0 comments on commit 5416b7c

Please sign in to comment.