Skip to content

Commit

Permalink
Fix compatibility issues with Featuretools (#41)
Browse files Browse the repository at this point in the history
* tests pass

* upgraded sphinx

* trying to fix sphinx error

* added test

* removed comment

* removed ww.init()

* updated release notes

* fixed release note formatting

* fixed formatting

* changed entities to dataframes

* renamed normalize_entity to normalize_entityset

* fixed test

* added breaking change note to release notes
  • Loading branch information
dvreed77 authored Mar 9, 2022
1 parent d02756f commit f05cb9d
Show file tree
Hide file tree
Showing 6 changed files with 96 additions and 51 deletions.
48 changes: 24 additions & 24 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,9 +8,9 @@ AutoNormalize is a Python library for automated datatable normalization. It allo

## Getting Started

* [Install](#install)
* [Demos](#demos)
* [API Reference](#api-reference)
- [Install](#install)
- [Demos](#demos)
- [API Reference](#api-reference)

## Install

Expand All @@ -26,11 +26,11 @@ pip uninstall autonormalize

## Demos

* [Blog Post](https://blog.featurelabs.com/automatic-dataset-normalization-for-feature-engineering-in-python/)
* [Machine Learning Demo with Featuretools](https://github.com/FeatureLabs/autonormalize/blob/master/autonormalize/demos/AutoNormalize%20%2B%20FeatureTools%20Demo.ipynb)
* [Kaggle Liquor Sales Dataset Demo](https://github.com/FeatureLabs/autonormalize/blob/master/autonormalize/demos/Kaggle%20Liquor%20Sales%20Dataset%20Demo.ipynb)
* [Demo with Editing Dependencies](https://github.com/FeatureLabs/autonormalize/blob/master/autonormalize/demos/Editing%20Dependnecies%20Demo.ipynb)
* [Kaggle Food Production Dataset Demo](https://github.com/FeatureLabs/autonormalize/blob/master/autonormalize/demos/Kaggle%20Food%20%20Dataset%20Demo.ipynb)
- [Blog Post](https://blog.featurelabs.com/automatic-dataset-normalization-for-feature-engineering-in-python/)
- [Machine Learning Demo with Featuretools](https://github.com/FeatureLabs/autonormalize/blob/master/autonormalize/demos/AutoNormalize%20%2B%20FeatureTools%20Demo.ipynb)
- [Kaggle Liquor Sales Dataset Demo](https://github.com/FeatureLabs/autonormalize/blob/master/autonormalize/demos/Kaggle%20Liquor%20Sales%20Dataset%20Demo.ipynb)
- [Demo with Editing Dependencies](https://github.com/FeatureLabs/autonormalize/blob/master/autonormalize/demos/Editing%20Dependnecies%20Demo.ipynb)
- [Kaggle Food Production Dataset Demo](https://github.com/FeatureLabs/autonormalize/blob/master/autonormalize/demos/Kaggle%20Food%20%20Dataset%20Demo.ipynb)

## API Reference

Expand All @@ -44,19 +44,19 @@ Creates a normalized entityset from a dataframe.

**Arguments:**

* `df` (pd.Dataframe) : the dataframe containing data
- `df` (pd.Dataframe) : the dataframe containing data

* `accuracy` (0 < float <= 1.00; default = 0.98) : the accuracy threshold required in order to conclude a dependency (i.e. with accuracy = 0.98, 0.98 of the rows must hold true the dependency LHS --> RHS)
- `accuracy` (0 < float <= 1.00; default = 0.98) : the accuracy threshold required in order to conclude a dependency (i.e. with accuracy = 0.98, 0.98 of the rows must hold true the dependency LHS --> RHS)

* `index` (str, optional) : name of column that is intended index of df
- `index` (str, optional) : name of column that is intended index of df

* `name` (str, optional) : the name of created EntitySet
- `name` (str, optional) : the name of created EntitySet

* `time_index` (str, optional) : name of time column in the dataframe.
- `time_index` (str, optional) : name of time column in the dataframe.

**Returns:**

* `entityset` (ft.EntitySet) : created entity set
- `entityset` (ft.EntitySet) : created entity set

### `find_dependencies`

Expand All @@ -68,7 +68,7 @@ Finds dependencies within dataframe with the DFD search algorithm.

**Returns:**

* `dependencies` (Dependencies) : the dependencies found in the data within the contraints provided
- `dependencies` (Dependencies) : the dependencies found in the data within the contraints provided

### `normalize_dataframe`

Expand All @@ -78,13 +78,13 @@ normalize_dataframe(df, dependencies)

Normalizes dataframe based on the dependencies given. Keys for the newly created DataFrames can only be columns that are strings, ints, or categories. Keys are chosen according to the priority:

1) shortest lenghts
2) has "id" in some form in the name of an attribute
3) has attribute furthest to left in the table
1. shortest lenghts
2. has "id" in some form in the name of an attribute
3. has attribute furthest to left in the table

**Returns:**

* `new_dfs` (list[pd.DataFrame]) : list of new dataframes
- `new_dfs` (list[pd.DataFrame]) : list of new dataframes

<br />

Expand All @@ -98,25 +98,25 @@ Creates a normalized EntitySet from dataframe based on the dependencies given. K

**Returns:**

* `entityset` (ft.EntitySet) : created EntitySet
- `entityset` (ft.EntitySet) : created EntitySet

<br />

### `normalize_entity`
### `normalize_entityset`

```shell
normalize_entity(es, accuracy=0.98)
normalize_entityset(es, accuracy=0.98)
```

Returns a new normalized `EntitySet` from an `EntitySet` with a single entity.

**Arguments:**

* `es` (ft.EntitySet) : EntitySet with a single entity to normalize
- `es` (ft.EntitySet) : EntitySet with a single entity to normalize

**Returns:**

* `new_es` (ft.EntitySet) : new normalized EntitySet
- `new_es` (ft.EntitySet) : new normalized EntitySet

<br />

Expand Down
34 changes: 21 additions & 13 deletions autonormalize/autonormalize.py
Original file line number Diff line number Diff line change
Expand Up @@ -85,24 +85,31 @@ def make_entityset(df, dependencies, name=None, time_index=None):
normalize.normalize_dataframe(depdf)
normalize.make_indexes(depdf)

entities = {}
dataframes = {}
relationships = []

stack = [depdf]

while stack != []:
current = stack.pop()
if (current.df.ww.schema is None):
current.df.ww.init(index=current.index[0], name=current.index[0])

current_df_name = current.df.ww.name
if time_index in current.df.columns:
entities[current.index[0]] = (current.df, current.index[0], time_index)
dataframes[current_df_name] = (current.df, current.index[0], time_index)
else:
entities[current.index[0]] = (current.df, current.index[0])
dataframes[current_df_name] = (current.df, current.index[0])
for child in current.children:
if (child.df.ww.schema is None):
child.df.ww.init(index=child.index[0], name=child.index[0])
child_df_name = child.df.ww.name
# add to stack
# add relationship
stack.append(child)
relationships.append((child.index[0], child.index[0], current.index[0], child.index[0]))
relationships.append((child_df_name, child.index[0], current_df_name, child.index[0]))

return ft.EntitySet(name, entities, relationships)
return ft.EntitySet(name, dataframes, relationships)


def auto_entityset(df, accuracy=0.98, index=None, name=None, time_index=None):
Expand Down Expand Up @@ -141,9 +148,9 @@ def auto_normalize(df):
return normalize_dataframe(df, find_dependencies(df))


def normalize_entity(es, accuracy=0.98):
def normalize_entityset(es, accuracy=0.98):
"""
Returns a new normalized EntitySet from an EntitySet with a single entity.
Returns a new normalized EntitySet from an EntitySet with a single dataframe.
Arguments:
es (ft.EntitySet) : EntitySet to normalize
Expand All @@ -152,13 +159,14 @@ def normalize_entity(es, accuracy=0.98):
Returns:
new_es (ft.EntitySet) : new normalized EntitySet
"""
# TO DO: add option to pass an EntitySet with more than one entity, and specify which one
# TO DO: add option to pass an EntitySet with more than one dataframe, and specify which one
# to normalize while preserving existing relationships

if len(es.entities) > 1:
raise ValueError('There is more than one entity in this EntitySet')
if len(es.entities) == 0:
if len(es.dataframes) > 1:
raise ValueError('There is more than one dataframe in this EntitySet')
if len(es.dataframes) == 0:
raise ValueError('This EntitySet is empty')
entity = es.entities[0]
new_es = auto_entityset(entity.df, accuracy, index=entity.index, name=es.id, time_index=entity.time_index)

df = es.dataframes[0]
new_es = auto_entityset(df, accuracy, index=df.ww.index, name=es.id, time_index=df.ww.time_index)
return new_es
30 changes: 30 additions & 0 deletions autonormalize/tests/test_example.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,8 @@
import featuretools as ft
import pandas as pd
from unittest.mock import patch

import pytest
import autonormalize as an


Expand All @@ -21,3 +24,30 @@ def test_ft_mock_customer():
assert set([str(rel) for rel in entityset.relationships]) == set(['<Relationship: transaction_id.session_id -> session_id.session_id>',
'<Relationship: transaction_id.product_id -> product_id.product_id>',
'<Relationship: session_id.customer_id -> customer_id.customer_id>'])


@patch("autonormalize.autonormalize.auto_entityset")
def test_normalize_entityset(auto_entityset):
df1 = pd.DataFrame({"test": [0, 1, 2]})
df2 = pd.DataFrame({"test": [0, 1, 2]})
accuracy = 0.98

es = ft.EntitySet()

error = "This EntitySet is empty"
with pytest.raises(ValueError, match=error):
an.normalize_entityset(es, accuracy)

es.add_dataframe(df1, "df")

df_out = es.dataframes[0]

an.normalize_entityset(es, accuracy)

auto_entityset.assert_called_with(df_out, accuracy, index=df_out.ww.index, name=es.id, time_index=df_out.ww.time_index)

es.add_dataframe(df2, "df2")

error = "There is more than one dataframe in this EntitySet"
with pytest.raises(ValueError, match=error):
an.normalize_entityset(es, accuracy)
6 changes: 3 additions & 3 deletions dev-requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,9 @@ codecov==2.1.8
flake8==3.7.8
autopep8==1.4.4
isort==4.3.21
nbsphinx==0.8.5
pydata-sphinx-theme==0.4.0
Sphinx==3.2.1
nbsphinx==0.8.7
pydata-sphinx-theme==0.7.1
Sphinx==4.2.0
nbconvert==6.0.2
ipython==7.16.3
pygments==2.8.1
Expand Down
2 changes: 1 addition & 1 deletion docs/source/api_reference.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ Autonormalize
make_entityset
auto_entityset
auto_normalize
normalize_entity
normalize_entityset

Dependencies
======================
Expand Down
27 changes: 17 additions & 10 deletions docs/source/release_notes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,34 +3,41 @@
Release Notes
-------------

.. Future Release
==============
Future Release
==============
* Enhancements
* Fixes
* Fix compatibility issues with featuretools (:pr:`41`)
* Changes
* Rename ``normalize_entity`` to ``normalize_entityset`` (:pr:`41`)
* Documentation Changes
* Testing Changes

.. Thanks to the following people for contributing to this release:
Thanks to the following people for contributing to this release:
:user:`dvreed77`

Breaking Changes
++++++++++++++++
* :pr:`41`: The function ``normalize_entity`` has been renamed to ``normalize_entityset``.

v1.0.1 Jan 7, 2022
==================
* Documentation Changes
* Update release notes and release format (:pr:`37`)
* Updated sphinx documentation and guides (:pr:`35`)
* Update release notes and release format (:pr:`37`)
* Updated sphinx documentation and guides (:pr:`35`)
* Testing Changes
* Updated tests to work with featuretools 1.0 (:pr:`35`)
* Updated tests to work with featuretools 1.0 (:pr:`35`)

Thanks to the following people for contributing to this release:
:user:`gsheni`, :user:`tuethan1999`
Thanks to the following people for contributing to this release:
:user:`gsheni`, :user:`tuethan1999`


v1.0.0 Aug 15, 2019
===================
* Initial Release

Thanks to the following people for contributing to this release:
:user:`allisonportis`
Thanks to the following people for contributing to this release:
:user:`allisonportis`

.. command
.. git log --pretty=oneline --abbrev-commit

0 comments on commit f05cb9d

Please sign in to comment.