SMARTFEAT: Efficient Feature Construction through Feature-Level Foundation Model Interactions

Raw data collected through data integration is seldom suitable for direct use for such machine learning (or any other data analytics): there is typically a need for appropriate data wrangling to construct high-quality features. This process is highly dependent on domain expertise and requires considerable manual effort by data scientists

In this repo, we introduce SMARTFEAT, an efficient automated feature engineering tool to assist data users, even nonexperts, in constructing useful features. Leveraging the power of Foundation Models (FMs), our approach enables the creation of new features from the data, based on contextual information and open-world knowledge.

If you have any questions, please contact: Yin Lin ([email protected])

Prerequisites

Step 1: To install the packages and prepare for use, run:

$ pip install -r requirements.txt

Step 2: Openai configurations

Configure your OpenAI API keys by referring to the guidance available at this link: Best Practices for API Key Safety. Make sure to replace your openai API key in line 3-5 in file ./SMARTFEAT/gpt.py

openai.organization = "[YOUR_ORG]"
openai.api_key = "[YOUR_OPENAI_API_KEY]"
os.environ["OPENAI_API_KEY"] = "[YOUR_OPENAI_API_KEY]"

Code execution example

$ cd run
$ ./test_adult.sh

We provide logs of datasets with new features in ./log_new_datasets_dt/ The evaluation file for seeing the feature importance and prediction results is in ./baseline/determine_useful_attrs_classication.py

Running your own dataset

You can add the source file and the corresponding feature description. See examples in the ./dataset/

To ensure the performance of the models, make sure to clean the input data as it (1) does not contain NULL values (2) does not have errors.

Handling groupby attributes.

We allow the inclusion of 'y_label' in the aggregate column. The aggregate information from the training set is utilized to impute this groupby feature for the test set. In cases where the mapping fails for a specific test column, we resort to using the aggregate value from the entire training set for imputation.

If the groupby columns have high cardinality, which could lead to numerous mapping failures during the imputation process, we drop this groupby feature.

For more information, refer to lines 45-110 in the './baseline/determine_useful_attrs_classication.py' file.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
SMARTFEAT		SMARTFEAT
baselines		baselines
dataset		dataset
log_new_datasets_dt		log_new_datasets_dt
prompts		prompts
run		run
.DS_Store		.DS_Store
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SMARTFEAT: Efficient Feature Construction through Feature-Level Foundation Model Interactions

Prerequisites

Step 1: To install the packages and prepare for use, run:

Step 2: Openai configurations

Code execution example

Running your own dataset

Handling groupby attributes.

About

Releases

Packages

Languages

niceIrene/SMARTFEAT

Folders and files

Latest commit

History

Repository files navigation

SMARTFEAT: Efficient Feature Construction through Feature-Level Foundation Model Interactions

Prerequisites

Step 1: To install the packages and prepare for use, run:

Step 2: Openai configurations

Code execution example

Running your own dataset

Handling groupby attributes.

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages