Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support a variable number of columns in the array feature selection transform #135

Merged
merged 4 commits into from
May 31, 2024

Conversation

riley-harper
Copy link
Contributor

This PR closes #134.

Previously, the array feature selection transform defined in hlink/linking/core/transforms.py required exactly two input columns. This change updates the transform to accept any number of columns instead. This is a small change because the pyspark.sql.functions.array() function already accepts a variable number of input columns to pack into the output column.

I also took this opportunity to add a few tests specifically for this feature and add type hints to the generate_transforms() function.

This includes some failing tests which provide 1 or 3 input columns instead of
just 2. #134 should make these tests pass.
…ctionality

Now this feature selection transform handles any number of columns, not just 2.
Copy link

@joegrover joegrover left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. I just had one question which I attached to the specific line.

output_col = feature_selection["output_column"]
df_selected = df_selected.withColumn(output_col, array(col1, col2))
df_selected = df_selected.withColumn(output_col, array(input_cols))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really know how the pyspark.sql.functions.array() works when a list is passed, though your tests seem to indicate it's fine, but this makes it look like you can just pass the list itself as an "array literal". Or does passing it this way allow for the "input_columns" to be either a single string or a list of strings?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't think to look too closely at that either since the tests passed.

Pyspark automatically flattens the argument when you pass a single list to array(): https://api-docs.databricks.com/python/pyspark/latest/pyspark.sql/api/pyspark.sql.functions.array.html. So array(input_cols) and array(*input_cols) are equivalent here. I guess that passing it this way allows the input_columns in the config to be either a string or list! But that was not the intention of doing it this way.

I think the "array literal" syntax might be out of date, because I get a type error from it.

from hlink.spark.factory import SparkFactory

factory = SparkFactory()
spark = factory.create()
df = spark.createDataFrame([[1, 2, 3], [2, 3, 4]], schema=["A", "B", "C"])
df2 = df.withColumn("D", ["A", "B"])

gives me

pyspark.errors.exceptions.base.PySparkTypeError: [NOT_COLUMN] Argument `col` should be a Column, got list.

@riley-harper riley-harper merged commit c38617c into main May 31, 2024
6 checks passed
@riley-harper riley-harper deleted the array_transform_columns branch May 31, 2024 16:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow 1 or 3+ columns as input in the array feature selection
2 participants