Support a variable number of columns in the array feature selection transform #135

riley-harper · 2024-05-31T14:40:46Z

This PR closes #134.

Previously, the array feature selection transform defined in hlink/linking/core/transforms.py required exactly two input columns. This change updates the transform to accept any number of columns instead. This is a small change because the pyspark.sql.functions.array() function already accepts a variable number of input columns to pack into the output column.

I also took this opportunity to add a few tests specifically for this feature and add type hints to the generate_transforms() function.

This includes some failing tests which provide 1 or 3 input columns instead of just 2. #134 should make these tests pass.

…ctionality Now this feature selection transform handles any number of columns, not just 2.

joegrover

Looks good to me. I just had one question which I attached to the specific line.

joegrover · 2024-05-31T15:07:08Z

hlink/linking/core/transforms.py

            output_col = feature_selection["output_column"]
-            df_selected = df_selected.withColumn(output_col, array(col1, col2))
+            df_selected = df_selected.withColumn(output_col, array(input_cols))


I don't really know how the pyspark.sql.functions.array() works when a list is passed, though your tests seem to indicate it's fine, but this makes it look like you can just pass the list itself as an "array literal". Or does passing it this way allow for the "input_columns" to be either a single string or a list of strings?

I didn't think to look too closely at that either since the tests passed.

Pyspark automatically flattens the argument when you pass a single list to array(): https://api-docs.databricks.com/python/pyspark/latest/pyspark.sql/api/pyspark.sql.functions.array.html. So array(input_cols) and array(*input_cols) are equivalent here. I guess that passing it this way allows the input_columns in the config to be either a string or list! But that was not the intention of doing it this way.

I think the "array literal" syntax might be out of date, because I get a type error from it.

from hlink.spark.factory import SparkFactory factory = SparkFactory() spark = factory.create() df = spark.createDataFrame([[1, 2, 3], [2, 3, 4]], schema=["A", "B", "C"]) df2 = df.withColumn("D", ["A", "B"])

gives me

pyspark.errors.exceptions.base.PySparkTypeError: [NOT_COLUMN] Argument `col` should be a Column, got list.

riley-harper added 4 commits May 30, 2024 20:39

Add type hints to hlink.linking.core.transforms.generate_transforms()

04ea049

[#134] Add tests for the array feature selection

dbb9eee

This includes some failing tests which provide 1 or 3 input columns instead of just 2. #134 should make these tests pass.

[#134] Allow any number of input columns in the array feature selection

39f118d

[#134] Update the Sphinx docs for the new array feature selection fun…

40476ed

…ctionality Now this feature selection transform handles any number of columns, not just 2.

joegrover approved these changes May 31, 2024

View reviewed changes

riley-harper merged commit c38617c into main May 31, 2024
6 checks passed

riley-harper deleted the array_transform_columns branch May 31, 2024 16:52

riley-harper mentioned this pull request May 31, 2024

Bump the version to 3.5.5 #136

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support a variable number of columns in the array feature selection transform #135

Support a variable number of columns in the array feature selection transform #135

riley-harper commented May 31, 2024

joegrover left a comment

joegrover May 31, 2024

riley-harper May 31, 2024

Support a variable number of columns in the array feature selection transform #135

Support a variable number of columns in the array feature selection transform #135

Conversation

riley-harper commented May 31, 2024

joegrover left a comment

Choose a reason for hiding this comment

joegrover May 31, 2024

Choose a reason for hiding this comment

riley-harper May 31, 2024

Choose a reason for hiding this comment