Libray used by FnF to create parquet file is different than spark uses. #305

VAIBHAVTARANGE · 2022-04-12T07:48:28Z

The Library used by FnF is parquet-cpp-arrow version 7.0.0 and
The library used by Spark is parquet-mr version 1.10.1.

Schema for timestamp is getting changed like below.
Pre FnF:-
############ Column(datetime) ############
name: datetime
path: datetime
max_definition_level: 1
max_repetition_level: 0
physical_type: INT96
logical_type: None
converted_type (legacy): NONE

Post FnF:-
############ Column(datetime) ############
name: datetime
path: datetime
max_definition_level: 1
max_repetition_level: 0
physical_type: INT64
logical_type: Timestamp(isAdjustedToUTC=false, timeUnit=milliseconds, is_from_converted_type=false, force_set_converted_type=false)
converted_type (legacy): NONE

Do you see any issues in the future if Spark gets newer versions?

vivek-biradar · 2022-04-12T07:49:25Z

@matteofigus : Need your expertise here please
@VAIBHAVTARANGE

ctd · 2022-04-12T07:57:39Z

Hello,

Just to understand your question better, is there any underlying context/issue that lead you to looking at this? i.e. unexpected changes to the data stored in the parquet file?

vivek-biradar · 2022-04-12T08:43:23Z

@ctd : Thank you for your quick response. Yes, we observed a small issue(which does not impact us badly) . Since Spark 3 uses a different calendar as per below JIRA. All dates/timestamps before 1900 I believe are impacted

https://issues.apache.org/jira/browse/SPARK-31404

We used the below workaround to read data through SPARK

https://docs.microsoft.com/en-us/sql/big-data-cluster/spark-3-upgrade?view=sql-server-ver15 (Please refer section "SparkUpgradeException due to calendar mode change")

Hence, wanted to get your expert opinion if other issues like this might pop-up because parquet-mr s=vs parquet-arrow are 2 different libraries

@matteofigus

matteofigus · 2022-04-12T12:20:21Z

Hi @vivek-biradar @VAIBHAVTARANGE thanks for opening an issue. I am not very familiar with the scenario you mentioned, but I know that indeed manipulating date and times is risky due to compatibility issues, in fact we mention it in the production readiness docs: https://github.com/awslabs/amazon-s3-find-and-forget/blob/master/docs/PRODUCTION_READINESS_GUIDELINES.md#4-run-your-test-queries

Will there be any other issues? To be honest, I don't know but I think you are on the right path to find out. For each dataset my recommendation is to have a sample in a test account, perform a test deletion, and validate the schema of the output to ensure all systems you use to read are backward-compatible with the newly created object. After you perform the necessary testing, you can onboard the dataset in production.

vivek-biradar · 2022-04-12T14:39:47Z

@matteofigus : Thank you for the reply. We did test this on our test environment and are running on production based on that. Our question was more from a futuristic perspective around whether pyarrow and parquet-mr will be in sync in terms of the parquet format as they are now. I know its a hard question, but if you could get expert recommendation within the AWS team(probably EMR folks).

Again thank you for such quick response

@VAIBHAVTARANGE

ctd added the question Further information is requested label Apr 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Libray used by FnF to create parquet file is different than spark uses. #305

Libray used by FnF to create parquet file is different than spark uses. #305

VAIBHAVTARANGE commented Apr 12, 2022

vivek-biradar commented Apr 12, 2022

ctd commented Apr 12, 2022

vivek-biradar commented Apr 12, 2022 •

edited

matteofigus commented Apr 12, 2022

vivek-biradar commented Apr 12, 2022

Libray used by FnF to create parquet file is different than spark uses. #305

Libray used by FnF to create parquet file is different than spark uses. #305

Comments

VAIBHAVTARANGE commented Apr 12, 2022

vivek-biradar commented Apr 12, 2022

ctd commented Apr 12, 2022

vivek-biradar commented Apr 12, 2022 • edited

matteofigus commented Apr 12, 2022

vivek-biradar commented Apr 12, 2022

vivek-biradar commented Apr 12, 2022 •

edited