Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Libray used by FnF to create parquet file is different than spark uses. #305

Open
VAIBHAVTARANGE opened this issue Apr 12, 2022 · 5 comments
Labels
question Further information is requested

Comments

@VAIBHAVTARANGE
Copy link

The Library used by FnF is parquet-cpp-arrow version 7.0.0 and
The library used by Spark is parquet-mr version 1.10.1.

Schema for timestamp is getting changed like below.
Pre FnF:-
############ Column(datetime) ############
name: datetime
path: datetime
max_definition_level: 1
max_repetition_level: 0
physical_type: INT96
logical_type: None
converted_type (legacy): NONE

Post FnF:-
############ Column(datetime) ############
name: datetime
path: datetime
max_definition_level: 1
max_repetition_level: 0
physical_type: INT64
logical_type: Timestamp(isAdjustedToUTC=false, timeUnit=milliseconds, is_from_converted_type=false, force_set_converted_type=false)
converted_type (legacy): NONE

Do you see any issues in the future if Spark gets newer versions?

InkedMicrosoftTeams-image (6)_LI

@vivek-biradar
Copy link

@matteofigus : Need your expertise here please
@VAIBHAVTARANGE

@ctd ctd added the question Further information is requested label Apr 12, 2022
@ctd
Copy link
Contributor

ctd commented Apr 12, 2022

Hello,

Just to understand your question better, is there any underlying context/issue that lead you to looking at this? i.e. unexpected changes to the data stored in the parquet file?

@vivek-biradar
Copy link

vivek-biradar commented Apr 12, 2022

@ctd : Thank you for your quick response. Yes, we observed a small issue(which does not impact us badly) . Since Spark 3 uses a different calendar as per below JIRA. All dates/timestamps before 1900 I believe are impacted

https://issues.apache.org/jira/browse/SPARK-31404

We used the below workaround to read data through SPARK

https://docs.microsoft.com/en-us/sql/big-data-cluster/spark-3-upgrade?view=sql-server-ver15 (Please refer section "SparkUpgradeException due to calendar mode change")

Hence, wanted to get your expert opinion if other issues like this might pop-up because parquet-mr s=vs parquet-arrow are 2 different libraries

@matteofigus

@matteofigus
Copy link
Member

Hi @vivek-biradar @VAIBHAVTARANGE thanks for opening an issue. I am not very familiar with the scenario you mentioned, but I know that indeed manipulating date and times is risky due to compatibility issues, in fact we mention it in the production readiness docs: https://github.com/awslabs/amazon-s3-find-and-forget/blob/master/docs/PRODUCTION_READINESS_GUIDELINES.md#4-run-your-test-queries

Will there be any other issues? To be honest, I don't know but I think you are on the right path to find out. For each dataset my recommendation is to have a sample in a test account, perform a test deletion, and validate the schema of the output to ensure all systems you use to read are backward-compatible with the newly created object. After you perform the necessary testing, you can onboard the dataset in production.

@vivek-biradar
Copy link

@matteofigus : Thank you for the reply. We did test this on our test environment and are running on production based on that. Our question was more from a futuristic perspective around whether pyarrow and parquet-mr will be in sync in terms of the parquet format as they are now. I know its a hard question, but if you could get expert recommendation within the AWS team(probably EMR folks).

Again thank you for such quick response

@VAIBHAVTARANGE

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants