-
Notifications
You must be signed in to change notification settings - Fork 5.5k
fix(plugin-delta): Fix problem reading tables with spaces in location or partition values #26397
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(plugin-delta): Fix problem reading tables with spaces in location or partition values #26397
Conversation
Reviewer's guide (collapsed on small PRs)Reviewer's GuideWrap raw URI-encoded file paths in Hadoop Path constructions to decode spaces correctly and refresh corresponding v3 test fixtures to validate reading partitions with spaces. ER diagram for partitioned Delta table test data with spaceserDiagram
"AddFileStatus" {
string path
int size
}
"PartitionValues" {
string columnName
string partitionValue
}
"AddFileStatus" ||--o| "PartitionValues" : contains
Class diagram for updated file path handling in Delta connectorclassDiagram
class DeltaSplitManager {
+getNextBatch()
}
class DeltaExpressionUtils {
<<static>>
+evaluatePartitionPredicate()
}
class InternalScanFileUtils {
+getAddFileStatus(row)
+getPartitionValues(row)
}
class Path {
+Path(URI)
+toString()
}
class URI {
+create(String)
}
DeltaSplitManager --> Path : uses
DeltaExpressionUtils --> Path : uses
DeltaExpressionUtils --> InternalScanFileUtils : uses
DeltaSplitManager --> InternalScanFileUtils : uses
Path <-- URI : constructed from
File-Level Changes
Possibly linked issues
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey there - I've reviewed your changes and they look great!
Blocking issues:
- Identified a pattern that may indicate AWS credentials, risking unauthorized cloud resource access and data breaches on AWS platforms. (link)
- Identified a pattern that may indicate AWS credentials, risking unauthorized cloud resource access and data breaches on AWS platforms. (link)
- Identified a pattern that may indicate AWS credentials, risking unauthorized cloud resource access and data breaches on AWS platforms. (link)
- Identified a pattern that may indicate AWS credentials, risking unauthorized cloud resource access and data breaches on AWS platforms. (link)
- Identified a pattern that may indicate AWS credentials, risking unauthorized cloud resource access and data breaches on AWS platforms. (link)
- Identified a pattern that may indicate AWS credentials, risking unauthorized cloud resource access and data breaches on AWS platforms. (link)
- Identified a pattern that may indicate AWS credentials, risking unauthorized cloud resource access and data breaches on AWS platforms. (link)
- Identified a pattern that may indicate AWS credentials, risking unauthorized cloud resource access and data breaches on AWS platforms. (link)
- Identified a pattern that may indicate AWS credentials, risking unauthorized cloud resource access and data breaches on AWS platforms. (link)
- Identified a pattern that may indicate AWS credentials, risking unauthorized cloud resource access and data breaches on AWS platforms. (link)
- Identified a pattern that may indicate AWS credentials, risking unauthorized cloud resource access and data breaches on AWS platforms. (link)
- Identified a pattern that may indicate AWS credentials, risking unauthorized cloud resource access and data breaches on AWS platforms. (link)
- Identified a pattern that may indicate AWS credentials, risking unauthorized cloud resource access and data breaches on AWS platforms. (link)
- Identified a pattern that may indicate AWS credentials, risking unauthorized cloud resource access and data breaches on AWS platforms. (link)
- Identified a pattern that may indicate AWS credentials, risking unauthorized cloud resource access and data breaches on AWS platforms. (link)
- Identified a pattern that may indicate AWS credentials, risking unauthorized cloud resource access and data breaches on AWS platforms. (link)
- Identified a pattern that may indicate AWS credentials, risking unauthorized cloud resource access and data breaches on AWS platforms. (link)
Prompt for AI Agents
Please address the comments from this code review:
## Individual Comments
### Comment 1
<location> `presto-delta/src/test/resources/delta_v3/test-spaces/_delta_log/00000000000000000002.json:Zone.Identifier:4` </location>
<code_context>
ASIAZ37KJPSBC54RK3G6
</code_context>
<issue_to_address>
**security (aws-access-token):** Identified a pattern that may indicate AWS credentials, risking unauthorized cloud resource access and data breaches on AWS platforms.
*Source: gitleaks*
</issue_to_address>
### Comment 2
<location> `presto-delta/src/test/resources/delta_v3/test-spaces/_delta_log/00000000000000000000.crc:Zone.Identifier:4` </location>
<code_context>
ASIAZ37KJPSBC54RK3G6
</code_context>
<issue_to_address>
**security (aws-access-token):** Identified a pattern that may indicate AWS credentials, risking unauthorized cloud resource access and data breaches on AWS platforms.
*Source: gitleaks*
</issue_to_address>
### Comment 3
<location> `presto-delta/src/test/resources/delta_v3/test-spaces/_delta_log/00000000000000000000.json:Zone.Identifier:4` </location>
<code_context>
ASIAZ37KJPSBC54RK3G6
</code_context>
<issue_to_address>
**security (aws-access-token):** Identified a pattern that may indicate AWS credentials, risking unauthorized cloud resource access and data breaches on AWS platforms.
*Source: gitleaks*
</issue_to_address>
### Comment 4
<location> `presto-delta/src/test/resources/delta_v3/test-spaces/_delta_log/00000000000000000001.crc:Zone.Identifier:4` </location>
<code_context>
ASIAZ37KJPSBC54RK3G6
</code_context>
<issue_to_address>
**security (aws-access-token):** Identified a pattern that may indicate AWS credentials, risking unauthorized cloud resource access and data breaches on AWS platforms.
*Source: gitleaks*
</issue_to_address>
### Comment 5
<location> `presto-delta/src/test/resources/delta_v3/test-spaces/_delta_log/00000000000000000001.json:Zone.Identifier:4` </location>
<code_context>
ASIAZ37KJPSBC54RK3G6
</code_context>
<issue_to_address>
**security (aws-access-token):** Identified a pattern that may indicate AWS credentials, risking unauthorized cloud resource access and data breaches on AWS platforms.
*Source: gitleaks*
</issue_to_address>
### Comment 6
<location> `presto-delta/src/test/resources/delta_v3/test-spaces/_delta_log/00000000000000000002.crc:Zone.Identifier:4` </location>
<code_context>
ASIAZ37KJPSBC54RK3G6
</code_context>
<issue_to_address>
**security (aws-access-token):** Identified a pattern that may indicate AWS credentials, risking unauthorized cloud resource access and data breaches on AWS platforms.
*Source: gitleaks*
</issue_to_address>
### Comment 7
<location> `presto-delta/src/test/resources/delta_v3/test-spaces/_delta_log/00000000000000000003.crc:Zone.Identifier:4` </location>
<code_context>
ASIAZ37KJPSBC54RK3G6
</code_context>
<issue_to_address>
**security (aws-access-token):** Identified a pattern that may indicate AWS credentials, risking unauthorized cloud resource access and data breaches on AWS platforms.
*Source: gitleaks*
</issue_to_address>
### Comment 8
<location> `presto-delta/src/test/resources/delta_v3/test-spaces/_delta_log/00000000000000000003.json:Zone.Identifier:4` </location>
<code_context>
ASIAZ37KJPSBC54RK3G6
</code_context>
<issue_to_address>
**security (aws-access-token):** Identified a pattern that may indicate AWS credentials, risking unauthorized cloud resource access and data breaches on AWS platforms.
*Source: gitleaks*
</issue_to_address>
### Comment 9
<location> `presto-delta/src/test/resources/delta_v3/test-spaces/_delta_log/00000000000000000004.crc:Zone.Identifier:4` </location>
<code_context>
ASIAZ37KJPSBC54RK3G6
</code_context>
<issue_to_address>
**security (aws-access-token):** Identified a pattern that may indicate AWS credentials, risking unauthorized cloud resource access and data breaches on AWS platforms.
*Source: gitleaks*
</issue_to_address>
### Comment 10
<location> `presto-delta/src/test/resources/delta_v3/test-spaces/_delta_log/00000000000000000004.json:Zone.Identifier:4` </location>
<code_context>
ASIAZ37KJPSBC54RK3G6
</code_context>
<issue_to_address>
**security (aws-access-token):** Identified a pattern that may indicate AWS credentials, risking unauthorized cloud resource access and data breaches on AWS platforms.
*Source: gitleaks*
</issue_to_address>
### Comment 11
<location> `presto-delta/src/test/resources/delta_v3/test-spaces/_delta_log/00000000000000000005.crc:Zone.Identifier:4` </location>
<code_context>
ASIAZ37KJPSBC54RK3G6
</code_context>
<issue_to_address>
**security (aws-access-token):** Identified a pattern that may indicate AWS credentials, risking unauthorized cloud resource access and data breaches on AWS platforms.
*Source: gitleaks*
</issue_to_address>
### Comment 12
<location> `presto-delta/src/test/resources/delta_v3/test-spaces/country=UK/part-00000-e2a8a9e1-b475-45ae-8d2c-f49764644e0d.c000.snappy.parquet:Zone.Identifier:4` </location>
<code_context>
ASIAZ37KJPSBC54RK3G6
</code_context>
<issue_to_address>
**security (aws-access-token):** Identified a pattern that may indicate AWS credentials, risking unauthorized cloud resource access and data breaches on AWS platforms.
*Source: gitleaks*
</issue_to_address>
### Comment 13
<location> `presto-delta/src/test/resources/delta_v3/test-spaces/_delta_log/00000000000000000005.json:Zone.Identifier:4` </location>
<code_context>
ASIAZ37KJPSBC54RK3G6
</code_context>
<issue_to_address>
**security (aws-access-token):** Identified a pattern that may indicate AWS credentials, risking unauthorized cloud resource access and data breaches on AWS platforms.
*Source: gitleaks*
</issue_to_address>
### Comment 14
<location> `presto-delta/src/test/resources/delta_v3/test-spaces/country=BAH_AMAS/part-00000-33844059-3951-4d43-8902-555085c05f77.c000.snappy.parquet:Zone.Identifier:4` </location>
<code_context>
ASIAZ37KJPSBC54RK3G6
</code_context>
<issue_to_address>
**security (aws-access-token):** Identified a pattern that may indicate AWS credentials, risking unauthorized cloud resource access and data breaches on AWS platforms.
*Source: gitleaks*
</issue_to_address>
### Comment 15
<location> `presto-delta/src/test/resources/delta_v3/test-spaces/country=SOUTH AFRICA/part-00000-4aad1292-1c82-4525-a3af-4d0573f9c62d.c000.snappy.parquet:Zone.Identifier:4` </location>
<code_context>
ASIAZ37KJPSBC54RK3G6
</code_context>
<issue_to_address>
**security (aws-access-token):** Identified a pattern that may indicate AWS credentials, risking unauthorized cloud resource access and data breaches on AWS platforms.
*Source: gitleaks*
</issue_to_address>
### Comment 16
<location> `presto-delta/src/test/resources/delta_v3/test-spaces/country=SOUTH AFRICA/part-00000-e850600d-c9a8-428c-864e-308d42456618.c000.snappy.parquet:Zone.Identifier:4` </location>
<code_context>
ASIAZ37KJPSBC54RK3G6
</code_context>
<issue_to_address>
**security (aws-access-token):** Identified a pattern that may indicate AWS credentials, risking unauthorized cloud resource access and data breaches on AWS platforms.
*Source: gitleaks*
</issue_to_address>
### Comment 17
<location> `presto-delta/src/test/resources/delta_v3/test-spaces/country=UK/part-00000-6a826aa4-bbfc-411b-a76c-c5fa37e2583e.c000.snappy.parquet:Zone.Identifier:4` </location>
<code_context>
ASIAZ37KJPSBC54RK3G6
</code_context>
<issue_to_address>
**security (aws-access-token):** Identified a pattern that may indicate AWS credentials, risking unauthorized cloud resource access and data breaches on AWS platforms.
*Source: gitleaks*
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
1ed3582 to
0eba9ae
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New security issues found
...untry=UK/part-00000-e2a8a9e1-b475-45ae-8d2c-f49764644e0d.c000.snappy.parquet:Zone.Identifier
Outdated
Show resolved
Hide resolved
...BAH_AMAS/part-00000-33844059-3951-4d43-8902-555085c05f77.c000.snappy.parquet:Zone.Identifier
Outdated
Show resolved
Hide resolved
...H AFRICA/part-00000-4aad1292-1c82-4525-a3af-4d0573f9c62d.c000.snappy.parquet:Zone.Identifier
Outdated
Show resolved
Hide resolved
...H AFRICA/part-00000-e850600d-c9a8-428c-864e-308d42456618.c000.snappy.parquet:Zone.Identifier
Outdated
Show resolved
Hide resolved
...untry=UK/part-00000-6a826aa4-bbfc-411b-a76c-c5fa37e2583e.c000.snappy.parquet:Zone.Identifier
Outdated
Show resolved
Hide resolved
0eba9ae to
176e06a
Compare
6f91fd7 to
608111a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems you only need folder country= SOUTH AFRICA
|
If this fix applies to both v1 and v3 tables, do we get any additional test coverage by including both? If not, let's just use the v3 Delta logs to reduce the raw number of checked-in Delta logs. |
|
@denodo-research-labs the test failures appear related |
abb1df8 to
464f4d7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey there - I've reviewed your changes - here's some feedback:
- Extract the Path(URI.create(path)).toString() decoding logic into a shared utility method instead of duplicating it in multiple classes.
- Add a test for non-S3 URI schemes (e.g. file://, hdfs://) containing spaces to verify that the decoding fix works across all supported filesystems.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Extract the Path(URI.create(path)).toString() decoding logic into a shared utility method instead of duplicating it in multiple classes.
- Add a test for non-S3 URI schemes (e.g. file://, hdfs://) containing spaces to verify that the decoding fix works across all supported filesystems.Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| deltaTable.getSchemaName(), | ||
| deltaTable.getTableName(), | ||
| addFileStatus.getPath(), | ||
| new Path(URI.create(addFileStatus.getPath())).toString(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this for URL decoding the file path? Would this work?
| new Path(URI.create(addFileStatus.getPath())).toString(), | |
| URI.create(addFileStatus.getPath()).getPath(), |
464f4d7 to
97d5070
Compare
|
Thanks for the release note! Nits: |
|
@sourcery-ai resolve |
|
@sourcery-ai dismiss |
Automated Sourcery review dismissed.
Description
Fix the problem reading Delta tables with spaces in location or partition values #25864
Reading Delta tables with an external location containing spaces returns an error.
Motivation and Context
Fixes the problem reading Delta tables with spaces in location or partition values
Impact
If you try to read Delta tables with spaces in location or partition values you will get an error
Test Plan
There was already a test named
readPartitionedTableAllDataTypes, which verifies the reading of partitions containing various data types and spaces.Previously, the as_timestamp partition had an issue where incorrect escape characters were used instead of spaces.
This partition has now been corrected to replace the invalid characters. For version v3, it was necessary to regenerate the data to fully resolve the issue.
Contributor checklist
Release Notes
Please follow release notes guidelines and fill in the release notes below.