Skip to content

Conversation

@denodo-research-labs
Copy link
Contributor

@denodo-research-labs denodo-research-labs commented Oct 22, 2025

Description

Fix the problem reading Delta tables with spaces in location or partition values #25864

Reading Delta tables with an external location containing spaces returns an error.

com.facebook.presto.spi.PrestoException: File does not exist: s3a://delta/delta%20space%20test/test/test.parquet
	at com.facebook.presto.delta.DeltaPageSourceProvider.createParquetPageSource(DeltaPageSourceProvider.java:348)
	at com.facebook.presto.delta.DeltaPageSourceProvider.createPageSource(DeltaPageSourceProvider.java:164)
	at com.facebook.presto.spi.connector.classloader.ClassLoaderSafeConnectorPageSourceProvider.createPageSource(ClassLoaderSafeConnectorPageSourceProvider.java:65)
	at com.facebook.presto.split.PageSourceManager.createPageSource(PageSourceManager.java:81)
	at com.facebook.presto.operator.TableScanOperator.getOutput(TableScanOperator.java:263)
	at com.facebook.presto.operator.Driver.processInternal(Driver.java:441)
	at com.facebook.presto.operator.Driver.lambda$processFor$10(Driver.java:324)
	at com.facebook.presto.operator.Driver.tryWithLock(Driver.java:750)
	at com.facebook.presto.operator.Driver.processFor(Driver.java:317)
	at com.facebook.presto.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:1079)
	at com.facebook.presto.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:165)
	at com.facebook.presto.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:621)
	at com.facebook.presto.$gen.Presto_null__testversion____20250822_075611_1.run(Unknown Source)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Motivation and Context

Fixes the problem reading Delta tables with spaces in location or partition values

Impact

If you try to read Delta tables with spaces in location or partition values you will get an error

com.facebook.presto.spi.PrestoException: File does not exist

Test Plan

There was already a test named readPartitionedTableAllDataTypes, which verifies the reading of partitions containing various data types and spaces.
Previously, the as_timestamp partition had an issue where incorrect escape characters were used instead of spaces.
This partition has now been corrected to replace the invalid characters. For version v3, it was necessary to regenerate the data to fully resolve the issue.

Contributor checklist

  • Please make sure your submission complies with our contributing guide, in particular code style and commit standards.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

Please follow release notes guidelines and fill in the release notes below.

== RELEASE NOTES ==
 
Delta Connector Changes
* Fix problem reading Delta Lake tables with spaces in location or partition values.

@denodo-research-labs denodo-research-labs requested a review from a team as a code owner October 22, 2025 08:57
@sourcery-ai
Copy link
Contributor

sourcery-ai bot commented Oct 22, 2025

Reviewer's guide (collapsed on small PRs)

Reviewer's Guide

Wrap raw URI-encoded file paths in Hadoop Path constructions to decode spaces correctly and refresh corresponding v3 test fixtures to validate reading partitions with spaces.

ER diagram for partitioned Delta table test data with spaces

erDiagram
"AddFileStatus" {
  string path
  int size
}
"PartitionValues" {
  string columnName
  string partitionValue
}
"AddFileStatus" ||--o| "PartitionValues" : contains
Loading

Class diagram for updated file path handling in Delta connector

classDiagram
class DeltaSplitManager {
  +getNextBatch()
}
class DeltaExpressionUtils {
  <<static>>
  +evaluatePartitionPredicate()
}
class InternalScanFileUtils {
  +getAddFileStatus(row)
  +getPartitionValues(row)
}
class Path {
  +Path(URI)
  +toString()
}
class URI {
  +create(String)
}
DeltaSplitManager --> Path : uses
DeltaExpressionUtils --> Path : uses
DeltaExpressionUtils --> InternalScanFileUtils : uses
DeltaSplitManager --> InternalScanFileUtils : uses
Path <-- URI : constructed from
Loading

File-Level Changes

Change Details Files
Normalize file path handling to preserve spaces
  • Wrap getPath() output with URI.create()
  • Construct a new Hadoop Path and call toString()
  • Apply change in partition predicate and split manager
presto-delta/src/main/java/com/facebook/presto/delta/DeltaExpressionUtils.java
presto-delta/src/main/java/com/facebook/presto/delta/DeltaSplitManager.java
Refresh v3 partitioned table test fixtures
  • Correct invalid escape sequences in JSON logs for as_timestamp partition
  • Add missing .crc files for log entries
  • Regenerate data for version v3 to reflect path normalization
presto-delta/src/test/resources/delta_v3/data-reader-partition-values/_delta_log/00000000000000000000.json
presto-delta/src/test/resources/delta_v3/data-reader-partition-values/_delta_log/00000000000000000001.json
presto-delta/src/test/resources/delta_v3/data-reader-partition-values/_delta_log/00000000000000000000.crc
presto-delta/src/test/resources/delta_v3/data-reader-partition-values/_delta_log/00000000000000000001.crc

Possibly linked issues


Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

sourcery-ai[bot]
sourcery-ai bot previously requested changes Oct 22, 2025
Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes and they look great!

Blocking issues:

  • Identified a pattern that may indicate AWS credentials, risking unauthorized cloud resource access and data breaches on AWS platforms. (link)
  • Identified a pattern that may indicate AWS credentials, risking unauthorized cloud resource access and data breaches on AWS platforms. (link)
  • Identified a pattern that may indicate AWS credentials, risking unauthorized cloud resource access and data breaches on AWS platforms. (link)
  • Identified a pattern that may indicate AWS credentials, risking unauthorized cloud resource access and data breaches on AWS platforms. (link)
  • Identified a pattern that may indicate AWS credentials, risking unauthorized cloud resource access and data breaches on AWS platforms. (link)
  • Identified a pattern that may indicate AWS credentials, risking unauthorized cloud resource access and data breaches on AWS platforms. (link)
  • Identified a pattern that may indicate AWS credentials, risking unauthorized cloud resource access and data breaches on AWS platforms. (link)
  • Identified a pattern that may indicate AWS credentials, risking unauthorized cloud resource access and data breaches on AWS platforms. (link)
  • Identified a pattern that may indicate AWS credentials, risking unauthorized cloud resource access and data breaches on AWS platforms. (link)
  • Identified a pattern that may indicate AWS credentials, risking unauthorized cloud resource access and data breaches on AWS platforms. (link)
  • Identified a pattern that may indicate AWS credentials, risking unauthorized cloud resource access and data breaches on AWS platforms. (link)
  • Identified a pattern that may indicate AWS credentials, risking unauthorized cloud resource access and data breaches on AWS platforms. (link)
  • Identified a pattern that may indicate AWS credentials, risking unauthorized cloud resource access and data breaches on AWS platforms. (link)
  • Identified a pattern that may indicate AWS credentials, risking unauthorized cloud resource access and data breaches on AWS platforms. (link)
  • Identified a pattern that may indicate AWS credentials, risking unauthorized cloud resource access and data breaches on AWS platforms. (link)
  • Identified a pattern that may indicate AWS credentials, risking unauthorized cloud resource access and data breaches on AWS platforms. (link)
  • Identified a pattern that may indicate AWS credentials, risking unauthorized cloud resource access and data breaches on AWS platforms. (link)
Prompt for AI Agents
Please address the comments from this code review:

## Individual Comments

### Comment 1
<location> `presto-delta/src/test/resources/delta_v3/test-spaces/_delta_log/00000000000000000002.json:Zone.Identifier:4` </location>
<code_context>
ASIAZ37KJPSBC54RK3G6
</code_context>

<issue_to_address>
**security (aws-access-token):** Identified a pattern that may indicate AWS credentials, risking unauthorized cloud resource access and data breaches on AWS platforms.

*Source: gitleaks*
</issue_to_address>

### Comment 2
<location> `presto-delta/src/test/resources/delta_v3/test-spaces/_delta_log/00000000000000000000.crc:Zone.Identifier:4` </location>
<code_context>
ASIAZ37KJPSBC54RK3G6
</code_context>

<issue_to_address>
**security (aws-access-token):** Identified a pattern that may indicate AWS credentials, risking unauthorized cloud resource access and data breaches on AWS platforms.

*Source: gitleaks*
</issue_to_address>

### Comment 3
<location> `presto-delta/src/test/resources/delta_v3/test-spaces/_delta_log/00000000000000000000.json:Zone.Identifier:4` </location>
<code_context>
ASIAZ37KJPSBC54RK3G6
</code_context>

<issue_to_address>
**security (aws-access-token):** Identified a pattern that may indicate AWS credentials, risking unauthorized cloud resource access and data breaches on AWS platforms.

*Source: gitleaks*
</issue_to_address>

### Comment 4
<location> `presto-delta/src/test/resources/delta_v3/test-spaces/_delta_log/00000000000000000001.crc:Zone.Identifier:4` </location>
<code_context>
ASIAZ37KJPSBC54RK3G6
</code_context>

<issue_to_address>
**security (aws-access-token):** Identified a pattern that may indicate AWS credentials, risking unauthorized cloud resource access and data breaches on AWS platforms.

*Source: gitleaks*
</issue_to_address>

### Comment 5
<location> `presto-delta/src/test/resources/delta_v3/test-spaces/_delta_log/00000000000000000001.json:Zone.Identifier:4` </location>
<code_context>
ASIAZ37KJPSBC54RK3G6
</code_context>

<issue_to_address>
**security (aws-access-token):** Identified a pattern that may indicate AWS credentials, risking unauthorized cloud resource access and data breaches on AWS platforms.

*Source: gitleaks*
</issue_to_address>

### Comment 6
<location> `presto-delta/src/test/resources/delta_v3/test-spaces/_delta_log/00000000000000000002.crc:Zone.Identifier:4` </location>
<code_context>
ASIAZ37KJPSBC54RK3G6
</code_context>

<issue_to_address>
**security (aws-access-token):** Identified a pattern that may indicate AWS credentials, risking unauthorized cloud resource access and data breaches on AWS platforms.

*Source: gitleaks*
</issue_to_address>

### Comment 7
<location> `presto-delta/src/test/resources/delta_v3/test-spaces/_delta_log/00000000000000000003.crc:Zone.Identifier:4` </location>
<code_context>
ASIAZ37KJPSBC54RK3G6
</code_context>

<issue_to_address>
**security (aws-access-token):** Identified a pattern that may indicate AWS credentials, risking unauthorized cloud resource access and data breaches on AWS platforms.

*Source: gitleaks*
</issue_to_address>

### Comment 8
<location> `presto-delta/src/test/resources/delta_v3/test-spaces/_delta_log/00000000000000000003.json:Zone.Identifier:4` </location>
<code_context>
ASIAZ37KJPSBC54RK3G6
</code_context>

<issue_to_address>
**security (aws-access-token):** Identified a pattern that may indicate AWS credentials, risking unauthorized cloud resource access and data breaches on AWS platforms.

*Source: gitleaks*
</issue_to_address>

### Comment 9
<location> `presto-delta/src/test/resources/delta_v3/test-spaces/_delta_log/00000000000000000004.crc:Zone.Identifier:4` </location>
<code_context>
ASIAZ37KJPSBC54RK3G6
</code_context>

<issue_to_address>
**security (aws-access-token):** Identified a pattern that may indicate AWS credentials, risking unauthorized cloud resource access and data breaches on AWS platforms.

*Source: gitleaks*
</issue_to_address>

### Comment 10
<location> `presto-delta/src/test/resources/delta_v3/test-spaces/_delta_log/00000000000000000004.json:Zone.Identifier:4` </location>
<code_context>
ASIAZ37KJPSBC54RK3G6
</code_context>

<issue_to_address>
**security (aws-access-token):** Identified a pattern that may indicate AWS credentials, risking unauthorized cloud resource access and data breaches on AWS platforms.

*Source: gitleaks*
</issue_to_address>

### Comment 11
<location> `presto-delta/src/test/resources/delta_v3/test-spaces/_delta_log/00000000000000000005.crc:Zone.Identifier:4` </location>
<code_context>
ASIAZ37KJPSBC54RK3G6
</code_context>

<issue_to_address>
**security (aws-access-token):** Identified a pattern that may indicate AWS credentials, risking unauthorized cloud resource access and data breaches on AWS platforms.

*Source: gitleaks*
</issue_to_address>

### Comment 12
<location> `presto-delta/src/test/resources/delta_v3/test-spaces/country=UK/part-00000-e2a8a9e1-b475-45ae-8d2c-f49764644e0d.c000.snappy.parquet:Zone.Identifier:4` </location>
<code_context>
ASIAZ37KJPSBC54RK3G6
</code_context>

<issue_to_address>
**security (aws-access-token):** Identified a pattern that may indicate AWS credentials, risking unauthorized cloud resource access and data breaches on AWS platforms.

*Source: gitleaks*
</issue_to_address>

### Comment 13
<location> `presto-delta/src/test/resources/delta_v3/test-spaces/_delta_log/00000000000000000005.json:Zone.Identifier:4` </location>
<code_context>
ASIAZ37KJPSBC54RK3G6
</code_context>

<issue_to_address>
**security (aws-access-token):** Identified a pattern that may indicate AWS credentials, risking unauthorized cloud resource access and data breaches on AWS platforms.

*Source: gitleaks*
</issue_to_address>

### Comment 14
<location> `presto-delta/src/test/resources/delta_v3/test-spaces/country=BAH_AMAS/part-00000-33844059-3951-4d43-8902-555085c05f77.c000.snappy.parquet:Zone.Identifier:4` </location>
<code_context>
ASIAZ37KJPSBC54RK3G6
</code_context>

<issue_to_address>
**security (aws-access-token):** Identified a pattern that may indicate AWS credentials, risking unauthorized cloud resource access and data breaches on AWS platforms.

*Source: gitleaks*
</issue_to_address>

### Comment 15
<location> `presto-delta/src/test/resources/delta_v3/test-spaces/country=SOUTH AFRICA/part-00000-4aad1292-1c82-4525-a3af-4d0573f9c62d.c000.snappy.parquet:Zone.Identifier:4` </location>
<code_context>
ASIAZ37KJPSBC54RK3G6
</code_context>

<issue_to_address>
**security (aws-access-token):** Identified a pattern that may indicate AWS credentials, risking unauthorized cloud resource access and data breaches on AWS platforms.

*Source: gitleaks*
</issue_to_address>

### Comment 16
<location> `presto-delta/src/test/resources/delta_v3/test-spaces/country=SOUTH AFRICA/part-00000-e850600d-c9a8-428c-864e-308d42456618.c000.snappy.parquet:Zone.Identifier:4` </location>
<code_context>
ASIAZ37KJPSBC54RK3G6
</code_context>

<issue_to_address>
**security (aws-access-token):** Identified a pattern that may indicate AWS credentials, risking unauthorized cloud resource access and data breaches on AWS platforms.

*Source: gitleaks*
</issue_to_address>

### Comment 17
<location> `presto-delta/src/test/resources/delta_v3/test-spaces/country=UK/part-00000-6a826aa4-bbfc-411b-a76c-c5fa37e2583e.c000.snappy.parquet:Zone.Identifier:4` </location>
<code_context>
ASIAZ37KJPSBC54RK3G6
</code_context>

<issue_to_address>
**security (aws-access-token):** Identified a pattern that may indicate AWS credentials, risking unauthorized cloud resource access and data breaches on AWS platforms.

*Source: gitleaks*
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@denodo-research-labs denodo-research-labs changed the title Fix problem reading Delta tables with spaces in location or partition values fix: Fix problem reading Delta tables with spaces in location or partition values Oct 22, 2025
@denodo-research-labs denodo-research-labs force-pushed the read_delta_tables_spaces branch 2 times, most recently from 1ed3582 to 0eba9ae Compare October 22, 2025 09:39
sourcery-ai[bot]
sourcery-ai bot previously requested changes Oct 22, 2025
Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New security issues found

@tdcmeehan tdcmeehan self-assigned this Oct 22, 2025
@denodo-research-labs denodo-research-labs force-pushed the read_delta_tables_spaces branch 4 times, most recently from 6f91fd7 to 608111a Compare October 22, 2025 17:05
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems you only need folder country= SOUTH AFRICA

@tdcmeehan tdcmeehan changed the title fix: Fix problem reading Delta tables with spaces in location or partition values fix(plugin-delta): Fix problem reading Delta tables with spaces in location or partition values Oct 24, 2025
@tdcmeehan tdcmeehan changed the title fix(plugin-delta): Fix problem reading Delta tables with spaces in location or partition values fix(plugin-delta): Fix problem reading tables with spaces in location or partition values Oct 24, 2025
@tdcmeehan
Copy link
Contributor

If this fix applies to both v1 and v3 tables, do we get any additional test coverage by including both? If not, let's just use the v3 Delta logs to reduce the raw number of checked-in Delta logs.

@tdcmeehan
Copy link
Contributor

@denodo-research-labs the test failures appear related

@denodo-research-labs denodo-research-labs marked this pull request as draft October 27, 2025 08:59
@denodo-research-labs denodo-research-labs force-pushed the read_delta_tables_spaces branch 4 times, most recently from abb1df8 to 464f4d7 Compare October 29, 2025 13:27
@denodo-research-labs denodo-research-labs marked this pull request as ready for review October 29, 2025 16:25
Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes - here's some feedback:

  • Extract the Path(URI.create(path)).toString() decoding logic into a shared utility method instead of duplicating it in multiple classes.
  • Add a test for non-S3 URI schemes (e.g. file://, hdfs://) containing spaces to verify that the decoding fix works across all supported filesystems.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- Extract the Path(URI.create(path)).toString() decoding logic into a shared utility method instead of duplicating it in multiple classes.
- Add a test for non-S3 URI schemes (e.g. file://, hdfs://) containing spaces to verify that the decoding fix works across all supported filesystems.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

deltaTable.getSchemaName(),
deltaTable.getTableName(),
addFileStatus.getPath(),
new Path(URI.create(addFileStatus.getPath())).toString(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this for URL decoding the file path? Would this work?

Suggested change
new Path(URI.create(addFileStatus.getPath())).toString(),
URI.create(addFileStatus.getPath()).getPath(),

@steveburnett
Copy link
Contributor

Thanks for the release note! Nits:

== RELEASE NOTES ==
 
Delta Lake Connector Changes
* Fix problem reading Delta Lake tables with spaces in location or partition values.

@tdcmeehan
Copy link
Contributor

@sourcery-ai resolve

@tdcmeehan
Copy link
Contributor

@sourcery-ai dismiss

@sourcery-ai sourcery-ai bot dismissed stale reviews from themself November 1, 2025 14:04

Automated Sourcery review dismissed.

@tdcmeehan tdcmeehan merged commit 5f36bf7 into prestodb:master Nov 1, 2025
82 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants