Skip to content

Conversation

@PingLiuPing
Copy link
Collaborator

Refactors the Iceberg connector implementation to separate it from Hive connector. Sets the foundation for independent evolution of both connectors.

@netlify
Copy link

netlify bot commented Nov 20, 2025

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit e967787
🔍 Latest deploy log https://app.netlify.com/projects/meta-velox/deploys/69206f0513e31c0008f5cb2a

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 20, 2025
@PingLiuPing PingLiuPing requested review from mbasmanova and removed request for majetideepak November 20, 2025 03:41
@PingLiuPing
Copy link
Collaborator Author

@mbasmanova This is a follow up PR from discussion #15440 (comment). Code changes in Prestissimo: prestodb/presto#26661


// Register IcebergConnector.
IcebergConnectorFactory icebergFactory;
auto icebergConnector = icebergFactory.newConnector(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a way to specify the prefix to use for transform functions? This way, the application can control the contents of the function registry better.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. This makes sense.
But newConnector is an interface from ConnectorFactory, we cannot adding a new parameter to this method to specifying the prefix since the parameter doesn't mean anything to other connector. And I see Prestissimo calls this method for all types of connector with same parametes, so we cannot adding a new method either otherwise it requires special handling from caller.
Instead I hard-code the prefix in IcebergConnectorFactory::newConnector and pass it to IcebergConnector, then pass it to IcebergDataSink. This prefix is for internal use only.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is desirable to avoid hard-coding this prefix. Can we define connector-specific configuration property and use that instead?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I looked at this too. But seems this way requires a new entry in the catalog property file.

See https://github.com/prestodb/presto/blob/9374cd8672d8e76a2c9db0ee296a5c8a320e5b58/presto-native-execution/presto_cpp/main/PrestoServer.cpp#L1270-L1300

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a problem?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me that this prefix is primarily for internal use, and therefore the application may not need to specify it explicitly in the catalog property file. And we still need some logic to amend this if the property is missing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that currently a single application (Prestissimo) imposes restrictions on the Connector which may be used in other applications. If these other applications impose their own restrictions that conflict with restrictions imposed by Prestissimo there will be no way to satisfy these. Hence, it would be best to avoid such restrictions. Makes sense?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. Yes, understood now.
Update the code to retrieve this prefix from config and setting the default prefix if not present.

}
#endif

TEST_F(IcebergInsertTest, connectorConfiguration) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this test belongs to 'insert' test? How do we verify that configuration properties do take effect and not just ignored?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.
Moved them to a new test file.

test(rowType);
}

TEST_F(IcebergInsertTest, connectorProperties) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this test belong here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, moved.

exec::test::AssertQueryBuilder(plan).splits(splits).assertResults(vectors);
}

static void resetIcebergConnector(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this go into xxxTestBase?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to new test file.

/// Upstream engines such as Prestissimo and Gluten should register the same
/// functions with different prefixes to avoid conflicts.
void registerIcebergInternalFunctions(const std::string_view& prefix);
static constexpr std::string_view kDefaultIcebergFunctionPrefix{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want to remove this constant altogether and let the application specify the prefix when registering the connector.


IcebergConnectorFactory() : ConnectorFactory(kIcebergConnectorName) {}

std::shared_ptr<Connector> newConnector(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you document this API? Specifically, what are ioExecutor and cpuExecutor are for and why they are optional?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Sure. cpuExecutor is not used by Hive and Iceberg connector.

auto icebergInsertHandle =
std::dynamic_pointer_cast<const IcebergInsertTableHandle>(
connectorInsertTableHandle);
VELOX_USER_CHECK_NOT_NULL(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use checked_poiner_cast from velox/common/Casts.h

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

ConnectorInsertTableHandlePtr connectorInsertTableHandle,
ConnectorQueryCtx* connectorQueryCtx,
CommitStrategy commitStrategy) {
if (auto icebergInsertHandle =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice.

@PingLiuPing PingLiuPing force-pushed the lp_iceberg_connector branch 2 times, most recently from b7f5029 to 80633b7 Compare November 21, 2025 02:25

namespace facebook::velox::connector::hive::iceberg {

static constexpr std::string_view kIcebergFunctionPrefix{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably want something similar to velox/connectors/hive/HiveConfig.h

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, agree. I actually added IcebergConfig.h and then removed it since I only need one property in this PR.
Can I open another PR to add IcebergConfig.h? It requires some changes on other Iceberg classes too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Let's then add a TODO here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The names kIcebergFunctionPrefix and kDefaultIcebergFunctionPrefix suggest that these are both prefixes... however one is the name of the config property whose value is the prefix and the other is the prefix. Perhaps, rename kIcebergFunctionPrefix to kIcebergFunctionPrefixConfig to avoid confusion.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

RowTypePtr inputType,
ConnectorInsertTableHandlePtr connectorInsertTableHandle,
ConnectorQueryCtx* connectorQueryCtx,
CommitStrategy commitStrategy) override;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does Iceberg have support for all commit strategies? If not, let's clarify what is supported.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, only kNoCommit is supported.

/// @param connectorQueryCtx Query context for the write operation.
/// @param commitStrategy Strategy for committing the write operation.
/// @return IcebergDataSink instance configured for the write operation.
/// @throws VeloxUserError if connectorInsertTableHandle is not an
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is redundant. Let's remove.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

const std::string functionPrefix_;
};

/// Factory for creating IcebergConnector instances.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant comment.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks.

const std::string& id,
std::shared_ptr<const config::ConfigBase> config,
folly::Executor* ioExecutor = nullptr,
folly::Executor* cpuExecutor = nullptr) override {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

linter will complain about this parameter being unused. Add [[maybe_unused]].

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Added this to HiveConnector as well.

opPool_.reset();
root_.reset();
queryCtx_.reset();
// Unregister IcebergConnector
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

redundant comment

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks.

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

@mbasmanova mbasmanova added the ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall label Nov 21, 2025
@meta-codesync
Copy link

meta-codesync bot commented Nov 21, 2025

@kagamiori has imported this pull request. If you are a Meta employee, you can view this in D87666371.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants