Skip to content

Commit 10cdf6f

Browse files
JkSelffacebook-github-bot
authored andcommitted
Support jvm version libhdfs in velox (facebookincubator#9835)
Summary: Currently, Gluten will throw hdfs connection failures when executing queries if their HDFS system employs Kerberos authentication and Viewfs support. This is due to the fact that the existing libhdfs3 API does not support Kerberos authentication, whereas the JVM version of libhdfs is capable of invoking APIs that support Kerberos authentication. If the user's system has the `HADOOP_HOME `environment variable set, the JVM version of libhdfs will be used during the compilation of Gluten; if not set, the default libhdfs3 will be used instead. Pull Request resolved: facebookincubator#9835 Reviewed By: xiaoxmeng, DanielHunte Differential Revision: D64872596 Pulled By: pedroerp fbshipit-source-id: 995ee73a9f8474f8a6467b926a56246073fee75e
1 parent 8d77beb commit 10cdf6f

19 files changed

+2342
-139
lines changed

.github/workflows/linux-build.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -134,6 +134,7 @@ jobs:
134134
LIBHDFS3_CONF: "${{ github.workspace }}/scripts/hdfs-client.xml"
135135
working-directory: _build/release
136136
run: |
137+
export CLASSPATH=`/usr/local/hadoop/bin/hdfs classpath --glob`
137138
ctest -j 8 --output-on-failure --no-tests=error
138139
139140
ubuntu-debug:

CMakeLists.txt

Lines changed: 3 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -259,11 +259,9 @@ if(VELOX_ENABLE_ABFS)
259259
endif()
260260

261261
if(VELOX_ENABLE_HDFS)
262-
find_library(
263-
LIBHDFS3
264-
NAMES libhdfs3.so libhdfs3.dylib
265-
HINTS "${CMAKE_SOURCE_DIR}/hawq/depends/libhdfs3/_build/src/" REQUIRED)
266-
add_definitions(-DVELOX_ENABLE_HDFS3)
262+
add_definitions(-DVELOX_ENABLE_HDFS)
263+
# JVM libhdfs requires arrow dependency.
264+
set(VELOX_ENABLE_ARROW ON)
267265
endif()
268266

269267
if(VELOX_ENABLE_PARQUET)

NOTICE.txt

Lines changed: 100 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,3 +9,103 @@ This product includes software from the QT project (BSD, 3-clause).
99

1010
This product includes software from HowardHinnant's date library (MIT License).
1111
* https://github.com/HowardHinnant/date/tree/master
12+
13+
This product includes software from the The Arrow project.
14+
* https://github.com/apache/arrow/blob/apache-arrow-15.0.0/cpp/src/arrow/io/hdfs_internal.h
15+
* https://github.com/apache/arrow/blob/apache-arrow-15.0.0/cpp/src/arrow/io/hdfs_internal.cc
16+
Which contain the following NOTICE file:
17+
-------
18+
Apache Arrow
19+
Copyright 2016-2024 The Apache Software Foundation
20+
This product includes software developed at
21+
The Apache Software Foundation (http://www.apache.org/).
22+
This product includes software from the SFrame project (BSD, 3-clause).
23+
* Copyright (C) 2015 Dato, Inc.
24+
* Copyright (c) 2009 Carnegie Mellon University.
25+
This product includes software from the Feather project (Apache 2.0)
26+
https://github.com/wesm/feather
27+
This product includes software from the DyND project (BSD 2-clause)
28+
https://github.com/libdynd
29+
This product includes software from the LLVM project
30+
* distributed under the University of Illinois Open Source
31+
This product includes software from the google-lint project
32+
* Copyright (c) 2009 Google Inc. All rights reserved.
33+
This product includes software from the mman-win32 project
34+
* Copyright https://code.google.com/p/mman-win32/
35+
* Licensed under the MIT License;
36+
This product includes software from the LevelDB project
37+
* Copyright (c) 2011 The LevelDB Authors. All rights reserved.
38+
* Use of this source code is governed by a BSD-style license that can be
39+
* Moved from Kudu http://github.com/cloudera/kudu
40+
This product includes software from the CMake project
41+
* Copyright 2001-2009 Kitware, Inc.
42+
* Copyright 2012-2014 Continuum Analytics, Inc.
43+
* All rights reserved.
44+
This product includes software from https://github.com/matthew-brett/multibuild (BSD 2-clause)
45+
* Copyright (c) 2013-2016, Matt Terry and Matthew Brett; all rights reserved.
46+
This product includes software from the Ibis project (Apache 2.0)
47+
* Copyright (c) 2015 Cloudera, Inc.
48+
* https://github.com/cloudera/ibis
49+
This product includes software from Dremio (Apache 2.0)
50+
* Copyright (C) 2017-2018 Dremio Corporation
51+
* https://github.com/dremio/dremio-oss
52+
This product includes software from Google Guava (Apache 2.0)
53+
* Copyright (C) 2007 The Guava Authors
54+
* https://github.com/google/guava
55+
This product include software from CMake (BSD 3-Clause)
56+
* CMake - Cross Platform Makefile Generator
57+
* Copyright 2000-2019 Kitware, Inc. and Contributors
58+
The web site includes files generated by Jekyll.
59+
--------------------------------------------------------------------------------
60+
This product includes code from Apache Kudu, which includes the following in
61+
its NOTICE file:
62+
Apache Kudu
63+
Copyright 2016 The Apache Software Foundation
64+
This product includes software developed at
65+
The Apache Software Foundation (http://www.apache.org/).
66+
Portions of this software were developed at
67+
Cloudera, Inc (http://www.cloudera.com/).
68+
--------------------------------------------------------------------------------
69+
This product includes code from Apache ORC, which includes the following in
70+
its NOTICE file:
71+
Apache ORC
72+
Copyright 2013-2019 The Apache Software Foundation
73+
This product includes software developed by The Apache Software
74+
Foundation (http://www.apache.org/).
75+
This product includes software developed by Hewlett-Packard:
76+
(c) Copyright [2014-2015] Hewlett-Packard Development Company, L.P
77+
-------
78+
79+
This product includes software from the The Hadoop project.
80+
* https://github.com/apache/hadoop/blob/release-3.3.0-RC0/hadoop-hdfs-project/hadoop-hdfs-native-client/src/main/native/libhdfs/include/hdfs/hdfs.h
81+
Which contains the following NOTICE file:
82+
----
83+
Apache Hadoop
84+
Copyright 2006 and onwards The Apache Software Foundation.
85+
This product includes software developed at
86+
The Apache Software Foundation (http://www.apache.org/).
87+
Export Control Notice
88+
---------------------
89+
This distribution includes cryptographic software. The country in
90+
which you currently reside may have restrictions on the import,
91+
possession, use, and/or re-export to another country, of
92+
encryption software. BEFORE using any encryption software, please
93+
check your country's laws, regulations and policies concerning the
94+
import, possession, or use, and re-export of encryption software, to
95+
see if this is permitted. See <http://www.wassenaar.org/> for more
96+
information.
97+
The U.S. Government Department of Commerce, Bureau of Industry and
98+
Security (BIS), has classified this software as Export Commodity
99+
Control Number (ECCN) 5D002.C.1, which includes information security
100+
software using or performing cryptographic functions with asymmetric
101+
algorithms. The form and manner of this Apache Software Foundation
102+
distribution makes it eligible for export under the License Exception
103+
ENC Technology Software Unrestricted (TSU) exception (see the BIS
104+
Export Administration Regulations, Section 740.13) for both object
105+
code and source code.
106+
The following provides more details on the included cryptographic software:
107+
This software uses the SSL libraries from the Jetty project written
108+
by mortbay.org.
109+
Hadoop Yarn Server Web Proxy uses the BouncyCastle Java
110+
cryptography APIs written by the Legion of the Bouncy Castle Inc.
111+
----

velox/CMakeLists.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,7 @@ add_subdirectory(row)
2424
add_subdirectory(flag_definitions)
2525
add_subdirectory(external/date)
2626
add_subdirectory(external/md5)
27+
add_subdirectory(external/hdfs)
2728
#
2829

2930
# examples depend on expression

velox/connectors/hive/storage_adapters/hdfs/CMakeLists.txt

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,12 @@ if(VELOX_ENABLE_HDFS)
2323
HdfsFileSystem.cpp
2424
HdfsReadFile.cpp
2525
HdfsWriteFile.cpp)
26-
velox_link_libraries(velox_hdfs Folly::folly ${LIBHDFS3} xsimd)
26+
velox_link_libraries(
27+
velox_hdfs
28+
velox_external_hdfs
29+
velox_dwio_common
30+
Folly::folly
31+
xsimd)
2732

2833
if(${VELOX_BUILD_TESTING})
2934
add_subdirectory(tests)

velox/connectors/hive/storage_adapters/hdfs/HdfsFileSystem.cpp

Lines changed: 23 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -14,11 +14,11 @@
1414
* limitations under the License.
1515
*/
1616
#include "velox/connectors/hive/storage_adapters/hdfs/HdfsFileSystem.h"
17-
#include <hdfs/hdfs.h>
1817
#include <mutex>
1918
#include "velox/common/config/Config.h"
2019
#include "velox/connectors/hive/storage_adapters/hdfs/HdfsReadFile.h"
2120
#include "velox/connectors/hive/storage_adapters/hdfs/HdfsWriteFile.h"
21+
#include "velox/external/hdfs/ArrowHdfsInternal.h"
2222

2323
namespace facebook::velox::filesystems {
2424
std::string_view HdfsFileSystem::kScheme("hdfs://");
@@ -29,21 +29,27 @@ class HdfsFileSystem::Impl {
2929
explicit Impl(
3030
const config::ConfigBase* config,
3131
const HdfsServiceEndpoint& endpoint) {
32-
auto builder = hdfsNewBuilder();
33-
hdfsBuilderSetNameNode(builder, endpoint.host.c_str());
34-
hdfsBuilderSetNameNodePort(builder, atoi(endpoint.port.data()));
35-
hdfsClient_ = hdfsBuilderConnect(builder);
36-
hdfsFreeBuilder(builder);
32+
auto status = filesystems::arrow::io::internal::ConnectLibHdfs(&driver_);
33+
if (!status.ok()) {
34+
LOG(ERROR) << "ConnectLibHdfs failed ";
35+
}
36+
37+
// connect to HDFS with the builder object
38+
hdfsBuilder* builder = driver_->NewBuilder();
39+
driver_->BuilderSetNameNode(builder, endpoint.host.c_str());
40+
driver_->BuilderSetNameNodePort(builder, atoi(endpoint.port.data()));
41+
driver_->BuilderSetForceNewInstance(builder);
42+
hdfsClient_ = driver_->BuilderConnect(builder);
3743
VELOX_CHECK_NOT_NULL(
3844
hdfsClient_,
3945
"Unable to connect to HDFS: {}, got error: {}.",
4046
endpoint.identity(),
41-
hdfsGetLastError());
47+
driver_->GetLastExceptionRootCause());
4248
}
4349

4450
~Impl() {
4551
LOG(INFO) << "Disconnecting HDFS file system";
46-
int disconnectResult = hdfsDisconnect(hdfsClient_);
52+
int disconnectResult = driver_->Disconnect(hdfsClient_);
4753
if (disconnectResult != 0) {
4854
LOG(WARNING) << "hdfs disconnect failure in HdfsReadFile close: "
4955
<< errno;
@@ -54,8 +60,13 @@ class HdfsFileSystem::Impl {
5460
return hdfsClient_;
5561
}
5662

63+
filesystems::arrow::io::internal::LibHdfsShim* hdfsShim() {
64+
return driver_;
65+
}
66+
5767
private:
5868
hdfsFS hdfsClient_;
69+
filesystems::arrow::io::internal::LibHdfsShim* driver_;
5970
};
6071

6172
HdfsFileSystem::HdfsFileSystem(
@@ -79,13 +90,15 @@ std::unique_ptr<ReadFile> HdfsFileSystem::openFileForRead(
7990
path.remove_prefix(index);
8091
}
8192

82-
return std::make_unique<HdfsReadFile>(impl_->hdfsClient(), path);
93+
return std::make_unique<HdfsReadFile>(
94+
impl_->hdfsShim(), impl_->hdfsClient(), path);
8395
}
8496

8597
std::unique_ptr<WriteFile> HdfsFileSystem::openFileForWrite(
8698
std::string_view path,
8799
const FileOptions& /*unused*/) {
88-
return std::make_unique<HdfsWriteFile>(impl_->hdfsClient(), path);
100+
return std::make_unique<HdfsWriteFile>(
101+
impl_->hdfsShim(), impl_->hdfsClient(), path);
89102
}
90103

91104
bool HdfsFileSystem::isHdfsFile(const std::string_view filePath) {

velox/connectors/hive/storage_adapters/hdfs/HdfsFileSystem.h

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,12 @@
1515
*/
1616
#include "velox/common/file/FileSystems.h"
1717

18+
namespace velox::filesystems::arrow::io::internal {
19+
class LibHdfsShim;
20+
}
21+
1822
namespace facebook::velox::filesystems {
23+
1924
struct HdfsServiceEndpoint {
2025
HdfsServiceEndpoint(const std::string& hdfsHost, const std::string& hdfsPort)
2126
: host(hdfsHost), port(hdfsPort) {}

velox/connectors/hive/storage_adapters/hdfs/HdfsReadFile.cpp

Lines changed: 59 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -16,20 +16,65 @@
1616

1717
#include "HdfsReadFile.h"
1818
#include <folly/synchronization/CallOnce.h>
19-
#include <hdfs/hdfs.h>
19+
#include "velox/external/hdfs/ArrowHdfsInternal.h"
2020

2121
namespace facebook::velox {
2222

23-
HdfsReadFile::HdfsReadFile(hdfsFS hdfs, const std::string_view path)
24-
: hdfsClient_(hdfs), filePath_(path) {
25-
fileInfo_ = hdfsGetPathInfo(hdfsClient_, filePath_.data());
23+
struct HdfsFile {
24+
filesystems::arrow::io::internal::LibHdfsShim* driver_;
25+
hdfsFS client_;
26+
hdfsFile handle_;
27+
28+
HdfsFile() : driver_(nullptr), client_(nullptr), handle_(nullptr) {}
29+
~HdfsFile() {
30+
if (handle_ && driver_->CloseFile(client_, handle_) == -1) {
31+
LOG(ERROR) << "Unable to close file, errno: " << errno;
32+
}
33+
}
34+
35+
void open(
36+
filesystems::arrow::io::internal::LibHdfsShim* driver,
37+
hdfsFS client,
38+
const std::string& path) {
39+
driver_ = driver;
40+
client_ = client;
41+
handle_ = driver->OpenFile(client, path.data(), O_RDONLY, 0, 0, 0);
42+
VELOX_CHECK_NOT_NULL(
43+
handle_,
44+
"Unable to open file {}. got error: {}",
45+
path,
46+
driver_->GetLastExceptionRootCause());
47+
}
48+
49+
void seek(uint64_t offset) const {
50+
VELOX_CHECK_EQ(
51+
driver_->Seek(client_, handle_, offset),
52+
0,
53+
"Cannot seek through HDFS file, error is : {}",
54+
driver_->GetLastExceptionRootCause());
55+
}
56+
57+
int32_t read(char* pos, uint64_t length) const {
58+
auto bytesRead = driver_->Read(client_, handle_, pos, length);
59+
VELOX_CHECK(bytesRead >= 0, "Read failure in HDFSReadFile::preadInternal.");
60+
return bytesRead;
61+
}
62+
};
63+
64+
HdfsReadFile::HdfsReadFile(
65+
filesystems::arrow::io::internal::LibHdfsShim* driver,
66+
hdfsFS hdfs,
67+
const std::string_view path)
68+
: driver_(driver), hdfsClient_(hdfs), filePath_(path) {
69+
fileInfo_ = driver_->GetPathInfo(hdfsClient_, filePath_.data());
2670
if (fileInfo_ == nullptr) {
27-
auto error = hdfsGetLastError();
71+
auto error = fmt::format(
72+
"FileNotFoundException: Path {} does not exist.", filePath_);
2873
auto errMsg = fmt::format(
2974
"Unable to get file path info for file: {}. got error: {}",
3075
filePath_,
3176
error);
32-
if (std::strstr(error, "FileNotFoundException") != nullptr) {
77+
if (error.find("FileNotFoundException") != std::string::npos) {
3378
VELOX_FILE_NOT_FOUND_ERROR(errMsg);
3479
}
3580
VELOX_FAIL(errMsg);
@@ -38,19 +83,22 @@ HdfsReadFile::HdfsReadFile(hdfsFS hdfs, const std::string_view path)
3883

3984
HdfsReadFile::~HdfsReadFile() {
4085
// should call hdfsFreeFileInfo to avoid memory leak
41-
hdfsFreeFileInfo(fileInfo_, 1);
86+
if (fileInfo_) {
87+
driver_->FreeFileInfo(fileInfo_, 1);
88+
}
4289
}
4390

4491
void HdfsReadFile::preadInternal(uint64_t offset, uint64_t length, char* pos)
4592
const {
4693
checkFileReadParameters(offset, length);
47-
if (!file_->handle_) {
48-
file_->open(hdfsClient_, filePath_);
94+
folly::ThreadLocal<HdfsFile> file;
95+
if (!file->handle_) {
96+
file->open(driver_, hdfsClient_, filePath_);
4997
}
50-
file_->seek(offset);
98+
file->seek(offset);
5199
uint64_t totalBytesRead = 0;
52100
while (totalBytesRead < length) {
53-
auto bytesRead = file_->read(pos, length - totalBytesRead);
101+
auto bytesRead = file->read(pos, length - totalBytesRead);
54102
totalBytesRead += bytesRead;
55103
pos += bytesRead;
56104
}

0 commit comments

Comments
 (0)