Skip to content

Apache Hive (or extensible JDBC) support #807

Open
@christopherfrieler

Description

@christopherfrieler

I tried to connect to Apache Hive via JDBC and faced a few issues, which I would like to share here. Feel free to discuss my ideas or split them into multiple issues.

  1. When I tried it, first I ran into the IllegalArgumentExceptions from https://github.com/Kotlin/dataframe/blob/master/dataframe-jdbc/src/main/kotlin/org/jetbrains/kotlinx/dataframe/io/db/util.kt telling me it's not possible. -> Hive support would be nice.
  2. I tried to add Hive support myself (in my code), but, although I could create a Hive instance of DbType, the when-statements in https://github.com/Kotlin/dataframe/blob/master/dataframe-jdbc/src/main/kotlin/org/jetbrains/kotlinx/dataframe/io/db/util.kt checking the DbType are not extensible. -> The implementation should be easier extensible, as new database technologies arrise and supporting them in Kotlin Dataframe takes a while or even never happens, if it's a rather exotic one. I found the issues Add Apache Pinot® as supported database #637 and Redshift not supported #549, that also ask for further databases to be supported. Maybe there could be something like a "DbTypeRegistry", where users can add custom DbTypes at runtime instead of these static when-statements.
  3. I found a workaround (see below) to connect to Hive and read data from it. However, I had to deal with a lot of SQLFeatureNotSupportedExceptions from the Hive JDBC library, because they did not implement some (optional? I'm not a JDBC expert, so I'm not sure how "optional" these things are, i.e. if they should implement it or client code should be able to deal with the missing SQL features.) stuff, especially regarding metadata. -> Maybe there could be some kind of fallback implementation that might not be as performant, but allows to work with such incomplete JDBC driver implementations.

Finally, my workaround to connect to Hive:

import org.jetbrains.kotlinx.dataframe.DataFrame
import org.jetbrains.kotlinx.dataframe.io.*
import org.jetbrains.kotlinx.dataframe.io.db.DbType
import org.jetbrains.kotlinx.dataframe.schema.ColumnSchema
import java.sql.DriverManager
import java.sql.ResultSet
import java.sql.ResultSetMetaData
import kotlin.reflect.KType

private object Hive: DbType("hive") {
    override val driverClassName: String
        get() = "org.apache.hive.jdbc.HiveDriver"

    override fun buildTableMetadata(tables: ResultSet): TableMetadata {
        TODO("Not yet implemented")
    }

    override fun convertSqlTypeToColumnSchemaValue(tableColumnMetadata: TableColumnMetadata): ColumnSchema? = null

    override fun convertSqlTypeToKType(tableColumnMetadata: TableColumnMetadata): KType? = null

    override fun isSystemTable(tableMetadata: TableMetadata): Boolean = false
}

private class HiveResultSet(private val original: ResultSet) : ResultSet by original {
    private class HiveResultSetMetaData(val original: ResultSetMetaData) : ResultSetMetaData by original {
        override fun getTableName(column: Int): String = ""

        override fun getColumnName(column: Int): String {
            return original.getColumnName(column).substringAfter(".")
        }
    }

    override fun getMetaData(): ResultSetMetaData {
        return HiveResultSetMetaData(original.metaData)
    }
}

public fun DataFrame.Companion.readHiveSqlQuery(
    query: String,
    dbConfig: DatabaseConfiguration,
): DataFrame<*> {
    DriverManager.getConnection(dbConfig.url, dbConfig.user, dbConfig.password).use { connection ->
        connection.createStatement().use { statement ->
            statement.executeQuery(query).use { rs ->
                return readResultSet(HiveResultSet(rs), Hive)
            }
        }
    }
}

This way I managed to get my own Hive DbType into readResultSet(). It definitely has limitations and I'm very unsure about my minimal implementations of the methods of the DbType. But at least it seems to work so far.

Metadata

Metadata

Assignees

Labels

databasesJDBC related issuesenhancementNew feature or request

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions