Hive's CombineTextInputFormat

I use the hive metastore pretty heavily and we currently have hundreds of tables that make use of the CombineTextInputFormat for the storage descriptor. I was recently trying to upgrade to Trino 445 from Trino 409 and found https://github.com/trinodb/trino/issues/15921 which mentions removing Hive dependencies in the code base. I spun it up and tried querying against my metastore and was getting errors like this when querying text based tables:

```
io.trino.spi.TrinoException: Unsupported storage format: mydatabase.mytable:<UNPARTITIONED> StorageFormat{serde=org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, inputFormat=org.apache.hadoop.mapred.lib.CombineTextInputFormat, outputFormat=org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat}
```

This error tracks with the code changes in the [HiveStorageFormat](https://github.com/trinodb/trino/blob/master/plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveStorageFormat.java#L93-L96) and the [HiveClassNames](https://github.com/trinodb/trino/blob/master/lib/trino-hive-formats/src/main/java/io/trino/hive/formats/HiveClassNames.java#L47). Our hive tables are used to query with Trino but also run spark jobs against them. So, using the CombineTextInputFormat is ideal to help deal with the small file problem in spark. However, that means those tables can't be queried via Trino. 

A few questions:
* [This comment](https://github.com/trinodb/trino/issues/19018#issuecomment-1795742792) mentions that trino devs are open to supporting popular and maintained input formats. Is CombineTextInputFormat a format Trino would consider supporting? 
* If my hive tables have a small file problem would converting them to TextInputFormat degrade performance in Trino when querying these tables? 
* Are there any suggestions for working around this that doesn't require migrating tables storage formats to work with Trino?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Hive's CombineTextInputFormat #21842

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Hive's CombineTextInputFormat #21842

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions