Skip to content

techascent/tech.parquet

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Easy Parquet Bindings

A single jar that enables read/write of csv and parquet.

Use

Depend on these maven coordinates:

[com.techascent/tmd-parquet "1.000-beta-39"]

Then, for small datasets:

(require '[tech.v3.dataset :as ds])
(require '[tech.v3.libs.parquet :as parquet])
(-> (ds/->>dataset {:x (concat (repeat 3 "a") (repeat 3 "b"))
                    :y (range 6)
                    :z (repeatedly 6 rand)})
                   (parquet/ds->parquet "little.parquet"))

And for bigger datasets (streaming a batch at a time):

(require '[tech.v3.dataset :as ds])
(require '[tech.v3.dataset.io.csv :as ds-csv])
(require '[tech.v3.libs.parquet :as parquet])
(require '[tech.v3.io :as io])
(->> (io/gzip-input-stream "https://github.com/techascent/tech.ml.dataset/raw/master/test/data/ames-train.csv.gz")
     (ds-csv/csv->dataset-seq)
     (parquet/ds-seq->parquet "result.parquet"))

A Note About Possible Performance Issues

If Parquet read/write performance is degraded by profoundly verbose debug-level logging, be sure to disable that.

An example logback.xml might look something like:

<configuration debug="false">
  <appender name="STDOUT" class="ch.qos.logback.core.ConsoleAppender">
    <!-- encoders are assigned the type
         ch.qos.logback.classic.encoder.PatternLayoutEncoder by default -->
    <encoder>
      <pattern>%d{HH:mm:ss.SSS} [%thread] %-5level %logger{36} - %msg%n</pattern>
    </encoder>
  </appender>

  <root level="info">
    <appender-ref ref="STDOUT" />
  </root>
</configuration>

About

Simple parquet bindings for tech.ml.dataset.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published