spark_input_lib

A useful tool for reading data with Spark.

It contains it.mgaido.spark.io.FileWithHeaderReader class which allows to read files withan header ignoring it. It requires to set properly it.mgaido.spark.io.InputFileWithHeaderReader.HEADER_NUMBER_OF_LINES property in Hadoop job configuration with the number of lines each file contains as header.

This class doesn't work fine if the header is spread over multiple block, but this should not happen.... In such a case only the lines in the first block are discarded.

Scala object it.mgaido.spark.io.IOHelper performs the same operation reading a path and returning a RDD of strings without the header lines.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
SparkInputLib		SparkInputLib
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

spark_input_lib

About

Uh oh!

Releases 1

Packages

Languages

mgaido91/spark_input_lib

Folders and files

Latest commit

History

Repository files navigation

spark_input_lib

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages