Skip to content

geneSpark is a bioinformatics software program written in Python and Apache Spark for big data epigenetic histone modification ChIP-seq analysis.

License

Notifications You must be signed in to change notification settings

Bohdan-Khomtchouk/geneSpark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

geneSpark

Installation

Requirements

How to run

Regular version:
  • python src/geneSpark.py [-o OUTPUT_FILE] [-u UPSTREAM_BASE_PAIRS] [-d DOWNSTREAM_BASE_PAIRS] INPUT_FILE
Spark version:
  • /PATH/TO/spark-VERSION/bin/spark-submit --master local[2] src/geneSpark_spark.py [-o OUTPUT_FILE] [-u UPSTREAM_BASE_PAIRS] [-d DOWNSTREAM_BASE_PAIRS] INPUT_FILE
  • This is just an example using 2 cores of a local machine. Change local[2] to customize to the number of cores in your machine. To learn how many cores your machine has, type sysctl -n hw.ncpu in the Terminal (command-line). For more options, please have a look at Spark documentation

The Apache Spark version of geneSpark runs approximately 5X faster (relative to geneSpark using only the Pandas and Numpy library of Python) in a local machine with 2 cores. In a local machine with 8 cores, the Spark version of geneSpark runs 11X faster. The more cores your system has, the faster Spark geneSpark finishes its run.

Apache Spark geneSpark is designed to scale up and leverage the power of thousands of computing cores of any HPC environment via the MapReduce framework.

About

geneSpark is a bioinformatics software program written in Python and Apache Spark for big data epigenetic histone modification ChIP-seq analysis.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages