wikipedia-dump-analysis

Experiment on dumps of Wikipedia

Starting the experiment on Grid5000

Requirements

Environment for Grid5000 with Java 8
Jar with dependencies of this project (can be obtained with mvn clean install)

Submission of the job

Example of a job with 15 nodes on paravance cluster

oarsub -l "{cluster='paravance'}/nodes=15,walltime=2" -t deploy "./start-spark-cluster.sh /PATH/TO/WIKIPEDIA/DUMP/en.preprocessed.xml.bz2 en /PATH/TO/OUTPUT/DIRECTORY false 1000"

Parameters

path to dump (bz2 file)
language of the dump (e.g. en for the English version of Wikipedia)
path to output directory
boolean that indicates if the program need to export the PCMs. WARNING: it takes a lot of space !
number of partitions for spark (1000 seems quite good)

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
names.txt		names.txt
pom.xml		pom.xml
start-spark-cluster.sh		start-spark-cluster.sh
submit-job.sh		submit-job.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

wikipedia-dump-analysis

Starting the experiment on Grid5000

Requirements

Submission of the job

About

Releases

Packages

Languages

License

OpenCompare/wikipedia-dump-analysis

Folders and files

Latest commit

History

Repository files navigation

wikipedia-dump-analysis

Starting the experiment on Grid5000

Requirements

Submission of the job

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages