#

common-crawl

Here are 39 public repositories matching this topic...

fizerkhan / cdx-index-client

A command-line tool for using CommonCrawl Index API at http://index.commoncrawl.org/

Updated Nov 30, 2016
Python

ErikGartner / prometheus-cc-extractor

This repository contains mapreduce extractors to preprocess and extract websites from the common crawl corpus.

big-data spark data-extraction mapreduce common-crawl

Updated Mar 21, 2017
Python

Vzzarr / Common-Crawl-Client

Updated Jul 4, 2017
Java

Vikasg7 / warc-reader

ES6 Class to read .warc or .warc.gz file member by member in nodejs

nodejs generator yield warc next common-crawl warc-reader warc-record warc-headers

Updated Aug 25, 2017
TypeScript

fizerkhan / CommonCrawlDocumentDownload

A small tool which uses the CommonCrawl URL Index to download documents with certain file types or mime-types for mass-testing of frameworks like Apache POI and Apache Tika

Updated Sep 25, 2017
Java

fizerkhan / KeywordAnalysis

Word analysis, by domain, on the Common Crawl data set for the purpose of finding industry trends

Updated Oct 26, 2017
Python

srmocher / fake-science

Analyzing Common Crawl data (specifically) to classify fake/real based on trained deep learning models (LSTM, CNN)

deep-learning fake-news common-crawl

Updated May 29, 2018
Python

hrbrmstr / cc

⛏Extract metadata of a specific target based on the results of "commoncrawl.org"

r domains urls rstats recon reconnaissance common-crawl r-cyber

Updated Aug 31, 2018
R

ggodreau / huhdewp

Hadoop streaming EMR job

big-data hadoop bigdata warc hadoop-streaming common-crawl

Updated Dec 18, 2018
Python

socket-var / nyt-twitter-cc-hadoop

Perform big data analysis on New york times, Twitter and Common Crawl APIs

twitter-api hadoop-mapreduce nyt-api common-crawl

Updated Apr 22, 2019
Jupyter Notebook

Mgosi / Big-Data-Analysis-using-MapReduce-in-Hadoop

We explore data by using Big Data Analysis and Visualization skills. To obtain this, we perform 3 main operations. i.e. i)Data Aggregation through different sources. ii) Big Data Analysis using MapReduce and iii) Visualization through Tableau. Data Analysis is very critical in understanding the data, and what we can do with the data. For small d…

docker big-data twitter-api hdfs tableau data-processing data-pipeline hadoop-docker common-crawl big-data-analytics tweet-collector

Updated Oct 5, 2019
Jupyter Notebook

bottomless-archive-project / common-crawl-client

This library is a very lightweight client to Common Crawl's WARC files.

warc common-crawl

Updated Jan 16, 2020
Java

tokenmill / common-crawl-utils

Various Common Crawl utilities in Clojure.

clojure clojure-library warc common-crawl cdx-api

Updated Dec 5, 2023
Clojure

skyler-myers-db / Common-Crawl-Analysis

Parsing the common crawl database using Scala and Spark

emr scala big-data spark s3 s3-bucket common-crawl emr-cluster

Updated Mar 17, 2021
Scala

oscar-project / goclassy

An asynchronous concurrent pipeline for classifying Common Crawl based on fastText's pipeline.

nlp corpus-linguistics fasttext common-crawl language-classification

Updated Apr 21, 2021
Go

code402 / warc-benchmark

Sample code to grep Common Crawl WARC files in Go, Java, Node and Python.

warc commoncrawl common-crawl

Updated Apr 30, 2021
Shell

alumik / common-crawl-downloader

Distributed download scripts for Common Crawl data

downloader common-crawl

Updated Jul 2, 2021
Python

HRN-Projects / common_crawl_with_scrapy

Parsing Huge Web Archive files from Common Crawl data index to fetch any required domain's data concurrently with Python and Scrapy.

python data-mining python3 web-scraping scrapy web-crawling webarchive common-crawl common-crawl-with-scrapy parse-common-crawl common-crawl-with-python common-crawl-scrapy common-crawl-python common-crawl-data webarchive-data-scraping

Updated Jul 14, 2021
Python

bottomless-archive-project / url-collector

An application that crawls the Common Crawl corpus for URLs with the specified file extensions.

crawler common-crawl url-crawler

Updated Oct 15, 2021
Java

IBM / cc-dbp

A dataset for knowledge base population research using Common Crawl and DBpedia.

dbpedia common-crawl ibm-research-ai knowledge-base-population

Updated Jan 27, 2022
Java

Improve this page

Add a description, image, and links to the common-crawl topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the common-crawl topic, visit your repo's landing page and select "manage topics."