Skip to content
@CI-Research

CI-Research

Popular repositories Loading

  1. KeywordAnalysis KeywordAnalysis Public

    Word analysis, by domain, on the Common Crawl data set for the purpose of finding industry trends

    56 11

  2. spark-Jupyter-AWS spark-Jupyter-AWS Public

    Forked from PiercingDan/spark-Jupyter-AWS

    A guide on how to set up Jupyter with Pyspark painlessly on AWS EC2 clusters, with S3 I/O support

    Jupyter Notebook 1

  3. cdx-index-client cdx-index-client Public

    Forked from ikreymer/cdx-index-client

    A command-line tool for using CommonCrawl Index API at http://index.commoncrawl.org/

    Python 1 1

  4. commoncrawl-examples commoncrawl-examples Public

    Forked from commoncrawl/commoncrawl-examples

    A library of examples showing how to use the Common Crawl corpus.

    Java

  5. dkpro-c4corpus dkpro-c4corpus Public

    Forked from dkpro/dkpro-c4corpus

    DKPro C4CorpusTools is a collection of tools for processing CommonCrawl corpus, including Creative Commons license detection, boilerplate removal, language detection, and near-duplicate removal.

    Java

  6. common_crawl_index common_crawl_index Public

    Forked from trivio/common_crawl_index

    Index URLs in Common Crawl

    Python 1

Repositories

Showing 9 of 9 repositories
  • KeywordAnalysis Public

    Word analysis, by domain, on the Common Crawl data set for the purpose of finding industry trends

    CI-Research/KeywordAnalysis’s past year of commit activity
    56 11 0 0 Updated Jan 28, 2024
  • CI-HiBench Public

    Big Data benchmark from Intel called HiBench

    CI-Research/CI-HiBench’s past year of commit activity
    Shell 0 0 0 0 Updated Mar 5, 2018
  • CommonCrawlDocumentDownload Public Forked from centic9/CommonCrawlDocumentDownload

    A small tool which uses the CommonCrawl URL Index to download documents with certain file types or mime-types for mass-testing of frameworks like Apache POI and Apache Tika

    CI-Research/CommonCrawlDocumentDownload’s past year of commit activity
    Java 0 BSD-2-Clause 19 0 0 Updated Feb 9, 2018
  • cdx-index-client Public Forked from ikreymer/cdx-index-client

    A command-line tool for using CommonCrawl Index API at http://index.commoncrawl.org/

    CI-Research/cdx-index-client’s past year of commit activity
    Python 1 MIT 48 0 0 Updated Feb 9, 2018
  • HiBench Public Forked from Intel-bigdata/HiBench

    HiBench is a big data benchmark suite.

    CI-Research/HiBench’s past year of commit activity
    Java 0 779 0 0 Updated Jan 31, 2018
  • spark-Jupyter-AWS Public Forked from PiercingDan/spark-Jupyter-AWS

    A guide on how to set up Jupyter with Pyspark painlessly on AWS EC2 clusters, with S3 I/O support

    CI-Research/spark-Jupyter-AWS’s past year of commit activity
    Jupyter Notebook 1 18 0 0 Updated Apr 24, 2017
  • dkpro-c4corpus Public Forked from dkpro/dkpro-c4corpus

    DKPro C4CorpusTools is a collection of tools for processing CommonCrawl corpus, including Creative Commons license detection, boilerplate removal, language detection, and near-duplicate removal.

    CI-Research/dkpro-c4corpus’s past year of commit activity
    Java 0 Apache-2.0 7 0 0 Updated Dec 20, 2016
  • common_crawl_index Public Forked from trivio/common_crawl_index

    Index URLs in Common Crawl

    CI-Research/common_crawl_index’s past year of commit activity
    Python 0 47 0 0 Updated Sep 6, 2016
  • commoncrawl-examples Public Forked from commoncrawl/commoncrawl-examples

    A library of examples showing how to use the Common Crawl corpus.

    CI-Research/commoncrawl-examples’s past year of commit activity
    Java 0 44 0 0 Updated Aug 5, 2016

People

This organization has no public members. You must be a member to see who’s a part of this organization.

Top languages

Loading…

Most used topics

Loading…