Skip to content

Framework for creating a fast distributed crawler system to gather open data.

License

Notifications You must be signed in to change notification settings

jaitl/cloud-crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

16ae55e · Jan 26, 2021
Mar 31, 2020
Feb 6, 2020
Sep 28, 2020
Oct 2, 2020
Sep 25, 2020
Sep 28, 2020
Sep 29, 2020
Mar 19, 2020
Jun 19, 2019
Feb 9, 2019
Jan 26, 2021
Sep 25, 2020
Mar 31, 2020
Jun 25, 2019

Repository files navigation

Build Status Release Status Version Coverage Status

cloudCrawler

Framework for creating a fast distributed crawler system to gather open data.

Run

  1. Run master node:
    sbt master/run
    
  2. Run worker nodes:
    sbt simple-worker/run
    sbt simple-worker/run
    

MongoDB data example

  1. СonfigurationСollection
    {
        "workerExecuteInterval" : "35.seconds",
        "workerFilePath" : "/",
        "workerBatchSize" : 2.0,
        "workerBaseUrl" : "https://habr.com/ru",
        "workerTaskType" : "HabrTasks",
        "workerParallelBatches" : 1,
        "workerResource" : "Tor",
        "workerNotification" : false
    }
    
  2. TorCollection
    {
        "workerTorHost" : "127.0.0.1",
        "workerTorLimit" : 1,
        "workerTorPort" : 9150,
        "workerTorControlPort" : 0,
        "workerTorPassword" : "",
        "workerTorTimeoutUp" : "30.seconds",
        "workerTorTimeoutDown" : "30.seconds",
        "workerTaskType" : [ 
            "HabrTasks"
        ],
        "usedCount" : 0
    }
    
  3. CrawlTasks
    {
        "taskType" : "HabrTasks",
        "taskData" : "438886",
        "taskStatus" : "taskWait",
        "attempt" : 0
    }
    

Worker dependency

GitHub release

sbt

resolvers += "Cloud Crawler Repository" at "https://dl.bintray.com/jaitl/cloud-crawler",
libraryDependencies += "com.github.jaitl.crawler" %% "worker" % version

gradle

repo:

repositories {
    maven {
        url  "https://dl.bintray.com/jaitl/cloud-crawler" 
    }
}

dependency:

compile 'com.github.jaitl.crawler:worker_2.13:version'