Harvit harvests data from different sources (e.g websites, APIs), converts and transforms it.
- Go 1.18+
- Mage - replacement for Makefile in Go.
- Golangci-lint - Fast Go linters runner.
- Ginkgo - Expressive testing framework.
- Docker - Containerization.
Harvit uses a plan
in yaml format (see example) to define the data source, fields and the transformer to be performed.
$ ./harvit harvest [command options] plan
NAME:
harvit harvest - Let's harvest some data!
USAGE:
harvit harvest [command options] plan
OPTIONS:
--debug whether running in PROD or DEBUG mode (default: false) [$HARVIT_DEBUG]
--help, -h show help (default: false)
$ ./harvit harvest | jq
source: https://mgjules.dev
type: website
fields:
- name: firstJobName
type: raw
selector: "#experience > div:nth-child(2) > ul > li:nth-child(1) > div.flex.flex-wrap.items-center.justify-between > h3"
- name: secondJobStartYear
type: datetime
selector: "#experience > div:nth-child(2) > ul > li:nth-child(2) > div.flex.flex-wrap.items-center.justify-between > span"
regex: \d{2}/(\d{4})\s→
format: Y
- name: secondJobEndDateTime
type: datetime
selector: "#experience > div:nth-child(2) > ul > li:nth-child(2) > div.flex.flex-wrap.items-center.justify-between > span"
regex: →\s(?:[a-zA-Z]+|(\d{2}/\d{4}))
format: m/Y
timezone: Indian/Mauritius
- name: topLinks
type: text
selector: "body > div.relative.px-4.pt-4.sm\\:pt-16.print\\:pt-0.sm\\:px-6.lg\\:px-8 > div.max-w-4xl.mx-auto.text-lg > div:nth-child(2) > div.flex.flex-wrap.items-center.justify-center.gap-x-4.gap-y-2.print\\:hidden > a > div > span"
- name: experiencePlaces
type: text
selector: "#experience > div:nth-child(2) > ul > li > div.flex.flex-wrap.items-center.justify-between > h3"
- name: contributionsYears
type: datetime
selector: "#contributions > div:nth-child(2) > ul > li > div > span"
regex: (\d{4})
format: Y
- name: contributionsYearsNumbers
type: number
selector: "#contributions > div:nth-child(2) > ul > li > div > span"
regex: (\d{4})
- name: interestsTitle
type: text
selector: "#interests > div:nth-child(2) > ul > li > span"
transformer: transformers/sample.js
data['interestsTitle'] = data['interestsTitle'].map(v => v === 'Space Exploration' ? 'SpaceX' : v);
{
"contributionsYears": [
"2022-01-01T00:00:00+04:00",
"2021-01-01T00:00:00+04:00",
"2020-01-01T00:00:00+04:00",
"2020-01-01T00:00:00+04:00",
"2019-01-01T00:00:00+04:00",
"2019-01-01T00:00:00+04:00",
"2019-01-01T00:00:00+04:00",
"2019-01-01T00:00:00+04:00"
],
"contributionsYearsNumbers": [
2022,
2021,
2020,
2020,
2019,
2019,
2019,
2019
],
"experiencePlaces": [
"Ringier SA",
"Bocasay",
"La Sentinelle Digital Ltd",
"Expat-Blog Ltd",
"Noveo IT Ltd"
],
"firstJobName": "<h3 class=\"my-0\">Ringier SA</h3>",
"interestsTitle": [
"SpaceX",
"Artificial Intelligence",
"Skateboarding",
"Anime",
"Gaming",
"Movie"
],
"secondJobEndDateTime": "2021-02-01T00:00:00+04:00",
"secondJobStartYear": "2020-01-01T00:00:00+04:00",
"topLinks": [
"Developer",
"Github",
"LinkedIn",
"Mail",
"Mauritius"
]
}
Harvit is Apache 2.0 licensed.
This project follows SemVer strictly and is not yet v1
.
Breaking changes might be introduced until v1
is released.
This project follows the Go Release Policy. Each major version of Go is supported until there are two newer major releases.