Skip to content

Commit

Permalink
feat(#9): task ids, more docs
Browse files Browse the repository at this point in the history
  • Loading branch information
h1alexbel committed Jul 8, 2024
1 parent a3f12a6 commit e09f677
Show file tree
Hide file tree
Showing 2 changed files with 23 additions and 9 deletions.
24 changes: 15 additions & 9 deletions sr-data/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ data about GitHub repositories.

## How it works?

`sr-data` runs a [task](#tasks) or a set of tasks and produces output.
`sr-data` runs a task or a set of [tasks](#tasks) and produces some output.
To run all tasks:

```bash
Expand All @@ -25,22 +25,28 @@ All tasks will be executed in order, and you should expect to have these
* `v.csv`
* `w.csv`

In order to run single task in isolation, use it like that:
In order to run single task in isolation, then use it like that
(run it inside `/sr-data` dir!):

```bash
poetry run sr-data:collect
poetry run <task id> # e.g. collect
```

## Tasks

* `collect`, collects data about public repositories through
[GitHub GraphQL API] and outputs `repos.csv` CSV with gathered repositories
[GitHub GraphQL API] and outputs `repos.csv` with gathered repositories
and their [metadata](#collected-metadata).
* `en-filter`, filters out repositories with non-English README file and
outputs `filtered.csv` CSV with English-only entries.
* `text`, converts README markdown content into plain text, outputs
`text.csv` CSV.
TBD..
* `filter`, filters out repositories with non-English README and outputs
`filtered.csv`.
* `text`, converts README markdown content into plain text, outputs `text.csv`.
* `highlight`, highlights READMEs by annotating them with a help of LLM,
outputs `annotated.csv`.
* `embed`, generates embeddings for README content, outputs `embeddings.csv`.
* `u`, constructs a set of vectors with numerical metadata, outputs `u.csv`.
* `v`, constructs a set of vectors with README, outputs `v.csv`.
* `w`, constructs a set of vectors with both: numerical metadata and README,
outputs `w.csv`.

## Collected metadata

Expand Down
8 changes: 8 additions & 0 deletions sr-data/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,14 @@ python = "^3.10 || ^3.11 || ^3.12"

[tool.poetry.scripts]
sr-data = "sr_data.all:main"
collect = "sr_data.tasks.collect:main"
filter = "sr_data.tasks.filter:main"
text = "sr_data.tasks.text:main"
highlight = "sr_data.tasks.highlight:main"
embed = "sr_data.tasks.embed:main"
u = "sr_data.tasks.u:main"
v = "sr_data.tasks.v:main"
w = "sr_data.tasks.w:main"

[build-system]
requires = ["setuptools", "wheel"]
Expand Down

0 comments on commit e09f677

Please sign in to comment.