Skip to content

Commit

Permalink
Merge pull request #46 from h1alexbel/45
Browse files Browse the repository at this point in the history
feat(#45): filter step, experiment folder
  • Loading branch information
h1alexbel committed Sep 3, 2024
2 parents 200994c + 316fe45 commit bc230e9
Show file tree
Hide file tree
Showing 4 changed files with 46 additions and 14 deletions.
25 changes: 23 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# sr-detection

[![poetry](https://github.com/h1alexbel/sr-detection/actions/workflows/poetry.yml/badge.svg)](https://github.com/h1alexbel/sr-detection/actions/workflows/poetry.yml)
[![build](https://github.com/h1alexbel/sr-detection/actions/workflows/build.yml/badge.svg)](https://github.com/h1alexbel/sr-detection/actions/workflows/build.yml)
[![Hits-of-Code](https://hitsofcode.com/github/h1alexbel/sr-detection)](https://hitsofcode.com/view/github/h1alexbel/sr-detection)
[![PDD status](http://www.0pdd.com/svg?name=h1alexbel/sr-detection)](http://www.0pdd.com/p?name=h1alexbel/sr-detection)
[![License](https://img.shields.io/badge/license-MIT-green.svg)](https://github.com/h1alexbel/sr-detection/blob/master/LICENSE.txt)
Expand Down Expand Up @@ -43,12 +43,33 @@ To run this:
just collect
```

Or if you want to capture smaller amount of repositories:
You should expect to have `sr-data/experiment/repos.csv` with collected
repositories and their metadata.

To capture smaller amount of repositories you can run this:

```bash
just test-collect
```

You should expect to have `sr-data/tmp/test-repos.csv`, with the same structure
as `repos.csv`, but smaller.

### Filter

We filter collected repositories. We remove repositories with empty README
file. Then, we convert README file to plain text and check in which languages
file is written. We remove repositories with README file that is not fully
written in English.

To run this:

```bash
just filter repos.csv
```

You should expect to have `sr-data/experiment/after-filter.csv`.

## How to contribute

Make sure that you have [Python 3.10+], [just], and [npm] installed on your
Expand Down
21 changes: 16 additions & 5 deletions justfile
Original file line number Diff line number Diff line change
Expand Up @@ -50,26 +50,37 @@ check:

# Run experiment.
@experiment:
NOW=$(date +%F):$(TZ=UTC date +%T); echo Experiment datetime is: "$NOW (UTC)"
NOW=$(date +%F):$(TZ=UTC date +%T) && echo "$NOW" >> now.txt; \
echo Experiment datetime is: "$NOW (UTC)"
mkdir -p sr-data/experiment
mv now.txt sr-data/experiment/now.txt
just collect
just filter

# Clean up experiment.
clean:
echo "Cleaning up sr-data/experiment..."
rm sr-data/experiment/* && rmdir sr-data/experiment

# Collect repositories.
# Here, $PATS is a name of file with a number of GitHub PATs, separated
# by new line.
collect:
mkdir -p sr-data/experiment
ghminer --query "stars:>10 language:java size:>=20 mirror:false template:false" \
--start "2019-01-01" --end "2024-05-01" --tokens "$PATS"
--start "2019-01-01" --end "2024-05-01" --tokens "$PATS" \
--filename "repos" && mv repos.csv sr-data/experiment/repos.csv

# Collect test repositories.
test-collect:
mkdir -p tmp
ghminer --query "stars:>10 language:java size:>=20 mirror:false template:false" \
mkdir -p sr-data/tmp
cd sr-data && ghminer --query "stars:>10 language:java size:>=20 mirror:false template:false" \
--start "2024-05-01" --end "2024-05-01" --tokens "$PATS" \
--filename "tmp/test-repos"

# Filter collected repositories.
filter:
filter repos out="experiment/after-filter.csv":
cd sr-data && poetry poe filter --repos {{repos}} --out {{out}}

# Build paper with LaTeX.
paper:
Expand Down
4 changes: 2 additions & 2 deletions sr-data/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -41,8 +41,8 @@ requests = "^2.32.3"
pytest = "^8.2.2"

[tool.poe.tasks.filter]
script = "sr_data.tasks.filter:main(csv, out)"
args = [{name = "csv"}, {name = "out"}]
script = "sr_data.tasks.filter:main(repos, out)"
args = [{name = "repos"}, {name = "out"}]

[tool.poe.tasks.embed]
script = "sr_data.tasks.embed:main(key, checkpoint, csv, out)"
Expand Down
10 changes: 5 additions & 5 deletions sr-data/src/sr_data/tasks/filter.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,28 +29,28 @@
from langdetect import detect


def main(csv, out):
def main(repos, out):
"""
Filter.
:param csv: CSV filename
:param repos: CSV with repositories
:param out: Output CSV file name
:return: Filtered CSV
"""
print("Start filtering...")
DetectorFactory.seed = 0
frame = pd.read_csv(csv)
frame = pd.read_csv(repos)
start = len(frame)
print(f"Repositories in {start}")
frame = frame.dropna(subset=["readme"])
non_null = start - len(frame)
after_null = len(frame)
print(f"Skipped {non_null} repositories with NULL READMEs")
print(f"Skipped {non_null} repositories with empty README files")
frame["readme"] = frame["readme"].apply(md_to_text)
frame = frame[frame["readme"].apply(english)]
non_english = after_null - len(frame)
print(f"Skipped {non_english} non-english repositories")
print(f"Total skipped: {non_null + non_english}")
print(f"Staying with {len(frame)} good repositories")
print(f"Saving {len(frame)} good repositories to {out}")
print(frame)
frame.to_csv(out, index=False)

Expand Down

0 comments on commit bc230e9

Please sign in to comment.