Initial commit

Toloka · Mar 2, 2021 · 5591e3d · 5591e3d
commit 5591e3d
Show file tree

Hide file tree

Showing 64 changed files with 4,221 additions and 0 deletions.
diff --git a/.github/workflows/release.yml b/.github/workflows/release.yml
@@ -0,0 +1,25 @@
+name: Release
+
+on:
+  release:
+    types: [ published ]
+
+jobs:
+  release:
+    runs-on: ubuntu-latest
+    steps:
+    - uses: actions/checkout@v2
+    - name: Set up Python
+      uses: actions/setup-python@v2
+      with:
+         python-version: 3.8
+    - name: Install dependencies
+      run: |
+        python -m pip install --upgrade pip
+        pip install tox tox-gh-actions
+    - name: Run tox
+      env:
+        TWINE_USERNAME: ${{ secrets.PYPI_USERNAME }}
+        TWINE_PASSWORD: ${{ secrets.PYPI_PASSWORD }}
+      run: |
+        python -m tox -e release
diff --git a/.github/workflows/tests.yml b/.github/workflows/tests.yml
@@ -0,0 +1,27 @@
+name: Tests
+
+on:
+  push:
+    branches: [ main ]
+  pull_request:
+    branches: [ main ]
+
+jobs:
+  build:
+    runs-on: ubuntu-latest
+    strategy:
+       matrix:
+         python-version: [3.7, 3.8, 3.9]
+    steps:
+    - uses: actions/checkout@v2
+    - name: Set up Python ${{ matrix.python-version }}
+      uses: actions/setup-python@v2
+      with:
+         python-version: ${{ matrix.python-version }}
+    - name: Install dependencies
+      run: |
+        python -m pip install --upgrade pip
+        pip install tox tox-gh-actions
+    - name: Run tox with Python ${{ matrix.python-version }}
+      run: |
+        python -m tox
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,20 @@
+# Ignore all
+*
+
+# Unignore dirs
+!*/
+
+# Unignore specific files without extensions
+!AUTHORS
+!LICENSE
+!py.typed
+!.gitignore
+
+# Unignore useful extensions
+!*.in
+!*.ini
+!*.md
+!*.py
+!*.pyi
+!*.toml
+!*.yml
diff --git a/AUTHORS b/AUTHORS
@@ -0,0 +1,6 @@
+The following authors have created the source code of "crowd-kit" published and distributed by YANDEX LLC as the owner:
+
+Dmitry Ustalov dustalov@yandex-team.ru
+Evgeny Tulin tulinev@yandex-team.ru
+Nikita Pavlichenko pavlichenko@yandex-team.ru
+Vladimir Losev losev@yandex-team.ru
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -0,0 +1,35 @@
+# Notice to external contributors
+
+
+## General info
+
+Hello! In order for us (YANDEX LLC) to accept patches and other contributions from you, you will have to adopt our Yandex Contributor License Agreement (the “**CLA**”). The current version of the CLA can be found here:
+1) https://yandex.ru/legal/cla/?lang=en (in English) and 
+2) https://yandex.ru/legal/cla/?lang=ru (in Russian).
+
+By adopting the CLA, you state the following:
+
+* You obviously wish and are willingly licensing your contributions to us for our open source projects under the terms of the CLA,
+* You have read the terms and conditions of the CLA and agree with them in full,
+* You are legally able to provide and license your contributions as stated,
+* We may use your contributions for our open source projects and for any other project too,
+* We rely on your assurances concerning the rights of third parties in relation to your contributions.
+
+If you agree with these principles, please read and adopt our CLA. By providing us your contributions, you hereby declare that you have already read and adopt our CLA, and we may freely merge your contributions with our corresponding open source project and use it further in accordance with terms and conditions of the CLA.
+
+## Provide contributions
+
+If you have already adopted terms and conditions of the CLA, you are able to provide your contributions. When you submit your first pull request, please add the following information into it:
+
+```
+I hereby agree to the terms of the CLA available at: [link].
+```
+
+Replace the bracketed text as follows:
+* [link] is the link to the current version of the CLA: https://yandex.ru/legal/cla/?lang=en (in English) or https://yandex.ru/legal/cla/?lang=ru (in Russian).
+
+It is enough to provide this notification only once.
+
+## Other questions
+
+If you have any questions, please mail us at opensource@yandex-team.ru.
diff --git a/LICENSE b/LICENSE
@@ -0,0 +1,13 @@
+Copyright 2020 YANDEX LLC
+
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
diff --git a/MANIFEST.in b/MANIFEST.in
@@ -0,0 +1,6 @@
+# Legal
+include LICENSE AUTHORS CONTRIBUTING.md
+
+# Stubs
+recursive-include src py.typed
+recursive-include src *.pyi
diff --git a/README.md b/README.md
@@ -0,0 +1,27 @@
+# Crowd-kit
+
+[![GitHub Tests][github_tests_badge]][github_tests_link]
+
+[github_tests_badge]: https://github.com/Toloka/crowdlib/workflows/Tests/badge.svg?branch=main
+[github_tests_link]: https://github.com/Toloka/crowdlib/actions?query=workflow:Tests
+
+
+`crowd-kit` is a Python module for crowdsourcing distributed under the Apache-2.0 license. We strive to implement functionality that eases working with crowd-sourced data. Currently module contains:
+* Implementations of commonly used aggregation methods
+* A set of metrics
+
+The module is currenly in a heavy development state and interfaces are subject to change.
+
+Install
+--------------
+Installing crowdlib is as easy as `pip install crowd-kit`
+
+
+Questions and bug reports
+--------------
+For reporting bugs please use the [Toloka/bugreport](https://github.com/Toloka/crowdlib/issues) page.
+
+
+License
+-------
+© YANDEX LLC, 2020-2021. Licensed under the Apache License, Version 2.0. See LICENSE file for more details.
diff --git a/pyproject.toml b/pyproject.toml
@@ -0,0 +1,3 @@
+[build-system]
+requires = ["setuptools", "wheel"]
+build-backend = "setuptools.build_meta"
diff --git a/setup.py b/setup.py
@@ -0,0 +1,27 @@
+#!/usr/bin/env python
+# coding: utf8
+
+from setuptools import setup, find_packages
+
+PREFIX = 'crowdkit'
+
+setup(
+    name='crowd-kit',
+    package_dir={PREFIX: 'src'},
+    packages=[f'{PREFIX}.{package}' for package in find_packages('src')],
+    version='0.0.1',
+    description='Python libraries for crowdsourcing',
+    license='Apache 2.0',
+    author='Vladimir Losev',
+    author_email='losev@yandex-team.ru',
+    python_requires='>=3.7.0',
+    install_requires=[
+        'attrs',
+        'numpy',
+        'pandas',
+        'tqdm',
+        'scikit-learn',
+        'nltk',
+    ],
+    include_package_data=True,
+)
diff --git a/src/aggregation/__init__.py b/src/aggregation/__init__.py
@@ -0,0 +1,9 @@
+from .dawid_skene import DawidSkene
+from .gold_majority_vote import GoldMajorityVote
+from .majority_vote import MajorityVote
+from .m_msr import MMSR
+from .wawa import Wawa
+from .zero_based_skill import ZeroBasedSkill
+from .hrrasa import HRRASA, RASA
+
+__all__ = ['DawidSkene', 'MajorityVote', 'MMSR', 'Wawa', 'GoldMajorityVote', 'ZeroBasedSkill', 'HRRASA', 'RASA']
diff --git a/src/aggregation/annotations.py b/src/aggregation/annotations.py
@@ -0,0 +1,124 @@
+"""
+This module contains reusable annotations that encapsulate both typing
+and description for commonly used parameters. These annotations are
+used to automatically generate stub files with proper docstrings
+"""
+
+import inspect
+import textwrap
+from io import StringIO
+from typing import ClassVar, Dict, Optional, Type, get_type_hints
+
+import attr
+import pandas as pd
+
+
+@attr.s
+class Annotation:
+    type: Optional[Type] = attr.ib(default=None)
+    title: Optional[str] = attr.ib(default=None)
+    description: Optional[str] = attr.ib(default=None)
+
+    def format_google_style_attribute(self, name: str) -> str:
+        type_str = f' ({getattr(self.type, "__name__", str(self.type))})' if self.type else ''
+        title = f' {self.title}\n' if self.title else '\n'
+        description_str = textwrap.indent(f'{self.description}\n', ' ' * 4).lstrip('\n') if self.description else ''
+        return f'{name}{type_str}:{title}{description_str}'
+
+    def format_google_style_return(self):
+        type_str = f'{getattr(self.type, "__name__", str(self.type))}' if self.type else ''
+        title = f' {self.title}\n' if self.title else '\n'
+        description_str = textwrap.indent(f'{self.description}\n', ' ' * 4).lstrip('\n') if self.description else ''
+        return f'{type_str}:{title}{description_str}'
+
+
+def manage_docstring(obj):
+
+    attributes: Dict[str, Annotation] = {}
+    new_annotations = {}
+
+    for key, value in get_type_hints(obj).items():
+        if isinstance(value, Annotation):
+            attributes[key] = value
+            if value.type is not None:
+                new_annotations[key] = value.type
+        else:
+            new_annotations[key] = value
+
+    return_section = attributes.pop('return', None)
+
+    sio = StringIO()
+    sio.write(inspect.cleandoc(obj.__doc__ or ''))
+
+    if attributes:
+        sio.write('\nArgs:\n' if inspect.isfunction(obj) else '\nAttributes:\n')
+        for key, ann in attributes.items():
+            sio.write(textwrap.indent(ann.format_google_style_attribute(key), ' ' * 4))
+
+    if return_section:
+        sio.write('Returns:\n')
+        sio.write(textwrap.indent(return_section.format_google_style_return(), ' ' * 4))
+
+    obj.__annotations__ = new_annotations
+    obj.__doc__ = sio.getvalue()
+    return obj
+
+
+PERFORMERS_SKILLS = Annotation(
+    type=pd.Series,
+    title='Predicted skills for each performer',
+    description=textwrap.dedent("A series of performers' skills indexed by performers"),
+)
+
+PROBAS = Annotation(
+    type=pd.DataFrame,
+    title='Estimated label probabilities',
+    description=textwrap.dedent('''
+        A frame indexed by `task` and a column for every label id found
+        in `data` such that `result.loc[task, label]` is the probability of `task`'s
+        true label to be equal to `label`.
+    '''),
+)
+
+PRIORS = Annotation(
+    type=pd.Series,
+    title='A prior label distribution',
+    description="A series of labels' probabilities indexed by labels",
+)
+
+TASKS_LABELS = Annotation(
+    type=pd.DataFrame,
+    title='Estimated labels',
+    description=textwrap.dedent('''
+        A pandas.DataFrame indexed by `task` with a single column `label` containing
+        `tasks`'s most probable label for last fitted data, or None otherwise.
+    '''),
+)
+
+ERRORS = Annotation(
+    type=pd.DataFrame,
+    title="Performers' error matrices",
+    description=textwrap.dedent('''
+        A pandas.DataFrame indexed by `performer` and `label` with a column for every
+        label_id found in `data` such that `result.loc[performer, observed_label, true_label]`
+        is the probability of `performer` producing an `observed_label` given that a task's
+        true label is `true_label`
+    '''),
+)
+
+DATA = Annotation(
+    type=pd.DataFrame,
+    title='Input data',
+    description='A pandas.DataFrame containing `task`, `performer` and `label` columns',
+)
+
+
+def _make_opitonal_classlevel(annotation: Annotation):
+    return attr.evolve(annotation, type=ClassVar[Optional[annotation.type]])
+
+
+OPTIONAL_CLASSLEVEL_PERFORMERS_SKILLS = _make_opitonal_classlevel(PERFORMERS_SKILLS)
+OPTIONAL_CLASSLEVEL_PROBAS = _make_opitonal_classlevel(PROBAS)
+OPTIONAL_CLASSLEVEL_PRIORS = _make_opitonal_classlevel(PRIORS)
+OPTIONAL_CLASSLEVEL_TASKS_LABELS = _make_opitonal_classlevel(TASKS_LABELS)
+OPTIONAL_CLASSLEVEL_ERRORS = _make_opitonal_classlevel(ERRORS)