Project to create a pipeline that uses Graph Recommendation Systems to recommend whether Throughput Article is of interest to the Neotoma Database.
Using Graph Databases and a Data Science approach, identify whether a paper is suitable for Neotoma and detect interesting features such as 'number of commits', 'last updates', 'number of linked repos' and 'linked databases'.
Baseline Model will be a Recommender System in unstructured Graph Databases and data will also tried to be structured to create a Recommendre System using Python.
This project is an open project, and contributions are welcome from any individual. All contributors to this project are bound by a code of conduct. Please review and follow this code of conduct as part of your contribution.
Issues and bug reports are always welcome. Code clean-up, and feature additions can be done either through branches.
All products of the Throughput Annotation Project are licensed under an MIT License unless otherwise noted.
Files and directory structure in the repository are as follows: This structure might be modified as the project progresses.
throughput-ec/neotoma_recommender/
├── data
│ ├── input # input data
│ │ ├── throughput neo4j db # data: paleoecology db - dummy file for reproducibility
│ │ └── neotoma_db # data: bibliography json dummy file for reproducibility
│ ├── output # output file
│ │ └── predictions.csv # file to describe whether an article belongs to neotoma or not
├── img # all images or docs for explaining processes
├── reports
│ ├── milestones
│ └── supporting documents
├── src
│ ├── preprocessing
│ │ └── data_preprocessing.py # script to clean data
│ ├── train # scripts for helping functions and training model
│ │ ├── utils.py
│ │ └── train.py
│ └── predict # prediction scripts to validate test data and to try in new data
│ │ └── predict.py
├── .gitignore
├── config_sample.py # file for credentials
├── CODE_OF_CONDUCT.md
├── LICENSE
└── README.md
This project uses the Throughput Graph Database as an input from neo4j:
neotoma:
tsv file
These files are used as input that will help create a Recommender System.
- Predict whether an article is suitable for Neotoma.
- Create nodes in Throughput graph / Create a Graph for Neotoma Files
- Benefit from operations using Graph Databases.
This project is developed using Python and Neo4j.
This project will need Neo4j installed.
It runs on a MacOS system.
Continuous integration uses TravisCI.
The project pulls data from the Throughput database.
This project will generate a structured dataset that provides the following information:
- Whether the paper is useful for Neotoma based on its characteristics or not.
TODO: \n \n
[include workflow chart]
For now, only the notebook is available. There will be scripts shortly.