Skip to content

throughput-ec/NeotomaRecommender

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 

Repository files navigation

lifecycle NSF-1928366

Neotoma Recommender

Project to create a pipeline that uses Graph Recommendation Systems to recommend whether Throughput Article is of interest to the Neotoma Database.

Using Graph Databases and a Data Science approach, identify whether a paper is suitable for Neotoma and detect interesting features such as 'number of commits', 'last updates', 'number of linked repos' and 'linked databases'.

Baseline Model will be a Recommender System in unstructured Graph Databases and data will also tried to be structured to create a Recommendre System using Python.

Contributors

This project is an open project, and contributions are welcome from any individual. All contributors to this project are bound by a code of conduct. Please review and follow this code of conduct as part of your contribution.

Tips for Contributing

Issues and bug reports are always welcome. Code clean-up, and feature additions can be done either through branches.

All products of the Throughput Annotation Project are licensed under an MIT License unless otherwise noted.

How to use this repository

Files and directory structure in the repository are as follows: This structure might be modified as the project progresses.

throughput-ec/neotoma_recommender/
├── data
│   ├── input                       # input data
│   │     ├── throughput neo4j db   # data: paleoecology db - dummy file for reproducibility
│   │     └── neotoma_db            # data: bibliography json dummy file for reproducibility
│   ├── output                      # output file
│   │     └── predictions.csv   # file to describe whether an article belongs to neotoma or not
├── img                        # all images or docs for explaining processes
├── reports    
│   ├── milestones
│   └── supporting documents           
├── src    
│   ├── preprocessing    
│   │     └── data_preprocessing.py    # script to clean data
│   ├── train                          # scripts for helping functions and training model
│   │     ├── utils.py
│   │     └── train.py                 
│   └──  predict                 # prediction scripts to validate test data and to try in new data
│   │     └── predict.py       
├── .gitignore
├── config_sample.py             # file for credentials
├── CODE_OF_CONDUCT.md
├── LICENSE
└── README.md

Workflow Overview

This project uses the Throughput Graph Database as an input from neo4j:

  • neotoma: tsv file

These files are used as input that will help create a Recommender System.

  • Predict whether an article is suitable for Neotoma.
  • Create nodes in Throughput graph / Create a Graph for Neotoma Files
  • Benefit from operations using Graph Databases.

System Requirements

This project is developed using Python and Neo4j.
This project will need Neo4j installed. It runs on a MacOS system. Continuous integration uses TravisCI.

Data Requirements

The project pulls data from the Throughput database.

Key Outputs

This project will generate a structured dataset that provides the following information:

  • Whether the paper is useful for Neotoma based on its characteristics or not.

Pipeline

TODO: \n \n

[include workflow chart]

Instructions

For now, only the notebook is available. There will be scripts shortly.

About

No description, website, or topics provided.

Resources

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published