Skip to content

Natural Language Processing of academic papers for dataset indexing

License

Notifications You must be signed in to change notification settings

eth-library-lab/inDexDa

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

42 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ETH Library LAB logo

inDexDa - Natural Language Processing of academic papers for dataset identification and indexing.

An Initiative for human-centered Innovation in the Knowledge Sphere of the ETH Library Lab.

Table of contents

Getting Started

This project is divided into multiple sections, pipelines, to make it more modular and easier to use/modify. These are as the followings:

Pipeline Description
PaperScraper Used to comb through a specified online archive of academic papers in order to find papers relating to a field, topic, or search term. Stores these papers in a MongoDB database. See PaperScraper folder for more information and usage instructions.
NLP From the papers found using PaperScraper, used natural language processing techniques to determine whether the papers shows a new dataset was created. If so, it stores this information within the MongoDB database for later use. See NaturalLanguageProcessing folder for more information and usage instructions.
Dataset Extraction Collects information from the papers the BERT network predicts contain new datasets such as links to the dataset, type of data used, size of dataset, etc.

Setup

This code has been tested on a computer with following specifications:

  • OS Platform and Distribution: Linux Ubuntu 18.04LTS
  • CUDA/cuDNN version: CUDA 10.0.130, cuDNN 7.6.4
  • GPU model and memory: NVidia GeForce GTX 1080, 8GB
  • Python: 3.6.8
  • TensorFlow: 1.14

Installation Instructions

To install the virtual environment and most of the required dependencies, run:

pip install pew
pew new inDexDa
pew in inDexDa

git clone https://github.com/eth-library-lab/inDexDa.git
cd inDexDa
./install.sh

Networks used in this project are run using Tensorflow backend.

Usage

To begin running inDexDa check the args.json file in the main directory. This contains relevant information which will be used during the process. Please make sure to add the following fields:

Configuration

inDexDa is configured primarily through the args.json file. In this file is included a variety of options for web-scraping, network training, and dataset extraction options. Each section is explained more thoroughly in the PaperScraper README, but the following steps will allow you to run inDexDa quickly.

  1. Choose the online academic paper repository you wish to scrape in the archives_to_scrape section. InDexDa natively supports both arXiv and ScienceDirect scraping APIs. You can use either a single scraper or multiple scrapers in sequence.
  2. Replace the default search query with your specific word or phrase. More specific search queries will yield less results, but will run much faster.
  3. If using ScienceDirect scraper, apply for an API key (https://dev.elsevier.com/apikey/manage). Once a key has been obtained, include it in the archive_info ScienceDirect apikey field. Also make sure to include the start and end years for the search.

Running inDexDa

Once the args.json file has been configured, run the run.py file using the following flags as desired, but only include EITHER the train or the scrape flag:

python3 run.py
    --first_time  # Must be included the first time you run inDexDa
    --scrape      # Will run inDexDa and output datasets it finds
    --train       # Will re-train the BERT network

Contact

For any inquiries, use the ETH Library Lab contact form.

License

MIT