Skip to content

Latest commit

 

History

History
215 lines (147 loc) · 6.93 KB

README.md

File metadata and controls

215 lines (147 loc) · 6.93 KB


databio_logo

Databio Project

This code reads an XML file and extracts data from it to create nodes and relationships in a Neo4j graph database. It uses the py2neo library to connect to the database and the xmltodict library to parse the XML file
Explore the docs »

View Demo · Report Bug · Request Feature

Table of Contents
  1. About The Author
  2. About The Project
  3. Getting Started
  4. Contributing
  5. License
  6. Contact
  7. Documentation

About The Author

👤 ** Rodrigo Wurdig **

About The Project

This code reads an XML file and extracts data from it to create nodes and relationships in a Neo4j graph database. It uses the py2neo library to connect to the database and the xmltodict library to parse the XML file.

(back to top)

Built With

  • Docker and Docker-Compose
  • Neo4j (docker container)
  • Python (docker container)
  • Airflow (docker container)
  • Bash scripting

(back to top)

Getting Started

This is how you setting up your project locally.

- To get a local copy up and running follow these simple steps bellow.

(back to top)

Installation and Prerequisites

To run this code, you will need to have the following softwares and libraries installed:

  • Airflow 2.5.2
  • Neo4j 5.6.0
  • pendulum 2.1.2
  • pip 23.0.1
  • postgres:14.0
  • Python 3.x
  • py2neo 2021.2.3
  • xmltodict 0.13.0

After installing Python and pip, run the following command to install the necessary Python packages:

1. Install packages:

  pip install neo4j xml airflow etc

2. Clone the repository

   git clone https://github.com/rwurdig/Databio_project.git
   cd Databio_project

3. Run the build.sh file with admin privileges.

  chmod +x build.sh
  ./build.sh

4. The project will start and it will build all the images on the docker compose and run it.

(back to top)

Contributing

Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.

If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again!

  1. Fork the Project
  2. Create your Feature Branch (git checkout -b feature/AmazingFeature)
  3. Commit your Changes (git commit -m 'Add some AmazingFeature')
  4. Push to the Branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

(back to top)

License

Distributed under the MIT License. See License for more information.

(back to top)

Contact

👤 Rwurdig: E-mail

Project Link: https://github.com/rwurdig/Databio_project

(back to top)

Documentation

Documentation of the Biomedical Engineering Project

tl;dr: The objective is to create a data pipeline that will ingest a UniProt XML file (data/Q9Y261.xml) and store the data in a Neo4j graph database.

Task

Read the XML file Q9Y261.xml located in the data directory. The XML file contains information about a protein. The task is to create a data pipeline that will ingest the XML file and store as much information as possible in a Neo4j graph database.

Requirements & Tools

  • Use Apache Airflow or a similar workflow management tool to orchestrate the pipeline
  • The pipeline should run on a local machine
  • Use open-source tools as much as possible

Source Data

Please use the XML file provided in the data directory. The XML file is a subset of the UniProt Knowledgebase.

The XML contains information about proteins, associdated genes and other biological entities. The root element of the XML is uniprot. Each uniprot element contains a entry element. Each entry element contains various elements such as protein, gene, organism and reference. Use this for the graph data model.

The full XML schema is available here.

Neo4j Target Database

Please run a Neo4j database locally. You can download Neo4j from https://neo4j.com/download-center/ or run it in Docker:

docker run \
  --publish=7474:7474 --publish=7687:7687 \
  --volume=$HOME/neo4j/data:/data \
  neo4j:latest

Getting Started with Neo4j: https://neo4j.com/docs/getting-started/current/

Data Model

The data model should contain nodes for proteins, genes, organisms, references, and more. The graph should contain edges for the relationships between these nodes. The relationships should be based on the XML schema. For example, the protein element contains a recommendedName element. The recommendedName element contains a fullName element. The fullName element contains the full name of the protein. The graph should contain an edge between the protein node and the fullName node.

Here is an example for the target data model:

Example Data Model

Example Code

In the example_code directory, you will find some example Python code for loading data to Neo4j.