RDF Generator

The rdf_generator library provides tools for generating RDF datasets from various data sources, including scraped websites, Excel sheets, CSV files, text files, PDFs, and relational databases (PostgreSQL, MySQL, etc.). It aims to simplify the process of building RDF datasets, enabling seamless integration into linked data workflows.

Features

Modular Parsers: Support for CSV, Excel, PDFs, relational databases (PostgreSQL, MySQL), and BEACON files.
Web Scraping: Extract structured data from websites.
RDF Generation: Build RDF graphs using rdflib, complete with namespaces and serialization options.
Customizable Workflows: Easily extend and integrate with your data pipelines.
Serialization Formats: Generate RDF in Turtle, RDF/XML, JSON-LD, and other formats.

Installation

1. Install from PyPI (Standard Method)

pip install rdf_generator

2. Install Directly from GitHub (Alternative Method)

You can clone the repository and install the library manually if it's not on PyPI yet:

git clone https://github.com/judaicalink/rdf_generator.git cd rdf_generator pip install .

Or, install it directly from GitHub using:

pip install git+https://github.com/judaicalink/rdf_generator.git

Requirements

Python 3.7 or higher
Libraries:

Install dependencies with: pip install -r requirements.txt

Core dependencies include:
- rdflib
- pandas
- requests
- beautifulsoup4
- PyPDF2
- mysql-connector-python
- psycopg2

Usage

The rdf_generator library is designed to provide parsers for multiple data sources and utilities to generate RDF datasets. Below are examples for various data sources.

Generate RDF from CSV Files

from rdf_generator.parsers.csv_parser import CSVParser
from rdf_generator.rdf_builder import RDFBuilder

csv_parser = CSVParser("data/people.csv")
data = csv_parser.read_csv()

# Generate RDF
rdf_builder = RDFBuilder()
for row in data:
    rdf_builder.add_person(row['Name'], row['Email'])

# Serialize RDF
print(rdf_builder.serialize(format="turtle"))

Generate RDF from PostgreSQL

from rdf_generator.parsers.sql_parser import PostgreSQLParser
from rdf_generator.rdf_builder import RDFBuilder

# Connect to the database
db_parser = PostgreSQLParser(
    host="localhost",
    database="testdb",
    user="your_username",
    password="your_password"
)

# Fetch data
query = "SELECT name, email FROM people;"
data = db_parser.fetch_data(query)

# Generate RDF
rdf_builder = RDFBuilder()
for row in data:
    rdf_builder.add_person(row['name'], row['email'])

# Serialize RDF
print(rdf_builder.serialize(format="turtle"))

Generate RDF from Websites (Web Scraping)

from rdf_generator.parsers.web_scraper import WebScraper
from rdf_generator.rdf_builder import RDFBuilder

# Scrape the website
scraper = WebScraper("https://example.com")
data = scraper.extract_data("h1")  # Extract all H1 elements

# Generate RDF
rdf_builder = RDFBuilder()
for item in data:
    rdf_builder.graph.add((rdf_builder.ns[item], rdf_builder.ns.title, rdf_builder.ns[item]))

# Serialize RDF
print(rdf_builder.serialize(format="turtle"))

Generate RDF from Excel Files

from rdf_generator.parsers.excel_parser import ExcelParser
from rdf_generator.rdf_builder import RDFBuilder

# Parse the Excel file
excel_parser = ExcelParser("data/people.xlsx")
data = excel_parser.read_sheet(sheet_name="People")

# Generate RDF
rdf_builder = RDFBuilder()
for row in data:
    rdf_builder.add_person(row['Name'], row['Email'])

# Serialize RDF
print(rdf_builder.serialize(format="turtle"))

Supported Parsers

Parser Description CSV Parses CSV files and extracts data as dictionaries. Excel Parses Excel (.xls/.xlsx) files and handles multiple sheets. PDF Extracts text and tables from PDF files. SQL Fetches data from relational databases like PostgreSQL and MySQL. BEACON Parses BEACON link dump files for RDF generation. Web Scrapes websites to extract structured data.

Serialization Formats

The `rdf_generator library supports the following RDF serialization formats:

Turtle: rdf_builder.serialize(format="turtle")
RDF/XML: rdf_builder.serialize(format="xml")
JSON-LD: rdf_builder.serialize(format="json-ld")

Example Dataset Workflow

Here’s an example pipeline to process multiple data sources and generate a combined RDF dataset:

from rdf_generator.parsers.csv_parser import CSVParser
from rdf_generator.parsers.web_scraper import WebScraper
from rdf_generator.rdf_builder import RDFBuilder

rdf_builder = RDFBuilder()

# Parse CSV
csv_parser = CSVParser("data/people.csv")
for row in csv_parser.read_csv():
    rdf_builder.add_person(row['Name'], row['Email'])

# Scrape Website
web_scraper = WebScraper("https://example.com")
titles = web_scraper.extract_data("h1")
for title in titles:
    rdf_builder.graph.add((rdf_builder.ns[title], rdf_builder.ns.label, rdf_builder.ns[title]))

# Serialize RDF
with open("output.ttl", "w") as f:
    f.write(rdf_builder.serialize(format="turtle"))

Development

Clone the Repository

To contribute or use the library without installation: git clone https://github.com/yourusername/rdf_generator.git cd rdf_generator

Install Dependencies

Install dependencies using:

pip install -r requirements.txt

Run Tests

Run unit tests using:

python -m unittest discover -s tests

License

This library is licensed under the MIT License. See the LICENSE file for details.

Contributing

Contributions are welcome! Please follow these steps:

Fork the repository. Create a feature branch: git checkout -b feature-name. Commit your changes: git commit -m "Add feature name". Push to the branch: `git push origin feature-name. Submit a pull request.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.github/workflows		.github/workflows
.idea		.idea
rdf_generator		rdf_generator
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
example_usage.py		example_usage.py
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RDF Generator

Features

Installation

1. Install from PyPI (Standard Method)

2. Install Directly from GitHub (Alternative Method)

Requirements

Usage

Supported Parsers

Serialization Formats

Example Dataset Workflow

Development

Clone the Repository

Install Dependencies

Run Tests

License

Contributing

About

Releases 5

Packages

Languages

License

judaicalink/rdf_generator

Folders and files

Latest commit

History

Repository files navigation

RDF Generator

Features

Installation

1. Install from PyPI (Standard Method)

2. Install Directly from GitHub (Alternative Method)

Requirements

Usage

Supported Parsers

Serialization Formats

Example Dataset Workflow

Development

Clone the Repository

Install Dependencies

Run Tests

License

Contributing

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 5

Packages 0

Languages

Packages