Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate Halyard with SANSA Stack #71

Open
asotona opened this issue Aug 7, 2019 · 3 comments
Open

Integrate Halyard with SANSA Stack #71

asotona opened this issue Aug 7, 2019 · 3 comments

Comments

@asotona
Copy link
Collaborator

asotona commented Aug 7, 2019

Halyard is powerful distributed triplestore, instantly answering majority of SPARQL queries, however weak in some complex operations (like ORDER BY and GROUP BY) and complicated to implement a custom code that goes beyond SPARQL.
SANSA Stack (and similar Spark-based SPARQL frameworks) seem to be complimentary to Halyard - powerful in ordering, aggregations, and easy to integrate custom transformation logic into the pipe, however slow in ad-hoc SPARQL queries, and unable to form SPARQL Endpoint.

The idea is to provide a hybrid solution, where SANSA Stack (or any other Spark framework) can directly use Halyard data and Halyard query engine as a (distributed) source of RDF data for further processing.

  1. Minimal implementation is to provide Halyard library for Spark, so SANSA Stack can directly consume the Halyard data (read the RDF data directly from HBase) and can call Halyard SPARQL Query Engine (consume results from Halyard SPARQL Graph Query locally and directly).
  2. Integrated solution would require to include Halyard as a Service Provider in SANSA SPARQL Query Engine, so hybrid access to Halyard data from Sansa would be available inside SANSA SPARQL as a Federated Service Provider.
  3. Optimal solution would also include transparent integration of Halyard SPARQL parallelization (similar to halyard:forkAndFilterBy function used in Halyard BulkExport), so Spark engine would be able to directly manage Halyard parallelization (transparently for user).

This is an idea of potential synergy effect of Halyard and SANSA Stack, that seems to be worth to test.

@peterjohnlawrence
Copy link

Have you taken this idea any further, got any sample code, etc? I am interested in prototyping something, so anything to help me kick start would be useful.

@asotona
Copy link
Collaborator Author

asotona commented Jul 20, 2020 via email

@peterjohnlawrence
Copy link

I'm not familiar with SANZA. It looks like it partitions its RDF vertically unlike Halyard. Because of this, as I understand, it prefers to reload the RDF from the RDF source and partition into predicate tables with subject as key, and object as value. I assume you would not want to reload the Halyard triples simply to repartition them. Therefore, are you suggesting that Halyard be injected into SANZA query planner/executor so that SANZA uses Halyards SPARQL in preference?

There are a few problems that SPARQL does not solve. For example shortest-path requires as a mimimum iterative SPARQL. I am hoping that SANZA SPARK would offer and alternative.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants