DBpedia Information Extraction Framework

About DBpedia

DBpedia is a crowd-sourced community effort to extract structured information from Wikipedia and make this information available on the Web. DBpedia allows you to ask sophisticated queries against Wikipedia, and to link the different data sets on the Web to Wikipedia data. We hope that this work will make it easier for the huge amount of information in Wikipedia to be used in some new interesting ways. Furthermore, it might inspire new mechanisms for navigating, linking, and improving the encyclopedia itself.
To check out the projects of DBpedia, visit the official DBpedia website.

The DBpedia Extraction Framework

The DBpedia community uses a flexible and extensible framework to extract different kinds of structured information from Wikipedia. The DBpedia extraction framework is written using Scala 2.8. The framework is available from the DBpedia Github repository (GNU GPL License). The change log may reveal more recent developments. More recent configuration options can be found here: https://github.com/dbpedia/extraction-framework/wiki

The DBpedia extraction framework is structured into different modules

Core Module : Contains the core components of the framework.
Dump extraction Module : Contains the DBpedia dump extraction application.

Core Module

Components

Source : The Source package provides an abstraction over a source of Media Wiki pages.
WikiParser : The Wiki Parser package specifies a parser, which transforms an Media Wiki page source into an Abstract Syntax Tree (AST).
Extractor : An Extractor is a mapping from a page node to a graph of statements about it.
Destination : The Destination package provides an abstraction over a destination of RDF statements.

In addition to the core components, a number of utility packages offers essential functionality to be used by the extraction code:

Ontology Classes used to represent an ontology. Methods for both, reading and writing ontologies are provided. All classes are located in the namespace org.dbpedia.extraction.ontology
DataParser Parsers to extract data from nodes in the abstract syntax tree. All classes are located in the namespace org.dbpedia.extraction.dataparser
Util Various utility classes. All classes are located in the namespace org.dbpedia.extraction.util

Dump extraction Module

More recent configuration options can be found here: https://github.com/dbpedia/extraction-framework/wiki/Extraction-Instructions.

To know more about the extraction framework, click here

Quickstart

Before you can start developing you need to take care of some prerequisites:

DBpedia Extraction Framework Get the most recent revision from the Github repository.

$ git clone git://github.com/dbpedia/extraction-framework.git
Java Development Kit The DBpedia extraction framework uses Java. Get the most recent JDK from http://java.sun.com/. DBpedia requires at least Java 7 (v1.7.0). To compile and run it with an earlier version, delete or blank the following two files.(The launchers purge-download and purge-extract in the dump module won't work, but they are not vitally necessary.)

core/src/main/scala/org/dbpedia/extraction/util/RichPath.scala

dump/src/main/scala/org/dbpedia/extraction/dump/clean/Clean.scala
Maven is used for project management and build automation. Get it from: http://maven.apache.org/. Please download Maven 3.

This is enough to compile and run the DBpedia extraction framework. The required input files, the wikimedia dumps, will be downloaded by extractor code if configured to do so (see here). Check this out to know more about Development Environment Setup.

DBpedia Extraction-Framework now powered by Apache Spark

The Dump-Extraction of the DBpedia Extraction-Framework has now an Apache Spark Implementation.

$ cd extraction-framework/dump/
$ ../install-run sparkextraction extraction.spark.properties or $ ../run sparkextraction extraction.spark.properties

the spark-extraction currently supports every extractor except MappingsExtractor, ImageExtractor and the NIF-Extraction
spark-master, alternate spark-temporary dir, languages and extractors can be configured in /dump/extraction.spark.properties

Contribution Guidelines

If you want to work on one of the issues, assign yourself to it or at least leave a comment that you are working on it and how.
If you have an idea for a new feature, make an issue first, assign yourself to it, then start working.
Please make sure you have read the Developer's Certificate of Origin, further down on this page!

Fork the main extraction-framework repository on GitHub.
Clone this fork onto your machine (git clone <your_repo_url_on_github>).
From the latest revision of the master branch, make a new development branch from the latest revision. Name the branch something meaningful, for example fixRestApiParams (git checkout master -b fixRestApiParams).
Make changes and commit them to this branch.

Please commit regularly in small batches of things "that go together" (for example, changing a constructor and all the instance creating calls). Putting a huge batch of changes in one commit is bad for code reviews.
In the commit messages, summarize the commit in the first line using not more than 70 characters. Leave one line blank and describe the details in the following lines, preferably in bullet points, like in 7776e31....

When you are done with a bugfix or feature, rebase your branch onto extraction-framework/master (git pull --rebase git://github.com/dbpedia/extraction-framework.git). Resolve possible conflicts and commit.
Push your branch to GitHub (git push origin fixRestApiParams).
Send a pull request from your development branch into extraction-framework/master via GitHub.

In the description, reference the associated commit (for example, "Fixes #123 by ..." for issue number 123).
Your changes will be reviewed and discussed on GitHub.
In addition, Travis-CI will test if the merged version passes the build.
If there are further changes you need to make, because Travis said the build fails or because somebody caught something you overlooked, go back to item 4. Stay on the same branch (if it is still related to the same issue). GitHub will add the new commits to the same pull request.
Finally, when everything is fine, your changes will be merged into extraction-framework/master.

Read the complete contribution guidelines here

Wiki

For more information about DBpedia, check out the wiki page.

License

The source code is under the terms of the GNU General Public License, version 2.

Name		Name	Last commit message	Last commit date
Latest commit History 6,857 Commits
core		core
dump		dump
live		live
mappings		mappings
scripts		scripts
server		server
wiktionary		wiktionary
.gitignore		.gitignore
.gitmodules		.gitmodules
.travis.yml		.travis.yml
QuickStart.md		QuickStart.md
README.md		README.md
clean-install-run		clean-install-run
extraction-process.md		extraction-process.md
install-run		install-run
ontology.owl		ontology.owl
ontology.xml		ontology.xml
pom.xml		pom.xml
run		run
sitemap.config		sitemap.config
void.config		void.config

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DBpedia Information Extraction Framework

Contents

About DBpedia

The DBpedia Extraction Framework

Core Module

Dump extraction Module

Quickstart

DBpedia Extraction-Framework now powered by Apache Spark

Contribution Guidelines

Wiki

License

About

Releases

Packages

Languages

dies-und-lenes/extraction-framework

Folders and files

Latest commit

History

Repository files navigation

DBpedia Information Extraction Framework

Contents

About DBpedia

The DBpedia Extraction Framework

Core Module

Dump extraction Module

Quickstart

DBpedia Extraction-Framework now powered by Apache Spark

Contribution Guidelines

Wiki

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages