Data Journalism Extractor

This project is an attempt to create a tool to help journalists extract and process data at scale, from multiple heterogenous data sources while leveraging powerful and complex database, information extraction and NLP tools with limited programming knowledge.

Features

This software is based on Apache Flink, a stream processing framework similar to Spark written in Java and Scala. It executes dataflow programs, is highly scalable and integrates easily with other Big Data frameworks and tools such as Kafka, HDFS, YARN, Cassandra or ElasticSearch.

Although you can work with custom dataflow programs that suits your specific needs, one doesn't need to know programming, Flink or Scala to work with this tool and build complex dataflow programs to achieve some of the following operations:

Extract data from relational databases (Postgres, MySQL, Oracle), NoSQL databases (MongoDB), CSV files, HDFS, etc.
Use complex processing tools such as soft string-matching functions, link extractions, etc.
Store outputs in multiple different data sinks (CSV files, databases, HDFS, etc.)

Documentation

Documentation about the project is available at this link.

Run an example

The generated code is a Flink application project using Scala and SBT.

To run and test your application locally, you can just execute sbt run then select the main class that contains the Flink job.

You can also package the application into a fat jar with sbt assembly, then submit it as usual, with something like:

flink run -c core.ScalaTempTest scala/target/scala-2.11/test-assembly-0.1-SNAPSHOT.jar

You can also run your application from within IntelliJ: select the classpath of the 'mainRunner' module in the run/debug configurations. Simply open 'Run -> Edit configurations...' and then select 'mainRunner' from the "Use classpath of module" dropbox.

Run the tests

Python tests

The python tests can easily be run with the command make test in the parent directory.

All the python tests are in $ROOT/python/tests.

Name		Name	Last commit message	Last commit date
Latest commit History 120 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
example		example
python		python
scala		scala
templates		templates
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
readthedocs.yml		readthedocs.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Journalism Extractor

Features

Documentation

Run an example

Run the tests

Python tests

About

Releases

Packages

Languages

License

hugcis/data_journalism_extractor

Folders and files

Latest commit

History

Repository files navigation

Data Journalism Extractor

Features

Documentation

Run an example

Run the tests

Python tests

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages