Skip to content

swadhin-/wordnet

 
 

Repository files navigation

Wordnet

Open Source browsing application for Wordnet database

Requirements

  • Ruby 2.1.10
  • PostgreSQL 9
  • Neo4J 2
  • MySQL (for WordnetSQL import)

Installation

# Load wordnet database to MySQL

rbenv install 2.1.10
rbenv use 2.1.10
bundle install

bin/rake db:create db:migrate
bin/rake wordnet:import

Ubuntu 12.04 deployment

Video: https://www.youtube.com/watch?v=kJVyO9I173o.

  • Create hosting account (e.g. digitalocean)
  • Create user with sudo permissions
  • Remember to ssh-copy-id your public key to this new user account
  • Add your public key to data/playbook.yml
  • Ensure ansible is installed on your machine
  • Run bin/setup-host USER@HOST:PORT command to setup your server. It installs:
    • common tools
    • ruby, java
    • mysql, postgresql, neo4j
    • deployment framework
  • ssh wordnet@HOST -p PORT and run ./deploy script
  • download mysql database to server
  • import database to mysql via mysql -D wordnet < wordnet.sql
  • go to current deployment location (cd production/current) and run:
    • RAILS_ENV=production bin/rake wordnet:import (imports mysql to postgresql)
    • RAILS_ENV=production bin/rake wordnet:export (exports postgresql to neo4j)
    • RAILS_ENV=production bin/rake wordnet:translations (generates translations)
    • RAILS_ENV=production bin/rake wordnet:stats (generates statistics)
  • to change url prefix of application:
    • add URL_ROOT=/wordnet to .env file in application's deployment directory
    • touch tmp/restart.txt to restart an application

Project overview

Słowosieć is a Polish equivalent of Princeton Wordnet, a lexical database of word senses and relations between them.

The purpose of this document is to describe a successful effort of making the web interface of Polish Wordnet more performant and user-friendly. In particular we'll elaborate on developed architecture, used components, and database designs.

The front-end and back-end of application were rebuilt from scratch. As as result the browsing latency dropped from 30 seconds in some cases to 110ms on average.

Architecture

Following decisions has been made:

  • Data is stored in normalised form using relational database
  • Data is indexed and queried using graph database
  • Data is rendered on client-side using templates
  • Data is loaded through a well-crafted API endpoint

Given multiple issues with MySQL database and performance issues with handling UUIDs, the PostgreSQL were chosen as relational database backend. This has an additional advantage of storing data in Hstore and Array types (where sensible), avoiding unnecessary JOIN statements for data retrieval.

Neo4J has been chosen as relational database backend. The main reasons included being open-source, mature, and reliable graph store. Neo4J is one of the few graph databases providing declarative way of querying data, using Cypher language (similar in some ways to SQL).

On front-end an Angular.js framework is used. It is relatively new, but popular product developed and maintained by Google. It allows for easy decoupling of application logic and template rendering using unique concepts of directives, services, and controllers.

Rails 4 web-framework is used for both API endpoint, and serving front-end. Rails is mature software, allowing for robust development of modern web applications. Made in Ruby, allows us to use use tens of thousands of Ruby Gems, significantly boosting the development.

API allows for disjoint development of front-end and back-end.

Other technologies used

Experience made us choose following set of tool for application development:

  • CoffeeScript replacing plain JavaScript
  • SASS replacing plain CSS stylesheets
  • SLIM for rendering front-end HTML markup

Definitions

  • Lexeme - unit of lexical meaning that exists regardless of the number of inflectional endings it may have or the number of words it may contain (e.g. run, ran, runs)
  • Lemma - particular form of a lexeme that is chosen by convention to represent a canonical form of a lexeme (e.g. run)
  • Sense - a Lexeme associated with particular meaning. Each Lexeme can have multiple Senses. In Wordnet each Sense is associated with number to easily distinguish (e.g. I can write run 4 meaning an unbroken series of events, or run 5 meaning the act of running)
  • Synset - a set of Senses (not Lexemes) with similar meaning, i.e. synonyms (e.g. run 2 forms Synset with following Senses: bunk 3, escape 6, turn tail 1).
  • Sense Relation - a relationship between two Senses, i.e. relationship between two particular meanings of words (e.g. big 1 is antonym of little 1)
  • Synset Relation - a relationship between two Synsets, i.e. relationship between two groups of Senses (e.g. Synset { act 10, play 25 } is hyponym of Synset { overact 1, overplay 1 }).
  • Relation Type - each SenseRelation and SynsetRelation has its type, it can be among others: antonym, hyponym, hyperonym, meronym, ...

In summary: Each Lexeme is represented by Lemma. Each Lexeme has multiple Senses. Each Sense forms Synset with other Senses. Each Sense can be in SenseRelation to other Senses. Each Synset can be in SynsetRelation to other Synsets. Each Relation has its own RelationType.

Above concepts of Wordnet are modelled in application in following way:

Class Diagram

Relational Database

Introducing Relational Database as primary store had two purposes:

  1. Reliably and economically storing data in normalised form
  2. Ability to use de-normalised graph database as index

The data is imported to normalised form from Polish Wordnet, but the process allows for importing arbitrary Wordnet-alike database.

Non-conventionally the primary keys of database tables are UUIDs, instead of auto-incrementing values. It has few advantages:

  • Plays well with graph databases, each node has its own unique ID
  • UUIDs for records can be generated by application code what makes inserting interconnected data into the database easier & performant.
  • Makes replication of relational database trivial
  • Allows for easy merging of two databases with same schema

The overall schema closely reassembles concepts described earlier:

senses

  • id: The UUID identifier
  • synset_id: The UUID of connected Synset
  • external_id: The ID from external database, used for importing
  • lemma: The lemma of Lexeme that Sense belongs to (e.g. car)
  • sense_index: The index of sense in context of its Synset (e.g. 1)
  • comment: The short comment, used in UI (e.g. transporting machine)
  • language: Currently can be en_GB or pl_PL
  • part_of_speech: The part of speech of Sense (noun etc.)
  • domain_id: The ID of the Domain of Sense (not used yet)

synsets

  • id: The UUID identifier
  • external_id: The ID from external database, used for importing
  • comment: The short comment by Słowosieć, used in UI
  • definition: The short comment by Princeton Wordnet, used in UI
  • examples: The examples of usage of synset from Princeton Wordnet

relation_types

  • name: Name of the relation
  • reverse_relation: Name of reverse relation (see: normalisation)
  • parent_id: Name of parent RelationType (inheritance-like)
  • priority: It is used for sorting relation types in UI (lower-better)
  • description: Description of the relation (not used yet)

sense_relations and synset_relations

  • parent_id: UUID of base sense (or synset)
  • child_id: UUID of of related sense (or synset)
  • relation_id: UUID of relation in which child is toward parent (e.g. UUID hyponymy relation means child is hyponym of parent)

Normalisation of Relations

Imported relations are normalised in few ways:

  1. For reverse relation types we leave only one relation type (by convention the one where where are more children than parents, e.g. hyponymes, not hyperonymes).
  2. The name of removed reverse relation is assigned to reverse_name
  3. Name and reverse_name are in plural form for for UI purposes
  4. Even name has it’s parent, the name describes full relation type name (for example “Meronymes (place)”, not “place”)

Relations

Graph Database

Graph Database Graph database has slightly different structure than relational database. Most importantly Sense and Synset nodes don’t contain any data except their IDs. The relationships of type relation exist only between Synset and Senses. All data displayed in UI columns is hold in Data nodes.

Each Synset and each Sense is represented by connected Data node in UI.

Data node holds following data from Sense model:

  • lemma
  • sense_index
  • comment
  • language
  • part_of_speech
  • domain_id

Importing data from external Wordnets

Wordnet uses internal, normalised representation of database. The normalised structure is defined in Relational Database section.

The data mapping is done by 5 classes inherited from Importer class:

  • WordnetPl::RelationType
  • WordnetPl::Sense
  • WordnetPl::Synset
  • WordnetPl::SenseRelation
  • WordnetPl::SynsetRelation

Each class is responsible for importing data to corresponding models.

Importer class processes data in batches for performance reasons. It handles progress bar rendering, parallelising import process, and synchronising writes. It expects following methods to be defined in descendants:

  • total_count: The total count of items to be imported
  • load_entities(limit, offset): This method should load limit records from external database with given offset and return hash consumed later by process_entities! method
  • process_entities!(entities): This method is responsible for processing data returned from load_entites and passing them to persist_entities! method described below

persist_entities!(table_name, collection, unique_attributes) uses Upsert method to insert or update data in database in performant way. It accepts table in database where the record should be inserted/updated, the actual collection of records as array of hashes where keys are column names (see relational database schema) and values are row values. The unique_attributes is an array of column names that upsert method will use for selecting data to merge (usually “id”, but can be for example [“parent_id”, “child_id”] for relations.

Import process can be triggered by issuing command:

bin/rake wordnet:import

The source database defaults to mysql2://root@localhost/wordnet, but you can change it by passing SOURCE_URL environment variable.

Exporting to Neo4J index

The same way importer classes inherit from Importer, exporter classes inherit from Exporter. The are only 4 exporter classes:

  • Neo4J::Sense
  • Neo4J::Synset
  • Neo4J::SenseRelation
  • Neo4J::SynsetRelation

Each exporter is supposed to define 2 methods:

  • export_index!: that ensures at the beginning of export that proper indexes are present in Neo4J database
  • process_batch(entities): method that accepts array of entity hashes, just like process_entities! and returns array of queries to execute in batch request by Neography gem.

Export process can be triggered by issuing command:

bin/rake wordnet:export

The destination defaults to http://127.0.0.1:7474, but you can change it by passing NEO4J_URL environment variable.

Deployment

Application is supposed to be run on at least 3 servers:

  1. Application server
  2. PostgreSQL server
  3. Neo4J server

On application server the Rails application should be deployed, using any method. At least Node.js, Ruby 2.0, and development libraries of Postgresql and Mysql are required to be installed on system.

The addresses of PostgreSQL database and Neo4J database are passed by NEO4J_URL environment variable, and database information is configured in config/database.yml.

The assets need to be precompiled before deploying app on production:

RAILS_ENV=production bin/rake assets:precompile

The server can be started by hand with:

RAILS_ENV=production bin/rails server --port 80

Or by tool you choose (Capistrano or other).

License

Wordnet is released under the MIT License.

About

Lexical database of any language

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Ruby 55.4%
  • HTML 18.8%
  • Shell 11.1%
  • CSS 7.1%
  • CoffeeScript 6.7%
  • Python 0.9%