Skip to content

Usage with Postgres

Taimur Shah edited this page Oct 6, 2016 · 1 revision

The advantage of using Postgres is that you can have it hosted by Compose, performance is predictable and its typically going to be faster than a graph database. The tradeoff with the performance gains is that you need to set your schema in advance, and 'deeper' queries become much more complicated to write. With Postgres you can mitigate the former a bit by using the jsonb type.

Schema

The Postgres adapter for toomba requires 2 tables be created with a specific table. At a high level, we need a table which toomba will populate with metadata about a concept, and we need a table that links your original document to these concepts.

The Concept table needs to be called concepts and needs to have this schema:

postgres=# \d concepts;
                        Table "public.concepts"
   Column    | Type | Modifiers 
-------------+------+-----------
 id          | text | not null  
 type        | text | not null  
 annotator   | text | not null  
 label       | text | not null  
 ontology    | text |           
 description | text |           
 meta        | text |           
Indexes:
    "concepts_pkey" PRIMARY KEY, btree (id)

There can be several edge tables. When you make the request to toomba, you need to specify the name of the edge table responsible for the connection from your original document. This is so that you can enforce foreign key constraints on the various edge tables if you want to. If you're not interested in that, then you can create an all-purpose 'edges' table and use that for everything.

This edge table needs to have the below schema:

postgres=# \d edges;
                         Table "public.edges"
  Column   | Type | Modifiers 
-----------+------+-----------
 foreignid | uuid | not null  
 conceptid | text | not null  
 score     | real | not null  
 sentiment | real |           
 evidence  | text |           
 meta      | text |           
Foreign-key constraints:
    "edges_conceptid_fkey" FOREIGN KEY (conceptid) REFERENCES concepts(id)

Note that the foreignId does not need to be of uuid type. It's passed to toomba in the request as a string, and can be cast to whatever SQL will let you do (text, varchar, uuid, possibly int?).

Ingestion

After having established your schema, taking into account what's required from toomba you now want to ingest things into toomba. As with all of the toomba endpoints, you will make a post request to toomba with json in the body, receive a json response back which will look like:

{
   message: '',
   transactions: [],
   status: 'OK/ERROR'
}

When you get the response from toomba, you need to loop through the transactions array and execute each against your database. You could also roll up the transactions array into one transaction statement to execute against the db if supported.

It's convenient to have this 'ingestion' of a document be abstracted to a method called 'toombatize'. Toombatizing a piece of text essentially links it to every other toombatized document in your database, so you can easily leverage whatever unstructured data you want to include in your project.

Similarity

Toomba links your documents together through 'concepts', which come from a variety of Watson services. This enables you to implement advanced NLP-enabled comparator functions relatively easily. One of the more common comparators we've implemented is text-to-text similarity.

There are a lot of approaches you could take to solving this, and here we want to give an example of a very basic algorithm. In this algorithm, we go through 2 stages - retrieve and rank (hah). In the retrieve phase, we want to get the top n documents that share the most concepts with your input document. Your postgres query might look something like this:

SELECT 
    edges.foreignid as docId, 
    array_agg(edges.conceptid) as evidence, 
    array_agg(question_concepts.score) as question_to_concept_scores, 
    array_agg(edges.score) as document_to_concept_scores
FROM 
    ( SELECT conceptid, score FROM edges WHERE edges.foreignid = $1 ) as question_concepts 
    JOIN edges using(edges.conceptid, question_concepts.conceptid)
GROUP BY edges.foreignid
ORDER BY COUNT(*) DESC
LIMIT $2

This should get us document-ids, a list of concept-ids that tell us what each document shared with the input question, a corresponding list of question-to-concept scores and another list of document-to-concept scores.

The next step is to rank this with a slightly more sophisticated method. So far we only care about how many concepts the question shared with the documents. But we haven't taken into account how strongly each concept is related to both the question and the target document. An example method to rank these (pseudocode)

# pretend row is a dictionary
for row in similar:
    row['score'] = 0
    for index, conceptid in row['evidence']:
        row['score'] += row['question_to_concept_scores'][index] * .5 + row['document_to_concept_scores'][index] * .5
    row['score']

ranked = sorted(similar, lambda row: -row['score'])[:limit]

This is just to illustrate an example approach. You should use what you think is best for your specific use-case. Maybe you want to take sentiment into account, or take into account the overall frequency of each concept (ex: is 'United States of America' an extremely popular concept in your corpus? Maybe that should make it count for less). You also probably want the full concept metadata available to you, and probably want more than just the document id which would mean you need 2 more joins in your recall phase. We recommend setting up a lot of views and using them to keep your query sane.

Clone this wiki locally