Skip to content

MatrixTDBProcedures

anonymous edited this page Sep 17, 2009 · 13 revisions

MatrixTDB is the regression test facility for the Grammar Matrix and the Matrix customization system. It allows us to create gold standard tsdb++ profiles on demand for language types defined in choices files.

High Level Overview

There are three main things you might want to do with MatrixTDB: put data in or get data out. Add new strings, add a language type, extract a profile for a language type. These three high-level tasks break down into smaller sub-tasks. The breakdown into sub-tasks is displayed here, while the Detailed Processes section of this page breaks each of those down into smaller tasks.

Adding New Strings Breakdown

  • Create a source profile
  • Import the source profile
  • Add permutes
  • Run specific filters

Adding A Language Type Breakdown

  • (Actually this just breaks down to one sub-task: adding a language type

Extracting a Profile Breakdown

  • Import the language type
  • Generate a profile
  • Evaluate the profile using [incr_tsdb()]

Detailed Processes

This section describes step-by-step instructions on how to perform various tasks and sub-tasks with MatrixTDB.

Database Dump

If you're not sure of the effect of what you are about to do, you may want to make a dump of the database so that the data can be quickly restored if what you do doesn't go the way you want. To do this:

  • Run the following command:

    • $ mysqldump -h capuchin.ling.washington.u -u username -p --result-file=resultfile dbname
    • The arguments are as folows:
      • username - your username for the database
      • resultfile - the name of the file you want to dump the backup to.
      • dbname - the name of the database on capuchin to backup. Currently MatrixTDB2 is the database we are using.

Note: This won't actually back up the data per se, but it will create a (very large) file full of SQL statements that can be used to restore the database to its state at the time of the dump.

Restore From Database Dump

If you want to revert the database to a previous point:

  • Log in to the database

  • Issue the following command:

    • mysql> source filename
    • where filename is the name of the file with the dump you want to revert to.

Create A Source Profile

Source profiles (sometimes also called 'original source profiles') are what are used to bring the big, hairy mrs semantics into the database. To create one:

  • Create a flat file with one harvester string per line
  • Start LKB and load the grammar you want to use to create the mrs semantics to import
  • Start [incr_tsdb()] and process all the items in that file

Import A Source Profile

Source profiles (sometimes also called 'original source profiles') are what are used to bring the big, hairy mrs semantics into the database. When you have a [incr_tsdb()] profile that was created by processing a flat file of items you can use that to import a source profile into the database. To do so:

  • Create a file that has each harvester string in your profile on a line preceded by its mrs tag and a '@' E.g., wo1@n1 iv

  • Run the following command:

    • $ python import_from_itsdb.py itsdbDir harvMrsFilename choicesFilename
    • The arguments are as follows:
      • itsdbDir - the absolute directory of your [incr_tsdb()] directory. Be sure to end it in a '/'
      • harvMrsFilename - the name of the file you created above with mrs tags and harvester strings
      • choicesFilename - the choices file of the grammar you used to create the profile
  • The system will prompt for a username and password to the database

  • The system may ask you if the tags you're adding really are new or if you want to replace the existing tags with the new semantics you are importing. Answer appropriately. If the system indicates you are changing some semantics, make sure that is what you want to do.

  • The system will also ask you for a description of the source profile. It can be up to 1000 characters.

  • The system will import the profile and, if the choices file you used represents a language type not already in the database, will create a language type for that, too. It will return the osp_id which you will need to add permutes and run specific filters and the language type, which you will need to add permutes run specific filters

Add Permutes

At this point you will have imported a profile with its harvester strings. But a harvester string just yields to potentially millions of other possible strings with the same semantics. Specifically, each harvester string gives rise to seed strings which are then permuted and added to the database as string to be run through specific filters. (Earlier versions of MatrixTDB added all permutations which were run through universal filters and then specific filters, but more recently only those string/semantic tag combos that pass all universal filters are being added to the database.) Seed strings are stored in a canonical form: words in alphabetic order followed by prefixes in alphabetic order followed by suffixes in alphabetic order. Permutations are then every possible permutation of the words with every possible permutation of prefixes and suffixes on every word in every one of those permutations. Seed strings are generated from harvester strings by the stringmods in stringmod.py. Here is how to generate all the permutations for an imported original source profile:

  • Make sure stringmods is updated to meet your needs (optional)
  • Create a condor file. A template is in the repository named addPerms.cmd
    • change ospID to be the ID of the source profile you want to create permutes for
    • change username to be your username to the MatrixTDB database
    • change password to be your password to the MatrixTDB database
  • Submit your command file to condor with the following line:
    • $ condor_submit addPerms.cmd
  • The process may take many hours depending on how many strings you have and how long they are. Two ways to monitor the progress are to check the count of records in the result table or to check the .warning file from time to time.

Run Specific Filters

At this point you will have inserted every permutation that passes all universal filters for your source profile into the item_tsdb, parse, and result tables. At this point we need to record how the string/mrs combos fare when run through specific filters so that we can generate profiles for language types based on the the results of those runs through filters. Here's how:

  • Make sure s_filters.py has the filters in it specific to the phenomena you are concerned with (optional)
  • Create a condor file. A template is in the repository named runSFltrs.cmd
    • change ospID to be the ID of the source profile you want to create permutes for
    • change username to be your username to the MatrixTDB database
    • change password to be your password to the MatrixTDB database
  • Submit your command file to condor with the following line:
    • $ condor_submit runSFltrs.cmd
  • The process may take a few hours depending on how many permutations passed all universal filters in this OSP. Two ways to monitor the progress are to check the count of records in the res_sfltr table or to check the .warning file from time to time.

Import A Language Type

  • Create a choices file from the customization system

  • Run the following command:

    • $ python sql_lg_type.py filename
    • where filename is the full path and filename of the choices file you downloaded
  • You will be prompted for the username and password to the database.

  • If the language type already exists, it will return its ID.

  • If the language type does not already exist, the system will ask if the language type is randomly created or purpose-built. Answer r or p as appropriate. Unless you're just testing phenomena coverage of MatrixTDB itself, yours is probably purpose-built.

  • The system will prompt you for a short comment describing your language type. You can enter the name of the language or the phenomena you're testing (e.g., v-final), or whatever you feel is appropriate

  • The system will then create a language type in the database and give you an lt_id (language type ID) that you can use to generate a [incr_tsdb()] profile for validation

Generate A Profile

  • Make sure you have a language type ID that you want to generate the profile for. You can get the ID by querying the lt table or by importing a language type(see: HOW TO IMPORT A LANGUAGE TYPE)

  • Run the following command:

    • $ python generate_s_profile.py lt_id dbroot profilename
    • where the three arguments are:
      • lt_id - the language type ID you are generating the profile for
      • dbroot - the absolute path where you want the profile created
      • profilename - the name of the profile to create
  • You will be prompted for the username and password to the database.

  • The system will generate two profiles in dbroot: [profilename] and [profilename]gold. The gold version should be processed in [incr_tsdb()] and used as the gold standard

Definitions

Old Stuff

Matrix developers should update MatrixTDB as follows:

  • Determine the harvester strings you'll need to illustrate your library.

  • Determine the semantically neutral variants your library allows for each harvester string, at the level of bags of words. For example, the basic lexical library allows for case-marking adpositions. So p-nom can be added to any string with an overt subject to get a semantically equivalent string, provided p-nom is in the right place.

  • update customize/sql_profiles/stringmods.py to reflect the modifications.

  • Create a harvester grammar to process you harvester strings with. Save the choices file from that grammar.

  • Create a file listing the harvester strings and their mrs_tags (see customize/sql_profiles/harv_str/harv_mrs_1 for an example).

  • Create a file with just the harvester strings.

  • Start the LKB and tsdb++

  • Load the harvester grammar in to the LKB.

  • In tsdb++ to File > Import > Test items to import the harvester strings.

  • Make sure tsdb++ is set to write the mrs field.

  • Process the items you imported with your grammar.

  • The resulting profile will be your source_profile.

  • Next, run customize/sql_profiles/import_from_itsdb.py with your source_profile, choices file and harv_mrs file as arguments.

  • Get the resulting osp_id

  • Then run customize/sql_profiles/add_permutes.py and give it the osp_id

  • Update the universal and specific filters in u_filters.py and s_filters.py

  • Run run_u_filters.py

  • Run the SQL query that separates the universally ungrammatical from universally grammatical results.

  • Run run_specific_filters.py

At this point, MatrixTDB is up to date. We can also use import_from_tsdb.py to update the mrss we want to have corresponding to particular mrs_tags.

To export a profile corresopnding to a given choices file:

  • _FIX_ME_ instructions here.

TODO:

  • Work out how to run filters recursively for coordination et al
  • Update filters for coordination
  • Update filters for inflection version of question particles
Clone this wiki locally