TODO.txt

h1. Overall goal: allow searches in DOBES corpora in a semasiological and onomasiological approach to extract grammatical information
But: From a technical point of view the search types a descriptive linguist would like to apply to a DOBES corpus are independent of the distinction between the two basic approaches to grammar mentioned above. (wirklich? vielleicht finde ich noch Gegenbeispiele)

h1. User stories:
* Search results are presented in interlinear version
* Search results/hits can be stored as a separate data set which may be the basis for another search (ist letzters nicht ausreichend? muss man das wirklich speichern, als Datei?)
* It is possible to determine the context size of the concordances with respect to the different levels of syntagmatic complexity (???)
* It is possible to search for the cooccurrence of two or more lexical or grammatical forms F1 and F2, or of construction C1 and C2 in one specific tier x1 or across different tiers x1-n within a linguistic context defined by the level of syntagmatic complexity/grammatical level (ab "within" versteh ichs nicht)
* Between the linguistic items (i.e. forms/constructions) F1-2/C1-2 which are searched for in this search type – on a single tier x1 (!) - a logical operation can be defined such F1 {AND, OR, AND NOT etc.} F2, or F1 {AND, OR, AND NOT etc.} C1-C1, and so forth
* Between the linguistic items F1 in tier x1 and F2 in tier x2 or C1 in tier x1 and C2 in tier x2, and so forth, which are searched for in this search type, a logical operation can be defined such that F1 in tier 1 {AND, OR, AND NOT etc.} F2 in tier x2, or C1-C1 in tier x1 {AND, OR, AND NOT etc.} C2-C2 in tier x2, and so forth
* After a search the user has access to a frequency view of the hits (wie funktioniert das in ELAN genau?) (see also Other Stories)

h2. Semasiological search

h3. Lexical identification
* It is possible to find lexical items/allomorphs by searchin the longest common prefix in all words, inconsistent spelling is regared, the user can give alternations for characters/substrings (for example long/short vowels)
* There is a search for strings which are hypothetically part of the word stem (similar to previous story, but for longest common substring instead of prefix)
* It is possible to search for a lexical item in the translation tier (also for identification of allomorphs of the word/lexeme and to find lexical paradigms)
* It is necessary to view context of the search result, which is either a word or a sentence); a simple number of context items is irrelevant here
* With mo/gl tier: The user can search an item on the morpheme or gloss tier and for the same item in the utterance tier (to identify lexical items); the items are not necessarily the same, but can be very similar

h3. Identifictation of grammatical units
* It is possible to look for systematic varations with regards to form-meaning pairs in the tx and ft tier (idea: use statistics like for parallel texts, partext package by Matthias Ongyerth)
* There is a possibility to search for prototypical nouns/verbs/... to look for systematical form-meaning variation. The user can enter preliminary wordlists (for example a list of nouns from fieldwork) for this.
* There is a semi-automatically POS-tagger that derives part-of-speech information from existing tiers (maybe with the help of a lexicon/wordlists, in first step)
* It is possible to search for grammatical information in the ft tier (for example -ing forms, to search for progressives)
* It is possible to search for grammatical glosses in gl tier
* It is possible to search for one string in the tx tier and for another string in the gl tier (to analyze morphophonetic processes)
* It is possible to search over word boundaries on gl/mo tiers, (for example search "V-kje" and then "heesge" right of it) without specifying how many words are between

h3. Prase level analysis
* It is possible to search for list of words, for example from a lexicon (as lexicon specifies POS, and it may be possible to find phrases by searching for POS information)
* It is possible to search for elements of phrases, for example DET or "one" for NPs (problem: both have other functions, and there are NPs without DET)
* It is possible to search for proper noun markers ("-ga" in Hocank)
* It is possible to search for first n nouns/verbs/... of a frequency list
* It is possible to search across tiers, for example "'ee" on mo tier and not "say" on gl tier (as "'ee" is both free pronomina and "say")
* It is possible to search for prepositions in ft tier (to find PP in Hocank)

h3. Clause level analysis
* It is possible to search for question words in ft tier
* It is possible to search for puctuation (in tx and ft tier)
* It is possible to search for elements on gl tier with logical operation (gl tier has "item1" AND "item2", see User story above)
* It is possible to search for semantic classes of verbs on gl and mo tier (for Hocank there is a (Toolbox) lexicon which contains information about semantic classes of verbs)
* It is possible to search for elements on different tiers with logical operations (gl tier as "item1" AND ft tier has "item2")

h2. General stories
* After each search a summary of hits, hits by ovarall items, etc. (frequency lists) is available. Those are simple lists that can be copy and pasted.
* It is possible to add a POS tier semi-automatically (via word lists, lexicon?)
* It is possible to search define word lists and search for word lists
* It is possible to extract word lists from Toolbox lexicon (and other source data like wordlists from fieldwork; maybe Copy&Paste is enough for the latter...)
* It is possible to do some analysis on the ft tier (POS, syntactic tree, morphemes with glosses, ...) and to search that additional information (for example, after morpho-syntactical tagging of the english translation it is possible to search for comparatives)

20 Oktober 2009 - Treffen mit Helmrecht
PeterBouda, 8 November 2009 (created 8 November 2009)

* Suche nach Morphemen / Glossen, Anzeige der Umgebung
* Wie sieht die Nominalphrase aus? N _ _ DET
* Suche nach Nomen / Klassen von Nomen (aus dem Lexikon) und ihre Umgebung
* Bsp. Numeralia vor oder nach dem DET, Numeralia nur im Lexikon markiert
* Gibt es kopflose Nominalphrasen? Zumindest bei Relativsätzen...
* Relativsätze sind nominalsiert, wie sieht hier die Nominalphrase aus?
* Wie findet man Relativsätze? (Hocank mit "internal head" oder auch nicht) N [ N (N) .. V Def]; kann man evtl. in der Übersetzung suchen?
* Wie findet man Satzstrukturen? (Hocank: V am Schluss als Hinweis; aber: Verbserien)
* Mit welchen Satzstrukturen sind Fragepronomen verknüpft?
* Wie findet man interne Struktur von Wörtern? z.B. (Verb)Klitika oder nicht? Reihenfolge der Pre- und Suffixe?
* Wie sucht man mit Toolbox und Elan (z.B. die Wortstrukturen)?
* Valenz von Verben? Wie kann man da (semi-)automatisch was machen?
* Satztypen: modal, Frage, Nebensätze (wie findet man Komplementsätze (vs. Adverbialsätze, Juxtapunktionen, Kosubordination)? "hearsay"-Marker, aber nicht nur bei Komplementsätzen)...; * Wortstellung? -> Suche nach Verben (also wieder nach Klassen von Wörtern); Suche nach Übersetzung (nach "weil" etc. für Adverbialsätze)
* Wie findet man pragmatisch markierte Sätze? -> Cleft-Strukturen in der engl./dt. Übersetzung
* Wie untersucht man Fokus? Wie ist der Fokus verteilt, welche Formen? Ohne Form: evtl. Übersetzung
* allg.: Distributionen: z.B. von Demonstrativpronomen, Verben mit Hilfsverben, Eigennamen (gibt's spezielle Kasusmarkierungen? Im Hocank gibts Markierung für Eigennamen; Wie verhalten die sich syntaktisch, wie und wann werden sie benutzt (S, A, O oder gar nicht)?)
* Welche morphophonologischen Verfahren mit welchen Morphemen/Worten? e -> a im Hocank unter Bedingungen
* Wie findet man Verbkonjugationen?