Skip to content

Latest commit

 

History

History
47 lines (42 loc) · 4.34 KB

Meeting200831.md

File metadata and controls

47 lines (42 loc) · 4.34 KB

Meeting 31.8.2020

  • Skype
  • 10:00–11:30am EET
  • Micha, Zhenya

General

  • general intro to IKDP and expected work by Zhenya
  • create user account in GitHub and add to LangDoc
  • 75% payed working time for the project
  • remaining (unpayed) 25% working time could be invested into dissertation (working title "Conjunctions in language contact: Even, Komi, Kildin compared")

Agreed working plan for Zhenya

  • regular Skype meetings with Micha every second Friday, 10:00am EET, starting Sep 11 (perhaps we will later also have meetings with other team members)
  • spontaneous Skype meetings with Micha during the upcoming two weeks whenever necessary for configuring and learning GitHub
  • UPCOMING 1 WEEK from now
    • reads overview articles on Komi language (send to her by Micha/Rogier)
    • reads project publications in order to familiarize with our project work (send to her by Micha)
  • UPCOMING 2 WEEKS from now
    • familiarizes with GitHub (under supervizion by Micha)
  • UPCOMING 3 WEEKS from now
    • familiarizes with IKDP's data and workflows in GitHub, issue tracking, etc. (under supervizion by Micha)
  • UPCOMING 6 WEEKS from now (task starting Sep 14)
    • Language tagging, see below (will need input by Niko)
  • UPCOMING 6 WEEKS from now (task starting Sep 14)
    • Pseudonymization checking, see below
  • UPCOMING 14 WEEKS from now (task starting Oct 12)
    • OCR validation, see below (will need input by Niko)
  • UPCOMING 6 MONTHS from now
    • grammar chapter on coordination and subordination
    • this draft (or parts of it) will also be relevant for dissertation
  • LATER, better explanation of our
    • grammaticographic work (together with Niko and Rogier)
    • FST/CG and Tromsø infrastructure (perhaps together with Sjur)

More detailled explanations for tasks

  • Language tagging (duration estimated 1 month)

    • Actually, working with Russian language tags is also something Zhenya could do: to go manually through some hours of recordings and add language tags which Jauhiainen's system would then try to predict. This could take maybe a week of active work to get done to a large enough extent.
    • This task for Zhenya would probably go so that the suspected sentences are extracted into a table, she marks all wrongly tagged ones, and then a script inserts the correct tags back to the ELAN files. Basically every component is ready for doing it this way. Another approach is to go through files in ELAN and tag everything, but this is probably too ineffective as she would be reading Komi 90% of the time. It depends a bit on what kind of tagging we want, currently we would mainly need actual Russian neatly separated, more mixed processes are something different.
  • Pseudonymization checking

    • Part of the language tagging task could also be going through the Russian portions and check that our pseudonymization system handles those sections adequately. No thorough checking has been done at these parts, which is one major reason for Korp release's delay
  • Validate text collections (duration estimated 2 months)

    • There are several text collections with izhma texts not currently incorporated into our collections, and one thing Zhenya could very well do is to check some of the transcriptions. She can use Transkribus-platform, which works almost perfectly, so actually it is quite light work. What I have in mind is not only "proofreading", but kind of wider preparatory work, which is still clearly defined.
    • In some cases I would even argue that proofreading as such is not always necessary, since we can also use FST to check which words do not recognize correctly (after automatic transliteration), and fixing those would often go really far already. In fact when the accuracy of OCR is somewhere around 99.99%, then the only real problems are parts where the page is somehow damaged or something has gone wrong in printing or digitalization. Glancing pages through for those parts is also different from normal proofreading, mainly in being faster and less tedious.
    • There is also one handwritten Russian part of translations into Komi transcriptions, and she of course can read handwritten Russian easily, so for a few weeks that would be a really good focused task. Those texts are already translated to Russian, so let's grab them.
    • First collection could be Erkki Itkonen's and Raisa Batalova's materials, which contain also other dialects than Izhma (which is not a problem)