|
207 | 207 | <div xml:id="schema">
|
208 | 208 | <head>The Parla-CLARIN Schema</head>
|
209 | 209 | <p>Parla-CLARIN is written as a TEI ODD document, consisting of the prose guidelines and
|
210 |
| - the schema specification, on the basis of which it is possible, using the standard TEI |
211 |
| - XSLT stylesheets, to derive an XML schema expressed either as a RelaxNG schema, a DTD, |
212 |
| - or a W3C schema, which is then used for formal validations of a Parla-CLARIN |
213 |
| - parliamentary corpus.</p> |
| 210 | + the schema specification, on the basis of which it is possible, using the <ptr |
| 211 | + type="software" xml:id="R5" target="#teistylesheets"/><rs type="soft.name" ref="#R5" |
| 212 | + >standard TEI XSLT stylesheets</rs>, to derive an XML schema expressed either as a |
| 213 | + RelaxNG schema, a DTD, or a W3C schema, which is then used for formal validations of a |
| 214 | + Parla-CLARIN parliamentary corpus.</p> |
214 | 215 | <p>While the proposal tries to cater for many encoding needs, it is possible that new
|
215 | 216 | users will have to use TEI elements or attributes that are not discussed in the prose
|
216 | 217 | guidelines. Since the recommendations are still under development, the formal schema
|
|
324 | 325 | <div xml:id="presentation">
|
325 | 326 | <head>Presentation of Parla-CLARIN</head>
|
326 | 327 | <p>Like the TEI Guidelines, the Parla-CLARIN recommendations are available on <ref
|
327 |
| - target="https://github.com/clarin-eric/parla-clarin/"><ptr type="software" |
328 |
| - xml:id="GitHub" target="#GitHub"/><rs type="soft.name" ref="#GitHub" |
329 |
| - >GitHub</rs></ref>, as a project<note>Tomaž Erjavec and Andrej Pančur, Parla-CLARIN |
330 |
| - project <ptr type="software" xml:id="GitHub" target="#GitHub"/><rs type="soft.name" |
331 |
| - ref="#GitHub">GitHub</rs> site, last updated March 17, 2021, <ptr |
332 |
| - target="https://github.com/clarin-eric/parla-clarin/"/>.</note> of the CLARIN ERIC |
333 |
| - collection. The project contains a folder for the schema (i.e., the Parla-CLARIN ODD |
334 |
| - document and XML schemas derived from it), a folder for the programs that convert the |
335 |
| - ODD into the XML schemas and to the HTML of the prose and schema definitions, and a |
336 |
| - folder for examples, which contains an artificial but fully worked out example of a |
337 |
| - Parla-CLARIN document and subfolders with various example resources, where each should |
338 |
| - contain: <list rend="ordered"> |
| 328 | + target="https://github.com/clarin-eric/parla-clarin/"><ptr type="software" xml:id="R1" |
| 329 | + target="#GitHub"/><rs type="soft.name" ref="#R1">GitHub</rs></ref>, as a |
| 330 | + project<note>Tomaž Erjavec and Andrej Pančur, Parla-CLARIN project <ptr |
| 331 | + type="software" xml:id="R2" target="#GitHub"/><rs type="soft.name" ref="#R2" |
| 332 | + >GitHub</rs> site, last updated March 17, 2021, <ptr type="software" xml:id="R9" |
| 333 | + target="#parlaclarinscripts"/><rs type="soft.url" ref="#R9"><ptr |
| 334 | + target="https://github.com/clarin-eric/parla-clarin/"/></rs>.</note> of the CLARIN |
| 335 | + ERIC collection. The project contains a folder for the schema (i.e., the Parla-CLARIN |
| 336 | + ODD document and XML schemas derived from it), a folder for the <rs type="soft.name" |
| 337 | + ref="#R9">programs that convert the ODD into the XML schemas and to the HTML of the |
| 338 | + prose and schema definitions</rs>, and a folder for examples, which contains an |
| 339 | + artificial but fully worked out example of a Parla-CLARIN document and subfolders with |
| 340 | + various example resources, where each should contain: <list rend="ordered"> |
339 | 341 | <item>a sample of a corpus in its source encoding;</item>
|
340 |
| - <item>XSLT script to convert it into Parla-CLARIN; and</item> |
| 342 | + <item><rs type="soft.name" ref="#R9">XSLT script to convert it into Parla-CLARIN</rs>; |
| 343 | + and</item> |
341 | 344 | <item>the output of the conversion.</item>
|
342 | 345 | </list>
|
343 | 346 | </p>
|
|
495 | 498 | <p>Nevertheless, AKN is an important schema for modeling parliamentary proceedings,
|
496 | 499 | especially as the primary encoding standard used by various legislative bodies, so some
|
497 | 500 | of AKN’s solutions were used in developing the Parla-CLARIN proposal, in particular the
|
498 |
| - typology of divisions of a document. Also developed was a partial, but non-trivial, |
499 |
| - conversion from AKN to Parla-CLARIN, which covers several AKN example documents. As |
500 |
| - mentioned in <ptr type="crossref" target="#presentation"/>, the example documents and |
501 |
| - conversion script can be found in the <ident>Examples</ident> folder of the Parla-CLARIN |
502 |
| - Git repository. The <ident>akn2tei.xsl</ident> script attempts to preserve the IDs of |
503 |
| - the source AKN document, converts the AKN addressee, role, and questions and answers to |
| 501 | + typology of divisions of a document. Also developed was a partial, but non-trivial, <ptr |
| 502 | + type="software" xml:id="R10" target="#parlaclarinscripts"/><rs type="soft.name" |
| 503 | + ref="#R10">conversion from AKN to Parla-CLARIN</rs>, which covers several AKN example |
| 504 | + documents. As mentioned in <ptr type="crossref" target="#presentation"/>, the example |
| 505 | + documents and conversion script can be found in the <ident>Examples</ident> folder of |
| 506 | + the Parla-CLARIN Git repository. The <ptr type="software" xml:id="R11" |
| 507 | + target="#parlaclarinscripts"/><rs type="soft.name" ref="#R11" |
| 508 | + ><ident>akn2tei.xsl</ident></rs> script attempts to preserve the IDs of the source |
| 509 | + AKN document, converts the AKN addressee, role, and questions and answers to |
504 | 510 | Parla-CLARIN, and maps FRBR data (which distinguishes a <soCalled>work</soCalled> from
|
505 | 511 | its <soCalled>expression</soCalled> and its expression from its
|
506 | 512 | <soCalled>manifestation</soCalled>) to the appropriate TEI elements and attributes.
|
|
572 | 578 | parliamentary proceedings meant for scholarly investigations. This scheme is currently a
|
573 | 579 | straightforward customization of the TEI Guidelines, with the majority of the effort
|
574 | 580 | having gone into the writing of the prose guidelines of the Parla-CLARIN recommendations
|
575 |
| - and into developing the conversion from Akoma Ntoso to Parla-CLARIN. We have not included |
576 |
| - examples of the encoding, as these are readily available on the <ptr type="software" |
577 |
| - xml:id="GitHub" target="#GitHub"/><rs type="soft.name" ref="#GitHub">GitHub</rs> |
| 581 | + and into developing the <ptr type="software" xml:id="R12" target="#parlaclarinscripts" |
| 582 | + /><rs type="soft.name" ref="#R12">conversion from Akoma Ntoso to Parla-CLARIN</rs>. We |
| 583 | + have not included examples of the encoding, as these are readily available on the <ptr |
| 584 | + type="software" xml:id="R3" target="#GitHub"/><rs type="soft.name" ref="#R3">GitHub</rs> |
578 | 585 | documentation page of the project, and large Parla-CLARIN encoded corpora are openly
|
579 | 586 | available.</p>
|
580 | 587 | <p>Apart from the siParl 2.0 corpus mentioned above (<ptr type="crossref"
|
|
601 | 608 | <p>As we wanted to have corpora that are not only interchangeable but interoperable as well,
|
602 | 609 | we created a bespoke ParlaMint XML schema directly in RelaxNG – the schema is compatible
|
603 | 610 | with Parla-CLARIN as it validates a subset of documents that would be validated against
|
604 |
| - Parla-CLARIN. We produced common scripts that can convert any of the four corpora to plain |
605 |
| - text, to CoNLL-U format as used by the Universal Dependencies project, and to vertical |
606 |
| - format as used by the <ref target="http://cwb.sourceforge.net/">CWB</ref><note>The IMS |
607 |
| - Open Corpus Workbench (CWB), last modified March 30, 2021, <ptr |
608 |
| - target="http://cwb.sourceforge.net/"/>.</note> and <ref |
609 |
| - target="http://www.sketchengine.eu/">Sketch Engine</ref><note>Accessed January 13, 2022, |
610 |
| - <ptr target="http://www.sketchengine.eu/"/>.</note> (<ref type="bibl" |
611 |
| - target="#kilgarriff14">Kilgarriff et al. 2014</ref>) concordancers, as well as to |
612 |
| - extract complete speech metadata into TSV files.</p> |
| 611 | + Parla-CLARIN. We produced <ptr type="software" xml:id="R13" target="#parlaclarinscripts" |
| 612 | + /><rs type="soft.url" ref="#R13">common scripts that can convert any of the four corpora |
| 613 | + to plain text, to CoNLL-U format as used by the Universal Dependencies project, and to |
| 614 | + vertical format as used by the <ptr type="software" xml:id="R14" target="#cwb"/><rs |
| 615 | + type="soft.url" ref="#R14"><ref target="http://cwb.sourceforge.net/" |
| 616 | + >CWB</ref></rs></rs><note>The <rs type="soft.name" ref="#R14">IMS Open Corpus Workbench |
| 617 | + (CWB)</rs>, last modified March 30, 2021, <rs type="soft.url" ref="#R14"><ptr |
| 618 | + target="http://cwb.sourceforge.net/"/></rs>.</note> and <ptr type="software" |
| 619 | + xml:id="R15" target="#sketchengine"/><rs type="soft.url" ref="#R15"><ref |
| 620 | + target="http://www.sketchengine.eu/"><rs type="soft.name" ref="#R15">Sketch |
| 621 | + Engine</rs></ref></rs><note>Accessed January 13, 2022, <rs type="soft.url" |
| 622 | + ref="#R15"><ptr target="http://www.sketchengine.eu/"/></rs>.</note> (<rs |
| 623 | + type="soft.bib.ref" ref="#R15"><ref type="bibl" target="#kilgarriff14">Kilgarriff et al. |
| 624 | + 2014</ref></rs>) concordancers, as well as to extract complete speech metadata into |
| 625 | + TSV files.</p> |
613 | 626 | <p>In order for Parla-CLARIN to achieve its goal of becoming a widely recognized encoding
|
614 | 627 | format for corpora of parliamentary proceedings, significant work remains to be done. On
|
615 | 628 | the basis of the lessons learned in creating ParlaMint, we plan to revise the prose
|
|
619 | 632 | specification from the default ones in the TEI Guidelines to ones taken or adapted from
|
620 | 633 | the collected parliamentary corpora.</p>
|
621 | 634 | <p>Second, as we have already done for ParlaMint, we plan to add to the <ptr type="software"
|
622 |
| - xml:id="GitHub" target="#GitHub"/><rs type="soft.name" ref="#GitHub">GitHub</rs> |
623 |
| - Parla-CLARIN project more down-conversion scripts with which we would increase the |
624 |
| - usability of the Parla-CLARIN corpora. As mentioned, work also needs to be done to develop |
625 |
| - a conversion to RDF.</p> |
| 635 | + xml:id="R4" target="#GitHub"/><rs type="soft.name" ref="#R4">GitHub</rs> Parla-CLARIN |
| 636 | + project more down-conversion scripts with which we would increase the usability of the |
| 637 | + Parla-CLARIN corpora. As mentioned, work also needs to be done to develop a conversion to |
| 638 | + RDF.</p> |
626 | 639 | <p>Last, but not least, one of the great benefits of Git is the ability to support
|
627 | 640 | collaborative work, be it through posting issues, or through using pull requests to
|
628 | 641 | incorporate changes. While the community has not so far made use of these options, we hope
|
|
790 | 803 | <bibl xml:id="kilgarriff14"><author>Kilgarriff, Adam</author>, <author>Vít Baisa</author>,
|
791 | 804 | <author>Jan Bušta</author>, <author>Miloš Jakubíček</author>, <author>Vojtěch
|
792 | 805 | Kovář</author>, <author>Jan Michelfeit</author>, <author>Pavel Rychlý</author>, and
|
793 |
| - <author>Vít Suchomel</author>. <date>2014</date>. <title level="a">The Sketch Engine: |
794 |
| - Ten Years On.</title> |
| 806 | + <author>Vít Suchomel</author>. <rs type="soft.bib.ref" ref="ewfew"><date>2014</date>. |
| 807 | + <title level="a">The Sketch Engine: Ten Years On.</title></rs> |
795 | 808 | <title level="j">Lexicography: Journal of ASIALEX</title>
|
796 | 809 | <biblScope unit="volume">1</biblScope> (<biblScope unit="issue">1</biblScope>):
|
797 | 810 | <biblScope unit="page">7–36</biblScope>. doi:<idno type="DOI"
|
|
0 commit comments