WeScience

Background

The WeScience initiative is an on-going effort to provide resources that enable eScience research and development in our own field, i.e. Computational Linguistics (or Natural Language Processing). Some of the motivating ideas and goals are sketched by [http://www.delph-in.net/wescience/tlt09.pdf Ytrestøl, Flickinger, & Oepen (2009)]. WeScience aims to (help) improve the accessibility of scholarly literature and digital libraries, with a special emphasis on community or open access resources. Current development is focused on semantic parsing of encyclopedic articles (from the on-line community resource [http://en.wikipedia.org Wikipedia]), with the long-term goal of relating natural language semantics and taxonomic knowledge, for example in relation extraction or ontology learning applications. As a complementary element, we plan to include a selection of scientific articles (from the [http://aclweb.org/anthology-new/ ACL Anthology]), with candidate applications ranging over, among others, function and attitude analysis for citations, attribution tracking, indexing by complex content properties (for example specific sub-fields, hypotheses, methods used), association to encyclopedia entries (or ontology nodes), or so-called 'semantic search'.

WeScience, in its early stages of 2008 and 2009, is a semi-formal collaboration between the [http://www.ifi.uio.no/research/groups/lns/lt.html University of Oslo], the [http://lingo.stanford.edu/ Center for the Study of Language and Information], and [http://www.coli.uni-saarland.de Saarland University], with partial funding from the University of Oslo, the [http://www.ub.uit.no/wiki/openaccess/index.php/NORA Norwegian Open Research Archives], and the [http://www.notur.no Norwegian Metacenter for Computational Science].

Current State of Development

WeScience, at least as of early 2009, comprises two components, the WeScience Corpus (discussed in more detail by Ytrestøl, et al. (2009)) and the WeScience Treebank. The corpus comprises a selection of [http://en.wikipedia.org Wikipedia] articles in the domain of Natural Language Processing, pre-processed to strip irrelevant markup and segmented into sentence-like units. WeScience defines a simple, line-oriented textual exchange format for the corpus, aiming to strike a good balance between computer and human readability (there are formal considerations too that make the use of XML infeasible). Each sentence-like unit has a unique 8-digit identifier, with the first four digits referencing the underlying article. The corpus is broken into 16 sections, each of a maximum of 1000 segments, where no article is split across sections. Sections 14 through 16 are reserved for evaluation purposes.

Development of the WeScience Treebank builds on the LinGO [http://www.delph-in.net/erg English Resource Grammar] (ERG) and [http://www.delph-in.net/redwoods Redwoods] discriminant-based treebanking approach. The [http://svn.delph-in.net/erg/tags/0902 February 2009] release of the ERG includes a sub-set of the WeScience Corpus in treebanked form. An enlarged release of the treebank will be part of the forthcoming July 2009 release of the grammar.

Obtaining the Corpus and Treebank

As of early 2009, the WeScience Corpus has been released in three versions. Revisions 0.1 and 0.2 were purely internal releases and are now superseded by the present release, revision 0.3. This is publicly and freely available in a variety of formats. The recommend method of obtaining the WeScience Corpus is by virtue of the SubVersion (SVN) revision management system. A command like:

  svn co http://svn.emmtee.net/trunk/uio/wescience wescience

will retrieve the latest development version (i.e. revision 0.3, as of early 2009) and create a new subdirectory wescience/. This directory will contain both the raw, un-processed [http://en.wikipedia.org Wikipedia] articles (in the raw/ sub-directory) and the actual WeScience Corpus, in the format described above (in the txt/ sub-directory). For those without a functional SVN client (M$ Windoze users, maybe), this data is also available as a compressed Un*x tar(1) [http://www.delph-in.net/wescience/corpus.0.2.tgz archive].

The WeScience Corpus is available as so-called itsdb skeletons too, i.e. the result of importing the text files (the pre-processed ones, obviously) into the itsdb database. These skeletons have been part of the itsdb distribution through the LOGON tree (see the LogonTop page) since late 2008. The WeScience skeletons are called ws01 through ws16, and these same names are used in organizing the WeScience Treebank.

Regarding availability of the first release of the WeScience Treebank, please watch this space (or the [http://www.delph-in.net DELPH-IN] [http://lists.delph-in.net mailing lists]) for an imminent announcement.

Outlook: Next Steps

Acknowledgements

Home | Forum | Discussions | Events

Provide feedback

Saved searches

Use saved searches to filter your results more quickly