Linguistic Linked Open Data (LLOD) is a technology and a movement in
several disciplines working with language resources, including Natural
Language Processing, general linguistics, computational lexicography
and the localization industry. This talk describes basic principles of
Linguistic Linked Open Data and their application to linguistically
annotated corpora, it summarizes the current status of the Linguistic
Linked Open Data cloud and gives an overview over selected LLOD
vocabularies and their uses.
A resource constitutes Linguistic Linked Open Data if it is published
in accordance with the following principles:
- The dataset is relevant for linguistic research or NLP algorithms.
- The elements in the dataset should be uniquely identified by means
of a URI.
- The URI should resolve, so users can access more information using
web browsers.
- Resolving an LLOD resource should return results using web
standards such as Resource Description Framework (RDF).
- Links to other resources should be included to help users discover
new resources and provide semantics.
- Data should be openly licensed using licenses such as the Creative
Commons licenses.
Criterion (1) defines linguistic(ally relevant) data,
criteria (2-5) define linked data, criterion (6) defines open
data, their combination thus yields Linguistic Linked Open Data.
The primary benefits of LLOD have been identified as:
- Representation: Linked graphs are a more flexible representation
format for linguistic data
- Interoperability: Common RDF models can easily be integrated
- Federation: Data from multiple sources can trivially be combined
- Ecosystem: Tools for RDF and linked data are widely available under
open source licenses
- Expressivity: Existing vocabularies help express linguistic resources.
- Semantics: Common links express what you mean.
- Dynamicity: Web data can be continuously improved.
I specifically focus on linguistically annotated corpora and discuss
the potential of Linked Data in relation to four standing problems in
the field:
- representing highly interlinked corpora (e.g., multi-layer
corpora, annotated parallel corpora),
- integrating corpora with lexical resources available from the web
of data,
- facilitating annotation interoperability using terminology
resources available from the web of data, and
- streamlining data manipulation processes in a modular and
domain-independent fashion.
These aspects will be discussed in relation to two selected resources
from both general linguistics and Natural Language
Processing. Finally, the talk will discuss some of the challenges that
LLOD is still facing in both areas.