Parallel Corpora and Linguistic Analysis

The Parallel Corpora and Linguistic Analysis (PCLA) is an initative of an informal network of linguists working on parallel corpora who share the above aims and exchange knowledge, resources, and ideas on a regular basis. This webpage keeps track of their activities and of parallel corpus research more generally.

Parallel Corpora are corpora that contain the same texts in different languages. The best-known example is Europarl [Koehn 2005] (TODO: add link) that compiles the multilingual minutes of the European Parliament.

Parallel corpora are central to translation studies, contrastive linguistics, and statistical machine translation. Typology and comparative linguistics have followed suit in recent years and there is an ever-growing number of smaller and bigger parallel corpus compilation initiatives.

However, from the perspective of linguistic analysis, the question remains how to best exploit the potential of parallel corpora in a methodologically sound way. What we need for this is a joint effort to extend methodological insights from other branches of corpus linguistics to parallel corpus linguistics, to explore new avenues of analysis that do justice to the parallel nature of the data, and – above all – to probe the limits of parallel corpus data by investing in their analysis and in the replication of findings.