Skip to main content

6.1. A historical perspective

Pioneering activities in Transnational Language Resources and Evaluation can be reported with the 4-languages Speech Recognition evaluation campaigns organized by NATO back in 1981 and by the opening of the DARPA-sponsored NIST Speech recognition evaluation campaigns to the international community in 1992.

Several projects sponsored by the EC also addressed this area such as SAM (Speech Assessment Methodology) (1987-1989), SQALE (Speech Quality Assessment for Language Engineering) (1993-1995) or EAGLES (European Advisory Group on Language Engineering Standards) (1993-1998).

Coordinating the activities in Language Resources with permanent entities started with the creation of the Coordinating Committee on Speech Databases and Speech Input/Output Systems Evaluation (Cocosda) in 1991, as a satellite event of the Eurospeech conference organized by the European Speech Communication Association (ESCA). The Oriental chapter of Cocosda launched an initiative to serve the needs of their languages (Oriental-Cocosda), which organizes a yearly conference. More recently, the written language community had a similar initiative (WRITE).

The Linguistic Data Consortium (LDC) was launched in 1992 in the US, with a grant from the Advanced Research Projects Agency (ARPA). It is now partly supported by grant IRI-9528587 from the Information and Intelligent Systems division of the National Science Foundation. This creation required the approval of the Congress and the LDC is hosted since then by the University of Pennsylvania.

Europe launched its European Language Resources Association (ELRA) in 1995, as a spin-off of the EC Relator project and with the support of the European Language and Speech Network (Elsnet) and of the EC Eurococosda project. ELRA immediately created its Evaluation and Language Resources Distribution Agency (ELDA) in Paris. ELRA then took the initiative in 1998 to organize a biennial Conference on Language Resources and Evaluation (LREC) and has now strong involvement in the Language Resources and Evaluation Journal published by Springer.

Similarly, the Gengo-Shigen-Kyokai (GSK) (literally “Language Resources Association”) was launched in Japan in 2003. GSK is aiming at distributing LR in the framework of an international contribution to language and speech research and technology by not limiting its target to resources within Japan, but expanding throughout the whole of Asia. GSK has close relationship with JEITA (Japan Electronics and Information Technology Industries Association).

The situation remained stable for several years, organized around those central agencies, which are mostly fed in terms of Language Resources by the DARPA programs (LDC) and by the EC Framework Programs (ELRA), with projects such as the SpeechDat series (1996-2003). Their activity extended from LR distribution to LR validation, LR production and Language technology Evaluation. While being a European association, ELRA has non-European subscribers. LDC also has an international scope, by producing and distributing Language Resources in many languages, by having members from many different countries and by providing many of the Language Resources which are used for the Evaluation campaigns organized by DARPA and NIST (the US National Institute of Standards and Technology).

More recently, new initiatives started in Europe in a more distributed way, with the launching in 2008 of the CLARIN (Common LANguage Resources and technology Infrastructure) project supported by the ESFRI program (European Strategy Forum on Research Infrastructure), with a budget of 4.1 M€ over 3 years (2008-2010). CLARIN aims at distributing LR (data and tools) for the needs of Human and Social Science research community. CLARIN has about 180 members from 33 countries. It now aims at federating European countries and get funding from them, with a targeted budget of 200 M€ over 12 years (2008-2020). There already exists a CLARIN-NL in The Netherlands, with a budget of 9 M€ over 5 years (2009-2014).

The FLaReNet (Fostering Language Resources Network) was launched in 2008. It is a thematic Network supported by the e-Content program, aiming at promoting the production and distribution of Language Resources for all purposes, and especially the needs of developing and evaluating Language Technologies. The total budget of 900 K€ over 3 years (2008-2010). Apart from the coordinator, the partners only get funding for participating in meetings, workshops and conferences. FLaReNet has about 80 institutional members in 31 countries, and 330 individual members. It also settled a network of 102 International Contact Points in 76 countries or regions all over the world.

The META-NET (Multilingual Europe Technology Alliance Network) is a Network of Excellence funded by the EC FP7-ICT program. It has a budget of 6 M€ over 3 years (2010-2013). Its aim is to establish a network of all stakeholders involved in Machine Translation, and more generally in Multilingual Language Technologies. Part of its activity is the building of an infrastructure (META-SHARE) for the sharing of Language Resources (data, tools and services), mainly to serve the needs of Language Technologies, including research and industry. The situation regarding Language Technologies has been analyzed through Language Reports and Language Matrixes. META-NET has 44 members in Europe.

In the area of HLT Evaluation, the main transnational activities have been conducted by NIST since 1992, and more recently by CLEF (Crosslingual Evaluation Forum) on Crosslingual Information Retrieval, from 2000-2009). But many smaller scale evaluation campaigns have been organized all over the world, making it difficult to ensure the participation of a large enough community in each of them.

Interesting national initiatives which have a multilingual scope given the linguistic diversity of the country may also be reported in India with the TDIL program (Technology Development for Indian Languages – 20 languages) and in South Africa with the NHN program (National HLT Network – 11 languages), and more recently in Egypt with ALTEC (Arabic Language Technologies Center), for the Arabic language (see Annex 5), and in Africa with the African HLT Association for the African languages launched in Djibouti in January 2010[1], and initiated by , the World Network for Linguistic Diversity. OLAC, the Open Language Archives Community, is an international partnership of institutions and individuals who are creating a worldwide virtual library of language resources by developing consensus on best current practice for the digital archiving of language resources, and developing a network of interoperating repositories and services for housing and accessing such resources.

In order to have a clear view of the various infrastructural initiatives for sharing Language Resources (Data, Tools, Services), and to try to respond to the following questions : They have different levels of generality, different purposes, but: How they compare? Which are their respective roles? Synergies? Complementarities? Overlapping? How to profit from each other? Which lessons and priorities?, a joint Cocosda/WRITE/FLaReNet satellite workshop entitled “Language Technologies issues for International cooperation” was organized at LREC 2010 on May 22, with contributions from LanguageGrid, CLARIN, Language Commons, LDC, ELRA, the National Centre for Text Mining in UK, the EC Panacea Platform, the NICT ALAGIN Platform, FLaReNet (LREC Map) and META-NET (META-SHARE).