Publications
SmartBiC: Smart Harvesting of Bilingual Corpora from the Internet
2024
SmartBiC, an 18-month innovation project funded by the Spanish Government, aims at improving the full process of collecting, filtering and selecting in-domain parallel content to be used for machine translation and language model ...Read more
HPLT's First Release of Data and Models
2024
We describe the first results of the High Performance Language Technologies project (HPLT), a 3-year EU-funded project that started in September 2022. The first data release includes 75 monolingual datasets and 18 parallel dataset...Read more
We present the HPLT (High Performance Language Technologies) language resources, a new massive multilingual dataset including both monolingual and bilingual corpora extracted from CommonCrawl and previously unused web crawls from ...Read more
Large, curated, web-crawled corpora play a vital role in training language models (LMs). They form the lion's share of the training data in virtually all recent LMs, such as the well-known GPT, LLaMA and XLM-RoBERTa models. Howeve...Read more
Language identification is a crucial component in the automated production of language resources, particularly in multilingual and big data contexts. However, commonly used language identifiers struggle to differentiate between si...Read more
Developing high quality machine translation systems is a labour intensive, challenging and confusing process for newcomers to the field. We present a pair of tools OpusCleaner and OpusTrainer that aim to simplify the process, redu...Read more
Proceedings of the 24th Annual Conference of the European Association for Machine Translation
We describe the High Performance Language Technologies project (HPLT), a 3-year EU-funded project started in September 2022. HPLT will build a space combining petabytes of natural language data with large-scale model training. It ...Read more
Proceedings of the 1st Workshop on Open Community-Driven Machine Translation
Proceedings of the 1st Workshop on Open Community-Driven Machine Translation Page 1 Proceedings of the 1st Workshop on Open Community-Driven Machine Translation June 15 2023 Tampere, Finland Edited by Miquel Espl`a-Gomis (Universi...Read more
Proceedings of the 1st Workshop on Open Community-Driven Machine Translation
We present MutNMT, 1 an open-source web application for educational purposes to introduce non-experts to NMT. The tool, developed within the MultiTraiNMT project2 along with other training materials (a book3 and activities4), gath...Read more
Proceedings of the Thirteenth Language Resources and Evaluation Conference
This paper describes the experiments carried out during the development of the latest version of Bicleaner, named Bicleaner AI, a tool that aims at detecting noisy sentences in parallel corpora. The tool, which now implements a ne...Read more
23rd Annual Conference of the European Association for Machine Translation, EAMT 2022
We introduce the project MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages, funded by the Connecting Europe Facility, which is aimed at building monolingual and parallel ...Read more
Jožef Stefan Institute
Quality assessment has been an ongoing activity of the series of ParaCrawl efforts to crawl massive amounts of parallel data from multilingual websites for 29 languages. The goal of ParaCrawl is to get parallel data that is good f...Read more
Machine translation for everyone: Empowering users in the age of artificial intelligence
This chapter gives an overview of the theoretical and practical implications of customizing machine translation (MT) to make it fit for a particular purpose. The chapter is written for readers who have just a basic knowledge of MT...Read more
Proceedings of the 23rd Annual Conference of the European Association for Machine Translation
The MultitraiNMT Erasmus+ project has developed an open innovative syl-labus in machine translation, focusing on neural machine translation (NMT) and targeting both language learners and translators. The training materials include...Read more
Proceedings of the Thirteenth Language Resources and Evaluation Conference
This paper describes the experiments carried out during the development of the latest version of Bicleaner, named Bicleaner AI, a tool that aims at detecting noisy sentences in parallel corpora. The tool, which now implements a ne...Read more
TRITON 2021 (Translation and Interpreting Technology Online)
The aim of the MultiTraiNMT Erasmus+ project is to develop an open innovative syllabus in neural machine translation (NMT) for language learners and translators as multilingual citizens. Machine translation is seen as a resource t...Read more
Proceedings of the Fifth Conference on Machine Translation
This paper describes the joint submission of Universitat d’Alacant and Prompsit Language Engineering to the WMT 2020 shared task on parallel corpus filtering. Our submission, based on the free/open-source tool Bicleaner, enhances ...Read more
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation (2020)
This paper shows the utility of two open-source tools designed for parallel data cleaning: Bifixer and Bicleaner. Already used to clean highly noisy parallel content from crawled multilingual websites, we evaluate their performanc...Read more
Proceedings of Machine Translation Summit XVII: Translator, Project and User Tracks
We describe two projects funded by the Connecting Europe Facility, Provision of Web-Scale Parallel Corpora for Official European Languages (2016-EU-IA-0114, completed) and Broader Web-Scale Provision of Parallel Corpora for Europe...Read more
Universitat Politècnica de València
Entenem com a paràfrasi l'acte de reescriure un text amb paraules diferents mantenint el seu significat. Hi podem trobar moltes aplicacions de la paràfrasi tals com reescriure paraules mentre s'escriu una text, proporcionar traduc...Read more
Proceedings of Machine Translation Summit XVII: Translator, Project and User Tracks
This paper reports the results of an indepth evaluation of 34 state-of-the-art domain-adapted machine translation (MT) systems that were built by four leading MT companies as part of the EU-funded iADAATPA project. These systems s...Read more
Proceedings of the third conference on machine translation: shared task papers
This paper describes Prompsit Language Engineering’s submissions to the WMT 2018 parallel corpus filtering shared task. Our four submissions were based on an automatic classifier for identifying pairs of sentences that are mutual ...Read more
Proceedings of the First International Workshop on Free/Open-Source Rule-Based Machine Translation
This paper describes Apertium: a free/open-source machine translation platform (engine, toolbox and data), its history, its philosophy of design, its technology, the community of developers, the research and business based on it, ...Read more
Proceedings of the 13th Conference of the Association for Machine Translation in the Americas (Volume 2: User Track)
How does AltLang work? The basics… 1/3 ● automatically and quickly replaces differences among two variants of the same language→ nice for dynamic content ● performs only controlled changes→ no (or low) risks● highly customisable→ ...Read more
MTradumàtica: Free Statistical Machine Translation Customisation for Translators
2017
Annual Conference of the European Association for Machine Translation
MTradumàtica is a free, Moses-based web platform for training and using statistical machine translation systems with a user-friendly graphical interface. Its goal is to offer translators a free tool to customise their own statisti...Read more
Proceedings of the 19th Annual Conference of the European Association for Machine Translation
This paper describes the development and current state of a bidirectional Croatian-Serbian machine translation system based on the open-source Apertium platform. It has been created inside the Abu-MaTran project with the aims of c...Read more
Re-assessing the Impact of SMT Techniques with Human Evaluation: a Case Study on English—Croatian
2016
Proceedings of the 19th Annual Conference of the European Association for Machine Translation
We re-assess the impact brought by a set of widely-used SMT models and techniques by means of human evaluation. These include different types of development sets (crowdsourced vs translated professionally), reordering, operation s...Read more
Baltic Journal of Modern Computing
We present the current status of Abu-MaTran (http://www. abumatran. eu), a 4-year project (January 2013-December 2016) on rapid development of machine translation for underresourced languages. It is funded under Marie Curie's Indu...Read more
EAMT (Projects/Products)
AltLang is a rule-based automatic converter for language varieties. It deals with differences in spelling, lexicon and local grammar along with numeric, style and punctuation conventions. It is available for varieties of English, ...Read more
Proceedings of the 18th Annual Conference of the European Association for Machine Translation
It has been a huge honour for me to serve as president of the European Association for Machine Translation (EAMT) over the past six years. As I step down from office, I am delighted that the last EAMT annual conference under my pr...Read more
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
This paper presents an approach for building large monolingual corpora and, at the same time, extracting parallel data by crawling the top-level domain of a given language of interest. For gathering linguistically relevant data fr...Read more
Cloudlm: a cloud-based language model for machine translation
2016
Prague Bulletin of Mathematical Linguistics
Language models (LMs) are an essential element in statistical approaches to natural language processing for tasks such as speech recognition and machine translation (MT). The advent of big data leads to the availability of massive...Read more
Proceedings of the Third Conference on Machine Translation: Shared Task Papers
This paper describes Prompsit Language Engineering’s submissions to the WMT 2018 parallel corpus filtering shared task. Our four submissions were based on an automatic classifier for identifying pairs of sentences that are mutual ...Read more
Proceedings of the Tenth Workshop on Statistical Machine Translation
This paper presents the machine translation systems submitted by the Abu-MaTran project for the Finnish–English language pair at the WMT 2015 translation task. We tackle the lack of resources and complex morphology of the Finnish ...Read more
EAMT 2015
2015
Antalya, Turkey
This paper presents the work done to port a deep-transfer rule-based machine translation system to translate from a different source language by maximizing the exploitation of existing resources and by limiting the development wor...Read more
Proceedings of the 17th Annual conference of the European Association for Machine Translation
We present an extrinsic evaluation of crawlers of parallel corpora from multilingual web sites in machine translation (MT). Our case study is on Croatian to English translation in the tourism domain. Given two crawlers, we build p...Read more
Proceedings of the ninth workshop on statistical machine translation
This paper presents the machine translation systems submitted by the Abu-MaTran project to the WMT 2014 translation task. The language pair concerned is English–French with a focus on French as the target language. The French to E...Read more
European Language Resources Association (ELRA)
In this paper we compare two tools for automatically harvesting bitexts from multilingual websites: bitextor and ILSP-FC. We used both tools for crawling 21 multilingual websites from the tourism domain to build a domain-specific ...Read more
Procesamiento del Lenguaje Natural
aims to establish a linguistic Olympiad in Spain. We introduce the Linguistic Olympiads, our rationale and objectives for setting up OLE as well as our implementation plan for. We foresee our work to be useful for other countries ...Read more
Raphael Rubino, Antonio Toral, Nikola Ljubeˇsic, Gema Ramírez-Sánchez
This paper presents a novel approach for parallel data generation using machine translation and quality estimation. Our study focuses on pivot-based machine translation from English to Croatian through Slovene. We generate an Engl...Read more
Proceedings of the ninth workshop on statistical machine translation
This paper presents the machine translation systems submitted by the Abu-MaTran project to the WMT 2014 translation task. The language pair concerned is English–French with a focus on French as the target language. The French to E...Read more
arXiv preprint arXiv:1303.0446
The classification of opinion texts in positive and negative is becoming a subject of great interest in sentiment analysis. The existence of many labeled opinions motivates the use of statistical and machine-learning methods. Firs...Read more
Automatic acquisition of machine translation resources in the Abu-MaTran project
2013
Procesamiento del Lenguaje Natural, Sociedad Española para el Procesamiento del Lenguaje Natural
This paper provides an overview of the research and development activities carried out to alleviate the language resources’ bottleneck in machine translation within the Abu-MaTran project. We have developed a range of tools for th...Read more
Springer Berlin Heidelberg
The Universitat Oberta de Catalunya (Open University of Catalonia, UOC), is a public university based in Barcelona. The UOC is characterised by three main factors: (a) it is a virtual university based in an e-Learning model, (b) i...Read more
Proceedings of the 3rd Workshop in Computational Approaches to Subjectivity and Sentiment Analysis
The classification of opinion texts in positive and negative can be tackled by evaluating separate key words but this is a very limited approach. We propose an approach based on the order of the words without using any syntactic a...Read more
Machine translation - Springer Netherlands
Apertium is a free/open-source platform for rule-based machine translation. It is being widely used to build machine translation systems for a variety of language pairs, especially in those cases (mainly with related-language pair...Read more
Using Apertium linguistic data for tokenization to improve Moses SMT performance
2011
LIHMT 2011
This paper describes a new method to tokenize texts, both to train a Moses SMT system and to be used during the translation process. The new method involves reusing the morphological analyser and part-of-speech tagger of the Apert...Read more
Free/open-source resources in the Apertium platform for machine translation research and development
2010
Charles University in Prague. Institute of Formal and Applied Linguistics
This paper describes the resources available in the Apertium platform, a free/open-source framework for creating rule-based machine translation systems. Resources within the platform take the form of finite-state morphologies for ...Read more
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers
This paper describes the participation of Prompsit Language Engineering and the Universitat d’Alacant in the shared task on document alignment at the First Conference on Machine Translation (WMT 2016). Two systems have been submit...Read more
Francois Masselot, Petra Ribiczey, Gema Ramírez‐Sánchez
We present a user case of the free/opensource Spanish↔ Brazilian Portuguese Apertium machine translation system inside the localization workflow of Autodesk. This system, initially developed to perform general-domain translations,...Read more
Proceedings of the International Multiconference on Computer Science and Information Technology
In this paper, we describe the adaptation process of Apertium, a free/open-source rule-based machine translation platform which is operating in a number of different real-life contexts, to the linguistic needs of the Universitat O...Read more
Development of a free Basque to Spanish machine translation system
2009
Sociedad Española para el Procesamiento del Lenguaje Natural
This paper presents a free (or open-source) rule-based machine translation system between Basque and Spanish, based on the Apertium machine translation platform aimed at assimilation, that is, as a help for the understanding of te...Read more
Procesamiento del Lenguaje Natural
Este artículo presenta un sistema de traducción automática libre (de código abierto) basado en reglas entre euskera y castellano, construido sobre la plataforma de traducción automática Apertium y pensado para la asimilación, es d...Read more
Proceedings of the First International Workshop on Free/Open-Source Rule-based Machine Translation
This article describes the needs of UOC regarding translation and how these needs are satisfied by Prompsit further developing a free rule-based machine translation system: Apertium. We initially describe the general framework reg...Read more
Documentation of the open-source shallow-transfer machine translation platform Apertium
2007
Departament de Llenguatges i Sistemes Informatics Universitat d‟ Alacant
This documentation describes the Apertium platform, one of the opensource machine translation systems which originated within the project” Open-Source Machine Translation for the Languages of Spain”(” Traducci ón automática de cód...Read more
Apertium, una plataforma de código abierto para el desarrollo de sistemas de traducción automática
2007
Universidad de Cádiz. Servicio de Publicaciones
Uno de los principales retos de la informática para las próximas décadas es el desarrollo de sistemas capaces de procesar eficazmente el lenguaje natural (o lenguaje humano). Dentro de este campo, los sistemas de traducción automá...Read more
International Workshop on Computational Processing of the Portuguese Language
This paper describes the current status of development of an open-source shallow-transfer machine translation (MT) system for the [European] Portuguese Spanish language pair, developed using the OpenTrad Apertium MT toolbox (www.a...Read more
Opentrad Apertium open-source machine translation system: an opportunity for business and research
2006
Aslib
Most successful machine translation systems built until now use proprietary software and data, and are either distributed as commercial products or are accessible on the net with some restrictions. This kind of machine translation...Read more
Sociedad Española para el Procesamiento del Lenguaje Natural
En este artículo se presenta un modelo de gestión de diccionarios basado en paradigmas para construir procesadores léxicos. Para ello, primero se muestran algunos ejemplos que permiten poner de manifiesto la potencia expresiva del...Read more
Advances in Natural Language Processing: 5th International Conference on NLP, FinTAL 2006 Turku, Finland, August 23-25, 2006 Proceedings
The Internet constitutes a potential huge store of parallel text that may be collected to be exploited by many applications such as multilingual information retrieval, machine translation, etc. These applications usually require a...Read more