Desarrollo de un sistema libre de traducción automática del euskera al castellano
Mireia Ginestí-Rosell, Gema Ramírez-Sánchez, Sergio Ortiz-Rojas+2 more
Francis M Tyers, Mikel L Forcada
Este artículo presenta un sistema de traducción automática libre (de código abierto) basado en reglas entre euskera y castellano, construido sobre la plataforma de traducción automática Apertium y pen...Read more
sado para la asimilación, es decir, como ayuda a la comprensión de textos escritos en euskera. Se describe el desarrollo y la situación actual y se muestra una evaluación de la calidad de las traducciones.
EAMT 2015
İIknur Durgar El-Kahlout, Mehmed Özkan, Felipe Sánchez-Martínez+3 more
Gema Ramírez-Sánchez, Fred Hollowood, Andy Way
This paper presents the work done to port a deep-transfer rule-based machine translation system to translate from a different source language by maximizing the exploitation of existing resources and b...Read more
y limiting the development work. Specifically, we report the changes and effort required in each of the system’s modules to obtain an English-Basque translator, ENEUS, starting from the Spanish-Basque Matxin system. We run a human pairwise comparison for the new prototype and two statistical systems and see that ENEUS is preferred in over 30% of the test sentences.
Abu-matran at wmt 2014 translation task: Two-step data selection and rbmt-style synthetic rules
Raphael Rubino, Antonio Toral, Victor M Sánchez-Cartagena+5 more
Jorge Ferrández-Tordera, Sergio Ortiz-Rojas, Gema Ramírez‐Sánchez, Felipe Sánchez‐Martínez, Andy Way
This paper presents the machine translation systems submitted by the Abu-MaTran project to the WMT 2014 translation task. The language pair concerned is English–French with a focus on French as the ta...Read more
rget language. The French to English translation direction is also considered, based on the word alignment computed in the other direction. Large language and translation models are built using all the datasets provided by the shared task organisers, as well as the monolingual data from LDC. To build the translation models, we apply a two-step data selection method based on bilingual crossentropy difference and vocabulary saturation, considering each parallel corpus individually. Synthetic translation rules are extracted from the development sets and used to train another translation model. We then interpolate the translation models, minimising the perplexity on the development sets, to obtain our final SMT system. Our submission for the English to French translation task was ranked second amongst nine teams and a total of twenty submissions.
Evaluation of alignment methods for HTML parallel text
Enrique Sánchez-Villamil, Susana Santos-Antón, Sergio Ortiz-Rojas+1 more
Mikel L Forcada
The Internet constitutes a potential huge store of parallel text that may be collected to be exploited by many applications such as multilingual information retrieval, machine translation, etc. These ...Read more
applications usually require at least sentence-aligned bilingual text. This paper presents new aligners designed for improving the performance of classical sentence-level aligners while aligning structured text such as HTML. The new aligners are compared with other well-known geometric aligners.
Producing monolingual and parallel web corpora at the same time-spiderling and bitextor’s love affair
Nikola Ljubešić, Miquel Esplà-Gomis, Antonio Toral+2 more
Sergio Ortiz-Rojas, Filip Klubička
This paper presents an approach for building large monolingual corpora and, at the same time, extracting parallel data by crawling the top-level domain of a given language of interest. For gathering l...Read more
inguistically relevant data from top-level domains we use the SpiderLing crawler, modified to crawl data written in multiple languages. The output of this process is then fed to Bitextor, a tool for harvesting parallel data from a collection of documents. We call the system combining these two tools Spidextor, a blend of the names of its two crucial parts. We evaluate the described approach intrinsically by measuring the accuracy of the extracted bitexts from the Croatian top-level domain “. hr” and the Slovene top-level domain “. si”, and extrinsically on the English-Croatian language pair by comparing an SMT system built from the crawled data with third-party systems. We finally present parallel datasets collected with our approach for the English-Croatian, English-Finnish, English-Serbian and English-Slovene language pairs.
MTradumàtica: Free Statistical Machine Translation Customisation for Translators
Gökhan Doğru, Adrià Martín-Mor, Sergio Ortiz-Rojas
MTradumàtica is a free, Moses-based web platform for training and using statistical machine translation systems with a user-friendly graphical interface. Its goal is to offer translators a free tool t...Read more
o customise their own statistical machine translation engines and enhance their productivity. In this paper, we aim to describe the features of MTradumàtica and its advantages for translators by focusing on its current capabilities and limitations from a user perspective.
Joint efforts to further develop and incorporate Apertium into the document management flow at Universitat Oberta de Catalunya
Luis Villarejo Munoz, Sergio Ortíz-Rojas, Mireia Ginestí-Rosell
This article describes the needs of UOC regarding translation and how these needs are satisfied by Prompsit further developing a free rule-based machine translation system: Apertium. We initially desc...Read more
ribe the general framework regarding linguistic needs inside UOC. Then, section 2 introduces Apertium and outlines the development scenario that Prompsit executed. After that, section 3 outlines the specific needs of UOC and why Apertium was chosen as the machine translation engine. Then, section 4 describes some of the features specially developed in this project. Section 5 explains how the linguistic data was improved to increase the quality of the output in Catalan and Spanish. And, finally, we draw conclusions and outline further work originating from the project.
Cloudlm: a cloud-based language model for machine translation
Jorge Ferrández-Tordera, Sergio Ortiz-Rojas, Antonio Toral
Language models (LMs) are an essential element in statistical approaches to natural language processing for tasks such as speech recognition and machine translation (MT). The advent of big data leads ...Read more
to the availability of massive amounts of data to build LMs, and in fact, for the most prominent languages, using current techniques and hardware, it is not feasible to train LMs with all the data available nowadays. At the same time, it has been shown that the more data is used for a LM the better the performance, eg for MT, without any indication yet of reaching a plateau. This paper presents CloudLM, an open-source cloud-based LM intended for MT, which allows to query distributed LMs. CloudLM relies on Apache Solr and provides the functionality of state-of-the-art language modelling (it builds upon KenLM), while allowing to query massive LMs (as the use of local memory is drastically reduced), at the expense of slower decoding speed.
Using Apertium linguistic data for tokenization to improve Moses SMT performance
Sergio Ortiz Rojas, Santiago Cortés Vaıllo, UMH Campus+1 more
Edficio Quorum III
This paper describes a new method to tokenize texts, both to train a Moses SMT system and to be used during the translation process. The new method involves reusing the morphological analyser and part...Read more
-of-speech tagger of the Apertium rule-based machine translation system to enrich the default tokenization used in Moses with part-of-speech-based truecasing, multi-word-unit chunking, number preprocessing and fixed translation patterns. Figures of the experimental results show an improvement of the final quality similar to the improvement attained by using minimumerror-rate training (MERT) as well as an increase of the overall consistency of the output.
Prompsit’s submission to WMT 2018 Parallel Corpus Filtering shared task
Víctor M Sánchez-Cartagena, Marta Bañón, Sergio Ortiz Rojas+1 more
Gema Ramírez-Sánchez
This paper describes Prompsit Language Engineering’s submissions to the WMT 2018 parallel corpus filtering shared task. Our four submissions were based on an automatic classifier for identifying pairs...Read more
of sentences that are mutual translations. A set of hand-crafted hard rules for discarding sentences with evident flaws were applied before the classifier. We explored different strategies for achieving a training corpus with diverse vocabulary and fluent sentences: language model scoring, an active-learning-inspired data selection algorithm and n-gram saturation. Our submissions were very competitive in comparison with other participants on the 100 million word training corpus.