Showing 51-60 of 62 publications (page 6 of 7)

2009

Desarrollo de un sistema libre de traducción automática del euskera al castellano

Mireia Ginestí-Rosell, Gema Ramírez-Sánchez, Sergio Ortiz-Rojas
+2 more Francis M Tyers, Mikel L Forcada
Este artículo presenta un sistema de traducción automática libre (de código abierto) basado en reglas entre euskera y castellano, construido sobre la plataforma de traducción automática Apertium y pen...
Read moresado para la asimilación, es decir, como ayuda a la comprensión de textos escritos en euskera. Se describe el desarrollo y la situación actual y se muestra una evaluación de la calidad de las traducciones.
2015

EAMT 2015

İIknur Durgar El-Kahlout, Mehmed Özkan, Felipe Sánchez-Martínez
+3 more Gema Ramírez-Sánchez, Fred Hollowood, Andy Way
This paper presents the work done to port a deep-transfer rule-based machine translation system to translate from a different source language by maximizing the exploitation of existing resources and b...
Read morey limiting the development work. Specifically, we report the changes and effort required in each of the system’s modules to obtain an English-Basque translator, ENEUS, starting from the Spanish-Basque Matxin system. We run a human pairwise comparison for the new prototype and two statistical systems and see that ENEUS is preferred in over 30% of the test sentences.
2014

Abu-matran at wmt 2014 translation task: Two-step data selection and rbmt-style synthetic rules

Raphael Rubino, Antonio Toral, Victor M Sánchez-Cartagena
+5 more Jorge Ferrández-Tordera, Sergio Ortiz-Rojas, Gema Ramírez‐Sánchez, Felipe Sánchez‐Martínez, Andy Way
This paper presents the machine translation systems submitted by the Abu-MaTran project to the WMT 2014 translation task. The language pair concerned is English–French with a focus on French as the ta...
Read morerget language. The French to English translation direction is also considered, based on the word alignment computed in the other direction. Large language and translation models are built using all the datasets provided by the shared task organisers, as well as the monolingual data from LDC. To build the translation models, we apply a two-step data selection method based on bilingual crossentropy difference and vocabulary saturation, considering each parallel corpus individually. Synthetic translation rules are extracted from the development sets and used to train another translation model. We then interpolate the translation models, minimising the perplexity on the development sets, to obtain our final SMT system. Our submission for the English to French translation task was ranked second amongst nine teams and a total of twenty submissions.
2006

Evaluation of alignment methods for HTML parallel text

Enrique Sánchez-Villamil, Susana Santos-Antón, Sergio Ortiz-Rojas
+1 more Mikel L Forcada
The Internet constitutes a potential huge store of parallel text that may be collected to be exploited by many applications such as multilingual information retrieval, machine translation, etc. These ...
Read moreapplications usually require at least sentence-aligned bilingual text. This paper presents new aligners designed for improving the performance of classical sentence-level aligners while aligning structured text such as HTML. The new aligners are compared with other well-known geometric aligners.
2016

Producing monolingual and parallel web corpora at the same time-spiderling and bitextor’s love affair

Nikola Ljubešić, Miquel Esplà-Gomis, Antonio Toral
+2 more Sergio Ortiz-Rojas, Filip Klubička
This paper presents an approach for building large monolingual corpora and, at the same time, extracting parallel data by crawling the top-level domain of a given language of interest. For gathering l...
Read moreinguistically relevant data from top-level domains we use the SpiderLing crawler, modified to crawl data written in multiple languages. The output of this process is then fed to Bitextor, a tool for harvesting parallel data from a collection of documents. We call the system combining these two tools Spidextor, a blend of the names of its two crucial parts. We evaluate the described approach intrinsically by measuring the accuracy of the extracted bitexts from the Croatian top-level domain “. hr” and the Slovene top-level domain “. si”, and extrinsically on the English-Croatian language pair by comparing an SMT system built from the crawled data with third-party systems. We finally present parallel datasets collected with our approach for the English-Croatian, English-Finnish, English-Serbian and English-Slovene language pairs.
2017

MTradumàtica: Free Statistical Machine Translation Customisation for Translators

Gökhan Doğru, Adrià Martín-Mor, Sergio Ortiz-Rojas
MTradumàtica is a free, Moses-based web platform for training and using statistical machine translation systems with a user-friendly graphical interface. Its goal is to offer translators a free tool t...
Read moreo customise their own statistical machine translation engines and enhance their productivity. In this paper, we aim to describe the features of MTradumàtica and its advantages for translators by focusing on its current capabilities and limitations from a user perspective.
2009

Joint efforts to further develop and incorporate Apertium into the document management flow at Universitat Oberta de Catalunya

Luis Villarejo Munoz, Sergio Ortíz-Rojas, Mireia Ginestí-Rosell
This article describes the needs of UOC regarding translation and how these needs are satisfied by Prompsit further developing a free rule-based machine translation system: Apertium. We initially desc...
Read moreribe the general framework regarding linguistic needs inside UOC. Then, section 2 introduces Apertium and outlines the development scenario that Prompsit executed. After that, section 3 outlines the specific needs of UOC and why Apertium was chosen as the machine translation engine. Then, section 4 describes some of the features specially developed in this project. Section 5 explains how the linguistic data was improved to increase the quality of the output in Catalan and Spanish. And, finally, we draw conclusions and outline further work originating from the project.
2016

Cloudlm: a cloud-based language model for machine translation

Jorge Ferrández-Tordera, Sergio Ortiz-Rojas, Antonio Toral
Language models (LMs) are an essential element in statistical approaches to natural language processing for tasks such as speech recognition and machine translation (MT). The advent of big data leads ...
Read moreto the availability of massive amounts of data to build LMs, and in fact, for the most prominent languages, using current techniques and hardware, it is not feasible to train LMs with all the data available nowadays. At the same time, it has been shown that the more data is used for a LM the better the performance, eg for MT, without any indication yet of reaching a plateau. This paper presents CloudLM, an open-source cloud-based LM intended for MT, which allows to query distributed LMs. CloudLM relies on Apache Solr and provides the functionality of state-of-the-art language modelling (it builds upon KenLM), while allowing to query massive LMs (as the use of local memory is drastically reduced), at the expense of slower decoding speed.
2011

Using Apertium linguistic data for tokenization to improve Moses SMT performance

Sergio Ortiz Rojas, Santiago Cortés Vaıllo, UMH Campus
+1 more Edficio Quorum III
This paper describes a new method to tokenize texts, both to train a Moses SMT system and to be used during the translation process. The new method involves reusing the morphological analyser and part...
Read more-of-speech tagger of the Apertium rule-based machine translation system to enrich the default tokenization used in Moses with part-of-speech-based truecasing, multi-word-unit chunking, number preprocessing and fixed translation patterns. Figures of the experimental results show an improvement of the final quality similar to the improvement attained by using minimumerror-rate training (MERT) as well as an increase of the overall consistency of the output.
2016

Prompsit’s submission to WMT 2018 Parallel Corpus Filtering shared task

Víctor M Sánchez-Cartagena, Marta Bañón, Sergio Ortiz Rojas
+1 more Gema Ramírez-Sánchez
This paper describes Prompsit Language Engineering’s submissions to the WMT 2018 parallel corpus filtering shared task. Our four submissions were based on an automatic classifier for identifying pairs...
Read more of sentences that are mutual translations. A set of hand-crafted hard rules for discarding sentences with evident flaws were applied before the classifier. We explored different strategies for achieving a training corpus with diverse vocabulary and fluent sentences: language model scoring, an active-learning-inspired data selection algorithm and n-gram saturation. Our submissions were very competitive in comparison with other participants on the 100 million word training corpus.
NLP Research & Publications | Machine Translation Papers | Prompsit