Showing 21-30 of 62 publications (page 3 of 7)

2014

Abu-matran at wmt 2014 translation task: Two-step data selection and rbmt-style synthetic rules

Raphael Rubino, Antonio Toral, Victor M Sánchez-Cartagena
+5 more Jorge Ferrández-Tordera, Sergio Ortiz-Rojas, Gema Ramírez‐Sánchez, Felipe Sánchez‐Martínez, Andy Way
This paper presents the machine translation systems submitted by the Abu-MaTran project to the WMT 2014 translation task. The language pair concerned is English–French with a focus on French as the ta...
Read morerget language. The French to English translation direction is also considered, based on the word alignment computed in the other direction. Large language and translation models are built using all the datasets provided by the shared task organisers, as well as the monolingual data from LDC. To build the translation models, we apply a two-step data selection method based on bilingual crossentropy difference and vocabulary saturation, considering each parallel corpus individually. Synthetic translation rules are extracted from the development sets and used to train another translation model. We then interpolate the translation models, minimising the perplexity on the development sets, to obtain our final SMT system. Our submission for the English to French translation task was ranked second amongst nine teams and a total of twenty submissions.
2022

Bicleaner AI: Bicleaner goes neural

Jaume Zaragoza-Bernabeu, Gema Ramírez‐Sánchez, Marta Bañón
+1 more Sergio Ortiz-Rojas
This paper describes the experiments carried out during the development of the latest version of Bicleaner, named Bicleaner AI, a tool that aims at detecting noisy sentences in parallel corpora. The t...
Read moreool, which now implements a new neural classifier, uses state-of-the-art techniques based on pre-trained transformer-based language models fine-tuned on a binary classification task. After that, parallel corpus filtering is performed, discarding the sentences that have lower probability of being mutual translations. Our experiments, based on the training of neural machine translation (NMT) with corpora filtered using Bicleaner AI for two different scenarios, show significant improvements in translation quality compared to the previous version of the tool which implemented a classifier based on Extremely Randomized Trees.
2020

Bicleaner at wmt 2020: Universitat d’alacant-prompsit’s submission to the parallel corpus filtering shared task

Miquel Espla-Gomis, Víctor M Sánchez-Cartagena, Jaume Zaragoza-Bernabeu
+1 more Felipe Sánchez‐Martínez
This paper describes the joint submission of Universitat d’Alacant and Prompsit Language Engineering to the WMT 2020 shared task on parallel corpus filtering. Our submission, based on the free/open-so...
Read moreurce tool Bicleaner, enhances it with Extremely Randomised Trees and lexical similarity features that account for the frequency of the words in the parallel sentences to determine if two sentences are parallel. To train this classifier we used the clean corpora provided for the task and synthetic noisy parallel sentences. In addition we re-score the output of Bicleaner using character-level language models and n-gram saturation.
2022

MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages

Marta Banón, Miquel Espla-Gomis, Mikel L Forcada
+11 more Cristian García-Romero, Taja Kuzman, Nikola Ljubešić, Rik van Noord, Leopoldo Pla Sempere, Gema Ramírez-Sánchez, Peter Rupnik, Vít Suchomel, Antonio Toral, Tobias van der Werff, Jaume Zaragoza
We introduce the project MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages, funded by the Connecting Europe Facility, which is aimed at buil...
Read moreding monolingual and parallel corpora for under-resourced European languages. The approach followed consists of crawling large amounts of textual data from selected top-level domains of the Internet, and then applying a curation and enrichment pipeline. In addition to corpora, the project will release the free/open-source web crawling and curation software used.
2023

HPLT: High Performance Language Technologies

Mikko Aulamo, Nikolay Bogoychev, Shaoxiong Ji
+5 more Graeme Nail, Gema Ramírez‐Sánchez, Jörg Tiedemann, Jelmer Van Der Linde, Jaume Zaragoza
We describe the High Performance Language Technologies project (HPLT), a 3-year EU-funded project started in September 2022. HPLT will build a space combining petabytes of natural language data with l...
Read morearge-scale model training. It will derive monolingual and bilingual datasets from the Internet Archive and CommonCrawl and build efficient and solid machine translation (MT) as well as large language models (LLMs). HPLT aims at providing free, sustainable and reusable datasets, models and workflows at scale using high-performance computing (HPC).
2022

Human evaluation of web-crawled parallel corpora for machine translation

Proceedings of the 2nd Workshop on Human Evaluation of NLP Systems (HumEval)
Quality assessment has been an ongoing activity of the series of ParaCrawl efforts to crawl massive amounts of parallel data from multilingual websites for 29 languages. The goal of ParaCrawl is to ge...
Read moret parallel data that is good for machine translation. To prove so, both, automatic (extrinsic) and human (intrinsic and extrinsic) evaluation tasks have been included as part of the quality assessment activity of the project. We sum up the various methods followed to address these evaluation tasks for the web-crawled corpora produced and their results. We review their advantages and disadvantages for the final goal of the ParaCrawl project and the related ongoing project MaCoCu.
2019

Neural Paraphrasing Generation System

Jaume Zaragoza Bernabeu
Entenem com a paràfrasi l'acte de reescriure un text amb paraules diferents mantenint el seu significat. Hi podem trobar moltes aplicacions de la paràfrasi tals com reescriure paraules mentre s'escriu...
Read more una text, proporcionar traduccions alternatives per a una frase objectiu, identificar frase similars, obtenir sinònims o expandint consultes de cerca per a trobar més informació. Amb l'objectiu d'ajudar a totes aquestes aplicacions, l'objectiu del projecte és construir un sistema que proporcione paràfrasis a partir d'una frase donada. Per a construir aquest sistema, explorarem diferents tècniques de l'estat de l'art basades en xarxes neuronals, més concretament, inspirades en traducció automàtica neuronal. Primerament realitzarem una tasca no supervisada que es centrarà en la generació d'embeddings de frases (vectors de nombres reals) que representen la informació semàntica en un espai continuu. Per a generar aquests embeddings usarem corpus de gran tamany, amb milions de frases de llibres públics o de subtítols de series de televisió, pel·lícules i documentals. Després aquests embeddings seran provats en tasques sobre relació semàntica (quin grau de similitud tenen dues frases) i identificació de paràfrasi (si dues frases són paràfrasi). Finalment, construirem un sistema de generació de paràfrasi usant aquests embeddings per a millorar el seu rendiment.
2010

Free/open-source resources in the Apertium platform for machine translation research and development

Francis M Tyers, Felipe Sánchez-Martínez, Sergio Ortiz Rojas
+1 more Mikel L Forcada
This paper describes the resources available in the Apertium platform, a free/open-source framework for creating rule-based machine translation systems. Resources within the platform take the form of ...
Read morefinite-state morphologies for morphological analysis and generation, bilingual transfer lexica, probabilistic part-of-speech taggers and transfer rule files, all in standardised formats. These resources are described and some examples are given of their reuse and recycling in combination with other machine translation systems.
2010

Bitextor’s participation in WMT’16: shared task on document alignment

Miquel Espla-Gomis, Mikel L Forcada, Sergio Ortiz-Rojas
+1 more Jorge Ferrández-Tordera
This paper describes the participation of Prompsit Language Engineering and the Universitat d’Alacant in the shared task on document alignment at the First Conference on Machine Translation (WMT 2016)...
Read more. Two systems have been submitted, corresponding to two different versions of the tool Bitextor: the last stable release, version 4.1, and the newest one, version 5.0. The paper describes the main features of each version of the tool and discusses the results obtained on the data sets published for the shared task.
2015

Abu-matran at wmt 2015 translation task: Morphological segmentation and web crawling

Raphael Rubino, Tommi A Pirinen, Miquel Espla-Gomis
+5 more Nikola Ljubešić, Sergio Ortiz-Rojas, Vassilis Papavassiliou, Prokopis Prokopidis, Antonio Toral
This paper presents the machine translation systems submitted by the Abu-MaTran project for the Finnish–English language pair at the WMT 2015 translation task. We tackle the lack of resources and comp...
Read morelex morphology of the Finnish language by (i) crawling parallel and monolingual data from the Web and (ii) applying rule-based and unsupervised methods for morphological segmentation. Several statistical machine translation approaches are evaluated and then combined to obtain our final submissions, which are the top performing English-to-Finnish unconstrained (all automatic metrics) and constrained (BLEU), and Finnish-to-English constrained (TER) systems.
NLP Research & Publications | Machine Translation Papers | Prompsit