Committed to research
that brings advanced results to language technology

Filter by year:

Showing 21-30 of 64 publications (page 3 of 7)

2014

Extrinsic evaluation of web-crawlers in machine translation: a study on Croatian-English for the tourism domain

Antonio Toral, Raphael Rubino, Miquel Espla-Gomis

+3 more

Tommi A Pirinen, Andy Way, Gema Ramírez‐Sánchez

We present an extrinsic evaluation of crawlers of parallel corpora from multilingual web sites in machine translation (MT). Our case study is on Croatian to English translation in the tourism domain. ...

Given two crawlers, we build phrase-based statistical MT systems on the datasets produced by each crawler using different settings. We also combine the best datasets produced by each crawler (union and intersection) to build additional MT systems. Finally we combine the best of the previous systems (union) with general-domain data. This last system outperforms all the previous systems built on crawled data as well as two baselines (a system built on general-domain data and a well known online MT system).

2016

Collaborative development of a rule-based machine translator between Croatian and Serbian

Filip Klubička, Gema Ramírez‐Sánchez, Nikola Ljubešić

This paper describes the development and current state of a bidirectional Croatian-Serbian machine translation system based on the open-source Apertium platform. It has been created inside the Abu-MaT...

ran project with the aims of creating free linguistic resources as well as having non-experts and experts work together. We describe the collaborative way of collecting the necessary data to build our system, which outperforms other available systems.

2014

Abu-matran at wmt 2014 translation task: Two-step data selection and rbmt-style synthetic rules

Raphael Rubino, Antonio Toral, Victor M Sánchez-Cartagena

+5 more

Jorge Ferrández-Tordera, Sergio Ortiz-Rojas, Gema Ramírez‐Sánchez, Felipe Sánchez‐Martínez, Andy Way

This paper presents the machine translation systems submitted by the Abu-MaTran project to the WMT 2014 translation task. The language pair concerned is English–French with a focus on French as the ta...

rget language. The French to English translation direction is also considered, based on the word alignment computed in the other direction. Large language and translation models are built using all the datasets provided by the shared task organisers, as well as the monolingual data from LDC. To build the translation models, we apply a two-step data selection method based on bilingual crossentropy difference and vocabulary saturation, considering each parallel corpus individually. Synthetic translation rules are extracted from the development sets and used to train another translation model. We then interpolate the translation models, minimising the perplexity on the development sets, to obtain our final SMT system. Our submission for the English to French translation task was ranked second amongst nine teams and a total of twenty submissions.

2022

Bicleaner AI: Bicleaner goes neural

Jaume Zaragoza-Bernabeu, Gema Ramírez‐Sánchez, Marta Bañón

+1 more

Sergio Ortiz-Rojas

This paper describes the experiments carried out during the development of the latest version of Bicleaner, named Bicleaner AI, a tool that aims at detecting noisy sentences in parallel corpora. The t...

ool, which now implements a new neural classifier, uses state-of-the-art techniques based on pre-trained transformer-based language models fine-tuned on a binary classification task. After that, parallel corpus filtering is performed, discarding the sentences that have lower probability of being mutual translations. Our experiments, based on the training of neural machine translation (NMT) with corpora filtered using Bicleaner AI for two different scenarios, show significant improvements in translation quality compared to the previous version of the tool which implemented a classifier based on Extremely Randomized Trees.

2020

Bicleaner at wmt 2020: Universitat d’alacant-prompsit’s submission to the parallel corpus filtering shared task

Miquel Espla-Gomis, Víctor M Sánchez-Cartagena, Jaume Zaragoza-Bernabeu

+1 more

Felipe Sánchez‐Martínez

This paper describes the joint submission of Universitat d’Alacant and Prompsit Language Engineering to the WMT 2020 shared task on parallel corpus filtering. Our submission, based on the free/open-so...

urce tool Bicleaner, enhances it with Extremely Randomised Trees and lexical similarity features that account for the frequency of the words in the parallel sentences to determine if two sentences are parallel. To train this classifier we used the clean corpora provided for the task and synthetic noisy parallel sentences. In addition we re-score the output of Bicleaner using character-level language models and n-gram saturation.

2022

MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages

Marta Banón, Miquel Espla-Gomis, Mikel L Forcada

+11 more

Cristian García-Romero, Taja Kuzman, Nikola Ljubešić, Rik van Noord, Leopoldo Pla Sempere, Gema Ramírez-Sánchez, Peter Rupnik, Vít Suchomel, Antonio Toral, Tobias van der Werff, Jaume Zaragoza

We introduce the project MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages, funded by the Connecting Europe Facility, which is aimed at buil...

ding monolingual and parallel corpora for under-resourced European languages. The approach followed consists of crawling large amounts of textual data from selected top-level domains of the Internet, and then applying a curation and enrichment pipeline. In addition to corpora, the project will release the free/open-source web crawling and curation software used.

2023

HPLT: High Performance Language Technologies

Mikko Aulamo, Nikolay Bogoychev, Shaoxiong Ji

+5 more

Graeme Nail, Gema Ramírez‐Sánchez, Jörg Tiedemann, Jelmer Van Der Linde, Jaume Zaragoza

We describe the High Performance Language Technologies project (HPLT), a 3-year EU-funded project started in September 2022. HPLT will build a space combining petabytes of natural language data with l...

arge-scale model training. It will derive monolingual and bilingual datasets from the Internet Archive and CommonCrawl and build efficient and solid machine translation (MT) as well as large language models (LLMs). HPLT aims at providing free, sustainable and reusable datasets, models and workflows at scale using high-performance computing (HPC).

2022

Human evaluation of web-crawled parallel corpora for machine translation

Proceedings of the 2nd Workshop on Human Evaluation of NLP Systems (HumEval)

Quality assessment has been an ongoing activity of the series of ParaCrawl efforts to crawl massive amounts of parallel data from multilingual websites for 29 languages. The goal of ParaCrawl is to ge...

t parallel data that is good for machine translation. To prove so, both, automatic (extrinsic) and human (intrinsic and extrinsic) evaluation tasks have been included as part of the quality assessment activity of the project. We sum up the various methods followed to address these evaluation tasks for the web-crawled corpora produced and their results. We review their advantages and disadvantages for the final goal of the ParaCrawl project and the related ongoing project MaCoCu.

2019

Neural Paraphrasing Generation System

Jaume Zaragoza Bernabeu

Entenem com a paràfrasi l'acte de reescriure un text amb paraules diferents mantenint el seu significat. Hi podem trobar moltes aplicacions de la paràfrasi tals com reescriure paraules mentre s'escriu...

una text, proporcionar traduccions alternatives per a una frase objectiu, identificar frase similars, obtenir sinònims o expandint consultes de cerca per a trobar més informació. Amb l'objectiu d'ajudar a totes aquestes aplicacions, l'objectiu del projecte és construir un sistema que proporcione paràfrasis a partir d'una frase donada. Per a construir aquest sistema, explorarem diferents tècniques de l'estat de l'art basades en xarxes neuronals, més concretament, inspirades en traducció automàtica neuronal. Primerament realitzarem una tasca no supervisada que es centrarà en la generació d'embeddings de frases (vectors de nombres reals) que representen la informació semàntica en un espai continuu. Per a generar aquests embeddings usarem corpus de gran tamany, amb milions de frases de llibres públics o de subtítols de series de televisió, pel·lícules i documentals. Després aquests embeddings seran provats en tasques sobre relació semàntica (quin grau de similitud tenen dues frases) i identificació de paràfrasi (si dues frases són paràfrasi). Finalment, construirem un sistema de generació de paràfrasi usant aquests embeddings per a millorar el seu rendiment.

2010

Free/open-source resources in the Apertium platform for machine translation research and development

Francis M Tyers, Felipe Sánchez-Martínez, Sergio Ortiz Rojas

+1 more

Mikel L Forcada

This paper describes the resources available in the Apertium platform, a free/open-source framework for creating rule-based machine translation systems. Resources within the platform take the form of ...

finite-state morphologies for morphological analysis and generation, bilingual transfer lexica, probabilistic part-of-speech taggers and transfer rule files, all in standardised formats. These resources are described and some examples are given of their reuse and recycling in combination with other machine translation systems.

Committed to research that brings advanced results to language technology

Extrinsic evaluation of web-crawlers in machine translation: a study on Croatian-English for the tourism domain

Collaborative development of a rule-based machine translator between Croatian and Serbian

Abu-matran at wmt 2014 translation task: Two-step data selection and rbmt-style synthetic rules

Bicleaner AI: Bicleaner goes neural

Bicleaner at wmt 2020: Universitat d’alacant-prompsit’s submission to the parallel corpus filtering shared task

MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages

HPLT: High Performance Language Technologies

Human evaluation of web-crawled parallel corpora for machine translation

Neural Paraphrasing Generation System

Free/open-source resources in the Apertium platform for machine translation research and development

Committed to research
that brings advanced results to language technology