Investigación PLN y Publicaciones | Artículos Traducción

2026

Prompsit’s API and CLI: planet-friendly, privacy-first, open-source translation services for everyone

Lev Nikolaevich Berezhnoy, Gema Ramírez Sánchez Sergio Ortiz Rojas, Mikel L. Forcada

Prompsit is launching an updated API and CLI for its open-source, planet-friendly machine translation services. Operating on a freemium model, the tools offer free limited access alongside tiered pric...

ing for advanced features like MT evaluation, quality estimation, corpus scoring, and multilingual dataset annotation.

2026

MultiSynt/MT: Trillion-Token Multi-Parallel Pre-Training Data Translated Across 36 Languages

Maximilian Idahl, Jörg Tiedemann, Sampo Pyysalo

+19 more

David Salinas, Tomasz Galica, Shenbin Qian, Tudor Nicolae Mateiu, Zihao Li, Anna Lokrantz, Fedor Vitiugin, André F. T. Martins, Jenna Kanerva, Filip Ginter, Matthias Lindemann, Tim Isbister, Birger Moell, Jonas Lindh, Jan Hajič, Jenia Jitsev, Andrey Kutuzov, Stephan Oepen, Gema Ramírez-Sánchez

Open web-scale pre-training corpora remain concentrated in English, limiting multilingual LLM development. We introduce MultiSynt/MT, an open synthetic parallel corpus with approximately 4.8 trillion ...

target-language tokens across 36 European languages, produced by translating 100 billion high-quality Nemotron-CC tokens with Tower+ and OPUS-MT/HPLT-MT systems. For many medium- and lower-resource European languages, this is the largest openly available pre-training resource. On a broad multilingual benchmark suite, reference LLMs trained on MultiSynt/MT reach the final score of HPLT 2.0, a native-data baseline, using roughly 72% fewer pre-training tokens, and outperform it by approximately 15% relative at a matched 100B-token training budget. Our analyses also identify evaluation blind spots: standard multiple-choice benchmarks miss translation-quality differences that a fluency-sensitive LLM-as-judge evaluation cleanly recovers on the trained LLMs (with no fluency deficit in MultiSynt itself), and Norwegian idiomatic and culturally grounded tasks remain better served by native data. We release the corpus, including row-aligned translations from multiple systems, to support controlled research on multilingual pre-training data and evaluation.

2025

A New Massive Multilingual Dataset for High-Performance Language Technologies

Ona de Gibert, Graeme Nail, Nikolay Arefyev

+10 more

Marta Bañón, Jelmer van der Linde, Shaoxiong Ji, Jaume Zaragoza-Bernabeu, Mikko Aulamo, Gema Ramírez-Sánchez, Andrey Kutuzov, Sampo Pyysalo, Stephan Oepen, Jörg Tiedemann

We present the HPLT (High Performance Language Technologies) language resources, a new massive multilingual dataset including both monolingual and bilingual corpora extracted from CommonCrawl and prev...

iously unused web crawls from the Internet Archive. We describe our methods for data acquisition, management and processing of large corpora, which rely on open-source software tools and high-performance computing. Our monolingual collection focuses on low- to medium-resourced languages and covers 75 languages and a total of ≈ 5.6 trillion word tokens de-duplicated on the document level. Our English-centric parallel corpus is derived from its monolingual counterpart and covers 18 language pairs and more than 96 million aligned sentence pairs with roughly 1.4 billion English tokens. The HPLT language resources are one of the largest open text corpora ever released, providing a great resource for language modeling and machine translation training. We publicly release the corpora, the software, and the tools used in this work.

2025

Quality Beyond A Glance: Revealing Large Quality Differences Between Web-Crawled Parallel Corpora

Rik van Noord, Miquel Esplà-Gomis, Malina Chichirau

+2 more

Gema Ramírez-Sánchez, Antonio Toral

Parallel corpora play a vital role in advanced multilingual natural language processing tasks, notably in machine translation (MT). The recent emergence of numerous large parallel corpora, often extra...

cted from multilingual documents on the Internet, has expanded the available resources. Nevertheless, the quality of these corpora remains largely unexplored, while there are large differences in how the corpora are constructed. Moreover, how the potential differences affect the performance of neural MT (NMT) systems has also received limited attention. This study addresses this gap by manually and automatically evaluating four well-known publicly available parallel corpora across eleven language pairs. Our findings are quite concerning: all corpora contain a substantial amount of noisy sentence pairs, with CCMatrix and CCAligned having well below of 50% reasonably clean pairs. MaCoCu and ParaCrawl generally have higher quality texts, though around a third of the texts still have clear issues. While corpus size impacts NMT models’ performance, our study highlights the critical role of quality: higher-quality corpora consistently yield better-performing NMT models when controlling for size.

2024

SmartBiC: Smart Harvesting of Bilingual Corpora from the Internet

Gema Ramírez-Sánchez, Sergio Ortiz Rojas, Alicia Núñez Alcover

+6 more

Tudor Nicolae Mateiu, Mikel L. Forcada, Pedro Luis Díez-Orzas, Almudena Ballester Carrillo, Giuseppe Deriard Nolasco, Noelia Jiménez Listón

SmartBiC, an 18-month innovation project funded by the Spanish Government, aims at improving the full process of collecting, filtering and selecting in-domain parallel content to be used for machine t...

ranslation and language model tuning purposes in industrial settings. Based on state-of-the-art technology in the free/open-source parallel web corpora harvester Bitextor, SmartBic develops a web-based application around it including novel components such as a language- and domain-focused crawler and a domain-specific corpora selector. SmartBic also addresses specific industrial use cases for individual components of the Bitextor pipeline, such as parallel data cleaning. Relevant improvements to the current Bitextor pipeline will be publicly released.

2024

HPLT's First Release of Data and Models

Nikolay Arefyev, Mikko Aulamo, Pinzhen Chen

+9 more

Ona De Gibert Bonet, Barry Haddow, Jindřich Helcl, Bhavitvya Malik, Gema Ramírez-Sánchez, Pavel Stepachev, Jörg Tiedemann, Dušan Variš, Jaume Zaragoza-Bernabeu

We describe the first results of the High Performance Language Technologies project (HPLT), a 3-year EU-funded project that started in September 2022. The first data release includes 75 monolingual da...

tasets and 18 parallel datasets derived from 1.8 petabytes of the Internet Archive and CommonCrawl. Building upon automated and reusable pipelines, the first machine translation (MT) models as well as large language models (LLMs) have been trained and released. Several data processing tools and pipelines have also been made public. HPLT aims to provide free, sustainable, and reusable datasets, models, and workflows at scale using high-performance computing.

2024

A New Massive Multilingual Dataset for High-Performance Language Technologies

Ona de Gibert, Graeme Nail, Nikolay Arefyev

+10 more

Marta Bañón, Jelmer van der Linde, Shaoxiong Ji, Jaume Zaragoza-Bernabeu, Mikko Aulamo, Gema Ramírez-Sánchez, Andrey Kutuzov, Sampo Pyysalo, Stephan Oepen, Jörg Tiedemann

We present the HPLT (High Performance Language Technologies) language resources, a new massive multilingual dataset including both monolingual and bilingual corpora extracted from CommonCrawl and prev...

iously unused web crawls from the Internet Archive. We describe our methods for data acquisition, management and processing of large corpora, which rely on open-source software tools and high-performance computing. Our monolingual collection focuses on low- to medium-resourced languages and covers 75 languages and a total of ≈ 5.6 trillion word tokens de-duplicated on the document level. Our English-centric parallel corpus is derived from its monolingual counterpart and covers 18 language pairs and more than 96 million aligned sentence pairs with roughly 1.4 billion English tokens. The HPLT language resources are one of the largest open text corpora ever released, providing a great resource for language modeling and machine translation training. We publicly release the corpora, the software, and the tools used in this work.

2024

Do Language Models Care About Text Quality? Evaluating Web-Crawled Corpora Across 11 Languages

Rik van Noord, Taja Kuzman, Peter Rupnik

+4 more

Nikola Ljubešić, Miquel Esplà-Gomis, Gema Ramírez-Sánchez, Antonio Toral

Large, curated, web-crawled corpora play a vital role in training language models (LMs). They form the lion's share of the training data in virtually all recent LMs, such as the well-known GPT, LLaMA ...

and XLM-RoBERTa models. However, despite this importance, relatively little attention has been given to the quality of these corpora. In this paper, we compare four of the currently most relevant large, web-crawled corpora (CC100, MaCoCu, mC4 and OSCAR) across eleven lower-resourced European languages. Our approach is two-fold: first, we perform an intrinsic evaluation by performing a human evaluation of the quality of samples taken from different corpora; then, we assess the practical impact of the qualitative differences by training specific LMs on each of the corpora and evaluating their performance on downstream tasks. We find that there are clear differences in quality of the corpora, with MaCoCu and OSCAR obtaining the best results. However, during the extrinsic evaluation, we actually find that the CC100 corpus achieves the highest scores. We conclude that, in our experiments, the quality of the web-crawled corpora does not seem to play a significant role when training LMs.

2024

FastSpell: the LangId Magic Spell

Marta Bañón, Jaume Zaragoza-Bernabeu, Gema Ramírez-Sánchez

+1 more

Sergio Ortiz-Rojas

Language identification is a crucial component in the automated production of language resources, particularly in multilingual and big data contexts. However, commonly used language identifiers strugg...

le to differentiate between similar or closely-related languages. This paper introduces FastSpell, a language identifier that combines fastText (a pre-trained language identifier tool) and Hunspell (a spell checker) with the aim of having a refined second-opinion before deciding which language should be assigned to a text. We provide a description of the FastSpell algorithm along with an explanation on how to use and configure it. To that end, we motivate the need of such a tool and present a benchmark including some popular language identifiers evaluated during the development of FastSpell. We show how FastSpell is useful not only to improve identification of similar languages, but also to identify new ones ignored by other tools.

2023

OpusCleaner and OpusTrainer, open source toolkits for training Machine Translation and Large language models

Nikolay Bogoychev, Jelmer van der Linde, Graeme Nail

+7 more

Barry Haddow, Jaume Zaragoza-Bernabeu, Gema Ramírez-Sánchez, Lukas Weymann, Tudor Nicolae Mateiu, Jindřich Helcl, Mikko Aulamo

Developing high quality machine translation systems is a labour intensive, challenging and confusing process for newcomers to the field. We present a pair of tools OpusCleaner and OpusTrainer that aim...

to simplify the process, reduce the amount of work and lower the entry barrier for newcomers. OpusCleaner is a data downloading, cleaning, and proprocessing toolkit. It is designed to allow researchers to quickly download, visualise and preprocess bilingual (or monolingual) data that comes from many different sources, each of them with different quality, issues, and unique filtering/preprocessing requirements. OpusTrainer is a data scheduling and data augmenting tool aimed at building large scale, robust machine translation systems and large language models. It features deterministic data mixing from many different sources, on-the-fly data augmentation and more. Using these tools, we showcase how we can use it to create high quality machine translation model robust to noisy user input; multilingual models and terminology aware models.

Comprometidos con la investigación para mejorar las tecnologías de la lengua

Prompsit’s API and CLI: planet-friendly, privacy-first, open-source translation services for everyone

MultiSynt/MT: Trillion-Token Multi-Parallel Pre-Training Data Translated Across 36 Languages

A New Massive Multilingual Dataset for High-Performance Language Technologies

Quality Beyond A Glance: Revealing Large Quality Differences Between Web-Crawled Parallel Corpora

SmartBiC: Smart Harvesting of Bilingual Corpora from the Internet

HPLT's First Release of Data and Models

A New Massive Multilingual Dataset for High-Performance Language Technologies

Do Language Models Care About Text Quality? Evaluating Web-Crawled Corpora Across 11 Languages

FastSpell: the LangId Magic Spell

OpusCleaner and OpusTrainer, open source toolkits for training Machine Translation and Large language models

Comprometidos con la
investigación para mejorar las tecnologías de la lengua