ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2201.06642
  4. Cited By
Towards a Cleaner Document-Oriented Multilingual Crawled Corpus

Towards a Cleaner Document-Oriented Multilingual Crawled Corpus

17 January 2022
Julien Abadji
Pedro Ortiz Suarez
Laurent Romary
Benoît Sagot
    CLL
ArXivPDFHTML

Papers citing "Towards a Cleaner Document-Oriented Multilingual Crawled Corpus"

47 / 97 papers shown
Title
CroissantLLM: A Truly Bilingual French-English Language Model
CroissantLLM: A Truly Bilingual French-English Language Model
Manuel Faysse
Patrick Fernandes
Nuno M. Guerreiro
António Loison
Duarte M. Alves
...
François Yvon
André F.T. Martins
Gautier Viaud
C´eline Hudelot
Pierre Colombo
43
33
0
01 Feb 2024
TeenyTinyLlama: open-source tiny language models trained in Brazilian
  Portuguese
TeenyTinyLlama: open-source tiny language models trained in Brazilian Portuguese
N. Corrêa
Sophia Falk
Shiza Fatimah
Aniket Sen
N. D. Oliveira
20
9
0
30 Jan 2024
MultiMUC: Multilingual Template Filling on MUC-4
MultiMUC: Multilingual Template Filling on MUC-4
William Gantt
Shabnam Behzad
Hannah YoungEun An
Yunmo Chen
Aaron Steven White
Benjamin Van Durme
M. Yarmohammadi
32
3
0
29 Jan 2024
TURNA: A Turkish Encoder-Decoder Language Model for Enhanced
  Understanding and Generation
TURNA: A Turkish Encoder-Decoder Language Model for Enhanced Understanding and Generation
Gokcce Uludougan
Zeynep Yirmibecsouglu Balal
Furkan Akkurt
Melikcsah Turker
Onur Gungor
S. Uskudarli
31
12
0
25 Jan 2024
LLaMAntino: LLaMA 2 Models for Effective Text Generation in Italian
  Language
LLaMAntino: LLaMA 2 Models for Effective Text Generation in Italian Language
Pierpaolo Basile
Elio Musacchio
Marco Polignano
Lucia Siciliani
G. Fiameni
Giovanni Semeraro
26
36
0
15 Dec 2023
Toxic language detection: a systematic review of Arabic datasets
Toxic language detection: a systematic review of Arabic datasets
Imene Bensalem
Paolo Rosso
Hanane Zitouni
25
4
0
12 Dec 2023
TurkishBERTweet: Fast and Reliable Large Language Model for Social Media
  Analysis
TurkishBERTweet: Fast and Reliable Large Language Model for Social Media Analysis
Ali Najafi
Onur Varol
VLM
19
11
0
29 Nov 2023
Oasis: Data Curation and Assessment System for Pretraining of Large
  Language Models
Oasis: Data Curation and Assessment System for Pretraining of Large Language Models
Tong Zhou
Yubo Chen
Pengfei Cao
Kang Liu
Jun Zhao
Shengping Liu
14
3
0
21 Nov 2023
MC$^2$: Towards Transparent and Culturally-Aware NLP for Minority
  Languages in China
MC2^22: Towards Transparent and Culturally-Aware NLP for Minority Languages in China
Chen Zhang
Mingxu Tao
Quzhe Huang
Jiuheng Lin
Zhibin Chen
Yansong Feng
25
2
0
14 Nov 2023
GreekT5: A Series of Greek Sequence-to-Sequence Models for News
  Summarization
GreekT5: A Series of Greek Sequence-to-Sequence Models for News Summarization
Nikolaos Giarelis
Charalampos Mastrokostas
N. Karacapilidis
29
2
0
13 Nov 2023
Efficiently Adapting Pretrained Language Models To New Languages
Efficiently Adapting Pretrained Language Models To New Languages
Zoltan Csaki
Pian Pawakapan
Urmish Thakker
Qiantong Xu
CLL
21
17
0
09 Nov 2023
Question answering using deep learning in low resource Indian language
  Marathi
Question answering using deep learning in low resource Indian language Marathi
Dhiraj Amin
S. Govilkar
Sagar Kulkarni
21
2
0
27 Sep 2023
Sequence-to-Sequence Spanish Pre-trained Language Models
Sequence-to-Sequence Spanish Pre-trained Language Models
Vladimir Araujo
Maria Mihaela Truşcǎ
Rodrigo Tufino
Marie-Francine Moens
22
2
0
20 Sep 2023
CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large
  Language Models in 167 Languages
CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages
Thuat Nguyen
Chien Van Nguyen
Viet Dac Lai
Hieu Man
Nghia Trung Ngo
Franck Dernoncourt
Ryan A. Rossi
Thien Huu Nguyen
18
93
0
17 Sep 2023
ChatGPT MT: Competitive for High- (but not Low-) Resource Languages
ChatGPT MT: Competitive for High- (but not Low-) Resource Languages
Nathaniel R. Robinson
Perez Ogayo
David R. Mortensen
Graham Neubig
18
29
0
14 Sep 2023
Embedding structure matters: Comparing methods to adapt multilingual
  vocabularies to new languages
Embedding structure matters: Comparing methods to adapt multilingual vocabularies to new languages
C.M. Downey
Terra Blevins
Nora Goldfine
Shane Steinert-Threlkeld
25
8
0
09 Sep 2023
MADLAD-400: A Multilingual And Document-Level Large Audited Dataset
MADLAD-400: A Multilingual And Document-Level Large Audited Dataset
Sneha Kudugunta
Isaac Caswell
Biao Zhang
Xavier Garcia
Christopher A. Choquette-Choo
...
Derrick Xin
Aditya Kusupati
Romi Stella
Ankur Bapna
Orhan Firat
59
118
0
09 Sep 2023
The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122
  Language Variants
The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants
Lucas Bandarkar
Davis Liang
Benjamin Muller
Mikel Artetxe
Satya Narayan Shukla
Don Husa
Naman Goyal
Abhinandan Krishnan
Luke Zettlemoyer
Madian Khabsa
28
128
0
31 Aug 2023
Empowering Cross-lingual Abilities of Instruction-tuned Large Language
  Models by Translation-following demonstrations
Empowering Cross-lingual Abilities of Instruction-tuned Large Language Models by Translation-following demonstrations
Leonardo Ranaldi
Giulia Pucci
André Freitas
25
33
0
27 Aug 2023
A Survey of Spanish Clinical Language Models
A Survey of Spanish Clinical Language Models
Guillem García Subies
Á. Jiménez
Paloma Martínez
LM&MA
ELM
LRM
14
0
0
04 Aug 2023
Developing and Evaluating Tiny to Medium-Sized Turkish BERT Models
Developing and Evaluating Tiny to Medium-Sized Turkish BERT Models
Himmet Toprak Kesgin
M. K. Yuce
M. Amasyalı
14
6
0
26 Jul 2023
A Novel Pipeline for Improving Optical Character Recognition through
  Post-processing Using Natural Language Processing
A Novel Pipeline for Improving Optical Character Recognition through Post-processing Using Natural Language Processing
Aishik Rakshit
Samyak Mehta
Anirban Dasgupta
18
0
0
09 Jul 2023
Matching Pairs: Attributing Fine-Tuned Models to their Pre-Trained Large
  Language Models
Matching Pairs: Attributing Fine-Tuned Models to their Pre-Trained Large Language Models
Myles Foley
Ambrish Rawat
Taesung Lee
Yufang Hou
Gabriele Picco
Giulio Zizzo
DeLMO
30
5
0
15 Jun 2023
GAIA Search: Hugging Face and Pyserini Interoperability for NLP Training
  Data Exploration
GAIA Search: Hugging Face and Pyserini Interoperability for NLP Training Data Exploration
Aleksandra Piktus
Odunayo Ogundepo
Christopher Akiki
Akintunde Oladipo
Xinyu Crystina Zhang
Hailey Schoelkopf
Stella Biderman
Martin Potthast
Jimmy J. Lin
CVBM
28
10
0
02 Jun 2023
The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora
  with Web Data, and Web Data Only
The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only
Guilherme Penedo
Quentin Malartic
Daniel Hesslow
Ruxandra-Aimée Cojocaru
Alessandro Cappelli
Hamza Alobeidli
B. Pannier
Ebtesam Almazrouei
Julien Launay
21
744
0
01 Jun 2023
DUMB: A Benchmark for Smart Evaluation of Dutch Models
DUMB: A Benchmark for Smart Evaluation of Dutch Models
Wietse de Vries
Martijn B. Wieling
Malvina Nissim
ELM
ALM
MoE
26
6
0
22 May 2023
Exploiting Biased Models to De-bias Text: A Gender-Fair Rewriting Model
Exploiting Biased Models to De-bias Text: A Gender-Fair Rewriting Model
Chantal Amrhein
Florian Schottmann
Rico Sennrich
Samuel Laubli
17
17
0
18 May 2023
Advancing Neural Encoding of Portuguese with Transformer Albertina PT-*
Advancing Neural Encoding of Portuguese with Transformer Albertina PT-*
João Rodrigues
Luís Gomes
Joao Silva
António Branco
Rodrigo Santos
Henrique Lopes Cardoso
T. Osório
22
43
0
11 May 2023
Effects of sub-word segmentation on performance of transformer language
  models
Effects of sub-word segmentation on performance of transformer language models
Jue Hou
Anisia Katinskaia
Anh Vu
R. Yangarber
11
4
0
09 May 2023
CCpdf: Building a High Quality Corpus for Visually Rich Documents from
  Web Crawl Data
CCpdf: Building a High Quality Corpus for Visually Rich Documents from Web Crawl Data
M. Turski
Tomasz Stanislawek
Karol Kaczmarek
Pawel Dyda
Filip Graliñski
14
10
0
28 Apr 2023
A Survey of Corpora for Germanic Low-Resource Languages and Dialects
A Survey of Corpora for Germanic Low-Resource Languages and Dialects
Verena Blaschke
Hinrich Schütze
Barbara Plank
19
13
0
19 Apr 2023
GreekBART: The First Pretrained Greek Sequence-to-Sequence Model
GreekBART: The First Pretrained Greek Sequence-to-Sequence Model
Iakovos Evdaimon
Hadi Abdine
Christos Xypolopoulos
Stamatis Outsios
Michalis Vazirgiannis
Giorgos Stamou
VLM
23
7
0
03 Apr 2023
Perplexed by Quality: A Perplexity-based Method for Adult and Harmful
  Content Detection in Multilingual Heterogeneous Web Data
Perplexed by Quality: A Perplexity-based Method for Adult and Harmful Content Detection in Multilingual Heterogeneous Web Data
Timm Jansen
Yangling Tong
V. Zevallos
Pedro Ortiz Suarez
16
17
0
20 Dec 2022
Modern French Poetry Generation with RoBERTa and GPT-2
Modern French Poetry Generation with RoBERTa and GPT-2
Mika Hämäläinen
Khalid Alnajjar
Thierry Poibeau
BDL
19
10
0
06 Dec 2022
RobBERT-2022: Updating a Dutch Language Model to Account for Evolving
  Language Use
RobBERT-2022: Updating a Dutch Language Model to Account for Evolving Language Use
Pieter Delobelle
Thomas Winters
Bettina Berendt
8
5
0
15 Nov 2022
The VolcTrans System for WMT22 Multilingual Machine Translation Task
The VolcTrans System for WMT22 Multilingual Machine Translation Task
Xian Qian
Kai Hu
Jiaqiang Wang
Yifeng Liu
Xingyuan Pan
Jun Cao
Mingxuan Wang
21
1
0
20 Oct 2022
Language Varieties of Italy: Technology Challenges and Opportunities
Language Varieties of Italy: Technology Challenges and Opportunities
Alan Ramponi
19
7
0
20 Sep 2022
Visual Grounding of Inter-lingual Word-Embeddings
Visual Grounding of Inter-lingual Word-Embeddings
W. Mohammed
Hassan Shahmohammadi
Hendrik P. A. Lensch
R. Baayen
8
1
0
08 Sep 2022
naab: A ready-to-use plug-and-play corpus for Farsi
naab: A ready-to-use plug-and-play corpus for Farsi
Sadra Sabouri
Elnaz Rahmati
S. Gooran
Hossein Sameti
AI4CE
21
3
0
29 Aug 2022
Multilingual Coreference Resolution in Multiparty Dialogue
Multilingual Coreference Resolution in Multiparty Dialogue
Boyuan Zheng
Patrick Xia
M. Yarmohammadi
Benjamin Van Durme
45
3
0
02 Aug 2022
esCorpius: A Massive Spanish Crawling Corpus
esCorpius: A Massive Spanish Crawling Corpus
Asier Gutiérrez-Fandiño
David Pérez-Fernández
Jordi Armengol-Estapé
D. Griol
Z. Callejas
31
2
0
30 Jun 2022
Harnessing Multilingual Resources to Question Answering in Arabic
Harnessing Multilingual Resources to Question Answering in Arabic
Khalid Alnajjar
Mika Hämäläinen
RALM
26
2
0
16 May 2022
Building Machine Translation Systems for the Next Thousand Languages
Building Machine Translation Systems for the Next Thousand Languages
Ankur Bapna
Isaac Caswell
Julia Kreutzer
Orhan Firat
D. Esch
...
Apurva Shah
Yanping Huang
Z. Chen
Yonghui Wu
Macduff Hughes
54
98
0
09 May 2022
Analysis of Data Augmentation Methods for Low-Resource Maltese ASR
Analysis of Data Augmentation Methods for Low-Resource Maltese ASR
A. DeMarco
C. Mena
Albert Gatt
Claudia Borg
A. Williams
Lonneke van der Plas
11
0
0
15 Nov 2021
OpenHands: Making Sign Language Recognition Accessible with Pose-based
  Pretrained Models across Languages
OpenHands: Making Sign Language Recognition Accessible with Pose-based Pretrained Models across Languages
Prem Selvaraj
Gokul Nc
Pratyush Kumar
Mitesh Khapra
VLM
SLR
48
53
0
12 Oct 2021
Deduplicating Training Data Makes Language Models Better
Deduplicating Training Data Makes Language Models Better
Katherine Lee
Daphne Ippolito
A. Nystrom
Chiyuan Zhang
Douglas Eck
Chris Callison-Burch
Nicholas Carlini
SyDa
237
588
0
14 Jul 2021
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Leo Gao
Stella Biderman
Sid Black
Laurence Golding
Travis Hoppe
...
Horace He
Anish Thite
Noa Nabeshima
Shawn Presser
Connor Leahy
AIMat
248
1,986
0
31 Dec 2020
Previous
12