A standardized Project Gutenberg corpus for statistical analysis of natural language and quantitative linguistics

19 December 2018

Martin Gerlach

Francesc Font-Clos

ArXiv (abs)PDF HTML

Papers citing "A standardized Project Gutenberg corpus for statistical analysis of natural language and quantitative linguistics"

50 / 54 papers shown

When Personalization Tricks Detectors: The Feature-Inversion Trap in Machine-Generated Text Detection

315

10 Apr 2026

Decoding the Past: Explainable Machine Learning Models for Dating Historical Texts

287

28 Nov 2025

Re-coding for Uncertainties: Edge-awareness Semantic Concordance for Resilient Event-RGB Segmentation

164

11 Nov 2025

Sample-Efficient Language Modeling with Linear Attention and Lightweight Enhancements

Patrick Haller

Jonas Golde

Alan Akbik

128

04 Nov 2025

LLM one-shot style transfer for Authorship Attribution and Verification

Pablo Miralles-González

287

15 Oct 2025

Looking to Learn: Token-wise Dynamic Gating for Low-Resource Vision-Language Modelling

Bianca-Mihaela Ganescu

175

09 Oct 2025

LongTail-Swap: benchmarking language models' abilities on rare words

Robin Algayres

Charles-Éric Saint-James

149

05 Oct 2025

Scale-free Characteristics of Multilingual Legal Texts and the Limitations of LLMsInternational Conference on Text, Speech and Dialogue (TSD), 2025

Haoyang Chen

Kumiko Tanaka-Ishii

AILaw

124

22 Sep 2025

Once Upon a Time: Interactive Learning for Storytelling with Small Language Models

Jonas Mayer Martins

Ali Hamza Bashir

Muhammad Rehan Khalid

Lisa Beinborn

192

19 Sep 2025

Influence-driven Curriculum Learning for Pre-training on Limited Data

285

21 Aug 2025

Findings of the BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora

...

727

202

10 Apr 2025

BERTtime Stories: Investigating the Role of Synthetic Story Data in Language Pre-training

Nikitas Theodoropoulos

591

24 Feb 2025

BabyLM Turns 3: Call for papers for the 2025 BabyLM workshop

...

418

15 Feb 2025

Is a Peeled Apple Still Red? Evaluating LLMs' Ability for Conceptual Combination with Property TypeNorth American Chapter of the Association for Computational Linguistics (NAACL), 2025

719

10 Feb 2025

A Distributional Perspective on Word Learning in Neural Language ModelsNorth American Chapter of the Association for Computational Linguistics (NAACL), 2025

Filippo Ficarra

Robert Bamler

Alex Warstadt

282

09 Feb 2025

BOUQuET: dataset, Benchmark and Open initiative for Universal Quality Evaluation in Translation

Omnilingual MT Team

Pierre Yves Andrews

Mikel Artetxe

Mariano Coria Meglioli

...

Albert Ventayol-Boada

Shireen Yates

537

06 Feb 2025

Findings of the Second BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora

530

06 Dec 2024

AntLM: Bridging Causal and Masked Language Models

378

04 Dec 2024

When Babies Teach Babies: Can student knowledge sharing outperform Teacher-Guided Distillation on small datasets?

Srikrishna Iyer

FedML

451

25 Nov 2024

What Should Baby Models Read? Exploring Sample-Efficient Data Composition on Model Performance

Hong Meng Yam

Nathan J Paek

278

11 Nov 2024

Building, Reusing, and Generalizing Abstract Representations from Concrete SequencesInternational Conference on Learning Representations (ICLR), 2024

327

27 Oct 2024

From Tokens to Words: On the Inner Lexicon of LLMsInternational Conference on Learning Representations (ICLR), 2024

Guy Kaplan

Matanel Oren

Yuval Reif

Roy Schwartz

597

08 Oct 2024

Customizing Large Language Model Generation Style using Parameter-Efficient FinetuningInternational Conference on Natural Language Generation (INLG), 2024

209

06 Sep 2024

Capturing Style in Author and Document Representation

Enzo Terreau

Antoine Gourru

Julien Velcin

313

18 Jul 2024

M2QA: Multi-domain Multilingual Question Answering

Iryna Gurevych

380

01 Jul 2024

YuLan: An Open-source Large Language Model

Yutao Zhu

Kun Zhou

Kelong Mao

Wentong Chen

Yiding Sun

...

Wenbing Huang

Ze-Feng Gao

Yueguo Chen

Weizheng Lu

Ji-Rong Wen

ALM ELM

201

28 Jun 2024

BAMBINO-LM: (Bilingual-)Human-Inspired Continual Pretraining of BabyLM

298

17 Jun 2024

From Intentions to Techniques: A Comprehensive Taxonomy and Challenges in Text Watermarking for Large Language Models

Harsh Nishant Lalai

Aashish Anantha Ramakrishnan

Raj Sanjay Shah

Dongwon Lee

WaLM VLM

308

17 Jun 2024

Beyond Scaling Laws: Understanding Transformer Performance with Associative Memory

Xueyan Niu

Bo Bai

Lei Deng

Wei Han

275

14 May 2024

[Call for Papers] The 2nd BabyLM Challenge: Sample-efficient pretraining on a developmentally plausible corpus

383

09 Apr 2024

Language Models Learn Rare Phenomena from Less Rare Phenomena: The Case of the Missing AANNs

Kanishka Misra

Kyle Mahowald

601

28 Mar 2024

Not all layers are equally as important: Every Layer Counts BERT

Lucas Georges Gabriel Charpentier

David Samuel

311

03 Nov 2023

Mean BERTs make erratic language teachers: the effectiveness of latent bootstrapping in low-resource settings

David Samuel

234

30 Oct 2023

BabyStories: Can Reinforcement Learning Teach Baby Language Models to Write Better Stories?

291

25 Oct 2023

LoRAShear: Efficient Large Language Model Structured Pruning and Knowledge Recovery

Tianyi Chen

390

24 Oct 2023

ChapGTP, ILLC's Attempt at Raising a BabyLM: Improving Data Efficiency by Automatic Task Formation

Jaap Jumelet

Michael Hanna

Marianne de Heer Kloots

Anna Langedijk

Charlotte Pouw

Oskar van der Wal

255

17 Oct 2023

Understanding writing style in social media with a supervised contrastively pre-trained transformerKnowledge-Based Systems (KBS), 2023

Javier Huertas-Tato

Alejandro Martín

David Camacho

420

17 Oct 2023

A Methodology for Generative Spelling Correction via Natural Spelling Errors Emulation across Multiple Domains and LanguagesFindings (Findings), 2023

Nikita Martynov

Mark Baushenko

Anastasia Kozlova

Katerina Kolomeytseva

Aleksandr Abramov

Alena Fenogenova

315

18 Aug 2023

Quantifying the Dissimilarity of Texts

Benjamin Shade

E. Altmann

183

03 May 2023

Extension of Dictionary-Based Compression Algorithms for the Quantitative Visualization of Patterns from Log Files

Igor Cherepanov

Jonathan Geraldi Joewono

Arjan Kuijper

Jörn Kohlhammer

249

10 Apr 2023

Call for Papers -- The BabyLM Challenge: Sample-efficient pretraining on a developmentally plausible corpus

318

27 Jan 2023

Phoneme-Level BERT for Enhanced Prosody of Text-to-Speech with Grapheme PredictionsIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2023

Yinghao Aaron Li

Cong Han

Xilin Jiang

N. Mesgarani

192

20 Jan 2023

PART: Pre-trained Authorship Representation Transformer

Javier Huertas-Tato

Álvaro Huertas-García

Alejandro Martín

463

30 Sep 2022

On the State of the Art in Authorship Attribution and Authorship Verification

Jacob Tyo

Bhuwan Dhingra

Zachary Chase Lipton

329

14 Sep 2022

A decomposition of book structure through ousiometric fluctuations in cumulative word-timeHumanities and Social Sciences Communications (HSSC), 2022

584

19 Aug 2022

Controllable Data Generation by Deep Learning: A ReviewACM Computing Surveys (ACM CSUR), 2022

810

19 Jul 2022

Text characterization based on recurrence networksInformation Sciences (Inf. Sci.), 2022

Bárbara C. e Souza

F. N. Silva

Henrique F. de Arruda

182

17 Jan 2022

Risks of AI Foundation Models in Education

Su Lin Blodgett

Michael A. Madaio

UQCV

173

19 Oct 2021

Joint prediction of truecasing and punctuation for conversational speech in low-resource scenarios

R. Pappagari

Piotr Żelasko

Agnieszka Mikołajczyk

Piotr Pęzik

Najim Dehak

169

13 Sep 2021

A Statistical Model of Word Rank Evolution

Alex John Quijano

Rick Dale

Suzanne S. Sindi

410

21 Jul 2021