Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
1812.08092
Cited By
A standardized Project Gutenberg corpus for statistical analysis of natural language and quantitative linguistics
19 December 2018
Martin Gerlach
Francesc Font-Clos
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"A standardized Project Gutenberg corpus for statistical analysis of natural language and quantitative linguistics"
50 / 54 papers shown
When Personalization Tricks Detectors: The Feature-Inversion Trap in Machine-Generated Text Detection
Lang Gao
Xuhui Li
Chenxi Wang
Mingzhe Li
Wei Liu
Zirui Song
J. Zhang
Rui Yan
Preslav Nakov
Xiuying Chen
DeLMO
315
1
0
10 Apr 2026
Decoding the Past: Explainable Machine Learning Models for Dating Historical Texts
Paulo J. N. Pinto
A. Pinho
Diogo Pratas
AI4CE
287
0
0
28 Nov 2025
Re-coding for Uncertainties: Edge-awareness Semantic Concordance for Resilient Event-RGB Segmentation
Nan Bao
Yifan Zhao
Lin Zhu
Jia Li
164
0
0
11 Nov 2025
Sample-Efficient Language Modeling with Linear Attention and Lightweight Enhancements
Patrick Haller
Jonas Golde
Alan Akbik
128
1
0
04 Nov 2025
LLM one-shot style transfer for Authorship Attribution and Verification
Pablo Miralles-González
Javier Huertas-Tato
Alejandro Martín
David Camacho
DeLMO
287
1
0
15 Oct 2025
Looking to Learn: Token-wise Dynamic Gating for Low-Resource Vision-Language Modelling
Bianca-Mihaela Ganescu
Suchir Salhan
Andrew Caines
P. Buttery
VLM
175
2
0
09 Oct 2025
LongTail-Swap: benchmarking language models' abilities on rare words
Robin Algayres
Charles-Éric Saint-James
Mahi Luthra
Jiayi Shen
Dongyan Lin
Youssef Benchekroun
Rashel Moritz
Juan Pino
Emmanuel Dupoux
149
1
0
05 Oct 2025
Scale-free Characteristics of Multilingual Legal Texts and the Limitations of LLMs
International Conference on Text, Speech and Dialogue (TSD), 2025
Haoyang Chen
Kumiko Tanaka-Ishii
AILaw
124
0
0
22 Sep 2025
Once Upon a Time: Interactive Learning for Storytelling with Small Language Models
Jonas Mayer Martins
Ali Hamza Bashir
Muhammad Rehan Khalid
Lisa Beinborn
192
0
0
19 Sep 2025
Influence-driven Curriculum Learning for Pre-training on Limited Data
Loris Schoenegger
Lukas Thoma
Terra Blevins
Benjamin Roth
285
1
0
21 Aug 2025
Findings of the BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora
Alex Warstadt
Aaron Mueller
Leshem Choshen
E. Wilcox
Chengxu Zhuang
...
Rafael Mosquera
Bhargavi Paranjape
Adina Williams
Tal Linzen
Robert Bamler
727
202
0
10 Apr 2025
BERTtime Stories: Investigating the Role of Synthetic Story Data in Language Pre-training
Nikitas Theodoropoulos
Giorgos Filandrianos
Vassilis Lyberatos
Maria Lymperaiou
Giorgos Stamou
SyDa
591
3
0
24 Feb 2025
BabyLM Turns 3: Call for papers for the 2025 BabyLM workshop
Lucas Charpentier
Leshem Choshen
Robert Bamler
Mustafa Omer Gul
Michael Y. Hu
...
Candace Ross
Raj Sanjay Shah
Alex Warstadt
Ethan Gotlieb Wilcox
Adina Williams
418
31
0
15 Feb 2025
Is a Peeled Apple Still Red? Evaluating LLMs' Ability for Conceptual Combination with Property Type
North American Chapter of the Association for Computational Linguistics (NAACL), 2025
Seokwon Song
Taehyun Lee
Jaewoo Ahn
Jae Hyuk Sung
Gunhee Kim
CoGe
719
1
0
10 Feb 2025
A Distributional Perspective on Word Learning in Neural Language Models
North American Chapter of the Association for Computational Linguistics (NAACL), 2025
Filippo Ficarra
Robert Bamler
Alex Warstadt
282
2
0
09 Feb 2025
BOUQuET: dataset, Benchmark and Open initiative for Universal Quality Evaluation in Translation
Omnilingual MT Team
Pierre Yves Andrews
Mikel Artetxe
Mariano Coria Meglioli
Marta R. Costa-jussá
...
Eduardo Sánchez
Ioannis Tsiamas
Arina Turkatenko
Albert Ventayol-Boada
Shireen Yates
537
4
0
06 Feb 2025
Findings of the Second BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora
Michael Y. Hu
Aaron Mueller
Candace Ross
Adina Williams
Tal Linzen
Chengxu Zhuang
Robert Bamler
Leshem Choshen
Alex Warstadt
Ethan Gotlieb Wilcox
530
53
0
06 Dec 2024
AntLM: Bridging Causal and Masked Language Models
Xinru Yu
Bin Guo
Shiwei Luo
Jiadong Wang
Changzhi Sun
Man Lan
CLL
378
6
0
04 Dec 2024
When Babies Teach Babies: Can student knowledge sharing outperform Teacher-Guided Distillation on small datasets?
Srikrishna Iyer
FedML
451
0
0
25 Nov 2024
What Should Baby Models Read? Exploring Sample-Efficient Data Composition on Model Performance
Hong Meng Yam
Nathan J Paek
278
2
0
11 Nov 2024
Building, Reusing, and Generalizing Abstract Representations from Concrete Sequences
International Conference on Learning Representations (ICLR), 2024
Shuchen Wu
Mirko Thalmann
Peter Dayan
Zeynep Akata
Eric Schulz
VLM
327
2
0
27 Oct 2024
From Tokens to Words: On the Inner Lexicon of LLMs
International Conference on Learning Representations (ICLR), 2024
Guy Kaplan
Matanel Oren
Yuval Reif
Roy Schwartz
597
40
0
08 Oct 2024
Customizing Large Language Model Generation Style using Parameter-Efficient Finetuning
International Conference on Natural Language Generation (INLG), 2024
Xinyue Liu
Harshita Diddee
Daphne Ippolito
ALM
209
12
0
06 Sep 2024
Capturing Style in Author and Document Representation
Enzo Terreau
Antoine Gourru
Julien Velcin
313
2
0
18 Jul 2024
M2QA: Multi-domain Multilingual Question Answering
Leon Arne Engländer
Hannah Sterz
Clifton A. Poth
Jonas Pfeiffer
Ilia Kuznetsov
Iryna Gurevych
VLM
380
6
0
01 Jul 2024
YuLan: An Open-source Large Language Model
Yutao Zhu
Kun Zhou
Kelong Mao
Wentong Chen
Yiding Sun
...
Wenbing Huang
Ze-Feng Gao
Yueguo Chen
Weizheng Lu
Ji-Rong Wen
ALM
ELM
201
3
0
28 Jun 2024
BAMBINO-LM: (Bilingual-)Human-Inspired Continual Pretraining of BabyLM
Zhewen Shen
Aditya Joshi
Ruey-Cheng Chen
CLL
298
5
0
17 Jun 2024
From Intentions to Techniques: A Comprehensive Taxonomy and Challenges in Text Watermarking for Large Language Models
Harsh Nishant Lalai
Aashish Anantha Ramakrishnan
Raj Sanjay Shah
Dongwon Lee
WaLM
VLM
308
5
0
17 Jun 2024
Beyond Scaling Laws: Understanding Transformer Performance with Associative Memory
Xueyan Niu
Bo Bai
Lei Deng
Wei Han
275
14
0
14 May 2024
[Call for Papers] The 2nd BabyLM Challenge: Sample-efficient pretraining on a developmentally plausible corpus
Leshem Choshen
Robert Bamler
Michael Y. Hu
Tal Linzen
Aaron Mueller
Candace Ross
Alex Warstadt
Ethan Gotlieb Wilcox
Adina Williams
Chengxu Zhuang
383
39
0
09 Apr 2024
Language Models Learn Rare Phenomena from Less Rare Phenomena: The Case of the Missing AANNs
Kanishka Misra
Kyle Mahowald
601
51
0
28 Mar 2024
Not all layers are equally as important: Every Layer Counts BERT
Lucas Georges Gabriel Charpentier
David Samuel
311
32
0
03 Nov 2023
Mean BERTs make erratic language teachers: the effectiveness of latent bootstrapping in low-resource settings
David Samuel
234
4
0
30 Oct 2023
BabyStories: Can Reinforcement Learning Teach Baby Language Models to Write Better Stories?
Xingmeng Zhao
Tongnian Wang
Sheri Osborn
Anthony Rios
291
11
0
25 Oct 2023
LoRAShear: Efficient Large Language Model Structured Pruning and Knowledge Recovery
Tianyi Chen
Tianyu Ding
Badal Yadav
Ilya Zharkov
Luming Liang
390
42
0
24 Oct 2023
ChapGTP, ILLC's Attempt at Raising a BabyLM: Improving Data Efficiency by Automatic Task Formation
Jaap Jumelet
Michael Hanna
Marianne de Heer Kloots
Anna Langedijk
Charlotte Pouw
Oskar van der Wal
255
4
0
17 Oct 2023
Understanding writing style in social media with a supervised contrastively pre-trained transformer
Knowledge-Based Systems (KBS), 2023
Javier Huertas-Tato
Alejandro Martín
David Camacho
420
15
0
17 Oct 2023
A Methodology for Generative Spelling Correction via Natural Spelling Errors Emulation across Multiple Domains and Languages
Findings (Findings), 2023
Nikita Martynov
Mark Baushenko
Anastasia Kozlova
Katerina Kolomeytseva
Aleksandr Abramov
Alena Fenogenova
315
10
0
18 Aug 2023
Quantifying the Dissimilarity of Texts
Benjamin Shade
E. Altmann
183
4
0
03 May 2023
Extension of Dictionary-Based Compression Algorithms for the Quantitative Visualization of Patterns from Log Files
Igor Cherepanov
Jonathan Geraldi Joewono
Arjan Kuijper
Jörn Kohlhammer
249
0
0
10 Apr 2023
Call for Papers -- The BabyLM Challenge: Sample-efficient pretraining on a developmentally plausible corpus
Alex Warstadt
Leshem Choshen
Aaron Mueller
Adina Williams
Ethan Gotlieb Wilcox
Chengxu Zhuang
318
77
0
27 Jan 2023
Phoneme-Level BERT for Enhanced Prosody of Text-to-Speech with Grapheme Predictions
IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2023
Yinghao Aaron Li
Cong Han
Xilin Jiang
N. Mesgarani
192
34
0
20 Jan 2023
PART: Pre-trained Authorship Representation Transformer
Javier Huertas-Tato
Álvaro Huertas-García
Alejandro Martín
463
16
0
30 Sep 2022
On the State of the Art in Authorship Attribution and Authorship Verification
Jacob Tyo
Bhuwan Dhingra
Zachary Chase Lipton
329
37
0
14 Sep 2022
A decomposition of book structure through ousiometric fluctuations in cumulative word-time
Humanities and Social Sciences Communications (HSSC), 2022
M. Fudolig
Thayer Alshaabi
Kathryn Cramer
C. Danforth
P. Dodds
584
5
0
19 Aug 2022
Controllable Data Generation by Deep Learning: A Review
ACM Computing Surveys (ACM CSUR), 2022
Shiyu Wang
Yuanqi Du
Xiaojie Guo
Bo Pan
Zhaohui Qin
Bo Pan
810
43
0
19 Jul 2022
Text characterization based on recurrence networks
Information Sciences (Inf. Sci.), 2022
Bárbara C. e Souza
F. N. Silva
Henrique F. de Arruda
Giovana D. da Silva
L. D. F. Costa
D. R. Amancio
AI4CE
182
10
0
17 Jan 2022
Risks of AI Foundation Models in Education
Su Lin Blodgett
Michael A. Madaio
UQCV
173
18
0
19 Oct 2021
Joint prediction of truecasing and punctuation for conversational speech in low-resource scenarios
R. Pappagari
Piotr Żelasko
Agnieszka Mikołajczyk
Piotr Pęzik
Najim Dehak
169
12
0
13 Sep 2021
A Statistical Model of Word Rank Evolution
Alex John Quijano
Rick Dale
Suzanne S. Sindi
410
0
0
21 Jul 2021
1
2
Next
Page 1 of 2