A standardized Project Gutenberg corpus for statistical analysis of natural language and quantitative linguistics

19 December 2018

Papers citing "A standardized Project Gutenberg corpus for statistical analysis of natural language and quantitative linguistics"

44 / 44 papers shown

Title
Findings of the BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora Alex Warstadt Aaron Mueller Leshem Choshen E. Wilcox Chengxu Zhuang ... Rafael Mosquera Bhargavi Paranjape Adina Williams Tal Linzen Ryan Cotterell 202 121 0 10 Apr 2025
BERTtime Stories: Investigating the Role of Synthetic Story Data in Language Pre-training Nikitas Theodoropoulos Giorgos Filandrianos Vassilis Lyberatos Maria Lymperaiou Giorgos Stamou SyDa 219 1 0 24 Feb 2025
BabyLM Turns 3: Call for papers for the 2025 BabyLM workshop Lucas Charpentier Leshem Choshen Ryan Cotterell Mustafa Omer Gul Michael Y. Hu ... Candace Ross Raj Sanjay Shah Alex Warstadt Ethan Gotlieb Wilcox Adina Williams 122 5 0 15 Feb 2025
Is a Peeled Apple Still Red? Evaluating LLMs' Ability for Conceptual Combination with Property Type Seokwon Song Taehyun Lee Jaewoo Ahn Jae Hyuk Sung Gunhee Kim CoGe 183 1 0 10 Feb 2025
A Distributional Perspective on Word Learning in Neural Language Models Filippo Ficarra Ryan Cotterell Alex Warstadt 82 1 0 09 Feb 2025
BOUQuET: dataset, Benchmark and Open initiative for Universal Quality Evaluation in Translation Omnilingual MT Team Pierre Yves Andrews Mikel Artetxe Mariano Coria Meglioli Marta R. Costa-jussá ... Eduardo Sánchez Ioannis Tsiamas Arina Turkatenko Albert Ventayol-Boada Shireen Yates 185 0 0 06 Feb 2025
Findings of the Second BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora Michael Y. Hu Aaron Mueller Candace Ross Adina Williams Tal Linzen Chengxu Zhuang Ryan Cotterell Leshem Choshen Alex Warstadt Ethan Gotlieb Wilcox 180 14 0 06 Dec 2024
AntLM: Bridging Causal and Masked Language Models Xinru Yu Bin Guo Shiwei Luo Jiadong Wang Tao Ji Yuanbin Wu CLL 135 1 0 04 Dec 2024
When Babies Teach Babies: Can student knowledge sharing outperform Teacher-Guided Distillation on small datasets? Srikrishna Iyer FedML 172 0 0 25 Nov 2024
What Should Baby Models Read? Exploring Sample-Efficient Data Composition on Model Performance Hong Meng Yam Nathan J Paek 119 1 0 11 Nov 2024
Building, Reusing, and Generalizing Abstract Representations from Concrete Sequences Shuchen Wu Mirko Thalmann Peter Dayan Zeynep Akata Eric Schulz VLM 109 0 0 27 Oct 2024
From Tokens to Words: On the Inner Lexicon of LLMs Guy Kaplan Matanel Oren Yuval Reif Roy Schwartz 107 14 0 08 Oct 2024
Customizing Large Language Model Generation Style using Parameter-Efficient Finetuning Xinyue Liu Harshita Diddee Daphne Ippolito ALM 42 3 0 06 Sep 2024
Capturing Style in Author and Document Representation Enzo Terreau Antoine Gourru Julien Velcin 78 1 0 18 Jul 2024
M2QA: Multi-domain Multilingual Question Answering Leon Arne Engländer Hannah Sterz Clifton A. Poth Jonas Pfeiffer Ilia Kuznetsov Iryna Gurevych VLM 76 2 0 01 Jul 2024
YuLan: An Open-source Large Language Model Yutao Zhu Kun Zhou Kelong Mao Wentong Chen Yiding Sun ... Wenbing Huang Ze-Feng Gao Yueguo Chen Weizheng Lu Ji-Rong Wen ALM ELM 65 1 0 28 Jun 2024
BAMBINO-LM: (Bilingual-)Human-Inspired Continual Pretraining of BabyLM Zhewen Shen Aditya Joshi Ruey-Cheng Chen CLL 92 2 0 17 Jun 2024
From Intentions to Techniques: A Comprehensive Taxonomy and Challenges in Text Watermarking for Large Language Models Harsh Nishant Lalai Aashish Anantha Ramakrishnan Raj Sanjay Shah Dongwon Lee WaLM VLM 69 2 0 17 Jun 2024
Beyond Scaling Laws: Understanding Transformer Performance with Associative Memory Xueyan Niu Bo Bai Lei Deng Wei Han 83 8 0 14 May 2024
[Call for Papers] The 2nd BabyLM Challenge: Sample-efficient pretraining on a developmentally plausible corpus Leshem Choshen Ryan Cotterell Michael Y. Hu Tal Linzen Aaron Mueller Candace Ross Alex Warstadt Ethan Gotlieb Wilcox Adina Williams Chengxu Zhuang 105 24 0 09 Apr 2024
Language Models Learn Rare Phenomena from Less Rare Phenomena: The Case of the Missing AANNs Kanishka Misra Kyle Mahowald 122 27 0 28 Mar 2024
Not all layers are equally as important: Every Layer Counts BERT Lucas Georges Gabriel Charpentier David Samuel 106 18 0 03 Nov 2023
Mean BERTs make erratic language teachers: the effectiveness of latent bootstrapping in low-resource settings David Samuel 54 4 0 30 Oct 2023
BabyStories: Can Reinforcement Learning Teach Baby Language Models to Write Better Stories? Xingmeng Zhao Tongnian Wang Sheri Osborn Anthony Rios 53 6 0 25 Oct 2023
LoRAShear: Efficient Large Language Model Structured Pruning and Knowledge Recovery Tianyi Chen Tianyu Ding Badal Yadav Ilya Zharkov Luming Liang 113 32 0 24 Oct 2023
ChapGTP, ILLC's Attempt at Raising a BabyLM: Improving Data Efficiency by Automatic Task Formation Jaap Jumelet Michael Hanna Marianne de Heer Kloots Anna Langedijk Charlotte Pouw Oskar van der Wal 82 3 0 17 Oct 2023
Understanding writing style in social media with a supervised contrastively pre-trained transformer Javier Huertas-Tato Alejandro Martín David Camacho 123 6 0 17 Oct 2023
A Methodology for Generative Spelling Correction via Natural Spelling Errors Emulation across Multiple Domains and Languages Nikita Martynov Mark Baushenko Anastasia Kozlova Katerina Kolomeytseva Aleksandr Abramov Alena Fenogenova 66 4 0 18 Aug 2023
Quantifying the Dissimilarity of Texts Benjamin Shade E. Altmann 73 1 0 03 May 2023
Extension of Dictionary-Based Compression Algorithms for the Quantitative Visualization of Patterns from Log Files Igor Cherepanov Jonathan Geraldi Joewono Arjan Kuijper Jörn Kohlhammer 102 0 0 10 Apr 2023
Call for Papers -- The BabyLM Challenge: Sample-efficient pretraining on a developmentally plausible corpus Alex Warstadt Leshem Choshen Aaron Mueller Adina Williams Ethan Gotlieb Wilcox Chengxu Zhuang 115 57 0 27 Jan 2023
Phoneme-Level BERT for Enhanced Prosody of Text-to-Speech with Grapheme Predictions Yinghao Aaron Li Cong Han Xilin Jiang N. Mesgarani 64 24 0 20 Jan 2023
PART: Pre-trained Authorship Representation Transformer Javier Huertas-Tato Álvaro Huertas-García Alejandro Martín 137 9 0 30 Sep 2022
On the State of the Art in Authorship Attribution and Authorship Verification Jacob Tyo Bhuwan Dhingra Zachary Chase Lipton 102 25 0 14 Sep 2022
A decomposition of book structure through ousiometric fluctuations in cumulative word-time M. Fudolig Thayer Alshaabi Kathryn Cramer C. Danforth P. Dodds 117 5 0 19 Aug 2022
Controllable Data Generation by Deep Learning: A Review Shiyu Wang Yuanqi Du Xiaojie Guo Bo Pan Zhaohui Qin Liang Zhao 99 28 0 19 Jul 2022
Text characterization based on recurrence networks Bárbara C. e Souza F. N. Silva Henrique F. de Arruda Giovana D. da Silva L. D. F. Costa D. R. Amancio AI4CE 59 9 0 17 Jan 2022
Risks of AI Foundation Models in Education Su Lin Blodgett Michael A. Madaio UQCV 53 15 0 19 Oct 2021
Joint prediction of truecasing and punctuation for conversational speech in low-resource scenarios R. Pappagari Piotr Żelasko Agnieszka Mikołajczyk Piotr Pęzik Najim Dehak 64 11 0 13 Sep 2021
A Statistical Model of Word Rank Evolution Alex John Quijano Rick Dale Suzanne S. Sindi 51 0 0 21 Jul 2021
Large-Scale Intelligent Microservices Mark Hamilton Nick Gonsalves Christina Lee Anand Raman Brendan Walsh ... Dalitso Banda Lucy Zhang Mei Gao Lei Zhang William T. Freeman SyDa AI4TS 35 5 0 17 Sep 2020
Critical Thinking for Language Models Gregor Betz Christian Voigt Kyle Richardson SyDa ReLM LRM AI4CE 111 35 0 15 Sep 2020
Storywrangler: A massive exploratorium for sociolinguistic, cultural, socioeconomic, and political timelines using Twitter Thayer Alshaabi J. L. Adams M. V. Arnold J. Minot D. R. Dewhurst A. J. Reagan C. Danforth P. Dodds 110 41 0 25 Jul 2020
Pull out all the stops: Textual analysis via punctuation sequences Alexandra N. M. Darmon Marya Bazzi S. Howison M. A. Porter 44 11 0 31 Dec 2018