ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1812.08092
  4. Cited By
A standardized Project Gutenberg corpus for statistical analysis of
  natural language and quantitative linguistics

A standardized Project Gutenberg corpus for statistical analysis of natural language and quantitative linguistics

19 December 2018
Martin Gerlach
Francesc Font-Clos
ArXiv (abs)PDFHTML

Papers citing "A standardized Project Gutenberg corpus for statistical analysis of natural language and quantitative linguistics"

44 / 44 papers shown
Title
Findings of the BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora
Findings of the BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora
Alex Warstadt
Aaron Mueller
Leshem Choshen
E. Wilcox
Chengxu Zhuang
...
Rafael Mosquera
Bhargavi Paranjape
Adina Williams
Tal Linzen
Ryan Cotterell
202
121
0
10 Apr 2025
BERTtime Stories: Investigating the Role of Synthetic Story Data in Language Pre-training
BERTtime Stories: Investigating the Role of Synthetic Story Data in Language Pre-training
Nikitas Theodoropoulos
Giorgos Filandrianos
Vassilis Lyberatos
Maria Lymperaiou
Giorgos Stamou
SyDa
219
1
0
24 Feb 2025
BabyLM Turns 3: Call for papers for the 2025 BabyLM workshop
BabyLM Turns 3: Call for papers for the 2025 BabyLM workshop
Lucas Charpentier
Leshem Choshen
Ryan Cotterell
Mustafa Omer Gul
Michael Y. Hu
...
Candace Ross
Raj Sanjay Shah
Alex Warstadt
Ethan Gotlieb Wilcox
Adina Williams
122
5
0
15 Feb 2025
Is a Peeled Apple Still Red? Evaluating LLMs' Ability for Conceptual Combination with Property Type
Is a Peeled Apple Still Red? Evaluating LLMs' Ability for Conceptual Combination with Property Type
Seokwon Song
Taehyun Lee
Jaewoo Ahn
Jae Hyuk Sung
Gunhee Kim
CoGe
183
1
0
10 Feb 2025
A Distributional Perspective on Word Learning in Neural Language Models
A Distributional Perspective on Word Learning in Neural Language Models
Filippo Ficarra
Ryan Cotterell
Alex Warstadt
82
1
0
09 Feb 2025
BOUQuET: dataset, Benchmark and Open initiative for Universal Quality Evaluation in Translation
BOUQuET: dataset, Benchmark and Open initiative for Universal Quality Evaluation in Translation
Omnilingual MT Team
Pierre Yves Andrews
Mikel Artetxe
Mariano Coria Meglioli
Marta R. Costa-jussá
...
Eduardo Sánchez
Ioannis Tsiamas
Arina Turkatenko
Albert Ventayol-Boada
Shireen Yates
185
0
0
06 Feb 2025
Findings of the Second BabyLM Challenge: Sample-Efficient Pretraining on
  Developmentally Plausible Corpora
Findings of the Second BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora
Michael Y. Hu
Aaron Mueller
Candace Ross
Adina Williams
Tal Linzen
Chengxu Zhuang
Ryan Cotterell
Leshem Choshen
Alex Warstadt
Ethan Gotlieb Wilcox
180
14
0
06 Dec 2024
AntLM: Bridging Causal and Masked Language Models
AntLM: Bridging Causal and Masked Language Models
Xinru Yu
Bin Guo
Shiwei Luo
Jiadong Wang
Tao Ji
Yuanbin Wu
CLL
135
1
0
04 Dec 2024
When Babies Teach Babies: Can student knowledge sharing outperform
  Teacher-Guided Distillation on small datasets?
When Babies Teach Babies: Can student knowledge sharing outperform Teacher-Guided Distillation on small datasets?
Srikrishna Iyer
FedML
172
0
0
25 Nov 2024
What Should Baby Models Read? Exploring Sample-Efficient Data
  Composition on Model Performance
What Should Baby Models Read? Exploring Sample-Efficient Data Composition on Model Performance
Hong Meng Yam
Nathan J Paek
119
1
0
11 Nov 2024
Building, Reusing, and Generalizing Abstract Representations from Concrete Sequences
Building, Reusing, and Generalizing Abstract Representations from Concrete Sequences
Shuchen Wu
Mirko Thalmann
Peter Dayan
Zeynep Akata
Eric Schulz
VLM
109
0
0
27 Oct 2024
From Tokens to Words: On the Inner Lexicon of LLMs
From Tokens to Words: On the Inner Lexicon of LLMs
Guy Kaplan
Matanel Oren
Yuval Reif
Roy Schwartz
107
14
0
08 Oct 2024
Customizing Large Language Model Generation Style using
  Parameter-Efficient Finetuning
Customizing Large Language Model Generation Style using Parameter-Efficient Finetuning
Xinyue Liu
Harshita Diddee
Daphne Ippolito
ALM
40
3
0
06 Sep 2024
Capturing Style in Author and Document Representation
Capturing Style in Author and Document Representation
Enzo Terreau
Antoine Gourru
Julien Velcin
76
1
0
18 Jul 2024
M2QA: Multi-domain Multilingual Question Answering
M2QA: Multi-domain Multilingual Question Answering
Leon Arne Engländer
Hannah Sterz
Clifton A. Poth
Jonas Pfeiffer
Ilia Kuznetsov
Iryna Gurevych
VLM
76
2
0
01 Jul 2024
YuLan: An Open-source Large Language Model
YuLan: An Open-source Large Language Model
Yutao Zhu
Kun Zhou
Kelong Mao
Wentong Chen
Yiding Sun
...
Wenbing Huang
Ze-Feng Gao
Yueguo Chen
Weizheng Lu
Ji-Rong Wen
ALMELM
65
1
0
28 Jun 2024
BAMBINO-LM: (Bilingual-)Human-Inspired Continual Pretraining of BabyLM
BAMBINO-LM: (Bilingual-)Human-Inspired Continual Pretraining of BabyLM
Zhewen Shen
Aditya Joshi
Ruey-Cheng Chen
CLL
92
2
0
17 Jun 2024
From Intentions to Techniques: A Comprehensive Taxonomy and Challenges in Text Watermarking for Large Language Models
From Intentions to Techniques: A Comprehensive Taxonomy and Challenges in Text Watermarking for Large Language Models
Harsh Nishant Lalai
Aashish Anantha Ramakrishnan
Raj Sanjay Shah
Dongwon Lee
WaLMVLM
69
2
0
17 Jun 2024
Beyond Scaling Laws: Understanding Transformer Performance with
  Associative Memory
Beyond Scaling Laws: Understanding Transformer Performance with Associative Memory
Xueyan Niu
Bo Bai
Lei Deng
Wei Han
83
8
0
14 May 2024
[Call for Papers] The 2nd BabyLM Challenge: Sample-efficient pretraining
  on a developmentally plausible corpus
[Call for Papers] The 2nd BabyLM Challenge: Sample-efficient pretraining on a developmentally plausible corpus
Leshem Choshen
Ryan Cotterell
Michael Y. Hu
Tal Linzen
Aaron Mueller
Candace Ross
Alex Warstadt
Ethan Gotlieb Wilcox
Adina Williams
Chengxu Zhuang
105
24
0
09 Apr 2024
Language Models Learn Rare Phenomena from Less Rare Phenomena: The Case of the Missing AANNs
Language Models Learn Rare Phenomena from Less Rare Phenomena: The Case of the Missing AANNs
Kanishka Misra
Kyle Mahowald
122
27
0
28 Mar 2024
Not all layers are equally as important: Every Layer Counts BERT
Not all layers are equally as important: Every Layer Counts BERT
Lucas Georges Gabriel Charpentier
David Samuel
106
18
0
03 Nov 2023
Mean BERTs make erratic language teachers: the effectiveness of latent
  bootstrapping in low-resource settings
Mean BERTs make erratic language teachers: the effectiveness of latent bootstrapping in low-resource settings
David Samuel
54
4
0
30 Oct 2023
BabyStories: Can Reinforcement Learning Teach Baby Language Models to
  Write Better Stories?
BabyStories: Can Reinforcement Learning Teach Baby Language Models to Write Better Stories?
Xingmeng Zhao
Tongnian Wang
Sheri Osborn
Anthony Rios
53
6
0
25 Oct 2023
LoRAShear: Efficient Large Language Model Structured Pruning and
  Knowledge Recovery
LoRAShear: Efficient Large Language Model Structured Pruning and Knowledge Recovery
Tianyi Chen
Tianyu Ding
Badal Yadav
Ilya Zharkov
Luming Liang
113
32
0
24 Oct 2023
ChapGTP, ILLC's Attempt at Raising a BabyLM: Improving Data Efficiency
  by Automatic Task Formation
ChapGTP, ILLC's Attempt at Raising a BabyLM: Improving Data Efficiency by Automatic Task Formation
Jaap Jumelet
Michael Hanna
Marianne de Heer Kloots
Anna Langedijk
Charlotte Pouw
Oskar van der Wal
82
3
0
17 Oct 2023
Understanding writing style in social media with a supervised
  contrastively pre-trained transformer
Understanding writing style in social media with a supervised contrastively pre-trained transformer
Javier Huertas-Tato
Alejandro Martín
David Camacho
123
6
0
17 Oct 2023
A Methodology for Generative Spelling Correction via Natural Spelling
  Errors Emulation across Multiple Domains and Languages
A Methodology for Generative Spelling Correction via Natural Spelling Errors Emulation across Multiple Domains and Languages
Nikita Martynov
Mark Baushenko
Anastasia Kozlova
Katerina Kolomeytseva
Aleksandr Abramov
Alena Fenogenova
66
4
0
18 Aug 2023
Quantifying the Dissimilarity of Texts
Quantifying the Dissimilarity of Texts
Benjamin Shade
E. Altmann
73
1
0
03 May 2023
Extension of Dictionary-Based Compression Algorithms for the
  Quantitative Visualization of Patterns from Log Files
Extension of Dictionary-Based Compression Algorithms for the Quantitative Visualization of Patterns from Log Files
Igor Cherepanov
Jonathan Geraldi Joewono
Arjan Kuijper
Jörn Kohlhammer
100
0
0
10 Apr 2023
Call for Papers -- The BabyLM Challenge: Sample-efficient pretraining on
  a developmentally plausible corpus
Call for Papers -- The BabyLM Challenge: Sample-efficient pretraining on a developmentally plausible corpus
Alex Warstadt
Leshem Choshen
Aaron Mueller
Adina Williams
Ethan Gotlieb Wilcox
Chengxu Zhuang
115
57
0
27 Jan 2023
Phoneme-Level BERT for Enhanced Prosody of Text-to-Speech with Grapheme
  Predictions
Phoneme-Level BERT for Enhanced Prosody of Text-to-Speech with Grapheme Predictions
Yinghao Aaron Li
Cong Han
Xilin Jiang
N. Mesgarani
64
24
0
20 Jan 2023
PART: Pre-trained Authorship Representation Transformer
PART: Pre-trained Authorship Representation Transformer
Javier Huertas-Tato
Álvaro Huertas-García
Alejandro Martín
137
9
0
30 Sep 2022
On the State of the Art in Authorship Attribution and Authorship
  Verification
On the State of the Art in Authorship Attribution and Authorship Verification
Jacob Tyo
Bhuwan Dhingra
Zachary Chase Lipton
102
25
0
14 Sep 2022
A decomposition of book structure through ousiometric fluctuations in
  cumulative word-time
A decomposition of book structure through ousiometric fluctuations in cumulative word-time
M. Fudolig
Thayer Alshaabi
Kathryn Cramer
C. Danforth
P. Dodds
117
5
0
19 Aug 2022
Controllable Data Generation by Deep Learning: A Review
Controllable Data Generation by Deep Learning: A Review
Shiyu Wang
Yuanqi Du
Xiaojie Guo
Bo Pan
Zhaohui Qin
Liang Zhao
99
28
0
19 Jul 2022
Text characterization based on recurrence networks
Text characterization based on recurrence networks
Bárbara C. e Souza
F. N. Silva
Henrique F. de Arruda
Giovana D. da Silva
L. D. F. Costa
D. R. Amancio
AI4CE
52
9
0
17 Jan 2022
Risks of AI Foundation Models in Education
Risks of AI Foundation Models in Education
Su Lin Blodgett
Michael A. Madaio
UQCV
53
15
0
19 Oct 2021
Joint prediction of truecasing and punctuation for conversational speech
  in low-resource scenarios
Joint prediction of truecasing and punctuation for conversational speech in low-resource scenarios
R. Pappagari
Piotr Żelasko
Agnieszka Mikołajczyk
Piotr Pęzik
Najim Dehak
62
11
0
13 Sep 2021
A Statistical Model of Word Rank Evolution
A Statistical Model of Word Rank Evolution
Alex John Quijano
Rick Dale
Suzanne S. Sindi
49
0
0
21 Jul 2021
Large-Scale Intelligent Microservices
Large-Scale Intelligent Microservices
Mark Hamilton
Nick Gonsalves
Christina Lee
Anand Raman
Brendan Walsh
...
Dalitso Banda
Lucy Zhang
Mei Gao
Lei Zhang
William T. Freeman
SyDaAI4TS
35
5
0
17 Sep 2020
Critical Thinking for Language Models
Critical Thinking for Language Models
Gregor Betz
Christian Voigt
Kyle Richardson
SyDaReLMLRMAI4CE
111
35
0
15 Sep 2020
Storywrangler: A massive exploratorium for sociolinguistic, cultural,
  socioeconomic, and political timelines using Twitter
Storywrangler: A massive exploratorium for sociolinguistic, cultural, socioeconomic, and political timelines using Twitter
Thayer Alshaabi
J. L. Adams
M. V. Arnold
J. Minot
D. R. Dewhurst
A. J. Reagan
C. Danforth
P. Dodds
110
41
0
25 Jul 2020
Pull out all the stops: Textual analysis via punctuation sequences
Pull out all the stops: Textual analysis via punctuation sequences
Alexandra N. M. Darmon
Marya Bazzi
S. Howison
M. A. Porter
42
11
0
31 Dec 2018
1