Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2505.15024
Cited By
Diagnosing our datasets: How does my language model learn clinical information?
21 May 2025
Furong Jia
David Sontag
Monica Agrawal
LM&MA
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Diagnosing our datasets: How does my language model learn clinical information?"
8 / 8 papers shown
Title
RedPajama: an Open Dataset for Training Large Language Models
Maurice Weber
Daniel Y. Fu
Quentin Anthony
Yonatan Oren
S. Adams
...
Tri Dao
Percy Liang
Christopher Ré
Irina Rish
Ce Zhang
182
66
0
19 Nov 2024
Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research
Luca Soldaini
Rodney Michael Kinney
Akshita Bhagia
Dustin Schwenk
David Atkinson
...
Hanna Hajishirzi
Iz Beltagy
Dirk Groeneveld
Jesse Dodge
Kyle Lo
56
255
0
31 Jan 2024
Infini-gram: Scaling Unbounded n-gram Language Models to a Trillion Tokens
Jiacheng Liu
Sewon Min
Luke Zettlemoyer
Yejin Choi
Hannaneh Hajishirzi
84
54
0
30 Jan 2024
Matching Patients to Clinical Trials with Large Language Models
Qiao Jin
Zifeng Wang
C. Floudas
Fangyuan Chen
Changlin Gong
Dara Bracken-Clarke
Elisabetta Xue
Yifan Yang
Jimeng Sun
Zhiyong Lu
LM&MA
84
98
0
27 Jul 2023
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Lianmin Zheng
Wei-Lin Chiang
Ying Sheng
Siyuan Zhuang
Zhanghao Wu
...
Dacheng Li
Eric Xing
Haotong Zhang
Joseph E. Gonzalez
Ion Stoica
ALM
OSLM
ELM
220
4,085
0
09 Jun 2023
Large Language Models Struggle to Learn Long-Tail Knowledge
Nikhil Kandpal
H. Deng
Adam Roberts
Eric Wallace
Colin Raffel
RALM
KELM
82
409
0
15 Nov 2022
Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus
Jesse Dodge
Maarten Sap
Ana Marasović
William Agnew
Gabriel Ilharco
Dirk Groeneveld
Margaret Mitchell
Matt Gardner
AILaw
62
437
0
18 Apr 2021
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Leo Gao
Stella Biderman
Sid Black
Laurence Golding
Travis Hoppe
...
Horace He
Anish Thite
Noa Nabeshima
Shawn Presser
Connor Leahy
AIMat
376
2,051
0
31 Dec 2020
1