ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2310.20707
  4. Cited By
What's In My Big Data?
v1v2 (latest)

What's In My Big Data?

31 October 2023
Yanai Elazar
Akshita Bhagia
Ian H. Magnusson
Abhilasha Ravichander
Dustin Schwenk
Alane Suhr
Pete Walsh
Dirk Groeneveld
Luca Soldaini
Sameer Singh
Hanna Hajishirzi
Noah A. Smith
Jesse Dodge
ArXiv (abs)PDFHTML

Papers citing "What's In My Big Data?"

50 / 76 papers shown
Title
Infini-gram mini: Exact n-gram Search at the Internet Scale with FM-Index
Infini-gram mini: Exact n-gram Search at the Internet Scale with FM-Index
Hao Xu
Jiacheng Liu
Yejin Choi
Noah A. Smith
Hannaneh Hajishirzi
10
0
0
13 Jun 2025
Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training
Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training
Pierre-Carl Langlais
Carlos Rosas Hinostroza
Mattia Nee
Catherine Arnett
Pavel Chizhov
Eliot Jones
Irène Girard
David Mach
Anastasia Stasenko
Ivan P. Yamshchikov
AILaw
51
1
0
02 Jun 2025
We Need to Measure Data Diversity in NLP -- Better and Broader
We Need to Measure Data Diversity in NLP -- Better and Broader
Dong Nguyen
Esther Ploeger
30
0
0
26 May 2025
TEDI: Trustworthy and Ethical Dataset Indicators to Analyze and Compare Dataset Documentation
TEDI: Trustworthy and Ethical Dataset Indicators to Analyze and Compare Dataset Documentation
Wiebke Hutiri
Mircea Cimpoi
M. Scheuerman
Victoria Matthews
Alice Xiang
165
0
0
23 May 2025
Semantic Pivots Enable Cross-Lingual Transfer in Large Language Models
Semantic Pivots Enable Cross-Lingual Transfer in Large Language Models
Kaiyu He
Tong Zhou
Yubo Chen
Delai Qiu
Shengping Liu
Kang Liu
Jun Zhao
LRM
55
0
0
22 May 2025
The Rise of Parameter Specialization for Knowledge Storage in Large Language Models
Yihuai Hong
Yiran Zhao
Wei Tang
Yang Deng
Yu Rong
Wenxuan Zhang
KELM
32
0
0
22 May 2025
Diagnosing our datasets: How does my language model learn clinical information?
Diagnosing our datasets: How does my language model learn clinical information?
Furong Jia
David Sontag
Monica Agrawal
LM&MA
206
1
0
21 May 2025
Tracing Multilingual Factual Knowledge Acquisition in Pretraining
Tracing Multilingual Factual Knowledge Acquisition in Pretraining
Yihong Liu
Mingyang Wang
Amir Hossein Kargaran
Felicia Körner
Ercong Nie
Yun Xue
François Yvon
Hinrich Schutze
HILMKELM
127
0
0
20 May 2025
AdaParse: An Adaptive Parallel PDF Parsing and Resource Scaling Engine
AdaParse: An Adaptive Parallel PDF Parsing and Resource Scaling Engine
Carlo Siebenschuh
Kyle Hippe
Ozan Gokdemir
Alexander Brace
A. Khan
...
V. Vishwanath
R. Stevens
Arvind Ramanathan
Ian Foster
Robert Underwood
MoE
84
0
0
23 Apr 2025
Advancing AI-assisted Hardware Design with Hierarchical Decentralized Training and Personalized Inference-Time Optimization
Advancing AI-assisted Hardware Design with Hierarchical Decentralized Training and Personalized Inference-Time Optimization
Hao Mark Chen
Zehuan Zhang
Wanru Zhao
Nicholas D. Lane
Hongxiang Fan
18
0
0
21 Apr 2025
STAMP Your Content: Proving Dataset Membership via Watermarked Rephrasings
STAMP Your Content: Proving Dataset Membership via Watermarked Rephrasings
Saksham Rastogi
Pratyush Maini
Danish Pruthi
169
0
0
18 Apr 2025
On Linear Representations and Pretraining Data Frequency in Language Models
On Linear Representations and Pretraining Data Frequency in Language Models
Jack Merullo
Noah A. Smith
Sarah Wiegreffe
Yanai Elazar
102
4
0
16 Apr 2025
Beyond Memorization: Mapping the Originality-Quality Frontier of Language Models
Beyond Memorization: Mapping the Originality-Quality Frontier of Language Models
Vishakh Padmakumar
Chen Yueh-Han
Jane Pan
Valerie Chen
He He
71
0
0
13 Apr 2025
WinoWhat: A Parallel Corpus of Paraphrased WinoGrande Sentences with Common Sense Categorization
WinoWhat: A Parallel Corpus of Paraphrased WinoGrande Sentences with Common Sense Categorization
I. Gevers
Victor De Marez
Luna De Bruyne
Walter Daelemans
65
0
0
31 Mar 2025
Developing and Utilizing a Large-Scale Cantonese Dataset for Multi-Tasking in Large Language Models
Jiyue Jiang
Alfred Kar Yin Truong
Yuxiao Chen
Qinghang Bao
Sheng Wang
Pengan Chen
Jinqiao Wang
Dianbo Sui
Yu Li
Chuan Wu
ALM
80
0
0
05 Mar 2025
Better Aligned with Survey Respondents or Training Data? Unveiling Political Leanings of LLMs on U.S. Supreme Court Cases
Better Aligned with Survey Respondents or Training Data? Unveiling Political Leanings of LLMs on U.S. Supreme Court Cases
Shanshan Xu
T. Y. S. S. Santosh
Yanai Elazar
Quirin Vogel
Barbara Plank
Matthias Grabmair
AILaw
165
0
0
25 Feb 2025
On Relation-Specific Neurons in Large Language Models
On Relation-Specific Neurons in Large Language Models
Yihong Liu
Runsheng Chen
Lea Hirlimann
Ahmad Dawar Hakimi
Mingyang Wang
Amir Hossein Kargaran
S. Rothe
François Yvon
Hinrich Schütze
KELM
89
0
0
24 Feb 2025
Interrogating LLM design under a fair learning doctrine
Interrogating LLM design under a fair learning doctrine
Johnny Tian-Zheng Wei
Maggie Wang
Ameya Godbole
Jonathan H. Choi
Robin Jia
113
0
0
22 Feb 2025
MathGAP: Out-of-Distribution Evaluation on Problems with Arbitrarily Complex Proofs
MathGAP: Out-of-Distribution Evaluation on Problems with Arbitrarily Complex Proofs
Andreas Opedal
Haruki Shirakami
Bernhard Schölkopf
Abulhair Saparov
Mrinmaya Sachan
LRM
134
3
0
17 Feb 2025
The Vendiscope: An Algorithmic Microscope For Data Collections
The Vendiscope: An Algorithmic Microscope For Data Collections
Amey P. Pasarkar
Adji Bousso Dieng
90
2
0
15 Feb 2025
Do we really have to filter out random noise in pre-training data for language models?
Do we really have to filter out random noise in pre-training data for language models?
Jinghan Ru
Yuxin Xie
Xianwei Zhuang
Yuguo Yin
Zhihui Guo
Zhiming Liu
Qianli Ren
Yuexian Zou
190
6
0
10 Feb 2025
FED: Fast and Efficient Dataset Deduplication Framework with GPU Acceleration
FED: Fast and Efficient Dataset Deduplication Framework with GPU Acceleration
Youngjun Son
Chaewon Kim
Jaejin Lee
130
0
0
02 Jan 2025
QUENCH: Measuring the gap between Indic and Non-Indic Contextual General
  Reasoning in LLMs
QUENCH: Measuring the gap between Indic and Non-Indic Contextual General Reasoning in LLMs
Mohammad Aflah Khan
Neemesh Yadav
Sarah Masud
Md. Shad Akhtar
167
0
0
16 Dec 2024
How far can bias go? -- Tracing bias from pretraining data to alignment
How far can bias go? -- Tracing bias from pretraining data to alignment
Marion Thaler
Abdullatif Köksal
Alina Leidinger
Anna Korhonen
Hinrich Schutze
148
1
0
28 Nov 2024
Do Large Language Models Perform Latent Multi-Hop Reasoning without Exploiting Shortcuts?
Do Large Language Models Perform Latent Multi-Hop Reasoning without Exploiting Shortcuts?
Sohee Yang
Nora Kassner
E. Gribovskaya
Sebastian Riedel
Mor Geva
LRMKELMReLM
171
9
0
25 Nov 2024
Evaluation data contamination in LLMs: how do we measure it and (when)
  does it matter?
Evaluation data contamination in LLMs: how do we measure it and (when) does it matter?
Aaditya K. Singh
Muhammed Yusuf Kocyigit
Andrew Poulton
David Esiobu
Maria Lomeli
Gergely Szilvasy
Dieuwke Hupkes
82
13
0
06 Nov 2024
A Systematic Review of NeurIPS Dataset Management Practices
A Systematic Review of NeurIPS Dataset Management Practices
Yiwei Wu
Leah Ajmani
Shayne Longpre
Hanlin Li
89
0
0
31 Oct 2024
From Babble to Words: Pre-Training Language Models on Continuous Streams
  of Phonemes
From Babble to Words: Pre-Training Language Models on Continuous Streams of Phonemes
Zébulon Goriely
Richard Diehl Martinez
Andrew Caines
Lisa Beinborn
P. Buttery
CLL
98
5
0
30 Oct 2024
Improving Model Evaluation using SMART Filtering of Benchmark Datasets
Improving Model Evaluation using SMART Filtering of Benchmark Datasets
Vipul Gupta
Candace Ross
David Pantoja
R. Passonneau
Megan Ung
Adina Williams
303
2
0
26 Oct 2024
What's New in My Data? Novelty Exploration via Contrastive Generation
What's New in My Data? Novelty Exploration via Contrastive Generation
Masaru Isonuma
Ivan Titov
59
0
0
18 Oct 2024
Ethics Whitepaper: Whitepaper on Ethical Research into Large Language
  Models
Ethics Whitepaper: Whitepaper on Ethical Research into Large Language Models
Eddie L. Ungless
Nikolas Vitsakis
Zeerak Talat
James Garforth
Bjorn Ross
Arno Onken
Atoosa Kasirzadeh
Alexandra Birch
71
1
0
17 Oct 2024
Measuring Spiritual Values and Bias of Large Language Models
Measuring Spiritual Values and Bias of Large Language Models
Songyuan Liu
Ziyang Zhang
Runze Yan
Wei Wu
Carl Yang
Jiaying Lu
51
0
0
15 Oct 2024
TSDS: Data Selection for Task-Specific Model Finetuning
TSDS: Data Selection for Task-Specific Model Finetuning
Zifan Liu
Amin Karbasi
Theodoros Rekatsinas
73
6
0
15 Oct 2024
Solving the Challenge Set without Solving the Task: On Winograd Schemas
  as a Test of Pronominal Coreference Resolution
Solving the Challenge Set without Solving the Task: On Winograd Schemas as a Test of Pronominal Coreference Resolution
Ian Porada
Jackie C.K. Cheung
71
0
0
12 Oct 2024
DecorateLM: Data Engineering through Corpus Rating, Tagging, and Editing
  with Language Models
DecorateLM: Data Engineering through Corpus Rating, Tagging, and Editing with Language Models
Ranchi Zhao
Zhen Leng Thai
Yifan Zhang
Shengding Hu
Yunqi Ba
Jie Zhou
Jie Cai
Zhiyuan Liu
Maosong Sun
145
1
0
08 Oct 2024
MEXA: Multilingual Evaluation of English-Centric LLMs via Cross-Lingual Alignment
MEXA: Multilingual Evaluation of English-Centric LLMs via Cross-Lingual Alignment
Amir Hossein Kargaran
Ali Modarressi
Nafiseh Nikeghbal
Jana Diesner
François Yvon
Hinrich Schütze
ELM
97
7
0
08 Oct 2024
ForecastBench: A Dynamic Benchmark of AI Forecasting Capabilities
ForecastBench: A Dynamic Benchmark of AI Forecasting Capabilities
Ezra Karger
Houtan Bastani
Chen Yueh-Han
Zachary Jacobs
Danny Halawi
Fred Zhang
P. Tetlock
139
9
0
30 Sep 2024
WinoPron: Revisiting English Winogender Schemas for Consistency,
  Coverage, and Grammatical Case
WinoPron: Revisiting English Winogender Schemas for Consistency, Coverage, and Grammatical Case
Vagrant Gautam
Julius Steuer
Eileen Bingert
Ray Johns
Anne Lauscher
Dietrich Klakow
82
4
0
09 Sep 2024
WildVis: Open Source Visualizer for Million-Scale Chat Logs in the Wild
WildVis: Open Source Visualizer for Million-Scale Chat Logs in the Wild
Yuntian Deng
Wenting Zhao
Jack Hessel
Xiang Ren
Claire Cardie
Yejin Choi
VLM
41
6
0
05 Sep 2024
Data Contamination Report from the 2024 CONDA Shared Task
Data Contamination Report from the 2024 CONDA Shared Task
Oscar Sainz
Iker García-Ferrero
Alon Jacovi
Jonas Hanselle
Yanai Elazar
...
Yu-Min Tseng
Vishaal Udandarao
Zengzhi Wang
Ruijie Xu
Jinglin Yang
109
6
0
31 Jul 2024
From Pre-training Corpora to Large Language Models: What Factors
  Influence LLM Performance in Causal Discovery Tasks?
From Pre-training Corpora to Large Language Models: What Factors Influence LLM Performance in Causal Discovery Tasks?
Tao Feng
Zhuang Li
Niket Tandon
Zhuang Li
Xiaoxi Kang
Gholamreza Haffari
LRM
78
5
0
29 Jul 2024
Benchmarks as Microscopes: A Call for Model Metrology
Benchmarks as Microscopes: A Call for Model Metrology
Michael Stephen Saxon
Ari Holtzman
Peter West
William Y. Wang
Naomi Saphra
109
13
0
22 Jul 2024
A Review of the Challenges with Massive Web-mined Corpora Used in Large
  Language Models Pre-Training
A Review of the Challenges with Massive Web-mined Corpora Used in Large Language Models Pre-Training
Michał Perełkiewicz
Rafał Poświata
71
3
0
10 Jul 2024
On Leakage of Code Generation Evaluation Datasets
On Leakage of Code Generation Evaluation Datasets
Alexandre Matton
Tom Sherborne
Dennis Aumiller
Elena Tommasone
Milad Alizadeh
Jingyi He
Raymond Ma
Maxime Voisin
Ellen Gilsenan-McMahon
Matthias Gallé
89
18
0
10 Jul 2024
Black Big Boxes: Do Language Models Hide a Theory of Adjective Order?
Black Big Boxes: Do Language Models Hide a Theory of Adjective Order?
Jaap Jumelet
Lisa Bylinina
Willem H. Zuidema
Jakub Szymanik
108
4
0
02 Jul 2024
Detection and Measurement of Syntactic Templates in Generated Text
Detection and Measurement of Syntactic Templates in Generated Text
Chantal Shaib
Yanai Elazar
Junyi Jessy Li
Byron C. Wallace
90
20
0
28 Jun 2024
Sonnet or Not, Bot? Poetry Evaluation for Large Models and Datasets
Sonnet or Not, Bot? Poetry Evaluation for Large Models and Datasets
Melanie Walsh
Anna Preus
Maria Antoniak
75
9
0
27 Jun 2024
Evaluating Copyright Takedown Methods for Language Models
Evaluating Copyright Takedown Methods for Language Models
Boyi Wei
Weijia Shi
Yangsibo Huang
Noah A. Smith
Chiyuan Zhang
Luke Zettlemoyer
Kai Li
Peter Henderson
145
25
0
26 Jun 2024
Fantastic Copyrighted Beasts and How (Not) to Generate Them
Fantastic Copyrighted Beasts and How (Not) to Generate Them
Luxi He
Yangsibo Huang
Weijia Shi
Tinghao Xie
Haotian Liu
Yue Wang
Luke Zettlemoyer
Chiyuan Zhang
Danqi Chen
Peter Henderson
108
12
0
20 Jun 2024
Evaluating $n$-Gram Novelty of Language Models Using Rusty-DAWG
Evaluating nnn-Gram Novelty of Language Models Using Rusty-DAWG
William Merrill
Noah A. Smith
Yanai Elazar
ELMTDI
116
12
0
18 Jun 2024
12
Next