Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2210.04261
Cited By
Noise-Robust De-Duplication at Scale
9 October 2022
Emily Silcock
Luca DÁmico-Wong
Jinglin Yang
Melissa Dell
SyDa
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Noise-Robust De-Duplication at Scale"
16 / 16 papers shown
Title
Towards the Three-Phase Dynamics of Generalization Power of a DNN
Yuxuan He
Junpeng Zhang
Hongyuan Zhang
Quanshi Zhang
AI4CE
26
0
0
11 May 2025
Southern Newswire Corpus: A Large-Scale Dataset of Mid-Century Wire Articles Beyond the Front Page
Michael McRae
AI4CE
36
0
0
17 Feb 2025
Mitigating Memorization In Language Models
Mansi Sakarvadia
Aswathy Ajith
Arham Khan
Nathaniel Hudson
Caleb Geniesse
Kyle Chard
Yaoqing Yang
Ian Foster
Michael W. Mahoney
KELM
MU
50
0
0
03 Oct 2024
Data Contamination Report from the 2024 CONDA Shared Task
Oscar Sainz
Iker García-Ferrero
Alon Jacovi
Jonas Hanselle
Yanai Elazar
...
Yu-Min Tseng
Vishaal Udandarao
Zengzhi Wang
Ruijie Xu
Jinglin Yang
34
5
0
31 Jul 2024
News Deja Vu: Connecting Past and Present with Semantic Search
Brevin Franklin
Emily Silcock
Abhishek Arora
Tom Bryan
Melissa Dell
21
1
0
21 Jun 2024
How Do Large Language Models Acquire Factual Knowledge During Pretraining?
Hoyeon Chang
Jinho Park
Seonghyeon Ye
Sohee Yang
Youngkyung Seo
Du-Seong Chang
Minjoon Seo
KELM
33
30
0
17 Jun 2024
Newswire: A Large-Scale Structured Database of a Century of Historical News
Emily Silcock
Abhishek Arora
Luca DÁmico-Wong
Melissa Dell
AI4TS
GNN
37
3
0
13 Jun 2024
A Survey of Multimodal Large Language Model from A Data-centric Perspective
Tianyi Bai
Hao Liang
Binwang Wan
Yanran Xu
Xi Li
...
Ping-Chia Huang
Jiulong Shan
Conghui He
Binhang Yuan
Wentao Zhang
47
36
0
26 May 2024
RETSim: Resilient and Efficient Text Similarity
Marina Zhang
Owen Vallis
Aysegul Bumin
Tanay Vakharia
Elie Bursztein
23
1
0
28 Nov 2023
LinkTransformer: A Unified Package for Record Linkage with Transformer Language Models
Abhishek Arora
Melissa Dell
KELM
23
8
0
02 Sep 2023
American Stories: A Large-Scale Structured Text Dataset of Historical U.S. Newspapers
Melissa Dell
Jacob Carlson
Tom Bryan
Emily Silcock
Abhishek Arora
Zejiang Shen
Luca DÁmico-Wong
Q. Le
Pablo Querubin
Leander Heldring
AI4TS
23
12
0
24 Aug 2023
A Massive Scale Semantic Similarity Dataset of Historical English
Emily Silcock
Melissa Dell
23
5
0
30 Jun 2023
A Language Model of Java Methods with Train/Test Deduplication
Chia-Yi Su
Aakash Bansal
Vijayanta Jain
S. Ghanavati
Collin McMillan
SyDa
VLM
21
9
0
15 May 2023
SemDeDup: Data-efficient learning at web-scale through semantic deduplication
Amro Abbas
Kushal Tirumala
Daniel Simig
Surya Ganguli
Ari S. Morcos
15
162
0
16 Mar 2023
Deduplicating Training Data Makes Language Models Better
Katherine Lee
Daphne Ippolito
A. Nystrom
Chiyuan Zhang
Douglas Eck
Chris Callison-Burch
Nicholas Carlini
SyDa
237
590
0
14 Jul 2021
BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models
Nandan Thakur
Nils Reimers
Andreas Rucklé
Abhishek Srivastava
Iryna Gurevych
VLM
229
964
0
17 Apr 2021
1