ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2210.04261
  4. Cited By
Noise-Robust De-Duplication at Scale

Noise-Robust De-Duplication at Scale

9 October 2022
Emily Silcock
Luca DÁmico-Wong
Jinglin Yang
Melissa Dell
    SyDa
ArXivPDFHTML

Papers citing "Noise-Robust De-Duplication at Scale"

16 / 16 papers shown
Title
Towards the Three-Phase Dynamics of Generalization Power of a DNN
Towards the Three-Phase Dynamics of Generalization Power of a DNN
Yuxuan He
Junpeng Zhang
Hongyuan Zhang
Quanshi Zhang
AI4CE
26
0
0
11 May 2025
Southern Newswire Corpus: A Large-Scale Dataset of Mid-Century Wire Articles Beyond the Front Page
Southern Newswire Corpus: A Large-Scale Dataset of Mid-Century Wire Articles Beyond the Front Page
Michael McRae
AI4CE
36
0
0
17 Feb 2025
Mitigating Memorization In Language Models
Mitigating Memorization In Language Models
Mansi Sakarvadia
Aswathy Ajith
Arham Khan
Nathaniel Hudson
Caleb Geniesse
Kyle Chard
Yaoqing Yang
Ian Foster
Michael W. Mahoney
KELM
MU
50
0
0
03 Oct 2024
Data Contamination Report from the 2024 CONDA Shared Task
Data Contamination Report from the 2024 CONDA Shared Task
Oscar Sainz
Iker García-Ferrero
Alon Jacovi
Jonas Hanselle
Yanai Elazar
...
Yu-Min Tseng
Vishaal Udandarao
Zengzhi Wang
Ruijie Xu
Jinglin Yang
34
5
0
31 Jul 2024
News Deja Vu: Connecting Past and Present with Semantic Search
News Deja Vu: Connecting Past and Present with Semantic Search
Brevin Franklin
Emily Silcock
Abhishek Arora
Tom Bryan
Melissa Dell
21
1
0
21 Jun 2024
How Do Large Language Models Acquire Factual Knowledge During
  Pretraining?
How Do Large Language Models Acquire Factual Knowledge During Pretraining?
Hoyeon Chang
Jinho Park
Seonghyeon Ye
Sohee Yang
Youngkyung Seo
Du-Seong Chang
Minjoon Seo
KELM
33
30
0
17 Jun 2024
Newswire: A Large-Scale Structured Database of a Century of Historical
  News
Newswire: A Large-Scale Structured Database of a Century of Historical News
Emily Silcock
Abhishek Arora
Luca DÁmico-Wong
Melissa Dell
AI4TS
GNN
37
3
0
13 Jun 2024
A Survey of Multimodal Large Language Model from A Data-centric
  Perspective
A Survey of Multimodal Large Language Model from A Data-centric Perspective
Tianyi Bai
Hao Liang
Binwang Wan
Yanran Xu
Xi Li
...
Ping-Chia Huang
Jiulong Shan
Conghui He
Binhang Yuan
Wentao Zhang
47
36
0
26 May 2024
RETSim: Resilient and Efficient Text Similarity
RETSim: Resilient and Efficient Text Similarity
Marina Zhang
Owen Vallis
Aysegul Bumin
Tanay Vakharia
Elie Bursztein
23
1
0
28 Nov 2023
LinkTransformer: A Unified Package for Record Linkage with Transformer
  Language Models
LinkTransformer: A Unified Package for Record Linkage with Transformer Language Models
Abhishek Arora
Melissa Dell
KELM
23
8
0
02 Sep 2023
American Stories: A Large-Scale Structured Text Dataset of Historical
  U.S. Newspapers
American Stories: A Large-Scale Structured Text Dataset of Historical U.S. Newspapers
Melissa Dell
Jacob Carlson
Tom Bryan
Emily Silcock
Abhishek Arora
Zejiang Shen
Luca DÁmico-Wong
Q. Le
Pablo Querubin
Leander Heldring
AI4TS
23
12
0
24 Aug 2023
A Massive Scale Semantic Similarity Dataset of Historical English
A Massive Scale Semantic Similarity Dataset of Historical English
Emily Silcock
Melissa Dell
21
5
0
30 Jun 2023
A Language Model of Java Methods with Train/Test Deduplication
A Language Model of Java Methods with Train/Test Deduplication
Chia-Yi Su
Aakash Bansal
Vijayanta Jain
S. Ghanavati
Collin McMillan
SyDa
VLM
21
9
0
15 May 2023
SemDeDup: Data-efficient learning at web-scale through semantic
  deduplication
SemDeDup: Data-efficient learning at web-scale through semantic deduplication
Amro Abbas
Kushal Tirumala
Daniel Simig
Surya Ganguli
Ari S. Morcos
13
162
0
16 Mar 2023
Deduplicating Training Data Makes Language Models Better
Deduplicating Training Data Makes Language Models Better
Katherine Lee
Daphne Ippolito
A. Nystrom
Chiyuan Zhang
Douglas Eck
Chris Callison-Burch
Nicholas Carlini
SyDa
237
590
0
14 Jul 2021
BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information
  Retrieval Models
BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models
Nandan Thakur
Nils Reimers
Andreas Rucklé
Abhishek Srivastava
Iryna Gurevych
VLM
229
964
0
17 Apr 2021
1