Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales

Terms and Conditions

Twitter GitHub LinkedIn Bluesky Youtube

© 2026 ResearchTrend.AI, All rights reserved.

Home
Papers
2403.19340
Cited By

Dataverse: Open-Source ETL (Extract, Transform, Load) Pipeline for Large Language Models

v1v2 (latest)

Dataverse: Open-Source ETL (Extract, Transform, Load) Pipeline for Large Language Models

28 March 2024

Yungi Kim

ArXiv (abs)PDF HTML Github (93★)

Papers citing "Dataverse: Open-Source ETL (Extract, Transform, Load) Pipeline for Large Language Models"

13 / 13 papers shown

Superfiltering: Weak-to-Strong Data Filtering for Fast
Instruction-Tuning

Superfiltering: Weak-to-Strong Data Filtering for Fast Instruction-Tuning

537

128

0

01 Feb 2024

Dolma: an Open Corpus of Three Trillion Tokens for Language Model
Pretraining Research

Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research

Luca Soldaini

Rodney Michael Kinney

Akshita Bhagia

...

Hanna Hajishirzi

Dirk Groeneveld

Kyle Lo

444

445

0

31 Jan 2024

Rethinking Benchmark and Contamination for Language Models with
Rephrased Samples

Rethinking Benchmark and Contamination for Language Models with Rephrased Samples

Joseph E. Gonzalez

445

184

0

08 Nov 2023

From Quantity to Quality: Boosting LLM Performance with Self-Guided Data
Selection for Instruction Tuning

From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction TuningNorth American Chapter of the Association for Computational Linguistics (NAACL), 2023

677

331

0

23 Aug 2023

The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora
with Web Data, and Web Data Only

The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only

Guilherme Penedo

Quentin Malartic

Ruxandra-Aimée Cojocaru

Alessandro Cappelli

Hamza Alobeidli

Ebtesam Almazrouei

542

924

0

01 Jun 2023

DMOps: Data Management Operation and Recipes

DMOps: Data Management Operation and Recipes

279

7

0

02 Jan 2023

Toxicity Detection with Generative Prompt-based Inference

Toxicity Detection with Generative Prompt-based Inference

358

46

0

24 May 2022

On the Effect of Pretraining Corpora on In-context Learning by a
Large-scale Language Model

On the Effect of Pretraining Corpora on In-context Learning by a Large-scale Language ModelNorth American Chapter of the Association for Computational Linguistics (NAACL), 2022

...

427

111

0

28 Apr 2022

Handling Bias in Toxic Speech Detection: A Survey

Handling Bias in Toxic Speech Detection: A SurveyACM Computing Surveys (ACM CSUR), 2022

Tanmoy Chakraborty

428

123

0

26 Jan 2022

Deduplicating Training Data Makes Language Models Better

Deduplicating Training Data Makes Language Models Better

Daphne Ippolito

Chris Callison-Burch

Nicholas Carlini

956

822

0

14 Jul 2021

Documenting Large Webtext Corpora: A Case Study on the Colossal Clean
Crawled Corpus

Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled CorpusConference on Empirical Methods in Natural Language Processing (EMNLP), 2021

Gabriel Ilharco

Dirk Groeneveld

Margaret Mitchell

389

615

0

18 Apr 2021

Improved and efficient inter-vehicle distance estimation using road
gradients of both ego and target vehicles

Improved and efficient inter-vehicle distance estimation using road gradients of both ego and target vehiclesInternational Conference on Autonomic and Autonomous Systems (ICAAS), 2021

Jinkyu Lee

Il Yong Chun

209

13

0

01 Apr 2021

Scaling Laws for Neural Language Models

Scaling Laws for Neural Language Models

2.2K

7,434

0

23 Jan 2020

Page 1 of 1