ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2403.19340
  4. Cited By
Dataverse: Open-Source ETL (Extract, Transform, Load) Pipeline for Large Language Models
v1v2 (latest)

Dataverse: Open-Source ETL (Extract, Transform, Load) Pipeline for Large Language Models

28 March 2024
Hyunbyung Park
Sukyung Lee
Gyoungjin Gim
Yungi Kim
Dahyun Kim
Chanjun Park
    VLM
ArXiv (abs)PDFHTMLGithub (93★)

Papers citing "Dataverse: Open-Source ETL (Extract, Transform, Load) Pipeline for Large Language Models"

13 / 13 papers shown
Superfiltering: Weak-to-Strong Data Filtering for Fast
  Instruction-Tuning
Superfiltering: Weak-to-Strong Data Filtering for Fast Instruction-Tuning
Ming Li
Yong Zhang
Shwai He
Zhitao Li
Hongyu Zhao
Jianzong Wang
Ning Cheng
Wanrong Zhu
537
128
0
01 Feb 2024
Dolma: an Open Corpus of Three Trillion Tokens for Language Model
  Pretraining Research
Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research
Luca Soldaini
Rodney Michael Kinney
Akshita Bhagia
Dustin Schwenk
David Atkinson
...
Hanna Hajishirzi
Iz Beltagy
Dirk Groeneveld
Jesse Dodge
Kyle Lo
444
445
0
31 Jan 2024
Rethinking Benchmark and Contamination for Language Models with
  Rephrased Samples
Rethinking Benchmark and Contamination for Language Models with Rephrased Samples
Shuo Yang
Wei-Lin Chiang
Lianmin Zheng
Joseph E. Gonzalez
Ion Stoica
ALM
445
184
0
08 Nov 2023
From Quantity to Quality: Boosting LLM Performance with Self-Guided Data
  Selection for Instruction Tuning
From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction TuningNorth American Chapter of the Association for Computational Linguistics (NAACL), 2023
Ming Li
Yong Zhang
Zhitao Li
Jiuhai Chen
Lichang Chen
Ning Cheng
Jianzong Wang
Wanrong Zhu
Jing Xiao
677
331
0
23 Aug 2023
The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora
  with Web Data, and Web Data Only
The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only
Guilherme Penedo
Quentin Malartic
Daniel Hesslow
Ruxandra-Aimée Cojocaru
Alessandro Cappelli
Hamza Alobeidli
B. Pannier
Ebtesam Almazrouei
Julien Launay
542
924
0
01 Jun 2023
DMOps: Data Management Operation and Recipes
DMOps: Data Management Operation and Recipes
E. Choi
Chanjun Park
279
7
0
02 Jan 2023
Toxicity Detection with Generative Prompt-based Inference
Toxicity Detection with Generative Prompt-based Inference
Yau-Shian Wang
Y. Chang
358
46
0
24 May 2022
On the Effect of Pretraining Corpora on In-context Learning by a
  Large-scale Language Model
On the Effect of Pretraining Corpora on In-context Learning by a Large-scale Language ModelNorth American Chapter of the Association for Computational Linguistics (NAACL), 2022
Seongjin Shin
Sang-Woo Lee
Hwijeen Ahn
Sungdong Kim
Hyoungseok Kim
...
Dong Wang
Gichang Lee
W. Park
Jung-Woo Ha
Nako Sung
LRM
427
111
0
28 Apr 2022
Handling Bias in Toxic Speech Detection: A Survey
Handling Bias in Toxic Speech Detection: A SurveyACM Computing Surveys (ACM CSUR), 2022
Tanmay Garg
Sarah Masud
Tharun Suresh
Tanmoy Chakraborty
428
123
0
26 Jan 2022
Deduplicating Training Data Makes Language Models Better
Deduplicating Training Data Makes Language Models Better
Katherine Lee
Daphne Ippolito
A. Nystrom
Chiyuan Zhang
Douglas Eck
Chris Callison-Burch
Nicholas Carlini
SyDa
956
822
0
14 Jul 2021
Documenting Large Webtext Corpora: A Case Study on the Colossal Clean
  Crawled Corpus
Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled CorpusConference on Empirical Methods in Natural Language Processing (EMNLP), 2021
Jesse Dodge
Maarten Sap
Ana Marasović
William Agnew
Gabriel Ilharco
Dirk Groeneveld
Margaret Mitchell
Matt Gardner
AILaw
389
615
0
18 Apr 2021
Improved and efficient inter-vehicle distance estimation using road
  gradients of both ego and target vehicles
Improved and efficient inter-vehicle distance estimation using road gradients of both ego and target vehiclesInternational Conference on Autonomic and Autonomous Systems (ICAAS), 2021
Robik Shrestha
Jinkyu Lee
Kushal Kafle
S. Hwang
Il Yong Chun
209
13
0
01 Apr 2021
Scaling Laws for Neural Language Models
Scaling Laws for Neural Language Models
Jared Kaplan
Sam McCandlish
T. Henighan
Tom B. Brown
B. Chess
R. Child
Scott Gray
Alec Radford
Jeff Wu
Dario Amodei
2.2K
7,434
0
23 Jan 2020
1
Page 1 of 1