Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2406.20094
Cited By
Scaling Synthetic Data Creation with 1,000,000,000 Personas
28 June 2024
Tao Ge
Xin Chan
Dian Yu
Haitao Mi
Dong Yu
Dong Yu
SyDa
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Scaling Synthetic Data Creation with 1,000,000,000 Personas"
50 / 81 papers shown
Title
Towards Multi-Agent Reasoning Systems for Collaborative Expertise Delegation: An Exploratory Design Study
Baixuan Xu
Chunyang Li
Weiqi Wang
Wei Fan
Tianshi Zheng
H. Shi
Tao Fan
Yangqiu Song
Qiang Yang
9
0
0
12 May 2025
Bounding Box-Guided Diffusion for Synthesizing Industrial Images and Segmentation Map
Alessandro Simoni
Francesco Pelosin
33
0
0
06 May 2025
HyPerAlign: Hypotheses-driven Personalized Alignment
Cristina Garbacea
Chenhao Tan
39
0
0
29 Apr 2025
TF1-EN-3M: Three Million Synthetic Moral Fables for Training Small, Open Language Models
Mihai Nadas
Laura Diosan
Andrei Piscoran
Andreea Tomescu
VGen
50
0
0
29 Apr 2025
Improving LLM Personas via Rationalization with Psychological Scaffolds
Brihi Joshi
Xiang Ren
Swabha Swayamdipta
Rik Koncel-Kedziorski
Tim Paek
68
0
0
25 Apr 2025
Instruction-Tuning Data Synthesis from Scratch via Web Reconstruction
Yuxin Jiang
Y. Wang
Chuhan Wu
Xinyi Dai
Yan Xu
...
Y. Wang
Xin Jiang
Lifeng Shang
R. Tang
W. Wang
17
0
0
22 Apr 2025
Know Me, Respond to Me: Benchmarking LLMs for Dynamic User Profiling and Personalized Responses at Scale
Bowen Jiang
Zhuoqun Hao
Y. Cho
B. Li
Yuan Yuan
Sihao Chen
Lyle Ungar
Camillo J. Taylor
Dan Roth
21
0
0
19 Apr 2025
MetaSynth: Meta-Prompting-Driven Agentic Scaffolds for Diverse Synthetic Data Generation
Haris Riaz
Sourav Sanjukta Bhabesh
Vinayak Arannil
Miguel Ballesteros
Graham Horwood
SyDa
33
0
0
17 Apr 2025
OpenTuringBench: An Open-Model-based Benchmark and Framework for Machine-Generated Text Detection and Attribution
Lucio La Cava
Andrea Tagarelli
DeLMO
VLM
46
0
0
15 Apr 2025
Nemotron-CrossThink: Scaling Self-Learning beyond Math Reasoning
Syeda Nahida Akter
Shrimai Prabhumoye
Matvei Novikov
Seungju Han
Ying Lin
...
Eric Nyberg
Yejin Choi
M. Patwary
M. Shoeybi
Bryan Catanzaro
ReLM
OffRL
LRM
44
0
1
15 Apr 2025
SynthTRIPs: A Knowledge-Grounded Framework for Benchmark Query Generation for Personalized Tourism Recommenders
Ashmi Banerjee
Adithi Satish
Fitri Nur Aisyah
Wolfgang Wörndl
Yashar Deldjoo
AI4TS
28
0
0
12 Apr 2025
APIGen-MT: Agentic Pipeline for Multi-Turn Data Generation via Simulated Agent-Human Interplay
Akshara Prabhakar
Z. Liu
Weiran Yao
Jianguo Zhang
Ming Zhu
...
Juan Carlos Niebles
Shelby Heinecke
H. Wang
S.
Caiming Xiong
VGen
64
1
0
04 Apr 2025
We Need Improved Data Curation and Attribution in AI for Scientific Discovery
Mara Graziani
Antonio Foncubierta
Dimitrios Christofidellis
Irina Espejo Morales
Malina Molnar
Marvin Alberts
Matteo Manica
Jannis Born
29
0
0
03 Apr 2025
XL-Instruct: Synthetic Data for Cross-Lingual Open-Ended Generation
Vivek Iyer
Ricardo Rei
Pinzhen Chen
Alexandra Birch
SyDa
LM&MA
57
0
0
29 Mar 2025
Factored Agents: Decoupling In-Context Learning and Memorization for Robust Tool Use
Nicholas Roth
Christopher Hidey
Lucas Spangher
William Arnold
Chang Ye
Nick Masiewicki
Jinoo Baek
Peter Grabowski
Eugene Ie
LLMAG
43
0
0
29 Mar 2025
Generating Synthetic Oracle Datasets to Analyze Noise Impact: A Study on Building Function Classification Using Tweets
Shanshan Bai
Anna Kruspe
X. X. Zhu
44
0
0
28 Mar 2025
Scaling Laws of Synthetic Data for Language Models
Zeyu Qin
Qingxiu Dong
Xingxing Zhang
Li Dong
Xiaolong Huang
...
Hany Awadalla
Yi R. Fung
Weizhu Chen
Minhao Cheng
Furu Wei
SyDa
63
1
0
25 Mar 2025
A Survey of Large Language Model Agents for Question Answering
Murong Yue
LLMAG
LM&MA
ELM
50
0
0
24 Mar 2025
ConvoGen: Enhancing Conversational AI with Synthetic Data: A Multi-Agent Approach
Reem Gody
Mahmoud Goudy
Ahmed Tawfik
SyDa
47
0
0
21 Mar 2025
LLM Generated Persona is a Promise with a Catch
Ang Li
Haozhe Chen
Hongseok Namkoong
Tianyi Peng
43
1
0
18 Mar 2025
Not All Personas Are Worth It: Culture-Reflective Persona Data Augmentation
Ji-Eun Han
Yoonseok Heo
37
0
0
17 Mar 2025
MetaScale: Test-Time Scaling with Evolving Meta-Thoughts
Qin Liu
Wenxuan Zhou
Nan Xu
James Y. Huang
Fei-Yue Wang
Sheng Zhang
Hoifung Poon
M. Chen
LLMAG
ReLM
AI4Cl
LRM
77
1
0
17 Mar 2025
Modeling Thousands of Human Annotators for Generalizable Text-to-Image Person Re-identification
Jiayu Jiang
Changxing Ding
Wentao Tan
Junhong Wang
Jin Tao
Xiangmin Xu
47
1
0
13 Mar 2025
Group Preference Alignment: Customized LLM Response Generation from In-Situ Conversations
Ishani Mondal
Jack W. Stokes
S. Jauhar
Longqi Yang
Mengting Wan
Xiaofeng Xu
Xia Song
Jennifer Neville
38
0
0
11 Mar 2025
Toward Multi-Session Personalized Conversation: A Large-Scale Dataset and Hierarchical Tree Framework for Implicit Reasoning
Xintong Li
Jalend Bantupalli
Ria Dharmani
Yuwei Zhang
Jingbo Shang
38
1
0
10 Mar 2025
VizTrust: A Visual Analytics Tool for Capturing User Trust Dynamics in Human-AI Communication
Xin Wang
Stephanie Tulk Jesso
Sadamori Kojaku
David M Neyens
Min Sun Kim
34
0
0
10 Mar 2025
Language Model Personalization via Reward Factorization
Idan Shenfeld
Felix Faltings
Pulkit Agrawal
Aldo Pacchiano
36
1
0
08 Mar 2025
TIMER: Temporal Instruction Modeling and Evaluation for Longitudinal Clinical Records
Hejie Cui
Alyssa Unell
Bowen Chen
Jason Alan Fries
Emily Alsentzer
Sanmi Koyejo
N. Shah
65
0
0
06 Mar 2025
MAS-GPT: Training LLMs to Build LLM-based Multi-Agent Systems
Rui Ye
Shuo Tang
Rui Ge
Yaxin Du
Zhenfei Yin
S. Chen
Jing Shao
LLMAG
74
1
0
05 Mar 2025
Adaptively evaluating models with task elicitation
Davis Brown
Prithvi Balehannina
Helen Jin
Shreya Havaldar
Hamed Hassani
Eric Wong
ALM
ELM
75
0
0
03 Mar 2025
PersonaBench: Evaluating AI Models on Understanding Personal Information through Accessing (Synthetic) Private User Data
Juntao Tan
Liangwei Yang
Zuxin Liu
Z. Liu
Rithesh Murthy
...
Shirley Kokane
S.
H. Wang
Caiming Xiong
Shelby Heinecke
SyDa
30
1
0
28 Feb 2025
LLM as a Broken Telephone: Iterative Generation Distorts Information
Amr Mohamed
Mingmeng Geng
Michalis Vazirgiannis
Guokan Shang
49
1
0
27 Feb 2025
FRIDA to the Rescue! Analyzing Synthetic Data Effectiveness in Object-Based Common Sense Reasoning for Disaster Response
Mollie Shichman
C. Bonial
Austin Blodgett
Taylor Hudson
Francis Ferraro
Rachel Rudinger
SyDa
56
0
0
25 Feb 2025
PersonaMath: Boosting Mathematical Reasoning via Persona-Driven Data Augmentation
Jing Luo
Longze Chen
Run Luo
Liang Zhu
Chang Ao
...
A. Argha
Hamid Alinejad-Rokny
Chengming Li
Shiwen Ni
Min Yang
SyDa
AIMat
75
0
0
24 Feb 2025
WildLong: Synthesizing Realistic Long-Context Instruction Data at Scale
Jiaxi Li
Xingxing Zhang
Xun Wang
Xiaolong Huang
Li Dong
Liang Wang
Si-Qing Chen
Wei Lu
Furu Wei
SyDa
52
0
0
23 Feb 2025
Synthesizing Post-Training Data for LLMs through Multi-Agent Simulation
Shuo Tang
Xianghe Pang
Zexi Liu
Bohan Tang
Rui Ye
Xiaowen Dong
Y. Wang
Yanfeng Wang
S. Chen
SyDa
LLMAG
106
3
0
21 Feb 2025
Can LLMs Simulate L2-English Dialogue? An Information-Theoretic Analysis of L1-Dependent Biases
Rena Gao
Xuetong Wu
Tatsuki Kuribayashi
Mingrui Ye
Siya Qi
Carsten Roever
Y. Liu
Zheng Yuan
Jey Han Lau
74
0
0
21 Feb 2025
Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation
Y. Yang
Ajay Patel
Matt Deitke
Tanmay Gupta
Luca Weihs
...
Mark Yatskar
Chris Callison-Burch
Ranjay Krishna
Aniruddha Kembhavi
Christopher Clark
SyDa
56
1
0
21 Feb 2025
Preference Optimization for Reasoning with Pseudo Feedback
Fangkai Jiao
Geyang Guo
Xingxing Zhang
Nancy F. Chen
Shafiq R. Joty
Furu Wei
LRM
87
8
0
17 Feb 2025
Designing Role Vectors to Improve LLM Inference Behaviour
Daniele Potertì
Andrea Seveso
Fabio Mercorio
LLMSV
35
0
0
17 Feb 2025
From Personas to Talks: Revisiting the Impact of Personas on LLM-Synthesized Emotional Support Conversations
Shenghan Wu
Yang Deng
Yimo Zhu
W. Hsu
M. Lee
48
2
0
17 Feb 2025
InfoQuest: Evaluating Multi-Turn Dialogue Agents for Open-Ended Conversations with Hidden Context
Bryan L. M. de Oliveira
Luana G. B. Martins
Bruno Brandão
L. Melo
ELM
67
1
0
17 Feb 2025
FinMTEB: Finance Massive Text Embedding Benchmark
Yixuan Tang
Yi Yang
AIFin
42
0
0
16 Feb 2025
LLM-Powered Benchmark Factory: Reliable, Generic, and Efficient
Peiwen Yuan
Shaoxiong Feng
Yiwei Li
X. U. Wang
Y. Zhang
Jiayi Shi
Chuyi Tan
Boyuan Pan
Yao Hu
Kan Li
59
2
0
02 Feb 2025
OpenCharacter: Training Customizable Role-Playing LLMs with Large-Scale Synthetic Personas
Xiaoyang Wang
H. Zhang
Tao Ge
W. Yu
Dian Yu
Dong Yu
AI4CE
26
2
0
28 Jan 2025
Aligning Instruction Tuning with Pre-training
Yiming Liang
Tianyu Zheng
Xinrun Du
Ge Zhang
J. Liu
...
Zhaoxiang Zhang
Wenhao Huang
Jiajun Zhang
Xiang Yue
Jiajun Zhang
68
1
0
16 Jan 2025
Map2Text: New Content Generation from Low-Dimensional Visualizations
Xingjian Zhang
Ziyang Xiong
Shixuan Liu
Yutong Xie
Tolga Ergen
Dongsub Shim
Hua Xu
Honglak Lee
Qiaozhu Me
33
0
0
24 Dec 2024
MAG-V: A Multi-Agent Framework for Synthetic Data Generation and Verification
Saptarshi Sengupta
Kristal Curtis
Akshay Mallipeddi
Abhinav Mathur
Joseph Ross
Liang Gou
Liang Gou
LLMAG
SyDa
88
1
0
28 Nov 2024
Low-Bit Quantization Favors Undertrained LLMs: Scaling Laws for Quantized LLMs with 100T Training Tokens
Xu Ouyang
Tao Ge
Thomas Hartvigsen
Zhisong Zhang
Haitao Mi
Dong Yu
MQ
80
1
0
26 Nov 2024
Compound-QA: A Benchmark for Evaluating LLMs on Compound Questions
Yutao Hou
Yajing Luo
Zhiwen Ruan
H. Wang
Weifeng Ge
Y. Chen
Guanhua Chen
ELM
34
0
0
15 Nov 2024
1
2
Next