ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2311.01149
  4. Cited By
ChineseWebText: Large-scale High-quality Chinese Web Text Extracted with
  Effective Evaluation Model

ChineseWebText: Large-scale High-quality Chinese Web Text Extracted with Effective Evaluation Model

2 November 2023
Jianghao Chen
Pu Jian
Tengxiao Xi
Yidong Yi
Qianlong Du
Chenglin Ding
Guibo Zhu
Chengqing Zong
Jinqiao Wang
Jiajun Zhang
ArXivPDFHTML

Papers citing "ChineseWebText: Large-scale High-quality Chinese Web Text Extracted with Effective Evaluation Model"

8 / 8 papers shown
Title
Ultra-FineWeb: Efficient Data Filtering and Verification for High-Quality LLM Training Data
Ultra-FineWeb: Efficient Data Filtering and Verification for High-Quality LLM Training Data
Y. Wang
Z. Fu
Jie Cai
Peijun Tang
Hongya Lyu
...
Jie Zhou
Guoyang Zeng
Chaojun Xiao
Xu Han
Zhiyuan Liu
49
0
0
08 May 2025
Citrus: Leveraging Expert Cognitive Pathways in a Medical Language Model for Advanced Medical Decision Support
Citrus: Leveraging Expert Cognitive Pathways in a Medical Language Model for Advanced Medical Decision Support
G. Wang
Minyu Gao
Shuai Yang
Ya Zhang
Lizhi He
...
Yexuan Zhang
Wanyue Li
Lu Chen
Jintao Fei
Xin Li
104
1
0
25 Feb 2025
Gradient Routing: Masking Gradients to Localize Computation in Neural
  Networks
Gradient Routing: Masking Gradients to Localize Computation in Neural Networks
Alex Cloud
Jacob Goldman-Wetzler
Evžen Wybitul
Joseph Miller
Alexander Matt Turner
28
2
0
06 Oct 2024
Pretraining Data Detection for Large Language Models: A Divergence-based Calibration Method
Pretraining Data Detection for Large Language Models: A Divergence-based Calibration Method
Weichao Zhang
Ruqing Zhang
Jiafeng Guo
Maarten de Rijke
Yixing Fan
Xueqi Cheng
32
7
0
23 Sep 2024
Whose Language Counts as High Quality? Measuring Language Ideologies in
  Text Data Selection
Whose Language Counts as High Quality? Measuring Language Ideologies in Text Data Selection
Suchin Gururangan
Dallas Card
Sarah K. Drier
E. K. Gade
Leroy Z. Wang
Zeyu Wang
Luke Zettlemoyer
Noah A. Smith
172
73
0
25 Jan 2022
Deduplicating Training Data Makes Language Models Better
Deduplicating Training Data Makes Language Models Better
Katherine Lee
Daphne Ippolito
A. Nystrom
Chiyuan Zhang
Douglas Eck
Chris Callison-Burch
Nicholas Carlini
SyDa
242
591
0
14 Jul 2021
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Leo Gao
Stella Biderman
Sid Black
Laurence Golding
Travis Hoppe
...
Horace He
Anish Thite
Noa Nabeshima
Shawn Presser
Connor Leahy
AIMat
253
1,986
0
31 Dec 2020
Efficient Estimation of Word Representations in Vector Space
Efficient Estimation of Word Representations in Vector Space
Tomáš Mikolov
Kai Chen
G. Corrado
J. Dean
3DV
233
31,253
0
16 Jan 2013
1