ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2211.15533
  4. Cited By
The Stack: 3 TB of permissively licensed source code

The Stack: 3 TB of permissively licensed source code

20 November 2022
Denis Kocetkov
Raymond Li
Loubna Ben Allal
Jia Li
Chenghao Mou
Carlos Muñoz Ferrandis
Yacine Jernite
Margaret Mitchell
Sean M. Hughes
Thomas Wolf
Dzmitry Bahdanau
Leandro von Werra
H. D. Vries
ArXivPDFHTML

Papers citing "The Stack: 3 TB of permissively licensed source code"

50 / 51 papers shown
Title
Enhancing Code Generation via Bidirectional Comment-Level Mutual Grounding
Enhancing Code Generation via Bidirectional Comment-Level Mutual Grounding
Yifeng Di
Tianyi Zhang
18
0
0
12 May 2025
Rewriting Pre-Training Data Boosts LLM Performance in Math and Code
Rewriting Pre-Training Data Boosts LLM Performance in Math and Code
Kazuki Fujii
Yukito Tajima
Sakae Mizuki
Hinari Shimada
Taihei Shiotani
...
Kakeru Hattori
Youmi Ma
Hiroya Takamura
Rio Yokota
Naoaki Okazaki
SyDa
47
0
0
05 May 2025
On Benchmarking Code LLMs for Android Malware Analysis
On Benchmarking Code LLMs for Android Malware Analysis
Yiling He
Hongyu She
Xingzhi Qian
Xinran Zheng
Zhuo Chen
Z. Qin
Lorenzo Cavallaro
ELM
43
1
0
01 Apr 2025
The Lucie-7B LLM and the Lucie Training Dataset: Open resources for multilingual language generation
The Lucie-7B LLM and the Lucie Training Dataset: Open resources for multilingual language generation
Olivier Gouvert
Julie Hunter
Jérôme Louradour
Christophe Cerisara
Evan Dufraisse
Yaya Sy
Laura Rivière
Jean-Pierre Lorré
OpenLLM-France community
105
0
0
15 Mar 2025
OASIS: Order-Augmented Strategy for Improved Code Search
Zuchen Gao
Zizheng Zhan
Xianming Li
Erxin Yu
Haotian Zhang
Bin Chen
Yuqun Zhang
Jing Li
55
0
0
11 Mar 2025
Robust Learning of Diverse Code Edits
Robust Learning of Diverse Code Edits
Tushar Aggarwal
Swayam Singh
Abhijeet Awasthi
Aditya Kanade
Nagarajan Natarajan
SyDa
109
0
0
05 Mar 2025
TituLLMs: A Family of Bangla LLMs with Comprehensive Benchmarking
TituLLMs: A Family of Bangla LLMs with Comprehensive Benchmarking
Shahriar Kabir Nahin
R. N. Nandi
Sagor Sarker
Quazi Sarwar Muhtaseem
Md. Kowsher
Apu Chandraw Shill
Md Ibrahim
Mehadi Hasan Menon
Tareq Al Muntasir
Firoj Alam
66
0
0
24 Feb 2025
KernelBench: Can LLMs Write Efficient GPU Kernels?
KernelBench: Can LLMs Write Efficient GPU Kernels?
Anne Ouyang
Simon Guo
Simran Arora
Alex L. Zhang
William Hu
Christopher Ré
Azalia Mirhoseini
ALM
38
1
0
14 Feb 2025
\Éclair -- Extracting Content and Layout with Integrated Reading Order for Documents
\Éclair -- Extracting Content and Layout with Integrated Reading Order for Documents
Ilia Karmanov
A. Deshmukh
Lukas Voegtle
Philipp Fischer
Kateryna Chumachenko
...
Jarno Seppänen
Jupinder Parmar
Joseph Jennings
Andrew Tao
Karan Sapra
68
0
0
06 Feb 2025
Are Large Language Models Memorizing Bug Benchmarks?
Are Large Language Models Memorizing Bug Benchmarks?
Daniel Ramos
Claudia Mamede
Kush Jain
Paulo Canelas
Catarina Gamboa
Claire Le Goues
PILM
ELM
94
6
0
20 Nov 2024
Sparsing Law: Towards Large Language Models with Greater Activation Sparsity
Sparsing Law: Towards Large Language Models with Greater Activation Sparsity
Yuqi Luo
Chenyang Song
Xu Han
Y. Chen
Chaojun Xiao
Zhiyuan Liu
Maosong Sun
47
3
0
04 Nov 2024
Parameter-Efficient Fine-Tuning in Large Models: A Survey of Methodologies
Parameter-Efficient Fine-Tuning in Large Models: A Survey of Methodologies
L. Wang
Sheng Chen
Linnan Jiang
Shu Pan
Runze Cai
Sen Yang
Fei Yang
44
3
0
24 Oct 2024
CursorCore: Assist Programming through Aligning Anything
CursorCore: Assist Programming through Aligning Anything
Hao Jiang
Qi Liu
Rui Li
Shengyu Ye
Shijin Wang
48
1
0
09 Oct 2024
How to Train Long-Context Language Models (Effectively)
How to Train Long-Context Language Models (Effectively)
Tianyu Gao
Alexander Wettig
Howard Yen
Danqi Chen
RALM
72
37
0
03 Oct 2024
Training Language Models on Synthetic Edit Sequences Improves Code Synthesis
Training Language Models on Synthetic Edit Sequences Improves Code Synthesis
Ulyana Piterbarg
Lerrel Pinto
Rob Fergus
SyDa
37
2
0
03 Oct 2024
CraftRTL: High-quality Synthetic Data Generation for Verilog Code Models with Correct-by-Construction Non-Textual Representations and Targeted Code Repair
CraftRTL: High-quality Synthetic Data Generation for Verilog Code Models with Correct-by-Construction Non-Textual Representations and Targeted Code Repair
Mingjie Liu
Yun-Da Tsai
Wenfei Zhou
Haoxing Ren
SyDa
3DV
45
5
0
19 Sep 2024
Strong Copyright Protection for Language Models via Adaptive Model
  Fusion
Strong Copyright Protection for Language Models via Adaptive Model Fusion
Javier Abad
Konstantin Donhauser
Francesco Pinto
Fanny Yang
35
4
0
29 Jul 2024
RegMix: Data Mixture as Regression for Language Model Pre-training
RegMix: Data Mixture as Regression for Language Model Pre-training
Qian Liu
Xiaosen Zheng
Niklas Muennighoff
Guangtao Zeng
Longxu Dou
Tianyu Pang
Jing Jiang
Min-Bin Lin
MoE
67
36
1
01 Jul 2024
Code-Optimise: Self-Generated Preference Data for Correctness and Efficiency
Code-Optimise: Self-Generated Preference Data for Correctness and Efficiency
Leonidas Gee
Milan Gritta
Gerasimos Lampouras
Ignacio Iacobacci
21
10
0
18 Jun 2024
Kotlin ML Pack: Technical Report
Kotlin ML Pack: Technical Report
Sergey Titov
Mikhail Evtikhiev
Anton Shapkin
Oleg Smirnov
Sergei Boytsov
...
Dariia Karaeva
Maksim Sheptyakov
Mikhail Arkhipov
T. Bryksin
Egor Bogomolov
32
0
0
29 May 2024
Enabling High-Sparsity Foundational Llama Models with Efficient
  Pretraining and Deployment
Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment
Abhinav Agarwalla
Abhay Gupta
Alexandre Marques
Shubhra Pandit
Michael Goin
...
Tuan Nguyen
Mahmoud Salem
Dan Alistarh
Sean Lie
Mark Kurtz
MoE
SyDa
33
11
0
06 May 2024
Performance-Aligned LLMs for Generating Fast Code
Performance-Aligned LLMs for Generating Fast Code
Daniel Nichols
Pranav Polasam
Harshitha Menon
Aniruddha Marathe
T. Gamblin
A. Bhatele
27
8
0
29 Apr 2024
JaFIn: Japanese Financial Instruction Dataset
JaFIn: Japanese Financial Instruction Dataset
Kota Tanabe
Masahiro Suzuki
Hiroki Sakaji
Itsuki Noda
39
1
0
14 Apr 2024
AdapterSwap: Continuous Training of LLMs with Data Removal and Access-Control Guarantees
AdapterSwap: Continuous Training of LLMs with Data Removal and Access-Control Guarantees
William Fleshman
Aleem Khan
Marc Marone
Benjamin Van Durme
CLL
KELM
42
3
0
12 Apr 2024
Edisum: Summarizing and Explaining Wikipedia Edits at Scale
Edisum: Summarizing and Explaining Wikipedia Edits at Scale
Marija Sakota
Isaac Johnson
Guosheng Feng
Robert West
SyDa
KELM
25
2
0
04 Apr 2024
Stable Code Technical Report
Stable Code Technical Report
Nikhil Pinnaparaju
Reshinth Adithyan
Duy Phung
J. Tow
James Baicoianu
...
Maksym Zhuravinskyi
Dakota Mahan
Marco Bellagente
Carlos Riquelme
Nathan Cooper
LRM
ALM
18
13
0
01 Apr 2024
SaulLM-7B: A pioneering Large Language Model for Law
SaulLM-7B: A pioneering Large Language Model for Law
Pierre Colombo
T. Pires
Malik Boudiaf
Dominic Culver
Rui Melo
...
Andre F. T. Martins
Fabrizio Esposito
Vera Lúcia Raposo
Sofia Morgado
Michael Desa
ELM
AILaw
39
63
0
06 Mar 2024
Semi-Instruct: Bridging Natural-Instruct and Self-Instruct for Code
  Large Language Models
Semi-Instruct: Bridging Natural-Instruct and Self-Instruct for Code Large Language Models
Xianzhen Luo
Qingfu Zhu
Zhiming Zhang
Xu Wang
Qing Yang
Dongliang Xu
Wanxiang Che
ALM
19
2
0
01 Mar 2024
Text-to-Code Generation with Modality-relative Pre-training
Text-to-Code Generation with Modality-relative Pre-training
Fenia Christopoulou
Guchun Zhang
Gerasimos Lampouras
AI4TS
13
1
0
08 Feb 2024
CroissantLLM: A Truly Bilingual French-English Language Model
CroissantLLM: A Truly Bilingual French-English Language Model
Manuel Faysse
Patrick Fernandes
Nuno M. Guerreiro
António Loison
Duarte M. Alves
...
François Yvon
André F.T. Martins
Gautier Viaud
C´eline Hudelot
Pierre Colombo
43
33
0
01 Feb 2024
A Prompt Learning Framework for Source Code Summarization
A Prompt Learning Framework for Source Code Summarization
Weisong Sun
Chunrong Fang
Yudu You
Yuchen Chen
Yi Liu
...
Quanjun Zhang
Hanwei Qian
Wei-Ye Zhao
Yang Liu
Zhenyu Chen
LLMAG
37
13
0
26 Dec 2023
RETSim: Resilient and Efficient Text Similarity
RETSim: Resilient and Efficient Text Similarity
Marina Zhang
Owen Vallis
Aysegul Bumin
Tanay Vakharia
Elie Bursztein
16
1
0
28 Nov 2023
PaSS: Parallel Speculative Sampling
PaSS: Parallel Speculative Sampling
Giovanni Monea
Armand Joulin
Edouard Grave
MoE
8
31
0
22 Nov 2023
CompCodeVet: A Compiler-guided Validation and Enhancement Approach for
  Code Dataset
CompCodeVet: A Compiler-guided Validation and Enhancement Approach for Code Dataset
Le Chen
Arijit Bhattacharjee
Nesreen K. Ahmed
N. Hasabnis
Gal Oren
Bin Lei
Ali Jannesari
LRM
24
3
0
11 Nov 2023
SALM: Speech-augmented Language Model with In-context Learning for
  Speech Recognition and Translation
SALM: Speech-augmented Language Model with In-context Learning for Speech Recognition and Translation
Zhehuai Chen
He Huang
A. Andrusenko
Oleksii Hrinchuk
Krishna C. Puvvada
Jason Chun Lok Li
Subhankar Ghosh
Jagadeesh Balam
Boris Ginsburg
LRM
21
48
0
13 Oct 2023
Retentive Network: A Successor to Transformer for Large Language Models
Retentive Network: A Successor to Transformer for Large Language Models
Yutao Sun
Li Dong
Shaohan Huang
Shuming Ma
Yuqing Xia
Jilong Xue
Jianyong Wang
Furu Wei
LRM
51
301
0
17 Jul 2023
Deduplicating and Ranking Solution Programs for Suggesting Reference
  Solutions
Deduplicating and Ranking Solution Programs for Suggesting Reference Solutions
Atsushi Shirafuji
Yutaka Watanobe
19
1
0
16 Jul 2023
LongNet: Scaling Transformers to 1,000,000,000 Tokens
LongNet: Scaling Transformers to 1,000,000,000 Tokens
Jiayu Ding
Shuming Ma
Li Dong
Xingxing Zhang
Shaohan Huang
Wenhui Wang
Nanning Zheng
Furu Wei
CLL
35
151
0
05 Jul 2023
Neural Machine Translation for Code Generation
Neural Machine Translation for Code Generation
K. Dharma
Clayton T. Morrison
25
4
0
22 May 2023
A Language Model of Java Methods with Train/Test Deduplication
A Language Model of Java Methods with Train/Test Deduplication
Chia-Yi Su
Aakash Bansal
Vijayanta Jain
S. Ghanavati
Collin McMillan
SyDa
VLM
16
9
0
15 May 2023
The Vault: A Comprehensive Multilingual Dataset for Advancing Code
  Understanding and Generation
The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation
Dũng Nguyễn Mạnh
Nam Le Hai
An Dau
A. Nguyen
Khanh N. Nghiem
Jingnan Guo
Nghi D. Q. Bui
26
13
0
09 May 2023
CodeGen2: Lessons for Training LLMs on Programming and Natural Languages
CodeGen2: Lessons for Training LLMs on Programming and Natural Languages
Erik Nijkamp
A. Ghobadzadeh
Caiming Xiong
Silvio Savarese
Yingbo Zhou
147
164
0
03 May 2023
Multi-lingual Evaluation of Code Generation Models
Multi-lingual Evaluation of Code Generation Models
Ben Athiwaratkun
Sanjay Krishna Gouda
Zijian Wang
Xiaopeng Li
Yuchen Tian
...
Baishakhi Ray
Parminder Bhatia
Sudipta Sengupta
Dan Roth
Bing Xiang
ELM
112
117
0
26 Oct 2022
A Systematic Evaluation of Large Language Models of Code
A Systematic Evaluation of Large Language Models of Code
Frank F. Xu
Uri Alon
Graham Neubig
Vincent J. Hellendoorn
ELM
ALM
202
628
0
26 Feb 2022
CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for
  Code Understanding and Generation
CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation
Yue Wang
Weishi Wang
Shafiq R. Joty
S. Hoi
210
1,485
0
02 Sep 2021
Deduplicating Training Data Makes Language Models Better
Deduplicating Training Data Makes Language Models Better
Katherine Lee
Daphne Ippolito
A. Nystrom
Chiyuan Zhang
Douglas Eck
Chris Callison-Burch
Nicholas Carlini
SyDa
237
588
0
14 Jul 2021
Measuring Coding Challenge Competence With APPS
Measuring Coding Challenge Competence With APPS
Dan Hendrycks
Steven Basart
Saurav Kadavath
Mantas Mazeika
Akul Arora
...
Collin Burns
Samir Puranik
Horace He
D. Song
Jacob Steinhardt
ELM
AIMat
ALM
194
623
0
20 May 2021
CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding
  and Generation
CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation
Shuai Lu
Daya Guo
Shuo Ren
Junjie Huang
Alexey Svyatkovskiy
...
Nan Duan
Neel Sundaresan
Shao Kun Deng
Shengyu Fu
Shujie Liu
ELM
196
853
0
09 Feb 2021
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Leo Gao
Stella Biderman
Sid Black
Laurence Golding
Travis Hoppe
...
Horace He
Anish Thite
Noa Nabeshima
Shawn Presser
Connor Leahy
AIMat
245
1,986
0
31 Dec 2020
Extracting Training Data from Large Language Models
Extracting Training Data from Large Language Models
Nicholas Carlini
Florian Tramèr
Eric Wallace
Matthew Jagielski
Ariel Herbert-Voss
...
Tom B. Brown
D. Song
Ulfar Erlingsson
Alina Oprea
Colin Raffel
MLAU
SILM
267
1,808
0
14 Dec 2020
12
Next