Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1707.06347
Cited By
Proximal Policy Optimization Algorithms
20 July 2017
John Schulman
Filip Wolski
Prafulla Dhariwal
Alec Radford
Oleg Klimov
OffRL
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Proximal Policy Optimization Algorithms"
50 / 7,399 papers shown
Title
Soft Best-of-n Sampling for Model Alignment
C. M. Verdun
Alex Oesterling
Himabindu Lakkaraju
Flavio du Pin Calmon
BDL
293
0
0
06 May 2025
DYSTIL: Dynamic Strategy Induction with Large Language Models for Reinforcement Learning
Borui Wang
Kathleen McKeown
Rex Ying
OffRL
58
0
0
06 May 2025
Automated Hybrid Reward Scheduling via Large Language Models for Robotic Skill Learning
Changxin Huang
Junyang Liang
Yanbin Chang
Jingzhao Xu
Jianqiang Li
55
0
0
05 May 2025
SIMPLEMIX: Frustratingly Simple Mixing of Off- and On-policy Data in Language Model Preference Learning
Tianjian Li
Daniel Khashabi
60
0
0
05 May 2025
AKD : Adversarial Knowledge Distillation For Large Language Models Alignment on Coding tasks
Ilyas Oulkadda
Julien Perez
ALM
56
0
0
05 May 2025
Aerodynamic and structural airfoil shape optimisation via Transfer Learning-enhanced Deep Reinforcement Learning
David Ramos
Lucas Lacasa
E. Valero
G. Rubio
AI4CE
58
0
0
05 May 2025
Knowing You Don't Know: Learning When to Continue Search in Multi-round RAG through Self-Practicing
Diji Yang
Linda Zeng
Jinmeng Rao
Yize Zhang
42
0
0
05 May 2025
R1-Reward: Training Multimodal Reward Model Through Stable Reinforcement Learning
Yi-Fan Zhang
Xingyu Lu
X. Hu
Chaoyou Fu
Bin Wen
...
Jianfei Chen
Fan Yang
Zheng Zhang
Yan Li
Liang Wang
OffRL
LRM
54
2
0
05 May 2025
Bielik 11B v2 Technical Report
Krzysztof Ociepa
Łukasz Flis
Krzysztof Wróbel
Adrian Gwoździej
Remigiusz Kinas
43
0
0
05 May 2025
Enhancing LLMs' Clinical Reasoning with Real-World Data from a Nationwide Sepsis Registry
J. Kim
Chaeeun Shim
Sungjin Park
Su Yeon Lee
Gee Young Suh
...
Yong Soo Kim
Hee-Joon Bae
Sung Yoon Lim
Han-Gil Jeong
Edward Choi
LRM
63
0
0
05 May 2025
RM-R1: Reward Modeling as Reasoning
Xiusi Chen
Gaotang Li
Zehua Wang
Bowen Jin
Cheng Qian
...
Yu Zhang
D. Zhang
Tong Zhang
Hanghang Tong
Heng Ji
ReLM
OffRL
LRM
224
7
0
05 May 2025
A Survey on Progress in LLM Alignment from the Perspective of Reward Design
Miaomiao Ji
Yanqiu Wu
Zhibin Wu
Shoujin Wang
Jian Yang
Mark Dras
Usman Naseem
48
1
0
05 May 2025
Optimizing Chain-of-Thought Reasoners via Gradient Variance Minimization in Rejection Sampling and RL
Jiarui Yao
Yifan Hao
Hanning Zhang
Hanze Dong
Wei Xiong
Nan Jiang
Tong Zhang
LRM
62
1
0
05 May 2025
Sailing AI by the Stars: A Survey of Learning from Rewards in Post-Training and Test-Time Scaling of Large Language Models
Xiaobao Wu
LRM
95
2
0
05 May 2025
TWIST: Teleoperated Whole-Body Imitation System
Yanjie Ze
Zixuan Chen
Joao Pedro Araujo
Zi-ang Cao
Xue Bin Peng
Jiajun Wu
Chao Liu
47
2
0
05 May 2025
Resolving Conflicting Constraints in Multi-Agent Reinforcement Learning with Layered Safety
Jason J. Choi
Jasmine Jerry Aloor
Jingqi Li
Maria G. Mendoza
H. Balakrishnan
Claire J. Tomlin
46
0
0
04 May 2025
Exploring the Potential of Offline RL for Reasoning in LLMs: A Preliminary Study
Xiaoyu Tian
Sitong Zhao
Haotian Wang
Shuaiting Chen
Yiping Peng
Yunjie Ji
Han Zhao
Xiangang Li
OffRL
LRM
45
0
0
04 May 2025
Restoring Calibration for Aligned Large Language Models: A Calibration-Aware Fine-Tuning Approach
Jiancong Xiao
Bojian Hou
Zhanliang Wang
Ruochen Jin
Q. Long
Weijie Su
Li Shen
50
1
0
04 May 2025
Interpretable Emergent Language Using Inter-Agent Transformers
Mannan Bhardwaj
AI4CE
245
0
0
04 May 2025
Prompt-responsive Object Retrieval with Memory-augmented Student-Teacher Learning
Malte Mosbach
Sven Behnke
41
0
0
04 May 2025
SkillMimic-V2: Learning Robust and Generalizable Interaction Skills from Sparse and Noisy Demonstrations
Runyi Yu
Yinhuai Wang
Qihan Zhao
Hok Wai Tsui
Jingbo Wang
P. Tan
Qifeng Chen
OffRL
47
1
0
04 May 2025
R-Bench: Graduate-level Multi-disciplinary Benchmarks for LLM & MLLM Complex Reasoning Evaluation
Meng-Hao Guo
Jiajun Xu
Yi Zhang
Jiaxi Song
Haoyang Peng
...
Yongming Rao
Houwen Peng
Han Hu
Gordon Wetzstein
Shi-Min Hu
ELM
LRM
62
3
0
04 May 2025
Analytic Energy-Guided Policy Optimization for Offline Reinforcement Learning
Jifeng Hu
Sili Huang
Zhiyong Yang
Shengchao Hu
Li Shen
Hechang Chen
Lichao Sun
Yi-Ju Chang
Dacheng Tao
OffRL
292
0
0
03 May 2025
CAMOUFLAGE: Exploiting Misinformation Detection Systems Through LLM-driven Adversarial Claim Transformation
Mazal Bethany
Nishant Vishwamitra
Cho-Yu Chiang
Peyman Najafirad
AAML
38
0
0
03 May 2025
A Generalised and Adaptable Reinforcement Learning Stopping Method
Reem Bin-Hezam
Mark Stevenson
34
0
0
03 May 2025
Adaptive Wizard for Removing Cross-Tier Misconfigurations in Active Directory
Huy Q. Ngo
Mingyu Guo
Hung Nguyen
AAML
48
0
0
02 May 2025
Model Tensor Planning
An T. Le
Khai Nguyen
Minh Nhat Vu
João Carvalho
Jan Peters
40
0
0
02 May 2025
Fast Flow-based Visuomotor Policies via Conditional Optimal Transport Couplings
Andreas Sochopoulos
Nikolay Malkin
Nikolaos Tsagkas
João Moura
Michael Gienger
S. Vijayakumar
52
1
0
02 May 2025
MULE: Multi-terrain and Unknown Load Adaptation for Effective Quadrupedal Locomotion
Vamshi Kumar Kurva
Shishir Kolathaya
48
0
0
01 May 2025
Wasserstein Policy Optimization
David Pfau
Ian Davies
Diana Borsa
Joao G. M. Araujo
Brendan D. Tracey
H. V. Hasselt
47
0
0
01 May 2025
T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT
D. Jiang
Ziyu Guo
Renrui Zhang
Zhuofan Zong
Hao Li
Le Zhuo
Shilin Yan
Pheng-Ann Heng
Haoyang Li
LRM
78
11
0
01 May 2025
A General Approach of Automated Environment Design for Learning the Optimal Power Flow
Thomas Wolgast
Astrid Nieße
AI4CE
48
0
0
01 May 2025
A Survey of Robotic Navigation and Manipulation with Physics Simulators in the Era of Embodied AI
Lik Hang Kenny Wong
Xueyang Kang
Kaixin Bai
Jianwei Zhang
69
0
0
01 May 2025
Towards Efficient Online Tuning of VLM Agents via Counterfactual Soft Reinforcement Learning
Lang Feng
Weihao Tan
Zhiyi Lyu
Longtao Zheng
Haiyang Xu
Ming Yan
Fei Huang
Jingyi Wang
34
0
0
01 May 2025
Leveraging Partial SMILES Validation Scheme for Enhanced Drug Design in Reinforcement Learning Frameworks
Xinyu Wang
Jinbo Bi
Minghu Song
CLL
78
0
0
01 May 2025
LangWBC: Language-directed Humanoid Whole-Body Control via End-to-end Learning
Yiyang Shao
Xiaoyu Huang
Bike Zhang
Qiayuan Liao
Yuman Gao
Yufeng Chi
Zhongyu Li
Sophia Shao
Koushil Sreenath
LM&Ro
300
0
0
30 Apr 2025
Adaptive 3D UI Placement in Mixed Reality Using Deep Reinforcement Learning
Feiyu Lu
Mengyu Chen
Hsiang Hsu
Pranav Deshpande
Cheng Yao Wang
Blair MacIntyre
40
3
0
30 Apr 2025
Phi-4-Mini-Reasoning: Exploring the Limits of Small Reasoning Language Models in Math
Haoran Xu
Baolin Peng
Hany Awadalla
DongDong Chen
Yen-Chun Chen
...
Yelong Shen
Shuaiqiang Wang
Weijian Xu
Jianfeng Gao
Weizhu Chen
ReLM
LRM
93
4
0
30 Apr 2025
Whleaper: A 10-DOF Flexible Bipedal Wheeled Robot
Yinglei Zhu
Sixiao He
Zhenghao Qi
Zhuoyuan Yong
Yihua Qin
Jianyu Chen
38
0
0
30 Apr 2025
One Net to Rule Them All: Domain Randomization in Quadcopter Racing Across Different Platforms
Robin Ferede
Till Blaha
Erin Lucassen
Christophe De Wagter
Guido de Croon
45
1
0
30 Apr 2025
ShorterBetter: Guiding Reasoning Models to Find Optimal Inference Length for Efficient Reasoning
Jingyang Yi
Jiazheng Wang
Sida Li
ReLM
OODD
LRM
296
4
0
30 Apr 2025
Reinforced MLLM: A Survey on RL-Based Reasoning in Multimodal Large Language Models
Guanghao Zhou
Panjia Qiu
Chong Chen
Jiadong Wang
Zheming Yang
Jian Xu
Minghui Qiu
OffRL
LRM
65
3
0
30 Apr 2025
Designing Control Barrier Function via Probabilistic Enumeration for Safe Reinforcement Learning Navigation
Luca Marzari
Francesco Trotti
Enrico Marchesini
Alessandro Farinelli
55
0
0
30 Apr 2025
Neuro-Symbolic Generation of Explanations for Robot Policies with Weighted Signal Temporal Logic
Mikihisa Yuasa
R. Sreenivas
Huy T. Tran
52
0
0
30 Apr 2025
Multi-Agent Reinforcement Learning for Resources Allocation Optimization: A Survey
Mohamad Abdul Hady
Siyi Hu
Mahardhika Pratama
Jimmy Cao
Ryszard Kowalczyk
29
0
0
29 Apr 2025
A Domain-Agnostic Scalable AI Safety Ensuring Framework
Beomjun Kim
Kangyeon Kim
Sunwoo Kim
Heejin Ahn
57
0
0
29 Apr 2025
Antidote: A Unified Framework for Mitigating LVLM Hallucinations in Counterfactual Presupposition and Object Perception
Yuanchen Wu
Lu Zhang
Hang Yao
Junlong Du
Ke Yan
Shouhong Ding
Yunsheng Wu
Xuzhao Li
MLLM
87
0
0
29 Apr 2025
Return Capping: Sample-Efficient CVaR Policy Gradient Optimisation
Harry Mead
Clarissa Costen
Bruno Lacerda
Nick Hawes
50
0
0
29 Apr 2025
Token-Efficient RL for LLM Reasoning
Alan Lee
Harry Tong
OffRL
270
0
0
29 Apr 2025
XPG-RL: Reinforcement Learning with Explainable Priority Guidance for Efficiency-Boosted Mechanical Search
Yiting Zhang
Shichen Li
Elena Shrestha
55
1
0
29 Apr 2025
Previous
1
2
3
...
7
8
9
...
146
147
148
Next