ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1707.06347
  4. Cited By
Proximal Policy Optimization Algorithms

Proximal Policy Optimization Algorithms

20 July 2017
John Schulman
Filip Wolski
Prafulla Dhariwal
Alec Radford
Oleg Klimov
    OffRL
ArXivPDFHTML

Papers citing "Proximal Policy Optimization Algorithms"

50 / 7,044 papers shown
Title
DISCO Balances the Scales: Adaptive Domain- and Difficulty-Aware Reinforcement Learning on Imbalanced Data
DISCO Balances the Scales: Adaptive Domain- and Difficulty-Aware Reinforcement Learning on Imbalanced Data
Yuhang Zhou
Jing Zhu
Shengyi Qian
Zhuokai Zhao
Xiyao Wang
Xiaoyu Liu
Ming Li
Paiheng Xu
Wei Ai
Furong Huang
27
0
0
21 May 2025
Pass@K Policy Optimization: Solving Harder Reinforcement Learning Problems
Pass@K Policy Optimization: Solving Harder Reinforcement Learning Problems
Christian Walder
Deep Karkhanis
OffRL
17
0
0
21 May 2025
Teaching Language Models to Evolve with Users: Dynamic Profile Modeling for Personalized Alignment
Teaching Language Models to Evolve with Users: Dynamic Profile Modeling for Personalized Alignment
Weixiang Zhao
Xingyu Sui
Yulin Hu
Jiahe Guo
Haixiao Liu
Biye Li
Yanyan Zhao
Bing Qin
Ting Liu
OffRL
26
0
0
21 May 2025
Trajectory Bellman Residual Minimization: A Simple Value-Based Method for LLM Reasoning
Trajectory Bellman Residual Minimization: A Simple Value-Based Method for LLM Reasoning
Yurun Yuan
Fan Chen
Zeyu Jia
Alexander Rakhlin
Tengyang Xie
OffRL
36
0
0
21 May 2025
HCRMP: A LLM-Hinted Contextual Reinforcement Learning Framework for Autonomous Driving
HCRMP: A LLM-Hinted Contextual Reinforcement Learning Framework for Autonomous Driving
Zhehuai Chen
Bo Leng
Zhuoren Li
Hanming Deng
Guizhe Jin
Ran Yu
Huanxi Wen
24
0
0
21 May 2025
GCNT: Graph-Based Transformer Policies for Morphology-Agnostic Reinforcement Learning
GCNT: Graph-Based Transformer Policies for Morphology-Agnostic Reinforcement Learning
Yingbo Luo
Meibao Yao
Xueming Xiao
27
0
0
21 May 2025
Thought-Augmented Policy Optimization: Bridging External Guidance and Internal Capabilities
Thought-Augmented Policy Optimization: Bridging External Guidance and Internal Capabilities
Jinyang Wu
Chonghua Liao
Mingkuan Feng
Shuai Zhang
Zhengqi Wen
Pengpeng Shao
Huazhe Xu
Jianhua Tao
OffRL
LRM
19
0
0
21 May 2025
A Temporal Difference Method for Stochastic Continuous Dynamics
A Temporal Difference Method for Stochastic Continuous Dynamics
Haruki Settai
Naoya Takeishi
Takehisa Yairi
12
0
0
21 May 2025
MMaDA: Multimodal Large Diffusion Language Models
MMaDA: Multimodal Large Diffusion Language Models
Ling Yang
Ye Tian
Bowen Li
Xinchen Zhang
Ke Shen
Yunhai Tong
Mengdi Wang
VLM
LRM
41
0
0
21 May 2025
lmgame-Bench: How Good are LLMs at Playing Games?
lmgame-Bench: How Good are LLMs at Playing Games?
Lanxiang Hu
Mingjia Huo
Yu Zhang
Haoyang Yu
Eric P. Xing
Ion Stoica
Tajana Rosing
Haojian Jin
Hao Zhang
22
0
0
21 May 2025
The Unreasonable Effectiveness of Entropy Minimization in LLM Reasoning
The Unreasonable Effectiveness of Entropy Minimization in LLM Reasoning
Shivam Agarwal
Zimin Zhang
Lifan Yuan
Jiawei Han
Hao Peng
22
0
0
21 May 2025
Discovering Pathology Rationale and Token Allocation for Efficient Multimodal Pathology Reasoning
Discovering Pathology Rationale and Token Allocation for Efficient Multimodal Pathology Reasoning
Zhe Xu
Cheng Jin
Yihui Wang
Ziyi Liu
Hao Chen
9
0
0
21 May 2025
AnyBody: A Benchmark Suite for Cross-Embodiment Manipulation
AnyBody: A Benchmark Suite for Cross-Embodiment Manipulation
Meenal Parakh
Alexandre Kirchmeyer
Beining Han
Jia Deng
LM&Ro
21
0
0
21 May 2025
AM-PPO: (Advantage) Alpha-Modulation with Proximal Policy Optimization
AM-PPO: (Advantage) Alpha-Modulation with Proximal Policy Optimization
Soham Sane
12
0
0
21 May 2025
Learning to Rank Chain-of-Thought: An Energy-Based Approach with Outcome Supervision
Learning to Rank Chain-of-Thought: An Energy-Based Approach with Outcome Supervision
Eric Hanchen Jiang
Haozheng Luo
Shengyuan Pang
Xiaomin Li
Zhenting Qi
...
Zongyu Lin
Xinfeng Li
Hao Xu
Kai-Wei Chang
Ying Nian Wu
LRM
18
0
0
21 May 2025
Hadamax Encoding: Elevating Performance in Model-Free Atari
Hadamax Encoding: Elevating Performance in Model-Free Atari
Jacob E. Kooi
Zhao Yang
Vincent François-Lavet
27
0
0
21 May 2025
When Less Language is More: Language-Reasoning Disentanglement Makes LLMs Better Multilingual Reasoners
When Less Language is More: Language-Reasoning Disentanglement Makes LLMs Better Multilingual Reasoners
Weixiang Zhao
Jiahe Guo
Yang Deng
Tongtong Wu
Wenxuan Zhang
...
Yanyan Zhao
Wanxiang Che
Bing Qin
Tat-Seng Chua
Ting Liu
LRM
21
0
0
21 May 2025
When to Continue Thinking: Adaptive Thinking Mode Switching for Efficient Reasoning
When to Continue Thinking: Adaptive Thinking Mode Switching for Efficient Reasoning
Xiaoyun Zhang
Jingqing Ruan
Xing Ma
Yawen Zhu
Haodong Zhao
Hao Li
Jiansong Chen
Ke Zeng
Xunliang Cai
LRM
27
0
0
21 May 2025
Human in the Loop Adaptive Optimization for Improved Time Series Forecasting
Human in the Loop Adaptive Optimization for Improved Time Series Forecasting
Malik Tiomoko
Hamza Cherkaoui
Giuseppe Paolo
Zhang Yili
Yu Meng
Zhang Keli
Hafiz Tiomoko Ali
AI4TS
AI4CE
22
0
0
21 May 2025
Solving Normalized Cut Problem with Constrained Action Space
Solving Normalized Cut Problem with Constrained Action Space
Qize Jiang
Linsey Pang
Alice Gatti
Mahima Aggarwa
Giovanna Vantin
Xiaosong Ma
Weiwei Sun
Sanjay Chawla
AI4CE
22
0
0
20 May 2025
Safety Subspaces are Not Distinct: A Fine-Tuning Case Study
Safety Subspaces are Not Distinct: A Fine-Tuning Case Study
Kaustubh Ponkshe
Shaan Shah
Raghav Singhal
Praneeth Vepakomma
19
0
0
20 May 2025
s3: You Don't Need That Much Data to Train a Search Agent via RL
s3: You Don't Need That Much Data to Train a Search Agent via RL
Pengcheng Jiang
Xueqiang Xu
Jiacheng Lin
Jinfeng Xiao
Zifeng Wang
Jimeng Sun
Jiawei Han
OffRL
RALM
AI4TS
LRM
31
0
0
20 May 2025
SHARP: Synthesizing High-quality Aligned Reasoning Problems for Large Reasoning Models Reinforcement Learning
SHARP: Synthesizing High-quality Aligned Reasoning Problems for Large Reasoning Models Reinforcement Learning
Xiong Jun Wu
Zhenduo Zhang
ZuJie Wen
Zhiqiang Zhang
Wang Ren
...
Cai Chen
Deng Zhao
Dingnan Jin
Qing Cui
Jun Zhou
LRM
14
0
0
20 May 2025
Breaking Bad Tokens: Detoxification of LLMs Using Sparse Autoencoders
Breaking Bad Tokens: Detoxification of LLMs Using Sparse Autoencoders
Agam Goyal
Vedant Rathi
William Yeh
Yian Wang
Yuen Chen
Hari Sundaram
22
0
0
20 May 2025
Toward Effective Reinforcement Learning Fine-Tuning for Medical VQA in Vision-Language Models
Toward Effective Reinforcement Learning Fine-Tuning for Medical VQA in Vision-Language Models
Wenhui Zhu
Xuanzhao Dong
Xin Li
Peijie Qiu
Xiwen Chen
Abolfazl Razi
Aris Sotiras
Yi Su
Yalin Wang
OffRL
LM&MA
34
0
0
20 May 2025
The Hallucination Tax of Reinforcement Finetuning
The Hallucination Tax of Reinforcement Finetuning
Linxin Song
Taiwei Shi
Jieyu Zhao
HILM
LRM
16
0
0
20 May 2025
ThinkSwitcher: When to Think Hard, When to Think Fast
ThinkSwitcher: When to Think Hard, When to Think Fast
Guosheng Liang
Longguang Zhong
Ziyi Yang
Xiaojun Quan
LRM
20
0
0
20 May 2025
Self-Evolving Curriculum for LLM Reasoning
Self-Evolving Curriculum for LLM Reasoning
Xiaoyin Chen
Jiarui Lu
Minsu Kim
Dinghuai Zhang
Jian Tang
Alexandre Piché
Nicolas Angelard-Gontier
Yoshua Bengio
Ehsan Kamalloo
ReLM
LRM
38
0
0
20 May 2025
NavBench: A Unified Robotics Benchmark for Reinforcement Learning-Based Autonomous Navigation
NavBench: A Unified Robotics Benchmark for Reinforcement Learning-Based Autonomous Navigation
Matteo El Hariry
Antoine Richard
Ricard M. Castan
Luis F W Batista
Matthieu Geist
C´edric Pradalier
Miguel Olivares-Mendez
OffRL
19
0
0
20 May 2025
Unearthing Gems from Stones: Policy Optimization with Negative Sample Augmentation for LLM Reasoning
Unearthing Gems from Stones: Policy Optimization with Negative Sample Augmentation for LLM Reasoning
Zhaohui Yang
Shilei Jiang
Chen Hu
Linjing Li
Shihong Deng
D. Jiang
OffRL
22
0
0
20 May 2025
Toward Embodied AGI: A Review of Embodied AI and the Road Ahead
Toward Embodied AGI: A Review of Embodied AI and the Road Ahead
Yequan Wang
Aixin Sun
LM&Ro
AI4CE
14
0
0
20 May 2025
Think-J: Learning to Think for Generative LLM-as-a-Judge
Think-J: Learning to Think for Generative LLM-as-a-Judge
Hui Huang
Yancheng He
Hongli Zhou
Rui Zhang
Wei Liu
Weixun Wang
Wenbo Su
Bo Zheng
Jiaheng Liu
LLMAG
AILaw
ELM
LRM
17
0
0
20 May 2025
Toward Real-World Cooperative and Competitive Soccer with Quadrupedal Robot Teams
Toward Real-World Cooperative and Competitive Soccer with Quadrupedal Robot Teams
Zhi Su
Yuman Gao
Emily Lukas
Yunfei Li
Jiaze Cai
...
Fei Gao
Chao Yu
Zhongyu Li
Yi Wu
Koushil Sreenath
17
0
0
20 May 2025
APEX: Empowering LLMs with Physics-Based Task Planning for Real-time Insight
APEX: Empowering LLMs with Physics-Based Task Planning for Real-time Insight
Wanjing Huang
Weixiang Yan
Zhen Zhang
Ambuj Singh
LRM
17
0
0
20 May 2025
Interpretable Reinforcement Learning for Load Balancing using Kolmogorov-Arnold Networks
Interpretable Reinforcement Learning for Load Balancing using Kolmogorov-Arnold Networks
Kamal Singh
Sami Marouani
Ahmad Al Sheikh
Pham Tran Anh Quang
Amaury Habrard
7
0
0
20 May 2025
Preference Learning with Lie Detectors can Induce Honesty or Evasion
Preference Learning with Lie Detectors can Induce Honesty or Evasion
Chris Cundy
Adam Gleave
7
0
0
20 May 2025
AAPO: Enhance the Reasoning Capabilities of LLMs with Advantage Momentum
AAPO: Enhance the Reasoning Capabilities of LLMs with Advantage Momentum
Jian Xiong
Jingbo Zhou
Jingyong Ye
Dejing Dou
LRM
28
0
0
20 May 2025
KIPPO: Koopman-Inspired Proximal Policy Optimization
KIPPO: Koopman-Inspired Proximal Policy Optimization
Andrei Cozma
Landon Harris
Hairong Qi
7
0
0
20 May 2025
InfiFPO: Implicit Model Fusion via Preference Optimization in Large Language Models
InfiFPO: Implicit Model Fusion via Preference Optimization in Large Language Models
Yanggan Gu
Zhaoyi Yan
Yuanyi Wang
Yiming Zhang
Qi Zhou
Fei Wu
Hongxia Yang
12
0
0
20 May 2025
RLVR-World: Training World Models with Reinforcement Learning
RLVR-World: Training World Models with Reinforcement Learning
Jialong Wu
Shaofeng Yin
Ningya Feng
Mingsheng Long
OffRL
VGen
17
0
0
20 May 2025
Multi-parameter Control for the (1+($λ$,$λ$))-GA on OneMax via Deep Reinforcement Learning
Multi-parameter Control for the (1+(λλλ,λλλ))-GA on OneMax via Deep Reinforcement Learning
Tai Nguyen
Phong Le
Carola Doerr
Nguyen Dang
34
0
0
19 May 2025
Composing Dextrous Grasping and In-hand Manipulation via Scoring with a Reinforcement Learning Critic
Composing Dextrous Grasping and In-hand Manipulation via Scoring with a Reinforcement Learning Critic
Lennart Röstel
Dominik Winkelbauer
Johannes Pitz
Leon Sievers
Berthold Bäuml
7
0
0
19 May 2025
Seek in the Dark: Reasoning via Test-Time Instance-Level Policy Gradient in Latent Space
Seek in the Dark: Reasoning via Test-Time Instance-Level Policy Gradient in Latent Space
Hengli Li
Chenxi Li
Tong Wu
Xuekai Zhu
Yuxuan Wang
...
Eric Hanchen Jiang
Song-Chun Zhu
Zixia Jia
Ying Nian Wu
Zilong Zheng
LRM
22
0
0
19 May 2025
Reasoning BO: Enhancing Bayesian Optimization with Long-Context Reasoning Power of LLMs
Reasoning BO: Enhancing Bayesian Optimization with Long-Context Reasoning Power of LLMs
Zhuo Yang
Lingli Ge
Dong Han
Tianfan Fu
Yuqiang Li
32
0
0
19 May 2025
Policy-Driven World Model Adaptation for Robust Offline Model-based Reinforcement Learning
Policy-Driven World Model Adaptation for Robust Offline Model-based Reinforcement Learning
Jiayu Chen
Aravind Venugopal
Jeff Schneider
OffRL
25
0
0
19 May 2025
Action-Dependent Optimality-Preserving Reward Shaping
Action-Dependent Optimality-Preserving Reward Shaping
Grant C. Forbes
Jianxun Wang
Leonardo Villalobos-Arias
Arnav Jhala
David L. Roberts
OffRL
27
0
0
19 May 2025
Thinking Short and Right Over Thinking Long: Serving LLM Reasoning Efficiently and Accurately
Thinking Short and Right Over Thinking Long: Serving LLM Reasoning Efficiently and Accurately
Yuhang Wang
Youhe Jiang
Bin Cui
Fangcheng Fu
LRM
17
0
0
19 May 2025
ToTRL: Unlock LLM Tree-of-Thoughts Reasoning Potential through Puzzles Solving
ToTRL: Unlock LLM Tree-of-Thoughts Reasoning Potential through Puzzles Solving
Haoyuan Wu
Xueyi Chen
Rui Ming
Jilong Gao
Shoubo Hu
Zhuolun He
Bei Yu
LRM
26
0
0
19 May 2025
Dynamic Sight Range Selection in Multi-Agent Reinforcement Learning
Dynamic Sight Range Selection in Multi-Agent Reinforcement Learning
Wei-Chen Liao
Ti-Rong Wu
I-Chen Wu
24
0
0
19 May 2025
Dribble Master: Learning Agile Humanoid Dribbling Through Legged Locomotion
Dribble Master: Learning Agile Humanoid Dribbling Through Legged Locomotion
Zhuoheng Wang
Jinyin Zhou
Qi Wu
27
0
0
19 May 2025
1234...139140141
Next