Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2409.01141
Cited By
Duplex: A Device for Large Language Models with Mixture of Experts, Grouped Query Attention, and Continuous Batching
2 September 2024
Sungmin Yun
Kwanhee Kyung
Juhwan Cho
Jaewan Choi
Jongmin Kim
Byeongho Kim
Sukhan Lee
Kyomin Sohn
Jung Ho Ahn
MoE
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Duplex: A Device for Large Language Models with Mixture of Experts, Grouped Query Attention, and Continuous Batching"
3 / 3 papers shown
Title
HPU: High-Bandwidth Processing Unit for Scalable, Cost-effective LLM Inference via GPU Co-processing
Myunghyun Rhee
Joonseop Sim
Taeyoung Ahn
Seungyong Lee
Daegun Yoon
Euiseok Kim
Kyoung Park
Youngpyo Joo
Hosik Kim
22
0
0
18 Apr 2025
Chameleon: Adaptive Caching and Scheduling for Many-Adapter LLM Inference Environments
Nikoleta Iliakopoulou
Jovan Stojkovic
Chloe Alverti
Tianyin Xu
Hubertus Franke
Josep Torrellas
70
2
0
24 Nov 2024
DFX: A Low-latency Multi-FPGA Appliance for Accelerating Transformer-based Text Generation
Seongmin Hong
Seungjae Moon
Junsoo Kim
Sungjae Lee
Minsub Kim
Dongsoo Lee
Joo-Young Kim
64
76
0
22 Sep 2022
1