Duplex: A Device for Large Language Models with Mixture of Experts, Grouped Query Attention, and Continuous Batching

2 September 2024

Papers citing "Duplex: A Device for Large Language Models with Mixture of Experts, Grouped Query Attention, and Continuous Batching"

3 / 3 papers shown

Title
HPU: High-Bandwidth Processing Unit for Scalable, Cost-effective LLM Inference via GPU Co-processing Myunghyun Rhee Joonseop Sim Taeyoung Ahn Seungyong Lee Daegun Yoon Euiseok Kim Kyoung Park Youngpyo Joo Hosik Kim 22 0 0 18 Apr 2025
Chameleon: Adaptive Caching and Scheduling for Many-Adapter LLM Inference Environments Nikoleta Iliakopoulou Jovan Stojkovic Chloe Alverti Tianyin Xu Hubertus Franke Josep Torrellas 70 2 0 24 Nov 2024
DFX: A Low-latency Multi-FPGA Appliance for Accelerating Transformer-based Text Generation Seongmin Hong Seungjae Moon Junsoo Kim Sungjae Lee Minsub Kim Dongsoo Lee Joo-Young Kim 64 76 0 22 Sep 2022