Supercharging LLM inference on Google TPUs: Achieving 3X speedups with diffusion-style speculative decoding

핵심 요약

구글은 Google TPUs에서 DFlash를 이용한 블록-확산 추론으로 순차 병목을 제거하고, 평균 3.13x 속도 향상과 피크 성능이 EAGLE-3의 거의 두 배에 이르는 효과를 달성했으며, vLLM 생태계에 오픈소스로 통합해 TPU 하드웨어를 효율적으로 활용했습니다.

구현 방법

DFlash(block-diffusion speculative decoding) 적용으로 후보 토큰을 블록 단위로 한 번의 순전파로 처리
TPU 하드웨어에 맞춘 병렬 검증과 고품질 초안 예측 사용
vLLM 생태계에 오픈소스로 통합하여 재현성과 확장성 강화

주요 결과

평균 속도 3.13x 증가, 피크 성능은 EAGLE-3 대비 거의 2배에 도달
오픈소스 통합으로 재현성 및 다양한 복합 추론 작업에의 적용성 향상

핵심 요약

구현 방법

DFlash(block-diffusion speculative decoding) 적용으로 후보 토큰을 블록 단위로 한 번의 순전파로 처리
TPU 하드웨어에 맞춘 병렬 검증과 고품질 초안 예측 사용
vLLM 생태계에 오픈소스로 통합하여 재현성과 확장성 강화

주요 결과

평균 속도 3.13x 증가, 피크 성능은 EAGLE-3 대비 거의 2배에 도달
오픈소스 통합으로 재현성 및 다양한 복합 추론 작업에의 적용성 향상

Supercharging LLM inference on Google TPUs: Achieving 3X speedups with diffusion-style speculative decoding

AI 요약

핵심 요약

구현 방법

주요 결과

TorchTPU: Running PyTorch Natively on TPUs at Google Scale

Unlocking Peak Performance on Qualcomm NPU with LiteRT

MediaTek NPU and LiteRT: Powering the next generation of on-device AI

Supercharging LLM inference on Google TPUs: Achieving 3X speedups with diffusion-style speculative decoding

AI 요약

핵심 요약

구현 방법

주요 결과

TorchTPU: Running PyTorch Natively on TPUs at Google Scale

Unlocking Peak Performance on Qualcomm NPU with LiteRT

MediaTek NPU and LiteRT: Powering the next generation of on-device AI

AI 요약

핵심 요약

구현 방법

주요 결과

연관 피드

TorchTPU: Running PyTorch Natively on TPUs at Google Scale

Unlocking Peak Performance on Qualcomm NPU with LiteRT

MediaTek NPU and LiteRT: Powering the next generation of on-device AI

AI 요약

핵심 요약

구현 방법

주요 결과

연관 피드

TorchTPU: Running PyTorch Natively on TPUs at Google Scale

Unlocking Peak Performance on Qualcomm NPU with LiteRT

MediaTek NPU and LiteRT: Powering the next generation of on-device AI