Roofline 분석 (Roofline Analysis)

LLM inference의 성능 병목을 compute-bound vs memory-bound로 구분하여, 배치 크기(B)에 따른 지연(latency)·비용(cost) 곡선을 그리는 상한선(roof) 기반 분석법.

핵심 수식

T = max(t_compute, t_memory)

t_compute = (B × N_active) / FLOPs_per_second
t_memory  = (N_total + B × L × bytes) / bandwidth

B: 배치 크기 (한 사이클에 처리하는 유저/토큰 수)
N_active: MoE에서 실제 활성화된 파라미터 수
N_total: 메모리에 올려야 하는 전체 파라미터
L: context length (KV cache 길이)

핵심 인사이트

B가 작을 때: t_memory > t_compute → memory-bound
B가 클 때: t_compute > t_memory → compute-bound
두 곡선의 교점(B*)에서 처리량(throughput)이 최대화됨
B* ≈ (FLOPs/bandwidth) × (1/sparsity) ≈ 300 / sparsity

FLOPs/bandwidth 비율

FP4 기준 하드웨어 세대를 거쳐도 약 300 수준 유지:

H100: FP8 기준 ~200; GB300: FP4 기준 ~300 수준

→ DeepSeek V3 sparsity ≈ 1/8 → B* ≈ 2400

관련 개념

참고

출처: yt-V_Z-ydQJ54c-LLM-추론-인프라와-토큰-경제학
원강연: Dwarkesh Patel × Reiner-Pope (2026-04-30)