DeepSeek-V4 아키텍처

DeepSeek이 2026년 4월 발표한 1.6T 파라미터 MoE 모델. V3(600B) 대비 규모가 대폭 확대되었으며, 세 가지 알고리즘적 혁신으로 long-context 비용을 획기적으로 절감했다.

세 가지 핵심 알고리즘 혁신

컴포넌트	방식
Sliding Window Attention	최근 ~500 토큰만 full attention
Block-sparse Attention	전체를 100:1 압축 후 full attention
Compressed Sparse Attention	4:1 압축 후 Lightning Indexer로 top-k 선택

효과: 연산량 27%, KV cache 메모리 10% (V3 Pro 대비)

Residual connection의 통로 폭을 저렴하게 넓혀 깊은 모델의 학습 안정성을 확보.

Adam 이후 중국 모델들이 표준으로 채택하는 optimizer. 학습 속도 가속 + 데이터 효율 향상.