vllm.models.deepseek_v4.attention ¶
DeepseekV4 MLA Attention Layer
_resolve_dsv4_backend ¶
_resolve_dsv4_backend(vllm_config: VllmConfig | None)
Return the explicitly-requested DSv4 sparse backend enum, or None.
Source code in vllm/models/deepseek_v4/attention.py
_resolve_dsv4_kv_cache_dtype ¶
_resolve_dsv4_kv_cache_dtype(
backend,
kv_cache_dtype: str,
cache_config: CacheConfig | None,
) -> tuple[str, dtype]
Map (backend, --kv-cache-dtype) to (cache_dtype_str, torch_dtype).
FlashInfer V4 reads a contiguous 512-wide KV row (bf16 or per-tensor FP8 E4M3); FlashMLA V4 reads the legacy UE8M0 paged layout (uint8 / fp8_ds_mla). For FlashMLA the canonical fp8_ds_mla string is written back onto cache_config so the page-size specs pick the 576B layout.
Source code in vllm/models/deepseek_v4/attention.py
_select_v4_sparse_impl ¶
_select_v4_sparse_impl(
vllm_config: VllmConfig | None = None,
) -> type[DeepseekV4SparseMLAAttentionImpl]
Pick the V4 sparse MLA impl class.
An explicit --attention-backend FLASHINFER_MLA_SPARSE_DSV4 selects the FlashInfer TRTLLM-gen path; otherwise the platform default (FlashMLA on NVIDIA, ROCm Aiter on AMD) is used.