vlm_loader
Vision Language Model loader with support for multiple VLM architectures.
This module provides a unified interface for loading and running inference with various Vision Language Models including Llama 4 Maverick, Gemma 3, InternVL3, Pixtral Large, and Qwen2.5-VL. Models can be loaded with different quantization strategies and inference frameworks (SGLang or vLLM).
logging
ABC
abstractmethod
dataclass
Enum
Any
torch
Image
AutoModel
AutoModelForVision2Seq
AutoProcessor
AutoTokenizer
BitsAndBytesConfig
Qwen2VLForConditionalGeneration
logger
QuantizationType Objects
class QuantizationType(str, Enum)
Supported quantization types for model compression.
NONE
FOUR_BIT
EIGHT_BIT
AWQ
InferenceFramework Objects
class InferenceFramework(str, Enum)
Supported inference frameworks for model execution.
SGLANG
VLLM
TRANSFORMERS
VLMConfig Objects
@dataclass
class VLMConfig()
Configuration for Vision Language Model loading and inference.
Parameters
model_id : str HuggingFace model identifier or local path. quantization : QuantizationType Quantization strategy to apply. framework : InferenceFramework Inference framework to use for model execution. max_memory_gb : int | None, default=None Maximum GPU memory to allocate in GB. If None, uses all available. device : str, default="cuda" Device to load the model on. trust_remote_code : bool, default=True Whether to trust remote code from HuggingFace.
model_id
quantization
framework
max_memory_gb
device
trust_remote_code
VLMLoader Objects
class VLMLoader(ABC)
Abstract base class for Vision Language Model loaders.
All VLM loaders must implement the load and generate methods.
__init__
def __init__(config: VLMConfig) -> None
Initialize the VLM loader with configuration.
Parameters
config : VLMConfig Configuration for model loading and inference.
load
@abstractmethod
def load() -> None
Load the model into memory with configured settings.
Raises
RuntimeError If model loading fails.
generate
@abstractmethod
def generate(images: list[Image.Image],
prompt: str,
max_new_tokens: int = 512,
temperature: float = 0.7) -> str
Generate text response from images and prompt.
Parameters
images : list[Image.Image] List of PIL images to process. prompt : str Text prompt for the model. max_new_tokens : int, default=512 Maximum number of tokens to generate. temperature : float, default=0.7 Sampling temperature for generation.
Returns
str Generated text response.
Raises
RuntimeError If generation fails or model is not loaded.
unload
def unload() -> None
Unload the model from memory to free GPU resources.
Llama4MaverickLoader Objects
class Llama4MaverickLoader(VLMLoader)
Loader for Llama 4 Maverick Vision Language Model.
Llama 4 Maverick is a 400B parameter MoE model with 17B active parameters, supporting multimodal input with 10M context length.
load
def load() -> None
Load Llama 4 Maverick model with configured settings.
generate
def generate(images: list[Image.Image],
prompt: str,
max_new_tokens: int = 512,
temperature: float = 0.7) -> str
Generate text response from images and prompt using Llama 4 Maverick.
Gemma3Loader Objects
class Gemma3Loader(VLMLoader)
Loader for Gemma 3 27B Vision Language Model.
Gemma 3 27B excels at document analysis, OCR, and multilingual tasks with fast inference speed.
load
def load() -> None
Load Gemma 3 model with configured settings.
generate
def generate(images: list[Image.Image],
prompt: str,
max_new_tokens: int = 512,
temperature: float = 0.7) -> str
Generate text response from images and prompt using Gemma 3.
InternVL3Loader Objects
class InternVL3Loader(VLMLoader)
Loader for InternVL3-78B Vision Language Model.
InternVL3-78B achieves state-of-the-art results on vision benchmarks with strong scientific reasoning capabilities.
load
def load() -> None
Load InternVL3 model with configured settings.
generate
def generate(images: list[Image.Image],
prompt: str,
max_new_tokens: int = 512,
temperature: float = 0.7) -> str
Generate text response from images and prompt using InternVL3.
PixtralLargeLoader Objects
class PixtralLargeLoader(VLMLoader)
Loader for Pixtral Large Vision Language Model.
Pixtral Large is a 123B parameter model with 128k context length, optimized for batch processing of long documents.
load
def load() -> None
Load Pixtral Large model with configured settings.
generate
def generate(images: list[Image.Image],
prompt: str,
max_new_tokens: int = 512,
temperature: float = 0.7) -> str
Generate text response from images and prompt using Pixtral Large.
Qwen25VLLoader Objects
class Qwen25VLLoader(VLMLoader)
Loader for Qwen2.5-VL 72B Vision Language Model.
Qwen2.5-VL 72B is a proven stable model with strong performance across vision-language tasks.
load
def load() -> None
Load Qwen2.5-VL model with configured settings.
generate
def generate(images: list[Image.Image],
prompt: str,
max_new_tokens: int = 512,
temperature: float = 0.7) -> str
Generate text response from images and prompt using Qwen2.5-VL.
create_vlm_loader
def create_vlm_loader(model_name: str, config: VLMConfig) -> VLMLoader
Factory function to create appropriate VLM loader based on model name.
Parameters
model_name : str Name of the model to load. Supported values:
- "llama-4-maverick" or "llama4-maverick"
- "gemma-3-27b" or "gemma3"
- "internvl3-78b" or "internvl3"
- "pixtral-large" or "pixtral"
- "qwen2.5-vl-72b" or "qwen25vl" config : VLMConfig Configuration for model loading and inference.
Returns
VLMLoader Appropriate loader instance for the specified model.
Raises
ValueError If model_name is not recognized.