llm_loader

Configurable LLM loader with multi-model support and quantization.

This module provides a loader for text-only language models with support for multiple model options (Llama 4 Scout, Llama 3.3 70B, DeepSeek V3, Gemma 3), 4-bit quantization with bitsandbytes, SGLang inference framework, and automatic fallback handling.

asyncio

dataclass

Enum

Path

Any

torch

AutoModelForCausalLM

AutoTokenizer

BitsAndBytesConfig

PreTrainedModel

PreTrainedTokenizer

LLMFramework Objects

class LLMFramework(str, Enum)

Inference framework options for LLM models.

SGLANG

TRANSFORMERS

LLMConfig Objects

@dataclass
class LLMConfig()

Configuration for a language model.

Parameters

model_id : str HuggingFace model identifier (e.g., "meta-llama/Llama-4-Scout"). quantization : str Quantization mode (e.g., "4bit", "8bit", "none"). framework : LLMFramework Inference framework to use (sglang or transformers). max_tokens : int, default=4096 Maximum number of tokens to generate. temperature : float, default=0.7 Sampling temperature for generation. top_p : float, default=0.9 Nucleus sampling parameter. context_length : int, default=131072 Maximum context length in tokens.

model_id

quantization

framework

max_tokens

temperature

top_p

context_length

GenerationConfig Objects

@dataclass
class GenerationConfig()

Configuration for text generation.

Parameters

max_tokens : int, default=4096 Maximum number of tokens to generate. temperature : float, default=0.7 Sampling temperature (0.0 for greedy, higher for more randomness). top_p : float, default=0.9 Nucleus sampling parameter. stop_sequences : list[str] | None, default=None List of sequences that stop generation when encountered.

max_tokens

temperature

top_p

stop_sequences

GenerationResult Objects

@dataclass
class GenerationResult()

Result from text generation.

Parameters

text : str Generated text. tokens_used : int Number of tokens used in generation. finish_reason : str Reason generation stopped (e.g., "length", "stop_sequence", "eos").

text

tokens_used

finish_reason

LLMLoader Objects

class LLMLoader()

Loader for text-only language models with quantization support.

This class handles loading language models with configurable quantization, supports multiple model options, and provides text generation utilities with error handling and fallback logic.

init

def __init__(config: LLMConfig, cache_dir: Path | None = None) -> None

Initialize the LLM loader.

Parameters

config : LLMConfig Model configuration specifying model ID, quantization, framework. cache_dir : Path | None, default=None Directory for caching model weights. If None, uses default HF cache.

load

async def load() -> None

Load the language model and tokenizer.

This method loads the model with the specified quantization settings and prepares it for inference. Loading is protected by a lock to prevent concurrent loading attempts.

Raises

RuntimeError If model loading fails due to memory, invalid model ID, or other issues.

generate

async def generate(
        prompt: str,
        generation_config: GenerationConfig | None = None) -> GenerationResult

Generate text from a prompt using the loaded model.

Parameters

prompt : str Input text prompt for generation. generation_config : GenerationConfig | None, default=None Generation parameters. If None, uses default configuration.

Returns

GenerationResult Generated text with metadata (tokens used, finish reason).

Raises

RuntimeError If model is not loaded or generation fails.

unload

async def unload() -> None

Unload the model from memory.

This method releases the model and tokenizer, freeing GPU/CPU memory.

is_loaded

def is_loaded() -> bool

Check if the model is currently loaded.

Returns

bool True if model and tokenizer are loaded, False otherwise.

get_memory_usage

def get_memory_usage() -> dict[str, int]

Get current GPU memory usage for the model.

Returns

dict[str, int] Dictionary with "allocated" and "reserved" memory in bytes. Returns zeros if CUDA is not available.

create_llm_config_from_dict

def create_llm_config_from_dict(model_dict: dict[str, Any]) -> LLMConfig

Create an LLMConfig from a dictionary (e.g., from YAML).

Parameters

model_dict : dict[str, Any] Dictionary containing model configuration keys.

Returns

LLMConfig Configured LLMConfig instance.

Raises

ValueError If required keys are missing or framework is invalid.

create_llm_loader_with_fallback

async def create_llm_loader_with_fallback(
        primary_config: LLMConfig,
        fallback_configs: list[LLMConfig],
        cache_dir: Path | None = None) -> LLMLoader

Create an LLM loader with automatic fallback to alternative models.

Parameters

primary_config : LLMConfig Primary model configuration to try first. fallback_configs : list[LLMConfig] List of fallback model configurations to try if primary fails. cache_dir : Path | None, default=None Directory for caching model weights.

Returns

LLMLoader Successfully loaded LLM loader.

Raises

RuntimeError If all model loading attempts fail.

asyncio​

dataclass​

Enum​

Path​

Any​

torch​

AutoModelForCausalLM​

AutoTokenizer​

BitsAndBytesConfig​

PreTrainedModel​

PreTrainedTokenizer​

LLMFramework Objects​

SGLANG​

TRANSFORMERS​

LLMConfig Objects​

Parameters​

model_id​

quantization​

framework​

max_tokens​

temperature​

top_p​

context_length​

GenerationConfig Objects​

Parameters​

max_tokens​

temperature​

top_p​

stop_sequences​

GenerationResult Objects​

Parameters​

text​

tokens_used​

finish_reason​

LLMLoader Objects​

__init__​

Parameters​

load​

Raises​

generate​

Parameters​

Returns​

Raises​

unload​

is_loaded​

Returns​

get_memory_usage​

Returns​

create_llm_config_from_dict​

Parameters​

Returns​

Raises​

create_llm_loader_with_fallback​

Parameters​

Returns​

Raises​

asyncio

dataclass

Enum

Path

Any

torch

AutoModelForCausalLM

AutoTokenizer

BitsAndBytesConfig

PreTrainedModel

PreTrainedTokenizer

LLMFramework Objects

SGLANG

TRANSFORMERS

LLMConfig Objects

Parameters

model_id

quantization

framework

max_tokens

temperature

top_p

context_length

GenerationConfig Objects

Parameters

max_tokens

temperature

top_p

stop_sequences

GenerationResult Objects

Parameters

text

tokens_used

finish_reason

LLMLoader Objects

init

Parameters

load

Raises

generate

Parameters

Returns

Raises

unload

is_loaded

Returns

get_memory_usage

Returns

create_llm_config_from_dict

Parameters

Returns

Raises

create_llm_loader_with_fallback

Parameters

Returns

Raises