tuning module

This module contains functions and classes for hyperparameter tuning and distributed training using Ray Tune.

class modelforge.train.tuning.RayTuner(model: Type[Module])[source]

Bases: object

Initializes the RayTuner with the given model.

Parameters:: model (torch.nn.Module) – The model to be tuned and trained using Ray.

get_ray_trainer(number_of_workers: int = 2, use_gpu: bool = False)[source]

Initializes and returns a Ray Trainer for distributed training.

Configures a Ray Trainer with a specified number of workers and GPU usage settings. This trainer is prepared for distributed training using Ray, with support for checkpointing.

Parameters:

number_of_workers (int, optional) – The number of distributed workers to use, by default 2.
use_gpu (bool, optional) – Specifies whether to use GPUs for training, by default False.

Returns:

The configured Ray Trainer for distributed training.

Return type:

TorchTrainer

train_func()[source]

Defines the training function to be used with Ray for distributed training.

This function configures a PyTorch Lightning trainer with the Ray Distributed Data Parallel (DDP) strategy for efficient distributed training. The training process utilizes a custom training loop and environment setup provided by Ray.

Note: This function should be passed to a Ray Trainer or directly used with Ray tasks.

tune_with_ray(train_dataloader, val_dataloader, number_of_epochs: int = 5, number_of_samples: int = 10, number_of_ray_workers: int = 2, train_on_gpu: bool = False, metric: str = 'val/per_system_energy/rmse')[source]

Performs hyperparameter tuning using Ray Tune.

This method sets up and starts a Ray Tune hyperparameter tuning session, utilizing the ASHA scheduler for efficient trial scheduling and early stopping.

Parameters:

train_dataloader (DataLoader) – The DataLoader for training data.
val_dataloader (DataLoader) – The DataLoader for validation data.
number_of_epochs (int, optional) – The maximum number of epochs for training, by default 5.
number_of_samples (int, optional) – The number of samples (trial runs) to perform, by default 10.
number_of_ray_workers (int, optional) – The number of Ray workers to use for distributed training, by default 2.
train_on_gpu (bool, optional) – Whether to use GPUs for training, by default False.
metric (str, optional) – The metric to use for evaluation and early stopping, by default “val/per_system_energy/rmse

Returns:

The result of the hyperparameter tuning session, containing performance metrics and the best hyperparameters found.

Return type:

ExperimentAnalysis

modelforge.train.tuning.tune_model(model: Module, dataset: Dataset, num_samples: int = 100, name: str = 'tune')[source]

A function to tune a model.

Parameters:

model (torch.nn.Module) – The model to tune.
dataset (torch.utils.data.Dataset) – The dataset to use for tuning.
num_samples (int, optional) – The number of samples to use for tuning. Default is 100.