tuning module
This module contains functions and classes for hyperparameter tuning and distributed training using Ray Tune.
- class modelforge.train.tuning.RayTuner(model: Type[Module])[source]
Bases:
objectInitializes the RayTuner with the given model.
- Parameters:
model (torch.nn.Module) – The model to be tuned and trained using Ray.
- get_ray_trainer(number_of_workers: int = 2, use_gpu: bool = False)[source]
Initializes and returns a Ray Trainer for distributed training.
Configures a Ray Trainer with a specified number of workers and GPU usage settings. This trainer is prepared for distributed training using Ray, with support for checkpointing.
- Parameters:
number_of_workers (int, optional) – The number of distributed workers to use, by default 2.
use_gpu (bool, optional) – Specifies whether to use GPUs for training, by default False.
- Returns:
The configured Ray Trainer for distributed training.
- Return type:
TorchTrainer
- train_func()[source]
Defines the training function to be used with Ray for distributed training.
This function configures a PyTorch Lightning trainer with the Ray Distributed Data Parallel (DDP) strategy for efficient distributed training. The training process utilizes a custom training loop and environment setup provided by Ray.
Note: This function should be passed to a Ray Trainer or directly used with Ray tasks.
- tune_with_ray(train_dataloader, val_dataloader, number_of_epochs: int = 5, number_of_samples: int = 10, number_of_ray_workers: int = 2, train_on_gpu: bool = False, metric: str = 'val/per_system_energy/rmse')[source]
Performs hyperparameter tuning using Ray Tune.
This method sets up and starts a Ray Tune hyperparameter tuning session, utilizing the ASHA scheduler for efficient trial scheduling and early stopping.
- Parameters:
train_dataloader (DataLoader) – The DataLoader for training data.
val_dataloader (DataLoader) – The DataLoader for validation data.
number_of_epochs (int, optional) – The maximum number of epochs for training, by default 5.
number_of_samples (int, optional) – The number of samples (trial runs) to perform, by default 10.
number_of_ray_workers (int, optional) – The number of Ray workers to use for distributed training, by default 2.
train_on_gpu (bool, optional) – Whether to use GPUs for training, by default False.
metric (str, optional) – The metric to use for evaluation and early stopping, by default “val/per_system_energy/rmse
- Returns:
The result of the hyperparameter tuning session, containing performance metrics and the best hyperparameters found.
- Return type:
ExperimentAnalysis
- modelforge.train.tuning.tune_model(model: Module, dataset: Dataset, num_samples: int = 100, name: str = 'tune')[source]
A function to tune a model.
- Parameters:
model (torch.nn.Module) – The model to tune.
dataset (torch.utils.data.Dataset) – The dataset to use for tuning.
num_samples (int, optional) – The number of samples to use for tuning. Default is 100.