Hyper-Parameter Optimization#
Work-in-progress!
Please note that this is very much a work in progress!
This is a small example Hydra and submitit make it very easy to launch lots of jobs on SLURM clusters.
hyper-parameter optimization (HPO)
Hyper-Parameter Optimization with the Orion Hydra Sweeper#
Here is a configuration file that you can use to launch a hyper-parameter optimization (HPO) sweep
Click to show the yaml config file
# @package _global_
defaults:
- example.yaml # A configuration for a single run (that works!)
- override /hydra/sweeper: orion # Select the orion sweeper plugin
log_level: DEBUG
name: "local-sweep-example"
seed: 123
algorithm:
optimizer:
# This here will get overwritten by the sweeper.
lr: 0.002
trainer:
accelerator: auto
devices: 1
max_epochs: 1
logger:
wandb:
_target_: lightning.pytorch.loggers.wandb.WandbLogger
project: "ResearchTemplate"
# TODO: Use the Orion trial name?
# name: ${oc.env:SLURM_JOB_ID}_${oc.env:SLURM_ARRAY_TASK_ID,0}_${oc.env:SLURM_PROCID}
save_dir: "${hydra:runtime.output_dir}"
offline: False # set True to store all logs only locally
# id: ${oc.env:SLURM_JOB_ID}_${oc.env:SLURM_ARRAY_TASK_ID,0}_${oc.env:SLURM_PROCID} # pass correct id to resume experiment!
# entity: "" # set to name of your wandb team
log_model: False
prefix: ""
job_type: "train"
group: ["${name}"]
tags: ["${name}"]
hydra:
mode: MULTIRUN
run:
# output directory, generated dynamically on each run
dir: logs/${name}/runs
sweep:
dir: logs/${name}/multiruns/
# subdir: ${hydra.job.num}
subdir: ${hydra.job.id}/task${hydra.job.num}
sweeper:
params:
algorithm:
optimizer:
lr: "loguniform(1e-6, 1.0, default_value=3e-4)"
# weight_decay: "loguniform(1e-6, 1e-2, default_value=0)"
experiment:
name: "${name}"
version: 1
algorithm:
type: tpe
config:
seed: 1
worker:
n_workers: 1
max_broken: 10000
max_trials: 10
storage:
type: legacy
use_hydra_path: false
database:
type: pickleddb
host: "logs/${name}/multiruns/database.pkl"
parametrization: null
You can use it like so:
Hyper-Parameter Optimization on a SLURM cluster#
Click to show the yaml config file
# @package _global_
# This is an "experiment" config, that groups together other configs into a ready-to-run example.
defaults:
- example.yaml # A configuration for a single run (that works!)
- override /trainer/logger: wandb
- override /hydra/sweeper: orion
- override /resources: gpu
- override /cluster: ??? # use `current` if you are already on a cluster, otherwise use one of the `cluster` configs.
log_level: DEBUG
name: "sweep-example"
# Set the seed to be the SLURM_PROCID, so that if we run more than one task per GPU, we get
# TODO: This should technically be something like the "run_id", which would be different than SLURM_PROCID when using >1 gpus per "run".
seed: ${oc.env:SLURM_PROCID,123}
algorithm:
optimizer:
# This here will get overwritten by the sweeper.
lr: 0.002
trainer:
accelerator: gpu
devices: 1
max_epochs: 1
logger:
wandb:
project: "ResearchTemplate"
# TODO: Use the Orion trial name?
name: ${oc.env:SLURM_JOB_ID}_${oc.env:SLURM_ARRAY_TASK_ID,0}_${oc.env:SLURM_PROCID}
save_dir: "${hydra:runtime.output_dir}"
offline: False # set True to store all logs only locally
id: ${oc.env:SLURM_JOB_ID}_${oc.env:SLURM_ARRAY_TASK_ID,0}_${oc.env:SLURM_PROCID} # pass correct id to resume experiment!
# entity: "" # set to name of your wandb team
log_model: False
prefix: ""
job_type: "train"
group: ${oc.env:SLURM_JOB_ID}
# tags: ["${name}"]
hydra:
mode: MULTIRUN
# TODO: Make it so running the same command twice in the same job id resumes from the last checkpoint.
run:
# output directory, generated dynamically on each run
dir: logs/${name}/runs
sweep:
dir: logs/${name}/multiruns/
# subdir: ${hydra.job.num}
subdir: ${hydra.job.id}/task${oc.env:SLURM_PROCID,0}
launcher:
# todo: bump this up.
array_parallelism: 5 # max num of jobs to run in parallel
additional_parameters:
time: 0-00:10:00 # maximum wall time allocated for the job (D-HH:MM:SS)
# TODO: Pack more than one job on a single GPU, and support this with both a
# patched submitit launcher as well as our remote submitit launcher, as well as by patching the
# orion sweeper to not drop these other results.
# ntasks_per_gpu: 1
sweeper:
params:
algorithm:
optimizer:
lr: "loguniform(1e-6, 1.0, default_value=3e-4)"
# weight_decay: "loguniform(1e-6, 1e-2, default_value=0)"
# todo: setup a fidelity parameter. Seems to not be working right now.
# trainer:
# # Let the HPO algorithm allocate more epochs to more promising HP configurations.
# max_epochs: "fidelity(1, 10, default_value=1)"
parametrization: null
experiment:
name: "${name}"
version: 1
algorithm:
# BUG: Getting a weird bug with TPE: KeyError in `dum_below_trials = [...]` at line 397.
type: tpe
config:
seed: 1
worker:
n_workers: ${hydra.launcher.array_parallelism}
max_broken: 10000
max_trials: 10
storage:
type: legacy
use_hydra_path: false
database:
type: pickleddb
host: "logs/${name}/multiruns/database.pkl"
Here's how you can easily launch a sweep remotely on the Mila cluster.
If you are already on a slurm cluster, use the "cluster=current"
config.