Hyper-Parameter Optimization#

Work-in-progress!

Please note that this is very much a work in progress!

This is a small example Hydra and submitit make it very easy to launch lots of jobs on SLURM clusters.

hyper-parameter optimization (HPO)

Hyper-Parameter Optimization with the Orion Hydra Sweeper#

Here is a configuration file that you can use to launch a hyper-parameter optimization (HPO) sweep

Click to show the yaml config file

# @package _global_
# NOTE: If you get config 'orion' not found, run `uv add hydra-orion-sweeper`
defaults:
  - example.yaml # A configuration for a single run (that works!)
  - override /hydra/sweeper: orion # Select the orion sweeper plugin

log_level: DEBUG
name: "local-sweep-example"
seed: 123

algorithm:
  optimizer:
    # This here will get overwritten by the sweeper.
    lr: 0.002

trainer:
  accelerator: auto
  devices: 1
  max_epochs: 1
  logger:
    wandb:
      _target_: lightning.pytorch.loggers.wandb.WandbLogger
      project: "ResearchTemplate"
      # TODO: Use the Orion trial name?
      # name: ${oc.env:SLURM_JOB_ID}_${oc.env:SLURM_ARRAY_TASK_ID,0}_${oc.env:SLURM_PROCID}
      save_dir: "${hydra:runtime.output_dir}"
      offline: False # set True to store all logs only locally
      # id: ${oc.env:SLURM_JOB_ID}_${oc.env:SLURM_ARRAY_TASK_ID,0}_${oc.env:SLURM_PROCID} # pass correct id to resume experiment!
      # entity: ""  # set to name of your wandb team
      log_model: False
      prefix: ""
      job_type: "train"
      group: ["${name}"]
      tags: ["${name}"]

hydra:
  mode: MULTIRUN
  run:
    # output directory, generated dynamically on each run
    dir: logs/${name}/runs
  sweep:
    dir: logs/${name}/multiruns/
    # subdir: ${hydra.job.num}
    subdir: ${hydra.job.id}/task${hydra.job.num}

  sweeper:
    params:
      algorithm:
        optimizer:
          lr: "loguniform(1e-6, 1.0, default_value=3e-4)"
          # weight_decay: "loguniform(1e-6, 1e-2, default_value=0)"

    experiment:
      name: "${name}"
      version: 1

    algorithm:
      type: tpe
      config:
        seed: 1

    worker:
      n_workers: 1
      max_broken: 10000
      max_trials: 10

    storage:
      type: legacy
      use_hydra_path: false
      database:
        type: pickleddb
        host: "logs/${name}/multiruns/database.pkl"
    parametrization: null

You can use it like so:

python project/main.py experiment=local_sweep_example

Hyper-Parameter Optimization on a SLURM cluster#

Click to show the yaml config file

# @package _global_

# NOTE: If you get config 'orion' not found, run `uv add hydra-orion-sweeper`

# This is an "experiment" config, that groups together other configs into a ready-to-run example.

defaults:
  - example.yaml # A configuration for a single run (that works!)
  - override /trainer/logger: wandb
  - override /hydra/sweeper: orion
  - override /resources: gpu
  - override /cluster: ??? # use `current` if you are already on a cluster, otherwise use one of the `cluster` configs.

log_level: DEBUG
name: "sweep-example"

# Set the seed to be the SLURM_PROCID, so that if we run more than one task per GPU, we get
# TODO: This should technically be something like the "run_id", which would be different than SLURM_PROCID when using >1 gpus per "run".
seed: ${oc.env:SLURM_PROCID,123}

algorithm:
  optimizer:
    # This here will get overwritten by the sweeper.
    lr: 0.002

trainer:
  accelerator: gpu
  devices: 1
  max_epochs: 1
  logger:
    wandb:
      project: "ResearchTemplate"
      # TODO: Use the Orion trial name?
      name: ${oc.env:SLURM_JOB_ID}_${oc.env:SLURM_ARRAY_TASK_ID,0}_${oc.env:SLURM_PROCID}
      save_dir: "${hydra:runtime.output_dir}"
      offline: False # set True to store all logs only locally
      id: ${oc.env:SLURM_JOB_ID}_${oc.env:SLURM_ARRAY_TASK_ID,0}_${oc.env:SLURM_PROCID} # pass correct id to resume experiment!
      # entity: ""  # set to name of your wandb team
      log_model: False
      prefix: ""
      job_type: "train"
      group: ${oc.env:SLURM_JOB_ID}
      # tags: ["${name}"]

hydra:
  mode: MULTIRUN
  # TODO: Make it so running the same command twice in the same job id resumes from the last checkpoint.
  run:
    # output directory, generated dynamically on each run
    dir: logs/${name}/runs
  sweep:
    dir: logs/${name}/multiruns/
    # subdir: ${hydra.job.num}
    subdir: ${hydra.job.id}/task${oc.env:SLURM_PROCID,0}

  launcher:
    # todo: bump this up.
    array_parallelism: 5 # max num of jobs to run in parallel
    additional_parameters:
      time: 0-00:10:00 # maximum wall time allocated for the job (D-HH:MM:SS)
      # TODO: Pack more than one job on a single GPU, and support this with both a
      # patched submitit launcher as well as our remote submitit launcher, as well as by patching the
      # orion sweeper to not drop these other results.
      # ntasks_per_gpu: 1
  sweeper:
    params:
      algorithm:
        optimizer:
          lr: "loguniform(1e-6, 1.0, default_value=3e-4)"
          # weight_decay: "loguniform(1e-6, 1e-2, default_value=0)"
      # todo: setup a fidelity parameter. Seems to not be working right now.
      # trainer:
      #   # Let the HPO algorithm allocate more epochs to more promising HP configurations.
      #   max_epochs: "fidelity(1, 10, default_value=1)"

    parametrization: null
    experiment:
      name: "${name}"
      version: 1

    algorithm:
      #  BUG: Getting a weird bug with TPE: KeyError in `dum_below_trials = [...]` at line 397.
      type: tpe
      config:
        seed: 1

    worker:
      n_workers: ${hydra.launcher.array_parallelism}
      max_broken: 10000
      max_trials: 10

    storage:
      type: legacy
      use_hydra_path: false
      database:
        type: pickleddb
        host: "logs/${name}/multiruns/database.pkl"

Here's how you can easily launch a sweep remotely on the Mila cluster. If you are already on a slurm cluster, use the "cluster=current" config.

python project/main.py experiment=cluster_sweep_example cluster=mila