Skip to content

Hyper-Parameter Optimization#

Work-in-progress!

Please note that this is very much a work in progress!

This is a small example Hydra and submitit make it very easy to launch lots of jobs on SLURM clusters.

hyper-parameter optimization (HPO)

Hyper-Parameter Optimization with the Orion Hydra Sweeper#

Here is a configuration file that you can use to launch a hyper-parameter optimization (HPO) sweep

Click to show the yaml config file
# @package _global_
defaults:
  - example.yaml  # A configuration for a single run (that works!)
  - override /hydra/sweeper: orion  # Select the orion sweeper plugin

log_level: DEBUG
name: "local-sweep-example"
seed: 123

algorithm:
  optimizer_config:
    # This here will get overwritten by the sweeper.
    lr: 0.002

trainer:
  accelerator: gpu
  devices: 1
  max_epochs: 1
  logger:
    wandb:
      _target_: lightning.pytorch.loggers.wandb.WandbLogger
      project: "ResearchTemplate"
      # TODO: Use the Orion trial name?
      # name: ${oc.env:SLURM_JOB_ID}_${oc.env:SLURM_ARRAY_TASK_ID,0}_${oc.env:SLURM_PROCID}
      save_dir: "${hydra:runtime.output_dir}"
      offline: False # set True to store all logs only locally
      # id: ${oc.env:SLURM_JOB_ID}_${oc.env:SLURM_ARRAY_TASK_ID,0}_${oc.env:SLURM_PROCID} # pass correct id to resume experiment!
      # entity: ""  # set to name of your wandb team
      log_model: False
      prefix: ""
      job_type: "train"
      group: ["${name}"]
      tags: ["${name}"]

hydra:
  mode: MULTIRUN
  run:
    # output directory, generated dynamically on each run
    dir: logs/${name}/runs
  sweep:
    dir: logs/${name}/multiruns/
    # subdir: ${hydra.job.num}
    subdir: ${hydra.job.id}/task${hydra.job.num}

  sweeper:
    params:
      algorithm:
        optimizer_config:
          lr: "loguniform(1e-6, 1.0, default_value=3e-4)"
          # weight_decay: "loguniform(1e-6, 1e-2, default_value=0)"

    experiment:
      name: "${name}"
      version: 1

    algorithm:
      type: tpe
      config:
        seed: 1

    worker:
      n_workers: 1
      max_broken: 10000
      max_trials: 10

    storage:
      type: legacy
      use_hydra_path: false
      database:
        type: pickleddb
        host: "logs/${name}/multiruns/database.pkl"
    parametrization: null

You can use it like so:

python project/main.py experiment=local_sweep_example

Hyper-Parameter Optimization on a SLURM cluster#

Click to show the yaml config file
# @package _global_
defaults:
  - example.yaml # A configuration for a single run (that works!)
  - override /trainer/logger: wandb
  - override /hydra/launcher: submitit_slurm
  - override /hydra/sweeper: orion

log_level: DEBUG
name: "sweep-example"
# TODO: This should technically be something like the "run_id", which would be different than SLURM_PROCID when using >1 gpus per "run".
seed: ${oc.env:SLURM_PROCID,123}

algorithm:
  optimizer_config:
    # This here will get overwritten by the sweeper.
    lr: 0.002

trainer:
  accelerator: gpu
  devices: 1
  max_epochs: 1
  logger:
    wandb:
      project: "ResearchTemplate"
      # TODO: Use the Orion trial name?
      name: ${oc.env:SLURM_JOB_ID}_${oc.env:SLURM_ARRAY_TASK_ID,0}_${oc.env:SLURM_PROCID}
      save_dir: "${hydra:runtime.output_dir}"
      offline: False # set True to store all logs only locally
      id: ${oc.env:SLURM_JOB_ID}_${oc.env:SLURM_ARRAY_TASK_ID,0}_${oc.env:SLURM_PROCID} # pass correct id to resume experiment!
      # entity: ""  # set to name of your wandb team
      log_model: False
      prefix: ""
      job_type: "train"
      group: ${oc.env:SLURM_JOB_ID}
      # tags: ["${name}"]

hydra:
  mode: MULTIRUN
  # TODO: Make it so running the same command twice in the same job id resumes from the last checkpoint.
  run:
    # output directory, generated dynamically on each run
    dir: logs/${name}/runs
  sweep:
    dir: logs/${name}/multiruns/
    # subdir: ${hydra.job.num}
    subdir: ${hydra.job.id}/task${oc.env:SLURM_PROCID}

  launcher:
    cpus_per_task: 2
    nodes: 1
    # TODO: Pack more than one job on a single GPU.
    tasks_per_node: 1
    mem_gb: 16
    gres: gpu:1
    # note: Can't currently use this argument with the submitit launcher plugin. We would need to
    # create a subclass to add support for it.
    # ntasks_per_gpu: 2

    # todo: bump this up.
    array_parallelism: 1 # max num of jobs to run in parallel

    stderr_to_stdout: true

    ## A list of commands to add to the generated sbatch script before running srun:
    setup:
    # - export LD_PRELOAD=/some/folder/with/libraries/

    # Other things to pass to `sbatch`:
    additional_parameters:
      time: 0-00:10:00 # maximum wall time allocated for the job (D-HH:MM:SS)

  sweeper:
    params:
      algorithm:
        optimizer_config:
          lr: "loguniform(1e-6, 1.0, default_value=3e-4)"
          # weight_decay: "loguniform(1e-6, 1e-2, default_value=0)"

    experiment:
      name: "${name}"
      version: 1

    algorithm:
      #  BUG: Getting a weird bug with TPE: KeyError in `dum_below_trials = [...]` at line 397.
      type: tpe
      config:
        seed: 1

    worker:
      n_workers: ${hydra.launcher.array_parallelism}
      max_broken: 10000
      max_trials: 10

    storage:
      type: legacy
      use_hydra_path: false
      database:
        type: pickleddb
        host: "logs/${name}/multiruns/database.pkl"

You can use it like so:

python project/main.py experiment=cluster_sweep_example