Recipes

Base Setup

salloc  -w "cn-d[003-004]" --ntasks=1 --gpus-per-task=a100l:8 --exclusive --nodes=1 --cpus-per-task=128 --time=120:00:00 --ntasks-per-node=1 --mem=0
cd /tmp/
mkdir milabench
cd milabench
git clone https://github.com/mila-iqia/milabench.git

conda activate base
python --version
Python 3.11.4

virtualenv ./env
source ./env/bin/activate
pip install -e milabench/

export MILABENCH_WORDIR="$(pwd)"
export MILABENCH_BASE="$MILABENCH_WORDIR/results"
export MILABENCH_CONFIG="$MILABENCH_WORDIR/milabench/config/standard.yaml"
export BENCHMARK_VENV="$MILABENCH_WORDIR/results/venv/torch"

module load cuda/12.3.2                                          # <= or set CUDA_HOME to the right spot

milabench install
milabench prepare
milabench run

The current setup runs on 8xA100 SXM4 80Go. Note that some benchmarks do require more than 40Go of VRAM. One bench might be problematic; rwkv which requires nvcc but can be ignored.

Install+Prepare on Network nodes

Network node

conda activate py310

export NETWORK_FOLDER=/network/shared/setup

cd $NETWORK_FOLDER
git clone https://github.com/mila-iqia/milabench.git
pip install -e milabench

export MILABENCH_CONFIG="$NETWORK_FOLDER/milabench/config/standard.yaml"

milabench install --base $NETWORK_FOLDER --config $MILABENCH_CONFIG
milabench prepare --base $NETWORK_FOLDER --config $MILABENCH_CONFIG

Compute node

# Sync data to local but use code from the network location
milabench sharedsetup --network $NETWORK_FOLDER --local /tmp/local/results

milabench run --base /tmp/local/results --config $NETWORK_FOLDER/milabench/config/standard.yaml

Reuse scaling model for newer GPUs

# Use H100 scaling model for auto batch size
export MILABENCH_SIZER_CONFIG="/home/milabench/workspace/milabench/config/scaling/H100.yaml"

# Save the results to the H200 file
export MILABENCH_SIZER_SAVE="/home/milabench/workspace/milabench/config/scaling/H200.yaml"

# Enable batch resizing
export MILABENCH_SIZER_AUTO=1

milabench run

Batch Update Dependencies / Dependenies pinning

Milabench comes with tool to manage the dependencies of all the benchmarks and update them seemlessly.

Major version updates

export MILABENCH_BASE=../
export MILABENCH_GPU_ARCH=cuda
milabench pin -c constraints/cuda.txt --config config/standard.yaml --from-scratch

export MILABENCH_GPU_ARCH=rocm
milabench pin -c constraints/rocm.txt --config config/standard.yaml --from-scratch

export MILABENCH_GPU_ARCH=xpu
milabench pin -c constraints/xpu.txt --config config/standard.yaml --from-scratch

export MILABENCH_GPU_ARCH=hpu
milabench pin -c constraints/hpu.txt --config config/standard.yaml --from-scratch

Minor version updates

export MILABENCH_GPU_ARCH=cuda
milabench pin -c constraints/cuda.txt --config config/standard.yaml

export MILABENCH_GPU_ARCH=rocm
milabench pin -c constraints/rocm.txt --config config/standard.yaml

export MILABENCH_GPU_ARCH=xpu
milabench pin -c constraints/xpu.txt --config config/standard.yaml

export MILABENCH_GPU_ARCH=hpu
milabench pin -c constraints/hpu.txt --config config/standard.yaml

Increase Runtime

For profiling it might be useful to run the benchmark for longer than the default configuration. You can update the yaml file (config/base.yaml or config/standard.yaml) to increase the runtime limits. There is two values that govern the runtime of a benchmark max_duration which is a pure timeout to avoid benchmark hangs and voir.options.stop which represent the target number of observations milabench will gather before stopping.

_defaults:
  max_duration: 600           # <= Maximum number of seconds the bench can run
  voir:                       # note that if this triggers the bench is marked as failed
    options:
      stop: 60                # <= Maximum number of observation we are gathering
      interval: "1s"          # This is usually what triggers the premature exit of the benchmark
                              # an observation is usually a batch forward/backward/optimizer.step (i.e one train step)

One Env

If your are using a container with dependencies such as pytorch already installed, you can force milabench to use a single environment for everything.

milabench install --use-current-env
milabench prepare --use-current-env
milabench run --use-current-env --select bert-fp32

Batch resizer

If the GPU you are using has lower VRAM automatic batch resizing could be enabled with the command below. Note that will not impact benchmarks that already use a batch of one, such as opt-6_7b and possibly opt-1_3b.

MILABENCH_SIZER_AUTO=1 milabench run

Device Select

To run on a subset of GPUs (note that by default milabench will try to use all the GPUs all the time which might make a run take a bit longer, reducing the number of visible devices to 2 might make experimentation faster)

CUDA_VISIBLE_DEVICES=0,1,2,3 milabench run

Update Package

To update pytorch to use a newer version of cuda (milabench creates a separate environment for benchmarks)

# can be executed after `milabench install` at the earliest
source $BENCHMARK_VENV/bin/activate
pip install -U torch torchvision torchaudio

Arguments

If environment variables are troublesome, the values can also be passed as arguments.

milabench install --base $MILABENCH_BASE --config $MILABENCH_CONFIG
milabench prepare --base $MILABENCH_BASE --config $MILABENCH_CONFIG
milabench run --base $MILABENCH_BASE --config $MILABENCH_CONFIG

To help us troubleshoot future issues, you can forward your result directory. It holds all the benchmark specific logs and metrics gathered by milabench.

zip -r results.zip results

Run a benchmark without milabench

milabench dev {benchname}  # will open bash with the benchmark venv sourced

# alternatively

source $MILABENCH_BASE/venv/torch/bin/activate

Containers

NGC

When using containers where some dependencies are already installed, we need to use a dummy virtualenv so make milabench install its dependencies there, then the duplicate dependencies can be removed.

podman run --rm --device nvidia.com/gpu=all --storage-opt ignore_chown_errors=true --security-opt=label=disable --ipc=host -it -e HOME=$HOME -e USER=$USER -v $HOME:$HOME  nvcr.io/nvidia/pytorch:24.02-py3

cd $HOME
rm -rf env
pip install virtualenv

# Create a virtual env with system packages to get the container's pytorch
virtualenv --system-site-packages env
source ./env/bin/activate
git clone https://github.com/mila-iqia/milabench.git
pip install -e milabench

export MILABENCH_BASE="$HOME/results"
export MILABENCH_CONFIG="$HOME/milabench/config/standard.yaml"
export MILABENCH_GPU_ARCH=cuda

# This updates the requirements for cuda
# milabench pin --from-scratch --variant cuda -c constraints/cuda.txt

# Install the new requirements (note: this will still install a new pytorch)
milabench install --use-current-env

# uninstall pytorch that was installed in the venv
# so we use the system packages instead
pip uninstall torch torchvision torchaudio

milabench prepare --use-current-env
milabench run --use-current-env

Nightly

podman run -it --rm --ipc=host --gpus=all                        \
   -e MILABENCH_HF_TOKEN=<TOKEN>                                 \
   -v /tmp/workspace/data:/milabench/envs/data                   \
   -v /tmp/workspace/runs:/milabench/envs/runs                   \
   ghcr.io/mila-iqia/milabench:cuda-nightly milabench prepare

podman run -it --rm --ipc=host --gpus=all                        \
   -e MILABENCH_HF_TOKEN=<TOKEN>                                 \
   -v /tmp/workspace/data:/milabench/envs/data                   \
   -v /tmp/workspace/runs:/milabench/envs/runs                   \
   ghcr.io/mila-iqia/milabench:cuda-nightly milabench run

Multi Node & Docker

Create a system file with the right docker configuration

system:
   # Default arch
   arch: cuda

   # sshkey used in remote milabench operations
   sshkey: ~/.ssh/id_ed25519

   # Configures how to use docker
   docker:
      executable: podman
      image: ghcr.io/mila-iqia/milabench:${system.arch}-nightly
      base: /tmp/workspace
      args: [
         -it, --rm, --ipc=host, --gpus=all, --network, host, --privileged,
         -e, MILABENCH_HF_TOKEN=<TOKEN>,
         -v, "${system.docker.base}/data:/milabench/envs/data",
         -v, "${system.docker.base}/runs:/milabench/envs/runs",
      ]

   # Nodes list
   nodes:
      # Alias used to reference the node
      - name: manager
         ip: 192.168.11.11
         sshport: 5000
         # Use this node as the master node or not
         main: true
         # User to use in remote milabench operations
         user: manager

      - name: node1
         ip: 192.168.11.12
         main: false
         user: username

Use milabench docker to suggest the command to use to execute the benchmark

cp system.yaml /tmp/workspace/data/system.yaml
milabench docker --system system.yaml

podman run -it --rm --ipc=host --gpus=all --network host --privileged   \
   -e MILABENCH_HF_TOKEN=<TOKEN>                                        \
   -v /tmp/workspace/data:/milabench/envs/data                          \
   -v /tmp/workspace/runs:/milabench/envs/runs                          \
   -e OMP_NUM_THREADS='12' ghcr.io/mila-iqia/milabench:cuda-nightly     \
   milabench prepare --system /milabench/envs/data/system.yaml

podman run -it --rm --ipc=host --gpus=all --network host --privileged   \
   -e MILABENCH_HF_TOKEN=<TOKEN>                                        \
   -v /tmp/workspace/data:/milabench/envs/data                          \
   -v /tmp/workspace/runs:/milabench/envs/runs                          \
   -e OMP_NUM_THREADS='12' ghcr.io/mila-iqia/milabench:cuda-nightly     \
   milabench run --system /milabench/envs/data/system.yaml

execute prepare
execute run

Example Reports

8xA100 SXM4 80Go

milabench run
=================
Benchmark results
=================
bench                          | fail | n |       perf |   sem% |   std% | peak_memory |      score | weight
bert-fp16                      |    0 | 8 |     154.92 |   0.3% |   4.5% |       28500 |    1240.06 |  0.00
bert-fp32                      |    0 | 8 |      29.55 |   0.0% |   0.5% |       35464 |     236.54 |  0.00
bert-tf32                      |    0 | 8 |     120.02 |   0.3% |   4.9% |       35466 |     960.04 |  0.00
bert-tf32-fp16                 |    0 | 8 |     154.87 |   0.3% |   4.5% |       28500 |    1239.70 |  3.00
bf16                           |    0 | 8 |     293.43 |   0.3% |   7.2% |        5688 |    2363.29 |  0.00
convnext_large-fp16            |    0 | 8 |     247.31 |   2.4% |  37.6% |       31362 |    1986.27 |  0.00
convnext_large-fp32            |    0 | 8 |      45.58 |   0.7% |  11.5% |       53482 |     360.90 |  0.00 ** High memory **
convnext_large-tf32            |    0 | 8 |     117.54 |   1.2% |  18.8% |       53482 |     940.03 |  0.00 ** High memory **
convnext_large-tf32-fp16       |    0 | 8 |     214.41 |   2.9% |  46.4% |       31362 |    1713.47 |  3.00
davit_large                    |    0 | 8 |     308.33 |   0.3% |   7.3% |       37900 |    2475.47 |  1.00
davit_large-multi              |    0 | 1 |    2242.69 |   2.0% |  15.2% |       45610 |    2242.69 |  5.00 ** High memory **
dlrm                           |    0 | 1 |  398088.30 |   2.5% |  19.3% |        7030 |  398088.30 |  1.00
focalnet                       |    0 | 8 |     391.21 |   0.3% |   6.8% |       29808 |    3143.46 |  2.00
fp16                           |    0 | 8 |     289.62 |   0.2% |   4.8% |        5688 |    2327.60 |  0.00
fp32                           |    0 | 8 |      19.13 |   0.0% |   1.3% |        6066 |     153.20 |  0.00
llama                          |    0 | 8 |     496.84 |   4.4% |  79.2% |       32326 |    3778.17 |  1.00
opt-1_3b                       |    0 | 1 |      28.23 |   0.1% |   0.4% |       45976 |      28.23 |  5.00 ** High memory **
opt-6_7b                       |    0 | 1 |      14.22 |   0.0% |   0.1% |       57196 |      14.22 |  5.00 ** High memory **
reformer                       |    0 | 8 |      61.45 |   0.0% |   1.0% |       29304 |     492.17 |  1.00
regnet_y_128gf                 |    0 | 8 |      82.25 |   0.3% |   6.8% |       35454 |     658.46 |  2.00
resnet152                      |    0 | 8 |     669.61 |   0.4% |   9.6% |       37878 |    5378.33 |  1.00
resnet152-multi                |    0 | 1 |    5279.39 |   1.2% |   9.2% |       42532 |    5279.39 |  5.00 ** High memory **
resnet50                       |    0 | 8 |     456.63 |   3.0% |  66.1% |        8630 |    3620.48 |  1.00
rwkv                           |    8 | 8 |        nan |   nan% |   nan% |        5458 |        nan |  1.00
stargan                        |    0 | 8 |      34.07 |   2.1% |  45.4% |       41326 |     271.44 |  1.00
super-slomo                    |    0 | 8 |      35.55 |   1.4% |  30.7% |       37700 |     285.19 |  1.00
t5                             |    0 | 8 |      47.77 |   0.2% |   4.0% |       39344 |     382.20 |  2.00
tf32                           |    0 | 8 |     147.05 |   0.2% |   4.9% |        6066 |    1181.93 |  0.00
whisper                        |    0 | 8 |     145.26 |   2.2% |  48.3% |       40624 |    1160.69 |  1.00

 Scores
 ------
 Failure rate:       4.06% (FAIL)
 Score:             567.57

 Errors
 ------
 8 errors, details in HTML report

4xA100 SXM4 80Go

CUDA_VISIBLE_DEVICES=0,1,2,3 milabench run
=================
Benchmark results
=================
bench                          | fail | n |       perf |   sem% |   std% | peak_memory |      score | weight
bert-fp16                      |    0 | 4 |     154.86 |   0.4% |   4.5% |       28500 |     619.75 |  0.00
bert-fp32                      |    0 | 4 |      29.58 |   0.0% |   0.5% |       35464 |     118.38 |  0.00
bert-tf32                      |    0 | 4 |     119.99 |   0.4% |   4.4% |       35466 |     480.05 |  0.00
bert-tf32-fp16                 |    0 | 4 |     155.04 |   0.4% |   4.6% |       28500 |     620.50 |  3.00
bf16                           |    0 | 4 |     293.40 |   0.3% |   6.6% |        5688 |    1180.12 |  0.00
convnext_large-fp16            |    0 | 4 |     265.18 |   2.8% |  30.6% |       31362 |    1065.59 |  0.00
convnext_large-fp32            |    0 | 4 |      46.35 |   1.3% |  14.2% |       53482 |     182.25 |  0.00  ** High memory **
convnext_large-tf32            |    0 | 4 |     122.58 |   1.4% |  15.9% |       53482 |     490.00 |  0.00  ** High memory **
convnext_large-tf32-fp16       |    0 | 4 |     295.47 |   2.1% |  22.8% |       31362 |    1191.62 |  3.00
davit_large                    |    0 | 4 |     310.47 |   0.4% |   6.5% |       38144 |    1247.04 |  1.00
davit_large-multi              |    0 | 1 |    1183.76 |   1.1% |   8.2% |       45336 |    1183.76 |  5.00 ** High memory **
dlrm                           |    0 | 1 |  430871.61 |   2.6% |  20.2% |        7758 |  430871.61 |  1.00
focalnet                       |    0 | 4 |     391.96 |   0.4% |   6.4% |       29812 |    1575.26 |  2.00
fp16                           |    0 | 4 |     289.99 |   0.2% |   4.1% |        5688 |    1164.13 |  0.00
fp32                           |    0 | 4 |      19.13 |   0.0% |   0.9% |        6066 |      76.58 |  0.00
llama                          |    0 | 4 |     492.72 |   6.2% |  78.3% |       32326 |    1884.58 |  1.00
opt-1_3b                       |    0 | 1 |      14.45 |   0.0% |   0.2% |       46016 |      14.45 |  5.00 ** High memory **
opt-6_7b                       |    0 | 1 |       5.96 |   0.0% |   0.1% |       75444 |       5.96 |  5.00 ** High memory **
reformer                       |    0 | 4 |      61.39 |   0.1% |   1.0% |       29304 |     245.83 |  1.00
regnet_y_128gf                 |    0 | 4 |      82.67 |   0.3% |   5.1% |       35454 |     330.98 |  2.00
resnet152                      |    0 | 4 |     672.09 |   0.4% |   6.9% |       39330 |    2694.83 |  1.00
resnet152-multi                |    0 | 1 |    2470.38 |   1.5% |  11.2% |       47288 |    2470.38 |  5.00 ** High memory **
resnet50                       |    0 | 4 |     454.49 |   3.2% |  50.5% |        8630 |    1800.61 |  1.00
rwkv                           |    4 | 4 |        nan |   nan% |   nan% |        5458 |        nan |  1.00
stargan                        |    0 | 4 |      42.30 |   1.9% |  29.9% |       53412 |     169.73 |  1.00 ** High memory **
super-slomo                    |    0 | 4 |      40.67 |   0.8% |  13.1% |       37700 |     163.08 |  1.00
t5                             |    0 | 4 |      47.74 |   0.3% |   3.9% |       39344 |     190.95 |  2.00
tf32                           |    0 | 4 |     146.72 |   0.2% |   4.0% |        6066 |     588.99 |  0.00
whisper                        |    0 | 4 |     207.47 |   1.0% |  15.4% |       40624 |     832.75 |  1.00

Scores
------
Failure rate:       3.96% (FAIL)
Score:             300.23

4xA100 SXM4 80Go limited to 40Go of VRAM

CUDA_VISIBLE_DEVICES=0,1,2,3 MILABENCH_SIZER_AUTO=True MILABENCH_SIZER_CAPACITY=40000Mo milabench run
 =================
 Benchmark results
 =================
                          fail n       perf   sem%   std% peak_memory          score weight
 bert-fp16                   0 4     147.52   0.2%   1.9%       41938     588.500016   0.00
 bert-fp32                   0 4      29.08   0.9%  10.3%       42138     116.083048   0.00
 bert-tf32                   0 4     117.82   0.1%   1.0%       42140     470.743584   0.00
 bert-tf32-fp16              0 4     147.67   0.2%   2.4%       41938     588.804052   3.00
 bf16                        0 4     293.92   0.3%   6.0%        5688    1181.627938   0.00
 convnext_large-fp16         0 4     269.92   2.9%  32.5%       42628    1085.129084   0.00
 convnext_large-fp32         0 4      50.31   0.7%   7.8%       42136     199.292499   0.00
 convnext_large-tf32         0 4     136.86   0.5%   5.0%       42138     549.100135   0.00
 convnext_large-tf32-fp16    0 4     266.48   3.1%  33.8%       42628    1071.146282   3.00
 davit_large                 0 4     300.29   0.5%   7.7%       41728    1203.538777   1.00
 davit_large-multi           0 1    1171.04   1.2%   9.3%       50030    1171.042025   5.00
 dlrm                        0 1  454625.69   2.1%  16.4%        7758  454625.687871   1.00
 focalnet                    0 4     391.81   0.3%   5.1%       41802    1569.986673   2.00
 fp16                        0 4     289.96   0.2%   3.9%        5688    1163.810339   0.00
 fp32                        0 4      19.14   0.0%   0.8%        6066      76.603551   0.00
 llama                       0 4     493.43   6.1%  78.2%       32326    1888.979344   1.00
 opt-1_3b                    0 1      14.52   0.1%   0.3%       45930      14.518303   5.00
 opt-6_7b                    0 1       5.96   0.0%   0.1%       75444       5.955118   5.00 ** High memory **
 reformer                    0 4      46.27   0.0%   0.3%       41986     185.104527   1.00
 regnet_y_128gf              0 4     105.08   0.7%  10.8%       42318     421.706539   2.00
 resnet152                   0 4     674.90   0.5%   7.3%       43688    2706.277411   1.00
 resnet152-multi             0 1    2350.25   2.2%  16.9%       52338    2350.245540   5.00
 resnet50                    0 4     420.09   5.8%  91.1%       42262    1653.944065   1.00
 rwkv                        4 4        NaN    NaN    NaN        5458            NaN   1.00
 stargan                     0 4      36.75   1.3%  20.5%       32310     147.651415   1.00
 super-slomo                 0 4      41.87   0.8%  12.0%       41986     167.928514   1.00
 t5                          0 4      49.55   0.3%   4.5%       41444     198.383370   2.00
 tf32                        0 4     146.74   0.2%   3.8%        6066     588.944520   0.00
 whisper                     0 4     209.19   0.7%  10.5%       42242     838.753126   1.00

 Scores
 ------
 Failure rate:       4.00% (FAIL)
 Score:             444.18

 Errors
 ------
 4 errors, details in HTML report.

Issues

> Traceback (most recent call last):
>   File "/gpfs/home3/pmorillas/mila/milabench/milabench/utils.py", line 69, in wrapped
>   return fn(*args, **kwargs)
>   File "/gpfs/home3/pmorillas/mila/milabench/milabench/summary.py", line 50, in aggregate
>   assert config and start and end
> AssertionError
> Source: mila_installation/runs/

This indicates that the configuration might be missing or invalid. It can happen when generating a report from an incomplete run as either the first metric entry (config) or the last config entry (end) might be missing. It can be the symptom of another problem that caused benchmarks to fail to run successfully.

>   File "/gpfs/home3/pmorillas/mila2/milabench/milabench/cli/run.py", line 82, in cli_run
>     arch = next(iter(mp.packs.values())).config["system"]["arch"]
>            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> StopIteration

This indicates no bench were found to run; either the configuration was invalid or the –select filtered out all benchmarks.