Recipes
=======

Base Setup
----------

.. code-block:: bash
  
   salloc  -w "cn-d[003-004]" --ntasks=1 --gpus-per-task=a100l:8 --exclusive --nodes=1 --cpus-per-task=128 --time=120:00:00 --ntasks-per-node=1 --mem=0
   cd /tmp/
   mkdir milabench
   cd milabench
   git clone https://github.com/mila-iqia/milabench.git

   conda activate base
   python --version
   Python 3.11.4

   virtualenv ./env
   source ./env/bin/activate
   pip install -e milabench/

   export MILABENCH_WORDIR="$(pwd)"
   export MILABENCH_BASE="$MILABENCH_WORDIR/results"
   export MILABENCH_CONFIG="$MILABENCH_WORDIR/milabench/config/standard.yaml"
   export BENCHMARK_VENV="$MILABENCH_WORDIR/results/venv/torch"

   module load cuda/12.3.2                                          # <= or set CUDA_HOME to the right spot
   
   milabench install
   milabench prepare
   milabench run

The current setup runs on 8xA100 SXM4 80Go.
Note that some benchmarks do require more than 40Go of VRAM.
One bench might be problematic; rwkv which requires nvcc but can be ignored.


Install+Prepare on Network nodes
--------------------------------

1. Network node

.. code-block:: bash
   
   conda activate py310

   export NETWORK_FOLDER=/network/shared/setup

   cd $NETWORK_FOLDER
   git clone https://github.com/mila-iqia/milabench.git
   pip install -e milabench
   
   export MILABENCH_CONFIG="$NETWORK_FOLDER/milabench/config/standard.yaml"
   
   milabench install --base $NETWORK_FOLDER --config $MILABENCH_CONFIG
   milabench prepare --base $NETWORK_FOLDER --config $MILABENCH_CONFIG

2. Compute node

   # Sync data to local but use code from the network location
   milabench sharedsetup --network $NETWORK_FOLDER --local /tmp/local/results

   milabench run --base /tmp/local/results --config $NETWORK_FOLDER/milabench/config/standard.yaml


Batch Update Dependencies / Dependenies pinning
-----------------------------------------------

Milabench comes with tool to manage the dependencies of all the benchmarks and update them seemlessly.

Major version updates
^^^^^^^^^^^^^^^^^^^^^

.. code-block:: bash

    export MILABENCH_BASE=../
    export MILABENCH_GPU_ARCH=cuda 
    milabench pin -c constraints/cuda.txt --config config/standard.yaml --from-scratch

    export MILABENCH_GPU_ARCH=rocm 
    milabench pin -c constraints/rocm.txt --config config/standard.yaml --from-scratch

    export MILABENCH_GPU_ARCH=xpu 
    milabench pin -c constraints/xpu.txt --config config/standard.yaml --from-scratch

    export MILABENCH_GPU_ARCH=hpu 
    milabench pin -c constraints/hpu.txt --config config/standard.yaml --from-scratch


Minor version updates
^^^^^^^^^^^^^^^^^^^^^

.. code-block:: bash

    export MILABENCH_GPU_ARCH=cuda 
    milabench pin -c constraints/cuda.txt --config config/standard.yaml

    export MILABENCH_GPU_ARCH=rocm 
    milabench pin -c constraints/rocm.txt --config config/standard.yaml

    export MILABENCH_GPU_ARCH=xpu 
    milabench pin -c constraints/xpu.txt --config config/standard.yaml

    export MILABENCH_GPU_ARCH=hpu 
    milabench pin -c constraints/hpu.txt --config config/standard.yaml


Increase Runtime
----------------

For profiling it might be useful to run the benchmark for longer than the default configuration.
You can update the yaml file (``config/base.yaml`` or ``config/standard.yaml``) to increase the runtime limits.
There is two values that govern the runtime of a benchmark ``max_duration`` which is a pure timeout to avoid benchmark hangs
and ``voir.options.stop`` which represent the target number of observations milabench will gather before stopping.

.. code-block:: yaml
  
   _defaults:
     max_duration: 600           # <= Maximum number of seconds the bench can run
     voir:                       # note that if this triggers the bench is marked as failed
       options:
         stop: 60                # <= Maximum number of observation we are gathering
         interval: "1s"          # This is usually what triggers the premature exit of the benchmark
                                 # an observation is usually a batch forward/backward/optimizer.step (i.e one train step)

One Env
-------

If your are using a container with dependencies such as pytorch already installed,
you can force milabench to use a single environment for everything.

.. code-block:: bash

    milabench install --use-current-env
    milabench prepare --use-current-env
    milabench run --use-current-env --select bert-fp32 

Batch resizer
-------------

If the GPU you are using has lower VRAM automatic batch resizing could be enabled with the command below.
Note that will not impact benchmarks that already use a batch of one, such as opt-6_7b and possibly opt-1_3b.

.. code-block:: bash

   MILABENCH_SIZER_AUTO=1 milabench run

Device Select
-------------

To run on a subset of GPUs (note that by default milabench will try to use all the GPUs all the time
which might make a run take a bit longer, reducing the number of visible devices to 2 might make experimentation faster)

.. code-block:: bash
  
   CUDA_VISIBLE_DEVICES=0,1,2,3 milabench run 

Update Package
--------------

To update pytorch to use a newer version of cuda (milabench creates a separate environment for benchmarks)

.. code-block:: bash
  
   # can be executed after `milabench install` at the earliest
   source $BENCHMARK_VENV/bin/activate
   pip install -U torch torchvision torchaudio

Arguments
---------

If environment variables are troublesome, the values can also be passed as arguments.

.. code-block:: bash
   
   milabench install --base $MILABENCH_BASE --config $MILABENCH_CONFIG
   milabench prepare --base $MILABENCH_BASE --config $MILABENCH_CONFIG
   milabench run --base $MILABENCH_BASE --config $MILABENCH_CONFIG

To help us troubleshoot future issues, you can forward your result directory.
It holds all the benchmark specific logs and metrics gathered by milabench.

.. code-block:: bash

  zip -r results.zip results


Run a benchmark without milabench
---------------------------------

.. code-block:: bash

   milabench dev {benchname}  # will open bash with the benchmark venv sourced 

   # alternatively

   source $MILABENCH_BASE/venv/torch/bin/activate


Containers
----------

NGC
^^^

When using containers where some dependencies are already installed, we need to use a dummy virtualenv 
so make milabench install its dependencies there, then the duplicate dependencies can be removed.

.. code-block:: bash

    podman run --rm --device nvidia.com/gpu=all --storage-opt ignore_chown_errors=true --security-opt=label=disable --ipc=host -it -e HOME=$HOME -e USER=$USER -v $HOME:$HOME  nvcr.io/nvidia/pytorch:24.02-py3

    cd $HOME 
    rm -rf env
    pip install virtualenv
    
    # Create a virtual env with system packages to get the container's pytorch
    virtualenv --system-site-packages env
    source ./env/bin/activate
    git clone https://github.com/mila-iqia/milabench.git
    pip install -e milabench
    
    export MILABENCH_BASE="$HOME/results"
    export MILABENCH_CONFIG="$HOME/milabench/config/standard.yaml"
    export MILABENCH_GPU_ARCH=cuda

    # This updates the requirements for cuda
    # milabench pin --from-scratch --variant cuda -c constraints/cuda.txt
    
    # Install the new requirements (note: this will still install a new pytorch)
    milabench install --use-current-env
    
    # uninstall pytorch that was installed in the venv
    # so we use the system packages instead
    pip uninstall torch torchvision torchaudio
    
    milabench prepare --use-current-env
    milabench run --use-current-env

Nightly
^^^^^^^

.. code-block:: bash

   podman run -it --rm --ipc=host --gpus=all                        \
      -e MILABENCH_HF_TOKEN=<TOKEN>                                 \
      -v /tmp/workspace/data:/milabench/envs/data                   \
      -v /tmp/workspace/runs:/milabench/envs/runs                   \
      ghcr.io/mila-iqia/milabench:cuda-nightly milabench prepare

   podman run -it --rm --ipc=host --gpus=all                        \
      -e MILABENCH_HF_TOKEN=<TOKEN>                                 \
      -v /tmp/workspace/data:/milabench/envs/data                   \
      -v /tmp/workspace/runs:/milabench/envs/runs                   \
      ghcr.io/mila-iqia/milabench:cuda-nightly milabench run


Multi Node & Docker
^^^^^^^^^^^^^^^^^^^

1. Create a system file with the right docker configuration

.. code-block:: yaml

   system:
      # Default arch
      arch: cuda

      # sshkey used in remote milabench operations
      sshkey: ~/.ssh/id_ed25519

      # Configures how to use docker 
      docker:
         executable: podman
         image: ghcr.io/mila-iqia/milabench:${system.arch}-nightly
         base: /tmp/workspace
         args: [
            -it, --rm, --ipc=host, --gpus=all, --network, host, --privileged,
            -e, MILABENCH_HF_TOKEN=<TOKEN>,
            -v, "${system.docker.base}/data:/milabench/envs/data",
            -v, "${system.docker.base}/runs:/milabench/envs/runs",
         ]

      # Nodes list
      nodes:
         # Alias used to reference the node
         - name: manager
            ip: 192.168.11.11
            port: 5000
            # Use this node as the master node or not
            main: true
            # User to use in remote milabench operations
            user: manager

         - name: node1
            ip: 192.168.11.12
            main: false
            user: username


2. Use ``milabench docker`` to suggest the command to use to execute the benchmark

.. code-block:: bash

   cp system.yaml /tmp/workspace/data/system.yaml
   milabench docker --system system.yaml


.. code-block::

   podman run -it --rm --ipc=host --gpus=all --network host --privileged   \
      -e MILABENCH_HF_TOKEN=<TOKEN>                                        \
      -v /tmp/workspace/data:/milabench/envs/data                          \
      -v /tmp/workspace/runs:/milabench/envs/runs                          \
      -e OMP_NUM_THREADS='12' ghcr.io/mila-iqia/milabench:cuda-nightly     \
      milabench prepare --system /milabench/envs/data/system.yaml
   
   podman run -it --rm --ipc=host --gpus=all --network host --privileged   \
      -e MILABENCH_HF_TOKEN=<TOKEN>                                        \
      -v /tmp/workspace/data:/milabench/envs/data                          \
      -v /tmp/workspace/runs:/milabench/envs/runs                          \
      -e OMP_NUM_THREADS='12' ghcr.io/mila-iqia/milabench:cuda-nightly     \
      milabench run --system /milabench/envs/data/system.yaml


3. execute prepare

4. execute run


Example Reports
---------------

8xA100 SXM4 80Go
^^^^^^^^^^^^^^^^

.. code-block:: bash
  
   milabench run 
   =================
   Benchmark results
   =================
   bench                          | fail | n |       perf |   sem% |   std% | peak_memory |      score | weight
   bert-fp16                      |    0 | 8 |     154.92 |   0.3% |   4.5% |       28500 |    1240.06 |  0.00
   bert-fp32                      |    0 | 8 |      29.55 |   0.0% |   0.5% |       35464 |     236.54 |  0.00
   bert-tf32                      |    0 | 8 |     120.02 |   0.3% |   4.9% |       35466 |     960.04 |  0.00
   bert-tf32-fp16                 |    0 | 8 |     154.87 |   0.3% |   4.5% |       28500 |    1239.70 |  3.00
   bf16                           |    0 | 8 |     293.43 |   0.3% |   7.2% |        5688 |    2363.29 |  0.00
   convnext_large-fp16            |    0 | 8 |     247.31 |   2.4% |  37.6% |       31362 |    1986.27 |  0.00
   convnext_large-fp32            |    0 | 8 |      45.58 |   0.7% |  11.5% |       53482 |     360.90 |  0.00 ** High memory **
   convnext_large-tf32            |    0 | 8 |     117.54 |   1.2% |  18.8% |       53482 |     940.03 |  0.00 ** High memory **
   convnext_large-tf32-fp16       |    0 | 8 |     214.41 |   2.9% |  46.4% |       31362 |    1713.47 |  3.00
   davit_large                    |    0 | 8 |     308.33 |   0.3% |   7.3% |       37900 |    2475.47 |  1.00
   davit_large-multi              |    0 | 1 |    2242.69 |   2.0% |  15.2% |       45610 |    2242.69 |  5.00 ** High memory **
   dlrm                           |    0 | 1 |  398088.30 |   2.5% |  19.3% |        7030 |  398088.30 |  1.00
   focalnet                       |    0 | 8 |     391.21 |   0.3% |   6.8% |       29808 |    3143.46 |  2.00
   fp16                           |    0 | 8 |     289.62 |   0.2% |   4.8% |        5688 |    2327.60 |  0.00
   fp32                           |    0 | 8 |      19.13 |   0.0% |   1.3% |        6066 |     153.20 |  0.00
   llama                          |    0 | 8 |     496.84 |   4.4% |  79.2% |       32326 |    3778.17 |  1.00
   opt-1_3b                       |    0 | 1 |      28.23 |   0.1% |   0.4% |       45976 |      28.23 |  5.00 ** High memory **
   opt-6_7b                       |    0 | 1 |      14.22 |   0.0% |   0.1% |       57196 |      14.22 |  5.00 ** High memory **
   reformer                       |    0 | 8 |      61.45 |   0.0% |   1.0% |       29304 |     492.17 |  1.00
   regnet_y_128gf                 |    0 | 8 |      82.25 |   0.3% |   6.8% |       35454 |     658.46 |  2.00
   resnet152                      |    0 | 8 |     669.61 |   0.4% |   9.6% |       37878 |    5378.33 |  1.00
   resnet152-multi                |    0 | 1 |    5279.39 |   1.2% |   9.2% |       42532 |    5279.39 |  5.00 ** High memory **
   resnet50                       |    0 | 8 |     456.63 |   3.0% |  66.1% |        8630 |    3620.48 |  1.00
   rwkv                           |    8 | 8 |        nan |   nan% |   nan% |        5458 |        nan |  1.00
   stargan                        |    0 | 8 |      34.07 |   2.1% |  45.4% |       41326 |     271.44 |  1.00
   super-slomo                    |    0 | 8 |      35.55 |   1.4% |  30.7% |       37700 |     285.19 |  1.00
   t5                             |    0 | 8 |      47.77 |   0.2% |   4.0% |       39344 |     382.20 |  2.00
   tf32                           |    0 | 8 |     147.05 |   0.2% |   4.9% |        6066 |    1181.93 |  0.00
   whisper                        |    0 | 8 |     145.26 |   2.2% |  48.3% |       40624 |    1160.69 |  1.00
    
    Scores
    ------
    Failure rate:       4.06% (FAIL)
    Score:             567.57
    
    Errors
    ------
    8 errors, details in HTML report

4xA100 SXM4 80Go
^^^^^^^^^^^^^^^^

.. code-block:: bash
  
    CUDA_VISIBLE_DEVICES=0,1,2,3 milabench run 
    =================
    Benchmark results
    =================
    bench                          | fail | n |       perf |   sem% |   std% | peak_memory |      score | weight
    bert-fp16                      |    0 | 4 |     154.86 |   0.4% |   4.5% |       28500 |     619.75 |  0.00
    bert-fp32                      |    0 | 4 |      29.58 |   0.0% |   0.5% |       35464 |     118.38 |  0.00
    bert-tf32                      |    0 | 4 |     119.99 |   0.4% |   4.4% |       35466 |     480.05 |  0.00
    bert-tf32-fp16                 |    0 | 4 |     155.04 |   0.4% |   4.6% |       28500 |     620.50 |  3.00
    bf16                           |    0 | 4 |     293.40 |   0.3% |   6.6% |        5688 |    1180.12 |  0.00
    convnext_large-fp16            |    0 | 4 |     265.18 |   2.8% |  30.6% |       31362 |    1065.59 |  0.00
    convnext_large-fp32            |    0 | 4 |      46.35 |   1.3% |  14.2% |       53482 |     182.25 |  0.00  ** High memory **
    convnext_large-tf32            |    0 | 4 |     122.58 |   1.4% |  15.9% |       53482 |     490.00 |  0.00  ** High memory **
    convnext_large-tf32-fp16       |    0 | 4 |     295.47 |   2.1% |  22.8% |       31362 |    1191.62 |  3.00
    davit_large                    |    0 | 4 |     310.47 |   0.4% |   6.5% |       38144 |    1247.04 |  1.00
    davit_large-multi              |    0 | 1 |    1183.76 |   1.1% |   8.2% |       45336 |    1183.76 |  5.00 ** High memory **
    dlrm                           |    0 | 1 |  430871.61 |   2.6% |  20.2% |        7758 |  430871.61 |  1.00
    focalnet                       |    0 | 4 |     391.96 |   0.4% |   6.4% |       29812 |    1575.26 |  2.00
    fp16                           |    0 | 4 |     289.99 |   0.2% |   4.1% |        5688 |    1164.13 |  0.00
    fp32                           |    0 | 4 |      19.13 |   0.0% |   0.9% |        6066 |      76.58 |  0.00
    llama                          |    0 | 4 |     492.72 |   6.2% |  78.3% |       32326 |    1884.58 |  1.00
    opt-1_3b                       |    0 | 1 |      14.45 |   0.0% |   0.2% |       46016 |      14.45 |  5.00 ** High memory **
    opt-6_7b                       |    0 | 1 |       5.96 |   0.0% |   0.1% |       75444 |       5.96 |  5.00 ** High memory **
    reformer                       |    0 | 4 |      61.39 |   0.1% |   1.0% |       29304 |     245.83 |  1.00
    regnet_y_128gf                 |    0 | 4 |      82.67 |   0.3% |   5.1% |       35454 |     330.98 |  2.00
    resnet152                      |    0 | 4 |     672.09 |   0.4% |   6.9% |       39330 |    2694.83 |  1.00
    resnet152-multi                |    0 | 1 |    2470.38 |   1.5% |  11.2% |       47288 |    2470.38 |  5.00 ** High memory **
    resnet50                       |    0 | 4 |     454.49 |   3.2% |  50.5% |        8630 |    1800.61 |  1.00
    rwkv                           |    4 | 4 |        nan |   nan% |   nan% |        5458 |        nan |  1.00
    stargan                        |    0 | 4 |      42.30 |   1.9% |  29.9% |       53412 |     169.73 |  1.00 ** High memory **
    super-slomo                    |    0 | 4 |      40.67 |   0.8% |  13.1% |       37700 |     163.08 |  1.00
    t5                             |    0 | 4 |      47.74 |   0.3% |   3.9% |       39344 |     190.95 |  2.00
    tf32                           |    0 | 4 |     146.72 |   0.2% |   4.0% |        6066 |     588.99 |  0.00
    whisper                        |    0 | 4 |     207.47 |   1.0% |  15.4% |       40624 |     832.75 |  1.00
    
    Scores
    ------
    Failure rate:       3.96% (FAIL)
    Score:             300.23

4xA100 SXM4 80Go limited to 40Go of VRAM
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^


.. code-block:: bash
  
   CUDA_VISIBLE_DEVICES=0,1,2,3 MILABENCH_SIZER_AUTO=True MILABENCH_SIZER_CAPACITY=40000Mo milabench run
    =================
    Benchmark results
    =================
                             fail n       perf   sem%   std% peak_memory          score weight
    bert-fp16                   0 4     147.52   0.2%   1.9%       41938     588.500016   0.00
    bert-fp32                   0 4      29.08   0.9%  10.3%       42138     116.083048   0.00
    bert-tf32                   0 4     117.82   0.1%   1.0%       42140     470.743584   0.00
    bert-tf32-fp16              0 4     147.67   0.2%   2.4%       41938     588.804052   3.00
    bf16                        0 4     293.92   0.3%   6.0%        5688    1181.627938   0.00
    convnext_large-fp16         0 4     269.92   2.9%  32.5%       42628    1085.129084   0.00
    convnext_large-fp32         0 4      50.31   0.7%   7.8%       42136     199.292499   0.00
    convnext_large-tf32         0 4     136.86   0.5%   5.0%       42138     549.100135   0.00
    convnext_large-tf32-fp16    0 4     266.48   3.1%  33.8%       42628    1071.146282   3.00
    davit_large                 0 4     300.29   0.5%   7.7%       41728    1203.538777   1.00
    davit_large-multi           0 1    1171.04   1.2%   9.3%       50030    1171.042025   5.00
    dlrm                        0 1  454625.69   2.1%  16.4%        7758  454625.687871   1.00
    focalnet                    0 4     391.81   0.3%   5.1%       41802    1569.986673   2.00
    fp16                        0 4     289.96   0.2%   3.9%        5688    1163.810339   0.00
    fp32                        0 4      19.14   0.0%   0.8%        6066      76.603551   0.00
    llama                       0 4     493.43   6.1%  78.2%       32326    1888.979344   1.00
    opt-1_3b                    0 1      14.52   0.1%   0.3%       45930      14.518303   5.00
    opt-6_7b                    0 1       5.96   0.0%   0.1%       75444       5.955118   5.00 ** High memory **
    reformer                    0 4      46.27   0.0%   0.3%       41986     185.104527   1.00
    regnet_y_128gf              0 4     105.08   0.7%  10.8%       42318     421.706539   2.00
    resnet152                   0 4     674.90   0.5%   7.3%       43688    2706.277411   1.00
    resnet152-multi             0 1    2350.25   2.2%  16.9%       52338    2350.245540   5.00
    resnet50                    0 4     420.09   5.8%  91.1%       42262    1653.944065   1.00
    rwkv                        4 4        NaN    NaN    NaN        5458            NaN   1.00
    stargan                     0 4      36.75   1.3%  20.5%       32310     147.651415   1.00
    super-slomo                 0 4      41.87   0.8%  12.0%       41986     167.928514   1.00
    t5                          0 4      49.55   0.3%   4.5%       41444     198.383370   2.00
    tf32                        0 4     146.74   0.2%   3.8%        6066     588.944520   0.00
    whisper                     0 4     209.19   0.7%  10.5%       42242     838.753126   1.00
    
    Scores
    ------
    Failure rate:       4.00% (FAIL)
    Score:             444.18
    
    Errors
    ------
    4 errors, details in HTML report.


Issues
------

.. code-block:: txt
  
    > Traceback (most recent call last):
    >   File "/gpfs/home3/pmorillas/mila/milabench/milabench/utils.py", line 69, in wrapped
    > 	return fn(*args, **kwargs)
    >   File "/gpfs/home3/pmorillas/mila/milabench/milabench/summary.py", line 50, in aggregate
    > 	assert config and start and end
    > AssertionError
    > Source: mila_installation/runs/

This indicates that the configuration might be missing or invalid.
It can happen when generating a report from an incomplete run as either the first metric entry (config) or the last config entry (end)
might be missing. It can be the symptom of another problem that caused benchmarks to fail to run successfully.

.. code-block:: txt

    >   File "/gpfs/home3/pmorillas/mila2/milabench/milabench/cli/run.py", line 82, in cli_run
    >     arch = next(iter(mp.packs.values())).config["system"]["arch"]
    >            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    > StopIteration

This indicates no bench were found to run; either the configuration was invalid or the `--select` filtered out all benchmarks.