Webpage for the University of Chicago Data Science Clinic
Hosted on GitHub Pages — Theme by orderedlist
For some projects we use submitit to submit jobs in Python. Another option that is sometimes used is directly writing sbatch
commands. Each method has its positive and negative aspects. This document details using submitit.
When we use slurm, we must be respectful to not overuse nodes. Please:
dev
queue or test them with less data in an interactive sessiongeneral
queueTo understand how we use submitit, some background knowledge will be useful:
if __name__ == "__main__":
blocks. The code under these blocks will run when the file is run as a script and not when it is imported as a module. For more information, see thisTo make use of submitit, a long script with no functions or a jupyter notebook will not work. You will need to think of how to write your code in a manner that is more abstract by using python functions and classes. Your code should be: ready for change, easy to understand, and safe from bugs. There are plenty of good resources on software design. For the bare minimum to work with submitit:
df = pd.read_csv(“test.csv”) df = df[df[“year”] > 2004] average = df[“amount”].mean() print(average)
into a function that is general (hint: if a descriptive name of your function is very long, you may want to make it more general) and return results instead of printing. Do this:
```python
import pandas as pd
def get_mean_amount_after_year(path_to_csv: str, earliest_year: int):
""" Return mean value of 'amount' column with year > earliest_year """
df = pd.read_csv(path_to_csv)
df = df[df["year"] > earliest_year]
return df["amount"].mean()
Submitit eliminates the need to remember complicated and long configurations and allows us to work only in python. The sample program in main.py
runs a test version.
if __name__ == "__main__":
block at the end of your python file. No submitit code should exist in your actual function. This way we can easily pivot between submiting jobs with submitit and local exucution. Call your function here.slurm_
rather than the --
you use on the command line. Include a submitit
key that maps to true when you want to submit the job and false when you want to run it normally (either locally or for debugging). Finally include any arguments to your python function. For example:
{
"path_to_csv": "test_file.csv",
"earliest_year": 1994,
"submitit": true,
"slurm": {
"slurm_partition": "general",
"slurm_job_name": "sample",
"slurm_nodes": 1,
"slurm_time": "60:00",
"slurm_gres": "gpu:1",
"slurm_mem_per_cpu": 16000
}
}
argparse
to submit a path to a query that contains both all slurm configuration and a submitit
key that maps to a boolean. Your file will look something like this:from pathlib import Path
# your actual code will have more and longer functions than this sample
def get_mean_amount_after_year(path_to_csv: str, earliest_year: int):
""" Return mean value of 'amount' column with year > earliest_year """
df = pd.read_csv(path_to_csv)
df = df[df["year"] > earliest_year]
return df["amount"].mean()
if __name__ == "__main__":
import argparse
import json
# set up command line arguments
parser = argparse.ArgumentParser()
parser.add_argument(
"--query", help="path to json file containing query", default=None
)
args = parser.parse_args()
# read in query
if Path(args.query).resolve().exists():
query_path = Path(args.query).resolve()
else:
# throw
raise ValueError(
f"Could not locate {args.query} in query directory or as absolute path"
)
with open(query_path) as f:
query = json.load(f)
# save query parameters to variables. if you want a default, better to put
# at the outermost call to a function.
path_to_csv = query.get("path_to_csv")
default_earliest_year = 2005
earliest_year = query.get("earliest_year", default_earliest_year)
output_directory = Path("results").resolve()
executor = submitit.AutoExecutor(folder=output_directory)
# here we unpack the query dictionary and pull any slurm commands that
# are in 'slurm' key. For more info on the ** syntax, see:
# https://stackoverflow.com/a/36908. The slurm options here are the same
# as those you use on the command line but instead of prepending with '--'
# we prepend with 'slurm_'
executor.update_parameters(**query.get("slurm", {}))
# if submitit is true in our query json, we'll use submitit
if query.get("submitit", False):
executor.submit(
get_mean_amount_after_year,
path_to_csv,
earliest_year,
)
else:
get_mean_amount_after_year(
path_to_csv,
earliest_year,
)
Then with a query like this:
you can run python path/to/script.py --query path/to/query.json
and get your result.
submitit
flag in your query json set to false. To debug, use the VS Code debugger. Add command line arguments to the debugger by following these instructions