Image 1

Webpage for the University of Chicago Data Science Clinic

Hosted on GitHub Pages — Theme by orderedlist

Coding standards for the Data Science Clinic

This repository contains notes and documents regarding the coding standards expected by DSI projects.

All code produced should follow the requirements listed below. For questions and assistance on completing completing these requirements, please refer to the FAQ at the bottom of this page and the docs of this repository before reaching out to TAs and mentors.

If you are looking for an example of a well-documented piece of code from a previous clinic we have put an example here

Motivation:

Much of the code that is produced here at the Data Science Clinic is 80% complete and not in a state where it can be easily turned over. The purpose of this document is to provide a set of best practices in checklist form so that we can quickly do code reviews and provide expectations on them.

We should always keep the following in mind: Analysis is useless without a good repo.

Principles:

  1. Automation – All tasks should be automated
  2. Reproducability – All results should be reproducible
  3. Documentation – All tasks and results should be clearly documented
  4. Data chain of ownership – Data should have an obvious chain from source to result

Requirements

Structure

  1. Directory structure and naming should be obvious and easy to understand.
  2. File names and directories should be useful:
    • There should be no v2 or dates or people’s name in filenames.
    • Spaces, punctuation marks, and parenthesis should not be in any file or directory names.
  3. .gitignore should be used to avoid committing data and intermediate data files which are not appropriate for the repo.
    • There should be no DS_Store files or .ipynb_checkpoints directories.
    • You should start with the default python .gitignore from GitHub.
    • Make sure that there are not unnecessary files in the repo. If they are generated by the code, put them in .gitignore.
  4. Secrets and API Keys should not be in the repository.
  5. All file paths should be relative (unless work is done inside a docker container where full pathing can be assumed).
  6. Bash scripts:
    • Should be set as executable (chmod +x)
    • Should end with .sh
    • Should begin with #!/bin/bash
    • Should also have set -e

Code Quality

  1. Function names should be descriptive.
  2. No commented-out code
  3. Code should be organized so that function definition is separate from execution. There should be __main__ blocks on files that are expected to be executed and files intended to be imported should not contain code execution.
  4. Code should never silently break (such as using try/except without raising an error.)
  5. A code formatter should be used for readability. All code should pass the checks in pre-commit run --all-files.

Notebooks

  1. Notebooks should generally not contain function definitions.
  2. Notebooks should have less than 10 cells and all cells should be 10 lines of code or less.
  3. Notebooks should have documentation (preferably markdown) which describes the purpose of them.
  4. There should be no ! pip install XXX in any notebooks. All environment requirements should be handled using a requirements.txt file.
  5. Documentation should include (at a minimum):
    • Doc strings on all functions
    • README files in directories specifying the contents.
    • README file in the root directory describing the purpose of the code, where to look for things, and how to run the code. If there are other locations for information regarding this project, links should be provided.
    • README file should describe your development process (e.g., how you did branches)

Dependency Management

  1. The following python libraries are banned unless given explicit permission:
    • subprocess or subprocess like library
  2. All non-standard python libraries need to be justified:
    • If asked why you used library X, there needs to be a good answer.

Git

  1. Working branches need to be up to date with main upon completion of task/code review and should not stray behind main for more than a day.

Docker

  1. Repos should contain a Dockerfile:
    • Clear Instructions on how to run the code (via docker) in the main README.md file.
    • All code in the repo should be executable via docker.
    • The Dockerfile should use a requirements.txt to manage modules and should have versions on all modules.
    • There should be no conda / pyenv etc.

DSI Cluster

  1. Include a conda recipe or micromamba to manage the environment.

FAQ

How do I handle output images or tables?

Use an \output directory to put in images and other results.

If I can’t put functions in notebooks, where should they go?

Functions should be put in a utils directory and loaded via import.

How should I document notebooks?

Notebooks, just like any other piece of code need to be well documented and readable. Some questions that we ask when evaluating:

What about Docker README.md information?

This example is a good starting point. Replace project-name with your project name.

This repository contains a basic dockerfile that will run a jupyter notebook instance. To build the docker image, please type in:

docker build . -t [project-name]

Note that the image name in the above command is drw

To run the image type in the following:

docker run -p 8888:8888 -v ${PWD}:/tmp [project-name]

as you can see we are running the [project-name] image.

What is a good folder structure?

For the simplest projects something like the below should work:

.
├── README.md
├── .gitignore
├── Dockerfile
├── notebooks/
│   ├── README.md
│   └── notebook Files
├── utils/
│   ├── README.md
│   ├── __init__.py
│   └── python utility files
├── data/
│   ├── README.md (or SOURCES.md)
│   └── Data files
├── output/
│   ├── README.md
│   └── output images, tables, etc.
└── documentation/
    └── README.md

What if I’m doing work outside of traditional python code?

If you are doing work outside of python it still requires documentation. There should be zero work that isn’t shareable.

What should a docstring look like?

Docstrings should contain, at a minimum: