Coding standards for the Data Science Clinic

This repository contains notes and documents regarding the coding standards expected by DSI projects.

All code produced should follow the requirements listed below. For questions and assistance on completing completing these requirements, please refer to the FAQ at the bottom of this page and the docs of this repository before reaching out to TAs and mentors.

If you are looking for an example of a well-documented piece of code from a previous clinic we have put an example here

Motivation:

Much of the code that is produced here at the Data Science Clinic is 80% complete and not in a state where it can be easily turned over. The purpose of this document is to provide a set of best practices in checklist form so that we can quickly do code reviews and provide expectations on them.

We should always keep the following in mind: Analysis is useless without a good repo.

Principles:

Automation – All tasks should be automated
Reproducability – All results should be reproducible
Documentation – All tasks and results should be clearly documented
Data chain of ownership – Data should have an obvious chain from source to result

Requirements

Structure

Directory structure and naming should be obvious and easy to understand.
File names and directories should be useful:
- There should be no v2 or dates or people’s name in filenames.
- Spaces, punctuation marks, and parenthesis should not be in any file or directory names.
.gitignore should be used to avoid committing data and intermediate data files which are not appropriate for the repo.
- There should be no DS_Store files or .ipynb_checkpoints directories.
- You should start with the default python .gitignore from GitHub.
- Make sure that there are not unnecessary files in the repo. If they are generated by the code, put them in .gitignore.
Secrets and API Keys should not be in the repository.
All file paths should be relative (unless work is done inside a docker container where full pathing can be assumed).
Bash scripts:
- Should be set as executable (chmod +x)
- Should end with .sh
- Should begin with #!/bin/bash
- Should also have set -e

Code Quality

Function names should be descriptive.
No commented-out code
Code should be organized so that function definition is separate from execution. There should be __main__ blocks on files that are expected to be executed and files intended to be imported should not contain code execution.
Code should never silently break (such as using try/except without raising an error.)
A code formatter should be used for readability. All code should pass the checks in pre-commit run --all-files.

Notebooks

Notebooks should generally not contain function definitions.
Notebooks should have less than 10 cells and all cells should be 10 lines of code or less.
Notebooks should have documentation (preferably markdown) which describes the purpose of them.
There should be no ! pip install XXX in any notebooks. All environment requirements should be handled using a requirements.txt file.
Documentation should include (at a minimum):
- Doc strings on all functions
- README files in directories specifying the contents.
- README file in the root directory describing the purpose of the code, where to look for things, and how to run the code. If there are other locations for information regarding this project, links should be provided.
- README file should describe your development process (e.g., how you did branches)

Dependency Management

The following python libraries are banned unless given explicit permission:
- subprocess or subprocess like library
All non-standard python libraries need to be justified:
- If asked why you used library X, there needs to be a good answer.

Git

Working branches need to be up to date with main upon completion of task/code review and should not stray behind main for more than a day.

Docker

Repos should contain a Dockerfile:
- Clear Instructions on how to run the code (via docker) in the main README.md file.
- All code in the repo should be executable via docker.
- The Dockerfile should use a requirements.txt to manage modules and should have versions on all modules.
- There should be no conda / pyenv etc.

DSI Cluster

Include a conda recipe or micromamba to manage the environment.

FAQ

How do I handle output images or tables?

Use an \output directory to put in images and other results.

If I can’t put functions in notebooks, where should they go?

Functions should be put in a utils directory and loaded via import.

How should I document notebooks?

Notebooks, just like any other piece of code need to be well documented and readable. Some questions that we ask when evaluating:

Does the notebook begin with a title, byline, date, and summary/description?
Is its content logically organized into sections with headers?
Does it walk the viewer through what the code is doing and why using both Markdown and comments, and in clear language?
Is cell output formatted for easier viewing (e.g., to avoid scrolling)? Are there any cells that were obviously used for testing/scratchwork and have not yet been removed? Are numbers rounded for display purposes?
Are Python module imports located together near the top of the notebook, following PEP 8, rather than scattered throughout many cells?
- Are all cells 15 lines of code or less?
- Do notebooks have less than 10 cells?
- Are pyflakes and black being run on the code for standardization.

What about Docker README.md information?

This example is a good starting point. Replace project-name with your project name.

This repository contains a basic dockerfile that will run a jupyter notebook instance. To build the docker image, please type in:

docker build . -t [project-name]

Note that the image name in the above command is drw

To run the image type in the following:

docker run -p 8888:8888 -v ${PWD}:/tmp [project-name]

as you can see we are running the [project-name] image.

What is a good folder structure?

For the simplest projects something like the below should work:

.
├── README.md
├── .gitignore
├── Dockerfile
├── notebooks/
│   ├── README.md
│   └── notebook Files
├── utils/
│   ├── README.md
│   ├── __init__.py
│   └── python utility files
├── data/
│   ├── README.md (or SOURCES.md)
│   └── Data files
├── output/
│   ├── README.md
│   └── output images, tables, etc.
└── documentation/
    └── README.md

What if I’m doing work outside of traditional python code?

If you are doing work outside of python it still requires documentation. There should be zero work that isn’t shareable.

What should a docstring look like?

Docstrings should contain, at a minimum:

A brief description of what the function does (if you are finding it difficult to breifly describe it, consider whether your function is too big)
Requirements of intput parameters. What types are expected? Does your function make any assumptions about the inputs (i.e. that an input dictionary has a ‘results’ key)? Document it! Docstring format should be consistent across a repository. Google has a popular format, described here.