1 / 24

UChicago Data Science Clinic

TA Technical Standards Training

Training Goals

Remember: We're preparing students for real data science careers, not just academic assignments.

Core Principles

Enforce, Don't Code

Guide students to solutions rather than fixing code yourself

Consistency

Fair standards across all projects and teams

Professionalism

Mirror real data science team practices

Escalation

Raise systemic issues to mentors promptly

Key Insight: Your job is to develop student capabilities, not deliver perfect code.

Why These Standards Matter

Reproducibility

Code should work the same way for everyone, everywhere

Professional Skills

Students internalize industry-level practices

Prevent Bugs

Catch "works on my machine" issues early

Fair Reviews

Enable consistent, actionable feedback

Key Technical Areas

  • Code Style & Structure
    • Organization and documentation
  • Debugging & Testing
    • Systematic problem-solving
  • Docker & Make
    • Reproducible environments
  • GitHub Workflows
    • Professional collaboration

For a full list of technical standards and best practices: https://dsi-clinic.github.io/ta-training/technical/

Code Style and Structure

Poor Organization

import pandas as pd
import numpy as np

def proc(d):
    return d.groupby('cat').mean()

def ld(f):
    return pd.read_csv(f)

def sv(d, f):
    d.to_csv(f)

def main():
    d = ld('data.csv')
    r = proc(d)
    sv(r, 'out.csv')

Professional Structure

"""Survey data analysis module."""
from pathlib import Path
import pandas as pd
import logging

logger = logging.getLogger(__name__)

def load_survey_data(file_path: Path) -> pd.DataFrame:
    """Load survey data with validation."""
    logger.info(f"Loading survey data from {file_path}")
    
    if not file_path.exists():
        raise FileNotFoundError(f"File not found: {file_path}")
    
    return pd.read_csv(file_path)

def calculate_category_averages(
    df: pd.DataFrame, 
    category_col: str = 'category'
) -> pd.DataFrame:
    """Calculate mean values grouped by category."""
    return df.groupby(category_col).mean()

def save_results(df: pd.DataFrame, output_path: Path) -> None:
    """Save analysis results to CSV."""
    output_path.parent.mkdir(parents=True, exist_ok=True)
    df.to_csv(output_path, index=False)
    logger.info(f"Results saved to {output_path}")

Key Teaching Points: Good structure isn't about following rules - it's about making code maintainable and collaborative. Help students see how clear names and documentation save time for their future selves and teammates.

Debugging and Testing

Ad Hoc Debugging

def clean_survey_data(df):
    # Print debugging - can't be controlled
    print("Starting with", len(df), "rows")
    
    # Remove invalid responses
    df = df[df['score'] > 0]
    print("After score filter:", len(df))
    
    # Remove outliers
    df = df[df['score'] < 100]
    print("After outlier removal:", len(df))
    
    return df

# No tests - hope it works!

No Systematic Testing

# Manual testing in notebooks
df = pd.read_csv('data.csv')
result = clean_survey_data(df)
# Visual inspection only
result.head()

Structured Debugging

import logging
logger = logging.getLogger(__name__)

def clean_survey_data(df: pd.DataFrame) -> pd.DataFrame:
    """Clean survey data with logging and validation."""
    logger.info(f"Starting data cleaning with {len(df)} rows")
    
    # Validate input
    required_cols = ['score', 'user_id']
    missing = set(required_cols) - set(df.columns)
    if missing:
        raise ValueError(f"Missing columns: {missing}")
    
    # Remove invalid responses
    initial_count = len(df)
    df_clean = df[df['score'] > 0]
    removed = initial_count - len(df_clean)
    logger.info(f"Removed {removed} rows with invalid scores")
    
    # Remove outliers (configurable threshold)
    outlier_threshold = 100
    df_final = df_clean[df_clean['score'] < outlier_threshold]
    outliers = len(df_clean) - len(df_final)
    logger.info(f"Removed {outliers} outlier rows")
    
    logger.info(f"Cleaning complete: {len(df_final)} rows remaining")
    return df_final

Arrange-Act-Assert Testing

import pandas as pd

# ARRANGE: Set up test data
test_data = pd.DataFrame({
    'user_id': [1, 2, 3, 4],
    'score': [85, -5, 95, 0]
})

# ACT: Call the function
result = clean_survey_data(test_data)

# ASSERT: Verify the expected behavior
assert len(result) == 2, f"Expected 2 rows, got {len(result)}"
assert all(result['score'] > 0), "All scores should be positive"

print("✓ Test passed: Invalid scores removed correctly")


# Test error handling
test_data_bad = pd.DataFrame({'other_col': [1, 2, 3]})

try:
    clean_survey_data(test_data_bad)
    assert False, "Should have raised ValueError"
except ValueError as e:
    assert "Missing columns" in str(e)
    print("✓ Test passed: Missing columns detected")

Key Teaching Points: Systematic debugging saves time and builds professional habits. Look for print statements instead of logging and functions without error handling. The Arrange-Act-Assert pattern structures tests clearly - simple assertions are usually sufficient. While pytest is the professional standard, basic assertions help students start testing without additional tools.

Docker & Make for Reproducibility

Modern Project Structure with uv

# Dockerfile
FROM ghcr.io/astral-sh/uv:python3.12-bookworm-slim

WORKDIR /app

# Copy dependency files
COPY pyproject.toml uv.lock ./

# Install dependencies using uv
RUN uv sync --frozen --no-dev

# Copy source code
COPY . .

# Default command
CMD ["uv", "run", "python", "app.py"]
# Makefile
IMAGE=project-name
.PHONY: build run test lint sync

build:
	docker build -t $(IMAGE) .

run:
	docker run --rm -it \
		-v $(PWD):/app -w /app \
		$(IMAGE) uv run python app.py

test:
	docker run --rm -it \
		-v $(PWD):/app -w /app \
		$(IMAGE) uv run pytest -q

lint:
	docker run --rm -it \
		-v $(PWD):/app -w /app \
		$(IMAGE) uv run ruff check .

sync:
	docker run --rm -it \
		-v $(PWD):/app -w /app \
		$(IMAGE) uv sync

Key Teaching Points: Modern Python packaging with uv provides faster, more reliable dependency management. Check that projects have pyproject.toml, uv.lock is committed, and make commands use `uv run`. This workflow eliminates "works on my machine" problems.

Build make build Run Container make run Reproduce Anywhere! uv manages deps Lock file ensures exact versions Volume mounts Live code updates Fast iteration Any machine Same results No "works for me"

Git Workflow

main branch: jd/add-validation Squash Merge Initial Other work Feature added Add tests Fix bugs Update docs

Student Workflow

# 1. Create feature branch
git checkout -b jd/my-feature

# 2. Work and commit regularly  
git add .
git commit -m "Add data validation"

# 3. Keep current with main
git checkout main && git pull
git checkout jd/my-feature && git rebase main

# 4. Open PR when ready
# 5. Address feedback
# 6. TA merges after approval

Benefits

  • Clean History: Main shows features, not "fix typo" commits
  • Safe Experimentation: Feature branches can be messy
  • Easy Rollback: Each feature is one commit
  • Parallel Work: Multiple students work simultaneously

Common Issues

  • Working directly on main
  • Not rebasing with main

Key Teaching Points: Feature branches encourage experimentation without fear of breaking main. Squash merging keeps history clean and makes it easy to rollback entire features. Help students understand that messy commits on feature branches are fine - the final merge is what matters.

The Review Feedback Loop

Student Code
Initial implementation
TA Review
Specific, actionable feedback
Student Revision
Address feedback & learn
Final Approval
Standards met

Keys to Effective Feedback

  • Specific: "Add type hints to this function" not "improve code quality"
  • Actionable: Suggest concrete next steps
  • Educational: Explain the "why" behind requests
  • Encouraging: Acknowledge good practices you see

Unhelpful Feedback

  • "This is wrong"
  • "Fix the Docker setup"
  • "Code quality issues"

Helpful Feedback

  • "This function needs type hints - what does it return?"
  • "Can you add a .env.example file showing required variables?"
  • "Consider extracting the file reading into a separate function"

Coaching Techniques

Ask Guiding Questions

  • "What happens if...?"
  • "How would a teammate run this?"
  • "What if this file didn't exist?"

Encourage Small PRs

  • 200-400 lines max
  • One feature at a time
  • Easier to review and learn from

Model Good Feedback

  • Specific and actionable
  • Explain the "why"
  • Suggest improvements, don't just criticize

Celebrate Progress

  • Acknowledge improvements
  • Progress over perfection
  • Build confidence incrementally

For more code review guidelines and best practices: https://dsi-clinic.github.io/ta-training/technical/code-review.html

Common Review Scenarios

What to Look For

  • Hardcoded paths: /Users/alice/project/data.csv
  • Missing error handling: Files that might not exist
  • Unclear function names: process(), calc()
  • Giant files: 500+ lines in one script
  • No type hints: Unclear parameter expectations
  • Secrets in code: API keys, passwords

How to Give Feedback

  • Ask questions: "What happens if this file doesn't exist?"
  • Suggest alternatives: "Could we use pathlib.Path here?"
  • Explain benefits: "Type hints help teammates understand your code"
  • Celebrate improvements: "Great job adding docstrings!"

Sample Review Comment

# Instead of: "Fix the paths"
# Try: 

"I notice this hardcoded path might break on other machines:
`df = pd.read_csv('/Users/alice/project/data.csv')`

Could we use a relative path instead? Something like:
`from pathlib import Path`
`data_path = Path(__file__).parent / 'data' / 'survey.csv'`
`df = pd.read_csv(data_path)`

This way anyone can run your code."

When to Escalate to Mentors

Immediate Escalation

  • Security issues: Committed secrets, exposed credentials
  • Plagiarism concerns: Taking credit for someone else's work without attribution
  • Scope creep: Project requirements changing significantly
  • Team conflicts: Students not collaborating effectively

Pattern-Based Escalation

  • Repeated issues: Same problems across multiple teams
  • Systematic gaps: Everyone struggling with the same concept
  • Resource constraints: Students need tools/access they don't have
  • Timeline concerns: Teams consistently behind schedule

How to Escalate Effectively

  • Document the issue: Screenshots, links, specific examples
  • Suggest solutions: Don't just raise problems, propose fixes
  • Note broader impact: How many students/teams affected?
  • Timeline urgency: Does this need immediate attention?
# Good escalation message:
"I've noticed 3 teams this week struggling with Docker setup on Windows machines. 
They're getting permission errors when mounting volumes. This is blocking their 
ability to run `make test`. 

Suggested fix: Add Windows-specific Make targets or update the handbook with 
Windows Docker Desktop configuration steps.

This affects ~40% of our current cohort. Can we address this in Friday's 
all-hands meeting?"

Practice Time - Let's Review Code Together

def process_survey_data():
    data = pd.read_csv('/Users/alice/Documents/clinic/survey_data.csv')
    
    print("Data loaded successfully!")
    
    # Remove invalid responses
    clean_data = data[data['satisfaction_score'] > 0]
    
    # Calculate statistics  
    avg = clean_data['satisfaction_score'].mean()
    dept_stats = clean_data.groupby('department')['satisfaction_score'].mean()
    
    # Save results
    results = {'overall_avg': avg, 'dept_averages': dept_stats.to_dict()}
    
    with open('analysis_results.json', 'w') as f:
        json.dump(results, f)
    
    print(f"Analysis complete! Average satisfaction: {avg}")
    return results

Discussion Questions

  • What specific issues do you see?
  • How would you prioritize your feedback?
  • What questions would you ask the student?
  • How could this code be improved incrementally?

Next Steps

Use the Handbook

  • Reference during code reviews
  • Share specific sections with students
  • Bookmark common patterns
  • Suggest handbook updates when needed

Practice First

  • Try the included examples
  • Run through Docker/Make workflows
  • Practice giving constructive feedback
  • Test handbook recommendations

Start Small

  • Focus on 2-3 key areas initially
  • Build consistency gradually
  • Celebrate student improvements
  • Document what works well

Communicate Regularly

  • Weekly TA check-ins
  • Share challenging scenarios
  • Escalate patterns promptly
  • Contribute to handbook improvements

Remember: You're Building Professional Data Scientists

Every code review, every standard enforced, every coaching conversation shapes how students will approach data science throughout their careers. The habits they learn here will serve them for years to come.

Questions & Discussion

Thank you!

Let's discuss any questions about:

  • Specific technical standards
  • Code review techniques
  • Student coaching strategies
  • Escalation scenarios
  • Handbook usage and improvements

Full handbook available for reference

Examples in technical/examples/ directory

Additional topics covered in appendix slides

Questions? Reach out anytime!

Appendix

Essential TA Guidelines

Critical topics for effective student coaching

These slides cover essential topics for TAs working with students. Use them as reference during code reviews and student interactions.

Topics covered: AI usage guidelines, repository hygiene, type hints, Python best practices, Pydantic validation, configuration management, dependency management, I/O separation patterns, and logging & error messages.

AI Usage Guidelines

Appropriate AI Use

  • Learning & Understanding: Explain concepts, debug errors
  • Code Review: Generate test cases, suggest improvements
  • Documentation: Help write clear docstrings and comments
  • Problem-Solving: Brainstorm approaches, not solutions

Detecting AI-Generated Code

  • Overly generic: Perfect but lacks project context
  • Unnecessary complexity: Over-engineered for simple tasks
  • Missing project specifics: No hardcoded paths or project structure
  • Inconsistent style: Doesn't match student's usual patterns

Inappropriate AI Use

  • Complete solutions: Copy-pasting entire functions/scripts
  • Without understanding: Can't explain the code
  • No attribution: Presenting AI work as original

How to Address AI Usage

  • Ask questions: "Can you walk me through this logic?"
  • Request explanation: "Help me understand this approach"
  • Focus on learning: "Let's trace through this together"
  • Encourage attribution: "Did you use any AI assistance here?"

Escalation Guidelines

  • Immediate escalation: Clear evidence of AI-generated solutions for graded work
  • Pattern escalation: Consistent use of AI without understanding across multiple submissions
  • Documentation: Save examples, note inability to explain code, record conversation
  • Educational approach: Focus on learning outcomes rather than punishment

Key Teaching Points: AI is a tool for learning, not a shortcut to avoid understanding. Help students use AI responsibly by focusing on comprehension and attribution. The goal is building skills, not just producing code.

Repository Hygiene

Code Debt Management

  • Address review comments promptly: Don't let PRs sit for weeks
  • Close completed PRs: Merge or close within 1-2 weeks
  • Clean up branches: Delete merged feature branches
  • Regular maintenance: Weekly time for technical debt

Time Allocation

  • 20-30% of time responding to code review feedback
  • 3-5 day response time for review comments
  • Maximum 2-3 open PRs per student
  • Batch similar changes into single commits

Continuous Work Strategy

# While PR is under review, continue building on it
git checkout jd/current-feature
git checkout -b jd/current-feature-part2

# Continue development
git add . && git commit -m "Build on current feature"

# When current PR merges, update and open new PR
git checkout main && git pull
git checkout jd/current-feature-part2
git rebase main

Preventing Bloat

  • Remove unused files: Old notebooks, temp scripts
  • Consolidate code: Merge duplicate functions
  • Clean imports: Remove unused dependencies
  • Archive old work: Move experiments to separate branches

Key Teaching Points: Code debt accumulates quickly over a quarter. Help students develop habits of regular maintenance and prompt response to feedback. A clean repository is easier to review, understand, and maintain.

Type Hints for Clarity

Unclear Function

def process_data(data, config):
    # What type is data? 
    # What does config contain?
    # What gets returned?
    result = analyze(data, config)
    return result

Runtime Error

def calculate_average(scores):
    return sum(scores) / len(scores)

# Later...
calculate_average("85,90,78")  # TypeError!

Clear Expectations

def process_survey_data(
    data: pd.DataFrame, 
    config: AnalysisConfig
) -> Dict[str, float]:
    """Process survey data and return statistics."""
    result = analyze(data, config)
    return result

Early Detection

def calculate_average(scores: List[float]) -> float:
    """Calculate mean of numeric scores."""
    return sum(scores) / len(scores)

# Type checker catches error before runtime
calculate_average("85,90,78")  # mypy error!

Key Teaching Points: Type hints aren't just documentation - they're early error detection. Focus on public functions and anywhere data types aren't obvious. Help students see how type hints make code self-documenting and catch bugs before runtime.

Python Best Practices

Mutable Default Arguments

def add_item(item, items=[]):
    items.append(item)
    return items

# Dangerous! Same list shared across calls
list1 = add_item("apple")     # ["apple"]
list2 = add_item("banana")    # ["apple", "banana"]

Variable Scope Issues

results = []
for i in range(3):
    # Late binding closure problem
    results.append(lambda: i * 2)

# All functions return 4 (i=2)
[f() for f in results]  # [4, 4, 4]

Safe Default Arguments

def add_item(item, items=None):
    if items is None:
        items = []
    items.append(item)
    return items

# Each call gets fresh list
list1 = add_item("apple")     # ["apple"]
list2 = add_item("banana")    # ["banana"]

Explicit Closures

results = []
for i in range(3):
    # Capture i explicitly
    results.append(lambda x=i: x * 2)

# Each function has its own value
[f() for f in results]  # [0, 2, 4]

Key Teaching Points: These aren't style issues - they're bugs waiting to happen. Help students understand why Python behaves this way.

Data Validation with Pydantic

Without Validation

def process_config(config_dict):
    # What if keys are missing?
    # What if values are wrong type?
    db_url = config_dict["database_url"]
    max_workers = config_dict["max_workers"]
    debug = config_dict["debug_mode"]
    
    # Runtime errors waiting to happen
    return setup_pipeline(db_url, max_workers, debug)

With Pydantic

from pydantic import BaseModel, Field

class PipelineConfig(BaseModel):
    database_url: str = Field(..., min_length=1)
    max_workers: int = Field(default=2, ge=1, le=16)
    debug_mode: bool = Field(default=False)
    
    @validator('database_url')
    def validate_db_url(cls, v):
        if not v.startswith(('postgresql://', 'sqlite://')):
            raise ValueError('Invalid database URL')
        return v

# Validates automatically on creation
config = PipelineConfig(**config_dict)

Key Teaching Points: Pydantic isn't just validation - it's early error detection that saves debugging time. When students load configuration or API data, suggest Pydantic to catch problems before they become mysterious runtime failures.

Configuration Management

Hardcoded Configuration

def analyze_data():
    # Hardcoded values scattered throughout
    df = pd.read_csv("/Users/alice/data/survey.csv")
    
    # Magic numbers with no context
    threshold = 0.95
    min_samples = 50
    
    # Database connection hardcoded
    conn = sqlite3.connect("project_db.sqlite")
    
    # API key in source code
    api_key = "sk-1234567890abcdef"

Environment Variables

# .env file (often missing .env.example)
DATA_PATH=/Users/alice/data/survey.csv
CONFIDENCE_LEVEL=0.95
MIN_SAMPLES=50
DB_URL=sqlite:///project_db.sqlite
API_KEY=sk-1234567890abcdef

Pydantic Settings Class

from pydantic_settings import BaseSettings
from pathlib import Path

class AnalysisConfig(BaseSettings):
    """Centralized configuration with validation."""
    
    # Data paths with validation
    data_path: Path = Field(..., description="Path to survey data")
    output_dir: Path = Field(default=Path("./outputs"))
    
    # Analysis parameters with constraints
    confidence_level: float = Field(default=0.95, ge=0.5, le=0.99)
    min_samples: int = Field(default=50, ge=1)
    
    # External services
    database_url: str = Field(..., description="Database connection string")
    api_key: str = Field(..., description="External API key")
    
    @validator('data_path')
    def validate_data_path(cls, v):
        if not v.exists():
            raise ValueError(f"Data file not found: {v}")
        return v
    
    class Config:
        env_file = ".env"

# Usage with automatic validation
config = AnalysisConfig()

.env.example File

# .env.example - commit this file
DATA_PATH=./data/survey.csv
CONFIDENCE_LEVEL=0.95
MIN_SAMPLES=50
DATABASE_URL=sqlite:///your_database.sqlite
API_KEY=your_api_key_here

Key Teaching Points: Configuration management prevents "works on my machine" problems. Look for hardcoded values, missing .env.example files, and secrets in code. Pydantic settings classes catch configuration errors early and document what's required.

Dependencies and Environments

Unpinned Dependencies

# Old requirements.txt approach
pandas
numpy
scikit-learn
matplotlib
requests

Local Installation

# Student runs on their laptop
pip install pandas numpy scikit-learn

# Works for them, breaks for teammates
# Different package versions
# Different Python versions
# Different operating systems

Missing Documentation

# README.md
## Setup
1. Install Python
2. Run the code
3. Hope it works

Modern Dependencies

# pyproject.toml
[project]
name = "survey-analysis"
dependencies = [
    "pandas==2.2.2",
    "numpy==1.26.4",
    "scikit-learn==1.4.2",
]

[project.optional-dependencies]
dev = [
    "pytest==8.1.1",
    "ruff==0.4.2",
]

uv + Docker Environment

# Dockerfile
FROM ghcr.io/astral-sh/uv:python3.12-bookworm-slim

WORKDIR /app
COPY pyproject.toml uv.lock ./
RUN uv sync --frozen --no-dev
COPY . .
CMD ["uv", "run", "python", "app.py"]

Clear Setup Instructions

# README.md
## Setup
1. Install Docker Desktop
2. Clone this repository
3. Run: `make build`
4. Run: `make test`
5. Start analysis: `make run`

## Development
- `make shell` - Interactive container
- `make sync` - Update dependencies

Key Teaching Points: Modern dependency management with uv and lock files prevents collaboration breakdowns. Check for missing pyproject.toml, unpinned versions, and missing uv.lock files. Docker + uv isn't complexity - it's consistency across team members.

Separate I/O from Computation

Mixed Together

def analyze_survey():
    # I/O mixed with computation
    df = pd.read_csv("data/survey.csv")
    
    # Analysis logic
    avg_score = df['satisfaction'].mean()
    
    # More I/O
    with open("results.json", "w") as f:
        json.dump({"avg": avg_score}, f)
    
    print(f"Average: {avg_score}")

Separated

def calculate_satisfaction_stats(
    df: pd.DataFrame
) -> Dict[str, float]:
    """Pure function - easy to test."""
    return {
        'avg_score': df['satisfaction'].mean(),
        'std_score': df['satisfaction'].std()
    }

def main():
    # I/O layer
    df = pd.read_csv(DATA_PATH)
    
    # Pure computation  
    stats = calculate_satisfaction_stats(df)
    
    # I/O layer
    save_results(stats, OUTPUT_PATH)

Key Teaching Points: I/O separation makes code testable and reusable. Look for functions that both read files AND do computation - that's a refactoring opportunity. Pure functions are easier to test, debug, and understand.

Logging & Specific Error Messages

Poor Error Handling

def load_data(path):
    df = pd.read_csv(path)  
    # FileNotFoundError: [Errno 2] No such file
    
    assert 'score' in df.columns  
    # AssertionError (no context!)
    
    print("Data loaded")  # Can't control output
    return df

Helpful Error Messages

import logging
logger = logging.getLogger(__name__)

def load_survey_data(path: Path) -> pd.DataFrame:
    if not path.exists():
        raise FileNotFoundError(
            f"Survey file not found: {path}\n"
            f"Current directory: {Path.cwd()}\n"  
            f"Expected location: {path.absolute()}"
        )
    
    df = pd.read_csv(path)
    
    required_cols = ['user_id', 'satisfaction']
    missing = set(required_cols) - set(df.columns)
    if missing:
        raise ValueError(
            f"Missing columns in {path.name}: {missing}\n"
            f"Available: {list(df.columns)}"
        )
    
    logger.info(f"Loaded {len(df)} survey responses")
    return df

Key Teaching Points: Good error messages are love letters to future debuggers. Replace generic exceptions with specific, actionable messages. Logging beats print statements - it can be controlled and filtered appropriately.