Methods

How the explorer is built

The site is deliberately strict about what it claims. Exact DOC jail histories are shown as exact. Arrest-to-jail links are labeled as candidate. Court and prison linkage are kept out unless the public data can actually support them.

196,713 exact DOC people, 86,347 with repeat admissions (43.9%), and a median readmission gap of 182d.

Truth Standard

What the app can claim

The unit of confidence is the join, not the visualization. Every route in the app inherits the confidence of the underlying linkage.

  • Exact DOC joins. DOC admissions and discharges join on INMATEID + admit_date, which supports exact jail episode histories, repeat-admission counts, and cohort return metrics.
  • Candidate arrest bridge. The arrest bridge keeps only unique 1:1 matches after filtering on same date, normalized sex, parsed penal code, and compatible age bucket.
  • Unsupported cross-stage claims. Public court bulk extracts are de-identified and public prison releases are aggregate, so this site does not claim full cross-stage identity resolution.

Current Runtime

What the live app reads

The deployed explorer is intentionally narrow. It only needs the four derived Parquet files below, all built from public inputs.

doc_recidivism_persons.parquet

Join DOC admissions to discharges on INMATEID + admit_date, sequence episodes, then aggregate to person-level metrics.

Exact. One row per DOC person. Used for Homepage leader table, search results, person summary cards, and tiering.

doc_recidivism_episodes.parquet

Exact DOC episode history with discharge date, stay_days, gap_days, episode order, and imputed birth year.

Exact. One row per DOC admission episode. Used for Person timelines, episode charts, stay/gap statistics, and raw episode history.

doc_cohort_recidivism.parquet

Build release cohorts from DOC episodes and mark whether each person returned within 1, 2, or 3 years when follow-up is observable.

Exact. One row per person-cohort outcome. Used for Homepage 1-year return-rate trend and cohort-based recidivism framing.

arrest_doc_bridge.parquet

Match arrests to DOC admissions on same date, normalized sex, parsed penal code, and compatible age bucket, then keep only unique 1:1 pairs.

Candidate. One row per candidate arrest-to-admission pair. Used for Person arrest tables, arrest map points, and bridge subset counts.

Build Chain

Pipeline steps

Each step writes a concrete artifact. The app does not synthesize methods at runtime; it reads the outputs of these build scripts.

scripts/download_public_data.py

Pull the core NYC open-data inputs used across the repo, including arrests, complaints, summonses, and DOC feeds.

Outputs: Raw CSVs under data/raw/

scripts/analyze_doc_recidivism.py

Construct exact jail episode histories and person-level repeat-admission summaries from DOC admissions/discharges.

Outputs: doc_recidivism_persons.parquet, doc_recidivism_episodes.parquet, doc_recidivism_summary.json

scripts/analyze_doc_cohort_recidivism.py

Create cohort-based 1/2/3-year return outcomes with censoring based on observed follow-up windows.

Outputs: doc_cohort_recidivism.parquet, doc_cohort_recidivism_summary.json

scripts/build_arrest_doc_bridge.py

Build the candidate arrest-to-jail bridge and keep only unique 1:1 matches.

Outputs: arrest_doc_bridge.parquet, arrest_doc_bridge_episodes.parquet, arrest_doc_bridge_summary.json

scripts/build_arrest_research_dataset_polars.py

Annotate yearly arrests with complaint-match quality for the broader event-graph work.

Outputs: nypd_arrests_<year>_research_dataset.parquet

scripts/build_public_event_spine_polars.py

Standardize arrests, complaints, summonses, and DOC events into one event schema.

Outputs: public_event_spine_<year>.parquet

scripts/build_public_event_spine_census_geo.py

Geocode unique coordinates with the Census batch geocoder and cache the results.

Outputs: public_event_spine_<year>_census_geo.parquet

scripts/build_public_event_panel.py

Concatenate yearly event spines into the multi-year public event panel.

Outputs: public_event_panel_<start>_<end>_census_geo.parquet

Join Quality

What is exact, candidate, or unsupported

This is the core methodological boundary of the project. Public criminal-justice data is mostly event-level. Only some layers can be stitched into person histories without inventing certainty.

JoinFieldsStatusSupportsCaveat
DOC admissions ↔ DOC dischargesINMATEID + admit_dateExactStay lengths, gap lengths, ordered jail episodes, and DOC person histories.Exact within the public DOC feeds, but only for the jail stage.
DOC episodes ↔ DOC person summariesAggregation over INMATEIDExactRepeat-admission counts, tiers, charge-change counts, and person profiles.Still a DOC-only identity, not a citywide criminal-justice person key.
NYPD arrests ↔ DOC admissionsdate + sex + parsed penal code + imputed age bucketCandidateA narrow arrest-to-jail bridge subset for mapped/contextual arrest detail.Not ground truth. Only unique 1:1 matches are kept to favor precision over coverage.
NYPD arrests ↔ NYPD complaintsdate + precinct + offense code + borough + demographicsCandidateBroader repo event-graph analysis outside the current web explorer.Ambiguous and incomplete, especially in earlier years.
Anything ↔ public court bulk dataNoneUnsupportedNo public person-level court linkage in this app.The court extracts documented in the repo are intentionally de-identified.
Anything ↔ state prison person recordsNoneUnsupportedNo person-level prison linkage in this app.Public DOCCS releases are aggregate only.

Bridge Method

How arrests are linked to DOC

The bridge is intentionally narrow. Coverage is sacrificed to reduce obvious false positives.

  1. Parse NYPD LAW_CODE into a penal-law format that can match DOC TOP_CHARGE.
  2. Require the arrest date to equal the DOC admission date.
  3. Require normalized sex to agree across systems.
  4. Use discharge-age-derived birth year to infer the expected NYPD age bucket.
  5. Keep only rows that survive as unique 1:1 arrest-to-admission matches.

Current build: 12,436 candidate pairs, 11,409 unique DOC people, and 904 people with 2+ linked episodes.

Non-Claims

What the site does not assert

These are not small footnotes. They define the ceiling of what a public-only NYC criminal-justice explorer can truthfully say.

  • A DOC person page is an exact jail-stage history keyed by INMATEID, not a citywide criminal-justice identity.
  • The arrest-to-DOC bridge is a candidate subset designed for precision, not a full arrest coverage layer.
  • Court outcomes are not linked in the current web app because the public court extracts do not expose a cross-case public person key.
  • State prison, parole, and statewide multi-arrest recidivism remain unsupported in public bulk data.
  • DOC race and charge fields are not strong enough to support broad cross-system identity claims on their own.

Broader Repo Outputs

Additional datasets in the workspace

The repo builds more than the live explorer currently exposes. Those outputs stay labeled by their own confidence level.

nypd_arrests_<year>_research_dataset.parquet

Filter arrests by year, join to year-specific complaint subset on date + precinct + offense code + borough, then tighten with demographics.

Candidate. Broader public event graph work outside the current web routes.

public_event_spine_<year>_census_geo.parquet

Stack yearly arrests, complaints, summonses, DOC admissions, and DOC discharges into one schema, then geocode unique coordinates with a cache.

Mixed. Broader panel analysis and cross-source coverage profiling.

public_event_panel_<start>_<end>_census_geo.parquet

Concatenate yearly geocoded event spines into a longitudinal public event panel.

Mixed. Repo-level inventory and cross-source coverage analysis.

Live Build Notes

Numbers on this page

These values are pulled from current derived outputs so the methods copy stays in sync with the latest local build.

Exact DOC layer: 196,713 people, 447,391 jail episodes, max 54 admissions for a single person, and a 32.5% 1-year return rate among observable cohorts.

The broader event panel is optional at runtime, so this page does not require those larger files to be deployed.

The website itself still reads only the narrow set of current app Parquet outputs. The broader repo outputs are documented here so the methodology stays honest about what exists versus what is actually surfaced.