Methods
How the explorer is built
The site is deliberately strict about what it claims. Exact DOC jail histories are shown as exact. Arrest-to-jail links are labeled as candidate. Court and prison linkage are kept out unless the public data can actually support them.
196,713 exact DOC people, 86,347 with repeat admissions (43.9%), and a median readmission gap of 182d.
Truth Standard
What the app can claim
The unit of confidence is the join, not the visualization. Every route in the app inherits the confidence of the underlying linkage.
- Exact DOC joins. DOC admissions and discharges join on
INMATEID + admit_date, which supports exact jail episode histories, repeat-admission counts, and cohort return metrics. - Candidate arrest bridge. The arrest bridge keeps only unique 1:1 matches after filtering on same date, normalized sex, parsed penal code, and compatible age bucket.
- Unsupported cross-stage claims. Public court bulk extracts are de-identified and public prison releases are aggregate, so this site does not claim full cross-stage identity resolution.
Current Runtime
What the live app reads
The deployed explorer is intentionally narrow. It only needs the four derived Parquet files below, all built from public inputs.
doc_recidivism_persons.parquet
Join DOC admissions to discharges on INMATEID + admit_date, sequence episodes, then aggregate to person-level metrics.
doc_recidivism_episodes.parquet
Exact DOC episode history with discharge date, stay_days, gap_days, episode order, and imputed birth year.
doc_cohort_recidivism.parquet
Build release cohorts from DOC episodes and mark whether each person returned within 1, 2, or 3 years when follow-up is observable.
arrest_doc_bridge.parquet
Match arrests to DOC admissions on same date, normalized sex, parsed penal code, and compatible age bucket, then keep only unique 1:1 pairs.
Build Chain
Pipeline steps
Each step writes a concrete artifact. The app does not synthesize methods at runtime; it reads the outputs of these build scripts.
scripts/download_public_data.py
Pull the core NYC open-data inputs used across the repo, including arrests, complaints, summonses, and DOC feeds.
scripts/analyze_doc_recidivism.py
Construct exact jail episode histories and person-level repeat-admission summaries from DOC admissions/discharges.
scripts/analyze_doc_cohort_recidivism.py
Create cohort-based 1/2/3-year return outcomes with censoring based on observed follow-up windows.
scripts/build_arrest_doc_bridge.py
Build the candidate arrest-to-jail bridge and keep only unique 1:1 matches.
scripts/build_arrest_research_dataset_polars.py
Annotate yearly arrests with complaint-match quality for the broader event-graph work.
scripts/build_public_event_spine_polars.py
Standardize arrests, complaints, summonses, and DOC events into one event schema.
scripts/build_public_event_spine_census_geo.py
Geocode unique coordinates with the Census batch geocoder and cache the results.
scripts/build_public_event_panel.py
Concatenate yearly event spines into the multi-year public event panel.
Join Quality
What is exact, candidate, or unsupported
This is the core methodological boundary of the project. Public criminal-justice data is mostly event-level. Only some layers can be stitched into person histories without inventing certainty.
| Join | Fields | Status | Supports | Caveat |
|---|---|---|---|---|
| DOC admissions ↔ DOC discharges | INMATEID + admit_date | Exact | Stay lengths, gap lengths, ordered jail episodes, and DOC person histories. | Exact within the public DOC feeds, but only for the jail stage. |
| DOC episodes ↔ DOC person summaries | Aggregation over INMATEID | Exact | Repeat-admission counts, tiers, charge-change counts, and person profiles. | Still a DOC-only identity, not a citywide criminal-justice person key. |
| NYPD arrests ↔ DOC admissions | date + sex + parsed penal code + imputed age bucket | Candidate | A narrow arrest-to-jail bridge subset for mapped/contextual arrest detail. | Not ground truth. Only unique 1:1 matches are kept to favor precision over coverage. |
| NYPD arrests ↔ NYPD complaints | date + precinct + offense code + borough + demographics | Candidate | Broader repo event-graph analysis outside the current web explorer. | Ambiguous and incomplete, especially in earlier years. |
| Anything ↔ public court bulk data | None | Unsupported | No public person-level court linkage in this app. | The court extracts documented in the repo are intentionally de-identified. |
| Anything ↔ state prison person records | None | Unsupported | No person-level prison linkage in this app. | Public DOCCS releases are aggregate only. |
Bridge Method
How arrests are linked to DOC
The bridge is intentionally narrow. Coverage is sacrificed to reduce obvious false positives.
- Parse NYPD
LAW_CODEinto a penal-law format that can match DOCTOP_CHARGE. - Require the arrest date to equal the DOC admission date.
- Require normalized sex to agree across systems.
- Use discharge-age-derived birth year to infer the expected NYPD age bucket.
- Keep only rows that survive as unique 1:1 arrest-to-admission matches.
Current build: 12,436 candidate pairs, 11,409 unique DOC people, and 904 people with 2+ linked episodes.
Non-Claims
What the site does not assert
These are not small footnotes. They define the ceiling of what a public-only NYC criminal-justice explorer can truthfully say.
- A DOC person page is an exact jail-stage history keyed by INMATEID, not a citywide criminal-justice identity.
- The arrest-to-DOC bridge is a candidate subset designed for precision, not a full arrest coverage layer.
- Court outcomes are not linked in the current web app because the public court extracts do not expose a cross-case public person key.
- State prison, parole, and statewide multi-arrest recidivism remain unsupported in public bulk data.
- DOC race and charge fields are not strong enough to support broad cross-system identity claims on their own.
Broader Repo Outputs
Additional datasets in the workspace
The repo builds more than the live explorer currently exposes. Those outputs stay labeled by their own confidence level.
nypd_arrests_<year>_research_dataset.parquet
Filter arrests by year, join to year-specific complaint subset on date + precinct + offense code + borough, then tighten with demographics.
public_event_spine_<year>_census_geo.parquet
Stack yearly arrests, complaints, summonses, DOC admissions, and DOC discharges into one schema, then geocode unique coordinates with a cache.
public_event_panel_<start>_<end>_census_geo.parquet
Concatenate yearly geocoded event spines into a longitudinal public event panel.
Live Build Notes
Numbers on this page
These values are pulled from current derived outputs so the methods copy stays in sync with the latest local build.
Exact DOC layer: 196,713 people, 447,391 jail episodes, max 54 admissions for a single person, and a 32.5% 1-year return rate among observable cohorts.
The broader event panel is optional at runtime, so this page does not require those larger files to be deployed.
The website itself still reads only the narrow set of current app Parquet outputs. The broader repo outputs are documented here so the methodology stays honest about what exists versus what is actually surfaced.