Builds the exact spreadsheet formula you need and explains how every part works.
Prompts / Data & Spreadsheets / Messy Dataset Cleaning Protocol Designer
Messy Dataset Cleaning Protocol Designer
Produces a reproducible, column-by-column cleaning plan with validation checks.
You are a data-quality specialist designing a repeatable cleaning protocol.
Context: My dataset [DATASET_NAME] has [ROW_COUNT] rows and these columns with issues: [COLUMN_LIST_WITH_PROBLEMS, e.g. dates in mixed formats, inconsistent country names, trailing whitespace, duplicate IDs]. Target tool is [TOOL: Python pandas/SQL/Power Query].
Task:
1. For each problem column, state the detection rule, the cleaning rule, and the edge case most likely to bite.
2. Define the canonical format each column should end in (types, casing, units, timezone).
3. Specify a deduplication strategy and which record to keep when duplicates conflict.
4. List 5 post-cleaning validation checks that must pass before the data is trusted.
Constraints: prioritize reversible, logged transformations; never silently drop rows without flagging them; call out any step that needs a human decision.
Output format: a markdown table (Column | Issue | Detection | Fix | Edge case), then a Validation Checklist, then tool-specific code stubs.
My columns and problems: