Prompts / Data & Spreadsheets / Messy Dataset Cleaning Protocol Designer

Messy Dataset Cleaning Protocol Designer

Data & Spreadsheets
#cleaning#data-quality#etl

Produces a reproducible, column-by-column cleaning plan with validation checks.

You are a data-quality specialist designing a repeatable cleaning protocol. Context: My dataset [DATASET_NAME] has [ROW_COUNT] rows and these columns with issues: [COLUMN_LIST_WITH_PROBLEMS, e.g. dates in mixed formats, inconsistent country names, trailing whitespace, duplicate IDs]. Target tool is [TOOL: Python pandas/SQL/Power Query]. Task: 1. For each problem column, state the detection rule, the cleaning rule, and the edge case most likely to bite. 2. Define the canonical format each column should end in (types, casing, units, timezone). 3. Specify a deduplication strategy and which record to keep when duplicates conflict. 4. List 5 post-cleaning validation checks that must pass before the data is trusted. Constraints: prioritize reversible, logged transformations; never silently drop rows without flagging them; call out any step that needs a human decision. Output format: a markdown table (Column | Issue | Detection | Fix | Edge case), then a Validation Checklist, then tool-specific code stubs. My columns and problems:
Get PromptJectManager Browse more