Data Wrangling (Preparation)



Overview



• One purpose of data wrangling is to keep the data formats consistent so that it can be uploaded to database or other programs smoothly

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Folder Strcture



• Create a Template Folder and name it !YYYYMMDD_<Project Name>
• Inside the Template folder, include the following sub-folders
◇ 1. Original Data
◇ 2. Prepared Data
◇ 3. Uploaded Data
◇ 4. Analysis
◇ 5. Insights
◇ 6. Final

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

How Excel Can Mess Up Your Data



• Excel can mess up your Dates and Numeric formats if your locale configuartion is not the default settings
• Example: a CSV file with dates and credit card number (16-digits)
• After you open the CSV file in Excel and exit without saving it, if you open the same file with text editor again, notice the following changes were made by Excel implicitly
Date format: Changed from American format to European format as per Regional settings
Long Integer: Changed to Scienific Display format

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Two Things You HAVE to Do Before Uploading Data



1) Fix Up the Dates in CSV Files



• Accounts for different date formats
• In Excel, right click the column > Format Cells > choose / create “yyyy-mm-dd”
• After the fix, if you open the file in Excel, notice that the dates will be displayed according to your Regional Setting so you may not see the dates displayed in “yyyy-mm-dd”
• The proper way to check the result is to open the file in your favourite text editor

2) Fix Up the Amounts in CSV Files



• To get rid of the dollar signs ($), commas (,) and make sure that decimals places are represented by dots (.)
◇ In Excel, right click the column > Format Cells > Number
◇ Make sure that “Use 1000 separator” is unchecked
◇ Some country settings will represent decimals as comma

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Quick Check List



- Amounts are without $ sign and comma
- Dates are in “YYYY-MM-DD” format
- Long integers like credit card numbers can be seen with all digits (no scientific notation)
- No hash tags on long text


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Bulletproof Blueprint for Data Wrangling



• Ideal for database upload. Follow the steps below:

1. Copy file <file_name>.csv from folder “1. Original Data” to folder “2. Prepared Data”

2. In folder “2. Prepared Data”, change the file extension from .csv to .txt

3. Open <file_name>.txt in Excel

4. Use File > Open as this will bring up the Import Wizard

5. In the Import Wizard, Select all columns, choose “Text” as data format and click “Finish”

6. Fix Date formatting: highlight the column with dates, select Data > Text to Columns
◇ In the Convert Text to Column Wizard
6-1: Choose “Delimited” and click “Next”
6-2: De-select all delimiters and click “Next”
6-3: Choose the Date format and click “Finish”
◇ Select the column, right click and choose “Format Cells...”. Go to Custom and specify the format we need “YYYY-MM-DD”
◇ Do the same for other columns with dates

7. Fix Number formatting: highlight column with numbers, select Data > Text to Column
◇ In the Convert Text to Columns Wizard, perform the same thing in steps 1 and 2. In step 3, select “General” and click “Finish”
◇ Highlight column with numbers, right click and choose “Format Cells...”, choose “Number”, 2 decimal places, no thousand separator and use “-” for negative numbers and then click “OK”
◇ Do the same for other columns with numbers

8. Fix Long Text (> 256 characters)
• If the text is too long (> 256 characters), the text is replaced by hash tag
• To fix this, select the column, right click and select “Format cells...” . Select “General” category and click OK
• Do the same for other columns with long text

9. Save the File as CSV file

10. The file is ready to upload

Index