Date

How to work with numbers beginning with zero in pandas

If you have tried to work with geographic codes like ZIP, FIPS etc. in python pandas, you might have experienced a problem. The default way of loading data into pandas removes leading zeros (0, nil, naught, nought, what ever you prefer) in numbers. This behavior essentially corrupts our data.

Other data sets where this problem might occur include: phone numbers, social security numbers, bank details or any kind of product/account/user ID.

In this short how-to we will show how to properly load ZIP codes into a DataFrame. Of course everything applies to other numbers with significant leading zeros.

The solution is to load the ZIP code data as strings instead of as integers.

The ZIP code problem

Assume we have a small file consisting of three columns (ZIP, name and some kind of amount):

In [13]:
!head zip/sample.csv
zip,name,amount
01221,"Alice Alicedottir",3000.33
12334,"Bob Bobson",2000
00123,"Claire Clairette",1000.50

If we load this file through the DataFrame convenience method we get the following result:

In [14]:
import pandas
bad_df = pandas.DataFrame.from_csv("zip/sample.csv", index_col=None)
bad_df
Out[14]:
zip name amount
0 1221 Alice Alicedottir 3000.33
1 12334 Bob Bobson 2000.00
2 123 Claire Clairette 1000.50

You can see that the ZIP of Alice which was originally 01221 has been converted to 1221. This happens because pandas tries to infer the dtype of a column automatically.

Actual numbers can never start with a zero:

$0100 should be converted to $100. pandas is trying do do us a favor by converting our data into well formed numbers.

A short note on dtypes

(skip to the next headline if you just want the solution)

dtype stands for data-type and is numpy's term for the type of a value. We can see that pandas assumed that the ZIP column is a numerical value and picked the numpy.int64 dtype

In [15]:
bad_df['zip'].dtype
Out[15]:
dtype('int64')

int stands for integer, ie. whole numbers. 64 indicates that the variable is internally stored with 64 bits. The amount of bits determines the maximum value a variable can take. In the case of int64 this is -9223372036854775808 to 9223372036854775807. Usually you can forget about this technical detail and just remember that pandas/numpy call whole numbers int64.

The problem is that ZIP codes are not numerical values at all. They are actually categorical data which just look like numbers.

An easy way to determine if data is numerical is to ask you self whether you could calculate a distance (ie. subtraction) between two points. It doesn't make any sense to subtract two ZIP codes.

As our ZIP codes are categorical, we want to treat them as strings. 'string' is short for 'string of characters', and the programmer's way of saying 'text'. In python (and pandas/numpy) string are abbreviated with str.

The right way to load a dataset with ZIP codes into a DataFrame

In [16]:
df_zip_proper = pandas.io.parsers.read_csv('zip/sample.csv', dtype={'zip': 'str'})
df_zip_proper
Out[16]:
zip name amount
0 01221 Alice Alicedottir 3000.33
1 12334 Bob Bobson 2000.00
2 00123 Claire Clairette 1000.50

Our leading zeros have been preserved!

We called the csv parser directly and not through the convenience method attached to DataFrame. Thus we gained access to the dtype argument.

By passing a dictionary that maps from column-names to dtypes, we can explicitly tell pandas to use a certain type for a column.

As explained above we chose to handle our ZIP codes as textual data (dtype str)

Advanced functionality for categorical data

Version 0.15 of pandas introduced a dtype called Categorical:

In [17]:
df_zip_proper = pandas.io.parsers.read_csv('zip/sample.csv', dtype={'zip': pandas.Categorical})
df_zip_proper
Out[17]:
zip name amount
0 01221 Alice Alicedottir 3000.33
1 12334 Bob Bobson 2000.00
2 00123 Claire Clairette 1000.50

The Categorical dtype is similar to a factor in R and allows for ordering of data even though an exact numerical distance is not defined. See the pandas documentation on categorical data for more information

In []: