How to work with numbers beginning with zero in pandas¶
If you have tried to work with geographic codes like ZIP, FIPS etc. in python pandas
, you might have experienced a problem. The default way of loading data into pandas
removes leading zeros (0
, nil, naught, nought, what ever you prefer) in numbers. This behavior essentially corrupts our data.
Other data sets where this problem might occur include: phone numbers, social security numbers, bank details or any kind of product/account/user ID.
In this short how-to we will show how to properly load ZIP codes into a DataFrame
. Of course everything applies to other numbers with significant leading zeros.
The solution is to load the ZIP code data as strings instead of as integers.
The ZIP code problem¶
Assume we have a small file consisting of three columns (ZIP, name and some kind of amount):
!head zip/sample.csv
If we load this file through the DataFrame
convenience method we get the following result:
import pandas
bad_df = pandas.DataFrame.from_csv("zip/sample.csv", index_col=None)
bad_df
You can see that the ZIP of Alice which was originally 01221
has been converted to 1221
. This happens because pandas
tries to infer the dtype of a column automatically.
Actual numbers can never start with a zero:
$0100
should be converted to $100
. pandas
is trying do do us a favor by converting our data into well formed numbers.
A short note on dtypes
¶
(skip to the next headline if you just want the solution)
dtype stands for data-type and is numpy
's term for the type of a value.
We can see that pandas
assumed that the ZIP column is a numerical value and picked the numpy.int64
dtype
bad_df['zip'].dtype
int
stands for integer, ie. whole numbers. 64
indicates that the variable is internally stored with 64 bits.
The amount of bits determines the maximum value a variable can take. In the case of int64
this is -9223372036854775808 to 9223372036854775807. Usually you can forget about this technical detail and just remember that pandas
/numpy
call whole numbers int64
.
The problem is that ZIP codes are not numerical values at all. They are actually categorical data which just look like numbers.
An easy way to determine if data is numerical is to ask you self whether you could calculate a distance (ie. subtraction) between two points. It doesn't make any sense to subtract two ZIP codes.
As our ZIP codes are categorical, we want to treat them as strings. 'string' is short for 'string of characters', and the programmer's way of saying 'text'. In python
(and pandas
/numpy
) string are abbreviated with str
.
The right way to load a dataset with ZIP codes into a DataFrame¶
df_zip_proper = pandas.io.parsers.read_csv('zip/sample.csv', dtype={'zip': 'str'})
df_zip_proper
Our leading zeros have been preserved!
We called the csv parser directly and not through the convenience method attached to DataFrame
. Thus we gained access to the dtype
argument.
By passing a dictionary that maps from column-names to dtype
s, we can explicitly tell pandas
to use a certain type for a column.
As explained above we chose to handle our ZIP codes as textual data (dtype
str
)
Advanced functionality for categorical data¶
Version 0.15
of pandas
introduced a dtype
called Categorical
:
df_zip_proper = pandas.io.parsers.read_csv('zip/sample.csv', dtype={'zip': pandas.Categorical})
df_zip_proper
The Categorical
dtype
is similar to a factor
in R and allows for ordering of data even though an exact numerical distance is not defined. See the pandas
documentation on categorical data for more information