# How to work with numbers beginning with zero in pandas¶

If you have tried to work with geographic codes like ZIP, FIPS etc. in python `pandas`

, you might have experienced a problem. The default way of loading data into `pandas`

removes leading zeros (`0`

, *nil*, *naught*, *nought*, what ever you prefer) in numbers. This behavior essentially corrupts our data.

Other data sets where this problem might occur include: phone numbers, social security numbers, bank details or any kind of product/account/user ID.

In this short how-to we will show how to properly load ZIP codes into a `DataFrame`

. Of course everything applies to other numbers with significant leading zeros.

The solution is to load the ZIP code data as strings instead of as integers.

## The ZIP code problem¶

Assume we have a small file consisting of three columns (ZIP, name and some kind of amount):

```
!head zip/sample.csv
```

If we load this file through the `DataFrame`

convenience method we get the following result:

```
import pandas
bad_df = pandas.DataFrame.from_csv("zip/sample.csv", index_col=None)
bad_df
```

You can see that the ZIP of Alice which was originally `01221`

has been converted to `1221`

. This happens because `pandas`

tries to infer the *dtype* of a column automatically.

Actual numbers can never start with a zero:

$`0100`

*should* be converted to $`100`

. `pandas`

is trying do do us a favor by converting our data into well formed numbers.

## A short note on `dtypes`

¶

(skip to the next headline if you just want the solution)

dtype stands for data-type and is `numpy`

's term for the type of a value.
We can see that `pandas`

assumed that the ZIP column is a numerical value and picked the `numpy.int64`

`dtype`

```
bad_df['zip'].dtype
```

`int`

stands for integer, ie. whole numbers. `64`

indicates that the variable is internally stored with 64 bits.
The amount of bits determines the maximum value a variable can take. In the case of `int64`

this is -9223372036854775808 to 9223372036854775807. Usually you can forget about this technical detail and just remember that `pandas`

/`numpy`

call whole numbers `int64`

.

The problem is that ZIP codes are not numerical values at all. They are actually categorical data which just look like numbers.

An easy way to determine if data is numerical is to ask you self whether you could calculate a distance (ie. subtraction) between two points. It doesn't make any sense to subtract two ZIP codes.

As our ZIP codes are categorical, we want to treat them as strings. 'string' is short for 'string of characters', and the programmer's way of saying 'text'. In `python`

(and `pandas`

/`numpy`

) string are abbreviated with `str`

.

## The right way to load a dataset with ZIP codes into a DataFrame¶

```
df_zip_proper = pandas.io.parsers.read_csv('zip/sample.csv', dtype={'zip': 'str'})
df_zip_proper
```

Our leading zeros have been preserved!

We called the csv parser directly and not through the convenience method attached to `DataFrame`

. Thus we gained access to the `dtype`

argument.

By passing a dictionary that maps from column-names to `dtype`

s, we can explicitly tell `pandas`

to use a certain type for a column.

As explained above we chose to handle our ZIP codes as textual data (`dtype`

`str`

)

## Advanced functionality for categorical data¶

Version `0.15`

of `pandas`

introduced a `dtype`

called `Categorical`

:

```
df_zip_proper = pandas.io.parsers.read_csv('zip/sample.csv', dtype={'zip': pandas.Categorical})
df_zip_proper
```

The `Categorical`

`dtype`

is similar to a `factor`

in R and allows for ordering of data even though an exact numerical distance is not defined. See the `pandas`

documentation on categorical data for more information