Pandas: Need to knows

If you are using Python in the postgraduate research space, then chances are you will need pandas. There are many resources and tutorials out there that will give you an introduction to pandas, or even time series analysis with pandas. There are, however, things you should know how to do in pandas that no one tells you about.

Display all rows/columns of the data frame

In most cases, your data frame (df) will contain thousands of rows and in most consoles only a contracted part of your df will display.

You can display all rows:

import pandas as pd
pd.set_option('display.max_rows', None)

Similarly, you can display all columns:

pd.set_option('display.max_columns', None)

Remember, your console may also limit the number of lines buffered. Spyder IDE (via Anaconda distribution) has a default configuration of 10,000 lines. You can change this setting in Tools > Preferences > IPython console > Source code > Buffer as shown below. Restart Spyder to apply changes.

Setting the buffer size in Spyder.

You can also display a specific number of rows or columns, the justification of headers and several other display settings for pandas. See the full list here.

Fill missing index datetimes

In general, working with datetimes is a beast on its own. We’ll later get into time zones, curve fits of datetimes and plotting options. However, the first thing is to ensure that missing time steps are filled.

Here is the code using minute averaged solar radiation data for 2019 from SAURAN for Stellenbosch University, and we’ll unpack after:

import pandas as pd
import datetime as dt
dateparse = lambda x: dt.datetime.strptime(x, '%d/%m/%Y %H:%M:%S')
df = pd.read_csv('Data/SUN_Minute.csv', skiprows=1, parse_dates=['TmStamp'], index_col='TmStamp', date_parser=dateparse)
print(df.shape)
r = pd.date_range(start='2019-01-01 00:00', end='2019-12-31 23:59', freq='min')
df = df.reindex(r).reset_index().set_index('index')
print(df.shape)
  • Line #1: we simply import our pandas library.
  • Line #2: we import the Python standard datetime class since strptime() from the pandas class, as used in line #3, was recently deprecated.
  • Line #3: we indicate the format of the date and time in the solar radiation data file that we are reading in line #4.
  • Line #4: The file imported is called SUN_Minute.csv and is located in a folder Data. Set ‘TmStamp’ as the index column.
  • Line #5: output: (508503, 15) The df has 508503 rows and 15 columns.
  • Line #6: Since in this example we have a single CSV file containing an entire year we can use date_range() to indicate what we expect the index should contain and at what frequency the discrete measurement points are.
  • Line #7: Without .set_index('index') the df will have no index.
  • Line #8: output: (525600, 15) The df now has 525600 rows and still 15 columns.

So the original output before the filling the missing time steps is:

2019-06-29 13:03:00  1464016     571.038100  ...   9.722885  2.161459
2019-06-29 13:04:00 1464017 466.012900 ... 9.345224 2.058094
2019-06-29 13:05:00 1464018 784.429500 ... 10.609600 2.433831
2019-07-05 01:05:00 1471922 0.000000 ... 0.000000 0.012182
2019-07-05 01:06:00 1471923 0.000000 ... 0.000000 0.012182

The resulting df now have a datetime index for every minute within the year range but all other missing values were set to NaN (Not a Number):

2019-06-29 13:03:00 1464016.0 571.038100 … 9.722885 2.161459
2019-06-29 13:04:00 1464017.0 466.012900 … 9.345224 2.058094
2019-06-29 13:05:00 1464018.0 784.429500 … 10.609600 2.433831
2019-06-29 13:06:00 NaN NaN … NaN NaN
2019-06-29 13:07:00 NaN NaN … NaN NaN
2019-06-29 13:08:00 NaN NaN … NaN NaN

The missing data could also have been replaced with zeros (or other values) by adding fill_value in Line #7: .reindex(r, fill_value=0).

Avoid Pythonic loops

Pandas is an extremely powerful tool however, misuse will cause frustration and excessive code execution times. Applying the usual Python loops is possibly contributing to your misery.

There are many optimised solutions to looping through a df and here is a brilliant guide and here are some more options. In summary:

  • Loop with iterrows(): less efficient and therefore slower than the other options but feels quite Pythonic, especially if you are newer to Python.
  • Loop with itertuples(): very similar to the previous option but faster.
  • Loop with apply(): usually more efficient than the previous options and a useful solution when grouping is required.
  • Vectorisation with pandas or numpy: both are faster than the previous options, with vectorisation with numpy often the fastest solution.

The speed of these solutions are dependant on the type of function executed “within” the loop, however, vectorisation seem to be the fastest solution in general. I have often found that my choice is more based on the function, i.e. do I need to group or resample before looping. Most important though, avoid Pythonic looping.

Let us know if you have any other need-to-knows for pandas so that we all may conquer our unoptimised pandas code.

Leave a Reply

Your email address will not be published. Required fields are marked *