Skip to content

API Reference

pandahelper.profiles

Panda-Helper data profiles.

DataFrameProfile

DataFrameProfile(df, *, name='', fmt='simple')

Pandas DataFrame data profile.

Prepare data profile of Pandas DataFrame that can be displayed or saved.

Attributes:

  • name (str) –

    Name of DataFrame profile if provided. Default value is "".

  • shape (tuple) –

    Dataframe shape.

  • dtypes (pandas.Series) –

    Data types of DataFrame index and Series in DataFrame.

  • memory_usage (pandas.Series) –

    Memory usage (MB) of index and Series in DataFrame.

  • num_duplicates (int) –

    Number of duplicated rows.

  • nulls_per_row (pandas.Series) –

    Count of null values per row.

  • null_stats (list) –

    Distribution statistics on nulls per row.

  • time_diffs (pandas.Series) –

    Time diffs (gaps) if DataFrame has a DateTimeIndex.

Parameters:

  • df (pandas.DataFrame) –

    DataFrame to profile.

  • name (str, default: '' ) –

    optional): Name to assign to profile.

  • fmt (str, default: 'simple' ) –

    optional): Printed table format. See https://github.com/astanin/python-tabulate for options.

Raises:

  • TypeError

    If input is not a Pandas DataFrame.

save

save(path)

Save profile to provided path.

Parameters:

  • path (str) –

    Where to save profile.

SeriesProfile

SeriesProfile(series, *, fmt='simple', freq_most_least=(10, 5), time_index=False)

Pandas Series data profile.

Prepare data profile of Pandas Series that can be displayed or saved.

Attributes:

  • name (str) –

    Name of Series.

  • dtype (numpy.dtype or Pandas dtype) –

    Data types of Series within DataFrame.

  • count (int) –

    Count of non-null values.

  • num_unique (int) –

    Number of unique values.

  • num_nulls (int) –

    Number of null values.

  • frequency (pandas.DataFrame) –

    Frequency table with counts and percentage.

  • stats (dict) –

    Distribution statistics for Series.

  • time_diffs (pandas.Series) –

    Time diffs (gaps) if series is of type datetime64. Alternately, can be time diffs in a Series with a DateTimeIndex if the time_index parameter was set to True when creating Series Profile.

Parameters:

  • series (pandas.Series) –

    Pandas Series to profile.

  • fmt (str, default: 'simple' ) –

    optional): Printed table format. See: https://github.com/astanin/python-tabulate for options.

  • freq_most_least (tuple, default: (10, 5) ) –

    optional): Tuple (x, y) of the x most common and y least common values to display in frequency table.

  • time_index (bool, default: False ) –

    optional): Whether to use the index for calculating time diffs for a datetime64-related Pandas Series. Not relevant for non-time related Series.

Raises:

  • TypeError

    If input is not a Pandas Series.

save

save(path)

Save profile to provided path.

Parameters:

  • path (str) –

    Where to save profile.


pandahelper.stats

Panda-Helper statistics functions.

distribution_stats

distribution_stats(series)

Return single-column Pandas DataFrame of distribution statistics.

Parameters:

  • series (pandas.Series) –

    Pandas Series used to calculate distribution statistics. Distribution statistics will depend on series dtype. Supported dtypes are:

    • int64
    • float64
    • bool
    • complex128
    • datetime64
    • timedelta64
    • period[]
    • interval

Returns:

  • pandas.DataFrame

    pd.DataFrame: Single-column of calculated values with statistics as index.

Raises:

  • TypeError

    If input is not a numeric-like pd.Series.

Examples:

Distribution stats for Pandas Series of type float64:

>>> from random import seed, gauss, expovariate
>>> import pandahelper as ph
>>> import pandas as pd
>>>
>>> seed(314)
>>> series = pd.Series([gauss(mu=30, sigma=20) for x in range(200)])
>>> ph.distribution_stats(series)
                               Statistic Value
    count                           200.000000
    min                             -23.643007
    1%                              -11.918955
    5%                                2.833604
    25%                              17.553793
    50%                              31.420759
    75%                              42.074998
    95%                              60.305435
    99%                              72.028633
    max                              81.547828
    mean                             30.580535
    standard deviation               18.277706
    median                           31.420759
    median absolute deviation        12.216607
    skew                             -0.020083

Distribution stats for Pandas Series of type datetime64:

>>> start = pd.Timestamp(2000, 1, 1)
>>> tds = [pd.Timedelta(hours=int(expovariate(lambd=.003))) for x in range(200)]
>>> times = [start + td for td in tds]
>>> series = pd.Series(times)
>>> ph.distribution_stats(series)
                               Statistic Value
count                                      200
min                        2000-01-01 00:00:00
1%                         2000-01-01 01:59:24
5%                         2000-01-01 09:00:00
25%                        2000-01-04 08:00:00
50%                        2000-01-08 04:30:00
75%                        2000-01-16 21:00:00
95%                        2000-02-08 01:36:00
99%                        2000-02-22 10:20:24
max                        2000-04-01 17:00:00
mean                       2000-01-12 14:24:18
standard deviation  12 days 16:47:15.284423042
median                     2000-01-08 04:30:00

frequency_table

frequency_table(series)

Return value counts and relative frequency.

Parameters:

  • series (pandas.Series) –

    Pandas Series used to calculate value counts and relative frequencies.

Returns:

  • pandas.DataFrame

    Pandas DataFrame of value counts and percentages indexed by value.

Raises:

  • TypeError

    If input is not a Pandas Series.

Examples:

>>> import random
>>> import pandahelper as ph
>>>
>>> random.seed(314)
>>> cities = ["Springfield", "Quahog", "Philadelphia", "Shelbyville"]
>>> series = pd.Series(random.choices(cities, k = 200))
>>> ph.frequency_table(series)
                  Count % of Total
    Springfield      66     33.00%
    Quahog           51     25.50%
    Philadelphia     44     22.00%
    Shelbyville      39     19.50%


pandahelper.times

Panda-Helper time-series functions.

category_gaps

category_gaps(series, threshold, max_cat=50)

Calculate sum of gaps for each category in time-indexed Series.

Gaps are time differences in excess of expected time increment (threshold). Gap per category is relative to the minimum and maximum times in the Series. Intended for use with categorical-like Series.

Parameters:

  • series (pandas.Series) –

    Categorical-like Series.

  • threshold (pandas.Timedelta) –

    Threshold for the time difference to be considered a gap. For hourly data, threshold should be pd.Timedelta(hours=1).

  • max_cat (int, default: 50 ) –

    Maximum number categories (unique values) before issuing warning and returning None.

Returns:

  • [pandas.DataFrame, None]

    Key-value pairs with category name and associated gap. Will return None if number of categories exceeds max_cat.

Warns:

  • UserWarning

    If the number of categories (unique values) in the series exceeds max_cat.

Examples:

>>> import pandahelper as ph
>>> import pandas as pd
>>>
>>> start = pd.Timestamp(year=1999, month=1, day=1)
>>> a = pd.Series(["A"] * 30, index=pd.date_range(start, periods=30, freq="D"))
>>> b = pd.Series(["B"] * 15, index=pd.date_range(start, periods=15, freq="2D"))
>>> c = pd.Series(["C"] * 10, index=pd.date_range(start, periods=10, freq="D"))
>>> ph.category_gaps(pd.concat([a, b, c]), threshold=pd.Timedelta(days=1))
              Cumulative Gap
    C        20 days
    B        15 days
    A         0 days

id_gaps

id_gaps(series, threshold)

Identify time gaps above threshold in datetime64 Series or DatetimeIndex.

Sorts input by time before calculating gaps.

Parameters:

  • series (pandas.Series or pandas.DatetimeIndex) –

    datetime64 Series or DatetimeIndex.

  • threshold (pandas.Timedelta) –

    Threshold to identify gaps (and not expected time differences).

Returns:

  • pandas.DataFrame

    One-column Pandas DataFrame of gaps indexed by when gap was calculated.

Examples:

Identify time gaps on Series of timestamps with a 2 and 4 hour gap after it has been randomized:

>>> import pandahelper as ph
>>> import pandas as pd
>>>
>>> start = pd.Timestamp(year=1999, month=1, day=1)
>>> rng = pd.date_range(start, periods=24, freq="1h").delete([3, 4, 8, 9, 10])
>>> series = pd.Series(rng).sample(frac=1, random_state=3)  # randomize order
>>> ph.id_gaps(series, pd.Timedelta(hours=1))
                              diffs
1999-01-01 11:00:00 0 days 04:00:00
1999-01-01 04:00:00 0 days 02:00:00

id_gaps_index

id_gaps_index(df, threshold)

Identify time gaps above threshold in time-indexed Series or DataFrame.

Sorts input by time index before calculating diffs.

Parameters:

  • df (pandas.Series or pandas.DataFrame) –

    Time-indexed Series or DataFrame.

  • threshold (pandas.Timedelta) –

    Threshold to identify gaps (and not expected time differences).

Returns:

  • pandas.DataFrame

    One-column Pandas DataFrame of gaps indexed by when gap was calculated.

Examples:

Identify time gaps on an hourly, time-indexed Series with a 2 and 4 hour gap after it has been randomized:

>>> import pandahelper as ph
>>> import pandas as pd
>>>
>>> start = pd.Timestamp(year=1999, month=1, day=1)
>>> rng = pd.date_range(start, periods=24, freq="1h").delete([3, 8, 9, 10])
>>> # index by time then randomize order
>>> df = pd.DataFrame(range(len(rng)), index=rng).sample(frac=1, random_state=3)
>>> ph.id_gaps_index(df, pd.Timedelta(hours=1))
                              diffs
1999-01-01 11:00:00 0 days 04:00:00
1999-01-01 04:00:00 0 days 02:00:00

time_diffs

time_diffs(series)

Calculate time difference between subsequent observations.

Sorts input by time before calculating diffs.

Parameters:

  • series (pandas.Series or pandas.DatetimeIndex) –

    Pandas Series or DatetimeIndex to calculate time diffs on.

Returns:

  • pandas.Series(pandas.Timedelta)

    Series of diffs (gaps) indexed by the time the diff was calculated.

Raises:

  • TypeError

    If input is not Series of type datetime64 or DatetimeIndex.

Examples:

Calculate time differences between observations on Series of timestamps after it has been randomized:

>>> import pandahelper as ph
>>> import pandas as pd
>>>
>>> start = pd.Timestamp(year=1999, month=1, day=1)
>>> rng = pd.date_range(start, periods=10, freq="D").delete([3, 4, 5, 8])
>>> series = pd.Series(rng).sample(frac=1, random_state=3)  # randomize order
>>> ph.time_diffs(series)
1999-01-01      NaT
1999-01-02   1 days
1999-01-03   1 days
1999-01-07   4 days
1999-01-08   1 days
1999-01-10   2 days
Name: diffs, dtype: timedelta64[ns]

time_diffs_index

time_diffs_index(df)

Calculate time difference between subsequent time-indexed observations.

Sorts input by time index before calculating diffs.

Parameters:

  • df (pandas.Series or pandas.DataFrame) –

    Pandas Series or DataFrame with DateTimeIndex to calculate time diffs on.

Returns:

  • pandas.Series(pandas.Timedelta)

    Series of diffs (gaps) indexed by the time the diff was calculated.

Raises:

  • TypeError

    If input does not have a DatetimeIndex.

Examples:

Calculate time differences between observations on time-indexed DataFrame after it has been randomized:

>>> import pandahelper as ph
>>> import pandas as pd
>>>
>>> start = pd.Timestamp(year=1999, month=1, day=1)
>>> rng = pd.date_range(start, periods=10, freq="D").delete([3, 4, 5, 8])
>>> # index by time then randomize order
>>> df = pd.DataFrame(range(len(rng)), index=rng).sample(frac=1, random_state=3)
>>> ph.time_diffs_index(df)
1999-01-01      NaT
1999-01-02   1 days
1999-01-03   1 days
1999-01-07   4 days
1999-01-08   1 days
1999-01-10   2 days
Name: diffs, dtype: timedelta64[ns]