API Reference
pandahelper.profiles
Panda-Helper data profiles.
DataFrameProfile
Pandas DataFrame data profile.
Prepare data profile of Pandas DataFrame that can be displayed or saved.
Attributes:
-
name
(str
) –Name of DataFrame profile if provided. Default value is "".
-
shape
(tuple
) –Dataframe shape.
-
dtypes
(pandas.Series
) –Data types of DataFrame index and Series in DataFrame.
-
memory_usage
(pandas.Series
) –Memory usage (MB) of index and Series in DataFrame.
-
num_duplicates
(int
) –Number of duplicated rows.
-
nulls_per_row
(pandas.Series
) –Count of null values per row.
-
null_stats
(list
) –Distribution statistics on nulls per row.
-
time_diffs
(pandas.Series
) –Time diffs (gaps) if DataFrame has a DateTimeIndex.
Parameters:
-
df
(pandas.DataFrame
) –DataFrame to profile.
-
name
(str
, default:''
) –optional): Name to assign to profile.
-
fmt
(str
, default:'simple'
) –optional): Printed table format. See https://github.com/astanin/python-tabulate for options.
Raises:
-
TypeError
–If input is not a Pandas DataFrame.
SeriesProfile
Pandas Series data profile.
Prepare data profile of Pandas Series that can be displayed or saved.
Attributes:
-
name
(str
) –Name of Series.
-
dtype
(numpy.dtype or Pandas dtype
) –Data types of Series within DataFrame.
-
count
(int
) –Count of non-null values.
-
num_unique
(int
) –Number of unique values.
-
num_nulls
(int
) –Number of null values.
-
frequency
(pandas.DataFrame
) –Frequency table with counts and percentage.
-
stats
(dict
) –Distribution statistics for Series.
-
time_diffs
(pandas.Series
) –Time diffs (gaps) if series is of type
datetime64
. Alternately, can be time diffs in a Series with a DateTimeIndex if thetime_index
parameter was set toTrue
when creating Series Profile.
Parameters:
-
series
(pandas.Series
) –Pandas Series to profile.
-
fmt
(str
, default:'simple'
) –optional): Printed table format. See: https://github.com/astanin/python-tabulate for options.
-
freq_most_least
(tuple
, default:(10, 5)
) –optional): Tuple (x, y) of the x most common and y least common values to display in frequency table.
-
time_index
(bool
, default:False
) –optional): Whether to use the index for calculating time diffs for a
datetime64
-related Pandas Series. Not relevant for non-time related Series.
Raises:
-
TypeError
–If input is not a Pandas Series.
pandahelper.stats
Panda-Helper statistics functions.
distribution_stats
Return single-column Pandas DataFrame of distribution statistics.
Parameters:
-
series
(pandas.Series
) –Pandas Series used to calculate distribution statistics. Distribution statistics will depend on series dtype. Supported dtypes are:
- int64
- float64
- bool
- complex128
- datetime64
- timedelta64
- period[
] - interval
Returns:
-
pandas.DataFrame
–pd.DataFrame: Single-column of calculated values with statistics as index.
Raises:
-
TypeError
–If input is not a numeric-like pd.Series.
Examples:
Distribution stats for Pandas Series of type float64
:
>>> from random import seed, gauss, expovariate
>>> import pandahelper as ph
>>> import pandas as pd
>>>
>>> seed(314)
>>> series = pd.Series([gauss(mu=30, sigma=20) for x in range(200)])
>>> ph.distribution_stats(series)
Statistic Value
count 200.000000
min -23.643007
1% -11.918955
5% 2.833604
25% 17.553793
50% 31.420759
75% 42.074998
95% 60.305435
99% 72.028633
max 81.547828
mean 30.580535
standard deviation 18.277706
median 31.420759
median absolute deviation 12.216607
skew -0.020083
Distribution stats for Pandas Series of type datetime64
:
>>> start = pd.Timestamp(2000, 1, 1)
>>> tds = [pd.Timedelta(hours=int(expovariate(lambd=.003))) for x in range(200)]
>>> times = [start + td for td in tds]
>>> series = pd.Series(times)
>>> ph.distribution_stats(series)
Statistic Value
count 200
min 2000-01-01 00:00:00
1% 2000-01-01 01:59:24
5% 2000-01-01 09:00:00
25% 2000-01-04 08:00:00
50% 2000-01-08 04:30:00
75% 2000-01-16 21:00:00
95% 2000-02-08 01:36:00
99% 2000-02-22 10:20:24
max 2000-04-01 17:00:00
mean 2000-01-12 14:24:18
standard deviation 12 days 16:47:15.284423042
median 2000-01-08 04:30:00
frequency_table
Return value counts and relative frequency.
Parameters:
-
series
(pandas.Series
) –Pandas Series used to calculate value counts and relative frequencies.
Returns:
-
pandas.DataFrame
–Pandas DataFrame of value counts and percentages indexed by value.
Raises:
-
TypeError
–If input is not a Pandas Series.
Examples:
>>> import random
>>> import pandahelper as ph
>>>
>>> random.seed(314)
>>> cities = ["Springfield", "Quahog", "Philadelphia", "Shelbyville"]
>>> series = pd.Series(random.choices(cities, k = 200))
>>> ph.frequency_table(series)
Count % of Total
Springfield 66 33.00%
Quahog 51 25.50%
Philadelphia 44 22.00%
Shelbyville 39 19.50%
pandahelper.times
Panda-Helper time-series functions.
category_gaps
Calculate sum of gaps for each category in time-indexed Series.
Gaps are time differences in excess of expected time increment (threshold). Gap per category is relative to the minimum and maximum times in the Series. Intended for use with categorical-like Series.
Parameters:
-
series
(pandas.Series
) –Categorical-like Series.
-
threshold
(pandas.Timedelta
) –Threshold for the time difference to be considered a gap. For hourly data, threshold should be pd.Timedelta(hours=1).
-
max_cat
(int
, default:50
) –Maximum number categories (unique values) before issuing warning and returning
None
.
Returns:
-
[pandas.DataFrame, None]
–Key-value pairs with category name and associated gap. Will return None if number of categories exceeds
max_cat
.
Warns:
-
UserWarning
–If the number of categories (unique values) in the series exceeds
max_cat
.
Examples:
>>> import pandahelper as ph
>>> import pandas as pd
>>>
>>> start = pd.Timestamp(year=1999, month=1, day=1)
>>> a = pd.Series(["A"] * 30, index=pd.date_range(start, periods=30, freq="D"))
>>> b = pd.Series(["B"] * 15, index=pd.date_range(start, periods=15, freq="2D"))
>>> c = pd.Series(["C"] * 10, index=pd.date_range(start, periods=10, freq="D"))
>>> ph.category_gaps(pd.concat([a, b, c]), threshold=pd.Timedelta(days=1))
Cumulative Gap
C 20 days
B 15 days
A 0 days
id_gaps
Identify time gaps above threshold
in datetime64 Series or DatetimeIndex.
Sorts input by time before calculating gaps.
Parameters:
-
series
(pandas.Series or pandas.DatetimeIndex
) –datetime64
Series or DatetimeIndex. -
threshold
(pandas.Timedelta
) –Threshold to identify gaps (and not expected time differences).
Returns:
-
pandas.DataFrame
–One-column Pandas DataFrame of gaps indexed by when gap was calculated.
Examples:
Identify time gaps on Series of timestamps with a 2 and 4 hour gap after it has been randomized:
id_gaps_index
Identify time gaps above threshold
in time-indexed Series or DataFrame.
Sorts input by time index before calculating diffs.
Parameters:
-
df
(pandas.Series or pandas.DataFrame
) –Time-indexed Series or DataFrame.
-
threshold
(pandas.Timedelta
) –Threshold to identify gaps (and not expected time differences).
Returns:
-
pandas.DataFrame
–One-column Pandas DataFrame of gaps indexed by when gap was calculated.
Examples:
Identify time gaps on an hourly, time-indexed Series with a 2 and 4 hour gap after it has been randomized:
>>> import pandahelper as ph
>>> import pandas as pd
>>>
>>> start = pd.Timestamp(year=1999, month=1, day=1)
>>> rng = pd.date_range(start, periods=24, freq="1h").delete([3, 8, 9, 10])
>>> # index by time then randomize order
>>> df = pd.DataFrame(range(len(rng)), index=rng).sample(frac=1, random_state=3)
time_diffs
Calculate time difference between subsequent observations.
Sorts input by time before calculating diffs.
Parameters:
-
series
(pandas.Series or pandas.DatetimeIndex
) –Pandas Series or DatetimeIndex to calculate time diffs on.
Returns:
-
pandas.Series(pandas.Timedelta)
–Series of diffs (gaps) indexed by the time the diff was calculated.
Raises:
-
TypeError
–If input is not Series of type datetime64 or DatetimeIndex.
Examples:
Calculate time differences between observations on Series of timestamps after it has been randomized:
time_diffs_index
Calculate time difference between subsequent time-indexed observations.
Sorts input by time index before calculating diffs.
Parameters:
-
df
(pandas.Series or pandas.DataFrame
) –Pandas Series or DataFrame with DateTimeIndex to calculate time diffs on.
Returns:
-
pandas.Series(pandas.Timedelta)
–Series of diffs (gaps) indexed by the time the diff was calculated.
Raises:
-
TypeError
–If input does not have a DatetimeIndex.
Examples:
Calculate time differences between observations on time-indexed DataFrame after it has been randomized:
>>> import pandahelper as ph
>>> import pandas as pd
>>>
>>> start = pd.Timestamp(year=1999, month=1, day=1)
>>> rng = pd.date_range(start, periods=10, freq="D").delete([3, 4, 5, 8])
>>> # index by time then randomize order
>>> df = pd.DataFrame(range(len(rng)), index=rng).sample(frac=1, random_state=3)