2.1.1. rainfallqc.checks package

2.1.1.1. Submodules

2.1.1.2. rainfallqc.checks.comparison_checks module

Quality control checks relying on comparison with a benchmark dataset.

Comparison checks are defined as QC checks that: “detect abnormalities in rainfall record based on benchmarks.”

Classes and functions ordered by appearance in IntenseQC framework.

rainfallqc.checks.comparison_checks.add_daily_year_col(data: DataFrame) → DataFrame[source]

Make a year column for the data. This method will first upsample data so that it is every day.

Parameters:

data: Rainfall data

Returns:

data_w_year_col: Rainfall data with year column

rainfallqc.checks.comparison_checks.check_annual_exceedance_etccdi_prcptot(data: DataFrame, target_gauge_col: str, gauge_lat: int | float, gauge_lon: int | float) → list[source]

Check annual exceedance of maximum PRCPTOT from ETCCDI dataset.

This is QC9 from the IntenseQC framework.

Parameters:

data: Rainfall data
target_gauge_col: Column with rainfall data
gauge_lat: latitude of the rain gauge
gauge_lon: longitude of the rain gauge

Returns:

exceedance_flags: List of flags (see exceedance_flagger function)

rainfallqc.checks.comparison_checks.check_annual_exceedance_etccdi_r99p(data: DataFrame, target_gauge_col: str, gauge_lat: int | float, gauge_lon: int | float) → list[source]

Check annual exceedance of maximum R99p from ETCCDI dataset.

This is QC8 from the IntenseQC framework.

Parameters:

data: Rainfall data
target_gauge_col: Column with rainfall data
gauge_lat: latitude of the rain gauge
gauge_lon: longitude of the rain gauge

Returns:

flag_list: List of flags

rainfallqc.checks.comparison_checks.check_exceedance_of_rainfall_world_record(data: DataFrame, target_gauge_col: str, time_res: str) → DataFrame[source]

Check exceedance of rainfall world record.

See Also utils/stats.py from world record sources.

This is QC10 from the IntenseQC framework.

Parameters:

data: Rainfall data
target_gauge_col: Column with rainfall data
time_res: Time resolution

Returns:

data_w_flags:: Rainfall data with exceedance of World Record (see flag_exceedance_of_ref_val_as_col function)

rainfallqc.checks.comparison_checks.check_hourly_exceedance_etccdi_rx1day(data: DataFrame, target_gauge_col: str, gauge_lat: int | float, gauge_lon: int | float) → DataFrame[source]

Check exceedance of hourly day rainfall 1-day record.

This is QC11 from the IntenseQC framework.

Parameters:

data: Rainfall data
target_gauge_col: Column with rainfall data
gauge_lat: latitude of the rain gauge
gauge_lon: longitude of the rain gauge

Returns:

data_w_flags:: Rainfall data with exceedance of Rx1day Record (see flag_exceedance_of_ref_val_as_col function)

rainfallqc.checks.comparison_checks.flag_exceedance_of_max_etccdi_variable(annual_sum_rainfall: DataFrame, target_gauge_col: str, nearby_etccdi_data: Dataset, etccdi_var: str) → list[source]

Flag exceedance of maximum ETCCDI variable, comparing the maximum sums of each year.

Parameters:

annual_sum_rainfall: Rainfall data as by year sums
target_gauge_col: Column with rainfall data
nearby_etccdi_data: ETCCDI data with given variable to check
etccdi_var: variable to load from ETCCDI

Returns:

exceedance_flags: Flags of exceedances of max ETCCDI value

rainfallqc.checks.comparison_checks.flag_exceedance_of_ref_val(val: int | float, ref_val: int | float) → int[source]

Exceedance flagger from intenseqc.

Parameters:

val: Value to check
ref_val: Reference value to compare against

Returns:

Flag: Exceedance flag

rainfallqc.checks.comparison_checks.flag_exceedance_of_ref_val_as_col(data: DataFrame, target_gauge_col: str, ref_val: int | float, new_col_name: str) → DataFrame[source]

Flag exceedance of maximum reference value and return as column.

Used in QC11 of the IntenseQC framework. TODO: could this be used in QC8+9?

Parameters:

data: Rainfall data.
target_gauge_col: Column with rainfall data
ref_val: Reference value.
new_col_name: New column name.

Returns:

data: Data with exceedance flags between 0-4.

rainfallqc.checks.comparison_checks.get_sum_rainfall_above_percentile_per_year(data: DataFrame, target_gauge_col: str, percentile: float) → DataFrame[source]

Check annual exceedance of maximum PRCPTOT from ETCCDI dataset.

Parameters:

data: Rainfall data
target_gauge_col: Column with rainfall data
percentile: nth percentile to check for values above

Returns:

exceedance_flags: List of flags (see exceedance_flagger function)

2.1.1.3. rainfallqc.checks.gauge_checks module

Quality control checks examining suspicious rain gauges.

Gauge checks are defined as QC checks that: “detect abnormalities in summary and descriptive statistics of rain gauges.”

Classes and functions ordered by appearance in IntenseQC framework.

rainfallqc.checks.gauge_checks.check_breakpoints(data: DataFrame, target_gauge_col: str, p_threshold: float = 0.01) → int[source]

Use a Pettitt test rainfall data to check for breakpoints.

This is QC6 from the IntenseQC framework.

Parameters:

data: Rainfall data.
target_gauge_col: Column with rainfall data.
p_threshold: Significance level for the test.

Returns:

flagint: 1 if breakpoint is detected (p < p_threshold), 0 otherwise

rainfallqc.checks.gauge_checks.check_intermittency(data: DataFrame, target_gauge_col: str, no_data_threshold: int = 2, annual_count_threshold: int = 5) → list[source]

Return years where more than five periods of missing data are bounded by zeros.

TODO: split into multiple sub-functions and write more tests! This is QC5 from the IntenseQC framework.

Parameters:

data: Rainfall data
target_gauge_col: Column with rainfall data
no_data_threshold: Number of missing values needed to be counted as a no data period (default: 2 (days))
annual_count_threshold: Number of missing data periods above no_data_threshold per year (default: 5)

Returns:

years_w_intermittency: List of years with intermittency issues.

rainfallqc.checks.gauge_checks.check_min_val_change(data: DataFrame, target_gauge_col: str, expected_min_val: float) → list[source]

Return years when the minimum recorded value changes.

Used to determine whether there are possible changes to the measuring equipment. This is QC7 from the IntenseQC framework.

Parameters:

data: Rainfall data
target_gauge_col: Column with rainfall data.
expected_min_val: Expected value of rainfall i.e. basically the resolution of data.

Returns:

yr_list: List of years with minimum value changes.

rainfallqc.checks.gauge_checks.check_temporal_bias(data: DataFrame, target_gauge_col: str, time_granularity: str, p_threshold: float = 0.01) → int[source]

Perform a two-sided t-test on the distribution of mean rainfall over time slices.

This is QC3 (day of week bias) and QC4 (hour-of-day bias) from the IntenseQC framework.

Parameters:

data: Rainfall data
target_gauge_col: Column with rainfall data
time_granularity: Temporal grouping, either ‘weekday’ or ‘hour’
p_threshold: Significance level for the test

Returns:

flagint: 1 if bias is detected (p < threshold), 0 otherwise

rainfallqc.checks.gauge_checks.check_years_where_annual_mean_k_top_rows_are_zero(data: DataFrame, target_gauge_col: str, k: int) → list[source]

Return year list where the annual mean top-K rows are zero.

This is QC2 from the IntenseQC framework

Parameters:

data: Rainfall data
target_gauge_col: Column with rainfall data
k: Number of top values check i.e. k==5 is top 5

Returns:

year_list: List of years where k-largest are zero.

rainfallqc.checks.gauge_checks.check_years_where_nth_percentile_is_zero(data: DataFrame, target_gauge_col: str, quantile: float) → list[source]

Return years where the n-th percentiles is zero.

This is QC1 from the IntenseQC framework

Parameters:

data: Rainfall data
target_gauge_col: Column with rainfall data
quantile: Between 0 & 1

Returns:

year_list: List of years where n-th percentile is zero.

2.1.1.4. rainfallqc.checks.neighbourhood_checks module

Quality control checks using neighbouring gauges to identify suspicious data.

Neighbourhood checks are QC checks that: “detect abnormalities in a gauges given measurements in neighbouring gauges.”

Classes and functions ordered by appearance in IntenseQC framework.

rainfallqc.checks.neighbourhood_checks.add_wet_flags_to_data(neighbour_data_diff: DataFrame, target_gauge_col: str, nearest_neighbour: str, expon_percentiles: dict, wet_threshold: float) → DataFrame[source]

Add flags to data based on when target gauge is wetter than neighbour above certain exponential thresholds.

Parameters:

neighbour_data_diff: Data with normalised diff to neighbour
target_gauge_col: Target gauge column
nearest_neighbour: Neighbouring gauge column
expon_percentiles: Thresholds at percentile of fitted distribution (needs 0.95, 0.99 & 0.999)
wet_threshold: Threshold for rainfall intensity in given time period

Returns:

neighbour_data_wet_flags: Data with wet flags applied

rainfallqc.checks.neighbourhood_checks.check_daily_factor(neighbour_data: DataFrame, target_gauge_col: str, nearest_neighbour: str, averaging_method: str = 'mean') → float[source]

Daily factor difference between target and neighbouring gauge.

Flag: Scalar factor difference.

This is QC24 from the IntenseQC framework.

Parameters:

neighbour_data: Daily rainfall data with target and neighbouring gauge and time col
target_gauge_col: Target gauge column
nearest_neighbour: Neighbouring gauge column
averaging_method: Method to use to get average i.e. mean or median (default mean)

Returns:

daily_factor: Average factor diff between target and neighbour

Raises:

ValueError: If averaging method not ‘mean’ or ‘median’

rainfallqc.checks.neighbourhood_checks.check_dry_neighbours(neighbour_data: DataFrame, target_gauge_col: str, list_of_nearest_stations: List[str], time_res: str, min_n_neighbours: int, dry_period_days: int = 15, n_neighbours_ignored: int = 0, hour_offset: int = 0, min_count: int = None) → DataFrame[source]

Identify suspicious dry periods by comparison to neighbour for hourly or daily data.

Flags (majority voting where flag is the highest value across all neighbours): 3, if >= 3 average number of wet days in neighbours during a dry period in target. 2, …if 2 days 1, …if 1 day 0, if not neighbours on average dry during dry target gauge period.

This is QC18 & QC19 from the IntenseQC framework.

Parameters:

neighbour_data: Rainfall data of neighbouring gauges with time col
target_gauge_col: Target gauge column
list_of_nearest_stations:: List of columns with neighbouring gauges
time_res: Time resolution of data
min_n_neighbours: Minimum number of neighbours needed to be checked for flag
dry_period_days: Length for of a “dry_spell” (default: 15 days)
n_neighbours_ignored: Number of zero flags allowed for majority voting (default: 0)
hour_offset: Time offset of hourly data in hours (i.e. if 7am-7am, then set this to 7) (default: 0)
min_count: Minimum number of time steps needed per time period (default: 1)

Returns:

data_w_dry_flags: Target data with dry flags

rainfallqc.checks.neighbourhood_checks.check_monthly_factor(neighbour_data: DataFrame, target_gauge_col: str, nearest_neighbour: str) → DataFrame[source]

Monthly factor difference between target and neighbouring gauge.

Flags: 1, when ~10 x greater than neighbour monthly total 2, when ~25.4 x greater … 3, when ~2.54 x greater … 4, when ~10 x smaller than neighbour monthly total 5, when ~25.4 x smaller … 6, when ~2.54 x smaller … else, 0

This is QC25 from the IntenseQC framework.

Parameters:

neighbour_data: Daily rainfall data with target and neighbouring gauge and time col
target_gauge_col: Target gauge column
nearest_neighbour: Neighbouring gauge column

Returns:

monthly_factor_flag: Factor diff flags between target and neighbour

rainfallqc.checks.neighbourhood_checks.check_monthly_neighbours(neighbour_data: DataFrame, target_gauge_col: str, list_of_nearest_stations: List[str], time_res: str, min_n_neighbours: int, n_neighbours_ignored: int = 0, hour_offset: int = 0, min_count: int = None) → DataFrame[source]

Identify suspicious monthly totals by comparison to neighbouring monthly gauges.

Flags (majority voting where flag is the highest value across all neighbours): Flags -3 to 3 based on percentage difference: -3, -100% (i.e. gauge dry but neighbours not) -2, <= 50% -1, <= 25% 1, >= 25% 2, >= 50% 3, >= 100% Flags equal to 3 may be upgraded to: 4, >=1.25 x record maximum for all neighbours 5, >=2 x record maximum for all neighbours Or: 0, if not in extreme exceedance of neighbours

This is QC20 from the IntenseQC framework.

Parameters:

neighbour_data: Rainfall data of neighbouring gauges with time col
target_gauge_col: Target gauge column
list_of_nearest_stations:: List of columns with neighbouring gauges
time_res: Time resolution of data (e.g. ‘monthly’ or ‘daily’, ‘hourly’ or ‘15m’ - will be resampled to monthly)
min_n_neighbours: Minimum number of neighbours needed to be checked for flag
n_neighbours_ignored: Number of zero flags allowed for majority voting (default: 0)
hour_offset: Time offset of hourly data in hours (i.e. if 7am-7am, then set this to 7) (default: 0)
min_count: Minimum number of time steps needed per time period (default: will be half of possible time steps)

Returns:

data_w_monthly_flags: Target data with monthly flags

rainfallqc.checks.neighbourhood_checks.check_nearest_neighbour_columns(neighbour_data: DataFrame, target_gauge_col: str, list_of_nearest_stations: list) → None[source]

Run checks of neighbouring gauge columns to check if there are any columns and if the target gauge is there.

Parameters:

neighbour_data: Rainfall data of all neighbouring gauges with time col
target_gauge_col: Target gauge column
list_of_nearest_stations:: List of columns with neighbouring gauges

Raises:

ValueError: If there are no neighbouring gauges in the ‘list_of_nearest_stations’ list
AssertionError: If ‘target_gauge_col’ not in neighbour_data

rainfallqc.checks.neighbourhood_checks.check_neighbour_affinity_index(neighbour_data: DataFrame, target_gauge_col: str, nearest_neighbour: str) → float[source]

Pre-QC Affinity index calculated between target and nearest neighbouring gauge.

Flag: Between 0-1 for affinity index

This is QC22 from the IntenseQC framework.

Parameters:

neighbour_data: Rainfall data with target and neighbouring gauge and time col
target_gauge_col: Target gauge column
nearest_neighbour: Neighbouring gauge column

Returns:

affinity_index: Between 0 and 1

rainfallqc.checks.neighbourhood_checks.check_neighbour_correlation(neighbour_data: DataFrame, target_gauge_col: str, nearest_neighbour: str) → float[source]

Pre-QC pearson correlation calculated between target and neighbouring gauge.

Flag: Between -1 to +1 for pearson correlation coefficient

This is QC23 from the IntenseQC framework.

Parameters:

neighbour_data: Rainfall data with target and neighbouring gauge and time col
target_gauge_col: Target gauge column
nearest_neighbour: Neighbouring gauge column

Returns:

r_squared: Between -1 to 1

rainfallqc.checks.neighbourhood_checks.check_timing_offset(neighbour_data: DataFrame, target_gauge_col: str, nearest_neighbour: str, time_res: str, offsets_to_check: Iterable[int] = (-1, 0, 1)) → int[source]

Identify suspicious data offset using Affinity Index and correlation (r^2) between target and nearest neighbour.

Flags: -1, -1 day offset 0, no offset 1, +1 day offset

This is QC21 from the IntenseQC framework.

Parameters:

neighbour_data: Rainfall data with target and neighbouring gauge and time col
target_gauge_col: Target gauge column
nearest_neighbour: Neighbouring gauge column
time_res: Time resolution of data
offsets_to_check: Offset values to check (default: -1, 0, 1)

Returns:

offset_flag: e.g. -1, 0 or 1

rainfallqc.checks.neighbourhood_checks.check_wet_neighbours(neighbour_data: DataFrame, target_gauge_col: str, list_of_nearest_stations: List[str], time_res: str, wet_threshold: int | float, min_n_neighbours: int, n_neighbours_ignored: int = 0, hour_offset: int = 0, min_count: int = None) → DataFrame[source]

Identify suspicious large values by comparison to neighbour for hourly or daily data.

Flags (majority voting where flag is the highest value across all neighbours): 3, if normalised difference between target gauge and neighbours is above the 99.9th percentile 2, …if above 99th percentile 1, …if above 95th percentile 0, if not in extreme exceedance of neighbours

This is QC16 & QC17 from the IntenseQC framework.

Parameters:

neighbour_data: Rainfall data of neighbouring gauges with time col
target_gauge_col: Target gauge column
list_of_nearest_stations:: List of columns with neighbouring gauges
time_res: Time resolution of data
wet_threshold: Threshold for rainfall intensity in given time period
min_n_neighbours: Minimum number of neighbours needed to be checked for flag
n_neighbours_ignored: Number of zero flags allowed for majority voting (default: 0)
hour_offset: Time offset of hourly data in hours (i.e. if 7am-7am, then set this to 7) (default: 0)
min_count: Minimum number of time steps needed per time period (default: 2)

Returns:

data_w_wet_flags: Target data with wet flags

rainfallqc.checks.neighbourhood_checks.filter_data_based_on_unusual_wetness(neighbour_data_diff: DataFrame, target_gauge_col: str, nearest_neighbour: str, wet_threshold: float) → DataFrame[source]

Filter data based on wet threshold.

Parameters:

neighbour_data_diff: Data with normalised diff to neighbour
target_gauge_col: Target gauge column
nearest_neighbour: Neighbouring gauge column
wet_threshold: Threshold for rainfall intensity in given time period

Returns:

filtered_diff: Data filtered to wet threshold and where diff is positive (thus more wet)

rainfallqc.checks.neighbourhood_checks.flag_dry_spell_fractions(one_neighbour_data: DataFrame, target_gauge_col: str, nearest_neighbour: str, proportion_of_dry_day_for_flags: dict) → DataFrame[source]

Flag dry spell fractions.

Parameters:

one_neighbour_data: Rainfall data of one neighbouring gauge with time col
target_gauge_col: Target gauge column
nearest_neighbour: Neighbouring gauge column
proportion_of_dry_day_for_flags: Proportion of dry days needed to be flagged 1, 2, or 3

Returns:

data_w_dry_spell_fraction: Target data with dry spell fractions

rainfallqc.checks.neighbourhood_checks.flag_monthly_factor_differences(monthly_factor: DataFrame) → DataFrame[source]

Flag monthly difference flag after IntenseQC framework for QC25.

Flags: 1, when ~10 x greater than neighbour monthly total 2, when ~25.4 x greater … 3, when ~2.54 x greater … 4, when ~10 x smaller than neighbour monthly total 5, when ~25.4 x smaller … 6, when ~2.54 x smaller … else, 0

Parameters:

monthly_factor: Rainfall data with ‘factor_diff’ and gauge_col
target_gauge_col: Rain column

Returns:

monthly_factor_w_flag: Rainfall data with flags based on monthly factor difference

rainfallqc.checks.neighbourhood_checks.flag_percentage_diff_of_neighbour(neighbour_data: DataFrame, nearest_neighbour: str) → DataFrame[source]

Flag percentage difference between target gauge and neighbouring gauge.

Flags -3 to 3 based on percentage difference: -3, -100% (i.e. gauge dry but neighbours not) -2, <= 50% -1, <= 25% 1, >= 25% 2, >= 50% 3, >= 100%

Parameters:

neighbour_data: Rainfall data of all neighbouring gauges with time col
nearest_neighbour:: Neighbouring gauge column

Returns:

neighbour_data_w_flags: Data with perc_diff flags

rainfallqc.checks.neighbourhood_checks.flag_wet_day_errors_based_on_neighbours(neighbour_data: DataFrame, target_gauge_col: str, nearest_neighbour: str, wet_threshold: float) → DataFrame[source]

Flag wet days with errors based on the percentile difference with neighbouring gauge.

Parameters:

neighbour_data: Rainfall data of all neighbouring gauges with time col
target_gauge_col: Target gauge column
nearest_neighbour:: Neighbouring gauge column
wet_threshold: Threshold for rainfall intensity in given time period

Returns:

neighbour_data_wet_flags: Data with wet flags

rainfallqc.checks.neighbourhood_checks.get_dry_spell_fraction_col(neighbour_data: DataFrame, target_gauge_col: str, nearest_neighbour: str, dry_period_days: int) → DataFrame[source]

Get dry spell fraction column.

Parameters:

neighbour_data: Rainfall data of neighbouring gauges with time col
target_gauge_col: Target gauge column
nearest_neighbour:: Neighbouring gauge column
dry_period_days: Length for of a “dry_spell” (default: 15 days)

Returns:

data_w_dry_spell_fraction: Target data with dry spell fractions

rainfallqc.checks.neighbourhood_checks.get_majority_positive_or_negative_flags(monthly_neighbour_data: DataFrame, list_of_nearest_stations: list, min_n_neighbours: int, n_neighbours_ignored: int) → DataFrame[source]

Get majority voted positive or negative flags i.e. get minimum positive flag, or maximum negative flag.

Parameters:

monthly_neighbour_data: Monthly rainfall data of neighbouring gauges with time col
list_of_nearest_stations:: List of columns with neighbouring gauges
min_n_neighbours: Minimum number of neighbours needed to be checked for flag
n_neighbours_ignored: Number of zero flags allowed for majority voting

Returns:

data_w_monthly_flag: Data with majority_monthly_flag

rainfallqc.checks.neighbourhood_checks.get_majority_voting_flag(neighbour_data: DataFrame, list_of_nearest_stations: list[str], min_n_neighbours: int, n_zeros_allowed: int, flag_col_prefix: str, new_flag_col_name: str, aggregation: str) → DataFrame[source]

Get the highest flag that is in all neighbours.

For this function, we introduce the ‘n_zeros_allowed’ parameter to allow for some leeway for problematic neighbours This stops a problematic neighbour that is similar to problematic target from stopping flagging.

Parameters:

neighbour_data: Rainfall data of neighbouring gauges with time col
list_of_nearest_stations:: List of columns with neighbouring gauges
min_n_neighbours: Minimum number of neighbours online that will be considered
n_zeros_allowed: Number of zero flags allowed (default: 0)
flag_col_prefix: Prefix for flag column e.g. “wet_flag_”
new_flag_col_name: New flag column name
aggregation: “min” or “max”

Returns:

neighbour_data_w_majority_wet_flag: Data with majority wet flag

rainfallqc.checks.neighbourhood_checks.make_neighbour_monthly_max_climatology(monthly_neighbour_data: DataFrame, list_of_nearest_stations: list) → DataFrame[source]

Make neighbourhood monthly max climatology.

Parameters:

monthly_neighbour_data: Monthly rainfall data of neighbouring gauges with time col
list_of_nearest_stations:: List of columns with neighbouring gauges

Returns:

data_w_monthly_flags: Target data with monthly flags

rainfallqc.checks.neighbourhood_checks.make_num_neighbours_online_col(neighbour_data: DataFrame, list_of_nearest_stations: list[str]) → DataFrame[source]

Get number of neighbours online column.

Parameters:

neighbour_data: Rainfall data of neighbouring gauges with time col
list_of_nearest_stations: Neighbouring columns to check if not null

Returns:

neighbour_data_online_neighbours: Data with column for number of online neighbours

rainfallqc.checks.neighbourhood_checks.normalised_diff_between_target_neighbours(neighbour_data: DataFrame, target_gauge_col: str, nearest_neighbour: str) → DataFrame[source]

Normalised difference between target rain col and neighbouring rain col.

Parameters:

neighbour_data: Rainfall data of all neighbouring gauges with time col
target_gauge_col: Target gauge column
nearest_neighbour: Neighbouring gauge column

Returns:

neighbour_data_w_diff: Data with normalised diff to each neighbour

rainfallqc.checks.neighbourhood_checks.upgrade_monthly_flag_using_neighbour_max_climatology(monthly_neighbour_data_w_flags: DataFrame, target_gauge_col: str, min_n_neighbours: int) → DataFrame[source]

Upgrade flags to 4 and 5 flags for monthly neighbours in excess of neighbourhood monthly climatological max.

Parameters:

monthly_neighbour_data_w_flags: Monthly rainfall data of neighbouring gauges with time col and ‘majority_monthly_flag’
target_gauge_col: Target gauge column
min_n_neighbours: Minimum number of neighbours needed to be checked for flag

Returns:

data_w_monthly_flags: Target data with monthly flags

2.1.1.5. rainfallqc.checks.timeseries_checks module

Quality control checks based on suspicious time-series artefacts.

Time-series checks are defined as QC checks that: “detect abnormalities in patterns of the data record.”

Classes and functions ordered by appearance in IntenseQC framework.

rainfallqc.checks.timeseries_checks.check_daily_accumulations(data: DataFrame, target_gauge_col: str, gauge_lat: int | float, gauge_lon: int | float, wet_day_threshold: int | float = 1.0, accumulation_multiplying_factor: int | float = 2.0, accumulation_threshold: float = None) → DataFrame[source]

Identify suspicious periods where an hour of rainfall is preceded by 23 hours with no rain.

Uses a simple precipitation intensity index (SDII) from ETCCDI.

This is QC13 from the IntenseQC framework.

Please see ‘Notes’ below for any additional information about the implementation of this method.

Parameters:

data: Hourly or 15-min rainfall data
target_gauge_col: Column with rainfall data
gauge_lat: latitude of the rain gauge
gauge_lon: longitude of the rain gauge
wet_day_threshold: Threshold for rainfall intensity in one day (default is 1 mm)
accumulation_multiplying_factor: Factor to multiply SDII value for to identify an accumulation of rain recordings
accumulation_threshold: Rain accumulation for detecting possible daily accumulations

Returns:

data_w_daily_accumulation_flags: Data with daily accumulation flags

Notes

This method returns only 0 and 1 flags. This differs from the description of the daily accumulation check from IntenseQC. This decision was taken as the IntenseQC python package only returns 0 and 1 flags.

rainfallqc.checks.timeseries_checks.check_dry_period_cdd(data: DataFrame, target_gauge_col: str, time_res: str, gauge_lat: int | float, gauge_lon: int | float) → DataFrame[source]

Identify suspiciously long dry periods in time-series using the ETCCDI Consecutive Dry Days (CDD) index.

This is QC12 from the IntenseQC framework.

Parameters:

data: Rainfall data
target_gauge_col: Column with rainfall data
time_res: Temporal resolution of the time series either ‘15m’, ‘daily’ or ‘hourly’
gauge_lat: latitude of the rain gauge
gauge_lon: longitude of the rain gauge

Returns:

data_w_dry_spell_flags: Data with dry spell flags

rainfallqc.checks.timeseries_checks.check_monthly_accumulations(data: DataFrame, target_gauge_col: str, gauge_lat: int | float, gauge_lon: int | float, min_dry_spell_duration_in_days: int = 28, wet_day_threshold: int | float = 1.0, accumulation_multiplying_factor: int | float = 2.0, accumulation_threshold: float = None) → DataFrame[source]

Identify suspicious periods when an hour of rainfall is preceded by 1 month with no rain.

Flags two different types of accumulations: 1) dry, when the isolated high value 2) wet, when the isolated value is followed by a few more wet values

Uses a simple precipitation intensity index (SDII) from ETCCDI.

This is QC14 from the IntenseQC framework.

Parameters:

data: Daily or Hourly or 15 min rainfall data
target_gauge_col: Column with rainfall data
gauge_lat: latitude of the rain gauge
gauge_lon: longitude of the rain gauge
min_dry_spell_duration_in_days: Minimum number of days in dry spell preceeding monthly accumulation (default is 28 i.e. Feb)
wet_day_threshold: Threshold for rainfall intensity in one day (default is 1 mm)
accumulation_multiplying_factor: Factor to multiply SDII value for to identify an accumulation of rain recordings (default is 2)
accumulation_threshold: Rain accumulation for detecting possible monthly accumulations

Returns:

data_w_monthly_accumulation_flags: Data with monthly accumulation flags

Notes

The original method filters out dry spells less than

rainfallqc.checks.timeseries_checks.check_streaks(data: DataFrame, target_gauge_col: str, gauge_lat: int | float, gauge_lon: int | float, smallest_measurable_rainfall_amount: float, accumulation_threshold: float = None) → DataFrame[source]

Check for suspected repeated values.

Flags (TODO: could change numbers as original includes unhelpful 2): 1, if streaks of 2 or more repeated values exceeding 2* mean wet day rainfall 3, if streaks of 12 or more greater than smallest measurable rainfall amount 4, if streaks of 24 or more greater than zero 5, if period of zeros bounded by streaks of >= 24

This is QC15 from the IntenseQC framework.

Parameters:

data: Hourly or 15-min data with rainfall.
target_gauge_col: Column with rainfall data.
gauge_lat: latitude of the rain gauge.
gauge_lon: longitude of the rain gauge.
smallest_measurable_rainfall_amount: Resolution of rainfall data (i.e. minimum rainfall recording).
accumulation_threshold: Rain accumulation for detecting possible monthly accumulations

Returns:

data_w_streak_flags: Data with streak flags.

rainfallqc.checks.timeseries_checks.compute_dry_spell_days(dry_spell_data: Dataset) → Dataset[source]

Compute dry spells in days from ETCCDI Consecutive Dry Days data.

Parameters:

dry_spell_data: ETCCDI CDD index data

Returns:

dry_spell_days: ETCCDI CDD index data with CDD_days variable

rainfallqc.checks.timeseries_checks.fill_in_monthly_accumulation_flags(monthly_accumulation_flags: DataFrame, time_step: str, min_dry_spell_duration: int | float, max_dry_spell_duration: int | float) → DataFrame[source]

Fill in flags preceeding monthly accumulation.

Parameters:

monthly_accumulation_flags: Rainfall data with monthly accumulation flag and dry spell info
time_step: Time step of data i.e. ‘1h’, ‘1d’, ‘15m’.
min_dry_spell_duration: Minimum dry spell duration
max_dry_spell_duration: Maximum dry spell duration

Returns:

monthly_accumulation_flags: Data with accumulation flag filled in

rainfallqc.checks.timeseries_checks.flag_accumulation_based_on_next_dry_spell_duration(data: DataFrame, min_dry_spell_duration: int | float, accumulation_col_name: str) → DataFrame[source]

Flag possible accumulation based on subsequent minimum dry spell duration.

Flags: 3, if dry spell followed with high value then wet period (wet) 1, if dry spell followed with high value then no rain for next 23 hours (dry) 0, if neither

Parameters:

data: Rainfall data with dry spell info and possible accumulation label
min_dry_spell_duration: Minimum dry spell duration
accumulation_col_name: Name for accumulation column

Returns:

data_w_flag: Data with accumulation flag

rainfallqc.checks.timeseries_checks.flag_accumulation_periods(data: DataFrame, target_gauge_col: str, accumulation_threshold: float, accumulation_period_in_hours: int) → ndarray[source]

Flag accumulation in a given period of hourly data.

TODO: make work for daily using: DAILY_DIVIDING_FACTOR

Parameters:

data: Hourly rainfall data
target_gauge_col: Column with rainfall data
accumulation_threshold: Rain accumulation for detecting possible period accumulations
accumulation_period_in_hours: Accumulation period in hours

Returns:

pa_flags: Accumulation flags

rainfallqc.checks.timeseries_checks.flag_dry_spell_duration(dry_spell_lengths: DataFrame, ref_dry_spell_length: int | float, time_res: str) → DataFrame[source]

Flag the dry spell duration using reference local dry spell length.

Parameters:

dry_spell_lengths: Data with dry spell lengths
ref_dry_spell_length: Reference dry spell length
time_res: Temporal resolution of the time series either ‘daily’ or ‘hourly’

Returns:

dry_spell_lengths_flags: Data with dry spell flags

rainfallqc.checks.timeseries_checks.flag_n_hours_accumulation_based_on_threshold(period_rain_vals: Series, accumulation_threshold: float, n_hours: int) → int | float[source]

Flag a period as accumulation if a value is preceded by n hourly recordings of 0.

Parameters:

period_rain_vals: One period of rain values
accumulation_threshold: Reference SDII threshold
n_hours: Number of hours in reference period

Returns:

flag: 1 if period accumulation, otherwise 0

rainfallqc.checks.timeseries_checks.flag_streaks_exceeding_smallest_measurable_rainfall_amount(data: DataFrame, target_gauge_col: str, streak_length: int, smallest_measurable_rainfall_amount: float) → DataFrame[source]

Flag streaks exceeding smallest measurable rainfall amount in data.

Parameters:

data:: Rainfall data with streak_id..
target_gauge_col:: Column with rainfall data.
streak_length: Only streaks longer than this will be considered
smallest_measurable_rainfall_amount:: Resolution of rainfall data (i.e. minimum rainfall recording).

Returns:

data_w_flags: Data with streak flag 3

rainfallqc.checks.timeseries_checks.flag_streaks_exceeding_wet_day_rainfall_threshold(data: DataFrame, target_gauge_col: str, streak_length: int, accumulation_threshold: float) → DataFrame[source]

Flag values exceeding wet day rainfall accumulation threshold.

Parameters:

data: Rainfall data with streak_id..
target_gauge_col: Column with rainfall data.
streak_length: Only streaks longer than this will be considered
accumulation_threshold: Threshold for rain accumulation.

Returns:

data_w_flags: Data with streak flag 1

rainfallqc.checks.timeseries_checks.flag_streaks_exceeding_zero(data: DataFrame, target_gauge_col: str, streak_length: int) → DataFrame[source]

Flag values exceeding wet day rainfall accumulation threshold.

Parameters:

data: Rainfall data with streak_id.
target_gauge_col: Column with rainfall data.
streak_length: Only streaks longer than this will be considered.

Returns:

data_w_flags: Data with streak flag 4

rainfallqc.checks.timeseries_checks.flag_streaks_of_zero_bounded_by_days(data: DataFrame, target_gauge_col: str, time_res: str) → DataFrame[source]

Flag streak of zeros bounded by record that are a multiple of 24 hours.

Parameters:

data: Hourly, 15-min or daily data with rainfall.
target_gauge_col: Column with rainfall data.
time_res: Time resolution: “1h”, “15m”, “1d”, or “hourly”, “daily”

Returns:

streaks_w_flag5: Data with streak flag 5.

rainfallqc.checks.timeseries_checks.get_accumulation_threshold(etccdi_sdii: float, gauge_sdii: float, accumulation_multiplying_factor: int | float) → float[source]

Get rainfall accumulation threshold based on ETCCDI or rain gauge Standard Precipitation Intensity Index (index).

Parameters:

etccdi_sdii: SDII value from ETCCDI
gauge_sdii: SDII value from rain gauge
accumulation_multiplying_factor: Factor to multiply to SDII value for to identify an accumulation of rain recordings

Returns:

accumulation_threshold: Reference SDII threshold

rainfallqc.checks.timeseries_checks.get_accumulation_threshold_from_etccdi(data: DataFrame, target_gauge_col: str, time_res: str, gauge_lat: int | float, gauge_lon: int | float, wet_day_threshold: float, accumulation_multiplying_factor: float) → float[source]

Get rain accumulation threshold from ETCCDI data.

Parameters:

data: Rainfall data.
target_gauge_col: Column with rainfall data.
time_res: Temporal resolution of the time series either ‘15m’, ‘daily’ or ‘hourly’
gauge_lat: latitude of the rain gauge.
gauge_lon: longitude of the rain gauge.
wet_day_threshold: Threshold for rainfall intensity in one day (whether it is a wet day or not)
accumulation_multiplying_factor: Factor to multiply SDII value for to identify an accumulation of rain recordings

Returns:

accumulation_threshold: Rain accumulation threshold that is e.g. 2*standard precipitation intensity threshold

rainfallqc.checks.timeseries_checks.get_consecutive_dry_days(gauge_dry_spells: DataFrame) → DataFrame[source]

Get consecutive groups of 0 rainfall days.

Parameters:

gauge_dry_spells: Data with ‘is_dry’ column

Returns:

gauge_dry_spell_groups: Data with group ids for consecutive dry days

rainfallqc.checks.timeseries_checks.get_daily_non_wr_data(data: DataFrame, target_gauge_col: str, time_res: str) → DataFrame[source]

Get daily non-world record data.

Parameters:

data: Hourly rainfall data
target_gauge_col: Column with rainfall data
time_res: Temporal resolution of the time series either ‘15m’, ‘daily’ or ‘hourly

Returns:

daily_data_not_wr: Daily rainfall data with world records filtered out

rainfallqc.checks.timeseries_checks.get_dry_spell_duration(data: DataFrame, target_gauge_col: str) → DataFrame[source]

Get consecutive dry spell duration.

Parameters:

data: Rainfall data
target_gauge_col: Column with rainfall data

Returns:

gauge_dry_spell_lengths: Data with dry spell start, end and duration

rainfallqc.checks.timeseries_checks.get_dry_spell_info(data: DataFrame, target_gauge_col: str) → DataFrame[source]

Get summary of dry spells (i.e. duration and first wet value after dry and previous and next dry spells duration).

Parameters:

data: Hourly rainfall data
target_gauge_col: Column with rainfall data

Returns:

gauge_dry_spell_info: Data with dry spell information

rainfallqc.checks.timeseries_checks.get_first_wet_after_dry_spell(data: DataFrame, target_gauge_col: str) → DataFrame[source]

Get first non-zero rainfall value after dry spell.

Parameters:

data: Rainfall data
target_gauge_col: Column with rainfall data

Returns:

data_w_first_wet: Data with binary column denoting first wet after dry spell

rainfallqc.checks.timeseries_checks.get_local_etccdi_sdii_mean(gauge_lat: int | float, gauge_lon: int | float) → float[source]

Get the nearby ETCCDI Standard Precipitation Index mean SDII.

Parameters:

gauge_lat: latitude of the rain gauge
gauge_lon: longitude of the rain gauge

Returns:

nearby_etccdi_sdii_mean: Local mean SDII value

rainfallqc.checks.timeseries_checks.get_possible_accumulations(gauge_dry_spell_info: DataFrame, target_gauge_col: str, accumulation_threshold: float) → DataFrame[source]

Get possible accumulations as 0 or 1 based on dry spell info.

Parameters:

gauge_dry_spell_info: Rainfall data with columns with dry spell info (durations, first_wet_after_dry, etc.)
target_gauge_col: Column with rainfall data
accumulation_threshold: Threshold of rainfall intensity

Returns:

gauge_data_possible_accumulations: Data with 1 is possible accumulation, otherwise 0.

rainfallqc.checks.timeseries_checks.get_streaks_above_threshold(data: DataFrame, target_gauge_col: str, streak_length: int, value_threshold: int | float) → DataFrame[source]

Get streak groups above given threshold.

Parameters:

data: Rainfall data with streak_id..
target_gauge_col: Column with rainfall data.
streak_length: Minimum length of streaks.
value_threshold: Threshold to check .

Returns:

streaks_above_accumulation: Get all streaks above given value

rainfallqc.checks.timeseries_checks.get_streaks_of_repeated_values(data: DataFrame, data_col: str) → DataFrame[source]

Get streaks of repeated values in time series.

Parameters:

data: Data with time column.
data_col: Column with values to check streaks in.

Returns:

streak_data: Data with streak column.

rainfallqc.checks.timeseries_checks.get_surrounding_dry_spell_lengths(data: DataFrame) → DataFrame[source]

Make prev_dry_spell and next_dry_spell columns from dry_spell_lengths.

Parameters:

data: Data with dry_spell_lengths

Returns:

data: Data with columns of previous and next dry spell durations

rainfallqc.checks.timeseries_checks.join_dry_spell_data_back_to_original(data: DataFrame, dry_spell_lengths_flags: DataFrame) → DataFrame[source]

Flag dry spell data using dry spell lengths.

Parameters:

data: Rainfall data
dry_spell_lengths_flags: Data with dry spell flags

Returns:

dry_spell_flag_data: Data with dry spell flags

2.1.1.6. rainfallqc.checks.pypwsqc_filters module

Quality control checks translated from the pyPWSQC framework (https://pypwsqc.readthedocs.io/en/latest/).

The PWSQC framework includes filters originally develop for automated PWS within the COST Action OPENSENSE.

‘run_’ and ‘check_’ relate to the algorithms from pyPWSQC.

Functions are ordered alphabetically.

rainfallqc.checks.pypwsqc_filters.check_faulty_zeros(neighbour_data: DataFrame, neighbour_metadata: DataFrame, neighbouring_gauge_ids: List[str], neighbour_metadata_gauge_id_col: str, time_res: str, projection: str, nint: int, n_stat: int, max_distance_for_neighbours: int | float = 10000.0, time_units: str = 'seconds since 1970-01-01 00:00:00', rainfall_attributes: dict = {'coverage_contant_type': 'physicalMeasurement', 'long_name': 'rainfall amount per time unit', 'name': 'rainfall', 'units': 'mm'}, lat_lon_attributes: dict = {'unit': 'degrees in WGS84 projection'}, global_attributes: dict = None) → Dataset[source]

Will flag faulty zeros based on neighbours …

Parameters:

neighbour_data: Rainfall data of neighbouring gauges with time col
neighbour_metadata: Metadata for the rainfall data with ‘latitude’ and ‘longitude’
neighbour_metadata_gauge_id_col: Column with the gauge ID
target_gauge_col: Target gauge column
neighbouring_gauge_ids:: List of ids with neighbouring gauges
time_res: Time resolution of data
projection: cartesian/metric coordinate system
nint: Number of intervals
n_stat: Number of stations
max_distance_for_neighbours: Maximum distance to consider for neighbours
time_units: Units and encoding of the ‘time’ column
rainfall_attributes: Attributes for rainfall in the xarray Dataset
lat_lon_attributes: Attributes for lat and lon in the xarray Dataset
global_attributes: Global attributes for xarray Dataset

Returns:

neighbour_data_ds_filtered: Data with flags for faulty zeros

Examples

available at: https://pypwsqc.readthedocs.io/en/latest/notebooks/merged_filters.html

rainfallqc.checks.pypwsqc_filters.check_high_influx_filter(neighbour_data: DataFrame) → None[source]

High influx filter.

Parameters:

neighbour_data: Rainfall data of neighbouring gauges with time col

Returns:

neighbour_data: todo

rainfallqc.checks.pypwsqc_filters.check_station_outlier(neighbour_data: DataFrame, neighbour_metadata: DataFrame, neighbouring_gauge_ids: List[str], neighbour_metadata_gauge_id_col: str, time_res: str, projection: str, evaluation_period: int, mmatch: int, gamma: float, n_stat: int, max_distance_for_neighbours: int | float = 10000.0, time_units: str = 'seconds since 1970-01-01 00:00:00', rainfall_attributes: dict = {'coverage_contant_type': 'physicalMeasurement', 'long_name': 'rainfall amount per time unit', 'name': 'rainfall', 'units': 'mm'}, lat_lon_attributes: dict = {'unit': 'degrees in WGS84 projection'}, global_attributes: dict = None) → Dataset[source]

Station outlier.

Parameters:

neighbour_data: Rainfall data of neighbouring gauges with time col
neighbour_metadata: Metadata for the rainfall data with ‘latitude’ and ‘longitude’
neighbour_metadata_gauge_id_col: Column with the gauge ID
target_gauge_col: Target gauge column
neighbouring_gauge_ids:: List of ids with neighbouring gauges
time_res: Time resolution of data
projection: cartesian/metric coordinate system
evaluation_period: length of (rolling) window for correlation calculation
mmatch: threshold for number of matching rainy intervals in evaluation period
gamma: threshold for rolling median pearson correlation
n_stat: Number of stations
max_distance_for_neighbours: Maximum distance to consider for neighbours
time_units: Units and encoding of the ‘time’ column
rainfall_attributes: Attributes for rainfall in the xarray Dataset
lat_lon_attributes: Attributes for lat and lon in the xarray Dataset
global_attributes: Global attributes for xarray Dataset

Returns:

neighbour_data_ds_filtered: Data with flags for station outliers

Examples

available at: https://pypwsqc.readthedocs.io/en/latest/notebooks/merged_filters.html

rainfallqc.checks.pypwsqc_filters.compute_distance_matrix(neighbour_data_ds: Dataset) → Dataset[source]

Compute a distance matrix.

Parameters:

neighbour_data_ds: xarray dataset of neighbour data

Returns:

distance_matrix: A distance matrix of all neighbouring gauges

rainfallqc.checks.pypwsqc_filters.convert_neighbour_data_to_xarray(neighbour_data: DataFrame, neighbour_metadata: DataFrame, projection: str, time_units: str = 'seconds since 1970-01-01 00:00:00', rainfall_attributes: dict = {'coverage_contant_type': 'physicalMeasurement', 'long_name': 'rainfall amount per time unit', 'name': 'rainfall', 'units': 'mm'}, lat_lon_attributes: dict = {'unit': 'degrees in WGS84 projection'}, global_attributes: dict = None) → Dataset[source]

Convert neighbour data in polars format to xarray dataset.

Parameters:

neighbour_data: Rainfall data of neighbouring gauges with time col
neighbour_metadata: Metadata for the rainfall data with ‘latitude’ and ‘longitude’
projection: cartesian/metric coordinate system
time_units: Units and encoding of the ‘time’ column
rainfall_attributes: Attributes for rainfall in the xarray Dataset
lat_lon_attributes: Attributes for lat and lon in the xarray Dataset
global_attributes: Global attributes for xarray Dataset

Returns:

neighbour_data_ds: xarray dataset with assigned attributes

rainfallqc.checks.pypwsqc_filters.run_bias_correction(neighbour_data: DataFrame) → None[source]

Bias correction.

Parameters:

neighbour_data: Rainfall data of neighbouring gauges with time col

Returns:

neighbour_data: todo

rainfallqc.checks.pypwsqc_filters.run_event_based_filter(neighbour_data: DataFrame) → None[source]

Event based filter (EBF).

Parameters:

neighbour_data: Rainfall data of neighbouring gauges with time col

Returns:

neighbour_data: todo

rainfallqc.checks.pypwsqc_filters.run_indicator_correlation(neighbour_data: DataFrame) → None[source]

Run indicator correlation.

Parameters:

neighbour_data: Rainfall data of neighbouring gauges with time col

Returns:

neighbour_data: todo

rainfallqc.checks.pypwsqc_filters.run_peak_removal(neighbour_data: DataFrame) → None[source]

Peak removal.

Parameters:

neighbour_data: Rainfall data of neighbouring gauges with time col

Returns:

neighbour_data: todo

rainfallqc.checks.pypwsqc_filters.subset_distance_matrix(neighbour_data_ds: Dataset, distance_matrix: Dataset, max_distance_for_neighbours: int | float) → Dataset[source]

Compute a distance matrix.

Parameters:

neighbour_data_ds: xarray dataset of neighbour data
distance_matrix: A distance matrix of all neighbouring gauges
max_distance_for_neighbours: Maximum distance to consider for neighbours

Returns:

neighbour_data_ds: A distance matrix of all neighbouring gauges

2.1.1.7. Module contents

Quality checks.