2.1.1. rainfallqc.checks package

2.1.1.1. Submodules

2.1.1.2. rainfallqc.checks.comparison_checks module

Quality control checks relying on comparison with a benchmark dataset.

Comparison checks are defined as QC checks that: “detect abnormalities in rainfall record based on benchmarks.”

Classes and functions ordered by appearance in IntenseQC framework.

rainfallqc.checks.comparison_checks.add_daily_year_col(data: DataFrame) DataFrame[source]

Make a year column for the data. This method will first upsample data so that it is every day.

Parameters:
data

Rainfall data

Returns:
data_w_year_col

Rainfall data with year column

rainfallqc.checks.comparison_checks.check_annual_exceedance_etccdi_prcptot(data: DataFrame, target_gauge_col: str, gauge_lat: int | float, gauge_lon: int | float) list[source]

Check annual exceedance of maximum PRCPTOT from ETCCDI dataset.

This is QC9 from the IntenseQC framework.

Parameters:
data

Rainfall data

target_gauge_col

Column with rainfall data

gauge_lat

latitude of the rain gauge

gauge_lon

longitude of the rain gauge

Returns:
exceedance_flags

List of flags (see exceedance_flagger function)

rainfallqc.checks.comparison_checks.check_annual_exceedance_etccdi_r99p(data: DataFrame, target_gauge_col: str, gauge_lat: int | float, gauge_lon: int | float) list[source]

Check annual exceedance of maximum R99p from ETCCDI dataset.

This is QC8 from the IntenseQC framework.

Parameters:
data

Rainfall data

target_gauge_col

Column with rainfall data

gauge_lat

latitude of the rain gauge

gauge_lon

longitude of the rain gauge

Returns:
flag_list

List of flags

rainfallqc.checks.comparison_checks.check_exceedance_of_rainfall_world_record(data: DataFrame, target_gauge_col: str, time_res: str) DataFrame[source]

Check exceedance of rainfall world record.

See Also utils/stats.py from world record sources.

This is QC10 from the IntenseQC framework.

Parameters:
data

Rainfall data

target_gauge_col

Column with rainfall data

time_res

Time resolution

Returns:
data_w_flags:

Rainfall data with exceedance of World Record (see flag_exceedance_of_ref_val_as_col function)

rainfallqc.checks.comparison_checks.check_hourly_exceedance_etccdi_rx1day(data: DataFrame, target_gauge_col: str, gauge_lat: int | float, gauge_lon: int | float) DataFrame[source]

Check exceedance of hourly day rainfall 1-day record.

This is QC11 from the IntenseQC framework.

Parameters:
data

Rainfall data

target_gauge_col

Column with rainfall data

gauge_lat

latitude of the rain gauge

gauge_lon

longitude of the rain gauge

Returns:
data_w_flags:

Rainfall data with exceedance of Rx1day Record (see flag_exceedance_of_ref_val_as_col function)

rainfallqc.checks.comparison_checks.flag_exceedance_of_max_etccdi_variable(annual_sum_rainfall: DataFrame, target_gauge_col: str, nearby_etccdi_data: Dataset, etccdi_var: str) list[source]

Flag exceedance of maximum ETCCDI variable, comparing the maximum sums of each year.

Parameters:
annual_sum_rainfall

Rainfall data as by year sums

target_gauge_col

Column with rainfall data

nearby_etccdi_data

ETCCDI data with given variable to check

etccdi_var

variable to load from ETCCDI

Returns:
exceedance_flags

Flags of exceedances of max ETCCDI value

rainfallqc.checks.comparison_checks.flag_exceedance_of_ref_val(val: int | float, ref_val: int | float) int[source]

Exceedance flagger from intenseqc.

Parameters:
val

Value to check

ref_val

Reference value to compare against

Returns:
Flag

Exceedance flag

rainfallqc.checks.comparison_checks.flag_exceedance_of_ref_val_as_col(data: DataFrame, target_gauge_col: str, ref_val: int | float, new_col_name: str) DataFrame[source]

Flag exceedance of maximum reference value and return as column.

Used in QC11 of the IntenseQC framework. TODO: could this be used in QC8+9?

Parameters:
data

Rainfall data.

target_gauge_col

Column with rainfall data

ref_val

Reference value.

new_col_name

New column name.

Returns:
data

Data with exceedance flags between 0-4.

rainfallqc.checks.comparison_checks.get_sum_rainfall_above_percentile_per_year(data: DataFrame, target_gauge_col: str, percentile: float) DataFrame[source]

Check annual exceedance of maximum PRCPTOT from ETCCDI dataset.

Parameters:
data

Rainfall data

target_gauge_col

Column with rainfall data

percentile

nth percentile to check for values above

Returns:
exceedance_flags

List of flags (see exceedance_flagger function)

2.1.1.3. rainfallqc.checks.gauge_checks module

Quality control checks examining suspicious rain gauges.

Gauge checks are defined as QC checks that: “detect abnormalities in summary and descriptive statistics of rain gauges.”

Classes and functions ordered by appearance in IntenseQC framework.

rainfallqc.checks.gauge_checks.check_breakpoints(data: DataFrame, target_gauge_col: str, p_threshold: float = 0.01) int[source]

Use a Pettitt test rainfall data to check for breakpoints.

This is QC6 from the IntenseQC framework.

Parameters:
data

Rainfall data.

target_gauge_col

Column with rainfall data.

p_threshold

Significance level for the test.

Returns:
flagint

1 if breakpoint is detected (p < p_threshold), 0 otherwise

rainfallqc.checks.gauge_checks.check_intermittency(data: DataFrame, target_gauge_col: str, no_data_threshold: int = 2, annual_count_threshold: int = 5) list[source]

Return years where more than five periods of missing data are bounded by zeros.

TODO: split into multiple sub-functions and write more tests! This is QC5 from the IntenseQC framework.

Parameters:
data

Rainfall data

target_gauge_col

Column with rainfall data

no_data_threshold

Number of missing values needed to be counted as a no data period (default: 2 (days))

annual_count_threshold

Number of missing data periods above no_data_threshold per year (default: 5)

Returns:
years_w_intermittency

List of years with intermittency issues.

rainfallqc.checks.gauge_checks.check_min_val_change(data: DataFrame, target_gauge_col: str, expected_min_val: float) list[source]

Return years when the minimum recorded value changes.

Used to determine whether there are possible changes to the measuring equipment. This is QC7 from the IntenseQC framework.

Parameters:
data

Rainfall data

target_gauge_col

Column with rainfall data.

expected_min_val

Expected value of rainfall i.e. basically the resolution of data.

Returns:
yr_list

List of years with minimum value changes.

rainfallqc.checks.gauge_checks.check_temporal_bias(data: DataFrame, target_gauge_col: str, time_granularity: str, p_threshold: float = 0.01) int[source]

Perform a two-sided t-test on the distribution of mean rainfall over time slices.

This is QC3 (day of week bias) and QC4 (hour-of-day bias) from the IntenseQC framework.

Parameters:
data

Rainfall data

target_gauge_col

Column with rainfall data

time_granularity

Temporal grouping, either ‘weekday’ or ‘hour’

p_threshold

Significance level for the test

Returns:
flagint

1 if bias is detected (p < threshold), 0 otherwise

rainfallqc.checks.gauge_checks.check_years_where_annual_mean_k_top_rows_are_zero(data: DataFrame, target_gauge_col: str, k: int) list[source]

Return year list where the annual mean top-K rows are zero.

This is QC2 from the IntenseQC framework

Parameters:
data

Rainfall data

target_gauge_col

Column with rainfall data

k

Number of top values check i.e. k==5 is top 5

Returns:
year_list

List of years where k-largest are zero.

rainfallqc.checks.gauge_checks.check_years_where_nth_percentile_is_zero(data: DataFrame, target_gauge_col: str, quantile: float) list[source]

Return years where the n-th percentiles is zero.

This is QC1 from the IntenseQC framework

Parameters:
data

Rainfall data

target_gauge_col

Column with rainfall data

quantile

Between 0 & 1

Returns:
year_list

List of years where n-th percentile is zero.

2.1.1.4. rainfallqc.checks.neighbourhood_checks module

Quality control checks using neighbouring gauges to identify suspicious data.

Neighbourhood checks are QC checks that: “detect abnormalities in a gauges given measurements in neighbouring gauges.”

Classes and functions ordered by appearance in IntenseQC framework.

rainfallqc.checks.neighbourhood_checks.add_wet_flags_to_data(neighbour_data_diff: DataFrame, target_gauge_col: str, nearest_neighbour: str, expon_percentiles: dict, wet_threshold: float) DataFrame[source]

Add flags to data based on when target gauge is wetter than neighbour above certain exponential thresholds.

Parameters:
neighbour_data_diff

Data with normalised diff to neighbour

target_gauge_col

Target gauge column

nearest_neighbour

Neighbouring gauge column

expon_percentiles

Thresholds at percentile of fitted distribution (needs 0.95, 0.99 & 0.999)

wet_threshold

Threshold for rainfall intensity in given time period

Returns:
neighbour_data_wet_flags

Data with wet flags applied

rainfallqc.checks.neighbourhood_checks.check_daily_factor(neighbour_data: DataFrame, target_gauge_col: str, nearest_neighbour: str, averaging_method: str = 'mean') float[source]

Daily factor difference between target and neighbouring gauge.

Flag: Scalar factor difference.

This is QC24 from the IntenseQC framework.

Parameters:
neighbour_data

Daily rainfall data with target and neighbouring gauge and time col

target_gauge_col

Target gauge column

nearest_neighbour

Neighbouring gauge column

averaging_method

Method to use to get average i.e. mean or median (default mean)

Returns:
daily_factor

Average factor diff between target and neighbour

Raises:
ValueError

If averaging method not ‘mean’ or ‘median’

rainfallqc.checks.neighbourhood_checks.check_dry_neighbours(neighbour_data: DataFrame, target_gauge_col: str, list_of_nearest_stations: List[str], time_res: str, min_n_neighbours: int, dry_period_days: int = 15, n_neighbours_ignored: int = 0, hour_offset: int = 0, min_count: int = None) DataFrame[source]

Identify suspicious dry periods by comparison to neighbour for hourly or daily data.

Flags (majority voting where flag is the highest value across all neighbours): 3, if >= 3 average number of wet days in neighbours during a dry period in target. 2, …if 2 days 1, …if 1 day 0, if not neighbours on average dry during dry target gauge period.

This is QC18 & QC19 from the IntenseQC framework.

Parameters:
neighbour_data

Rainfall data of neighbouring gauges with time col

target_gauge_col

Target gauge column

list_of_nearest_stations:

List of columns with neighbouring gauges

time_res

Time resolution of data

min_n_neighbours

Minimum number of neighbours needed to be checked for flag

dry_period_days

Length for of a “dry_spell” (default: 15 days)

n_neighbours_ignored

Number of zero flags allowed for majority voting (default: 0)

hour_offset

Time offset of hourly data in hours (i.e. if 7am-7am, then set this to 7) (default: 0)

min_count

Minimum number of time steps needed per time period (default: 1)

Returns:
data_w_dry_flags

Target data with dry flags

rainfallqc.checks.neighbourhood_checks.check_monthly_factor(neighbour_data: DataFrame, target_gauge_col: str, nearest_neighbour: str) DataFrame[source]

Monthly factor difference between target and neighbouring gauge.

Flags: 1, when ~10 x greater than neighbour monthly total 2, when ~25.4 x greater … 3, when ~2.54 x greater … 4, when ~10 x smaller than neighbour monthly total 5, when ~25.4 x smaller … 6, when ~2.54 x smaller … else, 0

This is QC25 from the IntenseQC framework.

Parameters:
neighbour_data

Daily rainfall data with target and neighbouring gauge and time col

target_gauge_col

Target gauge column

nearest_neighbour

Neighbouring gauge column

Returns:
monthly_factor_flag

Factor diff flags between target and neighbour

rainfallqc.checks.neighbourhood_checks.check_monthly_neighbours(neighbour_data: DataFrame, target_gauge_col: str, list_of_nearest_stations: List[str], time_res: str, min_n_neighbours: int, n_neighbours_ignored: int = 0, hour_offset: int = 0, min_count: int = None) DataFrame[source]

Identify suspicious monthly totals by comparison to neighbouring monthly gauges.

Flags (majority voting where flag is the highest value across all neighbours): Flags -3 to 3 based on percentage difference: -3, -100% (i.e. gauge dry but neighbours not) -2, <= 50% -1, <= 25% 1, >= 25% 2, >= 50% 3, >= 100% Flags equal to 3 may be upgraded to: 4, >=1.25 x record maximum for all neighbours 5, >=2 x record maximum for all neighbours Or: 0, if not in extreme exceedance of neighbours

This is QC20 from the IntenseQC framework.

Parameters:
neighbour_data

Rainfall data of neighbouring gauges with time col

target_gauge_col

Target gauge column

list_of_nearest_stations:

List of columns with neighbouring gauges

time_res

Time resolution of data (e.g. ‘monthly’ or ‘daily’, ‘hourly’ or ‘15m’ - will be resampled to monthly)

min_n_neighbours

Minimum number of neighbours needed to be checked for flag

n_neighbours_ignored

Number of zero flags allowed for majority voting (default: 0)

hour_offset

Time offset of hourly data in hours (i.e. if 7am-7am, then set this to 7) (default: 0)

min_count

Minimum number of time steps needed per time period (default: will be half of possible time steps)

Returns:
data_w_monthly_flags

Target data with monthly flags

rainfallqc.checks.neighbourhood_checks.check_nearest_neighbour_columns(neighbour_data: DataFrame, target_gauge_col: str, list_of_nearest_stations: list) None[source]

Run checks of neighbouring gauge columns to check if there are any columns and if the target gauge is there.

Parameters:
neighbour_data

Rainfall data of all neighbouring gauges with time col

target_gauge_col

Target gauge column

list_of_nearest_stations:

List of columns with neighbouring gauges

Raises:
ValueError

If there are no neighbouring gauges in the ‘list_of_nearest_stations’ list

AssertionError

If ‘target_gauge_col’ not in neighbour_data

rainfallqc.checks.neighbourhood_checks.check_neighbour_affinity_index(neighbour_data: DataFrame, target_gauge_col: str, nearest_neighbour: str) float[source]

Pre-QC Affinity index calculated between target and nearest neighbouring gauge.

Flag: Between 0-1 for affinity index

This is QC22 from the IntenseQC framework.

Parameters:
neighbour_data

Rainfall data with target and neighbouring gauge and time col

target_gauge_col

Target gauge column

nearest_neighbour

Neighbouring gauge column

Returns:
affinity_index

Between 0 and 1

rainfallqc.checks.neighbourhood_checks.check_neighbour_correlation(neighbour_data: DataFrame, target_gauge_col: str, nearest_neighbour: str) float[source]

Pre-QC pearson correlation calculated between target and neighbouring gauge.

Flag: Between -1 to +1 for pearson correlation coefficient

This is QC23 from the IntenseQC framework.

Parameters:
neighbour_data

Rainfall data with target and neighbouring gauge and time col

target_gauge_col

Target gauge column

nearest_neighbour

Neighbouring gauge column

Returns:
r_squared

Between -1 to 1

rainfallqc.checks.neighbourhood_checks.check_timing_offset(neighbour_data: DataFrame, target_gauge_col: str, nearest_neighbour: str, time_res: str, offsets_to_check: Iterable[int] = (-1, 0, 1)) int[source]

Identify suspicious data offset using Affinity Index and correlation (r^2) between target and nearest neighbour.

Flags: -1, -1 day offset 0, no offset 1, +1 day offset

This is QC21 from the IntenseQC framework.

Parameters:
neighbour_data

Rainfall data with target and neighbouring gauge and time col

target_gauge_col

Target gauge column

nearest_neighbour

Neighbouring gauge column

time_res

Time resolution of data

offsets_to_check

Offset values to check (default: -1, 0, 1)

Returns:
offset_flag

e.g. -1, 0 or 1

rainfallqc.checks.neighbourhood_checks.check_wet_neighbours(neighbour_data: DataFrame, target_gauge_col: str, list_of_nearest_stations: List[str], time_res: str, wet_threshold: int | float, min_n_neighbours: int, n_neighbours_ignored: int = 0, hour_offset: int = 0, min_count: int = None) DataFrame[source]

Identify suspicious large values by comparison to neighbour for hourly or daily data.

Flags (majority voting where flag is the highest value across all neighbours): 3, if normalised difference between target gauge and neighbours is above the 99.9th percentile 2, …if above 99th percentile 1, …if above 95th percentile 0, if not in extreme exceedance of neighbours

This is QC16 & QC17 from the IntenseQC framework.

Parameters:
neighbour_data

Rainfall data of neighbouring gauges with time col

target_gauge_col

Target gauge column

list_of_nearest_stations:

List of columns with neighbouring gauges

time_res

Time resolution of data

wet_threshold

Threshold for rainfall intensity in given time period

min_n_neighbours

Minimum number of neighbours needed to be checked for flag

n_neighbours_ignored

Number of zero flags allowed for majority voting (default: 0)

hour_offset

Time offset of hourly data in hours (i.e. if 7am-7am, then set this to 7) (default: 0)

min_count

Minimum number of time steps needed per time period (default: 2)

Returns:
data_w_wet_flags

Target data with wet flags

rainfallqc.checks.neighbourhood_checks.filter_data_based_on_unusual_wetness(neighbour_data_diff: DataFrame, target_gauge_col: str, nearest_neighbour: str, wet_threshold: float) DataFrame[source]

Filter data based on wet threshold.

Parameters:
neighbour_data_diff

Data with normalised diff to neighbour

target_gauge_col

Target gauge column

nearest_neighbour

Neighbouring gauge column

wet_threshold

Threshold for rainfall intensity in given time period

Returns:
filtered_diff

Data filtered to wet threshold and where diff is positive (thus more wet)

rainfallqc.checks.neighbourhood_checks.flag_dry_spell_fractions(one_neighbour_data: DataFrame, target_gauge_col: str, nearest_neighbour: str, proportion_of_dry_day_for_flags: dict) DataFrame[source]

Flag dry spell fractions.

Parameters:
one_neighbour_data

Rainfall data of one neighbouring gauge with time col

target_gauge_col

Target gauge column

nearest_neighbour

Neighbouring gauge column

proportion_of_dry_day_for_flags

Proportion of dry days needed to be flagged 1, 2, or 3

Returns:
data_w_dry_spell_fraction

Target data with dry spell fractions

rainfallqc.checks.neighbourhood_checks.flag_monthly_factor_differences(monthly_factor: DataFrame) DataFrame[source]

Flag monthly difference flag after IntenseQC framework for QC25.

Flags: 1, when ~10 x greater than neighbour monthly total 2, when ~25.4 x greater … 3, when ~2.54 x greater … 4, when ~10 x smaller than neighbour monthly total 5, when ~25.4 x smaller … 6, when ~2.54 x smaller … else, 0

Parameters:
monthly_factor

Rainfall data with ‘factor_diff’ and gauge_col

target_gauge_col

Rain column

Returns:
monthly_factor_w_flag

Rainfall data with flags based on monthly factor difference

rainfallqc.checks.neighbourhood_checks.flag_percentage_diff_of_neighbour(neighbour_data: DataFrame, nearest_neighbour: str) DataFrame[source]

Flag percentage difference between target gauge and neighbouring gauge.

Flags -3 to 3 based on percentage difference: -3, -100% (i.e. gauge dry but neighbours not) -2, <= 50% -1, <= 25% 1, >= 25% 2, >= 50% 3, >= 100%

Parameters:
neighbour_data

Rainfall data of all neighbouring gauges with time col

nearest_neighbour:

Neighbouring gauge column

Returns:
neighbour_data_w_flags

Data with perc_diff flags

rainfallqc.checks.neighbourhood_checks.flag_wet_day_errors_based_on_neighbours(neighbour_data: DataFrame, target_gauge_col: str, nearest_neighbour: str, wet_threshold: float) DataFrame[source]

Flag wet days with errors based on the percentile difference with neighbouring gauge.

Parameters:
neighbour_data

Rainfall data of all neighbouring gauges with time col

target_gauge_col

Target gauge column

nearest_neighbour:

Neighbouring gauge column

wet_threshold

Threshold for rainfall intensity in given time period

Returns:
neighbour_data_wet_flags

Data with wet flags

rainfallqc.checks.neighbourhood_checks.get_dry_spell_fraction_col(neighbour_data: DataFrame, target_gauge_col: str, nearest_neighbour: str, dry_period_days: int) DataFrame[source]

Get dry spell fraction column.

Parameters:
neighbour_data

Rainfall data of neighbouring gauges with time col

target_gauge_col

Target gauge column

nearest_neighbour:

Neighbouring gauge column

dry_period_days

Length for of a “dry_spell” (default: 15 days)

Returns:
data_w_dry_spell_fraction

Target data with dry spell fractions

rainfallqc.checks.neighbourhood_checks.get_majority_positive_or_negative_flags(monthly_neighbour_data: DataFrame, list_of_nearest_stations: list, min_n_neighbours: int, n_neighbours_ignored: int) DataFrame[source]

Get majority voted positive or negative flags i.e. get minimum positive flag, or maximum negative flag.

Parameters:
monthly_neighbour_data

Monthly rainfall data of neighbouring gauges with time col

list_of_nearest_stations:

List of columns with neighbouring gauges

min_n_neighbours

Minimum number of neighbours needed to be checked for flag

n_neighbours_ignored

Number of zero flags allowed for majority voting

Returns:
data_w_monthly_flag

Data with majority_monthly_flag

rainfallqc.checks.neighbourhood_checks.get_majority_voting_flag(neighbour_data: DataFrame, list_of_nearest_stations: list[str], min_n_neighbours: int, n_zeros_allowed: int, flag_col_prefix: str, new_flag_col_name: str, aggregation: str) DataFrame[source]

Get the highest flag that is in all neighbours.

For this function, we introduce the ‘n_zeros_allowed’ parameter to allow for some leeway for problematic neighbours This stops a problematic neighbour that is similar to problematic target from stopping flagging.

Parameters:
neighbour_data

Rainfall data of neighbouring gauges with time col

list_of_nearest_stations:

List of columns with neighbouring gauges

min_n_neighbours

Minimum number of neighbours online that will be considered

n_zeros_allowed

Number of zero flags allowed (default: 0)

flag_col_prefix

Prefix for flag column e.g. “wet_flag_

new_flag_col_name

New flag column name

aggregation

“min” or “max”

Returns:
neighbour_data_w_majority_wet_flag

Data with majority wet flag

rainfallqc.checks.neighbourhood_checks.make_neighbour_monthly_max_climatology(monthly_neighbour_data: DataFrame, list_of_nearest_stations: list) DataFrame[source]

Make neighbourhood monthly max climatology.

Parameters:
monthly_neighbour_data

Monthly rainfall data of neighbouring gauges with time col

list_of_nearest_stations:

List of columns with neighbouring gauges

Returns:
data_w_monthly_flags

Target data with monthly flags

rainfallqc.checks.neighbourhood_checks.make_num_neighbours_online_col(neighbour_data: DataFrame, list_of_nearest_stations: list[str]) DataFrame[source]

Get number of neighbours online column.

Parameters:
neighbour_data

Rainfall data of neighbouring gauges with time col

list_of_nearest_stations

Neighbouring columns to check if not null

Returns:
neighbour_data_online_neighbours

Data with column for number of online neighbours

rainfallqc.checks.neighbourhood_checks.normalised_diff_between_target_neighbours(neighbour_data: DataFrame, target_gauge_col: str, nearest_neighbour: str) DataFrame[source]

Normalised difference between target rain col and neighbouring rain col.

Parameters:
neighbour_data

Rainfall data of all neighbouring gauges with time col

target_gauge_col

Target gauge column

nearest_neighbour

Neighbouring gauge column

Returns:
neighbour_data_w_diff

Data with normalised diff to each neighbour

rainfallqc.checks.neighbourhood_checks.upgrade_monthly_flag_using_neighbour_max_climatology(monthly_neighbour_data_w_flags: DataFrame, target_gauge_col: str, min_n_neighbours: int) DataFrame[source]

Upgrade flags to 4 and 5 flags for monthly neighbours in excess of neighbourhood monthly climatological max.

Parameters:
monthly_neighbour_data_w_flags

Monthly rainfall data of neighbouring gauges with time col and ‘majority_monthly_flag’

target_gauge_col

Target gauge column

min_n_neighbours

Minimum number of neighbours needed to be checked for flag

Returns:
data_w_monthly_flags

Target data with monthly flags

2.1.1.5. rainfallqc.checks.timeseries_checks module

Quality control checks based on suspicious time-series artefacts.

Time-series checks are defined as QC checks that: “detect abnormalities in patterns of the data record.”

Classes and functions ordered by appearance in IntenseQC framework.

rainfallqc.checks.timeseries_checks.check_daily_accumulations(data: DataFrame, target_gauge_col: str, gauge_lat: int | float, gauge_lon: int | float, wet_day_threshold: int | float = 1.0, accumulation_multiplying_factor: int | float = 2.0, accumulation_threshold: float = None) DataFrame[source]

Identify suspicious periods where an hour of rainfall is preceded by 23 hours with no rain.

Uses a simple precipitation intensity index (SDII) from ETCCDI.

This is QC13 from the IntenseQC framework.

Please see ‘Notes’ below for any additional information about the implementation of this method.

Parameters:
data

Hourly or 15-min rainfall data

target_gauge_col

Column with rainfall data

gauge_lat

latitude of the rain gauge

gauge_lon

longitude of the rain gauge

wet_day_threshold

Threshold for rainfall intensity in one day (default is 1 mm)

accumulation_multiplying_factor

Factor to multiply SDII value for to identify an accumulation of rain recordings

accumulation_threshold

Rain accumulation for detecting possible daily accumulations

Returns:
data_w_daily_accumulation_flags

Data with daily accumulation flags

Notes

This method returns only 0 and 1 flags. This differs from the description of the daily accumulation check from IntenseQC. This decision was taken as the IntenseQC python package only returns 0 and 1 flags.

rainfallqc.checks.timeseries_checks.check_dry_period_cdd(data: DataFrame, target_gauge_col: str, time_res: str, gauge_lat: int | float, gauge_lon: int | float) DataFrame[source]

Identify suspiciously long dry periods in time-series using the ETCCDI Consecutive Dry Days (CDD) index.

This is QC12 from the IntenseQC framework.

Parameters:
data

Rainfall data

target_gauge_col

Column with rainfall data

time_res

Temporal resolution of the time series either ‘15m’, ‘daily’ or ‘hourly’

gauge_lat

latitude of the rain gauge

gauge_lon

longitude of the rain gauge

Returns:
data_w_dry_spell_flags

Data with dry spell flags

rainfallqc.checks.timeseries_checks.check_monthly_accumulations(data: DataFrame, target_gauge_col: str, gauge_lat: int | float, gauge_lon: int | float, min_dry_spell_duration_in_days: int = 28, wet_day_threshold: int | float = 1.0, accumulation_multiplying_factor: int | float = 2.0, accumulation_threshold: float = None) DataFrame[source]

Identify suspicious periods when an hour of rainfall is preceded by 1 month with no rain.

Flags two different types of accumulations: 1) dry, when the isolated high value 2) wet, when the isolated value is followed by a few more wet values

Uses a simple precipitation intensity index (SDII) from ETCCDI.

This is QC14 from the IntenseQC framework.

Parameters:
data

Daily or Hourly or 15 min rainfall data

target_gauge_col

Column with rainfall data

gauge_lat

latitude of the rain gauge

gauge_lon

longitude of the rain gauge

min_dry_spell_duration_in_days

Minimum number of days in dry spell preceeding monthly accumulation (default is 28 i.e. Feb)

wet_day_threshold

Threshold for rainfall intensity in one day (default is 1 mm)

accumulation_multiplying_factor

Factor to multiply SDII value for to identify an accumulation of rain recordings (default is 2)

accumulation_threshold

Rain accumulation for detecting possible monthly accumulations

Returns:
data_w_monthly_accumulation_flags

Data with monthly accumulation flags

Notes

The original method filters out dry spells less than

rainfallqc.checks.timeseries_checks.check_streaks(data: DataFrame, target_gauge_col: str, gauge_lat: int | float, gauge_lon: int | float, smallest_measurable_rainfall_amount: float, accumulation_threshold: float = None) DataFrame[source]

Check for suspected repeated values.

Flags (TODO: could change numbers as original includes unhelpful 2): 1, if streaks of 2 or more repeated values exceeding 2* mean wet day rainfall 3, if streaks of 12 or more greater than smallest measurable rainfall amount 4, if streaks of 24 or more greater than zero 5, if period of zeros bounded by streaks of >= 24

This is QC15 from the IntenseQC framework.

Parameters:
data

Hourly or 15-min data with rainfall.

target_gauge_col

Column with rainfall data.

gauge_lat

latitude of the rain gauge.

gauge_lon

longitude of the rain gauge.

smallest_measurable_rainfall_amount

Resolution of rainfall data (i.e. minimum rainfall recording).

accumulation_threshold

Rain accumulation for detecting possible monthly accumulations

Returns:
data_w_streak_flags

Data with streak flags.

rainfallqc.checks.timeseries_checks.compute_dry_spell_days(dry_spell_data: Dataset) Dataset[source]

Compute dry spells in days from ETCCDI Consecutive Dry Days data.

Parameters:
dry_spell_data

ETCCDI CDD index data

Returns:
dry_spell_days

ETCCDI CDD index data with CDD_days variable

rainfallqc.checks.timeseries_checks.fill_in_monthly_accumulation_flags(monthly_accumulation_flags: DataFrame, time_step: str, min_dry_spell_duration: int | float, max_dry_spell_duration: int | float) DataFrame[source]

Fill in flags preceeding monthly accumulation.

Parameters:
monthly_accumulation_flags

Rainfall data with monthly accumulation flag and dry spell info

time_step

Time step of data i.e. ‘1h’, ‘1d’, ‘15m’.

min_dry_spell_duration

Minimum dry spell duration

max_dry_spell_duration

Maximum dry spell duration

Returns:
monthly_accumulation_flags

Data with accumulation flag filled in

rainfallqc.checks.timeseries_checks.flag_accumulation_based_on_next_dry_spell_duration(data: DataFrame, min_dry_spell_duration: int | float, accumulation_col_name: str) DataFrame[source]

Flag possible accumulation based on subsequent minimum dry spell duration.

Flags: 3, if dry spell followed with high value then wet period (wet) 1, if dry spell followed with high value then no rain for next 23 hours (dry) 0, if neither

Parameters:
data

Rainfall data with dry spell info and possible accumulation label

min_dry_spell_duration

Minimum dry spell duration

accumulation_col_name

Name for accumulation column

Returns:
data_w_flag

Data with accumulation flag

rainfallqc.checks.timeseries_checks.flag_accumulation_periods(data: DataFrame, target_gauge_col: str, accumulation_threshold: float, accumulation_period_in_hours: int) ndarray[source]

Flag accumulation in a given period of hourly data.

TODO: make work for daily using: DAILY_DIVIDING_FACTOR

Parameters:
data

Hourly rainfall data

target_gauge_col

Column with rainfall data

accumulation_threshold

Rain accumulation for detecting possible period accumulations

accumulation_period_in_hours

Accumulation period in hours

Returns:
pa_flags

Accumulation flags

rainfallqc.checks.timeseries_checks.flag_dry_spell_duration(dry_spell_lengths: DataFrame, ref_dry_spell_length: int | float, time_res: str) DataFrame[source]

Flag the dry spell duration using reference local dry spell length.

Parameters:
dry_spell_lengths

Data with dry spell lengths

ref_dry_spell_length

Reference dry spell length

time_res

Temporal resolution of the time series either ‘daily’ or ‘hourly’

Returns:
dry_spell_lengths_flags

Data with dry spell flags

rainfallqc.checks.timeseries_checks.flag_n_hours_accumulation_based_on_threshold(period_rain_vals: Series, accumulation_threshold: float, n_hours: int) int | float[source]

Flag a period as accumulation if a value is preceded by n hourly recordings of 0.

Parameters:
period_rain_vals

One period of rain values

accumulation_threshold

Reference SDII threshold

n_hours

Number of hours in reference period

Returns:
flag

1 if period accumulation, otherwise 0

rainfallqc.checks.timeseries_checks.flag_streaks_exceeding_smallest_measurable_rainfall_amount(data: DataFrame, target_gauge_col: str, streak_length: int, smallest_measurable_rainfall_amount: float) DataFrame[source]

Flag streaks exceeding smallest measurable rainfall amount in data.

Parameters:
data:

Rainfall data with streak_id..

target_gauge_col:

Column with rainfall data.

streak_length

Only streaks longer than this will be considered

smallest_measurable_rainfall_amount:

Resolution of rainfall data (i.e. minimum rainfall recording).

Returns:
data_w_flags

Data with streak flag 3

rainfallqc.checks.timeseries_checks.flag_streaks_exceeding_wet_day_rainfall_threshold(data: DataFrame, target_gauge_col: str, streak_length: int, accumulation_threshold: float) DataFrame[source]

Flag values exceeding wet day rainfall accumulation threshold.

Parameters:
data

Rainfall data with streak_id..

target_gauge_col

Column with rainfall data.

streak_length

Only streaks longer than this will be considered

accumulation_threshold

Threshold for rain accumulation.

Returns:
data_w_flags

Data with streak flag 1

rainfallqc.checks.timeseries_checks.flag_streaks_exceeding_zero(data: DataFrame, target_gauge_col: str, streak_length: int) DataFrame[source]

Flag values exceeding wet day rainfall accumulation threshold.

Parameters:
data

Rainfall data with streak_id.

target_gauge_col

Column with rainfall data.

streak_length

Only streaks longer than this will be considered.

Returns:
data_w_flags

Data with streak flag 4

rainfallqc.checks.timeseries_checks.flag_streaks_of_zero_bounded_by_days(data: DataFrame, target_gauge_col: str, time_res: str) DataFrame[source]

Flag streak of zeros bounded by record that are a multiple of 24 hours.

Parameters:
data

Hourly, 15-min or daily data with rainfall.

target_gauge_col

Column with rainfall data.

time_res

Time resolution: “1h”, “15m”, “1d”, or “hourly”, “daily”

Returns:
streaks_w_flag5

Data with streak flag 5.

rainfallqc.checks.timeseries_checks.get_accumulation_threshold(etccdi_sdii: float, gauge_sdii: float, accumulation_multiplying_factor: int | float) float[source]

Get rainfall accumulation threshold based on ETCCDI or rain gauge Standard Precipitation Intensity Index (index).

Parameters:
etccdi_sdii

SDII value from ETCCDI

gauge_sdii

SDII value from rain gauge

accumulation_multiplying_factor

Factor to multiply to SDII value for to identify an accumulation of rain recordings

Returns:
accumulation_threshold

Reference SDII threshold

rainfallqc.checks.timeseries_checks.get_accumulation_threshold_from_etccdi(data: DataFrame, target_gauge_col: str, time_res: str, gauge_lat: int | float, gauge_lon: int | float, wet_day_threshold: float, accumulation_multiplying_factor: float) float[source]

Get rain accumulation threshold from ETCCDI data.

Parameters:
data

Rainfall data.

target_gauge_col

Column with rainfall data.

time_res

Temporal resolution of the time series either ‘15m’, ‘daily’ or ‘hourly’

gauge_lat

latitude of the rain gauge.

gauge_lon

longitude of the rain gauge.

wet_day_threshold

Threshold for rainfall intensity in one day (whether it is a wet day or not)

accumulation_multiplying_factor

Factor to multiply SDII value for to identify an accumulation of rain recordings

Returns:
accumulation_threshold

Rain accumulation threshold that is e.g. 2*standard precipitation intensity threshold

rainfallqc.checks.timeseries_checks.get_consecutive_dry_days(gauge_dry_spells: DataFrame) DataFrame[source]

Get consecutive groups of 0 rainfall days.

Parameters:
gauge_dry_spells

Data with ‘is_dry’ column

Returns:
gauge_dry_spell_groups

Data with group ids for consecutive dry days

rainfallqc.checks.timeseries_checks.get_daily_non_wr_data(data: DataFrame, target_gauge_col: str, time_res: str) DataFrame[source]

Get daily non-world record data.

Parameters:
data

Hourly rainfall data

target_gauge_col

Column with rainfall data

time_res

Temporal resolution of the time series either ‘15m’, ‘daily’ or ‘hourly

Returns:
daily_data_not_wr

Daily rainfall data with world records filtered out

rainfallqc.checks.timeseries_checks.get_dry_spell_duration(data: DataFrame, target_gauge_col: str) DataFrame[source]

Get consecutive dry spell duration.

Parameters:
data

Rainfall data

target_gauge_col

Column with rainfall data

Returns:
gauge_dry_spell_lengths

Data with dry spell start, end and duration

rainfallqc.checks.timeseries_checks.get_dry_spell_info(data: DataFrame, target_gauge_col: str) DataFrame[source]

Get summary of dry spells (i.e. duration and first wet value after dry and previous and next dry spells duration).

Parameters:
data

Hourly rainfall data

target_gauge_col

Column with rainfall data

Returns:
gauge_dry_spell_info

Data with dry spell information

rainfallqc.checks.timeseries_checks.get_first_wet_after_dry_spell(data: DataFrame, target_gauge_col: str) DataFrame[source]

Get first non-zero rainfall value after dry spell.

Parameters:
data

Rainfall data

target_gauge_col

Column with rainfall data

Returns:
data_w_first_wet

Data with binary column denoting first wet after dry spell

rainfallqc.checks.timeseries_checks.get_local_etccdi_sdii_mean(gauge_lat: int | float, gauge_lon: int | float) float[source]

Get the nearby ETCCDI Standard Precipitation Index mean SDII.

Parameters:
gauge_lat

latitude of the rain gauge

gauge_lon

longitude of the rain gauge

Returns:
nearby_etccdi_sdii_mean

Local mean SDII value

rainfallqc.checks.timeseries_checks.get_possible_accumulations(gauge_dry_spell_info: DataFrame, target_gauge_col: str, accumulation_threshold: float) DataFrame[source]

Get possible accumulations as 0 or 1 based on dry spell info.

Parameters:
gauge_dry_spell_info

Rainfall data with columns with dry spell info (durations, first_wet_after_dry, etc.)

target_gauge_col

Column with rainfall data

accumulation_threshold

Threshold of rainfall intensity

Returns:
gauge_data_possible_accumulations

Data with 1 is possible accumulation, otherwise 0.

rainfallqc.checks.timeseries_checks.get_streaks_above_threshold(data: DataFrame, target_gauge_col: str, streak_length: int, value_threshold: int | float) DataFrame[source]

Get streak groups above given threshold.

Parameters:
data

Rainfall data with streak_id..

target_gauge_col

Column with rainfall data.

streak_length

Minimum length of streaks.

value_threshold

Threshold to check .

Returns:
streaks_above_accumulation

Get all streaks above given value

rainfallqc.checks.timeseries_checks.get_streaks_of_repeated_values(data: DataFrame, data_col: str) DataFrame[source]

Get streaks of repeated values in time series.

Parameters:
data

Data with time column.

data_col

Column with values to check streaks in.

Returns:
streak_data

Data with streak column.

rainfallqc.checks.timeseries_checks.get_surrounding_dry_spell_lengths(data: DataFrame) DataFrame[source]

Make prev_dry_spell and next_dry_spell columns from dry_spell_lengths.

Parameters:
data

Data with dry_spell_lengths

Returns:
data

Data with columns of previous and next dry spell durations

rainfallqc.checks.timeseries_checks.join_dry_spell_data_back_to_original(data: DataFrame, dry_spell_lengths_flags: DataFrame) DataFrame[source]

Flag dry spell data using dry spell lengths.

Parameters:
data

Rainfall data

dry_spell_lengths_flags

Data with dry spell flags

Returns:
dry_spell_flag_data

Data with dry spell flags

2.1.1.6. rainfallqc.checks.pypwsqc_filters module

Quality control checks translated from the pyPWSQC framework (https://pypwsqc.readthedocs.io/en/latest/).

The PWSQC framework includes filters originally develop for automated PWS within the COST Action OPENSENSE.

run_’ and ‘check_’ relate to the algorithms from pyPWSQC.

Functions are ordered alphabetically.

rainfallqc.checks.pypwsqc_filters.check_faulty_zeros(neighbour_data: DataFrame, neighbour_metadata: DataFrame, neighbouring_gauge_ids: List[str], neighbour_metadata_gauge_id_col: str, time_res: str, projection: str, nint: int, n_stat: int, max_distance_for_neighbours: int | float = 10000.0, time_units: str = 'seconds since 1970-01-01 00:00:00', rainfall_attributes: dict = {'coverage_contant_type': 'physicalMeasurement', 'long_name': 'rainfall amount per time unit', 'name': 'rainfall', 'units': 'mm'}, lat_lon_attributes: dict = {'unit': 'degrees in WGS84 projection'}, global_attributes: dict = None) Dataset[source]

Will flag faulty zeros based on neighbours …

Parameters:
neighbour_data

Rainfall data of neighbouring gauges with time col

neighbour_metadata

Metadata for the rainfall data with ‘latitude’ and ‘longitude’

neighbour_metadata_gauge_id_col

Column with the gauge ID

target_gauge_col

Target gauge column

neighbouring_gauge_ids:

List of ids with neighbouring gauges

time_res

Time resolution of data

projection

cartesian/metric coordinate system

nint

Number of intervals

n_stat

Number of stations

max_distance_for_neighbours

Maximum distance to consider for neighbours

time_units

Units and encoding of the ‘time’ column

rainfall_attributes

Attributes for rainfall in the xarray Dataset

lat_lon_attributes

Attributes for lat and lon in the xarray Dataset

global_attributes

Global attributes for xarray Dataset

Returns:
neighbour_data_ds_filtered

Data with flags for faulty zeros

Examples

available at: https://pypwsqc.readthedocs.io/en/latest/notebooks/merged_filters.html

rainfallqc.checks.pypwsqc_filters.check_high_influx_filter(neighbour_data: DataFrame) None[source]

High influx filter.

Parameters:
neighbour_data

Rainfall data of neighbouring gauges with time col

Returns:
neighbour_data

todo

rainfallqc.checks.pypwsqc_filters.check_station_outlier(neighbour_data: DataFrame, neighbour_metadata: DataFrame, neighbouring_gauge_ids: List[str], neighbour_metadata_gauge_id_col: str, time_res: str, projection: str, evaluation_period: int, mmatch: int, gamma: float, n_stat: int, max_distance_for_neighbours: int | float = 10000.0, time_units: str = 'seconds since 1970-01-01 00:00:00', rainfall_attributes: dict = {'coverage_contant_type': 'physicalMeasurement', 'long_name': 'rainfall amount per time unit', 'name': 'rainfall', 'units': 'mm'}, lat_lon_attributes: dict = {'unit': 'degrees in WGS84 projection'}, global_attributes: dict = None) Dataset[source]

Station outlier.

Parameters:
neighbour_data

Rainfall data of neighbouring gauges with time col

neighbour_metadata

Metadata for the rainfall data with ‘latitude’ and ‘longitude’

neighbour_metadata_gauge_id_col

Column with the gauge ID

target_gauge_col

Target gauge column

neighbouring_gauge_ids:

List of ids with neighbouring gauges

time_res

Time resolution of data

projection

cartesian/metric coordinate system

evaluation_period

length of (rolling) window for correlation calculation

mmatch

threshold for number of matching rainy intervals in evaluation period

gamma

threshold for rolling median pearson correlation

n_stat

Number of stations

max_distance_for_neighbours

Maximum distance to consider for neighbours

time_units

Units and encoding of the ‘time’ column

rainfall_attributes

Attributes for rainfall in the xarray Dataset

lat_lon_attributes

Attributes for lat and lon in the xarray Dataset

global_attributes

Global attributes for xarray Dataset

Returns:
neighbour_data_ds_filtered

Data with flags for station outliers

Examples

available at: https://pypwsqc.readthedocs.io/en/latest/notebooks/merged_filters.html

rainfallqc.checks.pypwsqc_filters.compute_distance_matrix(neighbour_data_ds: Dataset) Dataset[source]

Compute a distance matrix.

Parameters:
neighbour_data_ds

xarray dataset of neighbour data

Returns:
distance_matrix

A distance matrix of all neighbouring gauges

rainfallqc.checks.pypwsqc_filters.convert_neighbour_data_to_xarray(neighbour_data: DataFrame, neighbour_metadata: DataFrame, projection: str, time_units: str = 'seconds since 1970-01-01 00:00:00', rainfall_attributes: dict = {'coverage_contant_type': 'physicalMeasurement', 'long_name': 'rainfall amount per time unit', 'name': 'rainfall', 'units': 'mm'}, lat_lon_attributes: dict = {'unit': 'degrees in WGS84 projection'}, global_attributes: dict = None) Dataset[source]

Convert neighbour data in polars format to xarray dataset.

Parameters:
neighbour_data

Rainfall data of neighbouring gauges with time col

neighbour_metadata

Metadata for the rainfall data with ‘latitude’ and ‘longitude’

projection

cartesian/metric coordinate system

time_units

Units and encoding of the ‘time’ column

rainfall_attributes

Attributes for rainfall in the xarray Dataset

lat_lon_attributes

Attributes for lat and lon in the xarray Dataset

global_attributes

Global attributes for xarray Dataset

Returns:
neighbour_data_ds

xarray dataset with assigned attributes

rainfallqc.checks.pypwsqc_filters.run_bias_correction(neighbour_data: DataFrame) None[source]

Bias correction.

Parameters:
neighbour_data

Rainfall data of neighbouring gauges with time col

Returns:
neighbour_data

todo

rainfallqc.checks.pypwsqc_filters.run_event_based_filter(neighbour_data: DataFrame) None[source]

Event based filter (EBF).

Parameters:
neighbour_data

Rainfall data of neighbouring gauges with time col

Returns:
neighbour_data

todo

rainfallqc.checks.pypwsqc_filters.run_indicator_correlation(neighbour_data: DataFrame) None[source]

Run indicator correlation.

Parameters:
neighbour_data

Rainfall data of neighbouring gauges with time col

Returns:
neighbour_data

todo

rainfallqc.checks.pypwsqc_filters.run_peak_removal(neighbour_data: DataFrame) None[source]

Peak removal.

Parameters:
neighbour_data

Rainfall data of neighbouring gauges with time col

Returns:
neighbour_data

todo

rainfallqc.checks.pypwsqc_filters.subset_distance_matrix(neighbour_data_ds: Dataset, distance_matrix: Dataset, max_distance_for_neighbours: int | float) Dataset[source]

Compute a distance matrix.

Parameters:
neighbour_data_ds

xarray dataset of neighbour data

distance_matrix

A distance matrix of all neighbouring gauges

max_distance_for_neighbours

Maximum distance to consider for neighbours

Returns:
neighbour_data_ds

A distance matrix of all neighbouring gauges

2.1.1.7. Module contents

Quality checks.