2.1.1. rainfallqc.checks package
2.1.1.1. Submodules
2.1.1.2. rainfallqc.checks.comparison_checks module
Quality control checks relying on comparison with a benchmark dataset.
Comparison checks are defined as QC checks that: “detect abnormalities in rainfall record based on benchmarks.”
Classes and functions ordered by appearance in IntenseQC framework.
- rainfallqc.checks.comparison_checks.add_daily_year_col(data: DataFrame) DataFrame[source]
Make a year column for the data. This method will first upsample data so that it is every day.
- Parameters:
- data
Rainfall data
- Returns:
- data_w_year_col
Rainfall data with year column
- rainfallqc.checks.comparison_checks.check_annual_exceedance_etccdi_prcptot(data: DataFrame, target_gauge_col: str, gauge_lat: int | float, gauge_lon: int | float) list[source]
Check annual exceedance of maximum PRCPTOT from ETCCDI dataset.
This is QC9 from the IntenseQC framework.
- Parameters:
- data
Rainfall data
- target_gauge_col
Column with rainfall data
- gauge_lat
latitude of the rain gauge
- gauge_lon
longitude of the rain gauge
- Returns:
- exceedance_flags
List of flags (see exceedance_flagger function)
- rainfallqc.checks.comparison_checks.check_annual_exceedance_etccdi_r99p(data: DataFrame, target_gauge_col: str, gauge_lat: int | float, gauge_lon: int | float) list[source]
Check annual exceedance of maximum R99p from ETCCDI dataset.
This is QC8 from the IntenseQC framework.
- Parameters:
- data
Rainfall data
- target_gauge_col
Column with rainfall data
- gauge_lat
latitude of the rain gauge
- gauge_lon
longitude of the rain gauge
- Returns:
- flag_list
List of flags
- rainfallqc.checks.comparison_checks.check_exceedance_of_rainfall_world_record(data: DataFrame, target_gauge_col: str, time_res: str) DataFrame[source]
Check exceedance of rainfall world record.
See Also utils/stats.py from world record sources.
This is QC10 from the IntenseQC framework.
- Parameters:
- data
Rainfall data
- target_gauge_col
Column with rainfall data
- time_res
Time resolution
- Returns:
- data_w_flags:
Rainfall data with exceedance of World Record (see flag_exceedance_of_ref_val_as_col function)
- rainfallqc.checks.comparison_checks.check_hourly_exceedance_etccdi_rx1day(data: DataFrame, target_gauge_col: str, gauge_lat: int | float, gauge_lon: int | float) DataFrame[source]
Check exceedance of hourly day rainfall 1-day record.
This is QC11 from the IntenseQC framework.
- Parameters:
- data
Rainfall data
- target_gauge_col
Column with rainfall data
- gauge_lat
latitude of the rain gauge
- gauge_lon
longitude of the rain gauge
- Returns:
- data_w_flags:
Rainfall data with exceedance of Rx1day Record (see flag_exceedance_of_ref_val_as_col function)
- rainfallqc.checks.comparison_checks.flag_exceedance_of_max_etccdi_variable(annual_sum_rainfall: DataFrame, target_gauge_col: str, nearby_etccdi_data: Dataset, etccdi_var: str) list[source]
Flag exceedance of maximum ETCCDI variable, comparing the maximum sums of each year.
- Parameters:
- annual_sum_rainfall
Rainfall data as by year sums
- target_gauge_col
Column with rainfall data
- nearby_etccdi_data
ETCCDI data with given variable to check
- etccdi_var
variable to load from ETCCDI
- Returns:
- exceedance_flags
Flags of exceedances of max ETCCDI value
- rainfallqc.checks.comparison_checks.flag_exceedance_of_ref_val(val: int | float, ref_val: int | float) int[source]
Exceedance flagger from intenseqc.
- Parameters:
- val
Value to check
- ref_val
Reference value to compare against
- Returns:
- Flag
Exceedance flag
- rainfallqc.checks.comparison_checks.flag_exceedance_of_ref_val_as_col(data: DataFrame, target_gauge_col: str, ref_val: int | float, new_col_name: str) DataFrame[source]
Flag exceedance of maximum reference value and return as column.
Used in QC11 of the IntenseQC framework. TODO: could this be used in QC8+9?
- Parameters:
- data
Rainfall data.
- target_gauge_col
Column with rainfall data
- ref_val
Reference value.
- new_col_name
New column name.
- Returns:
- data
Data with exceedance flags between 0-4.
- rainfallqc.checks.comparison_checks.get_sum_rainfall_above_percentile_per_year(data: DataFrame, target_gauge_col: str, percentile: float) DataFrame[source]
Check annual exceedance of maximum PRCPTOT from ETCCDI dataset.
- Parameters:
- data
Rainfall data
- target_gauge_col
Column with rainfall data
- percentile
nth percentile to check for values above
- Returns:
- exceedance_flags
List of flags (see exceedance_flagger function)
2.1.1.3. rainfallqc.checks.gauge_checks module
Quality control checks examining suspicious rain gauges.
Gauge checks are defined as QC checks that: “detect abnormalities in summary and descriptive statistics of rain gauges.”
Classes and functions ordered by appearance in IntenseQC framework.
- rainfallqc.checks.gauge_checks.check_breakpoints(data: DataFrame, target_gauge_col: str, p_threshold: float = 0.01) int[source]
Use a Pettitt test rainfall data to check for breakpoints.
This is QC6 from the IntenseQC framework.
- Parameters:
- data
Rainfall data.
- target_gauge_col
Column with rainfall data.
- p_threshold
Significance level for the test.
- Returns:
- flagint
1 if breakpoint is detected (p < p_threshold), 0 otherwise
- rainfallqc.checks.gauge_checks.check_intermittency(data: DataFrame, target_gauge_col: str, no_data_threshold: int = 2, annual_count_threshold: int = 5) list[source]
Return years where more than five periods of missing data are bounded by zeros.
TODO: split into multiple sub-functions and write more tests! This is QC5 from the IntenseQC framework.
- Parameters:
- data
Rainfall data
- target_gauge_col
Column with rainfall data
- no_data_threshold
Number of missing values needed to be counted as a no data period (default: 2 (days))
- annual_count_threshold
Number of missing data periods above no_data_threshold per year (default: 5)
- Returns:
- years_w_intermittency
List of years with intermittency issues.
- rainfallqc.checks.gauge_checks.check_min_val_change(data: DataFrame, target_gauge_col: str, expected_min_val: float) list[source]
Return years when the minimum recorded value changes.
Used to determine whether there are possible changes to the measuring equipment. This is QC7 from the IntenseQC framework.
- Parameters:
- data
Rainfall data
- target_gauge_col
Column with rainfall data.
- expected_min_val
Expected value of rainfall i.e. basically the resolution of data.
- Returns:
- yr_list
List of years with minimum value changes.
- rainfallqc.checks.gauge_checks.check_temporal_bias(data: DataFrame, target_gauge_col: str, time_granularity: str, p_threshold: float = 0.01) int[source]
Perform a two-sided t-test on the distribution of mean rainfall over time slices.
This is QC3 (day of week bias) and QC4 (hour-of-day bias) from the IntenseQC framework.
- Parameters:
- data
Rainfall data
- target_gauge_col
Column with rainfall data
- time_granularity
Temporal grouping, either ‘weekday’ or ‘hour’
- p_threshold
Significance level for the test
- Returns:
- flagint
1 if bias is detected (p < threshold), 0 otherwise
- rainfallqc.checks.gauge_checks.check_years_where_annual_mean_k_top_rows_are_zero(data: DataFrame, target_gauge_col: str, k: int) list[source]
Return year list where the annual mean top-K rows are zero.
This is QC2 from the IntenseQC framework
- Parameters:
- data
Rainfall data
- target_gauge_col
Column with rainfall data
- k
Number of top values check i.e. k==5 is top 5
- Returns:
- year_list
List of years where k-largest are zero.
- rainfallqc.checks.gauge_checks.check_years_where_nth_percentile_is_zero(data: DataFrame, target_gauge_col: str, quantile: float) list[source]
Return years where the n-th percentiles is zero.
This is QC1 from the IntenseQC framework
- Parameters:
- data
Rainfall data
- target_gauge_col
Column with rainfall data
- quantile
Between 0 & 1
- Returns:
- year_list
List of years where n-th percentile is zero.
2.1.1.4. rainfallqc.checks.neighbourhood_checks module
Quality control checks using neighbouring gauges to identify suspicious data.
Neighbourhood checks are QC checks that: “detect abnormalities in a gauges given measurements in neighbouring gauges.”
Classes and functions ordered by appearance in IntenseQC framework.
- rainfallqc.checks.neighbourhood_checks.add_wet_flags_to_data(neighbour_data_diff: DataFrame, target_gauge_col: str, nearest_neighbour: str, expon_percentiles: dict, wet_threshold: float) DataFrame[source]
Add flags to data based on when target gauge is wetter than neighbour above certain exponential thresholds.
- Parameters:
- neighbour_data_diff
Data with normalised diff to neighbour
- target_gauge_col
Target gauge column
- nearest_neighbour
Neighbouring gauge column
- expon_percentiles
Thresholds at percentile of fitted distribution (needs 0.95, 0.99 & 0.999)
- wet_threshold
Threshold for rainfall intensity in given time period
- Returns:
- neighbour_data_wet_flags
Data with wet flags applied
- rainfallqc.checks.neighbourhood_checks.check_daily_factor(neighbour_data: DataFrame, target_gauge_col: str, nearest_neighbour: str, averaging_method: str = 'mean') float[source]
Daily factor difference between target and neighbouring gauge.
Flag: Scalar factor difference.
This is QC24 from the IntenseQC framework.
- Parameters:
- neighbour_data
Daily rainfall data with target and neighbouring gauge and time col
- target_gauge_col
Target gauge column
- nearest_neighbour
Neighbouring gauge column
- averaging_method
Method to use to get average i.e. mean or median (default mean)
- Returns:
- daily_factor
Average factor diff between target and neighbour
- Raises:
- ValueError
If averaging method not ‘mean’ or ‘median’
- rainfallqc.checks.neighbourhood_checks.check_dry_neighbours(neighbour_data: DataFrame, target_gauge_col: str, list_of_nearest_stations: List[str], time_res: str, min_n_neighbours: int, dry_period_days: int = 15, n_neighbours_ignored: int = 0, hour_offset: int = 0, min_count: int = None) DataFrame[source]
Identify suspicious dry periods by comparison to neighbour for hourly or daily data.
Flags (majority voting where flag is the highest value across all neighbours): 3, if >= 3 average number of wet days in neighbours during a dry period in target. 2, …if 2 days 1, …if 1 day 0, if not neighbours on average dry during dry target gauge period.
This is QC18 & QC19 from the IntenseQC framework.
- Parameters:
- neighbour_data
Rainfall data of neighbouring gauges with time col
- target_gauge_col
Target gauge column
- list_of_nearest_stations:
List of columns with neighbouring gauges
- time_res
Time resolution of data
- min_n_neighbours
Minimum number of neighbours needed to be checked for flag
- dry_period_days
Length for of a “dry_spell” (default: 15 days)
- n_neighbours_ignored
Number of zero flags allowed for majority voting (default: 0)
- hour_offset
Time offset of hourly data in hours (i.e. if 7am-7am, then set this to 7) (default: 0)
- min_count
Minimum number of time steps needed per time period (default: 1)
- Returns:
- data_w_dry_flags
Target data with dry flags
- rainfallqc.checks.neighbourhood_checks.check_monthly_factor(neighbour_data: DataFrame, target_gauge_col: str, nearest_neighbour: str) DataFrame[source]
Monthly factor difference between target and neighbouring gauge.
Flags: 1, when ~10 x greater than neighbour monthly total 2, when ~25.4 x greater … 3, when ~2.54 x greater … 4, when ~10 x smaller than neighbour monthly total 5, when ~25.4 x smaller … 6, when ~2.54 x smaller … else, 0
This is QC25 from the IntenseQC framework.
- Parameters:
- neighbour_data
Daily rainfall data with target and neighbouring gauge and time col
- target_gauge_col
Target gauge column
- nearest_neighbour
Neighbouring gauge column
- Returns:
- monthly_factor_flag
Factor diff flags between target and neighbour
- rainfallqc.checks.neighbourhood_checks.check_monthly_neighbours(neighbour_data: DataFrame, target_gauge_col: str, list_of_nearest_stations: List[str], time_res: str, min_n_neighbours: int, n_neighbours_ignored: int = 0, hour_offset: int = 0, min_count: int = None) DataFrame[source]
Identify suspicious monthly totals by comparison to neighbouring monthly gauges.
Flags (majority voting where flag is the highest value across all neighbours): Flags -3 to 3 based on percentage difference: -3, -100% (i.e. gauge dry but neighbours not) -2, <= 50% -1, <= 25% 1, >= 25% 2, >= 50% 3, >= 100% Flags equal to 3 may be upgraded to: 4, >=1.25 x record maximum for all neighbours 5, >=2 x record maximum for all neighbours Or: 0, if not in extreme exceedance of neighbours
This is QC20 from the IntenseQC framework.
- Parameters:
- neighbour_data
Rainfall data of neighbouring gauges with time col
- target_gauge_col
Target gauge column
- list_of_nearest_stations:
List of columns with neighbouring gauges
- time_res
Time resolution of data (e.g. ‘monthly’ or ‘daily’, ‘hourly’ or ‘15m’ - will be resampled to monthly)
- min_n_neighbours
Minimum number of neighbours needed to be checked for flag
- n_neighbours_ignored
Number of zero flags allowed for majority voting (default: 0)
- hour_offset
Time offset of hourly data in hours (i.e. if 7am-7am, then set this to 7) (default: 0)
- min_count
Minimum number of time steps needed per time period (default: will be half of possible time steps)
- Returns:
- data_w_monthly_flags
Target data with monthly flags
- rainfallqc.checks.neighbourhood_checks.check_nearest_neighbour_columns(neighbour_data: DataFrame, target_gauge_col: str, list_of_nearest_stations: list) None[source]
Run checks of neighbouring gauge columns to check if there are any columns and if the target gauge is there.
- Parameters:
- neighbour_data
Rainfall data of all neighbouring gauges with time col
- target_gauge_col
Target gauge column
- list_of_nearest_stations:
List of columns with neighbouring gauges
- Raises:
- ValueError
If there are no neighbouring gauges in the ‘list_of_nearest_stations’ list
- AssertionError
If ‘target_gauge_col’ not in neighbour_data
- rainfallqc.checks.neighbourhood_checks.check_neighbour_affinity_index(neighbour_data: DataFrame, target_gauge_col: str, nearest_neighbour: str) float[source]
Pre-QC Affinity index calculated between target and nearest neighbouring gauge.
Flag: Between 0-1 for affinity index
This is QC22 from the IntenseQC framework.
- Parameters:
- neighbour_data
Rainfall data with target and neighbouring gauge and time col
- target_gauge_col
Target gauge column
- nearest_neighbour
Neighbouring gauge column
- Returns:
- affinity_index
Between 0 and 1
- rainfallqc.checks.neighbourhood_checks.check_neighbour_correlation(neighbour_data: DataFrame, target_gauge_col: str, nearest_neighbour: str) float[source]
Pre-QC pearson correlation calculated between target and neighbouring gauge.
Flag: Between -1 to +1 for pearson correlation coefficient
This is QC23 from the IntenseQC framework.
- Parameters:
- neighbour_data
Rainfall data with target and neighbouring gauge and time col
- target_gauge_col
Target gauge column
- nearest_neighbour
Neighbouring gauge column
- Returns:
- r_squared
Between -1 to 1
- rainfallqc.checks.neighbourhood_checks.check_timing_offset(neighbour_data: DataFrame, target_gauge_col: str, nearest_neighbour: str, time_res: str, offsets_to_check: Iterable[int] = (-1, 0, 1)) int[source]
Identify suspicious data offset using Affinity Index and correlation (r^2) between target and nearest neighbour.
Flags: -1, -1 day offset 0, no offset 1, +1 day offset
This is QC21 from the IntenseQC framework.
- Parameters:
- neighbour_data
Rainfall data with target and neighbouring gauge and time col
- target_gauge_col
Target gauge column
- nearest_neighbour
Neighbouring gauge column
- time_res
Time resolution of data
- offsets_to_check
Offset values to check (default: -1, 0, 1)
- Returns:
- offset_flag
e.g. -1, 0 or 1
- rainfallqc.checks.neighbourhood_checks.check_wet_neighbours(neighbour_data: DataFrame, target_gauge_col: str, list_of_nearest_stations: List[str], time_res: str, wet_threshold: int | float, min_n_neighbours: int, n_neighbours_ignored: int = 0, hour_offset: int = 0, min_count: int = None) DataFrame[source]
Identify suspicious large values by comparison to neighbour for hourly or daily data.
Flags (majority voting where flag is the highest value across all neighbours): 3, if normalised difference between target gauge and neighbours is above the 99.9th percentile 2, …if above 99th percentile 1, …if above 95th percentile 0, if not in extreme exceedance of neighbours
This is QC16 & QC17 from the IntenseQC framework.
- Parameters:
- neighbour_data
Rainfall data of neighbouring gauges with time col
- target_gauge_col
Target gauge column
- list_of_nearest_stations:
List of columns with neighbouring gauges
- time_res
Time resolution of data
- wet_threshold
Threshold for rainfall intensity in given time period
- min_n_neighbours
Minimum number of neighbours needed to be checked for flag
- n_neighbours_ignored
Number of zero flags allowed for majority voting (default: 0)
- hour_offset
Time offset of hourly data in hours (i.e. if 7am-7am, then set this to 7) (default: 0)
- min_count
Minimum number of time steps needed per time period (default: 2)
- Returns:
- data_w_wet_flags
Target data with wet flags
- rainfallqc.checks.neighbourhood_checks.filter_data_based_on_unusual_wetness(neighbour_data_diff: DataFrame, target_gauge_col: str, nearest_neighbour: str, wet_threshold: float) DataFrame[source]
Filter data based on wet threshold.
- Parameters:
- neighbour_data_diff
Data with normalised diff to neighbour
- target_gauge_col
Target gauge column
- nearest_neighbour
Neighbouring gauge column
- wet_threshold
Threshold for rainfall intensity in given time period
- Returns:
- filtered_diff
Data filtered to wet threshold and where diff is positive (thus more wet)
- rainfallqc.checks.neighbourhood_checks.flag_dry_spell_fractions(one_neighbour_data: DataFrame, target_gauge_col: str, nearest_neighbour: str, proportion_of_dry_day_for_flags: dict) DataFrame[source]
Flag dry spell fractions.
- Parameters:
- one_neighbour_data
Rainfall data of one neighbouring gauge with time col
- target_gauge_col
Target gauge column
- nearest_neighbour
Neighbouring gauge column
- proportion_of_dry_day_for_flags
Proportion of dry days needed to be flagged 1, 2, or 3
- Returns:
- data_w_dry_spell_fraction
Target data with dry spell fractions
- rainfallqc.checks.neighbourhood_checks.flag_monthly_factor_differences(monthly_factor: DataFrame) DataFrame[source]
Flag monthly difference flag after IntenseQC framework for QC25.
Flags: 1, when ~10 x greater than neighbour monthly total 2, when ~25.4 x greater … 3, when ~2.54 x greater … 4, when ~10 x smaller than neighbour monthly total 5, when ~25.4 x smaller … 6, when ~2.54 x smaller … else, 0
- Parameters:
- monthly_factor
Rainfall data with ‘factor_diff’ and gauge_col
- target_gauge_col
Rain column
- Returns:
- monthly_factor_w_flag
Rainfall data with flags based on monthly factor difference
- rainfallqc.checks.neighbourhood_checks.flag_percentage_diff_of_neighbour(neighbour_data: DataFrame, nearest_neighbour: str) DataFrame[source]
Flag percentage difference between target gauge and neighbouring gauge.
Flags -3 to 3 based on percentage difference: -3, -100% (i.e. gauge dry but neighbours not) -2, <= 50% -1, <= 25% 1, >= 25% 2, >= 50% 3, >= 100%
- Parameters:
- neighbour_data
Rainfall data of all neighbouring gauges with time col
- nearest_neighbour:
Neighbouring gauge column
- Returns:
- neighbour_data_w_flags
Data with perc_diff flags
- rainfallqc.checks.neighbourhood_checks.flag_wet_day_errors_based_on_neighbours(neighbour_data: DataFrame, target_gauge_col: str, nearest_neighbour: str, wet_threshold: float) DataFrame[source]
Flag wet days with errors based on the percentile difference with neighbouring gauge.
- Parameters:
- neighbour_data
Rainfall data of all neighbouring gauges with time col
- target_gauge_col
Target gauge column
- nearest_neighbour:
Neighbouring gauge column
- wet_threshold
Threshold for rainfall intensity in given time period
- Returns:
- neighbour_data_wet_flags
Data with wet flags
- rainfallqc.checks.neighbourhood_checks.get_dry_spell_fraction_col(neighbour_data: DataFrame, target_gauge_col: str, nearest_neighbour: str, dry_period_days: int) DataFrame[source]
Get dry spell fraction column.
- Parameters:
- neighbour_data
Rainfall data of neighbouring gauges with time col
- target_gauge_col
Target gauge column
- nearest_neighbour:
Neighbouring gauge column
- dry_period_days
Length for of a “dry_spell” (default: 15 days)
- Returns:
- data_w_dry_spell_fraction
Target data with dry spell fractions
- rainfallqc.checks.neighbourhood_checks.get_majority_positive_or_negative_flags(monthly_neighbour_data: DataFrame, list_of_nearest_stations: list, min_n_neighbours: int, n_neighbours_ignored: int) DataFrame[source]
Get majority voted positive or negative flags i.e. get minimum positive flag, or maximum negative flag.
- Parameters:
- monthly_neighbour_data
Monthly rainfall data of neighbouring gauges with time col
- list_of_nearest_stations:
List of columns with neighbouring gauges
- min_n_neighbours
Minimum number of neighbours needed to be checked for flag
- n_neighbours_ignored
Number of zero flags allowed for majority voting
- Returns:
- data_w_monthly_flag
Data with majority_monthly_flag
- rainfallqc.checks.neighbourhood_checks.get_majority_voting_flag(neighbour_data: DataFrame, list_of_nearest_stations: list[str], min_n_neighbours: int, n_zeros_allowed: int, flag_col_prefix: str, new_flag_col_name: str, aggregation: str) DataFrame[source]
Get the highest flag that is in all neighbours.
For this function, we introduce the ‘n_zeros_allowed’ parameter to allow for some leeway for problematic neighbours This stops a problematic neighbour that is similar to problematic target from stopping flagging.
- Parameters:
- neighbour_data
Rainfall data of neighbouring gauges with time col
- list_of_nearest_stations:
List of columns with neighbouring gauges
- min_n_neighbours
Minimum number of neighbours online that will be considered
- n_zeros_allowed
Number of zero flags allowed (default: 0)
- flag_col_prefix
Prefix for flag column e.g. “wet_flag_”
- new_flag_col_name
New flag column name
- aggregation
“min” or “max”
- Returns:
- neighbour_data_w_majority_wet_flag
Data with majority wet flag
- rainfallqc.checks.neighbourhood_checks.make_neighbour_monthly_max_climatology(monthly_neighbour_data: DataFrame, list_of_nearest_stations: list) DataFrame[source]
Make neighbourhood monthly max climatology.
- Parameters:
- monthly_neighbour_data
Monthly rainfall data of neighbouring gauges with time col
- list_of_nearest_stations:
List of columns with neighbouring gauges
- Returns:
- data_w_monthly_flags
Target data with monthly flags
- rainfallqc.checks.neighbourhood_checks.make_num_neighbours_online_col(neighbour_data: DataFrame, list_of_nearest_stations: list[str]) DataFrame[source]
Get number of neighbours online column.
- Parameters:
- neighbour_data
Rainfall data of neighbouring gauges with time col
- list_of_nearest_stations
Neighbouring columns to check if not null
- Returns:
- neighbour_data_online_neighbours
Data with column for number of online neighbours
- rainfallqc.checks.neighbourhood_checks.normalised_diff_between_target_neighbours(neighbour_data: DataFrame, target_gauge_col: str, nearest_neighbour: str) DataFrame[source]
Normalised difference between target rain col and neighbouring rain col.
- Parameters:
- neighbour_data
Rainfall data of all neighbouring gauges with time col
- target_gauge_col
Target gauge column
- nearest_neighbour
Neighbouring gauge column
- Returns:
- neighbour_data_w_diff
Data with normalised diff to each neighbour
- rainfallqc.checks.neighbourhood_checks.upgrade_monthly_flag_using_neighbour_max_climatology(monthly_neighbour_data_w_flags: DataFrame, target_gauge_col: str, min_n_neighbours: int) DataFrame[source]
Upgrade flags to 4 and 5 flags for monthly neighbours in excess of neighbourhood monthly climatological max.
- Parameters:
- monthly_neighbour_data_w_flags
Monthly rainfall data of neighbouring gauges with time col and ‘majority_monthly_flag’
- target_gauge_col
Target gauge column
- min_n_neighbours
Minimum number of neighbours needed to be checked for flag
- Returns:
- data_w_monthly_flags
Target data with monthly flags
2.1.1.5. rainfallqc.checks.timeseries_checks module
Quality control checks based on suspicious time-series artefacts.
Time-series checks are defined as QC checks that: “detect abnormalities in patterns of the data record.”
Classes and functions ordered by appearance in IntenseQC framework.
- rainfallqc.checks.timeseries_checks.check_daily_accumulations(data: DataFrame, target_gauge_col: str, gauge_lat: int | float, gauge_lon: int | float, wet_day_threshold: int | float = 1.0, accumulation_multiplying_factor: int | float = 2.0, accumulation_threshold: float = None) DataFrame[source]
Identify suspicious periods where an hour of rainfall is preceded by 23 hours with no rain.
Uses a simple precipitation intensity index (SDII) from ETCCDI.
This is QC13 from the IntenseQC framework.
Please see ‘Notes’ below for any additional information about the implementation of this method.
- Parameters:
- data
Hourly or 15-min rainfall data
- target_gauge_col
Column with rainfall data
- gauge_lat
latitude of the rain gauge
- gauge_lon
longitude of the rain gauge
- wet_day_threshold
Threshold for rainfall intensity in one day (default is 1 mm)
- accumulation_multiplying_factor
Factor to multiply SDII value for to identify an accumulation of rain recordings
- accumulation_threshold
Rain accumulation for detecting possible daily accumulations
- Returns:
- data_w_daily_accumulation_flags
Data with daily accumulation flags
Notes
This method returns only 0 and 1 flags. This differs from the description of the daily accumulation check from IntenseQC. This decision was taken as the IntenseQC python package only returns 0 and 1 flags.
- rainfallqc.checks.timeseries_checks.check_dry_period_cdd(data: DataFrame, target_gauge_col: str, time_res: str, gauge_lat: int | float, gauge_lon: int | float) DataFrame[source]
Identify suspiciously long dry periods in time-series using the ETCCDI Consecutive Dry Days (CDD) index.
This is QC12 from the IntenseQC framework.
- Parameters:
- data
Rainfall data
- target_gauge_col
Column with rainfall data
- time_res
Temporal resolution of the time series either ‘15m’, ‘daily’ or ‘hourly’
- gauge_lat
latitude of the rain gauge
- gauge_lon
longitude of the rain gauge
- Returns:
- data_w_dry_spell_flags
Data with dry spell flags
- rainfallqc.checks.timeseries_checks.check_monthly_accumulations(data: DataFrame, target_gauge_col: str, gauge_lat: int | float, gauge_lon: int | float, min_dry_spell_duration_in_days: int = 28, wet_day_threshold: int | float = 1.0, accumulation_multiplying_factor: int | float = 2.0, accumulation_threshold: float = None) DataFrame[source]
Identify suspicious periods when an hour of rainfall is preceded by 1 month with no rain.
Flags two different types of accumulations: 1) dry, when the isolated high value 2) wet, when the isolated value is followed by a few more wet values
Uses a simple precipitation intensity index (SDII) from ETCCDI.
This is QC14 from the IntenseQC framework.
- Parameters:
- data
Daily or Hourly or 15 min rainfall data
- target_gauge_col
Column with rainfall data
- gauge_lat
latitude of the rain gauge
- gauge_lon
longitude of the rain gauge
- min_dry_spell_duration_in_days
Minimum number of days in dry spell preceeding monthly accumulation (default is 28 i.e. Feb)
- wet_day_threshold
Threshold for rainfall intensity in one day (default is 1 mm)
- accumulation_multiplying_factor
Factor to multiply SDII value for to identify an accumulation of rain recordings (default is 2)
- accumulation_threshold
Rain accumulation for detecting possible monthly accumulations
- Returns:
- data_w_monthly_accumulation_flags
Data with monthly accumulation flags
Notes
The original method filters out dry spells less than
- rainfallqc.checks.timeseries_checks.check_streaks(data: DataFrame, target_gauge_col: str, gauge_lat: int | float, gauge_lon: int | float, smallest_measurable_rainfall_amount: float, accumulation_threshold: float = None) DataFrame[source]
Check for suspected repeated values.
Flags (TODO: could change numbers as original includes unhelpful 2): 1, if streaks of 2 or more repeated values exceeding 2* mean wet day rainfall 3, if streaks of 12 or more greater than smallest measurable rainfall amount 4, if streaks of 24 or more greater than zero 5, if period of zeros bounded by streaks of >= 24
This is QC15 from the IntenseQC framework.
- Parameters:
- data
Hourly or 15-min data with rainfall.
- target_gauge_col
Column with rainfall data.
- gauge_lat
latitude of the rain gauge.
- gauge_lon
longitude of the rain gauge.
- smallest_measurable_rainfall_amount
Resolution of rainfall data (i.e. minimum rainfall recording).
- accumulation_threshold
Rain accumulation for detecting possible monthly accumulations
- Returns:
- data_w_streak_flags
Data with streak flags.
- rainfallqc.checks.timeseries_checks.compute_dry_spell_days(dry_spell_data: Dataset) Dataset[source]
Compute dry spells in days from ETCCDI Consecutive Dry Days data.
- Parameters:
- dry_spell_data
ETCCDI CDD index data
- Returns:
- dry_spell_days
ETCCDI CDD index data with CDD_days variable
- rainfallqc.checks.timeseries_checks.fill_in_monthly_accumulation_flags(monthly_accumulation_flags: DataFrame, time_step: str, min_dry_spell_duration: int | float, max_dry_spell_duration: int | float) DataFrame[source]
Fill in flags preceeding monthly accumulation.
- Parameters:
- monthly_accumulation_flags
Rainfall data with monthly accumulation flag and dry spell info
- time_step
Time step of data i.e. ‘1h’, ‘1d’, ‘15m’.
- min_dry_spell_duration
Minimum dry spell duration
- max_dry_spell_duration
Maximum dry spell duration
- Returns:
- monthly_accumulation_flags
Data with accumulation flag filled in
- rainfallqc.checks.timeseries_checks.flag_accumulation_based_on_next_dry_spell_duration(data: DataFrame, min_dry_spell_duration: int | float, accumulation_col_name: str) DataFrame[source]
Flag possible accumulation based on subsequent minimum dry spell duration.
Flags: 3, if dry spell followed with high value then wet period (wet) 1, if dry spell followed with high value then no rain for next 23 hours (dry) 0, if neither
- Parameters:
- data
Rainfall data with dry spell info and possible accumulation label
- min_dry_spell_duration
Minimum dry spell duration
- accumulation_col_name
Name for accumulation column
- Returns:
- data_w_flag
Data with accumulation flag
- rainfallqc.checks.timeseries_checks.flag_accumulation_periods(data: DataFrame, target_gauge_col: str, accumulation_threshold: float, accumulation_period_in_hours: int) ndarray[source]
Flag accumulation in a given period of hourly data.
TODO: make work for daily using: DAILY_DIVIDING_FACTOR
- Parameters:
- data
Hourly rainfall data
- target_gauge_col
Column with rainfall data
- accumulation_threshold
Rain accumulation for detecting possible period accumulations
- accumulation_period_in_hours
Accumulation period in hours
- Returns:
- pa_flags
Accumulation flags
- rainfallqc.checks.timeseries_checks.flag_dry_spell_duration(dry_spell_lengths: DataFrame, ref_dry_spell_length: int | float, time_res: str) DataFrame[source]
Flag the dry spell duration using reference local dry spell length.
- Parameters:
- dry_spell_lengths
Data with dry spell lengths
- ref_dry_spell_length
Reference dry spell length
- time_res
Temporal resolution of the time series either ‘daily’ or ‘hourly’
- Returns:
- dry_spell_lengths_flags
Data with dry spell flags
- rainfallqc.checks.timeseries_checks.flag_n_hours_accumulation_based_on_threshold(period_rain_vals: Series, accumulation_threshold: float, n_hours: int) int | float[source]
Flag a period as accumulation if a value is preceded by n hourly recordings of 0.
- Parameters:
- period_rain_vals
One period of rain values
- accumulation_threshold
Reference SDII threshold
- n_hours
Number of hours in reference period
- Returns:
- flag
1 if period accumulation, otherwise 0
- rainfallqc.checks.timeseries_checks.flag_streaks_exceeding_smallest_measurable_rainfall_amount(data: DataFrame, target_gauge_col: str, streak_length: int, smallest_measurable_rainfall_amount: float) DataFrame[source]
Flag streaks exceeding smallest measurable rainfall amount in data.
- Parameters:
- data:
Rainfall data with streak_id..
- target_gauge_col:
Column with rainfall data.
- streak_length
Only streaks longer than this will be considered
- smallest_measurable_rainfall_amount:
Resolution of rainfall data (i.e. minimum rainfall recording).
- Returns:
- data_w_flags
Data with streak flag 3
- rainfallqc.checks.timeseries_checks.flag_streaks_exceeding_wet_day_rainfall_threshold(data: DataFrame, target_gauge_col: str, streak_length: int, accumulation_threshold: float) DataFrame[source]
Flag values exceeding wet day rainfall accumulation threshold.
- Parameters:
- data
Rainfall data with streak_id..
- target_gauge_col
Column with rainfall data.
- streak_length
Only streaks longer than this will be considered
- accumulation_threshold
Threshold for rain accumulation.
- Returns:
- data_w_flags
Data with streak flag 1
- rainfallqc.checks.timeseries_checks.flag_streaks_exceeding_zero(data: DataFrame, target_gauge_col: str, streak_length: int) DataFrame[source]
Flag values exceeding wet day rainfall accumulation threshold.
- Parameters:
- data
Rainfall data with streak_id.
- target_gauge_col
Column with rainfall data.
- streak_length
Only streaks longer than this will be considered.
- Returns:
- data_w_flags
Data with streak flag 4
- rainfallqc.checks.timeseries_checks.flag_streaks_of_zero_bounded_by_days(data: DataFrame, target_gauge_col: str, time_res: str) DataFrame[source]
Flag streak of zeros bounded by record that are a multiple of 24 hours.
- Parameters:
- data
Hourly, 15-min or daily data with rainfall.
- target_gauge_col
Column with rainfall data.
- time_res
Time resolution: “1h”, “15m”, “1d”, or “hourly”, “daily”
- Returns:
- streaks_w_flag5
Data with streak flag 5.
- rainfallqc.checks.timeseries_checks.get_accumulation_threshold(etccdi_sdii: float, gauge_sdii: float, accumulation_multiplying_factor: int | float) float[source]
Get rainfall accumulation threshold based on ETCCDI or rain gauge Standard Precipitation Intensity Index (index).
- Parameters:
- etccdi_sdii
SDII value from ETCCDI
- gauge_sdii
SDII value from rain gauge
- accumulation_multiplying_factor
Factor to multiply to SDII value for to identify an accumulation of rain recordings
- Returns:
- accumulation_threshold
Reference SDII threshold
- rainfallqc.checks.timeseries_checks.get_accumulation_threshold_from_etccdi(data: DataFrame, target_gauge_col: str, time_res: str, gauge_lat: int | float, gauge_lon: int | float, wet_day_threshold: float, accumulation_multiplying_factor: float) float[source]
Get rain accumulation threshold from ETCCDI data.
- Parameters:
- data
Rainfall data.
- target_gauge_col
Column with rainfall data.
- time_res
Temporal resolution of the time series either ‘15m’, ‘daily’ or ‘hourly’
- gauge_lat
latitude of the rain gauge.
- gauge_lon
longitude of the rain gauge.
- wet_day_threshold
Threshold for rainfall intensity in one day (whether it is a wet day or not)
- accumulation_multiplying_factor
Factor to multiply SDII value for to identify an accumulation of rain recordings
- Returns:
- accumulation_threshold
Rain accumulation threshold that is e.g. 2*standard precipitation intensity threshold
- rainfallqc.checks.timeseries_checks.get_consecutive_dry_days(gauge_dry_spells: DataFrame) DataFrame[source]
Get consecutive groups of 0 rainfall days.
- Parameters:
- gauge_dry_spells
Data with ‘is_dry’ column
- Returns:
- gauge_dry_spell_groups
Data with group ids for consecutive dry days
- rainfallqc.checks.timeseries_checks.get_daily_non_wr_data(data: DataFrame, target_gauge_col: str, time_res: str) DataFrame[source]
Get daily non-world record data.
- Parameters:
- data
Hourly rainfall data
- target_gauge_col
Column with rainfall data
- time_res
Temporal resolution of the time series either ‘15m’, ‘daily’ or ‘hourly
- Returns:
- daily_data_not_wr
Daily rainfall data with world records filtered out
- rainfallqc.checks.timeseries_checks.get_dry_spell_duration(data: DataFrame, target_gauge_col: str) DataFrame[source]
Get consecutive dry spell duration.
- Parameters:
- data
Rainfall data
- target_gauge_col
Column with rainfall data
- Returns:
- gauge_dry_spell_lengths
Data with dry spell start, end and duration
- rainfallqc.checks.timeseries_checks.get_dry_spell_info(data: DataFrame, target_gauge_col: str) DataFrame[source]
Get summary of dry spells (i.e. duration and first wet value after dry and previous and next dry spells duration).
- Parameters:
- data
Hourly rainfall data
- target_gauge_col
Column with rainfall data
- Returns:
- gauge_dry_spell_info
Data with dry spell information
- rainfallqc.checks.timeseries_checks.get_first_wet_after_dry_spell(data: DataFrame, target_gauge_col: str) DataFrame[source]
Get first non-zero rainfall value after dry spell.
- Parameters:
- data
Rainfall data
- target_gauge_col
Column with rainfall data
- Returns:
- data_w_first_wet
Data with binary column denoting first wet after dry spell
- rainfallqc.checks.timeseries_checks.get_local_etccdi_sdii_mean(gauge_lat: int | float, gauge_lon: int | float) float[source]
Get the nearby ETCCDI Standard Precipitation Index mean SDII.
- Parameters:
- gauge_lat
latitude of the rain gauge
- gauge_lon
longitude of the rain gauge
- Returns:
- nearby_etccdi_sdii_mean
Local mean SDII value
- rainfallqc.checks.timeseries_checks.get_possible_accumulations(gauge_dry_spell_info: DataFrame, target_gauge_col: str, accumulation_threshold: float) DataFrame[source]
Get possible accumulations as 0 or 1 based on dry spell info.
- Parameters:
- gauge_dry_spell_info
Rainfall data with columns with dry spell info (durations, first_wet_after_dry, etc.)
- target_gauge_col
Column with rainfall data
- accumulation_threshold
Threshold of rainfall intensity
- Returns:
- gauge_data_possible_accumulations
Data with 1 is possible accumulation, otherwise 0.
- rainfallqc.checks.timeseries_checks.get_streaks_above_threshold(data: DataFrame, target_gauge_col: str, streak_length: int, value_threshold: int | float) DataFrame[source]
Get streak groups above given threshold.
- Parameters:
- data
Rainfall data with streak_id..
- target_gauge_col
Column with rainfall data.
- streak_length
Minimum length of streaks.
- value_threshold
Threshold to check .
- Returns:
- streaks_above_accumulation
Get all streaks above given value
- rainfallqc.checks.timeseries_checks.get_streaks_of_repeated_values(data: DataFrame, data_col: str) DataFrame[source]
Get streaks of repeated values in time series.
- Parameters:
- data
Data with time column.
- data_col
Column with values to check streaks in.
- Returns:
- streak_data
Data with streak column.
- rainfallqc.checks.timeseries_checks.get_surrounding_dry_spell_lengths(data: DataFrame) DataFrame[source]
Make prev_dry_spell and next_dry_spell columns from dry_spell_lengths.
- Parameters:
- data
Data with dry_spell_lengths
- Returns:
- data
Data with columns of previous and next dry spell durations
- rainfallqc.checks.timeseries_checks.join_dry_spell_data_back_to_original(data: DataFrame, dry_spell_lengths_flags: DataFrame) DataFrame[source]
Flag dry spell data using dry spell lengths.
- Parameters:
- data
Rainfall data
- dry_spell_lengths_flags
Data with dry spell flags
- Returns:
- dry_spell_flag_data
Data with dry spell flags
2.1.1.6. rainfallqc.checks.pypwsqc_filters module
Quality control checks translated from the pyPWSQC framework (https://pypwsqc.readthedocs.io/en/latest/).
The PWSQC framework includes filters originally develop for automated PWS within the COST Action OPENSENSE.
‘run_’ and ‘check_’ relate to the algorithms from pyPWSQC.
Functions are ordered alphabetically.
- rainfallqc.checks.pypwsqc_filters.check_faulty_zeros(neighbour_data: DataFrame, neighbour_metadata: DataFrame, neighbouring_gauge_ids: List[str], neighbour_metadata_gauge_id_col: str, time_res: str, projection: str, nint: int, n_stat: int, max_distance_for_neighbours: int | float = 10000.0, time_units: str = 'seconds since 1970-01-01 00:00:00', rainfall_attributes: dict = {'coverage_contant_type': 'physicalMeasurement', 'long_name': 'rainfall amount per time unit', 'name': 'rainfall', 'units': 'mm'}, lat_lon_attributes: dict = {'unit': 'degrees in WGS84 projection'}, global_attributes: dict = None) Dataset[source]
Will flag faulty zeros based on neighbours …
- Parameters:
- neighbour_data
Rainfall data of neighbouring gauges with time col
- neighbour_metadata
Metadata for the rainfall data with ‘latitude’ and ‘longitude’
- neighbour_metadata_gauge_id_col
Column with the gauge ID
- target_gauge_col
Target gauge column
- neighbouring_gauge_ids:
List of ids with neighbouring gauges
- time_res
Time resolution of data
- projection
cartesian/metric coordinate system
- nint
Number of intervals
- n_stat
Number of stations
- max_distance_for_neighbours
Maximum distance to consider for neighbours
- time_units
Units and encoding of the ‘time’ column
- rainfall_attributes
Attributes for rainfall in the xarray Dataset
- lat_lon_attributes
Attributes for lat and lon in the xarray Dataset
- global_attributes
Global attributes for xarray Dataset
- Returns:
- neighbour_data_ds_filtered
Data with flags for faulty zeros
Examples
available at: https://pypwsqc.readthedocs.io/en/latest/notebooks/merged_filters.html
- rainfallqc.checks.pypwsqc_filters.check_high_influx_filter(neighbour_data: DataFrame) None[source]
High influx filter.
- Parameters:
- neighbour_data
Rainfall data of neighbouring gauges with time col
- Returns:
- neighbour_data
todo
- rainfallqc.checks.pypwsqc_filters.check_station_outlier(neighbour_data: DataFrame, neighbour_metadata: DataFrame, neighbouring_gauge_ids: List[str], neighbour_metadata_gauge_id_col: str, time_res: str, projection: str, evaluation_period: int, mmatch: int, gamma: float, n_stat: int, max_distance_for_neighbours: int | float = 10000.0, time_units: str = 'seconds since 1970-01-01 00:00:00', rainfall_attributes: dict = {'coverage_contant_type': 'physicalMeasurement', 'long_name': 'rainfall amount per time unit', 'name': 'rainfall', 'units': 'mm'}, lat_lon_attributes: dict = {'unit': 'degrees in WGS84 projection'}, global_attributes: dict = None) Dataset[source]
Station outlier.
- Parameters:
- neighbour_data
Rainfall data of neighbouring gauges with time col
- neighbour_metadata
Metadata for the rainfall data with ‘latitude’ and ‘longitude’
- neighbour_metadata_gauge_id_col
Column with the gauge ID
- target_gauge_col
Target gauge column
- neighbouring_gauge_ids:
List of ids with neighbouring gauges
- time_res
Time resolution of data
- projection
cartesian/metric coordinate system
- evaluation_period
length of (rolling) window for correlation calculation
- mmatch
threshold for number of matching rainy intervals in evaluation period
- gamma
threshold for rolling median pearson correlation
- n_stat
Number of stations
- max_distance_for_neighbours
Maximum distance to consider for neighbours
- time_units
Units and encoding of the ‘time’ column
- rainfall_attributes
Attributes for rainfall in the xarray Dataset
- lat_lon_attributes
Attributes for lat and lon in the xarray Dataset
- global_attributes
Global attributes for xarray Dataset
- Returns:
- neighbour_data_ds_filtered
Data with flags for station outliers
Examples
available at: https://pypwsqc.readthedocs.io/en/latest/notebooks/merged_filters.html
- rainfallqc.checks.pypwsqc_filters.compute_distance_matrix(neighbour_data_ds: Dataset) Dataset[source]
Compute a distance matrix.
- Parameters:
- neighbour_data_ds
xarray dataset of neighbour data
- Returns:
- distance_matrix
A distance matrix of all neighbouring gauges
- rainfallqc.checks.pypwsqc_filters.convert_neighbour_data_to_xarray(neighbour_data: DataFrame, neighbour_metadata: DataFrame, projection: str, time_units: str = 'seconds since 1970-01-01 00:00:00', rainfall_attributes: dict = {'coverage_contant_type': 'physicalMeasurement', 'long_name': 'rainfall amount per time unit', 'name': 'rainfall', 'units': 'mm'}, lat_lon_attributes: dict = {'unit': 'degrees in WGS84 projection'}, global_attributes: dict = None) Dataset[source]
Convert neighbour data in polars format to xarray dataset.
- Parameters:
- neighbour_data
Rainfall data of neighbouring gauges with time col
- neighbour_metadata
Metadata for the rainfall data with ‘latitude’ and ‘longitude’
- projection
cartesian/metric coordinate system
- time_units
Units and encoding of the ‘time’ column
- rainfall_attributes
Attributes for rainfall in the xarray Dataset
- lat_lon_attributes
Attributes for lat and lon in the xarray Dataset
- global_attributes
Global attributes for xarray Dataset
- Returns:
- neighbour_data_ds
xarray dataset with assigned attributes
- rainfallqc.checks.pypwsqc_filters.run_bias_correction(neighbour_data: DataFrame) None[source]
Bias correction.
- Parameters:
- neighbour_data
Rainfall data of neighbouring gauges with time col
- Returns:
- neighbour_data
todo
- rainfallqc.checks.pypwsqc_filters.run_event_based_filter(neighbour_data: DataFrame) None[source]
Event based filter (EBF).
- Parameters:
- neighbour_data
Rainfall data of neighbouring gauges with time col
- Returns:
- neighbour_data
todo
- rainfallqc.checks.pypwsqc_filters.run_indicator_correlation(neighbour_data: DataFrame) None[source]
Run indicator correlation.
- Parameters:
- neighbour_data
Rainfall data of neighbouring gauges with time col
- Returns:
- neighbour_data
todo
- rainfallqc.checks.pypwsqc_filters.run_peak_removal(neighbour_data: DataFrame) None[source]
Peak removal.
- Parameters:
- neighbour_data
Rainfall data of neighbouring gauges with time col
- Returns:
- neighbour_data
todo
- rainfallqc.checks.pypwsqc_filters.subset_distance_matrix(neighbour_data_ds: Dataset, distance_matrix: Dataset, max_distance_for_neighbours: int | float) Dataset[source]
Compute a distance matrix.
- Parameters:
- neighbour_data_ds
xarray dataset of neighbour data
- distance_matrix
A distance matrix of all neighbouring gauges
- max_distance_for_neighbours
Maximum distance to consider for neighbours
- Returns:
- neighbour_data_ds
A distance matrix of all neighbouring gauges
2.1.1.7. Module contents
Quality checks.