quantificationlib.search module

Search functions. Needed by those quantifiers based on matching distribution that compute the estimated prevalence using a search algorithm

compute_quantiles(prevalence=None, probabilities=None, n_quantiles=None, y=None, classes=None)[source]

Compute quantiles

Used by QUANTy. It computes the quantiles both for the testing distribution (in this case the value of the prevalence is ignored), and for the weighted mixture of positives and negatives (this depends on the value of the prevalence parameter)

Parameters:
  • prevalence (float or None) – The value of the prevalence of the positive class to compute the mixture of the positives and the negatives. To compute the quantiles of the testing set this parameter must be None

  • probabilities (ndarray, shape (nexamples, 1)) – The ordered probabilities for all examples. Notice that in the case of computing the mixture of the positives and the negatives, this array contains the probability for all the examples of the training set

  • n_quantiles (int) – Number of quantiles. This parameter is used with Quantiles-based algorithms.

  • y (array, labels) – This parameter is used with Quantiles-based algorithms. They need the true label of each example

  • classes (ndarray, shape (n_classes, )) – Class labels. Used by Quantiles-based algorithms

Returns:

quantiles – The value of the quantiles given the probabilities (and the value of the prevalence if we are computing the quantiles of the training mixture distribution)

Return type:

array, shape(n_quantiles,)

compute_sord_weights(prevalence=None, union_labels=None, classes=None)[source]

Computes the weight for each example, depending on the prevalence, to compute afterwards the SORD distance

Parameters:
  • prevalence (float,) – The prevalence for the positive class

  • union_labels (ndarray, shape (n_examples_train+n_examples_test, 1)) – Contains the set/or the label of each prediction. If the prediction corresponds to a training example, the value is the true class of such example. If the example belongs to the testing distribution, the value is NaN

  • classes (ndarray, shape (n_classes, )) – Class labels

Returns:

weights

The weight of each example, that is equal to:

negative class = (1-prevalence)*1/|D^-|

positive class = prevalence*1/|D^+|

testing examples = - 1 / |T|

Return type:

array, same shape of union_labels

References

André Maletzke, Denis dos Reis, Everton Cherman, and Gustavo Batista: Dys: A framework for mixture models in quantification. In AAAI 2019, volume 33, pp. 4552–4560. 2019.

Search function for non-V-shape distance functions

Given a function distance_func with a single local minumum in the interval [0,1], the method returns the prevalence that minimizes the differente between the mixture training distribution and the testing distribution according to distance_func

This method is based on using Golden Section Search but this kind of search only works when loss is V shape. We found that same combinations of quantifiers/loss functions do no produce a V shape. Instead of just checking that, this method first computes the loss for all the points in the range [0, 1] with a step of 0.01. Then, around all the minimums a Golden Section Search is performed to find the global minimum

Used by QUANTy, SORDy and DF-based classes. Only useful for binary quantification

Parameters:
  • distance_func (function) – This is the loss function minimized during the search

  • mixture_func (function) – The function used to generated the training mixture distribution given a value for the prevalence

  • test_distrib (array) – The distribution of the positive class. The exact shape depends on the representation (pdfs, quantiles…)

  • tol (float) – The precision of the solution

  • mixtures (array) – Contains the mixtures for all the prevalences in the range [0, 1] step 0.01. This mixtures can be computed just once, for the first testing bag, and applied for the rest. It is useful when computing the mixture is time consuming. Only used by QUANTy.

  • return_mixtures (boolean) – Contains True if the method must return the precomputed_mixtures

  • kwargs (keyword arguments) – Here we pass the set of arguments needed by mixture functions: mixture_two_pdfs (for pdf-based classes) and compute quantiles (for quantiles-based classes). See the help of this two functions

Returns:

  • mixtures (array or None) – Computed mixtures for the range [0, 1] step 0.01

  • prevalences (array, shape(2,)) – The predicted prevalence for the negative and the positive class

Golden section search

Only useful for binary quantification Given a function distance_func with a single local minumum in the interval [a, b], golden_section_search returns the prevalence that minimizes the differente between the mixture training distribution and the testing distribution according to distance_func

Parameters:
  • distance_func (function) – This is the loss function minimized during the search

  • mixture_func (function) – The function used to generated the training mixture distribution given a value for the prevalence

  • test_distrib (array) – The distribution of the positive class. The exact shape depends on the representation (pdfs, quantiles…)

  • tol (float) – The precision of the solution

  • a (float) – The lower bound of the interval

  • b (float) – The upper bound of the interval

  • kwargs (keyword arguments) – Here we pass the set of arguments needed by mixture functions: mixture_two_pdfs (for pdf-based classes) and compute quantiles (for quantiles-based classes). See the help of this two functions

Returns:

  • loss (float) – Distance between mixture and testing distribution for the returned prevalence according to distance_func

  • prevalences (array, shape(2,)) – The predicted prevalence for the negative and the positive class

mixture_of_pdfs(prevalence=None, pos_distrib=None, neg_distrib=None)[source]

Mix two pdfs given a value for the prevalence of the positive class

Parameters:
  • prevalence (float,) – The prevalence for the positive class

  • pos_distrib (array, shape(n_bins,)) – The distribution of the positive class. The exact shape depends on the representation (pdfs, quantiles…)

  • neg_distrib (array, shape(n_bins,)) – The distribution of the negative class. The exact shape depends on the representation (pdfs, quantiles…)

Returns:

mixture – The pdf mixture of positives and negatives

Return type:

array, same shape of positives and negatives

sord(weights, union_distrib)[source]

Computes the SORD distance for SORDy algorithm for a given union_distribution and the weights of the examples (that depends on the prevalence used to compute the mixture of the training distribution). This methods correspond to the implementation of Algorithm 1 in (Maletzke et al. 2019)

Parameters:
  • weights (array, shape (n_examples_train+n_examples_test, 1) (same shape of union_labels)) –

    The weight of each example, that is equal to:

    negative class = (1-prevalence)*1/|D^-|

    positive class = prevalence*1/|D^+|

    testing examples = - 1 / |T|

  • union_labels (ndarray, shape (n_examples_train+n_examples_test, 1)) – Contains the set/or the label of each prediction. If the prediction corresponds to a training example, the value is the true class of such example. If the example belongs to the testing distribution, the value is NaN

Returns:

total_cost – SORD distance

Return type:

float

References

André Maletzke, Denis dos Reis, Everton Cherman, and Gustavo Batista: Dys: A framework for mixture models in quantification. In AAAI 2019, volume 33, pp. 4552–4560. 2019.