quantificationlib.search module¶
Search functions. Needed by those quantifiers based on matching distribution that compute the estimated prevalence using a search algorithm
- compute_quantiles(prevalence=None, probabilities=None, n_quantiles=None, y=None, classes=None)[source]¶
Compute quantiles
Used by QUANTy. It computes the quantiles both for the testing distribution (in this case the value of the prevalence is ignored), and for the weighted mixture of positives and negatives (this depends on the value of the prevalence parameter)
- Parameters:
prevalence (float or None) – The value of the prevalence of the positive class to compute the mixture of the positives and the negatives. To compute the quantiles of the testing set this parameter must be None
probabilities (ndarray, shape (nexamples, 1)) – The ordered probabilities for all examples. Notice that in the case of computing the mixture of the positives and the negatives, this array contains the probability for all the examples of the training set
n_quantiles (int) – Number of quantiles. This parameter is used with Quantiles-based algorithms.
y (array, labels) – This parameter is used with Quantiles-based algorithms. They need the true label of each example
classes (ndarray, shape (n_classes, )) – Class labels. Used by Quantiles-based algorithms
- Returns:
quantiles – The value of the quantiles given the probabilities (and the value of the prevalence if we are computing the quantiles of the training mixture distribution)
- Return type:
array, shape(n_quantiles,)
- compute_sord_weights(prevalence=None, union_labels=None, classes=None)[source]¶
Computes the weight for each example, depending on the prevalence, to compute afterwards the SORD distance
- Parameters:
prevalence (float,) – The prevalence for the positive class
union_labels (ndarray, shape (n_examples_train+n_examples_test, 1)) – Contains the set/or the label of each prediction. If the prediction corresponds to a training example, the value is the true class of such example. If the example belongs to the testing distribution, the value is NaN
classes (ndarray, shape (n_classes, )) – Class labels
- Returns:
weights –
The weight of each example, that is equal to:
negative class = (1-prevalence)*1/|D^-|
positive class = prevalence*1/|D^+|
testing examples = - 1 / |T|
- Return type:
array, same shape of union_labels
References
André Maletzke, Denis dos Reis, Everton Cherman, and Gustavo Batista: Dys: A framework for mixture models in quantification. In AAAI 2019, volume 33, pp. 4552–4560. 2019.
- global_search(distance_func, mixture_func, test_distrib, tol, mixtures, return_mixtures, **kwargs)[source]¶
Search function for non-V-shape distance functions
Given a function distance_func with a single local minumum in the interval [0,1], the method returns the prevalence that minimizes the differente between the mixture training distribution and the testing distribution according to distance_func
This method is based on using Golden Section Search but this kind of search only works when loss is V shape. We found that same combinations of quantifiers/loss functions do no produce a V shape. Instead of just checking that, this method first computes the loss for all the points in the range [0, 1] with a step of 0.01. Then, around all the minimums a Golden Section Search is performed to find the global minimum
Used by QUANTy, SORDy and DF-based classes. Only useful for binary quantification
- Parameters:
distance_func (function) – This is the loss function minimized during the search
mixture_func (function) – The function used to generated the training mixture distribution given a value for the prevalence
test_distrib (array) – The distribution of the positive class. The exact shape depends on the representation (pdfs, quantiles…)
tol (float) – The precision of the solution
mixtures (array) – Contains the mixtures for all the prevalences in the range [0, 1] step 0.01. This mixtures can be computed just once, for the first testing bag, and applied for the rest. It is useful when computing the mixture is time consuming. Only used by QUANTy.
return_mixtures (boolean) – Contains True if the method must return the precomputed_mixtures
kwargs (keyword arguments) – Here we pass the set of arguments needed by mixture functions: mixture_two_pdfs (for pdf-based classes) and compute quantiles (for quantiles-based classes). See the help of this two functions
- Returns:
mixtures (array or None) – Computed mixtures for the range [0, 1] step 0.01
prevalences (array, shape(2,)) – The predicted prevalence for the negative and the positive class
- golden_section_search(distance_func, mixture_func, test_distrib, tol, a, b, **kwargs)[source]¶
Golden section search
Only useful for binary quantification Given a function distance_func with a single local minumum in the interval [a, b], golden_section_search returns the prevalence that minimizes the differente between the mixture training distribution and the testing distribution according to distance_func
- Parameters:
distance_func (function) – This is the loss function minimized during the search
mixture_func (function) – The function used to generated the training mixture distribution given a value for the prevalence
test_distrib (array) – The distribution of the positive class. The exact shape depends on the representation (pdfs, quantiles…)
tol (float) – The precision of the solution
a (float) – The lower bound of the interval
b (float) – The upper bound of the interval
kwargs (keyword arguments) – Here we pass the set of arguments needed by mixture functions: mixture_two_pdfs (for pdf-based classes) and compute quantiles (for quantiles-based classes). See the help of this two functions
- Returns:
loss (float) – Distance between mixture and testing distribution for the returned prevalence according to distance_func
prevalences (array, shape(2,)) – The predicted prevalence for the negative and the positive class
- mixture_of_pdfs(prevalence=None, pos_distrib=None, neg_distrib=None)[source]¶
Mix two pdfs given a value for the prevalence of the positive class
- Parameters:
prevalence (float,) – The prevalence for the positive class
pos_distrib (array, shape(n_bins,)) – The distribution of the positive class. The exact shape depends on the representation (pdfs, quantiles…)
neg_distrib (array, shape(n_bins,)) – The distribution of the negative class. The exact shape depends on the representation (pdfs, quantiles…)
- Returns:
mixture – The pdf mixture of positives and negatives
- Return type:
array, same shape of positives and negatives
- sord(weights, union_distrib)[source]¶
Computes the SORD distance for SORDy algorithm for a given union_distribution and the weights of the examples (that depends on the prevalence used to compute the mixture of the training distribution). This methods correspond to the implementation of Algorithm 1 in (Maletzke et al. 2019)
- Parameters:
weights (array, shape (n_examples_train+n_examples_test, 1) (same shape of union_labels)) –
- The weight of each example, that is equal to:
negative class = (1-prevalence)*1/|D^-|
positive class = prevalence*1/|D^+|
testing examples = - 1 / |T|
union_labels (ndarray, shape (n_examples_train+n_examples_test, 1)) – Contains the set/or the label of each prediction. If the prediction corresponds to a training example, the value is the true class of such example. If the example belongs to the testing distribution, the value is NaN
- Returns:
total_cost – SORD distance
- Return type:
float
References
André Maletzke, Denis dos Reis, Everton Cherman, and Gustavo Batista: Dys: A framework for mixture models in quantification. In AAAI 2019, volume 33, pp. 4552–4560. 2019.