quantificationlib.optimization module

Optimization related functions. Needed by those quantifiers that solve optimization problems to compute the estimated prevalence given a testing bag

compute_ed_param_test(distance_func, train_distrib, test_distrib, K, classes, n_cls_i)[source]

Computes params related to the test distribution for solving ED-problems using quadprog.solve_qp

Parameters:
  • distance_func (function) – The function used to measure the distance between each pair of examples

  • train_distrib (array, shape (n_bins * n_classes, n_classes)) – Represents the distribution of each class in the training set

  • test_distrib (array, shape (n_bins * n_classes, 1)) – Represents the distribution of the testing set

  • K (array, shape (n_classes, n_classes)) – Average distance between each pair of classes in the training set

  • classes (ndarray, shape (n_classes, )) – Class labels

  • n_cls_i (ndarray, shape (n_classes, )) – The number of examples of each class

Returns:

a – Term a for solving optimization problems using quadprog.solve_qp

Return type:

array, shape (n_classes, )

References

Alberto Castaño, Laura Morán-Fernández, Jaime Alonso, Verónica Bolón-Canedo, Amparo Alonso-Betanzos, Juan José del Coz: An analysis of quantification methods based on matching distributions

Hideko Kawakubo, Marthinus Christoffel Du Plessis, and Masashi Sugiyama. 2016. Computationally efficient class-prior estimation under class balance change using energy distance. Transactions on Information and Systems 99, 1 (2016), 176–186.

compute_ed_param_train(distance_func, train_distrib, classes, n_cls_i)[source]

Computes params related to the train distribution for solving ED-problems using quadprog.solve_qp

Parameters:
  • distance_func (function) – The function used to measure the distance between each pair of examples

  • train_distrib (array, shape (n_bins * n_classes, n_classes)) – Represents the distribution of each class in the training set

  • classes (ndarray, shape (n_classes, )) – Class labels

  • n_cls_i (ndarray, shape (n_classes, )) – The number of examples of each class

Returns:

  • K (array, shape (n_classes, n_classes)) – Average distance between each pair of classes in the training set

  • G (array, shape (n_classes - 1, n_classes - 1))

  • C (array, shape (n_classes - 1, n_constraints)) – n_constraints will be equal to the number of classes (n_classes)

  • b (array, shape (n_constraints,))

References

Alberto Castaño, Laura Morán-Fernández, Jaime Alonso, Verónica Bolón-Canedo, Amparo Alonso-Betanzos, Juan José del Coz: An analysis of quantification methods based on matching distributions

Hideko Kawakubo, Marthinus Christoffel Du Plessis, and Masashi Sugiyama. 2016. Computationally efficient class-prior estimation under class balance change using energy distance. Transactions on Information and Systems 99, 1 (2016), 176–186.

compute_l2_param_train(train_distrib, classes)[source]

Computes params related to the train distribution for solving QP using L2 loss function. For instance, this library uses quadprog.solve_qp that solves the following kind of problems:

Minimize     1/2 x^T G x - a^T x
Subject to   C.T x >= b

Thus, the values of G, C and b must be the following

G = train_distrib.T * train_distrib
G shape (n_classes, n_classes)

C = [[ 1, 1, ...,  1],
     [ 1, 0, ...,  0],
     [ 0, 1, 0,.., 0],
     ...
     [ 0, 0, ..,0, 1]].T
C shape (n_classes+1, n_classes)

b = [1, 0, ..., 0]
b shape (n_classes, )

This function computes and returns G, C and b.
Parameters:
  • train_distrib (array, shape (n_bins * n_classes, n_classes)) – Represents the distribution of each class in the training set

  • classes (ndarray, shape (n_classes, )) – Class labels

Returns:

  • G (array, shape (n_classes, n_classes))

  • C (array, shape (n_classes, n_classes+1)) – The number of columns is equal to the number of the constraints of the optimization problem. It must be n_classes + 1: n_classes constraints to guarantee that prevalences_i>=0, and an additional constraint for ensuring that sum(prevalences)==1

  • b (array, shape (n_classes + 1,)) – The size of b corresponds to the number of constraints of the optimization problem.

dpofa(m)[source]

Factors a symmetric positive definite matrix

This is a version of the dpofa function included in quadprog library. Here, it is mainly used to check whether a matrix is positive definite or not

Parameters:

m (symmetric matrix, typically the shape is (n_classes, n_classes)) – The matrix to be factored. Only the diagonal and upper triangle are used

Returns:

  • k (int,) – This value is:

    == 0 if m is positive definite and the factorization has been completed

    > 0 when the leading minor of order k is not positive definite

  • r (array, an upper triangular matrix) – When k==0, the factorization is complete and r.T.dot(r) == m. The strict lower triangle is unaltered (it is equal to the strict lower triangle of matrix m), so it could be different from 0.

is_pd(m)[source]

Checks whether a matrix is positive definite or not

It is based on dpofa function, a version of the dpofa function included in quadprog library. When dpofa returns 0 the matrix is positive definite.

Parameters:

m (symmetric matrix, typically the shape is (n_classes, n_classes)) – The matrix to check whether it is positive definite or not

Return type:

A boolean, True when m is positive definite and False otherwise

nearest_pd(A)[source]

Find the nearest positive-definite matrix to input

A Python/Numpy port of John D’Errico’s nearestSPD MATLAB code [1], which credits [2].

References

[1] https://www.mathworks.com/matlabcentral/fileexchange/42885-nearestspd

[2] N.J. Higham, “Computing a nearest symmetric positive semidefinite matrix” (1988):

https://doi.org/10.1016/0024-3795(88)90223-6

solve_ed(G, a, C, b)[source]

Solves the optimization problem for ED-based quantifiers

It just calls quadprog.solve_qp with the appropriate parameters. These paremeters were computed before by calling compute_ed_param_train and compute_ed_param_test. In the derivation of the optimization problem, the last class is put in terms of the rest of classes. Thus, we have to add 1-prevalences.sum() which it is the prevalence of the last class

Parameters:
  • G (array, shape (n_classes, n_classes)) –

  • C (array, shape (n_classes, n_constraints)) – n_constraints will be n_classes + 1

  • b (array, shape (n_constraints,)) –

  • a (array, shape (n_classes, )) –

Returns:

prevalences – Vector containing the predicted prevalence for each class

Return type:

array, shape=(n_classes, )

Notes

G, C and b are computed by compute_ed_param_train and a by compute_ed_param_test

References

Alberto Castaño, Laura Morán-Fernández, Jaime Alonso, Verónica Bolón-Canedo, Amparo Alonso-Betanzos, Juan José del Coz: An analysis of quantification methods based on matching distributions

Hideko Kawakubo, Marthinus Christoffel Du Plessis, and Masashi Sugiyama. 2016. Computationally efficient class-prior estimation under class balance change using energy distance. Transactions on Information and Systems 99, 1 (2016), 176–186.

solve_hd(train_distrib, test_distrib, n_classes, problem=None, solver='ECOS')[source]

Solves the optimization problem for DF methods using Hellinger Distance

This method just uses cvxpy library

Parameters:
  • train_distrib (array, shape (n_bins * n_classes, n_classes)) – Represents the distribution of each class in the training set

  • test_distrib (array, shape (n_bins * n_classes, 1)) – Represents the distribution of the testing set

  • n_classes (int) – Number of classes

  • problem (a cvxpy Problem object (default=None)) – The first time a problem is solved (this corresponds to the first testing bag) this parameter is None. For the rest testing bags, a Problem object is passed here to allow a warm start. This accelerates the solving process.

  • solver (str, (default='ECOS')) – The solver used to solve the optimization problem. Here ‘ECOS’ is used. If another solver is prefered, you may need to install additional libraries.

Returns:

prevalences – Vector containing the predicted prevalence for each class

Return type:

array, shape=(n_classes, )

solve_l1(train_distrib, test_distrib, n_classes, problem=None, solver=None)[source]

Solves AC, PAC, DF and Friedman optimization problems for L1 loss function

min |train_distrib * prevalences - test_distrib|

s.t. prevalences_i >=0, sum prevalences_i = 1

Parameters:
  • train_distrib (array, shape depends on the optimization problem) –

    Represents the distribution of each class in the training set

    • DF: shape (n_bins * n_classes, n_classes)

    • AC, PAC, Friedman: shape (n_classes, n_classes)

  • test_distrib (array, shape depends on the optimization problem) –

    Represents the distribution of the testing set

    • DF: shape shape (n_bins * n_classes, 1)

    • AC, PAC, Friedman: shape (n_classes, 1)

  • n_classes (int) – Number of classes

  • problem (a cvxpy Problem object (default=None)) – The first time a problem is solved (this corresponds to the first testing bag) this parameter is None. For the rest testing bags, a Problem object is passed here to allow a warm start. This accelerates the solving process.

  • solver (str, (default=None)) – The solver used to solve the optimization problem. cvxpy automatically selects the best solver (among those installed) according to the type of the optimization problem. If a particular solver is prefered maybe you need to install additional libraries

Returns:

  • problem (a cvxpy Problem) – A cvxpy Problem already created

  • prevalences (array, shape=(n_classes, )) – Vector containing the predicted prevalence for each class

solve_l2(train_distrib, test_distrib, G, C, b)[source]

Solves AC, PAC, DF and Friedman optimization problems for L2 loss function

min (test_distrib - train_distrib * prevalences).T * (test_distrib - train_distrib * prevalences)

s.t. prevalences_i >=0, sum prevalences_i = 1

Expanding the objective function, we obtain:

prevalences.T * train_distrib.T * train_distrib * prevalences

- 2 * prevalences * train_distrib.T * test_distrib + test_distrib.T * test_distrib

Notice that the last term is constant w.r.t prevalences.

Let G = 2 * train_distrib.T * train_distrib and a = 2 * train_distrib.T * test_distrib, we can use directly quadprog.solve_qp because it solves the following kind of problems:

Minimize 1/2 x^T G x - a^T x

Subject to C.T x >= b

solve_l2 just computes the term a, shape (n_classes,1), and then calls quadprog.solve_qp. G, C and b were computed by compute_l2_param_train before, in the fit method of the DF/Friedman object. quadprog is used here because it is faster than cvxpy.

Parameters:
  • train_distrib (array, shape depends on the optimization problem) –

    Represents the distribution of each class in the training set

    • DF: shape (n_bins * n_classes, n_classes)

    • AC, PAC Friedman: shape (n_classes, n_classes)

  • test_distrib (array, shape depends on the optimization problem) –

    Represents the distribution of the testing set

    • DF: shape shape (n_bins * n_classes, 1)

    • AC, PAC, Friedman: shape (n_classes, 1)

  • G (array, shape (n_classes, n_classes)) –

  • C (array, shape (n_classes, n_constraints)) – n_constraints will be n_classes + 1

  • b (array, shape (n_constraints,)) –

Returns:

prevalences – Vector containing the predicted prevalence for each class

Return type:

array, shape=(n_classes, )

Notes

G, C and b are computed by compute_l2_param_train in the fit method