quantificationlib.optimization module¶
Optimization related functions. Needed by those quantifiers that solve optimization problems to compute the estimated prevalence given a testing bag
- compute_ed_param_test(distance_func, train_distrib, test_distrib, K, classes, n_cls_i)[source]¶
Computes params related to the test distribution for solving ED-problems using quadprog.solve_qp
- Parameters:
distance_func (function) – The function used to measure the distance between each pair of examples
train_distrib (array, shape (n_bins * n_classes, n_classes)) – Represents the distribution of each class in the training set
test_distrib (array, shape (n_bins * n_classes, 1)) – Represents the distribution of the testing set
K (array, shape (n_classes, n_classes)) – Average distance between each pair of classes in the training set
classes (ndarray, shape (n_classes, )) – Class labels
n_cls_i (ndarray, shape (n_classes, )) – The number of examples of each class
- Returns:
a – Term a for solving optimization problems using quadprog.solve_qp
- Return type:
array, shape (n_classes, )
References
Alberto Castaño, Laura Morán-Fernández, Jaime Alonso, Verónica Bolón-Canedo, Amparo Alonso-Betanzos, Juan José del Coz: An analysis of quantification methods based on matching distributions
Hideko Kawakubo, Marthinus Christoffel Du Plessis, and Masashi Sugiyama. 2016. Computationally efficient class-prior estimation under class balance change using energy distance. Transactions on Information and Systems 99, 1 (2016), 176–186.
- compute_ed_param_train(distance_func, train_distrib, classes, n_cls_i)[source]¶
Computes params related to the train distribution for solving ED-problems using quadprog.solve_qp
- Parameters:
distance_func (function) – The function used to measure the distance between each pair of examples
train_distrib (array, shape (n_bins * n_classes, n_classes)) – Represents the distribution of each class in the training set
classes (ndarray, shape (n_classes, )) – Class labels
n_cls_i (ndarray, shape (n_classes, )) – The number of examples of each class
- Returns:
K (array, shape (n_classes, n_classes)) – Average distance between each pair of classes in the training set
G (array, shape (n_classes - 1, n_classes - 1))
C (array, shape (n_classes - 1, n_constraints)) – n_constraints will be equal to the number of classes (n_classes)
b (array, shape (n_constraints,))
References
Alberto Castaño, Laura Morán-Fernández, Jaime Alonso, Verónica Bolón-Canedo, Amparo Alonso-Betanzos, Juan José del Coz: An analysis of quantification methods based on matching distributions
Hideko Kawakubo, Marthinus Christoffel Du Plessis, and Masashi Sugiyama. 2016. Computationally efficient class-prior estimation under class balance change using energy distance. Transactions on Information and Systems 99, 1 (2016), 176–186.
- compute_l2_param_train(train_distrib, classes)[source]¶
Computes params related to the train distribution for solving QP using L2 loss function. For instance, this library uses quadprog.solve_qp that solves the following kind of problems:
Minimize 1/2 x^T G x - a^T x Subject to C.T x >= b Thus, the values of G, C and b must be the following G = train_distrib.T * train_distrib G shape (n_classes, n_classes) C = [[ 1, 1, ..., 1], [ 1, 0, ..., 0], [ 0, 1, 0,.., 0], ... [ 0, 0, ..,0, 1]].T C shape (n_classes+1, n_classes) b = [1, 0, ..., 0] b shape (n_classes, ) This function computes and returns G, C and b.
- Parameters:
train_distrib (array, shape (n_bins * n_classes, n_classes)) – Represents the distribution of each class in the training set
classes (ndarray, shape (n_classes, )) – Class labels
- Returns:
G (array, shape (n_classes, n_classes))
C (array, shape (n_classes, n_classes+1)) – The number of columns is equal to the number of the constraints of the optimization problem. It must be n_classes + 1: n_classes constraints to guarantee that prevalences_i>=0, and an additional constraint for ensuring that sum(prevalences)==1
b (array, shape (n_classes + 1,)) – The size of b corresponds to the number of constraints of the optimization problem.
- dpofa(m)[source]¶
Factors a symmetric positive definite matrix
This is a version of the dpofa function included in quadprog library. Here, it is mainly used to check whether a matrix is positive definite or not
- Parameters:
m (symmetric matrix, typically the shape is (n_classes, n_classes)) – The matrix to be factored. Only the diagonal and upper triangle are used
- Returns:
k (int,) – This value is:
== 0 if m is positive definite and the factorization has been completed
> 0 when the leading minor of order k is not positive definite
r (array, an upper triangular matrix) – When k==0, the factorization is complete and r.T.dot(r) == m. The strict lower triangle is unaltered (it is equal to the strict lower triangle of matrix m), so it could be different from 0.
- is_pd(m)[source]¶
Checks whether a matrix is positive definite or not
It is based on dpofa function, a version of the dpofa function included in quadprog library. When dpofa returns 0 the matrix is positive definite.
- Parameters:
m (symmetric matrix, typically the shape is (n_classes, n_classes)) – The matrix to check whether it is positive definite or not
- Return type:
A boolean, True when m is positive definite and False otherwise
- nearest_pd(A)[source]¶
Find the nearest positive-definite matrix to input
A Python/Numpy port of John D’Errico’s nearestSPD MATLAB code [1], which credits [2].
References
[1] https://www.mathworks.com/matlabcentral/fileexchange/42885-nearestspd
- [2] N.J. Higham, “Computing a nearest symmetric positive semidefinite matrix” (1988):
- solve_ed(G, a, C, b)[source]¶
Solves the optimization problem for ED-based quantifiers
It just calls quadprog.solve_qp with the appropriate parameters. These paremeters were computed before by calling compute_ed_param_train and compute_ed_param_test. In the derivation of the optimization problem, the last class is put in terms of the rest of classes. Thus, we have to add 1-prevalences.sum() which it is the prevalence of the last class
- Parameters:
G (array, shape (n_classes, n_classes)) –
C (array, shape (n_classes, n_constraints)) – n_constraints will be n_classes + 1
b (array, shape (n_constraints,)) –
a (array, shape (n_classes, )) –
- Returns:
prevalences – Vector containing the predicted prevalence for each class
- Return type:
array, shape=(n_classes, )
Notes
G, C and b are computed by compute_ed_param_train and a by compute_ed_param_test
References
Alberto Castaño, Laura Morán-Fernández, Jaime Alonso, Verónica Bolón-Canedo, Amparo Alonso-Betanzos, Juan José del Coz: An analysis of quantification methods based on matching distributions
Hideko Kawakubo, Marthinus Christoffel Du Plessis, and Masashi Sugiyama. 2016. Computationally efficient class-prior estimation under class balance change using energy distance. Transactions on Information and Systems 99, 1 (2016), 176–186.
- solve_hd(train_distrib, test_distrib, n_classes, problem=None, solver='ECOS')[source]¶
Solves the optimization problem for DF methods using Hellinger Distance
This method just uses cvxpy library
- Parameters:
train_distrib (array, shape (n_bins * n_classes, n_classes)) – Represents the distribution of each class in the training set
test_distrib (array, shape (n_bins * n_classes, 1)) – Represents the distribution of the testing set
n_classes (int) – Number of classes
problem (a cvxpy Problem object (default=None)) – The first time a problem is solved (this corresponds to the first testing bag) this parameter is None. For the rest testing bags, a Problem object is passed here to allow a warm start. This accelerates the solving process.
solver (str, (default='ECOS')) – The solver used to solve the optimization problem. Here ‘ECOS’ is used. If another solver is prefered, you may need to install additional libraries.
- Returns:
prevalences – Vector containing the predicted prevalence for each class
- Return type:
array, shape=(n_classes, )
- solve_l1(train_distrib, test_distrib, n_classes, problem=None, solver=None)[source]¶
Solves AC, PAC, DF and Friedman optimization problems for L1 loss function
min |train_distrib * prevalences - test_distrib|
s.t. prevalences_i >=0, sum prevalences_i = 1
- Parameters:
train_distrib (array, shape depends on the optimization problem) –
Represents the distribution of each class in the training set
DF: shape (n_bins * n_classes, n_classes)
AC, PAC, Friedman: shape (n_classes, n_classes)
test_distrib (array, shape depends on the optimization problem) –
Represents the distribution of the testing set
DF: shape shape (n_bins * n_classes, 1)
AC, PAC, Friedman: shape (n_classes, 1)
n_classes (int) – Number of classes
problem (a cvxpy Problem object (default=None)) – The first time a problem is solved (this corresponds to the first testing bag) this parameter is None. For the rest testing bags, a Problem object is passed here to allow a warm start. This accelerates the solving process.
solver (str, (default=None)) – The solver used to solve the optimization problem. cvxpy automatically selects the best solver (among those installed) according to the type of the optimization problem. If a particular solver is prefered maybe you need to install additional libraries
- Returns:
problem (a cvxpy Problem) – A cvxpy Problem already created
prevalences (array, shape=(n_classes, )) – Vector containing the predicted prevalence for each class
- solve_l2(train_distrib, test_distrib, G, C, b)[source]¶
Solves AC, PAC, DF and Friedman optimization problems for L2 loss function
min (test_distrib - train_distrib * prevalences).T * (test_distrib - train_distrib * prevalences)
s.t. prevalences_i >=0, sum prevalences_i = 1
Expanding the objective function, we obtain:
prevalences.T * train_distrib.T * train_distrib * prevalences
- 2 * prevalences * train_distrib.T * test_distrib + test_distrib.T * test_distrib
Notice that the last term is constant w.r.t prevalences.
Let G = 2 * train_distrib.T * train_distrib and a = 2 * train_distrib.T * test_distrib, we can use directly quadprog.solve_qp because it solves the following kind of problems:
Minimize 1/2 x^T G x - a^T x
Subject to C.T x >= b
solve_l2 just computes the term a, shape (n_classes,1), and then calls quadprog.solve_qp. G, C and b were computed by compute_l2_param_train before, in the fit method of the DF/Friedman object. quadprog is used here because it is faster than cvxpy.
- Parameters:
train_distrib (array, shape depends on the optimization problem) –
Represents the distribution of each class in the training set
DF: shape (n_bins * n_classes, n_classes)
AC, PAC Friedman: shape (n_classes, n_classes)
test_distrib (array, shape depends on the optimization problem) –
Represents the distribution of the testing set
DF: shape shape (n_bins * n_classes, 1)
AC, PAC, Friedman: shape (n_classes, 1)
G (array, shape (n_classes, n_classes)) –
C (array, shape (n_classes, n_constraints)) – n_constraints will be n_classes + 1
b (array, shape (n_constraints,)) –
- Returns:
prevalences – Vector containing the predicted prevalence for each class
- Return type:
array, shape=(n_classes, )
Notes
G, C and b are computed by compute_l2_param_train in the fit method