Author name cluster

Koji Tsuda

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

16 papers

2 author rows

AAAI Conference 2026 Conference Paper

Statistically Robust Sparse High-order Interaction Model

Diptesh Das
Ichiro Takeuchi
Koji Tsuda

Deep learning models often achieve high accuracy but lack interpretability, making them unsuitable for critical applications such as medical diagnosis, biomolecule design, criminal justice, etc. The Sparse High-order Interaction Model (SHIM) addresses this limitation by providing both transparency and predictive reliability. However, real-world data often contain outliers, which can distort model performance. To overcome this, we propose Huberized-SHIM, an extension of SHIM that integrates Huber loss-based robust regression to mitigate the impact of outliers. We introduce a homotopy-based exact regularization path algorithm and a novel tree-pruning criterion to efficiently manage interaction complexity. Additionally, we incorporate the conformal prediction framework to enhance statistical reliability. Empirical evaluations on synthetic and real-world datasets demonstrate the superior robustness and accuracy of Huberized-SHIM in high-stakes decision-making contexts.

PDF Details DOI

AAAI Conference 2022 Conference Paper

Fast and More Powerful Selective Inference for Sparse High-Order Interaction Model

Diptesh Das
Vo Nguyen Le Duy
Hiroyuki Hanada
Koji Tsuda
Ichiro Takeuchi

Automated high-stake decision-making, such as medical diagnosis, requires models with high interpretability and reliability. We consider the sparse high-order interaction model as an interpretable and reliable model with a good prediction ability. However, finding statistically significant high-order interactions is challenging because of the intrinsically high dimensionality of the combinatorial effects. Another problem in data-driven modeling is the effect of “cherry-picking” (i. e. , selection bias). Our main contribution is extending the recently developed parametric programming approach for selective inference to high-order interaction models. An exhaustive search over the cherry tree (all possible interactions) can be daunting and impractical, even for small-sized problems. We introduced an efficient pruning strategy and demonstrated the computational efficiency and statistical power of the proposed method using both synthetic and real data.

PDF Details

NeurIPS Conference 2018 Conference Paper

Legendre Decomposition for Tensors

Mahito Sugiyama
Hiroyuki Nakahara
Koji Tsuda

We present a novel nonnegative tensor decomposition method, called Legendre decomposition, which factorizes an input tensor into a multiplicative combination of parameters. Thanks to the well-developed theory of information geometry, the reconstructed tensor is unique and always minimizes the KL divergence from an input tensor. We empirically show that Legendre decomposition can more accurately reconstruct tensors than other nonnegative tensor decomposition methods.

PDF Details

ICML Conference 2017 Conference Paper

Selective Inference for Sparse High-Order Interaction Models

Shinya Suzumura
Kazuya Nakagawa
Yuta Umezu
Koji Tsuda
Ichiro Takeuchi

Finding statistically significant high-order interactions in predictive modeling is important but challenging task because the possible number of high-order interactions is extremely large (e. g. , $> 10^{17}$). In this paper we study feature selection and statistical inference for sparse high-order interaction models. Our main contribution is to extend recently developed selective inference framework for linear models to high-order interaction models by developing a novel algorithm for efficiently characterizing the selection event for the selective inference of high-order interactions. We demonstrate the effectiveness of the proposed algorithm by applying it to an HIV drug response prediction problem.

Details

ICML Conference 2017 Conference Paper

Tensor Balancing on Statistical Manifold

Mahito Sugiyama
Hiroyuki Nakahara
Koji Tsuda

We solve tensor balancing, rescaling an Nth order nonnegative tensor by multiplying N tensors of order N - 1 so that every fiber sums to one. This generalizes a fundamental process of matrix balancing used to compare matrices in a wide range of applications from biology to economics. We present an efficient balancing algorithm with quadratic convergence using Newton’s method and show in numerical experiments that the proposed algorithm is several orders of magnitude faster than existing ones. To theoretically prove the correctness of the algorithm, we model tensors as probability distributions in a statistical manifold and realize tensor balancing as projection onto a submanifold. The key to our algorithm is that the gradient of the manifold, used as a Jacobian matrix in Newton’s method, can be analytically obtained using the Möbius inversion formula, the essential of combinatorial mathematics. Our model is not limited to tensor balancing, but has a wide applicability as it includes various statistical and machine learning models such as weighted DAGs and Boltzmann machines.

Details

NeurIPS Conference 2009 Conference Paper

Submodularity Cuts and Applications

Yoshinobu Kawahara
Kiyohito Nagano
Koji Tsuda
Jeff Bilmes

Several key problems in machine learning, such as feature selection and active learning, can be formulated as submodular set function maximization. We present herein a novel algorithm for maximizing a submodular set function under a cardinality constraint --- the algorithm is based on a cutting-plane method and is implemented as an iterative small-scale binary-integer linear programming procedure. It is well known that this problem is NP-hard, and the approximation factor achieved by the greedy algorithm is the theoretical limit for polynomial time. As for (non-polynomial time) exact algorithms that perform reasonably in practice, there has been very little in the literature although the problem is quite important for many applications. Our algorithm is guaranteed to find the exact solution in finite iterations, and it converges fast in practice due to the efficiency of the cutting-plane mechanism. Moreover, we also provide a method that produces successively decreasing upper-bounds of the optimal solution, while our algorithm provides successively increasing lower-bounds. Thus, the accuracy of the current solution can be estimated at any point, and the algorithm can be stopped early once a desired degree of tolerance is met. We evaluate our algorithm on sensor placement and feature selection applications showing good performance.

PDF Details

ICML Conference 2007 Conference Paper

Entire regularization paths for graph data

Koji Tsuda

Graph data such as chemical compounds and XML documents are getting more common in many application domains. A main difficulty of graph data processing lies in the intrinsic high dimensionality of graphs, namely, when a graph is represented as a binary feature vector of indicators of all possible subgraph patterns, the dimensionality gets too large for usual statistical methods. We propose an efficient method to select a small number of salient patterns by regularization path tracking. The generation of useless patterns is minimized by progressive extension of the search space. In experiments, it is shown that our technique is considerably more efficient than a simpler approach based on frequent substructure mining.

Details

ICML Conference 2006 Conference Paper

Clustering graphs by weighted substructure mining

Koji Tsuda
Taku Kudo

Details

JMLR Journal 2005 Journal Article

Matrix Exponentiated Gradient Updates for On-line Learning and Bregman Projection

Koji Tsuda
Gunnar Rätsch
Manfred K. Warmuth

PDF Details

ICML Conference 2005 Conference Paper

Propagating distributions on a hypergraph by dual information regularization

Koji Tsuda

Details

NeurIPS Conference 2004 Conference Paper

Matrix Exponential Gradient Updates for On-line Learning and Bregman Projection

Koji Tsuda
Gunnar Rätsch
Manfred K. Warmuth

We address the problem of learning a symmetric positive definite matrix. The central issue is to design parameter updates that preserve positive definiteness. Our updates are motivated with the von Neumann diver- gence. Rather than treating the most general case, we focus on two key applications that exemplify our methods: On-line learning with a simple square loss and finding a symmetric positive definite matrix subject to symmetric linear constraints. The updates generalize the Exponentiated Gradient (EG) update and AdaBoost, respectively: the parameter is now a symmetric positive definite matrix of trace one instead of a probability vector (which in this context is a diagonal positive definite matrix with trace one). The generalized updates use matrix logarithms and exponen- tials to preserve positive definiteness. Most importantly, we show how the analysis of each algorithm generalizes to the non-diagonal case. We apply both new algorithms, called the Matrix Exponentiated Gradient (MEG) update and DefiniteBoost, to learn a kernel matrix from distance measurements. 1 Introduction Most learning algorithms have been developed to learn a vector of parameters from data. However, an increasing number of papers are now dealing with more structured parame- ters. More specifically, when learning a similarity or a distance function among objects, the parameters are defined as a symmetric positive definite matrix that serves as a kernel (e. g. [14, 11, 13]). Learning is typically formulated as a parameter updating procedure to optimize a loss function. The gradient descent update [6] is one of the most commonly used algorithms, but it is not appropriate when the parameters form a positive definite matrix, because the updated parameter is not necessarily positive definite. Xing et al. [14] solved this problem by always correcting the updated matrix to be positive. However no bound has been proven for this update-and-correction approach. In this paper, we introduce the Matrix Exponentiated Gradient update which works as follows: First, the matrix logarithm of the current parameter matrix is computed. Then a step is taken in the direction of the steepest descent. Finally, the parameter matrix is updated to the exponential of the modified log-matrix. Our update preserves symmetry and positive definiteness because the matrix exponential maps any symmetric matrix to a positive definite matrix. Bregman divergences play a central role in the motivation and the analysis of on-line learn- ing algorithms [5]. A learning problem is essentially defined by a loss function, and a di- vergence that measures the discrepancy between parameters. More precisely, the updates are motivated by minimizing the sum of the loss function and the Bregman divergence, where the loss function is multiplied by a positive learning rate. Different divergences lead to radically different updates [6]. For example, the gradient descent is derived from the squared Euclidean distance, and the exponentiated gradient from the Kullback-Leibler di- vergence. We use the von Neumann divergence (also called quantum relative entropy) for measuring the discrepancy between two positive definite matrices [8]. We derive a new Matrix Exponentiated Gradient update from this divergence (which is a Bregman diver- gence for positive definite matrices). Finally we prove relative loss bounds using the von Neumann divergence as a measure of progress. Also the following related key problem has received a lot of attention recently [14, 11, 13]: Find a symmetric positive definite matrix that satisfies a number of symmetric linear inequality constraints. The new DefiniteBoost algorithm greedily chooses the most violated constraint and performs an approximated Bregman projection. In the diagonal case, we recover AdaBoost [9]. We also show how the convergence proof of AdaBoost generalizes to the non-diagonal case. 2 von Neumann Divergence or Quantum Relative Entropy If F is a real convex differentiable function on the parameter domain (symmetric d d positive definite matrices) and f (W): = F(W), then the Bregman divergence between two parameters W and W is defined as F(W, W) = F(W) - F(W) - tr[(W - W)f(W)]. When choosing F(W) = tr(W log W -W), then f(W) = log W and the corresponding Bregman divergence becomes the von Neumann divergence [8]: F(W, W) = tr(W log W - W log W - W + W). (1) In this paper, we are primarily interested in the normalized case (when tr(W) = 1). In this case, the positive symmetric definite matrices are related to density matrices commonly used in Statistical Physics and the divergence simplifies to F(W, W) = tr(W log W - W log W). If W = i ivivi is our notation for the eigenvalue decomposition, then we can rewrite the normalized divergence as ~ ~ F(W, W) = i ln ~ i + i ln j(~ vi vj )2. i i, j So this divergence quantifies the difference in the eigenvalues as well as the eigenvectors. 3 On-line Learning In this section, we present a natural extension of the Exponentiated Gradient (EG) up- date [6] to an update for symmetric positive definite matrices. At the t-th trial, the algorithm receives a symmetric instance matrix Xt Rdd. It then produces a prediction ^ yt = tr(WtXt) based on the algorithm's current symmetric positive definite parameter matrix Wt. Finally it incurs for instance1 a quadratic loss (^ yt - yt)2, 1For the sake of simplicity, we use the simple quadratic loss: L W t (W) = (tr(Xt ) - yt)2. For the general update, the gradient Lt(Wt) is exponentiated in the update (4) and this gradient must be symmetric. Following [5], more general loss functions (based on Bregman divergences) are amenable to our techniques. and updates its parameter matrix Wt. In the update we aim to solve the following problem: Wt+1 = argmin W F(W, Wt) + (tr(WXt) - yt)2, (2) where the convex function F defines the Bregman divergence. Setting the derivative with respect to W to zero, we have f (Wt+1) - f(Wt) + [(tr(Wt+1Xt) - yt)2] = 0. (3) The update rule is derived by solving (3) with respect to Wt+1, but it is not solvable in closed form. A common way to avoid this problem is to approximate tr(Wt+1Xt) by tr(WtXt) [5]. Then, we have the following update: Wt+1 = f -1(f (Wt) - 2(^yt - yt)Xt). In our case, F(W) = tr(W log W - W) and thus f(W) = log W and f-1(W) = exp W. We also augment (2) with the constraint tr(W) = 1, leading to the following Matrix Exponential Gradient (MEG) Update: 1 Wt+1 = exp(log W Z t - 2(^yt - yt)Xt), (4) t where the normalization factor Zt is tr[exp(log Wt - 2(^yt - yt)Xt)]. Note that in the above update, the exponent log Wt - 2(^yt - yt)Xt is an arbitrary symmetric matrix and the matrix exponential converts this matrix back into a symmetric positive definite matrix. A numerically stable version of the MEG update is given in Section 3. 2. 3. 1 Relative Loss Bounds We now begin with the definitions needed for the relative loss bounds. Let S = (X1, y1), .. ., (XT, yT ) denote a sequence of examples, where the instance matrices Xt Rdd are symmetric and the labels yt R. For any symmetric positive semi-definite ma- trix U with tr(U) = 1, define its total loss as LU(S) = T (tr(UX t=1 t ) - yt)2. The total loss of the on-line algorithm is LMEG(S) = T (tr(W t=1 tXt) - yt)2. We prove a bound on the relative loss LMEG(S) -LU(S) that holds for any U. The proof generalizes a sim- ilar bound for the Exponentiated Gradient update (Lemmas 5. 8 and 5. 9 of [6]). The relative loss bound is derived in two steps: Lemma 3. 1 bounds the relative loss for an individual trial and Lemma 3. 2 for a whole sequence (Proofs are given in the full paper). Lemma 3. 1 Let Wt be any symmetric positive definite matrix. Let Xt be any symmetric matrix whose smallest and largest eigenvalues satisfy max - min r. Assume Wt+1 is produced from Wt by the MEG update and let U be any symmetric positive semi-definite matrix. Then for any constants a and b such that 0 a(yt - tr(WtXt))2 - b(yt - tr(UXt))2 (U, Wt) - (U, Wt+1) (5) In the proof, we use the Golden-Thompson inequality [3], i. e. , tr[exp(A + B)] tr[exp(A) exp(B)] for symmetric matrices A and B. We also needed to prove the fol- lowing generalization of Jensen's inequality to matrices: exp(1A + 2(I - A)) exp(1)A + exp(2)(I - A) for finite 1, 2 R and any symmetric matrix A with 0 c 1 1 LMEG(S) 1 + LU(S) + + r2(U, W 2 2 c 1). (6) Proof For the maximum tightness of (5), a should be chosen as a = = 2b/(2 + r2b). Let b = c/r2, and thus a = 2c/(r2(2 + c)). Then (5) is rewritten as 2c (y 2 + c t - tr(WtXt))2 - c(yt - tr(UXt))2 r2((U, Wt) - (U, Wt+1)) Adding the bounds for t = 1, , T, we get 2c L 2 + c MEG(S) - cLU(S) r2((U, W1) - (U, Wt+1)) r2(U, W1), which is equivalent to (6). Assuming LU(S) max and (U, W1) dmax, the bound (6) is tightest when c = r 2dmax/ max. Then we have LMEG(S) - LU(S) r2 maxdmax + r2(U, W 2 1). 3. 2 Numerically stable MEG update The MEG update is numerically unstable when the eigenvalues of Wt are around zero. However we can "unwrap" Wt+1 as follows: 1 t Wt+1 = exp(c (^ y ~ tI + log W1 s Z - 2 - ys)Xs), (7) t s=1 where the constant ~ Zt normalizes the trace of Wt+1 to one. As long as the eigen values of W1 are not too small then the computation of log Wt is stable. Note that the update is inde- pendent of the choice of ct R. We incrementally maintain an eigenvalue decomposition of the matrix in the exponent (O(n3) per iteration): t VttVTt = ctI + log W1 - 2 (^ys - ys)Xs), s=1 where the constant ct is chosen so that the maximum eigenvalue of the above is zero. Now Wt+1 = Vt exp(t)VTt /tr(exp(t)). 4 Bregman Projection and DefiniteBoost In this section, we address the following Bregman projection problem2 W = argmin W F (W, W1), tr(W) = 1, tr(WCj ) 0, for j = 1, .. ., n, (8) where the symmetric positive definite matrix W1 of trace one is the initial parameter ma- trix, and C1, .. ., Cn are arbitrary symmetric matrices. Prior knowledge about W is en- coded in the constraints, and the matrix closest to W1 is chosen among the matrices satis- fying all constraints. Tsuda and Noble [13] employed this approach for learning a kernel matrix among graph nodes, and this method can be potentially applied to learn a kernel matrix in other settings (e. g. [14, 11]). The problem (8) is a projection of W1 to the intersection of convex regions defined by the constraints. It is well known that the Bregman projection into the intersection of convex regions can be solved by sequential projections to each region [1]. In the original papers only asymptotic convergence was shown. More recently a connection [4, 7] was made to the AdaBoost algorithm which has an improved convergence analysis [2, 9]. We generalize the latter algorithm and its analysis to symmetric positive definite matrices and call the new algorithm DefiniteBoost. As in the original setting, only approximate projections (Figure 1) are required to show fast convergence. 2Note that if is large then the on-line update (2) becomes a Bregman projection subject to a single equality constraint tr(WXt) = yt. Approximate Figure 1: In (exact) Bregman projections, the intersection Projection of convex sets (i. e. , two lines here) is found by iterating pro- jections to each set. We project only approximately, so the projected point does not satisfy the current constraint. Nev- ertheless, global convergence to the optimal solution is guar- anteed via our proofs. Exact Projection Before presenting the algorithm, let us derive the dual problem of (8) by means of Lagrange multipliers, n = argmin log trexp(logW1- jCj), j 0. (9) j=1 See [13] for a detailed derivation of the dual problem. When (8) is feasible, the opti- mal solution is described as W = 1 exp(log W C ) = Z( 1 j ), where Z( ) - nj=1 j tr[exp(log W1 - n C j=1 j j )]. 4. 1 Exact Bregman Projections First, let us present the exact Bregman projection algorithm to solve (8). We start from the initial parameter W1. At the t-th step, the most unsatisfied constraint is chosen, jt = argmaxj=1, ,n tr(WtCj). Let us use Ct as the short notation for Cj. Then, the t following Bregman projection with respect to the chosen constraint is solved. Wt+1 = argmin (W, W W t), tr(W) = 1, tr(WCt) 0. (10) By means of a Lagrange multiplier, the dual problem is described as t = argmin tr[exp(log Wt - Ct)], 0. (11) Using the solution of the dual problem, Wt is updated as 1 Wt+1 = exp(log W Z t - tCt) (12) t(t) where the normalization factor is Zt(t) = tr[exp(log Wt -tCt)]. Note that we can use the same numerically stable update as in the previous section. 4. 2 Approximate Bregman Projections The solution of (11) cannot be obtained in closed form. However, one can use the following approximate solution: 1 1 + r t/max t t = log, (13) max t - min t 1 + rt/min t when the eigenvalues of Ct lie in the interval [min t, max t ] and rt = tr(WtCt). Since the most unsatisfied constraint is chosen, rt 0 and thus t 0. Although the projection is done only approximately, 3 the convergence of the dual objective (9) can be shown using the following upper bound. 3The approximate Bregman projection (with t as in (13) can also be motivated as an online algorithm based on an entropic loss and learning rate one (following Section 3 and [4]). Theorem 4. 1 The dual objective (9) is bounded as n T tr explogW1- jCj (rt) (14) j=1 t=1 max min t -t rt max max t -min t rt t -min t where (rt) = 1 - 1. max - t min t The dual objective is monotonically decreasing, because (rt) 1. Also, since rt corre- sponds to the maximum value among all constraint violations {rj}nj=1, we have (rt) = 1 only if rt = 0. Thus the dual objective continues to decrease until all constraints are satisfied. 4. 3 Relation to Boosting When all matrices are diagonal, the DefiniteBoost degenerates to AdaBoost [9]: Let {xi, yi}di=1 be the training samples, where xi Rm and yi {-1, 1}. Let h1(x), .. ., hn(x) [-1, 1] be the weak hypotheses. For the j-th hypothesis hj(x), let us define Cj = diag(y1hj(x1), .. ., ydhj(xd)). Since |yhj(x)| 1, max/min t = 1 for any t. Setting W1 = I/d, the dual objective (14) is rewritten as d n 1 exp d -yi jhj(xi), i=1 j=1 which is equivalent to the exponential loss function used in AdaBoost. Since Cj and W1 are diagonal, the matrix Wt stays diagonal after the update. If wti = [Wt]ii, the updating formula (12) becomes the AdaBoost update: wt+1, i = wti exp(-tyiht(xi))/Zt(t). The approximate solution of t (13) is described as t = 1 log 1+rt, where r 2 1-r t is the weighted t training error of the t-th hypothesis, i. e. rt = d w i=1 tiyiht(xi). 5 Experiments on Learning Kernels In this section, our technique is applied to learning a kernel matrix from a set of distance measurements. This application is not on-line per se, but it shows nevertheless that the theoretical bounds can be reasonably tight on natural data. When K is a d d kernel matrix among d objects, then the Kij characterizes the similarity between objects i and j. In the feature space, Kij corresponds to the inner product between object i and j, and thus the Euclidean distance can be computed from the entries of the kernel matrix [10]. In some cases, the kernel matrix is not given explicitly, but only a set of distance measurements is available. The data are represented either as (i) quantitative distance values (e. g. , the distance between i and j is 0. 75), or (ii) qualitative evaluations (e. g. , the distance between i and j is small) [14, 13]. Our task is to obtain a positive definite kernel matrix which fits well to the given distance data. On-line kernel learning In the first experiment, we consider the on-line learning scenario in which only one distance example is shown to the learner at each time step. The distance example at time t is described as {at, bt, yt}, which indicates that the squared Euclidean distance between objects at and bt is yt. Let us define a time-developing sequence of kernel matrices as {Wt}Tt=1, and the corresponding points in the feature space as {xti}di=1 (i. e. [Wt]ab = xtaxtb). Then, the total loss incurred by this sequence is T T 2 2 xta = (tr(W t - xtbt - yt tXt) - yt)2, t=1 t=1 1. 8 0. 45 1. 6 0. 4 1. 4 0. 35 1. 2 0. 3 1 0. 25 0. 8 Total Loss 0. 2 0. 6 Classification Error 0. 4 0. 15 0. 2 0. 1 0 0. 05 0 0. 5 1 1. 5 2 2. 5 3 0 0. 5 1 1. 5 2 2. 5 3 5 Iterations 5 Iterations x 10 x 10 Figure 2: Numerical results of on-line learning. (Left) total loss against the number of iterations. The dashed line shows the loss bound. (Right) classification error of the nearest neighbor classifier using the learned kernel. The dashed line shows the error by the target kernel. where Xt is a symmetric matrix whose (at, at) and (bt, bt) elements are 0. 5, (at, bt) and (bt, at) elements are -0. 5, and all the other elements are zero. We consider a controlled experiment in which the distance examples are created from a known target kernel matrix. We used a 52 52 kernel matrix among gyrB proteins of bacteria (d = 52). This data contains three bacteria species (see [12] for details). Each distance example is created by randomly choosing one element of the target kernel. The initial parameter was set as W1 = I/d. When the comparison matrix U is set to the target matrix, LU (S) = 0 and max = 0, because all the distance examples are derived from the target matrix. Therefore we choose learning rate = 2, which minimizes the relative loss bound of Lemma 3. 2. The total loss of the kernel matrix sequence obtained by the matrix exponential update is shown in Figure 2 (left). In the plot, we have also shown the relative loss bound. The bound seems to give a reasonably tight performance guarantee--it is about twice the actual total loss. To evaluate the learned kernel matrix, the prediction accuracy of bacteria species by the nearest neighbor classifier is calculated (Figure 2, right), where the 52 proteins are randomly divided into 50% training and 50% testing data. The value shown in the plot is the test error averaged over 10 different divisions. It took a large number of iterations ( 2 105) for the error rate to converge to the level of the target kernel. In practice one can often increase the learning rate for faster convergence, but here we chose the small rate suggested by our analysis to check the tightness of the bound. Kernel learning by Bregman projection Next, let us consider a batch learning sce- nario where we have a set of qualitative distance evaluations (i. e. inequality constraints). Given n pairs of similar objects {aj, bj}nj=1, the inequality constraints are constructed as xaj - xbj, j = 1, .. ., n, where is a predetermined constant. If Xj is de- fined as in the previous section and Cj = Xj - I, the inequalities are then rewritten as tr(WCj) 0, j = 1, .. ., n. The largest and smallest eigenvalues of any Cj are 1 - and -, respectively. As in the previous section, distance examples are generated from the target kernel matrix between gyrB proteins. Setting = 0. 2/d, we collected all object pairs whose distance in the feature space is less than to yield 980 inequalities (n = 980). Figure 3 (left) shows the convergence of the dual objective function as proven in Theo- rem 4. 1. The convergence was much faster than the previous experiment, because, in the batch setting, one can choose the most unsatisfied constraint, and optimize the step size as well. Figure 3 (right) shows the classification error of the nearest neighbor classifier. As opposed to the previous experiment, the error rate is higher than that of the target kernel matrix, because substantial amount of information is lost by the conversion to inequality constraints. 55 0. 8 50 0. 7 45 0. 6 40 0. 5 35 0. 4 Dual Obj 30 0. 3 Classification Error 25 0. 2 20 0. 1 15 0 0 50 100 150 200 250 300 0 50 100 150 200 250 300 Iterations Iterations Figure 3: Numerical results of Bregman projection. (Left) convergence of the dual objective function. (Right) classification error of the nearest neighbor classifier using the learned kernel. 6 Conclusion We motivated and analyzed a new update for symmetric positive matrices using the von Neumann divergence. We showed that the standard bounds for on-line learning and Boost- ing generalize to the case when the parameters are a symmetric positive definite matrix (of trace one) instead of a probability vector. As in quantum physics, the eigenvalues act as probabilities. Acknowledgment We would like to thank B. Sch olkopf, M. Kawanabe, J. Liao and W. S. Noble for fruitful discussions. M. W. was supported by NSF grant CCR 9821087 and UC Discovery grant LSIT02-10110. K. T. and G. R. gratefully acknowledge partial support from the PASCAL Network of Excellence (EU #506778). Part of this work was done while all three authors were visiting the National ICT Australia in Canberra.

PDF Details

NeurIPS Conference 2003 Conference Paper

Image Reconstruction by Linear Programming

Koji Tsuda
Gunnar Rätsch

A common way of image denoising is to project a noisy image to the sub- space of admissible images made for instance by PCA. However, a major drawback of this method is that all pixels are updated by the projection, even when only a few pixels are corrupted by noise or occlusion. We pro- pose a new method to identify the noisy pixels by (cid: 1) 1-norm penalization and update the identiﬁed pixels only. The identiﬁcation and updating of noisy pixels are formulated as one linear program which can be solved efﬁciently. Especially, one can apply the ν-trick to directly specify the fraction of pixels to be reconstructed. Moreover, we extend the linear program to be able to exploit prior knowledge that occlusions often ap- pear in contiguous blocks (e. g. sunglasses on faces). The basic idea is to penalize boundary points and interior points of the occluded area dif- ferently. We are able to show the ν-property also for this extended LP leading a method which is easy to use. Experimental results impressively demonstrate the power of our approach.

PDF Details

ICML Conference 2003 Conference Paper

Marginalized Kernels Between Labeled Graphs

Hisashi Kashima
Koji Tsuda
Akihiro Inokuchi

Details

JMLR Journal 2003 Journal Article

The em Algorithm for Kernel Matrix Completion with Auxiliary Data

Koji Tsuda
Shotaro Akaho
Kiyoshi Asai

In biological data, it is often the case that observed data are available only for a subset of samples. When a kernel matrix is derived from such data, we have to leave the entries for unavailable samples as missing. In this paper, the missing entries are completed by exploiting an auxiliary kernel matrix derived from another information source. The parametric model of kernel matrices is created as a set of spectral variants of the auxiliary kernel matrix, and the missing entries are estimated by fitting this model to the existing entries. For model fitting, we adopt the em algorithm (distinguished from the EM algorithm of Dempster et al., 1977) based on the information geometry of positive definite matrices. We will report promising results on bacteria clustering experiments using two marker sequences: 16S and gyrB. [abs] [ pdf ][ ps.gz ][ ps ]

PDF Details

NeurIPS Conference 2002 Conference Paper

Clustering with the Fisher Score

Koji Tsuda
Motoaki Kawanabe
Klaus-Robert Müller

Recently the Fisher score (or the Fisher kernel) is increasingly used as a feature extractor for classiﬁcation problems. The Fisher score is a vector of parameter derivatives of loglikelihood of a probabilistic model. This paper gives a theoretical analysis about how class information is pre- served in the space of the Fisher score, which turns out that the Fisher score consists of a few important dimensions with class information and many nuisance dimensions. When we perform clustering with the Fisher score, K-Means type methods are obviously inappropriate because they make use of all dimensions. So we will develop a novel but simple clus- tering algorithm specialized for the Fisher score, which can exploit im- portant dimensions. This algorithm is successfully tested in experiments with artiﬁcial data and real data (amino acid sequences). as follows: Let us assume that a probabilistic model parameter estimate

PDF Details

NeurIPS Conference 2001 Conference Paper

A New Discriminative Kernel From Probabilistic Models

Koji Tsuda
Motoaki Kawanabe
Gunnar Rätsch
Sören Sonnenburg
Klaus-Robert Müller

Recently, Jaakkola and Haussler proposed a method for construct(cid: 173) ing kernel functions from probabilistic models. Their so called "Fisher kernel" has been combined with discriminative classifiers such as SVM and applied successfully in e. g. DNA and protein analysis. Whereas the Fisher kernel (FK) is calculated from the marginal log-likelihood, we propose the TOP kernel derived from Tangent vectors Of Posterior log-odds. Furthermore we develop a theoretical framework on feature extractors from probabilistic models and use it for analyzing FK and TOP. In experiments our new discriminative TOP kernel compares favorably to the Fisher kernel.

PDF Details