Author name cluster

Yasutoshi Ida

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

13 papers

2 author rows

NeurIPS Conference 2024 Conference Paper

Fast Iterative Hard Thresholding Methods with Pruning Gradient Computations

Yasutoshi Ida
Sekitoshi Kanai
Atsutoshi Kumagai
Tomoharu Iwata
Yasuhiro Fujiwara

We accelerate the iterative hard thresholding (IHT) method, which finds (k) important elements from a parameter vector in a linear regression model. Although the plain IHT repeatedly updates the parameter vector during the optimization, computing gradients is the main bottleneck. Our method safely prunes unnecessary gradient computations to reduce the processing time. The main idea is to efficiently construct a candidate set, which contains (k) important elements in the parameter vector, for each iteration. Specifically, before computing the gradients, we prune unnecessary elements in the parameter vector for the candidate set by utilizing upper bounds on absolute values of the parameters. Our method guarantees the same optimization results as the plain IHT because our pruning is safe. Experiments show that our method is up to 73 times faster than the plain IHT without degrading accuracy.

PDF Details DOI

AAAI Conference 2023 Conference Paper

Fast Regularized Discrete Optimal Transport with Group-Sparse Regularizers

Yasutoshi Ida
Sekitoshi Kanai
Kazuki Adachi
Atsutoshi Kumagai
Yasuhiro Fujiwara

Regularized discrete optimal transport (OT) is a powerful tool to measure the distance between two discrete distributions that have been constructed from data samples on two different domains. While it has a wide range of applications in machine learning, in some cases the sampled data from only one of the domains will have class labels such as unsupervised domain adaptation. In this kind of problem setting, a group-sparse regularizer is frequently leveraged as a regularization term to handle class labels. In particular, it can preserve the label structure on the data samples by corresponding the data samples with the same class label to one group-sparse regularization term. As a result, we can measure the distance while utilizing label information by solving the regularized optimization problem with gradient-based algorithms. However, the gradient computation is expensive when the number of classes or data samples is large because the number of regularization terms and their respective sizes also turn out to be large. This paper proposes fast discrete OT with group-sparse regularizers. Our method is based on two ideas. The first is to safely skip the computations of the gradients that must be zero. The second is to efficiently extract the gradients that are expected to be nonzero. Our method is guaranteed to return the same value of the objective function as that of the original approach. Experiments demonstrate that our method is up to 8.6 times faster than the original method without degrading accuracy.

PDF Details DOI

AAAI Conference 2023 Conference Paper

Fast Saturating Gate for Learning Long Time Scales with Recurrent Neural Networks

Kentaro Ohno
Sekitoshi Kanai
Yasutoshi Ida

Gate functions in recurrent models, such as an LSTM and GRU, play a central role in learning various time scales in modeling time series data by using a bounded activation function. However, it is difficult to train gates to capture extremely long time scales due to gradient vanishing of the bounded function for large inputs, which is known as the saturation problem. We closely analyze the relation between saturation of the gate function and efficiency of the training. We prove that the gradient vanishing of the gate function can be mitigated by accelerating the convergence of the saturating function, i.e., making the output of the function converge to 0 or 1 faster. Based on the analysis results, we propose a gate function called fast gate that has a doubly exponential convergence rate with respect to inputs by simple function composition. We empirically show that our method outperforms previous methods in accuracy and computational efficiency on benchmark tasks involving extremely long time scales.

PDF Details DOI

ICML Conference 2023 Conference Paper

One-vs-the-Rest Loss to Focus on Important Samples in Adversarial Training

Sekitoshi Kanai
Shin'ya Yamaguchi
Masanori Yamada
Hiroshi Takahashi
Kentaro Ohno
Yasutoshi Ida

This paper proposes a new loss function for adversarial training. Since adversarial training has difficulties, e. g. , necessity of high model capacity, focusing on important data points by weighting cross-entropy loss has attracted much attention. However, they are vulnerable to sophisticated attacks, e. g. , Auto-Attack. This paper experimentally reveals that the cause of their vulnerability is their small margins between logits for the true label and the other labels. Since neural networks classify the data points based on the logits, logit margins should be large enough to avoid flipping the largest logit by the attacks. Importance-aware methods do not increase logit margins of important samples but decrease those of less-important samples compared with cross-entropy loss. To increase logit margins of important samples, we propose switching one-vs-the-rest loss (SOVR), which switches from cross-entropy to one-vs-the-rest loss for important samples that have small logit margins. We prove that one-vs-the-rest loss increases logit margins two times larger than the weighted cross-entropy loss for a simple problem. We experimentally confirm that SOVR increases logit margins of important samples unlike existing methods and achieves better robustness against Auto-Attack than importance-aware methods.

Details

NeurIPS Conference 2022 Conference Paper

Few-shot Learning for Feature Selection with Hilbert-Schmidt Independence Criterion

Atsutoshi Kumagai
Tomoharu Iwata
Yasutoshi Ida
Yasuhiro Fujiwara

We propose a few-shot learning method for feature selection that can select relevant features given a small number of labeled instances. Existing methods require many labeled instances for accurate feature selection. However, sufficient instances are often unavailable. We use labeled instances in multiple related tasks to alleviate the lack of labeled instances in a target task. To measure the dependency between each feature and label, we use the Hilbert-Schmidt Independence Criterion, which is a kernel-based independence measure. By modeling the kernel functions with neural networks that take a few labeled instances in a task as input, we can encode the task-specific information to the kernels such that the kernels are appropriate for the task. Feature selection with such kernels is performed by using iterative optimization methods, in which each update step is obtained as a closed-form. This formulation enables us to directly and efficiently minimize the expected test error on features selected by a small number of labeled instances. We experimentally demonstrate that the proposed method outperforms existing feature selection methods.

PDF Details

NeurIPS Conference 2022 Conference Paper

Meta-ticket: Finding optimal subnetworks for few-shot learning within randomly initialized neural networks

Daiki Chijiwa
Shin'ya Yamaguchi
Atsutoshi Kumagai
Yasutoshi Ida

Few-shot learning for neural networks (NNs) is an important problem that aims to train NNs with a few data. The main challenge is how to avoid overfitting since over-parameterized NNs can easily overfit to such small dataset. Previous work (e. g. MAML by Finn et al. 2017) tackles this challenge by meta-learning, which learns how to learn from a few data by using various tasks. On the other hand, one conventional approach to avoid overfitting is restricting hypothesis spaces by endowing sparse NN structures like convolution layers in computer vision. However, although such manually-designed sparse structures are sample-efficient for sufficiently large datasets, they are still insufficient for few-shot learning. Then the following questions naturally arise: (1) Can we find sparse structures effective for few-shot learning by meta-learning? (2) What benefits will it bring in terms of meta-generalization? In this work, we propose a novel meta-learning approach, called Meta-ticket, to find optimal sparse subnetworks for few-shot learning within randomly initialized NNs. We empirically validated that Meta-ticket successfully discover sparse subnetworks that can learn specialized features for each given task. Due to this task-wise adaptation ability, Meta-ticket achieves superior meta-generalization compared to MAML-based methods especially with large NNs.

PDF Details

NeurIPS Conference 2021 Conference Paper

Pruning Randomly Initialized Neural Networks with Iterative Randomization

Daiki Chijiwa
Shin'ya Yamaguchi
Yasutoshi Ida
Kenji Umakoshi
Tomohiro INOUE

Pruning the weights of randomly initialized neural networks plays an important role in the context of lottery ticket hypothesis. Ramanujan et al. (2020) empirically showed that only pruning the weights can achieve remarkable performance instead of optimizing the weight values. However, to achieve the same level of performance as the weight optimization, the pruning approach requires more parameters in the networks before pruning and thus more memory space. To overcome this parameter inefficiency, we introduce a novel framework to prune randomly initialized neural networks with iteratively randomizing weight values (IteRand). Theoretically, we prove an approximation theorem in our framework, which indicates that the randomizing operations are provably effective to reduce the required number of the parameters. We also empirically demonstrate the parameter efficiency in multiple experiments on CIFAR-10 and ImageNet.

PDF Details

AAAI Conference 2020 Conference Paper

Absum: Simple Regularization Method for Reducing Structural Sensitivity of Convolutional Neural Networks

Sekitoshi Kanai
Yasutoshi Ida
Yasuhiro Fujiwara
Masanori Yamada
Shuichi Adachi

We propose Absum, which is a regularization method for improving adversarial robustness of convolutional neural networks (CNNs). Although CNNs can accurately recognize images, recent studies have shown that the convolution operations in CNNs commonly have structural sensitivity to speciﬁc noise composed of Fourier basis functions. By exploiting this sensitivity, they proposed a simple black-box adversarial attack: Single Fourier attack. To reduce structural sensitivity, we can use regularization of convolution ﬁlter weights since the sensitivity of linear transform can be assessed by the norm of the weights. However, standard regularization methods can prevent minimization of the loss function because they impose a tight constraint for obtaining high robustness. To solve this problem, Absum imposes a loose constraint; it penalizes the absolute values of the summation of the parameters in the convolution layers. Absum can improve robustness against single Fourier attack while being as simple and efﬁcient as standard regularization methods (e. g. , weight decay and L1 regularization). Our experiments demonstrate that Absum improves robustness against single Fourier attack more than standard regularization methods. Furthermore, we reveal that robust CNNs with Absum are more robust against transferred attacks due to decreasing the common sensitivity and against high-frequency noise than standard regularization methods. We also reveal that Absum can improve robustness against gradient-based attacks (projected gradient descent) when used with adversarial training.

PDF Details

ICML Conference 2020 Conference Paper

Fast Deterministic CUR Matrix Decomposition with Accuracy Assurance

Yasutoshi Ida
Sekitoshi Kanai
Yasuhiro Fujiwara
Tomoharu Iwata
Koh Takeuchi 0001
Hisashi Kashima

The deterministic CUR matrix decomposition is a low-rank approximation method to analyze a data matrix. It has attracted considerable attention due to its high interpretability, which results from the fact that the decomposed matrices consist of subsets of the original columns and rows of the data matrix. The subset is obtained by optimizing an objective function with sparsity-inducing norms via coordinate descent. However, the existing algorithms for optimization incur high computation costs. This is because coordinate descent iteratively updates all the parameters in the objective until convergence. This paper proposes a fast deterministic CUR matrix decomposition. Our algorithm safely skips unnecessary updates by efficiently evaluating the optimality conditions for the parameters to be zeros. In addition, we preferentially update the parameters that must be nonzeros. Theoretically, our approach guarantees the same result as the original approach. Experiments demonstrate that our algorithm speeds up the deterministic CUR while achieving the same accuracy.

Details

AAAI Conference 2019 Conference Paper

Efficient Data Point Pruning for One-Class SVM

Yasuhiro Fujiwara
Sekitoshi Kanai
Junya Arai
Yasutoshi Ida
Naonori Ueda

One-class SVM is a popular method for one-class classification but it needs high computation cost. This paper proposes Quix as an efficient training algorithm for one-class SVM. It prunes unnecessary data points before applying the SVM solver by computing upper and lower bounds of a parameter that determines the hyper-plane. Since we can efficiently check optimality of the hyper-plane by using the bounds, it guarantees the identical classification results to the original approach. Experiments show that it is up to 6800 times faster than existing approaches without degrading optimality.

PDF Details

NeurIPS Conference 2019 Conference Paper

Fast Sparse Group Lasso

Yasutoshi Ida
Yasuhiro Fujiwara
Hisashi Kashima

Sparse Group Lasso is a method of linear regression analysis that finds sparse parameters in terms of both feature groups and individual features. Block Coordinate Descent is a standard approach to obtain the parameters of Sparse Group Lasso, and iteratively updates the parameters for each parameter group. However, as an update of only one parameter group depends on all the parameter groups or data points, the computation cost is high when the number of the parameters or data points is large. This paper proposes a fast Block Coordinate Descent for Sparse Group Lasso. It efficiently skips the updates of the groups whose parameters must be zeros by using the parameters in one group. In addition, it preferentially updates parameters in a candidate group set, which contains groups whose parameters must not be zeros. Theoretically, our approach guarantees the same results as the original Block Coordinate Descent. Experiments show that our algorithm enhances the efficiency of the original algorithm without any loss of accuracy.

PDF Details

IJCAI Conference 2017 Conference Paper

Adaptive Learning Rate via Covariance Matrix Based Preconditioning for Deep Neural Networks

Yasutoshi Ida
Yasuhiro Fujiwara
Sotetsu Iwamura

Adaptive learning rate algorithms such as RMSProp are widely used for training deep neural networks. RMSProp offers efficient training since it uses first order gradients to approximate Hessian-based preconditioning. However, since the first order gradients include noise caused by stochastic optimization, the approximation may be inaccurate. In this paper, we propose a novel adaptive learning rate algorithm called SDProp. Its key idea is effective handling of the noise by preconditioning based on covariance matrix. For various neural networks, our approach is more efficient and effective than RMSProp and its variant.

PDF Details

AAAI Conference 2016 Conference Paper

Fast Lasso Algorithm via Selective Coordinate Descent

Yasuhiro Fujiwara
Yasutoshi Ida
Hiroaki Shiokawa
Sotetsu Iwamura

For the AI community, the lasso proposed by Tibshirani is an important regression approach in ﬁnding explanatory predictors in high dimensional data. The coordinate descent algorithm is a standard approach to solve the lasso which iteratively updates weights of predictors in a round-robin style until convergence. However, it has high computation cost. This paper proposes Sling, a fast approach to the lasso. It achieves high efﬁciency by skipping unnecessary updates for the predictors whose weight is zero in the iterations. Sling can obtain high prediction accuracy with fewer predictors than the standard approach. Experiments show that Sling can enhance the efﬁciency and the effectiveness of the lasso.

PDF Details