Arrow Research search

Author name cluster

Jose Principe

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

10 papers
1 author row

Possible papers

10

AAAI Conference 2023 Conference Paper

Causal Recurrent Variational Autoencoder for Medical Time Series Generation

  • Hongming Li
  • Shujian Yu
  • Jose Principe

We propose causal recurrent variational autoencoder (CR-VAE), a novel generative model that is able to learn a Granger causal graph from a multivariate time series x and incorporates the underlying causal mechanism into its data generation process. Distinct to the classical recurrent VAEs, our CR-VAE uses a multi-head decoder, in which the p-th head is responsible for generating the p-th dimension of x (i.e., x^p). By imposing a sparsity-inducing penalty on the weights (of the decoder) and encouraging specific sets of weights to be zero, our CR-VAE learns a sparse adjacency matrix that encodes causal relations between all pairs of variables. Thanks to this causal matrix, our decoder strictly obeys the underlying principles of Granger causality, thereby making the data generating process transparent. We develop a two-stage approach to train the overall objective. Empirically, we evaluate the behavior of our model in synthetic data and two real-world human brain datasets involving, respectively, the electroencephalography (EEG) signals and the functional magnetic resonance imaging (fMRI) data. Our model consistently outperforms state-of-the-art time series generative models both qualitatively and quantitatively. Moreover, it also discovers a faithful causal graph with similar or improved accuracy over existing Granger causality-based causal inference methods. Code of CR-VAE is publicly available at https://github.com/hongmingli1995/CR-VAE.

IJCAI Conference 2021 Conference Paper

Information-Theoretic Methods in Deep Neural Networks: Recent Advances and Emerging Opportunities

  • Shujian Yu
  • Luis Sanchez Giraldo
  • Jose Principe

We present a review on the recent advances and emerging opportunities around the theme of analyzing deep neural networks (DNNs) with information-theoretic methods. We first discuss popular information-theoretic quantities and their estimators. We then introduce recent developments on information-theoretic learning principles (e. g. , loss functions, regularizers and objectives) and their parameterization with DNNs. We finally briefly review current usages of information-theoretic concepts in a few modern machine learning problems and list a few emerging opportunities.

AAAI Conference 2021 Conference Paper

Measuring Dependence with Matrix-based Entropy Functional

  • Shujian Yu
  • Francesco Alesiani
  • Xi Yu
  • Robert Jenssen
  • Jose Principe

Measuring the dependence of data plays a central role in statistics and machine learning. In this work, we summarize and generalize the main idea of existing information-theoretic dependence measures into a higher-level perspective by the Shearer’s inequality. Based on our generalization, we then propose two measures, namely the matrix-based normalized total correlation and the matrix-based normalized dual total correlation, to quantify the dependence of multiple variables in arbitrary dimensional space, without explicit estimation of the underlying data distributions. We show that our measures are differentiable and statistically more powerful than prevalent ones. We also show the impact of our measures in four different machine learning problems, namely the gene regulatory network inference, the robust machine learning under covariate shift and non-Gaussian noises, the subspace outlier detection, and the understanding of the learning dynamics of convolutional neural networks, to demonstrate their utilities, advantages, as well as implications to those problems.

IJCAI Conference 2020 Conference Paper

Measuring the Discrepancy between Conditional Distributions: Methods, Properties and Applications

  • Shujian Yu
  • Ammar Shaker
  • Francesco Alesiani
  • Jose Principe

We propose a simple yet powerful test statistic to quantify the discrepancy between two conditional distributions. The new statistic avoids the explicit estimation of the underlying distributions in high-dimensional space and it operates on the cone of symmetric positive semidefinite (SPS) matrix using the Bregman matrix divergence. Moreover, it inherits the merits of the correntropy function to explicitly incorporate high-order statistics in the data. We present the properties of our new statistic and illustrate its connections to prior art. We finally show the applications of our new statistic on three different machine learning problems, namely the multi-task learning over graphs, the concept drift detection, and the information-theoretic feature selection, to demonstrate its utility and advantage. Code of our statistic is available at https: //bit. ly/BregmanCorrentropy.

RLDM Conference 2013 Conference Abstract

Metric Learning for Invariant Feature Generation in Reinforcement Learning

  • Evan Kriminger
  • Austin Brockmeier
  • Luis Sanchez-Giraldo
  • Jose Principe

The success of reinforcement learning in real-world problems depends on careful selection of fea- tures to represent the state. Proper feature selection results in the increased ability to approximate the value function and in quicker learning, as only relevant information is emphasized. It is desirable to generate such features automatically, as this would otherwise be a trial and error process requiring a human expert. We propose a method to automatically map states to feature vectors, which are not only sufficiently descriptive of the environment, but are invariant to information not relevant to the agent’s goal. The mapping is based on the principle that if the Q-values for two states under the same action are similar, then the states themselves should be similar. This similarity is realized in a space defined by a metric computed using an information- theoretic approach to metric learning. Our method works in conjunction with a Q-learning algorithm of choice and is suitable for large and continuous state spaces. We test our algorithm on a non-sequential deci- sion task in which the state is an image. While the problem is in fact linear, our method greatly outperforms linear Q-learning.

NeurIPS Conference 2010 Conference Paper

A novel family of non-parametric cumulative based divergences for point processes

  • Sohan Seth
  • Park Il
  • Austin Brockmeier
  • Mulugeta Semework
  • John Choi
  • Joseph Francis
  • Jose Principe

Hypothesis testing on point processes has several applications such as model fitting, plasticity detection, and non-stationarity detection. Standard tools for hypothesis testing include tests on mean firing rate and time varying rate function. However, these statistics do not fully describe a point process and thus the tests can be misleading. In this paper, we introduce a family of non-parametric divergence measures for hypothesis testing. We extend the traditional Kolmogorov--Smirnov and Cramer--von-Mises tests for point process via stratification. The proposed divergence measures compare the underlying probability structure and, thus, is zero if and only if the point processes are the same. This leads to a more robust test of hypothesis. We prove consistency and show that these measures can be efficiently estimated from data. We demonstrate an application of using the proposed divergence as a cost function to find optimally matched spike trains.

NeurIPS Conference 2004 Conference Paper

The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space

  • Robert Jenssen
  • Deniz Erdogmus
  • Jose Principe
  • Torbjørn Eltoft

A new distance measure between probability density functions (pdfs) is introduced, which we refer to as the Laplacian pdf dis- tance. The Laplacian pdf distance exhibits a remarkable connec- tion to Mercer kernel based learning theory via the Parzen window technique for density estimation. In a kernel feature space defined by the eigenspectrum of the Laplacian data matrix, this pdf dis- tance is shown to measure the cosine of the angle between cluster mean vectors. The Laplacian data matrix, and hence its eigenspec- trum, can be obtained automatically based on the data at hand, by optimal Parzen window selection. We show that the Laplacian pdf distance has an interesting interpretation as a risk function connected to the probability of error. 1 Introduction In recent years, spectral clustering methods, i. e. data partitioning based on the eigenspectrum of kernel matrices, have received a lot of attention [1, 2]. Some unresolved questions associated with these methods are for example that it is not always clear which cost function that is being optimized and that is not clear how to construct a proper kernel matrix. In this paper, we introduce a well-defined cost function for spectral clustering. This cost function is derived from a new information theoretic distance measure between cluster pdfs, named the Laplacian pdf distance. The information theoretic/spectral duality is established via the Parzen window methodology for density estimation. The resulting spectral clustering cost function measures the cosine of the angle between cluster mean vectors in a Mercer kernel feature space, where the feature space is determined by the eigenspectrum of the Laplacian matrix. A principled approach to spectral clustering would be to optimize this cost function in the feature space by assigning cluster memberships. Because of space limitations, we leave it to a future paper to present an actual clustering algorithm optimizing this cost function, and focus in this paper on the theoretical properties of the new measure. Corresponding author. Phone: (+47) 776 46493. Email: robertj@phys. uit. no An important by-product of the theory presented is that a method for learning the Mercer kernel matrix via optimal Parzen windowing is provided. This means that the Laplacian matrix, its eigenspectrum and hence the feature space mapping can be determined automatically. We illustrate this property by an example. We also show that the Laplacian pdf distance has an interesting relationship to the probability of error. In section 2, we briefly review kernel feature space theory. In section 3, we utilize the Parzen window technique for function approximation, in order to introduce the new Laplacian pdf distance and discuss some properties in sections 4 and 5. Section 6 concludes the paper. 2 Kernel Feature Spaces Mercer kernel-based learning algorithms [3] make use of the following idea: via a nonlinear mapping: Rd F, x (x) (1) the data x1, .. ., xN Rd is mapped into a potentially much higher dimensional feature space F. For a given learning problem one now considers the same algorithm in F instead of in Rd, that is, one works with (x1), .. ., (xN) F. Consider a symmetric kernel function k(x, y). If k: C C R is a continuous kernel of a positive integral operator in a Hilbert space L2(C) on a compact set C Rd, i. e. L2(C): k(x, y)(x)(y)dxdy 0, (2) C then there exists a space F and a mapping: Rd F, such that by Mercer's theorem [4] NF k(x, y) = (x), (y) = ii(x)i(y), (3) i=1 where, denotes an inner product, the i's are the orthonormal eigenfunctions of the kernel and NF [3]. In this case (x) = [ 11(x), 22(x), .. . ]T, (4) can potentially be realized. In some cases, it may be desirable to realize this mapping. This issue has been addressed in [5]. Define the (N N) Gram matrix, K, also called the affinity, or kernel matrix, with elements Kij = k(xi, xj), i, j = 1, .. ., N. This matrix can be diagonalized as ET KE =, where the columns of E contains the eigenvectors of K and is a diagonal matrix containing the non-negative eigenvalues ~ 1, .. ., ~ N, ~ 1 ~N. In [5], it was shown that the eigenfunctions and eigenvalues of (4) can ~ be approximated as j j (xi) Neji, j, where e N ji denotes the ith element of the jth eigenvector. Hence, the mapping (4), can be approximated as (xi) [ ~1e1i, .. ., ~NeNi]T. (5) Thus, the mapping is based on the eigenspectrum of K. The feature space data set may be represented in matrix form as NN = [(x1), .. ., (xN )]. Hence, = 1 2 ET. It may be desirable to truncate the mapping (5) to C-dimensions. Thus, T only the C first rows of are kept, yielding ^. It is well-known that ^ K = ^ ^ is the best rank-C approximation to K wrt. the Frobenius norm [6]. The most widely used Mercer kernel is the radial-basis-function (RBF) k(x, y) = exp -||x - y||2. (6) 22 3 Function Approximation using Parzen Windowing Parzen windowing is a kernel-based density estimation method, where the resulting density estimate is continuous and differentiable provided that the selected kernel is continuous and differentiable [7]. Given a set of iid samples {x1, .. ., xN} drawn from the true density f (x), the Parzen window estimate for this distribution is [7] N ^ 1 f (x) = W N 2 (x, xi), (7) i=1 where W2 is the Parzen window, or kernel, and 2 controls the width of the kernel. The Parzen window must integrate to one, and is typically chosen to be a pdf itself with mean xi, such as the Gaussian kernel 1 W2 (x, xi) = exp, (8) d -||x - xi||2 (22) 2 22 which we will assume in the rest of this paper. In the conclusion, we briefly discuss the use of other kernels. Consider a function h(x) = v(x)f (x), for some function v(x). We propose to estimate h(x) by the following generalized Parzen estimator N ^ 1 h(x) = v(xi)W N 2 (x, xi). (9) i=1 This estimator is asymptotically unbiased, which can be shown as follows 1 N Ef v(xi)W N 2 (x, xi) = v(z)f (z)W2 (x, z)dz = [v(x)f (x)] W2(x), i=1 (10) where Ef () denotes expectation with respect to the density f(x). In the limit as N and (N) 0, we have lim [v(x)f (x)] W2(x) = v(x)f(x). (11) N (N )0 Of course, if v(x) = 1 x, then (9) is nothing but the traditional Parzen estimator of h(x) = f (x). The estimator (9) is also asymptotically consistent provided that the kernel width (N ) is annealed at a sufficiently slow rate. The proof will be presented in another paper. Many approaches have been proposed in order to optimally determine the size of the Parzen window, given a finite sample data set. A simple selection rule was proposed by Silverman [8], using the mean integrated square error (MISE) between the estimated and the actual pdf as the optimality metric: 1 d+4 opt = X 4N -1(2d + 1)-1, (12) where d is the dimensionality of the data and 2 = d-1, where are the X i Xii Xii diagonal elements of the sample covariance matrix. More advanced approximations to the MISE solution also exist. 4 The Laplacian PDF Distance Cost functions for clustering are often based on distance measures between pdfs. The goal is to assign memberships to the data patterns with respect to a set of clusters, such that the cost function is optimized. Assume that a data set consists of two clusters. Associate the probability density function p(x) with one of the clusters, and the density q(x) with the other cluster. Let f (x) be the overall probability density function of the data set. Now define the f -1 weighted inner product between p(x) and q(x) as p, q f p(x)q(x)f-1(x)dx. In such an inner product space, the Cauchy-Schwarz inequality holds, that is, p, q 2 q, q. Based on this discussion, an information theoretic distance f p, p f f measure between the two pdfs can be expressed as p, q D f L = - log 0. (13) p, p q, q f f We refer to this measure as the Laplacian pdf distance, for reasons that we discuss next. It can be seen that the distance DL is zero if and only if the two densities are equal. It is non-negative, and increases as the overlap between the two pdfs decreases. However, it does not obey the triangle inequality, and is thus not a distance measure in the strict mathematical sense. We will now show that the Laplacian pdf distance is also a cost function for clus- tering in a kernel feature space, using the generalized Parzen estimators discussed in the previous section. Since the logarithm is a monotonic function, we will derive the expression for the argument of the log in (13). This quantity will for simplicity be denoted by the letter "L" in equations. Assume that we have available the iid data points {xi}, i = 1, .. ., N1, drawn from p(x), which is the density of cluster C1, and the iid {xj}, j = 1, .. ., N2, drawn from q(x), the density of C2. Let h(x) = f - 12 (x)p(x) and g(x) = f - 12 (x)q(x). Hence, we may write h(x)g(x)dx L =. (14) h2(x)dx g2(x)dx We estimate h(x) and g(x) by the generalized Parzen kernel estimators, as follows N1 N2 ^ 1 1 h(x) = f - 12 (xi)W f - 12 (xj )W N 2 (x, xi ), ^ g(x) = 2 (x, xj ). (15) 1 N2 i=1 j=1 The approach taken, is to substitute these estimators into (14), to obtain N N 1 1 1 2 h(x)g(x)dx f - 12 (xi)W f - 12 (xj )W N 2 (x, xi ) 2 (x, xj ) 1 N2 i=1 j=1 N 1 1, N2 = f - 12 (xi)f - 12 (xj ) W N 2 (x, xi )W2 (x, xj )dx 1N2 i, j=1 N 1 1, N2 = f - 12 (xi)f - 12 (xj )W N 22 (xi, xj ), (16) 1N2 i, j=1 where in the last step, the convolution theorem for Gaussians has been employed. Similarly, we have N 1 1, N1 h2(x)dx f - 12 (xi)f - 12 (xi )W N 2 22 (xi, xi ), (17) 1 i, i =1 N 1 2, N2 g2(x)dx f - 12 (xj)f - 12 (xj )W N 2 22 (xj, xj ). (18) 2 j, j =1 Now we define the matrix Kf, such that Kf = K ij f (xi, xj ) = f - 1 2 (xi)f - 12 (xj )K(xi, xj ), (19) where K(xi, xj ) = W22 (xi, xj) for i, j = 1, .. ., N and N = N1 + N2. As a consequence, (14) can be re-written as follows N1, N2 Kf (xi, xj) L = i, j=1 (20) N1, N1 K K i, i =1 f (xi, xi ) N2, N2 j, j =1 f (xj, xj ) The key point of this paper, is to note that the matrix K = Kij = K(xi, xj), i, j = 1, .. ., N, is the data affinity matrix, and that K(xi, xj) is a Gaussian RBF kernel function. Hence, it is also a kernel function that satisfies Mercer's theorem. Since K(xi, xj) satisfies Mercer's theorem, the following by definition holds [4]. For any set of examples {x1, .. ., xN} and any set of real numbers 1, .. ., N N N ijK(xi, xj) 0, (21) i=1 j=1 in analogy to (3). Moreover, this means that N N N N ijf - 12 (xi)f - 12 (xj )K(xi, xj) = ijKf (xi, xj) 0, (22) i=1 j=1 i=1 j=1 hence Kf (xi, xj ) is also a Mercer kernel. Now, it is readily observed that the Laplacian pdf distance can be analyzed in terms of inner products in a Mercer kernel-based Hilbert feature space, since Kf (xi, xj) = f (xi), f (xj). Consequently, (20) can be written as follows N1, N2 f (xi), f (xj) L = i, j=1 N1, N1 i, i =1 f (xi), f (xi ) N2, N2 j, j =1 f (xj ), f (xj ) 1 N1 N2 N i=1 f (xi ), 1 N j=1 f (xj ) = 1 2 1 N1 N1 N2 N2 N f (xi), 1 f (xi ) 1 f (xj ), 1 f (xj ) 1 i=1 N1 i =1 N2 j=1 N2 j =1 m1, m2 = f f = cos (m, m ), (23) ||m 1f 2f 1f ||||m2f || where m Ni i = 1 f N f (xl), i = 1, 2, that is, the sample mean of the ith cluster i l=1 in feature space. This is a very interesting result. We started out with a distance measure between densities in the input space. By utilizing the Parzen window method, this distance measure turned out to have an equivalent expression as a measure of the distance between two clusters of data points in a Mercer kernel feature space. In the feature space, the distance that is measured is the cosine of the angle between the cluster mean vectors. The actual mapping of a data point to the kernel feature space is given by the eigendecomposition of Kf, via (5). Let us examine this mapping in more detail. 1 Note that f 2 (xi) can be estimated from the data by the traditional Parzen pdf estimator as follows N 1 1 f 2 (xi) = W (xi, xl) = di. (24) N 2 f l=1 Define the matrix D = diag(d1, .. ., dN ). Then Kf can be expressed as Kf = D- 12 KD- 12. (25) Quite interestingly, for 2 = 22, this is in fact the Laplacian data matrix. 1 f The above discussion explicitly connects the Parzen kernel and the Mercer kernel. Moreover, automatic procedures exist in the density estimation literature to opti- mally determine the Parzen kernel given a data set. Thus, the Mercer kernel is also determined by the same procedure. Therefore, the mapping by the Laplacian matrix to the kernel feature space can also be determined automatically. We regard this as a significant result in the kernel based learning theory. As an example, consider Fig. 1 (a) which shows a data set consisting of a ring with a dense cluster in the middle. The MISE kernel size is opt = 0. 16, and the Parzen pdf estimate is shown in Fig. 1 (b). The data mapping given by the corresponding Laplacian matrix is shown in Fig. 1 (c) (truncated to two dimensions for visualization purposes). It can be seen that the data is distributed along two lines radially from the origin, indicating that clustering based on the angular measure we have derived makes sense. The above analysis can easily be extended to any number of pdfs/clusters. In the C-cluster case, we define the Laplacian pdf distance as C-1 pi, pj L = f. (26) i=1 j=i C pi, pi p f j, pj f In the kernel feature space, (26), corresponds to all cluster mean vectors being pairwise as orthogonal to each other as possible, for all possible unique pairs. 4. 1 Connection to the Ng et al. [2] algorithm Recently, Ng et al. [2] proposed to map the input data to a feature space determined by the eigenvectors corresponding to the C largest eigenvalues of the Laplacian ma- trix. In that space, the data was normalized to unit norm and clustered by the C-means algorithm. We have shown that the Laplacian pdf distance provides a 1It is a bit imprecise to refer to Kf as the Laplacian matrix, as readers familiar with spectral graph theory may recognize, since the definition of the Laplacian matrix is L = I - Kf. However, replacing Kf by L does not change the eigenvectors, it only changes the eigenvalues from i to 1 - i. 0 0 (a) Data set (b) Parzen pdf estimate (c) Feature space data Figure 1: The kernel size is automatically determined (MISE), yielding the Parzen estimate (b) with the corresponding feature space mapping (c). clustering cost function, measuring the cosine of the angle between cluster means, in a related kernel feature space, which in our case can be determined automati- cally. A more principled approach to clustering than that taken by Ng et al. is to optimize (23) in the feature space, instead of using C-means. However, because of the normalization of the data in the feature space, C-means can be interpreted as clustering the data based on an angular measure. This may explain some of the success of the Ng et al. algorithm; it achieves more or less the same goal as cluster- ing based on the Laplacian distance would be expected to do. We will investigate this claim in our future work. Note that we in our framework may choose to use only the C largest eigenvalues/eigenvectors in the mapping, as discussed in section 2. Since we incorporate the eigenvalues in the mapping, in contrast to Ng et al. , the actual mapping will in general be different in the two cases. 5 The Laplacian PDF distance as a risk function We now give an analysis of the Laplacian pdf distance that may further motivate its use as a clustering cost function. Consider again the two cluster case. The overall data distribution can be expressed as f (x) = P1p(x) + P2q(x), were Pi, i = 1, 2, are the priors. Assume that the two clusters are well separated, such that for xi C1, f (xi) P1p(xi), while for xi C2, f(xi) P2q(xi). Let us examine the numerator of (14) in this case. It can be approximated as p(x)q(x) dx f (x) p(x)q(x) p(x)q(x) 1 1 dx + dx q(x)dx + p(x)dx. (27) C f (x) f (x) P1 P2 1 C2 C1 C2 By performing a similar calculation for the denominator of (14), it can be shown to be approximately equal to 1. Hence, the Laplacian pdf distance can be written P1P1 as a risk function, given by 1 1 L P1P2 q(x)dx + p(x)dx. (28) P1 C P2 1 C2 Note that if P1 = P2 = 1, then L = 2P 2 e, where Pe is the probability of error when assigning data points to the two clusters, that is Pe = P1 q(x)dx + P2 p(x)dx. (29) C1 C2 Thus, in this case, minimizing L is equivalent to minimizing Pe. However, in the case that P1 = P2, (28) has an even more interesting interpretation. In that situation, it can be seen that the two integrals in the expressions (28) and (29) are weighted exactly oppositely. For example, if P1 is close to one, L p(x)dx, while P C e 2 q(x)dx. Thus, the Laplacian pdf distance emphasizes to cluster the most un- C1 likely data points correctly. In many real world applications, this property may be crucial. For example, in medical applications, the most important points to classify correctly are often the least probable, such as detecting some rare disease in a group of patients. 6 Conclusions We have introduced a new pdf distance measure that we refer to as the Laplacian pdf distance, and we have shown that it is in fact a clustering cost function in a kernel feature space determined by the eigenspectrum of the Laplacian data matrix. In our exposition, the Mercer kernel and the Parzen kernel is equivalent, making it possible to determine the Mercer kernel based on automatic selection procedures for the Parzen kernel. Hence, the Laplacian data matrix and its eigenspectrum can be determined automatically too. We have shown that the new pdf distance has an interesting property as a risk function. The results we have derived can only be obtained analytically using Gaussian ker- nels. The same results may be obtained using other Mercer kernels, but it requires an additional approximation wrt. the expectation operator. This discussion is left for future work. Acknowledgments. This work was partially supported by NSF grant ECS- 0300340.

NeurIPS Conference 1994 Conference Paper

Dynamic Modelling of Chaotic Time Series with Neural Networks

  • Jose Principe
  • Jyh-Ming Kuo

The auditory system of the barn owl contains several spatial maps. In young barn owls raised with optical prisms over their eyes, these auditory maps are shifted to stay in register with the visual map, suggesting that the visual input imposes a frame of reference on the auditory maps. However, the optic tectum, the first site of convergence of visual with auditory information, is not the site of plasticity for the shift of the auditory maps; the plasticity occurs instead in the inferior colliculus, which contains an auditory map and projects into the optic tectum. We explored a model of the owl remapping in which a global reinforcement signal whose delivery is controlled by visual foveation. A hebb learning rule gated by rein(cid: 173) forcement learned to appropriately adjust auditory maps. In addi(cid: 173) tion, reinforcement learning preferentially adjusted the weights in the inferior colliculus, as in the owl brain, even though the weights were allowed to change throughout the auditory system. This ob(cid: 173) servation raises the possibility that the site of learning does not have to be genetically specified, but could be determined by how the learning procedure interacts with the network architecture. 126 Alexandre Pouget, Cedric Deffayet, Te"ence J. Sejnowski c: :: ======: :: :: » •

NeurIPS Conference 1993 Conference Paper

Analysis of Short Term Memories for Neural Networks

  • Jose Principe
  • Hui-H. Hsu
  • Jyh-Ming Kuo

Short term memory is indispensable for the processing of time varying information with artificial neural networks. In this paper a model for linear memories is presented, and ways to include memories in connectionist topologies are discussed. A comparison is drawn among different memory types, with indication of what is the salient characteristic of each memory model.