Tech Blog

Our Data Scientists Comment on The Grapples

See All Entries

Optimal Transport vs. Fisher-Rao distance between copulas for clustering multivariate time series

6 June 2016 by Gautier MARTI

Copulas are distributions which encode the dependence between random variables. Because it is complicated to work with the whole distribution, one usually extracts meaningful numbers from it called correlation / association / dependence coefficients. These numbers only measure particular aspects of the dependence structure: monotone association, tail-dependence, etc. When extracted from copulas that encode a more complex dependence pattern, they can even be misleading. It motivates the need to develop tools to deal with the whole distribution. We may need to:

  • compare two dependence structures (distance between two copulas);
  • summarize several dependence structures (barycenter of several copulas);
  • extract the strength of the association between the variables (projection of the copula onto [0,1]).
Basically, we aim at defining a relevant geometry for the space of copulas.

DataGrapple's Clustering takes on the STOXX Europe 600

18 April 2016 by Gautier MARTI

In this article, we apply our clustering approach, which powers the DataGrapple's engine, to stocks in the STOXX Europe 600. We leverage a dataset of historical prices that are adjusted for all cash and special dividends, splits and all capital changes to produce homogeneous time series (courtesy of Finaltis). We may want to know: Are there groups of stocks that behave similarly? How much similarly? What are these groups? Using our clustering technology, we study their comovements and provide an answer to these questions. We also assess their persistence to noise and economic changes. We finally illustrate and comment on the results using interactive visualizations that convey the main information about the groups of stocks found.

Optimal Copula Transport - A tutorial

18 January 2016 by Gautier MARTI

In this tutorial (based on our paper), we present a new methodology for clustering multivariate time series leveraging optimal transport between copulas. Copulas are used to encode both (i) intra-dependence of a multivariate time series, and (ii) inter-dependence between two time series. Then, optimal copula transport allows us to define two distances between multivariate time series: (i) one for measuring intra-dependence dissimilarity, (ii) another one for measuring inter-dependence dissimilarity based on a new multivariate dependence coefficient which is robust to noise, deterministic, and which can target specified dependencies. We illustrate the methodology and its benefits with Python and R code which are available for download at the end of the tutorial.

A GNPR tutorial: How to cluster random walks

19 November 2014 by Gautier MARTI

We present in this note a novel non-parametric approach useful for clustering Markov processes. For technical details, please refer to the related scientific paper (preprint). We introduce a mapping from stationary time series to what we have called the generic non-parametric representation (GNPR) which splits apart dependency and distribution information without losing any. We also propound an associated metric leveraging this representation and its statistical estimate which can be used in distance-based machine learning algorithms working on independent and identically distributed realizations of random variables. Python code illustrates the workflow, which is applied on a generated dataset consisting in random walks; the IPython Notebook file is available for download and reproducible experiments. We compare our results to the ones obtained using the same clustering algorithm directly on the data and show that the presented method is able to recover a finer clustering than the straightforward approach: it essentially recovers the information generated by the random walks model.