Working with short time series

Last updated 10 months ago

Although many sections of the time-series analysis literature has worked to develop methods for quantifying complex temporal structure in long time-series recordings, many time series that are analyzed in practice are relatively short. hctsa has been successfully applied to time-series classification problems in the data mining literature, which includes datasets of time series as short as 60 samples (link to paper). However, time-series data are sometimes even shorter, including yearly economic data across perhaps six years, or biological data measured at say 10 points across a lifespan. Although many features in hctsa will not give a meaningful output when applied to a short time series, hctsa includes methods for filtering such features (cf. TS_normalize), after which the remaining features can be used for analysis.

The number of features with a meaningful output, from time series as short as 5 samples, up to those with as many as 500 samples, is shown below (where the maximum set of 7749 is shown as a dashed horizontal line): ‚Äč

In each case, over 3000 features can be computed. Note that one must be careful when representing a 5-dimensional object with thousands of features, the vast majority of which will be highly intercorrelated.

Example application to developmental gene expression data

To demonstrate the feasibility of running hctsa analysis on datasets of short time series, we applied hctsa to gene expression data in the cerebellar brain region, r1A, across seven developmental time points (from the Allen Institute's Developing Mouse Brain Atlas), for a subset of 50 genes. After filtering and normalizing (TS_normalize), then clustering (TS_cluster), we plotted the clustered time-series data matrix (TS_plot_DataMatrix('cl')):

Inspecting the time series plots to the left of the colored matrix, we can see that genes with similar temporal expression profiles are clustered together based on their 2829-long feature vector representations. Thus, these feature-based representations provide a meaningful representation of these short time series. Further, while these 2829-long feature vectors are shorter than those that can be computed from longer time series, they still constitute a highly comprehensive representation that can be used as the starting point to obtain interpretable understanding in addressing specific domain questions.