Advice and common pitfalls

Thinking about running an hctsa analysis? Read this first.

The typical data analysis pipeline starts with inspecting and understanding the data, processing it in accordance with the questions of interest (and to be consistent with the assumptions of the analysis methods that will be applied), and then formulating and testing appropriate analysis methods. A typical *hctsa* pipeline inverts this process: many analysis methods are first applied, and then their results are interpreted.

Good practice involves thinking carefully about this full *hctsa *pipeline, including the type of questions and interpretations that are sought from it, and thus how the data are to be prepared, and how the results can be interpreted accurately.

Data processing

The following should be considered:

If your time-series data are measured on scales that differ across different recordings, then the data should be appropriately standardized.**Standardizing.**If your data contain low-order trends that are not meaningful (e.g., sensor drift) or not a signal of relevance (your question is based more around the structure of deviations from a low-frequency trend), then this should be removed.**Detrending.**. If your data contain high-frequency measurement noise, the analyst should consider removing it. For example, using a filter (e.g., moving average), wavelet denoising, or using a phase-space reconstruction (e.g., cf. Schreiber's method).**Denoising**. Features assume that the data are sampled at a rate that properly resolves the temporal patterns of interest. If your data are over-sampled, then many features will be sensitive to this dominating autocorrelation structure, and will be less sensitive to interesting patterns in the data. In this case, you can consider downsampling your data, for which there are many heuristics (e.g., cf. Toker et al.).*Downsampling*

Interpreting Features

Checking for simpler explanations

There are often many routes to solving a given data analysis challenge. For example, in a time-series classification problem, the two classes may be perfectly distinguished based on their lag-1 autocorrelation, and also on their Lyapunov exponent spectrum, and also on hundreds of other properties. In general, one should avoid interpreting the most complex features (like Lyapunov exponents) as being uniquely useful for a problem, as they reproduce the behavior of much simpler features, which provide a more interpretable and parsimonious interpretation of the relevant patterns in the dataset. For other problems, time-series analysis methods (that are sensitive to the time-ordering of the data samples) may not provide any benefit at all over properties of the data distribution (e.g., the variance), or more trivial differences in time-series length across classes.

In general, **complex explanations of patterns in a dataset can only be justified when simpler explanations have been ruled out**. E.g., Do not write a paper about a complex (e.g., powerlaw fit to a visibility graph degree-distribution) feature when your data can be just as well (or better) distinguished by their variance.

`TS_TopFeatures`

) to basic checks on different keyword-labeled classes of features (in `TS_CompareFeatureSets`

).Last modified 8mo ago

Copy link