Last updated 10 months ago

An *hctsa* dataset has been initialized (specifying details of a time-series dataset and operations to include using `TS_init`

), all results entries in the resulting `HCTSA.mat`

are set to `NaN`

, corresponding to results that are as yet uncomputed.

Calculations are performed using the function `TS_compute`

, which stores results back into the matrices in `HCTSA.mat`

. This function can be run without inputs to compute all missing values in the default hctsa file, `HCTSA.mat`

:

% Compute all missing values in HCTSA.mat:TS_compute();

`TS_compute`

will begin evaluating operations on time series in `HCTSA.mat`

for which elements in `TS_DataMat`

are `NaN`

(i.e., computations that have not been run previously). Results are stored back in the matrices of `HCTSA.mat`

: `TS_DataMat`

(output of each operation on each time series), `TS_CalcTime`

(calculation time for each operation on each time series), and `TS_Quality`

(labels indicating errors or special-valued outputs).

(1) Computing features in parallel across available cores using Matlab's Parallel Processing Toolbox. This can be achieved by setting the first input to `true`

:

% Compute all missing values in HCTSA.mat using parallel processing:TS_compute(true);

(2) Computing across a custom range of time-series IDs (`ts_id`

) and operation IDs (`op_id`

). This can be achieved by setting the second and third inputs:

% Compute missing values in HCTSA.mat for ts_ids from 1:10 and op_ids from 1:1000TS_compute(false,1:10,1:1000);

(3) Specifying what types of values should be computed:

% Compute all values that have never been computed before (default)TS_compute(false,[],[],'missing');% Compute all values that have never previous been calculated OR have previously been computed but returned an error:TS_compute(false,[],[],'error');

(4) Specifying a custom .mat file to operate on (`HCTSA.mat`

is the default):

% Compute all missing values in my_HCTSA_file.mat:TS_compute(false,[],[],'missing','my_HCTSA_file.mat');

(5) Suppress commandline output. All computations are displayed to screen by default (which can be overwhelming but is useful for error checking). This functionality can be suppressed by setting the final (6th) input to `false`

:

% Compute all missing values in HCTSA.mat, suppressing output to screen:TS_compute(false,[],[],'missing','',false);

Computing features for full time-series datasets can be time consuming, especially for large datasets of long time series. An appropriate strategy therefore depends on the time-series length, the number of time series in the dataset, and the computational resources available. When multiple cores are available, it is always recommended to use the parallel setting (i.e., as `TS_compute(true)`

).

The first thing to think about is how the time taken to compute 7749 features of v0.93 of *hctsa* scales with the length of time series in your dataset (see plot below). The figure compares results using a single core (e.g., `TS_compute(false)`

) to results using a 16-core machine, with parallelization enabled (e.g., `TS_compute(true)`

).

Times may vary across on individual machines, but the above plot can be used to estimate the computation time per time series, and thus help decide on an appropriate computation strategy for a given dataset.

Note that if computation times are too long for the computational resources at hand, one can always choose a reduced set of features, rather than the full set of >7000, to get a preliminary understanding of the dataset. One such reduced set of features is in `INP_ops_reduced.txt`

. We plan to reduced additional reduced feature sets, determined according to different criteria, in future.

If only a single machine is available for computation, there are a couple of options:

For small datasets, when it is feasible to run all computations in a single go, it is easiest to run computations within Matlab in a single call of

`TS_compute`

.For larger datasets that may run for a long time on a single machine, one may wish to use something like the provided

`sample_runscript_matlab`

script, where`TS_compute`

commands are run in a loop over time series, compute small sections of the dataset at a time (and then saving the results to file, e.g.,`HCTSA.mat`

), eventually covering the full dataset iteratively.

With a distributed computing setup, a local Matlab file (`HCTSA.mat`

) can be split into smaller pieces using `TS_subset`

, which outputs a new data file for a particular subset of your data, e.g., `TS_subset('raw',1:100)`

will generate a new file, `HCTSA_subset.mat`

that contains just time series with IDs from 1 to 100. Computing features for time series in each such subset can then be run on a distributed computing setup. For example, with a different compute node computing a different subset (by queuing batch jobs that each work on a given subset of time series). Once all subsets have been computed, the results can then be recombined into a single `HCTSA.mat`

file using `TS_combine`

commands, as described here.

Distributing feature computations on a large-scale distributed computing setup can be better suited to a linked mySQL database, especially for datasets that grow with time, as new time series can be easily added to the database. In this case, computation proceeds similarly to above, where shell scripts on a distributed cluster computing environment can be used to distribute jobs across cores, with all individual jobs writing to a centralized *mySQL* server. A set of Matlab code that generates an appropriately formatted mySQL database and interfaces with the database to facilitate *hctsa* feature computation is included with the software package, and is described in detail here.