Dataset¶
- class ims.dataset.Dataset(data, name=None, files=None, samples=None, labels=None)[source]¶
Bases:
objectDataset class coordinates many GC-IMS spectra (instances of ims.Spectrum class) with labels, file and sample names.
ims.Spectrum methods are applied to all spectra. It also contains additional functionality and methods that require multiple spectra at a time such as alignments and calculating means. Most operations are done inplace for memory efficiency.
- Parameters:
data (list) – Lists instances of ims.Spectrum.
name (str) – Name of the dataset.
files (list) – Lists one file name per spectrum. Must be unique.
samples (list) – Lists sample names. A sample can have multiple files in case of repeat determination. Needed to calculate means.
labels (list or numpy.ndarray) – Classification or regression labels.
- preprocessing¶
Keeps track of applied preprocessing steps.
- Type:
list
- weights¶
Stores the weights from scaling when the method is called. Needed to correct the loadings in PCA automatically.
- Type:
numpy.ndarray of shape (n_samples, n_features)
- train_index¶
Keeps the indices from train_test_split method. Used for plot annotations in PLS_DA and PLSR classes.
- Type:
list
- test_index¶
Keeps the indices from train_test_split method. Used for plot annotations in PLS_DA and PLSR classes.
- Type:
list
Example
>>> import ims >>> ds = ims.Dataset.read_mea("IMS_data") >>> print(ds) Dataset: IMS_data, 58 Spectra
- add_spectrum(spectrum, sample, label)[source]¶
Adds a ims.Spectrum to the dataset. Sample name and label must be provided because they are not stored in the ims.Spectrum class.
- Parameters:
spectrum (ims.Spectrum) – GC-IMS spectrum to add to the dataset.
sample (str) – The sample name is added to the sample attribute. Necessary because sample names are not stored in ims.Spectrum class.
label (various) – Classification or regression label is added to the label attribute. Necessary because labels are not stored in ims.Spectrum class.
- Returns:
With Spectrum added.
- Return type:
Example
>>> import ims >>> ds = ims.Dataset.read_mea("IMS_data") >>> sample = ims.Spectrum.read_mea("sample.mea") >>> ds.add_spectrum(sample, "sample_name", "class_label")
- align_ret_time(reference='mean')[source]¶
Retention time alignment based on dymamic time warping.
- Parameters:
reference (str, int or Spectrum, optional) – Reference intensity values and retention time. If “mean” is used, calculates the mean from all samples in dataset. An integer is used to index the dataset and select a Spectrum. If a Spectrum is given, uses this external sample as reference, by default “mean”.
- asymcorr(lam=10000000.0, p=0.001, niter=20)[source]¶
Retention time baseline correction using asymmetric least squares.
- Parameters:
lam (float, optional) – Controls smoothness. Larger numbers return smoother curves, by default 1e7
p (float, optional) – Controls asymmetry, by default 1e-3
niter (int, optional) – Number of iterations during optimization, by default 20
- Return type:
- binning(n=2)[source]¶
Downsamples each spectrum by binning the array with factor n. Similar to Spectrum.resampling but works on both dimensions simultaneously. If the dimensions are not divisible by the binning factor shortens it by the remainder at the long end. Very effective data reduction because a factor n=2 already reduces the number of features to a quarter.
- Parameters:
n (int, optional) – Binning factor, by default 2.
- Returns:
Downsampled data matrix.
- Return type:
Example
>>> import ims >>> ds = ims.Dataset.read_mea("IMS_Data") >>> print(ds[0].shape) (4082, 3150) >>> ds.binning(2) >>> print(ds[0].shape) (2041, 1575)
- bootstrap(n_bootstraps=5, n_samples=None, random_state=None)[source]¶
Iteratively resamples dataset with replacement. Samples can be included multiple times or not at all in the training data. Uses all samples that are not present in the training data as test data.
- Parameters:
n_bootstraps (int, optional) – Number of iterations, by default 5.
n_samples (int, optional) – Number of samples to draw per iteration. Is set to the lenghth of the dataset if None, by default None.
random_state (int, optional) – Controls randomness, pass an int for reproducible output, by default None.
- Yields:
tuple – (X_train, X_test, y_train, y_test) per iteration
Example
>>> import ims >>> from sklearn.metrics import accuracy_score >>> ds = ims.Dataset.read_mea("IMS_data") >>> model = ims.PLS_DA(ds) >>> accuracy = [] >>> for X_train, X_test, y_train, y_test in ds.bootstrap(): >>> model.fit(X_train, y_train) >>> y_pred = model.predict(X_test) >>> accuracy.append(accuracy_score(y_test, y_pred))
- copy()[source]¶
Uses deepcopy from the copy module in the standard library. Most operations happen inplace. Use this method if you do not want to change the original variable.
- Returns:
deepcopy of self.
- Return type:
Example
>>> import ims >>> ds = ims.Dataset.read_mea("IMS_data") >>> new_variable = ds.copy()
- cut_dt(start, stop=None)[source]¶
Cuts data along drift time coordinate. Range in between start and stop is kept. If stop is not given uses the end of the array instead. Combination with RIP relative drift time values makes it easier to cut the RIP away and focus on the peak area.
- Parameters:
start (int or float) – Start value on drift time coordinate.
stop (int or float, optional) – Stop value on drift time coordinate. If None uses the end of the array, by default None.
- Returns:
New drift time range.
- Return type:
Example
>>> import ims >>> ds = ims.Dataset.read_mea("IMS_data") >>> print(ds[0].shape) (4082, 3150) >>> ds.interp_riprel().cut_dt(1.05, 2) >>> print(ds[0].shape) (4082, 1005)
- cut_rt(start, stop=None)[source]¶
Cuts data along retention time coordinate. Range in between start and stop is kept. If stop is not given uses the end of the array instead.
- Parameters:
start (int or float) – Start value on retention time coordinate.
stop (int or float, optional) – Stop value on retention time coordinate. If None uses the end of the array, by default None.
- Returns:
New retention time range.
- Return type:
Example
>>> import ims >>> ds = ims.Dataset.read_mea("IMS_data") >>> print(ds[0].shape) (4082, 3150) >>> sample.cut_rt(80, 500) >>> print(ds[0].shape) (2857, 3150)
- drop(labels=None, samples=None, files=None)[source]¶
Removes all spectra of specified labels, samples, or files from the dataset. Must set at least one of the parameters.
- Parameters:
labels (str or list of str, optional) – List of label names to remove, by default None
samples (str or list of str, optional) – List of sample names to remove, by default None
files (str or list of str, optional) – List of file names to remove, by default None
- Returns:
Contains only spectra that do not match the specified criteria.
- Return type:
Example
>>> import ims >>> ds = ims.Dataset.read_mea("IMS_data") >>> ds = ds.drop(labels=["GroupA"], files=[FileA, FileB])
- export_plots(folder_name=None, file_format='jpg', **kwargs)[source]¶
Saves a figure per spectrum as image file. See the docs for matplotlib savefig function for supported file formats and kwargs (https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.savefig.html).
Creates a new folder for the plots in the current working directory.
- Parameters:
folder_name (str, optional) – New directory to save the images to.
file_format (str, optional) – See matplotlib savefig docs for information about supported formats, by default ‘jpeg’
Example
>>> import ims >>> ds = ims.Dataset.read_mea("IMS_data") >>> ds.export_plots("IMS_data_plots")
- get_xy(flatten=True)[source]¶
Returns features (X) and labels (y) as numpy arrays.
- Parameters:
flatten (bool, optional) – Flattens 3D datasets to 2D, by default True
- Returns:
(X, y)
- Return type:
tuple
Example
>>> import ims >>> ds = ims.Dataset.read_mea("IMS_data") >>> X, y = ds.get_xy()
- groupby(key='label')[source]¶
Groups dataset by label or sample.
- Parameters:
key (str, optional) – “label” or “sample” are valid keys, by default “label”
- Returns:
List of one ims.Dataset instance per group or sample.
- Return type:
list
- interp_riprel(rip_position=None)[source]¶
Interpolates all spectra to common RIP relative drift time coordinate. Alignment along drift time coordinate.
- Parameters:
rip_position (int or float, optional) – Position of the RIP in ms. If None the maximum value is used, by default None.
- Returns:
With RIP relative spectra.
- Return type:
- kfold_split(n_splits=5, shuffle=True, random_state=None, stratify=False)[source]¶
K-Folds cross-validator (sklearn.model_selection.KFold). Splits the dataset into k consecutive folds and provides train and test data.
If stratify is True uses StratifiedKfold instead.
- Parameters:
n_splits (int, optional) – Number of folds. Must be at least 2, by default 5.
shuffle (bool, optional) – Whether to shuffle the data before splitting, by default True.
random_state (int, optional) – When shuffle is True random_state affects the order of the indices. Pass an int for reproducible splits, by default None.
stratify (bool, optional) – Wheter to stratify output or not. Preserves the percentage of samples from each class in each split, by default False.
- Yields:
tuple – (X_train, X_test, y_train, y_test) per iteration
Example
>>> import ims >>> from sklearn.metrics import accuracy_score >>> ds = ims.Dataset.read_mea("IMS_data") >>> model = ims.PLS_DA(ds) >>> accuracy = [] >>> for X_train, X_test, y_train, y_test in ds.kfold_split(): >>> model.fit(X_train, y_train) >>> y_pred = model.predict(X_test) >>> accuracy.append(accuracy_score(y_test, y_pred))
- leave_one_out()[source]¶
Leave-One-Out cross-validator. Provides train test splits and uses each sample once as test set while the remaining data is used for training.
- Yields:
tuple – X_train, X_test, y_train, y_test
Example
>>> import ims >>> from sklearn.metrics import accuracy_score >>> ds = ims.Dataset.read_mea("IMS_data") >>> model = ims.PLS_DA(ds) >>> accuracy = [] >>> for X_train, X_test, y_train, y_test in ds.leave_one_out(): >>> model.fit(X_train, y_train) >>> y_pred = model.predict(X_test, y_test) >>> accuracy.append(accuracy_score(y_test, y_pred))
- mean()[source]¶
Calculates means for each sample, in case of repeat determinations. Automatically determines which file belongs to which sample. Sample names are used for mean spectra and file names are no longer needed.
- Returns:
With mean spectra.
- Return type:
- normalization()[source]¶
Normalizes each spectrum in the dataset by scaling its intensity values to the range [0, 1].
This method ensures that the intensity values of each spectrum are individually normalized between 0 and 1 based on their unique minimum and maximum values. This standardizes the dataset, facilitating consistent comparison and analysis across different spectra.
Example
>>> import ims >>> ds = ims.Dataset.read_mea("IMS_data") >>> ds.normalization()
- Returns:
self – The dataset with normalized spectra.
- Return type:
- plot(index=0, **kwargs)[source]¶
Plots the spectrum of selected index and adds the label to the title.
- Parameters:
index (int, optional) – Index of spectrum to plot, by default 0
- Return type:
matplotlib.axes._subplots.AxesSubplot
- classmethod read_csv(path, subfolders=False, extension='.csv', **kwargs)[source]¶
Reads generic csv files. The first row must be the drift time values and the first column must be the retention time values. Values inbetween are the intensity matrix. Uses the time when the file was created as timestamp.
If subfolders=True expects the following folder structure for each label and sample:
- Data
- Group A
- Sample A
File A
File B
- Sample B
File A
File B
Labels can then be auto-generated from directory names. If subfolders=False, labels are generated by searching for keywords in file names. Default keywords are “blank”, “sample”, “qc”, and “control” for example: * blank_air.csv –> label “blank” * control.csv –> label “control”. * All files that do not match any of the keywords are assigned the label “sample”. A list of custom labels can be passed with the label_keywords argument.
- Parameters:
path (str) – Absolute or relative file path.
subfolders (bool, optional) – Uses subdirectory names as labels, by default False
extension (str or list of str, optional) – File extension(s) to include, by default “.csv”
label_keywords (list of str, optional) – List of keywords to search for in filenames to assign labels. If not provided, defaults to [“blank”, “sample”, “qc”, “control”].
- Return type:
Example
>>> import ims >>> ds = ims.Dataset.read_csv("IMS_data", subfolders=True) >>> print(ds) Dataset: IMS_data, 58 Spectra
- classmethod read_hdf5(path)[source]¶
Reads hdf5 files exported by the Dataset.to_hdf5 method. Convenient way to store preprocessed spectra. Especially useful for larger datasets as preprocessing requires more time. Preferred to csv because of faster read and write speeds.
- Parameters:
path (str) – Absolute or relative file path.
- Return type:
Example
>>> import ims >>> sample = ims.Dataset.read_mea("IMS_data") >>> sample.to_hdf5("IMS_data_hdf5") >>> sample = ims.Dataset.read_hdf5("IMS_data_hdf5")
- classmethod read_mea(path, subfolders=False, sequential=False, compr='wavecompr', extension='.mea', **kwargs)[source]¶
Reads all mea files from G.A.S Dortmund instruments in the given directory and combines them into a dataset. Much faster than reading csv files and therefore preferred.
If subfolders=True expects the following folder structure for each label and sample:
- Data
- Group A
- Sample A
File A
File B
- Sample B
File A
File B
Labels can then be auto-generated from directory names. If subfolders=False, labels are generated by searching for keywords in file names. Default keywords are “blank”, “sample”, “qc”, and “control” for example: * blank_air.csv –> label “blank” * control.csv –> label “control”. * All files that do not match any of the keywords are assigned the label “sample”. A list of custom labels can be passed with the label_keywords argument.
If sequential=True files are compressed with the chosen compr method (“wavecompr” or “binning”). full dataset afterwards. Especially useful for big datasets or systems with lower RAM specifications. Valid kwargs for the compression methods can be found in the respective definitions.
- Parameters:
path (str) – Absolute or relative file path.
subfolders (bool, optional) – Uses subdirectory names as labels, by default False
sequential (bool, optional) – Sequential compression of each file, by default False
compr (str, optional) – Compression method “binning” or “wavecompr” are valid, by default “wavecompr”
extension (str or list of str, optional) – File extension(s) to include, by default “.mea”
label_keywords (list of str, optional) – List of keywords to search for in filenames to assign labels. If not provided, defaults to [“blank”, “sample”, “qc”, “control”].
- Return type:
- Raises:
ValueError – If compression method is not supported
Example
>>> import ims >>> ds = ims.Dataset.read_mea("IMS_data", subfolders=True) >>> print(ds) Dataset: IMS_data, 58 Spectra
- classmethod read_zip(path, subfolders=False, extension=['.csv', '.json'], **kwargs)[source]¶
Reads zipped csv and json files from G.A.S Dortmund mea2zip converting tool. Present for backwards compatibility. Reading mea files is much faster and saves the manual extra step of converting.
If subfolders=True expects the following folder structure for each label and sample:
- Data
- Group A
- Sample A
File A
File B
- Sample B
File A
File B
Labels can then be auto-generated from directory names. If subfolders=False, labels are generated by searching for keywords in file names. Default keywords are “blank”, “sample”, “qc”, and “control” for example: * blank_air.csv –> label “blank” * control.csv –> label “control”. * All files that do not match any of the keywords are assigned the label “sample”. A list of custom labels can be passed with the label_keywords argument.
- Parameters:
path (str) – Absolute or relative file path.
subfolders (bool, optional) – Uses subdirectory names as labels, by default False
extension (str or list of str, optional) – File extension(s) to include, by default [“.csv”, “.json”]
label_keywords (list of str, optional) – List of keywords to search for in filenames to assign labels. If not provided, defaults to [“blank”, “sample”, “qc”, “control”].
- Return type:
Example
>>> import ims >>> ds = ims.Dataset.read_zip("IMS_data", subfolders=True) >>> print(ds) Dataset: IMS_data, 58 Spectra
- resample(n=2)[source]¶
Resamples each spectrum by calculating means of every n rows. If the length of the retention time is not divisible by n it and the data matrix get cropped by the remainder at the long end.
- Parameters:
n (int, optional) – Number of rows to mean, by default 2.
- Returns:
Resampled values.
- Return type:
Example
>>> import ims >>> ds = ims.Dataset.read_mea("IMS_Data") >>> print(ds[0].shape) (4082, 3150) >>> ds.resample(2) >>> print(ds[0].shape) (2041, 3150)
- rip_scaling()[source]¶
Scales values relative to global maximum. Can be useful to directly compare spectra from instruments with different sensitivity.
- Returns:
With scaled values.
- Return type:
- property sample_indices¶
This property returns information about all spectra indices for each sample in the dataset. Useful to select or remove specific samples or files.
- Returns:
Sample names as keys, lists with indices of spectra as values.
- Return type:
dict
- savgol(window_length=10, polyorder=2, direction='both')[source]¶
Applys a Savitzky-Golay filter to intensity values. Can be applied in the drift time, retention time or both directions.
- Parameters:
window_length (int, optional) – The length of the filter window, by default 10
polyorder (int, optional) – The order of the polynomial used to fit the samples, by default 2
direction (str, optional) – The direction in which to apply the filter. Can be ‘drift time’, ‘retention time’ or ‘both’. By default ‘both’
- Return type:
- scaling(method='pareto', mean_centering=True)[source]¶
Scales and mean centeres features according to selected method.
- Parameters:
method (str, optional) – “pareto”, “auto” , “var” or None are valid options, by default “pareto”. If method = None and mean_centering=True, it only implements mean_centering
mean_centering (bool, optional) – If true center the data before scaling, by default True.
- Return type:
- Raises:
ValueError – If scaling method is not supported.
- select(labels=None, samples=None, files=None)[source]¶
Selects all spectra of specified labels, samples, or files. Must set at least one of the parameters.
- Parameters:
labels (list of str, optional) – List of label names to keep, by default None
samples (list of str, optional) – List of sample names to keep, by default None
files (list of str, optional) – List of file names to keep, by default None
- Returns:
Contains only matching spectra.
- Return type:
Example
>>> import ims >>> ds = ims.Dataset.read_mea("IMS_data") >>> group_a = ds.select(labels=["GroupA", "GroupB"], samples=["SampleA"])
- shuffle_split(n_splits=5, test_size=0.2, random_state=None)[source]¶
Shuffled splits for montecarlo cross-validation. Randomly selects a fraction of the dataset, without replacements, per split (sklearn.model_selection.ShuffleSplit).
- Parameters:
n_splits (int, optional) – Number of re-shuffling and splitting iterations, by default 5.
test_size (float, optional) – Proportion of the dataset to include in the test split, by default 0.2.
random_state (int, optional) – Controls randomness. Pass an int for reproducible output, by default None.
- Yields:
tuple – (X_train, X_test, y_train, y_test) per iteration
Example
>>> import ims >>> from sklearn.metrics import accuracy_score >>> ds = ims.Dataset.read_mea("IMS_data") >>> model = ims.PLS_DA(ds) >>> accuracy = [] >>> for X_train, X_test, y_train, y_test in ds.shuffle_split(): >>> model.fit(X_train, y_train) >>> y_pred = model.predict(X_test) >>> accuracy.append(accuracy_score(y_test, y_pred))
- sub_first_rows(n=1)[source]¶
Subtracts first row from every row in spectrum. Effective and simple baseline correction if RIP tailing is a concern but can hide small peaks.
- Return type:
- property timestamps¶
Property of timestamps when each spectrum in dataset was recorded.
- Returns:
List of Python datetime objects.
- Return type:
list
- to_hdf5(name=None, path=None)[source]¶
Exports the dataset as hdf5 file. It contains one group per spectrum and one with labels etc. Use ims.Dataset.read_hdf5 to read the file and construct a dataset.
- Parameters:
name (str, optional) – Name of the hdf5 file. File extension is not needed. If not set, uses the dataset name attribute, by default None.
path (str, otional) – Path to save the file. If not set uses the current working directory, by default None.
Example
>>> import ims >>> ds = ims.Dataset.read_mea("IMS_data") >>> ds.to_hdf5() >>> ds = ims.Dataset.read_hdf5("IMS_data.hdf5")
- tophat(size=15)[source]¶
Applies white tophat filter on data matrix as a baseline correction. Size parameter is the diameter of the circular structuring element. (Slow with large size values.)
- Parameters:
size (int, optional) – Size of structuring element, by default 15.
- Return type:
- train_test_split(test_size=0.2, stratify=False, random_state=None)[source]¶
Splits the dataset in train and test sets.
- Parameters:
test_size (float, optional) – Proportion of the dataset to be used for validation. Should be between 0.0 and 1.0, by default 0.2
stratify (bool, optional) – Wheter to stratify output or not. Preserves the percentage of samples from each class in each split, by default False.
random_state (int, optional) – Controls the randomness. Pass an int for reproducible output, by default 1
- Returns:
X_train, X_test, y_train, y_test
- Return type:
tuple of numpy.ndarray
Example
>>> import ims >>> ds = ims.Dataset.read_mea("IMS_Data") >>> X_train, X_test, y_train, y_test = ds.train_test_split()
- wavecompr(direction='ret_time', wavelet='db3', level=3)[source]¶
Data reduction by wavelet compression. Can be applied to drift time, retention time or both axis.
- Parameters:
direction (str, optional) – The direction in which to apply the filter. Can be ‘drift time’, ‘retention time’ or ‘both’. By default ‘ret_time’.
wavelet (str, optional) – Wavelet object or name string, by default “db3”.
level (int, optional) – Decomposition level (must be >= 0), by default 3.
- Return type:
- Raises:
ValueError – When direction is neither ‘ret_time’, ‘drift_time’ or ‘both’.