shnitsel.analyze.pca#

Attributes#

Classes#

PCAResult

Class to hold the results of a PCA analysis.

Functions#

pca_and_hops(…)

Get PCA projected data and a mask to provide information on which of the data points represent hopping points.

pca(…)

Function to perform a PCA decomposition on the data of various origins and formats.

pca_direct(data, dim[, n_components])

Wrapper function to directly apply the PCA decomposition to the values in a dataarray.

Module Contents#

OriginType#
ResultType#
DataType#
class PCAResult(pca_inputs, pca_dimension, pca_pipeline, pca_object, pca_projected_inputs)#

Bases: Generic[OriginType, ResultType]

Class to hold the results of a PCA analysis.

Also retains input data as well as corresponding results of the PCA decomposition. Input and output types are parametrized to allow for tree structures to be accurately represented.

Provides accessors for all result meta data as well as the method project_array(data_array) to project another array of appropriate shape with dimension pca_mapped_dimension to the PCA principal components.

Parameters:
  • OriginType – The type of the original intput data. Should either be xr.DataArray for simple types, meaning we were provided a feature array or a flat DataGroup with xr.DataArrays in its leaves for tree types.

  • ResultType – Matching structure to OriginType but with the projected PCA decomposed input data as data within it. Either an xr.DataArray or a DataGroup same as for OriginType.

  • pca_inputs (OriginType)

  • pca_dimension (Hashable)

  • pca_pipeline (sklearn.pipeline.Pipeline)

  • pca_object (sklearn.decomposition.PCA)

  • pca_projected_inputs (ResultType)

_pca_inputs: OriginType#
_pca_pipeline: sklearn.pipeline.Pipeline#
_pca_dimension: Hashable#
_pca_components: xarray.DataArray#
_pca_object: sklearn.decomposition.PCA#
_pca_inputs_projected: ResultType#
property inputs: OriginType#
Return type:

OriginType

property fitted_pca_object: sklearn.decomposition.PCA#
Return type:

sklearn.decomposition.PCA

property pca_mapped_dimension: Hashable#
Return type:

Hashable

property pca_pipeline: sklearn.pipeline.Pipeline#
Return type:

sklearn.pipeline.Pipeline

property principal_components: xarray.DataArray#
Return type:

xarray.DataArray

property loadings: xarray.DataArray#
Return type:

xarray.DataArray

property projected_inputs: ResultType#
Return type:

ResultType

property results: ResultType#
Return type:

ResultType

get_most_significant_loadings(top_n_per=5, top_n_total=5)#

Function to retrieve the most significant loadings in the PCA result for each individual component and in total.

You can configure the amount of

Parameters:
  • top_n_per (int, optional) – Number of top (most significant absolute loading) n loadings per component, by default 5

  • top_n_total (int, optional) – Number of overall top (i.e. most significant by 2-norm of their loadings across all PC) n features across all components, by default 5

Returns:

First the mapping of each PC to the array holding the data of all their most significant loadings. Second the overall most significant loadings across all components.

Return type:

tuple[Mapping[Hashable, xr.DataArray], xr.DataArray]

explain_loadings(top_n_per=5, top_n_total=5)#

Generate a textual explanation of the top influential loadings in the PCA result.

Tries to put the results of get_most_significant_loadings() into a textual form.

Parameters:
  • top_n_per (int, optional) – Number of top (most significant absolute loading) n loadings per component, by default 5

  • top_n_total (int, optional) – Number of overall top (i.e. most significant by 2-norm of their loadings across all PC) n features across all components, by default 5

Returns:

A text describing the results of the principal components analysis.

Return type:

str

project_array(other_da)#
Parameters:

other_da (xarray.DataArray)

Return type:

xarray.DataArray

static get_extra_coords_for_loadings(data, dim)#
Parameters:
Return type:

Mapping[Hashable, xarray.DataArray]

pca_and_hops(frames: shnitsel.data.tree.node.TreeNode[Any, shnitsel.data.dataset_containers.shared.ShnitselDataset | xarray.Dataset], structure_selection: shnitsel.filtering.structure_selection.StructureSelection | shnitsel.filtering.structure_selection.StructureSelectionDescriptor | None = None, center_mean: bool = False, n_components: int = 2) shnitsel.data.tree.node.TreeNode[Any, tuple[PCAResult, xarray.DataArray]]#
pca_and_hops(frames: shnitsel.data.dataset_containers.shared.ShnitselDataset | xarray.Dataset, structure_selection: shnitsel.filtering.structure_selection.StructureSelection | shnitsel.filtering.structure_selection.StructureSelectionDescriptor | None = None, center_mean: bool = False, n_components: int = 2) tuple[PCAResult, xarray.DataArray]

Get PCA projected data and a mask to provide information on which of the data points represent hopping points.

Parameters:
  • frames (xr.Dataset | ShnitselDataset | TreeNode[Any, ShnitselDataset | xr.Dataset]) – A Dataset (or tree of those) containing ‘atXYZ’ and ‘astate’ variables

  • structure_selection (StructureSelection | StructureSelectionDescriptor, optional) – An optional selection of features to calculate and base the PCA fitting on. If not provided, will calculate a PCA for full pairwise distances.

  • center_mean (bool) – Center mean data before pca if True, by default: False.

  • n_components (int, optional) – The number of principal components to return, by default 2, by default 2

Returns:

A tuple of the following two parts: - pca_res

The object result of the call to pca() holding all results of the pca analysis (see documentation of pca()).

  • hopping_point_masks

    The mask of the hopping point events. Can be used to only extract the hopping point PCA results from the projected input result in pca_res.

Return type:

tuple[PCAResult, xr.DataArray]

pca(data: shnitsel.data.tree.node.TreeNode[Any, shnitsel.data.dataset_containers.shared.ShnitselDataset | xarray.Dataset], structure_selection: shnitsel.filtering.structure_selection.StructureSelection | shnitsel.filtering.structure_selection.StructureSelectionDescriptor | None = None, dim: None = None, n_components: int = 2, center_mean: bool = False) shnitsel.data.tree.node.TreeNode[Any, PCAResult[shnitsel.data.tree.data_group.DataGroup[xarray.DataArray], shnitsel.data.tree.data_group.DataGroup[xarray.DataArray]]]#
pca(data: shnitsel.data.dataset_containers.shared.ShnitselDataset | xarray.Dataset | xarray.DataArray, structure_selection: shnitsel.filtering.structure_selection.StructureSelection | shnitsel.filtering.structure_selection.StructureSelectionDescriptor | None = None, dim: None = None, n_components: int = 2, center_mean: bool = False) PCAResult
pca(data: shnitsel.data.dataset_containers.shared.ShnitselDataset | xarray.Dataset | shnitsel.data.tree.node.TreeNode[Any, shnitsel.data.dataset_containers.shared.ShnitselDataset | xarray.Dataset], structure_selection: shnitsel.filtering.structure_selection.StructureSelection | shnitsel.filtering.structure_selection.StructureSelectionDescriptor | None = None, dim: None = None, n_components: int = 2, center_mean: bool = False) PCAResult | shnitsel.data.tree.node.TreeNode[Any, PCAResult[shnitsel.data.tree.data_group.DataGroup[xarray.DataArray], shnitsel.data.tree.data_group.DataGroup[xarray.DataArray]]]
pca(data: xarray.DataArray, structure_selection: None = None, dim: Hashable | None = None, n_components: int = 2, center_mean: bool = False) PCAResult[xarray.DataArray, xarray.DataArray]

Function to perform a PCA decomposition on the data of various origins and formats.

Can accept either full trajectory data in types of Frames, Trajectory or ShnitselDB hierarchical formats or as a raw xr.Dataset. Alternatively, the dataarray

Parameters:
  • da (xr.DataArray) – A DataArray with at least a dimension with a name matching dim dtype should be integer or floating with no nan or inf entries

  • structure_selection (StructureSelection | StructureSelectionDescriptor, optional) – Optional selection of geometric features to include in the PCA. If not provided, will fall back to pairwise distances.

  • dim – The name of the array-dimension to reduce (i.e. the axis along which different features lie)

  • n_components (int, optional) – The number of principal components to return, by default 2

  • center_mean (bool, optional) – Flag to center data before being passed to the PCA if set to True, by default False.

Returns:

  • PCAResult[xr.DataArray, xr.DataArray] – The full information obtained by the fitting of the result. Contains the inputs for the PCA result, the principal components, the mapped values for the inputs, the full pipeline to apply the PCA transformation again to other data.

    The mapped inputs are a DataArray with the same dimensions as da, except for the dimension indicated by dim, which is replaced by a dimension PC of size n_components.

    result.principal_components holds the fitted principal components. result.projected_inputs provides the PCA projection result when applied to the inputs.

  • ShnitselDB[PCAResult[DataGroup[xr.DataArray], DataGroup[xr.DataArray]]] – The hierarchical structure of PCA results, where each flat group is used for a PCA analysis.

  • Examples

  • ———

  • >>> pca_results1 = pca(data1)

  • >>> pca_results1.projected_inputs # See the loadings

  • >>> pca_results2 = pca_results1.project_array(data2)

pca_direct(data, dim, n_components=2)#

Wrapper function to directly apply the PCA decomposition to the values in a dataarray.

Contrary to the pca() function, the features for the pca are not derived from the first data parameter

Parameters:
  • data (xr.DataArray) – A DataArray with at least a dimension with a name matching dim

  • dim (Hashable) – The name of the array-dimension to reduce (i.e. the axis along which different features lie)

  • n_components (int, optional) – The number of principal components to return, by default 2

Returns:

  • PCAResult – The full information obtained by the fitting of the result. Contains the inputs for the PCA result, the principal components, the mapped values for the inputs, the full pipeline to apply the PCA transformation again to other data.

    The mapped inputs are a DataArray with the same dimensions as da, except for the dimension indicated by dim, which is replaced by a dimension PC of size n_components.

  • Examples

  • ———

  • >>> pca_results1 = pca(data1, ‘features’)

  • >>> pca_results1.projected_inputs # See the loadings

  • >>> pca_results2 = pca_results1.project_array(data2)

Return type:

PCAResult

principal_component_analysis#
PCA#