shnitsel.io.read#

Attributes#

Classes#

Functions#

read(path[, kind, sub_pattern, multiple, ...])

Read all trajectories from a folder of trajectory folder.

read_folder_multi(path[, kind, sub_pattern, parallel, ...])

Function to read multiple trajectories from an input directory.

read_single(path, kind[, error_reporting, ...])

Helper function to read input from a single input path.

identify_or_check_input_kind(path, kind_hint)

Function to identify/guess which kind of input type the current path has if no kind was provided.

_per_traj(trajdir, reader, format_info, ...[, ...])

Internal function to carry out loading of trajectories to allow for parallel processing with a ProcessExecutor.

Module Contents#

DataType#
read(path, kind=None, *, sub_pattern=None, multiple=True, concat_method='db', parallel=True, error_reporting='log', input_units=None, input_state_types=None, input_state_names=None, input_trajectory_id_maps=None, expect_dtype=None)#

Read all trajectories from a folder of trajectory folder.

The function will attempt to automatically detect the type of the trajectory if kind is not set. If path is a directory containing multiple trajectory sub-directories or files with multiple=True, this function will attempt to load all those subdirectories in parallel. To limit the number of considered trajectories, you can provide sub_pattern as a glob pattern to filter directory entries to be considered It will extract as much information from the trajectory as possible and return it in a standard shnitsel format.

If multiple trajectories are loaded, they need to be combined into one return object. The method for this can be configured via concat_method. By default, concat_method=’layers’, a new dimension trajid will be introduced and different trajectories can be identified by their index along this dimension.

Please note, that additional entries along the time dimension in any variable will be padded by default values. You can either check the max_ts attribute for the maximum time index in the respective directory or check whether there are np.nan values in any of the observables. We recommend using the energy variable.

concat_method=’frames’ introduces a new dimension frame where each tick is a combination of trajid and time in the respective trajectory. Therefore, only valid frames will be present and no padding performed. concat_method=’list’ simply returns the list of successfully loaded trajectories without merging them. concat_method=’db’ returns a Tree-structured ShnitselDB object containing all of the trajectories. Only works if all trajectories contain the same compound/molecule. For concatenation except ‘list’, the same number of atoms and states must be present in all individual trajectories.

Error reporting can be configure between logging or raising exceptions via error_reporting.

If parallel=True, multiple processes will be used to load multiple different trajectories in parallel.

As some formats do not contain sufficient information to extract the input units of all variables, you can provide units (see shnitsel.units.definitions.py for unit names) of individual variables via input_units. input_units should be a dict mapping default variable names to the respective unit. The individual variable names should adhere to the shnitsel-format standard, e.g. atXYZ, force, energy, dip_perm. Unknown names or names not present in the loaded data will be ignored without warning. If no overrides are provided, the read function will use internal defaults for all variables.

Similarly, as many output formats do not provide state multiplicity or state name information, we allow for the provision of state types (via input_state_types) and of state names (via input_state_names). Both can either be provided as a list of values for the states in the input in ascending index order or as a function that assigns the correct values to the coordinates state_types or state_names in the trajectory respectively. Types are either 1, 2, or 3, whereas names are commonly of the format “S0”, “D0”, “T0”. Do not modify any other variables within the respective function. If you modify any variable, use the mark_variable_assigned(variable) function, i.e. mark_variable_assigned(dataset.state_types) or mark_variable_assigned(dataset.state_names) respectively, to notify shnitsel of the respective update. If the notification is not applied, the coordinate may be dropped due to a supposed lack of assigned values.

If multiple trajectories are merged, it is importand to be able to distinguish which one may be referring. By setting input_trajectory_id_maps, you can provide a mapping between input paths and the id you would like to assign to the trajectory read from that individual path as a dict. The key should be the absolute path as a posix-conforming string. The value should be the desired id. Note that ids should be pairwise distinct. Alternatively, input_trajectory_id_maps can be a function that is provided the pathlib.Path object of the trajectory input path and should return an associated id. By default, ids are exctracted from integers in the directory names of directory-based inputs. If no integer is found or the format does not support the directory-style input, a random id will be assigned by default.

Parameters:
  • path (PathOptionsType) – The path to the folder of folders. Can be provided as str, os.PathLike or pathlib.Path. Depending on the kind of trajectory to be loaded should denote the path of the trajectory file (kind='shnitsel' or ``kind=’ase’) or a directory containing the files of the respective file format. Alternatively, if ``multiple=True, this can also denote a directory containing multiple sub-directories with the actual Trajectories. In that case, the concat_method parameter should be set to specify how the .

  • kind (FormatIdentifierType, optional) – The kind of trajectory, i.e. whether it was produced by SHARC, Newton-X, PyRAI2MD or Shnitsel-Tools. If None is provided, the function will make a best-guess effort to identify which kind of trajectory has been provided.

  • sub_pattern (str, optional) – If the input is a format with multiple input trajectories in different directories, this is the search pattern to append to the path (the whole thing will be read by glob.glob()). The default will be chosen based on kind, e.g., for SHARC ‘TRAJ_*’ or ‘ICOND*’ and for NewtonX ‘TRAJ*’. If the kind does not support multi-folder inputs (like shnitsel), this will be ignored. If multiple=False, this pattern will be ignored.

  • multiple (bool, optional) – A flag to enable loading of multiple trajectories from the subdirectories of the provided path. If set to False, only the provided path will be attempted to be loaded. If sub_pattern is provided, this parameter should not be set to False or the matching will be ignored.

  • concat_method (Literal['db', 'layers', 'list', 'frames']) – How to combine the loaded trajectories if multiple trajectories have been loaded. Defaults to concat_method='db'. The available methods are: ‘db’ : Returns the trajectories/data points in a hierarchical tree structure to allow for easier management of complex data hierarchies. ‘layers’: Introduce a new axis trajid along which the different trajectories are indexed in a combined xr.Dataset structure. ‘list’: Return the multiple trajectories as a list of individually loaded data. ‘frames’: Concatenate the individual trajectories along the time axis (‘frames’) using a xarray.indexes.PandasMultiIndex

  • parallel (bool, optional) – Whether to read multiple trajectories at the same time via parallel processing (which, in the current implementation, is only faster on storage that allows non-sequential reads). By default True.

  • error_reporting (Literal['log','raise'], optional) – Choose whether to log or to raise errors as they occur during the import process. Currently, the implementation does not support error_reporting=’raise’ while parallel=True.

  • input_units (dict[str, str], optional) – An optional dictionary to set the units in the loaded trajectory. Only necessary if the units differ from that tool’s default convention or if there is no default convention for the tool. Please refer to the names of the different unit kinds and possible values for different units in shnitsel.units.definitions.

  • input_state_types (list[int] | Callable[[xr.Dataset], xr.Dataset], optional) – Either a list of state types/multiplicities to assign to states in the loaded trajectories or a function that assigns a state multiplicity to each state. The function may use all of the information in the trajectory if required and should return the updated Dataset. If not provided or set to None, default types/multipliciteis will be applied based on extracted numbers of singlets, doublets and triplets. The first num_singlet types will be set to 1, then 2*num_doublet types will be set to 2 and then 3*num_triplets types will be set to 3. Will be invoked/applied before the input_state_names setting.

  • input_state_names (list[str] | Callable[[xr.Dataset], xr.Dataset], optional) – Either a list of names to assign to states in the loaded file or a function that assigns a state name to each state. The function may use all of the information in the trajectory, i.e. the state_types array, and should return the updated Dataset. If not provided or set to None, default naming will be applied, naming singlet states S0, S1,.., doublet states D0,… and triplet states T0, etc in ascending order. Will be invoked/applied after the input_state_types setting.

  • input_trajectory_id_maps (dict[str, int]| Callable[[pathlib.Path], int], optional) – A dict mapping absolut posix paths to ids to be applied or a function to convert a path into an integer id to assign to the trajectory. If not provided, will be chosen either based on the last integer matched from the path or at random up to 2**31-1.

  • expected_dtype (type[DataType] | UnionType, optional) – An explicit type hint to control the output type of this function where template arguments are concerned. Will be explicitly set on ShnitselDB nodes. If not provided, may be inferred internally.

  • expect_dtype (type[DataType] | types.UnionType | None)

Returns:

  • Trajectory | Frames | DataType | xr.Dataset | xr.DataArray – For simple inputs like single trajectories or non-hierarchical inputs, this function will return trajectory data or the data stored in the file that was attempted to be read. If concat_method=’frames’, multiple data entries will be combined into a single MultiFrames object.

  • List[Trajectory | Frames] | List[DataType] – If concat_method=’list’ and multiple data entries were read, a list of that data may be returned.

  • ShnitselDB[Trajectory | Frames]

  • | ShnitselDB[DataType]

  • | CompoundGroup[DataType]

  • | DataGroup[DataType]

  • | DataLeaf[DataType] – If a file with hierarchical data was read or if concat_method=’db’ was set, a hierarchical structure will be returned. If such a structure was constructed, it will always complete the tree up to the ShnitselDB root. If a tree structure is read from file, completion is not automatically performed.

  • xr.Dataset – If no conversion is possible, the data is most likely returned as an xr.Dataset or a list thereof.

  • None – If no data could be loaded.

Raises:
  • FileNotFoundError – If the kind does not match the provided path format, e.g because it does not exist or does not denote a file/directory with the required contents.

  • FileNotFoundError – If the search (= path + pattern) doesn’t match any paths according to glob.glob()

  • ValueError – If an invalid value for concat_method is passed.

  • ValueError – If error_reporting is set to ‘raise’ in combination with parallel=True, the code cannot execute correctly. Only 'log' is supported for parallel reading

Return type:

xarray.Dataset | xarray.DataArray | shnitsel.data.dataset_containers.shared.ShnitselDataset | shnitsel.data.xr_io_compatibility.SupportsFromXrConversion | shnitsel.data.tree.node.TreeNode[Any, shnitsel.data.dataset_containers.shared.ShnitselDataset | shnitsel.data.xr_io_compatibility.SupportsFromXrConversion | xarray.Dataset | xarray.DataArray] | shnitsel.data.tree.node.TreeNode[Any, DataType] | Sequence[xarray.Dataset | shnitsel.data.dataset_containers.shared.ShnitselDataset | shnitsel.data.xr_io_compatibility.SupportsFromXrConversion | xarray.DataArray] | DataType

read_folder_multi(path, kind=None, sub_pattern=None, parallel=True, error_reporting='log', base_loading_parameters=None, expect_dtype=None)#

Function to read multiple trajectories from an input directory.

You can either specify the kind and pattern to match relevant entries or the default pattern for kind will be used. If no kind is specified, all possible input formats will be checked.

If multiple formats fit, no input will be read and either an Error will be rased or an Error will be logged and None returned.

Otherwise, all successful reads will be returned as a list.

Parameters:
  • path (PathOptionsType, optional) – The path pointing to the directory where multiple trajectories may be located in the subdirectory, by default None,

  • kind (FormatIdentifierType, optional) – The key indicating the input format, will be inferred if not provided.

  • sub_pattern (str, optional) – The pattern provided to “glob” to identify relevant entries in the path subtree. Defaults to None.

  • parallel (bool, optional) – A flag to enable parallel loading of trajectories. Only faster if postprocessing of read data takes up significant amounts of time. Defaults to True.

  • error_reporting (Literal["log", "raise"], optional) – Whether to raise or to log resulting errors. If errors are raised, they may also be logged. ‘raise’ conflicts with parallel=True setting. Defaults to “log”.

  • base_loading_parameters (LoadingParameters, optional) – Base parameters to influence the loading of individual trajectories. Can be used to set default inputs and variable name mappings. Defaults to None.

  • expect_dtype (type[DataType] | TypeForm[DataType], optional) – An explicit type hint to control the output type of this function where template arguments are concerned. Will be explicitly set on ShnitselDB nodes. If not provided, may be inferred internally.

Returns:

Either a list of individual trajectories, a list of various possible result types read from file or None if loading failed.

Return type:

list[Trajectory] | list[…] None

Raises:
  • FileNotFoundError – If the path does not exist or Files were not founds.

  • ValueError – If conflicting information of file format is detected in the target directory

read_single(path, kind, error_reporting='log', base_loading_parameters=None, expect_dtype=None)#

Helper function to read input from a single input path.

May yield complex and iterable data structures depending on the input format.

Parameters:
  • path (PathOptionsType) – Path to a directory to be checked whether it can be read by available input readers

  • kind_hint (str | None) – If set, the input format specified by the user. Only that reader’s result will be used eventually.

  • kind (shnitsel.io.format_registry.FormatIdentifierType | None)

  • error_reporting (Literal['log', 'raise'])

  • base_loading_parameters (shnitsel.io.shared.helpers.LoadingParameters | None)

  • expect_dtype (type[DataType] | types.UnionType | None)

Raises:
Returns:

  • xr.Dataset

  • | xr.DataArray

  • | ShnitselDataset

  • | SupportsFromXrConversion

  • | TreeNode[ – Any, ShnitselDataset | SupportsFromXrConversion | xr.Dataset | xr.DataArray

  • ]

  • | TreeNode[Any, DataType]

  • | Sequence[xr.Dataset | ShnitselDataset | SupportsFromXrConversion]

  • | DataType

  • | None – The data that has been read from the path location.

Return type:

xarray.Dataset | xarray.DataArray | shnitsel.data.dataset_containers.shared.ShnitselDataset | shnitsel.data.xr_io_compatibility.SupportsFromXrConversion | shnitsel.data.tree.node.TreeNode[Any, shnitsel.data.dataset_containers.shared.ShnitselDataset | shnitsel.data.xr_io_compatibility.SupportsFromXrConversion | xarray.Dataset | xarray.DataArray] | shnitsel.data.tree.node.TreeNode[Any, DataType] | Sequence[xarray.Dataset | shnitsel.data.dataset_containers.shared.ShnitselDataset | shnitsel.data.xr_io_compatibility.SupportsFromXrConversion | xarray.DataArray] | DataType | None

identify_or_check_input_kind(path, kind_hint)#

Function to identify/guess which kind of input type the current path has if no kind was provided. If a kind_hint is provided, it will verify, if the path actually is of that kind

Parameters:
  • path (PathOptionsType) – Path to a directory to be checked whether it can be read by available input readers

  • kind_hint (str | None) – If set, the input format specified by the user. Only that reader’s result will be used eventually.

Raises:
  • FileNotFoundError – If the path is not valid

  • ValueError – If the specified reader for kind_hint does not confirm validity of the directory

  • ValueError – If multiple readers match and no kind_hint was provided.

Returns:

The FormatInformation returned by the only successful check or None if no reader matched

Return type:

FormatInformation | None

class Trajres#
path: pathlib.Path#
misc_error: tuple[Exception, Any] | Iterable[tuple[Exception, Any]] | None#
data: xarray.Dataset | xarray.DataArray | shnitsel.data.dataset_containers.shared.ShnitselDataset | shnitsel.data.xr_io_compatibility.SupportsFromXrConversion | shnitsel.data.tree.node.TreeNode | Sequence[xarray.Dataset | shnitsel.data.dataset_containers.shared.ShnitselDataset | shnitsel.data.xr_io_compatibility.SupportsFromXrConversion | xarray.DataArray] | None#
log_records: list[logging.LogRecord] | None#
_per_traj(trajdir, reader, format_info, base_loading_parameters, expect_dtype=None)#

Internal function to carry out loading of trajectories to allow for parallel processing with a ProcessExecutor.

Parameters:
  • trajdir (pathlib.Path) – The path to read a single trajectory from

  • reader (FormatReader) – The reader instance to use for reading from that directory path.

  • format_info (FormatInformation) – FormatInformation obtained from previous checks of the format.

  • base_loading_parameters (LoadingParameters) – Settings for Loading individual trajectories like initial units and mappings of parameter names to Shnitsel variable names.

  • expect_dtype (type | types.UnionType | None)

Returns:

Either the successfully loaded trajectory in a wrapper, or the wrapper containing error information

Return type:

Trajres|None