niquery.analysis.filtering module¶

niquery.analysis.filtering.filter_modality_datasets(df: pandas.DataFrame, modality: str | list) → pandas.Series[source]¶

Filter non-relevant modality data records.

Filters datasets whose ‘modalities’ field does not contain one of items in modality.

Parameters:

df (DataFrame) – Dataset records.
modality (str or list) – Modalities to consider (case-insensitive).

Returns:

Mask of relevant datasets.

Return type:

Series

niquery.analysis.filtering.filter_modality_records(fname: str, sep: str, suffix: str | list) → pandas.DataFrame[source]¶

Keep records where the filename matches the provided modality naming convention.

Following the BIDS modality suffix convention, keeps records where the ‘filename’ attribute ends with the given suffix, i.e. ‘_{suffix}.nii.gz’.

Parameters:

fname (str) – Filename. A delimiter-separated file containing the list of records to be inspected.
sep (str) – Separator.
suffix (str or list) – Suffix of the relevant files.

Returns:

Modality file records.

Return type:

DataFrame

niquery.analysis.filtering.filter_nonrelevant_datasets(df: pandas.DataFrame, species: str | list, modality: str | list) → pandas.DataFrame[source]¶

Filter non-relevant data records.

Return datasets that belong to the provided species and modality..

Parameters:

df (DataFrame) – Dataset records.
species (str or list) – Species to consider (case-insensitive).
modality (str or list) – Modalities to consider (case-insensitive).

Returns:

Relevant dataset records.

Return type:

DataFrame

niquery.analysis.filtering.filter_on_run_contribution(df: pandas.DataFrame, contrib_thr: int, seed: int) → pandas.DataFrame[source]¶

Filter BOLD runs of datasets to keep their total contribution under a threshold.

Randomly picks BOLD runs of a dataset if the total number of runs exceeds the given threshold.

Parameters:

df (DataFrame) – BOLD run information.
contrib_thr (int) – Contribution threshold in terms of number of runs.
seed (int) – Random seed value.

Returns:

Filtered BOLD runs.

Return type:

DataFrame

niquery.analysis.filtering.filter_on_timepoint_count(df: pandas.DataFrame, min_timepoints: int, max_timepoints: int) → pandas.DataFrame[source]¶

Filter BOLD runs of datasets that are below or above a given number of timepoints.

Filters BOLD runs whose timepoint count is not within the range [min_timepoints, max_timepoints].

Parameters:

df (DataFrame) – BOLD run information.
min_timepoints (int) – Minimum number of time points.
max_timepoints (int) – Maximum number of time points.

Returns:

Filtered BOLD runs.

Return type:

DataFrame

niquery.analysis.filtering.filter_runs(df: pandas.DataFrame, contrib_thr: int, min_timepoints: int, max_timepoints: int, seed: int) → pandas.DataFrame[source]¶

Filter BOLD runs based on run count and timepoint criteria.

Filters the BOLD runs to include only those that fulfil:

Criterion 1: the number of runs for a given dataset is below the threshold contrib_thr.

Criterion 2: the number of timepoints per BOLD run is between [min_timepoints, max_timepoints].

Parameters:

df (DataFrame) – BOLD run information.
contrib_thr (int) – Contribution threshold in terms of number of runs.
min_timepoints (int) – Minimum number of time points.
max_timepoints (int`) – Maximum number of time points.
seed (int) – Random seed value.

Returns:

Filtered BOLD runs.

Return type:

DataFrame

niquery.analysis.filtering.filter_species_datasets(df: pandas.DataFrame, species: str | list) → pandas.Series[source]¶

Filter non-relevant species data records.

Filters datasets whose ‘species’ field does not contain one of items in species.

Parameters:

df (DataFrame) – Dataset records.
species (str or list) – Species to consider (case-insensitive).

Returns:

Mask of relevant datasets.

Return type:

Series

niquery.analysis.filtering.identify_modality_files(datasets: dict, sep: str, suffix: str | list, max_workers: int = 8) → dict[source]¶

Identify dataset files having a particular suffix.

For each dataset, and following the BIDS modality suffix convention, keeps records where the ‘filename’ attribute ends with ‘_{suffix}.nii.gz’.

Parameters:

datasets (dict) – Dataset file information. Contains a list of datasets ids and the corresponding delimiter-separated files containing the list of records to be inspected.
suffix (str or list) – Suffix of the relevant files.
sep (str) – Separator.
max_workers (int, optional) – Maximum number of parallel threads to use.

Returns:

results – Dictionary of dataset modality-specific file records.

Return type:

dict

See also

filter_modality_records

niquery.analysis.filtering.identify_relevant_runs(df: pandas.DataFrame, contrib_thr: int, min_timepoints: int, max_timepoints: int, seed: int) → pandas.DataFrame[source]¶

Identify relevant BOLD runs in terms of run and timepoint count constraints.

Identifies the BOLD runs that fulfill the following criteria:

Criterion 1: the number of runs for a given dataset is below the threshold contrib_thr.

Criterion 2: the number of timepoints per BOLD run is between [min_timepoints, max_timepoints].

Runs are shuffled before the filtering process.

Parameters:

df (DataFrame) – BOLD run information.
contrib_thr (int) – Contribution threshold in terms of the number of runs a dataset can contribute with over the total number of runs.
min_timepoints (int) – Minimum number of time points.
max_timepoints (int`) – Maximum number of time points.
seed (int) – Random seed value.

Returns:

Identified relevant BOLD runs.

Return type:

DataFrame

See also

filter_runs