niquery.query.querying module¶

niquery.query.querying.MAX_QUERY_SIZE = 100¶: Maximum page size.

niquery.query.querying.edges_to_dataframe(edges: list) → pandas.DataFrame[source]¶

Convert a list of dataset edges (GraphQL response) into a pandas DataFrame.

Returned values are sorted by the dataset ‘id’.

Parameters:: edges (list) – GraphQL edges. Each edge contains a ‘node’ with dataset metadata.
Returns:: A DataFrame with the relevant dataset information, namely ‘remote’, ‘id’, ‘name’, ‘species’, ‘tag’, ‘dataset_doi’, ‘modalities’, and ‘tasks’.
Return type:: DataFrame

niquery.query.querying.fetch_page(gql_url: str, after_cursor: str | None = None) → dict[source]¶

Fetch a single page of datasets from a remote server via its URL.

The remote server needs to offer a GraphQL API.

Parameters:

gql_url (str) – GraphQL URL to fetch data from.
after_cursor (str, optional) – The pagination cursor indicating where to start. If None, fetches the first page.

Returns:

Dictionary with keys ‘edges’ (list of datasets) and ‘pageInfo’ (pagination metadata).

Return type:

dict

niquery.query.querying.fetch_pages(cursors: list, max_workers: int = 8) → list[source]¶

Fetch all dataset pages in parallel using a precomputed list of cursors.

Parameters:

cursors (list) – List of remote server name and cursor tuples.
max_workers (int, optional) – Maximum number of parallel threads to use.

Returns:

results – List of datasets.

Return type:

list

niquery.query.querying.get_cursors(remote: str) → list[source]¶

Serially walk through the entire dataset list from the given remote to collect all pagination cursors.

This function starts from the beginning and keeps fetching pages until the last one, recording the ‘endCursor’ of each page to enable parallel fetching later.

The remote server needs to offer a GraphQL API.

Parameters:: remote (str) – Name of the remote to fetch data from.
Returns:: cursors – List of remote and cursor tuples, where the first cursor is None (start of list), and the rest are page markers returned by GraphQL.
Return type:: list

niquery.query.querying.post_with_retry(url: str, headers: dict, payload: dict, retries: int = 5, backoff: float = 1.5) → Response | None[source]¶

Post an HTTP request with retrying.

If the request is unsuccessful, retry retries times after an exponential wait time computed as \(backoff^{attempt}\).

Parameters:

url (str) – URL to post to.
headers (dict) – HTTP headers.
payload (dict) – HTTP payload.
retries (int, optional) – Number of retry attempts.
backoff (float, optional) – Retry delay.

Returns:

Request response. None if attempts failed.

Return type:

Response or None

niquery.query.querying.query_dataset_files(gql_url: str, dataset_id: str, snapshot_tag: str) → list[source]¶

Retrieve all files for a given dataset snapshot.

This function takes a dataset metadata dictionary (typically a row from a DataFrame), extracts the dataset ID and snapshot tag, and recursively queries all files in the snapshot. If the snapshot tag is missing or the request fails, an empty list is returned.

Parameters:

gql_url (str) – GraphQL URL to query data from.
dataset_id (str) – Dataset ID (e.g., ‘ds000001’).
snapshot_tag (str) – Snapshot tag (e.g., ‘1.0.0’).

Returns:

List of files containing their metadata dictionaries, each including the fields ‘id’, ‘filename’, ‘size’, ‘directory’, ‘annexed’, ‘key’, ‘urls’, and ‘fullpath’.

Return type:

list

Notes

If ‘tag’ is missing or marked as NA, no files are returned.
Errors during querying are caught and logged, returning an empty list.

niquery.query.querying.query_datasets(df: pandas.DataFrame, max_workers: int = 8) → tuple[source]¶

Perform file queries over a DataFrame of datasets.

Parameters:

df (DataFrame) – Dataset records.
max_workers (int, optional) – Maximum number of parallel threads to use.

Returns:

A mapping from dataset ID to list of file metadata dictionaries, and a list of failed dataset ID and snapshot tags.

Return type:

tuple

niquery.query.querying.query_snapshot_files(gql_url: str, dataset_id: str, snapshot_tag: str, tree: str | None = None) → list[source]¶

Query the list of files at a specific level of a dataset snapshot.

Parameters:

gql_url (str) – GraphQL URL to query data from.
dataset_id (str) – The dataset ID (e.g., ‘ds000001’).
snapshot_tag (str) – The tag of the snapshot to query (e.g., ‘1.0.0’).
tree (str, optional) – ID of a directory within the snapshot tree to query; use None to start at the root.

Returns:

Each dict represents a file or directory with fields ‘id’, ‘filename’, ‘size’, ‘directory’, ‘annexed’, ‘key’, and ‘urls’.

Return type:

list

niquery.query.querying.query_snapshot_tree(gql_url: str, dataset_id: str, snapshot_tag: str, tree: str | None = None, parent_path='') → list[source]¶

Recursively query all files in a dataset snapshot.

Parameters:

gql_url (str) – GraphQL URL to query data from.
dataset_id (str) – The dataset ID (e.g., ‘ds000001’).
snapshot_tag (str) – The tag of the snapshot to query (e.g., ‘1.0.0’).
tree (str, optional) – ID of a directory within the snapshot tree to query; use None to start at the root.
parent_path (str, optional) – Relative path used to construct full file paths (used during recursion).

Returns:

all_files – List of all file entries (not directories), each including a ‘fullpath’ key that shows the complete path from the root.

Return type:

list