niquery.query.querying module¶
- niquery.query.querying.MAX_QUERY_SIZE = 100¶
Maximum page size.
- niquery.query.querying.edges_to_dataframe(edges: list) pandas.DataFrame[source]¶
Convert a list of dataset edges (GraphQL response) into a pandas DataFrame.
Returned values are sorted by the dataset ‘id’.
- Parameters:
edges (
list) – GraphQL edges. Each edge contains a ‘node’ with dataset metadata.- Returns:
A DataFrame with the relevant dataset information, namely ‘remote’, ‘id’, ‘name’, ‘species’, ‘tag’, ‘dataset_doi’, ‘modalities’, and ‘tasks’.
- Return type:
DataFrame
- niquery.query.querying.fetch_page(gql_url: str, after_cursor: str | None = None) dict[source]¶
Fetch a single page of datasets from a remote server via its URL.
The remote server needs to offer a GraphQL API.
- niquery.query.querying.fetch_pages(cursors: list, max_workers: int = 8) list[source]¶
Fetch all dataset pages in parallel using a precomputed list of cursors.
- niquery.query.querying.get_cursors(remote: str) list[source]¶
Serially walk through the entire dataset list from the given remote to collect all pagination cursors.
This function starts from the beginning and keeps fetching pages until the last one, recording the ‘endCursor’ of each page to enable parallel fetching later.
The remote server needs to offer a GraphQL API.
- niquery.query.querying.post_with_retry(url: str, headers: dict, payload: dict, retries: int = 5, backoff: float = 1.5) Response | None[source]¶
Post an HTTP request with retrying.
If the request is unsuccessful, retry
retriestimes after an exponential wait time computed as \(backoff^attempt\).
- niquery.query.querying.query_dataset_files(gql_url: str, dataset_id: str, snapshot_tag: str) list[source]¶
Retrieve all files for a given dataset snapshot.
This function takes a dataset metadata dictionary (typically a row from a
DataFrame), extracts the dataset ID and snapshot tag, and recursively queries all files in the snapshot. If the snapshot tag is missing or the request fails, an empty list is returned.- Parameters:
- Returns:
List of files containing their metadata dictionaries, each including the fields ‘id’, ‘filename’, ‘size’, ‘directory’, ‘annexed’, ‘key’, ‘urls’, and ‘fullpath’.
- Return type:
Notes
If ‘tag’ is missing or marked as
NA, no files are returned.Errors during querying are caught and logged, returning an empty list.
- niquery.query.querying.query_datasets(df: pandas.DataFrame, max_workers: int = 8) tuple[source]¶
Perform file queries over a DataFrame of datasets.
- niquery.query.querying.query_snapshot_files(gql_url: str, dataset_id: str, snapshot_tag: str, tree: str | None = None) list[source]¶
Query the list of files at a specific level of a dataset snapshot.
- Parameters:
- Returns:
Each dict represents a file or directory with fields ‘id’, ‘filename’, ‘size’, ‘directory’, ‘annexed’, ‘key’, and ‘urls’.
- Return type:
- niquery.query.querying.query_snapshot_tree(gql_url: str, dataset_id: str, snapshot_tag: str, tree: str | None = None, parent_path='') list[source]¶
Recursively query all files in a dataset snapshot.
- Parameters:
gql_url (
str) – GraphQL URL to query data from.dataset_id (
str) – The dataset ID (e.g., ‘ds000001’).snapshot_tag (
str) – The tag of the snapshot to query (e.g., ‘1.0.0’).tree (
str, optional) – ID of a directory within the snapshot tree to query; useNoneto start at the root.parent_path (
str, optional) – Relative path used to construct full file paths (used during recursion).
- Returns:
all_files – List of all file entries (not directories), each including a ‘fullpath’ key that shows the complete path from the root.
- Return type: