es_sfgtools.workflows.preprocess_ingest.data_handler module
Contains the DataHandler class for handling data operations.
- class es_sfgtools.workflows.preprocess_ingest.data_handler.DataHandler(directory: Path | str)
Bases:
WorkflowABCHandles data operations including searching, adding, downloading, and processing.
- add_data_remote(remote_filepaths: List[str], remote_type: REMOTE_TYPE | str = REMOTE_TYPE.HTTP) None
Adds remote data files to the catalog.
- Parameters:
remote_filepaths (list of str) – A list of remote file paths.
remote_type (REMOTE_TYPE or str, default REMOTE_TYPE.HTTP) – The type of the remote storage.
- Raises:
ValueError – If the specified remote type is not recognized.
- add_data_to_catalog(local_filepaths: List[Path])
Adds a list of local files to the data catalog.
- Parameters:
local_filepaths (list of Path) – A list of paths to the files to add.
- discover_data_and_add_files(directory_path: Path) None
Scans a directory for data files and adds them to the catalog.
- Parameters:
directory_path (Path) – The path to the directory to scan.
- download_HTTP_files(http_assets: List[AssetEntry], file_type: AssetType | None = None)
Downloads files from an HTTP server and updates the catalog.
- Parameters:
http_assets (list of AssetEntry) – A list of HTTP assets to download.
file_type (AssetType, optional) – The type of file being downloaded.
- download_data(file_types: List[AssetType] | List[str] | str = [DOWNLOAD_TYPES.SONARDYNE, DOWNLOAD_TYPES.NOVATEL, DOWNLOAD_TYPES.NOVATEL000, DOWNLOAD_TYPES.NOVATEL770, DOWNLOAD_TYPES.DFPO00, DOWNLOAD_TYPES.CTD, DOWNLOAD_TYPES.SEABIRD], override: bool = False)
Downloads files of specified types from remote storage.
- Parameters:
file_types (list of AssetType, list of str, or str, default DEFAULT_FILE_TYPES_TO_DOWNLOAD) – The types of files to download.
override (bool, default False) – If True, redownloads files even if they exist locally.
- Raises:
ValueError – If a specified file type is not recognized.
- geolab_get_s3(overwrite: bool = False)
Synchronize seafloor geodesy data from S3 storage to local GeoLab environment.
This method downloads and synchronizes data files from AWS S3 to the local GeoLab environment for the currently selected network and station. It handles both metadata files and campaign data, creating the necessary local directory structure and maintaining catalog consistency.
The synchronization process: 1. Validates GeoLab environment and S3 bucket configuration 2. Loads or creates an S3 directory catalog 3. Downloads station metadata files from S3 to local storage 4. Downloads campaign data files from S3 to local storage 5. Updates local and remote directory catalogs
- Parameters:
overwrite (bool, optional) – If True, re-downloads files even if they already exist locally. If False, only downloads missing files. Defaults to False.
- Raises:
AssertionError – If not running in GEOLAB environment
ValueError – If S3 bucket configuration is missing or invalid
Note
Only processes data for the currently set network and station context
Requires valid AWS credentials and S3 bucket access
Creates local directory structure to match S3 organization
Maintains both local and remote directory catalogs for consistency
- get_dtype_counts()
Retrieves the counts of different data types for the current operational context.
- Returns:
dict of {str – A dictionary mapping data types to their counts.
- Return type:
int}
- get_site_metadata(site_metadata: Site | Path | None = None) Site | None
Loads or validates site metadata for the current station.
- mid_process_workflow: bool = False
- set_network_station_campaign(network_id: str, station_id: str, campaign_id: str)
Changes the operational context to a specific network, station, and campaign.
Overrides the parent method to add DataHandler-specific setup including TileDB array initialization and logging configuration.
- Parameters:
network_id (str) – The network identifier.
station_id (str) – The station identifier.
campaign_id (str) – The campaign identifier.
- set_network_station_campaign_with_metadata(network_id: str, station_id: str, campaign_id: str, site_metadata: Site | Path | str | None = None)
Changes the operational context and loads specific site metadata.
This method extends set_network_station_campaign() by allowing custom site metadata to be loaded for the station context.
- Parameters:
network_id (str) – The network identifier.
station_id (str) – The station identifier.
campaign_id (str) – The campaign identifier.
site_metadata (Site, Path, str, optional) – Optional site metadata. If not provided, it will be loaded if available.
- update_catalog_from_archive() None
Updates the catalog with remote file paths from the data archive.