BioThings v0.12.5 (#349)

* test add volume feature * added in named volume option * fixed parsing in named volume option * needs to be tested * remove named volume when having keep_container flag as true in docker dumper * added option to create multiple named volumes * changed to a list * fix: moved jmespath after other transformations * add sqlite except for duplicate record _id * test add volume feature * added in named volume option * fixed parsing in named volume option * needs to be tested * remove named volume when having keep_container flag as true in docker dumper * added option to create multiple named volumes * changed to a list * revert to pymongo 4.6.3 due to pymongo logs leaking into biothings hub logs * rolling back pymongo version to see errors * hide pymongo logs Unnecessary pymongo logs were shown in hub level after new pymongo 4.7 update. https://pymongo.readthedocs.io/en/latest/changelog.html Added support for Python’s native logging library, enabling developers to customize the verbosity of log messages for their applications. * Add metadata query field validation (#330) * Change the constructor for the ESQueryBuilder ... Pass the actual BiothingsESMetadata object rather than a property of the that into the constructor for the ESQueryBuilder <main change> def __init__( ... metadata: BiothingsMetadata = None, <- used to be BiothingsMetadata.biothings_metadata ) * Add additional logging to the QStringParser.parse method * Make logical changes to ensure parity with 0.12.x branch * Add initial attempt at building metadata fields * Remove the metadata from the parser constructor * Create metadata field set generation method * Add metadata field checking at query runtime * Add signifcant docstring and logging to the parse method * Add docstring comments to the structure of the metadata * Improve error handling for the metadata access --------- Co-authored-by: jschaff <jschaff@scripps.edu> * added comments for new dockercontainerdumper options volumes and named_volumes * metadata query logical enhancements and bugfixes (#331) * Correct minor logical errors and improve logging * Add ResultFormatter for metadata field formatting * Fix pipeline constructor arguments * Remove breakpoint --------- Co-authored-by: jschaff <jschaff@scripps.edu> * Fix metadata fields. (#334) * Improve metadata field index search (#333) * Iterate over all pontential metadata index fields * Add 0 length check to ensure we return None for empty set * Fix the regex tests by updating the method calls * Update the metadata tests --------- Co-authored-by: jschaff <jschaff@scripps.edu> * client.snapshot.delete keyword arguments Elasticsearch Python Client 8.0 requires keyword arguments Credit to @ctrl-schaff for discovering this on pending.api * Add more idiomatic sqlite multiple document insert (#335) * Add executemany improvement and extra exception handling * Modify the signature for bulk_write * Fix the tuple malformatting * Remove breakpoints * Correct the arguments syntax --------- Co-authored-by: jschaff <jschaff@scripps.edu> * Api customization (#336) Initial implementation of allowing use of custom config_webs when creating apis using the hub. --------- Co-authored-by: mygene_hub <mygene@su05.scripps.edu> Co-authored-by: Everaldo <everaldorodrigo@gmail.com> Co-authored-by: Dylan Welzel <dylanwelzel@gmail.com> * Add advanced plugin support for the command line tooling (#329) * Remove requirements around the manifest file * Revert the import order * Re-add the btinspect import that was accidently removed --------- Co-authored-by: jschaff <jschaff@scripps.edu> * Implement notifiers using asyncio (#337) * Implement notifiers using asyncio library. * Implement notifiers using asyncio library. * Raise error if notifier is not implemented. * Add certificate to the slack channel message. * Add exponential backoff strategy. * Add exponential backoff strategy. * Clean code. * Clean code. * Add comment. * Clean code. * Move var to outside of the loop. * Add comment. * Keep same parameter name: event. * Retry if get HTTP 5xx from GA4. * Add tests for Notifiers. * Add tests for Notifiers. * Add tests for Notifiers. * Add tests for Notifiers. * Configure SSL for Slack Notifier. * Fix metadata mapping and remove duplicated code. (#341) * Fix Slack notifier test. (#343) * Upgrade to tornado 6.4.1 (#342) * Correct the pyproject.toml black configuration ... (#340) See https://black.readthedocs.io/en/stable/usage_and_configuration/the_basics.html#configuration-format Co-authored-by: jschaff <jschaff@scripps.edu> * Replace deprecated imp module with importlib (#339) The imp module has been deprecated since Python 3.4, which was released in March 2014. Python 3.12, released in October 2023, removed support for the imp module entirely. * Fix pytest "Set" issue for older python versions (3.8 3.7) * Build Config Err Fix In the event an error occurs in one build config, every item in the list afterwards will contain the error. Resetting the error to None inside the loop fixes this. * Add CLI mongodb support (#338) * Remove requirements around the manifest file * Revert the import order * Generalize structure for mongodb support --------- Co-authored-by: jschaff <jschaff@scripps.edu> * Add id checking to IgnoreDuplicatedStorage ... (#344) * Add id checking to IgnoreDuplicatedStorage ... For the sequential case, this storage works as expected because each document is checked for uniqueness. When processing this in batch, there needs to be some more filtering involved especially if the number of document is small with a higher ratio of duplicates in those documents. If you attempt to upload to the database and each batch has duplicates in it, the entire batch will continually get thrown out until no documents were actually uploaded to the database. Instead if we verify the uniqueness constraint prior to uploading by ensuring we have a one-to-one ratio of id to documents in our collection, then we can ensure a safe upload. This still discards identifical or near identical documents without throwing out an entire batch of potentially valid documents in the batch upload * Remove breakpoint --------- Co-authored-by: jschaff <jschaff@scripps.edu> * Refactor ES exceptions (#346) * Refactor ES exceptions. * Add tests. * Rollback changes in the connections tests file. * Refactor ES exceptions. * Review changes. * Rise exceptions correctly. (#347) * Rise exceptions correctly. * Rise exceptions correctly. --------- Co-authored-by: jal347 <linjason.03@gmail.com> Co-authored-by: Everaldo <everaldorodrigo@gmail.com> Co-authored-by: Chunlei Wu <anewgene@yahoo.com> Co-authored-by: Dylan Welzel <dylanwelzel@gmail.com> Co-authored-by: jschaff <jschaff@scripps.edu> Co-authored-by: Chunlei Wu <newgene@users.noreply.github.com> Co-authored-by: Jason Lin <35415519+jal347@users.noreply.github.com> Co-authored-by: mygene_hub <mygene@su05.scripps.edu>
biothings · Jul 22, 2024 · e154eaa · e154eaa
1 parent 6573562
commit e154eaa
Show file tree

Hide file tree

Showing 30 changed files with 1,355 additions and 420 deletions.
diff --git a/biothings/cli/__init__.py b/biothings/cli/__init__.py
@@ -1,5 +1,7 @@
+import functools
 import importlib
 import importlib.util
+import inspect
 import logging
 import os
 import pathlib
@@ -69,35 +71,59 @@
 
 
 def setup_config():
-    """Setup a config module necessary to launch the CLI"""
+    """
+    Setup a config module necessary to launch the CLI
+
+    Depending on the backend hub database, the order of configuration
+    matters. If we attempt to load a module that checks for the configuration
+    we'll have to ensure that the configuration is properly configured prior
+    to loading the module
+    """
     working_dir = pathlib.Path().resolve()
-    _config = DummyConfig("config")
-    _config.HUB_DB_BACKEND = {
-        "module": "biothings.utils.sqlite3",
-        "sqlite_db_folder": ".biothings_hub",
-    }
-    _config.DATA_SRC_DATABASE = ".data_src_database"
-    _config.DATA_ARCHIVE_ROOT = ".biothings_hub/archive"
-    # _config.LOG_FOLDER = ".biothings_hub/logs"
-    _config.LOG_FOLDER = None  # disable file logging, only log to stdout
-    _config.DATA_PLUGIN_FOLDER = f"{working_dir}"
-    _config.hub_db = importlib.import_module(_config.HUB_DB_BACKEND["module"])
+    configuration_instance = DummyConfig("config")
+
     try:
         config_mod = importlib.import_module("config")
         for attr in dir(config_mod):
             value = getattr(config_mod, attr)
             if isinstance(value, ConfigurationError):
-                raise ConfigurationError("%s: %s" % (attr, str(value)))
-            setattr(_config, attr, value)
+                raise ConfigurationError(f"{attr}: {value}")
+            setattr(configuration_instance, attr, value)
     except ModuleNotFoundError:
-        logger.debug("The config.py does not exists in the working directory, use default biothings.config")
+        logger.warning(ModuleNotFoundError)
+        logger.warning("Unable to find `config` module. Using the default configuration")
+    finally:
+        sys.modules["config"] = configuration_instance
+        sys.modules["biothings.config"] = configuration_instance
+
+    configuration_instance.HUB_DB_BACKEND = {
+        "module": "biothings.utils.sqlite3",
+        "sqlite_db_folder": ".biothings_hub",
+    }
+    configuration_instance.DATA_SRC_SERVER = "localhost"
+    configuration_instance.DATA_SRC_DATABASE = "data_src_database"
+    configuration_instance.DATA_ARCHIVE_ROOT = ".biothings_hub/archive"
+    configuration_instance.LOG_FOLDER = ".biothings_hub/logs"
+    configuration_instance.DATA_PLUGIN_FOLDER = f"{working_dir}"
+
+    try:
+        configuration_instance.hub_db = importlib.import_module(configuration_instance.HUB_DB_BACKEND["module"])
+    except ImportError as import_err:
+        logger.exception(import_err)
+        raise import_err
 
-    sys.modules["config"] = _config
-    sys.modules["biothings.config"] = _config
+    configuration_repr = [
+        f"{configuration_key}: [{configuration_value}]"
+        for configuration_key, configuration_value in inspect.getmembers(configuration_instance)
+        if configuration_value is not None
+    ]
+    logger.info("CLI Configuration:\n%s", "\n".join(configuration_repr))
 
 
 def main():
-    """The main entry point for running the BioThings CLI to test your local data plugins."""
+    """
+    The entrypoint for running the BioThings CLI to test your local data plugin
+    """
     if not typer_avail:
         logger.error(
             '"typer" package is required for CLI feature. Use "pip install biothings[cli]" or "pip install typer[all]" to install.'

diff --git a/biothings/cli/utils.py b/biothings/cli/utils.py
@@ -4,9 +4,11 @@
 import os
 import pathlib
 import shutil
+import sys
 import time
 from pprint import pformat
 from types import SimpleNamespace
+from typing import Union
 
 import tornado.template
 import typer
@@ -16,17 +18,19 @@
 from rich.panel import Panel
 from rich.table import Table
 
-import biothings.utils.inspect as btinspect
 from biothings.utils import es
 from biothings.utils.common import timesofar
 from biothings.utils.dataload import dict_traverse
 from biothings.utils.serializer import load_json, to_json
-from biothings.utils.sqlite3 import get_src_db
 from biothings.utils.workers import upload_worker
+import biothings.utils.inspect as btinspect
 
 
 def get_logger(name=None):
-    """Get a logger with the given name. If name is None, return the root logger."""
+    """
+    Get a logger with the given name.
+    If name is None, return the root logger.
+    """
     # basicConfig has been setup in cli.py, so we don't need to do it again here
     # If everything works as expected, we can delete this block.
     # logging.basicConfig(
@@ -35,10 +39,13 @@ def get_logger(name=None):
     #     datefmt="[%X]",
     #     handlers=[RichHandler(rich_tracebacks=True, show_path=False)],
     # )
-    logger = logging.getLogger(name=None)
+    logger = logging.getLogger(name)
     return logger
 
 
+logger = get_logger(name=__name__)
+
+
 def run_sync_or_async_job(func, *args, **kwargs):
     """When func is defined as either normal or async function/method, we will call this function properly and return the results.
     For an async function/method, we will use CLIJobManager to run it.
@@ -54,15 +61,30 @@ def run_sync_or_async_job(func, *args, **kwargs):
         return func(*args, **kwargs)
 
 
-def load_plugin_managers(plugin_path, plugin_name=None, data_folder=None):
-    """Load a data plugin from <plugin_path>, and return a tuple of (dumper_manager, upload_manager)"""
-    from biothings import config as btconfig
+def load_plugin_managers(
+    plugin_path: Union[str, pathlib.Path], plugin_name: str = None, data_folder: Union[str, pathlib.Path] = None
+):
+    """
+    Load a data plugin from <plugin_path>, and return a tuple of (dumper_manager, upload_manager)
+    """
+    from biothings import config
     from biothings.hub.dataload.dumper import DumperManager
     from biothings.hub.dataload.uploader import UploaderManager
     from biothings.hub.dataplugin.assistant import LocalAssistant
     from biothings.hub.dataplugin.manager import DataPluginManager
     from biothings.utils.hub_db import get_data_plugin
 
+    _plugin_path = pathlib.Path(plugin_path).resolve()
+    config.DATA_PLUGIN_FOLDER = _plugin_path.parent.as_posix()
+    sys.path.append(str(_plugin_path.parent))
+
+    if plugin_name is None:
+        plugin_name = _plugin_path.name
+
+    if data_folder is None:
+        data_folder = pathlib.Path(f"./{plugin_name}")
+    data_folder = pathlib.Path(data_folder).resolve().absolute()
+
     plugin_manager = DataPluginManager(job_manager=None)
     dmanager = DumperManager(job_manager=None)
     upload_manager = UploaderManager(job_manager=None)
@@ -71,12 +93,9 @@ def load_plugin_managers(plugin_path, plugin_name=None, data_folder=None):
     LocalAssistant.dumper_manager = dmanager
     LocalAssistant.uploader_manager = upload_manager
 
-    _plugin_path = pathlib.Path(plugin_path).resolve()
-    btconfig.DATA_PLUGIN_FOLDER = _plugin_path.parent.as_posix()
-    plugin_name = plugin_name or _plugin_path.name
-    data_folder = data_folder or f"./{plugin_name}"
     assistant = LocalAssistant(f"local://{plugin_name}")
-    # print(assistant.plugin_name, plugin_name, _plugin_path.as_posix(), btconfig.DATA_PLUGIN_FOLDER)
+    logger.debug(assistant.plugin_name, plugin_name, _plugin_path.as_posix(), config.DATA_PLUGIN_FOLDER)
+
     dp = get_data_plugin()
     dp.remove({"_id": assistant.plugin_name})
     dp.insert_one(
@@ -88,7 +107,7 @@ def load_plugin_managers(plugin_path, plugin_name=None, data_folder=None):
                 "active": True,
             },
             "download": {
-                "data_folder": data_folder,  # tmp path to your data plugin
+                "data_folder": str(data_folder),  # tmp path to your data plugin
             },
         }
     )
@@ -115,40 +134,40 @@ def get_plugin_name(plugin_name=None, with_working_dir=True):
     return plugin_name, working_dir if with_working_dir else plugin_name
 
 
-def is_valid_data_plugin_dir(data_plugin_dir):
-    """Return True/False if the given folder is a valid data plugin folder (contains either manifest.yaml or manifest.json)"""
-    return (
-        pathlib.Path(data_plugin_dir, "manifest.yaml").exists()
-        or pathlib.Path(data_plugin_dir, "manifest.json").exists()
-    )
-
-
-def load_plugin(plugin_name=None, dumper=True, uploader=True, logger=None):
-    """Return a plugin object for the given plugin_name.
+def load_plugin(plugin_name: str = None, dumper: bool = True, uploader: bool = True, logger: logging.Logger = None):
+    """
+    Return a plugin object for the given plugin_name.
     If dumper is True, include a dumper instance in the plugin object.
     If uploader is True, include uploader_classes in the plugin object.
 
     If <plugin_name> is not valid, raise the proper error and exit.
     """
     logger = logger or get_logger(__name__)
+
     _plugin_name, working_dir = get_plugin_name(plugin_name, with_working_dir=True)
-    data_plugin_dir = pathlib.Path(working_dir) if plugin_name is None else pathlib.Path(working_dir, _plugin_name)
-    if not is_valid_data_plugin_dir(data_plugin_dir):
+    if plugin_name is None:
+        # current working_dir has the data plugin
+        data_plugin_dir = pathlib.Path(working_dir)
+        data_folder = pathlib.Path(".").resolve().absolute()
+        plugin_args = {"plugin_path": working_dir, "plugin_name": None, "data_folder": data_folder}
+    else:
+        data_plugin_dir = pathlib.Path(working_dir, _plugin_name)
+        plugin_args = {"plugin_path": _plugin_name, "plugin_name": None, "data_folder": None}
+    try:
+        dumper_manager, uploader_manager = load_plugin_managers(**plugin_args)
+    except Exception as gen_exc:
+        logger.exception(gen_exc)
         if plugin_name is None:
-            err = (
+            plugin_loading_error_message = (
                 "This command must be run inside a data plugin folder. Please go to a data plugin folder and try again!"
             )
         else:
-            err = f'The data plugin folder "{data_plugin_dir}" is not a valid data plugin folder. Please try another.'
-        logger.error(err, extra={"markup": True})
+            plugin_loading_error_message = (
+                f'The data plugin folder "{data_plugin_dir}" is not a valid data plugin folder. Please try another.'
+            )
+        logger.error(plugin_loading_error_message, extra={"markup": True})
         raise typer.Exit(1)
 
-    if plugin_name is None:
-        # current working_dir has the data plugin
-        dumper_manager, uploader_manager = load_plugin_managers(working_dir, data_folder=".")
-    else:
-        dumper_manager, uploader_manager = load_plugin_managers(_plugin_name)
-
     current_plugin = SimpleNamespace(
         plugin_name=_plugin_name,
         data_plugin_dir=data_plugin_dir,
@@ -288,7 +307,12 @@ def do_dump_and_upload(plugin_name, logger=None):
 
 
 def process_inspect(source_name, mode, limit, merge, logger, do_validate, output=None):
-    """Perform inspect for the given source. It's used in do_inspect function below"""
+    """
+    Perform inspect for the given source. It's used in do_inspect function below
+    """
+    from biothings import config
+    from biothings.utils import hub_db
+
     VALID_INSPECT_MODES = ["jsonschema", "type", "mapping", "stats"]
     mode = mode.split(",")
     if "jsonschema" in mode:
@@ -305,14 +329,14 @@ def process_inspect(source_name, mode, limit, merge, logger, do_validate, output
     t0 = time.time()
     data_provider = ("src", source_name)
 
-    src_db = get_src_db()
+    src_db = hub_db.get_src_db()
     pre_mapping = "mapping" in mode
     src_cols = src_db[source_name]
     inspected = {}
     converters, mode = btinspect.get_converters(mode)
     for m in mode:
         inspected.setdefault(m, {})
-    cur = src_cols.findv2()
+    cur = src_cols.find()
     res = btinspect.inspect_docs(
         cur,
         mode=mode,
@@ -500,18 +524,24 @@ def do_inspect(
             process_inspect(source_name, mode, limit, merge, logger=logger, do_validate=True, output=output)
 
 
-def get_manifest_content(working_dir):
-    """return the manifest content of the data plugin in the working directory"""
-    manifest_file_yml = os.path.join(working_dir, "manifest.yaml")
-    manifest_file_json = os.path.join(working_dir, "manifest.json")
-    if os.path.isfile(manifest_file_yml):
-        manifest = yaml.safe_load(open(manifest_file_yml))
-        return manifest
-    elif os.path.isfile(manifest_file_json):
-        manifest = load_json(open(manifest_file_json).read())
-        return manifest
+def get_manifest_content(working_dir: Union[str, pathlib.Path]) -> dict:
+    """
+    return the manifest content of the data plugin in the working directory
+    """
+    working_dir = pathlib.Path(working_dir).resolve().absolute()
+    manifest_file_yml = working_dir.joinpath("manifest.yaml")
+    manifest_file_json = working_dir.joinpath("manifest.json")
+
+    manifest = {}
+    if manifest_file_yml.exists():
+        with open(manifest_file_yml, "r", encoding="utf-8") as yaml_handle:
+            manifest = yaml.safe_load(yaml_handle)
+    elif manifest_file_json.exists():
+        with open(manifest_file_json, "r", encoding="utf-8") as json_handle:
+            manifest = load_json(json_handle)
     else:
-        raise FileNotFoundError("manifest file does not exits in current working directory")
+        logger.info("No manifest file discovered")
+    return manifest
 
 
 ########################
@@ -520,14 +550,21 @@ def get_manifest_content(working_dir):
 
 
 def get_uploaders(working_dir: pathlib.Path):
-    """A helper function to get the uploaders from the manifest file in the working directory, used in show_uploaded_sources function below"""
+    """
+    A helper function to get the uploaders from the manifest file in the working directory
+    used in show_uploaded_sources function below
+    """
     data_plugin_name = working_dir.name
+
     manifest = get_manifest_content(working_dir)
-    upload_section = manifest.get("uploader")
-    table_space = [data_plugin_name]
-    if not upload_section:
-        upload_sections = manifest.get("uploaders")
+    upload_section = manifest.get("uploader", None)
+    upload_sections = manifest.get("uploaders", None)
+
+    if upload_section is None and upload_sections is None:
+        table_space = [data_plugin_name]
+    elif upload_section is None and upload_sections is not None:
         table_space = [item["name"] for item in upload_sections]
+
     return table_space
 
 
@@ -573,10 +610,15 @@ def get_uploaded_collections(src_db, uploaders):
 
 
 def show_uploaded_sources(working_dir, plugin_name):
-    """A helper function to show the uploaded sources from given plugin."""
+    """
+    A helper function to show the uploaded sources from given plugin.
+    """
+    from biothings import config
+    from biothings.utils import hub_db
+
     console = Console()
     uploaders = get_uploaders(working_dir)
-    src_db = get_src_db()
+    src_db = hub_db.get_src_db()
     uploaded_sources, archived_sources, temp_sources = get_uploaded_collections(src_db, uploaders)
     if not uploaded_sources:
         console.print(
@@ -670,10 +712,14 @@ def do_list(plugin_name=None, dump=False, upload=False, hubdb=False, logger=None
 
 
 def serve(host, port, plugin_name, table_space):
-    """Serve a simple API server to query the data plugin source."""
+    """
+    Serve a simple API server to query the data plugin source.
+    """
     from .web_app import main
+    from biothings import config
+    from biothings.utils import hub_db
 
-    src_db = get_src_db()
+    src_db = hub_db.get_src_db()
     rprint(f"[green]Serving data plugin source: {plugin_name}[/green]")
     asyncio.run(main(host=host, port=port, db=src_db, table_space=table_space))
 
@@ -724,8 +770,11 @@ def do_clean_dumped_files(data_folder, plugin_name):
 
 def do_clean_uploaded_sources(working_dir, plugin_name):
     """Remove all uploaded sources by a data plugin in the working directory."""
+    from biothings import config
+    from biothings.utils import hub_db
+
     uploaders = get_uploaders(working_dir)
-    src_db = get_src_db()
+    src_db = hub_db.get_src_db()
     uploaded_sources = []
     for item in src_db.collection_names():
         if item in uploaders: