Author: Tek Raj Chhetri | tekraj@mit.edu
BrainKB serves as a knowledge base platform that provides scientists worldwide with tools for searching, exploring, and visualizing Neuroscience knowledge represented by knowledge graphs (KGs). Moreover, BrainKB provides cutting-edge tools that enable scientists to contribute new information (or knowledge) to the platform and is expected to be a go-to destination for all neuroscience-related research needs.
The main objective of BrainKB is to represent neuroscience knowledge as a knowledge graph such that it can be used for different downstream tasks, such as making predictions and new inferences in addition to querying and viewing information. The expected outcome of the BrainKB includes the following:
-
(Semi-)Automated extraction of neuroscience knowledge from structured, semi-structured, and unstructured sources, and representing the knowledge via KGs.
-
Visualization of the KGs.
-
Platform to perform different analytics operations over the BrainKB KGs.
-
(Semi-)Automated validation of the BrainKB KGs to ensure the high quality of the content.
-
Provides the ability to ingest data in batch or streaming mode for the automated extraction of KGs.
-
Limited Availability of Platforms for Integrating Neuroscience Data into Knowledge Graphs: In fields such as biomedicine, many platforms exist, such as, SPOKE and CIViC. However, such resources are comparatively limited in the domain of neuroscience. LinkRBrain, a web-based platform that integrates anatomical, functional, and genetic knowledge, is among the limited number of such resources. BrainKnow, the most recent platform, is another platform that is designed to synthesize and integrate neuroscience knowledge from scientific literature. Additionally, projects like DANDI, EBRAINS and Open Metadata Initiative are making strides by enabling sharing of neurophysiology data together with its metadata.
-
Lack of Support for Heterogeneous Data Sources: The current platforms in neuroscience are limited in their ability to handle a diverse range of data sources. For instance, LinkRBrain can only integrate knowledge from 41 databases, whereas BrainKnow solely focuses on scientific literature. However, knowledge is not restricted to just databases or scientific literature, and there is a need for platforms that can accommodate a wider variety of sources (e.g., structured, semi-structured and unstructured sources).
All information stored in the KG has associated data models or can be extracted to models. The information will be linked to formal ontologies and linked across datasets. All data models will have well defined schemas and descriptors for human and programmatic consumption.
The KG will allow for both information retrieval and upload. This involves a set of services and an API layer that allows for curation of information. The curation of information will reflect the data models. In addition, the KG will link, ingest, or cache other authoritative sources of information.
To support being an authoritative source, information entering the KG will indicate levels of curation. Such curation may take the form of expertise that is embedded into algorithms (e.g., quality metrics, alignment, mapping), is incorporated into data models (e.g., genes, anatomy), and is derived from computational and human analysis (e.g., atlases as outputs of working groups).
The architecture of the KG will be usable by humans and computational entities. The application interfaces will provide user interactivity and programmatic access. The KG will support competencies needed by the community.
To increase trust, the provenance of all information in the KG shall be maintained, including absence of provenance and available through the KG interfaces.
The information stored will lend itself to compute through appropriate APIs, data formats, and services. The KG shall connect to computational services to generate and provide inferred or derived information relevant to scientists.
BrainKB will support knowledge extraction from various sources in different data formats (e.g., texts, JSON (JavaScript Object Notation)) via the BrainKB user interface (UI) and the application programming interface (API) endpoints. Both batch and streaming data ingestion modes will be supported.
KGs evolve over time. For example, if we consider the case of the president of a country, it changes overtime. The KGs storing the information regarding the president of the country has to be updated accordingly. There are many similar cases in neuroscience or any other domain. The knowledge may change over time based on new research findings, thereby making previous knowledge obsolete or factually incorrect. Additionally, changes might also occur in the schema due to the standardization, alignment or updates. While schema changes may not always be necessary, they may be required to accommodate new information. Therefore, BrainKB will support the evolution by allowing the addition (or removal) of entities and relationships.
- Example: In fields like biology, newer findings can invalidate existing terms, requiring flexibility in the schema to account for future changes.
BrainKB shall be maintainable, allowing operations such as KG enrichment and validation to be performed easily. When we mention "validation to be performed easily," we are referring to processes that require minimal human intervention, either through semi-automated or fully-automated methods.
BrainKB will allow the community-driven curation of the KGs as well as (semi-) automated extraction and construction of KGs from external sources, e.g, scientific literatures.
BrainKB shall check the accuracy of the knowledge for which multi-step (semi-) automated validations will be performed. Additionally, checks will be performed to ensure that the KG triples are complete, i.e., the mandatory information is present. To further ensure accuracy and completeness, BrainKB shall guarantee the additions of new facts (or KG triples) will not lead introduce inconsistencies (see Figure 1) with existing knowledge due to factual errors, data inconsistencies, and incompleteness.
Figure 1: KGs. The image on the left shows the original knowledge graph, while the image on the right demonstrates the updated knowledge graph. The green highlighted box indicates new knowledge that has been added, while the red highlighted box indicates any inconsistencies caused by factual changes, i.e., incorrect date of birth.
The ACC process will ensure human-centricity is maintained alongside automated validation. Figure 2 shows the high-level overview of the framework that is used for the automated extraction as well as validation of the KG triples. Each agent will perform individual tasks. For example, Agent 1 and Agent 2 will perform the task of KG triples extraction from the raw text and aligning with the schema (or ontology). Similarly, the validator agents 1, 2 and 3 will perform the validation of the aggregated KG triple. Each validator agent will use the different source for the vlaidation and the validator agent 4 will take all the validation results from the three different validator agents and make the final decision. IF the validator agent 4 is unable to make the decision for any reason, such as due to unresolvable conflict, it will trigger a alert to the user, who will then perform the manual validation (or confirmation of the validation). Even though, framework below (Figure 2) shows the complete pipeline for extraction and validation, each of the tasks can be performed independently. For example, if one wishes to use validate the existing KG triple, one can do so just by using the validation component.
Figure 2: KGs. A Multi-Agent Framework for Neuroscience Knowledge Graph Construction & Validation
To enable trust, the provenance, i.e., documentation of the source and the curators (in case of manual curation) of all the information, shall be maintained. The provenance conflict resolution mechanism will also be implemented to ensure the accuracy of the provenance information.
BrainKB shall support the KGs' querying and reasoning. It shall also support other downstream analytics tasks, such as link predictions (see Figure 3) using machine learning techniques.
Figure 3: Link prediction. The figure on the left indicates a KG with a missing link (or relation) indicated by dotted lines and the figure on the right displays the KG after the link prediction.
To ensure interoperability and ease of integration, BrainKB will focus on using standardized ontologies or schemas. However, standardized ontologies or schemas are not always available. In such cases, other schemas or ontologies must be used. To ensure the interoperability, the alignment will be performed where necessary.
As BrainKB will also provide features to perform the analytics operation in addition to querying the information (or knowledge), a special emphasis shall be placed on ensuring that the information presented to the user does not cause a cognitive burden and data fatigue. A cognitive burden occurs when the brain must exert more effort to understand information, typically resulting from an overload of visual content. For example, the figure below (left) places more cognitive burden than on the right.
Assumption: We operate on open-world assumptions (OWA), not closed-world assumptions (CSA). In OWA, we do not make any assumptions about the absence of statements, while in CSA absence of statements would be evaluated as false, i.e., assumed to be false.
Example: Let's consider a university scenario. We want to determine if Jane Doe is enrolled in the AI 101 course.
- In CSA, if Jane Doe's enrollment information for AI 101 is not present in the university database, this absence of information is interpreted as Jane Doe not being enrolled in the course.
- Conversely under OWA, the absence of Jane Doe's enrollment information for AI 101 means that the information is simply missing and it remains uncertain whether Jane Doe is enrolled or not enrolled in the course.
The figure below (Figure 4) shows the BrainKB's architecture. It is divided into three layers: application layer (layer 1), service layer (layer 2), and resource layer (layer 3).
Application: The application layer(or layer 1) is the go-to point that provides access to BrainKB, such as via UI.
Service: The service layer (or layer 2) implements the core logic and is broken down into multiple services based on the functionalities (e.g., ingestion service which allows ingestion of data and is represented by Figure 5). Resource: The resource layer (or layer 3) will provide the necessary computational resources that are required to deliver the required services by BrainKB.
Figure 4: Architecture of BrainKB
Additionally, Figure 4, shows the architecture of the ingest service, a component of BrainKB.
Figure 5: Ingest Service, one of the service component of BrainKB
-
Neuroscience researchers: BrainKB's primary audience will be the neuroscience researchers, who would be able to use the platform to integrate, visualize, and analyze neuroscience data. They will be able to capitalize on the platform's ability to synthesize their data (or knowledge) into KGs.
-
Research Labs and Academic Institutions: BrainKB will be an invaluable resource for teaching in academic contexts specializing in neuroscience. It offers convenient access to integrated neuroscience data for faculty and students.
-
Policy Makers: Neurology policymakers will be able to use the neuroscience knowledge that BrainKB hosts to make policy decisions.
-
Healthcare Professionals: Healthcare professionals in neurology (or clinical neuroscience) may use BrainKB knowledge to understand and improve neurological disease outcomes.
-
Neuroscience-related Companies: Companies specializing in developing drugs for neurological diseases will be able to use the platform's KGs to gain insights into neurological conditions and treatments.
Actor: Alice (Neuroscientists)
Task: Alice wants to know if she can gain new insights from their newly collected neuroscience data.
Precondition: The newly collected neuroscience dataset, which includes demographics, gene expression maps, and structural and functional MRI scans, is usable and uncorrupted.
Flow:
- Alice uploads the data into the BrainKB platform through the BrainKB UI (User Interface).
- BrainKB, the system, then analyzes data. If any error, e.g., unsupported file format, it will return the error; otherwise, the system will proceed to the next step of knowledge extraction.
- The system will perform the knowledge extraction, validation, and alignment operation. If the validation or the alignment issue cannot be resolved automatically, the extracted knowledge represented via KG is flagged for expert review. Upon the successful review, the KGs are integrated (or stored) in the BrainKB storage and is available for visualization and analysis.
Postcondition: Alice discovers new insights through the integration of diverse knowledge sources represented in BrainKB's KGs.
-
Extraction/Integration/Refinement: BrainKB will provide features to extract knowledge from diverse sources, such as raw text and scientific publications, and integrate it with the knowledge represented via KGs. Additionally, BrainKB will also provide features to refine the extracted knowledge, e.g., through humans in the loop.
-
Cards: The BrainKB web application allows easy visualization of the knowledge of interest to scientists/researchers stored in KGs and their corresponding interconnected knowledge. Figure 6 shows a snippet of the entity card from the BrainKB web application, which can be accessed at http://beta.brainkb.org.
Figure 6: Snippet of Entity card from BrainKB web application
-
Casual Inference: Casual inference helps distinguish causation from correlation, particularly important the domains like neuroscience [1,2]. BrainKB, which stores the knowledge represented via KGs, thus supports causal inference. The reason is that the KGs can encode the (casual) relationships between entities and enable (casual) reasoning [2].
[1] Danks, D. and Davis, I., 2023. Causal inference in cognitive neuroscience. Wiley Interdisciplinary Reviews: Cognitive Science, 14(5), p.e1650. [2] Huang, H. and Vidal, M.E., 2024. CauseKG: A Framework Enhancing Causal Inference with Implicit Knowledge Deduced from Knowledge Graphs. IEEE Access.
-
Human in the loop: BrainKB allows the creation of KGs constructed from heterogeneous sources, e.g., text and CSV files, in a (semi-) automated fashion (e.g., using NLP) and through community contribution. BrainKB includes human-in-the-loop features, which ensure quality control of the KGs. The human in the loop is also a step in the maturity model for operations in neuroscience [1], helping to optimize KGs (knowledge graphs) curation.
- Example:
- When new evidence is submitted, it is placed in a queued (or hold) stage and progressed upon the moderators' review. Changes might be required based on the review before it appears in the evidence entity card.
- If the KGs are manually or automatically created, the moderators will review the concepts' alignment and determine whether the resolution (e.g., entity resolution) has been performed correctly.
[1] Johnson, E.C., Nguyen, T.T., Dichter, B.K., Zappulla, F., Kosma, M., Gunalan, K., Halchenko, Y.O., Neufeld, S.Q., Schirner, M., Ritter, P. and Martone, M.E., 2023. A maturity model for operations in neuroscience research. arXiv preprint arXiv:2401.00077.
- Example:
-
Compare Atlases: BrainKB also integrates knowledge from diverse knowledge platform services if available for integration, providing the feature to compare knowledge from across different atlases (e.g., Allen Brain Atlases).
-
Find/correct Errors: BrainKB will provide a feature to search existing knowledge and correct errors if any.
-
Add information/API: BrainKB offers an API endpoint that enables seamless integration with its platform. These endpoints facilitate data ingestion from various sources, such as CSV files or raw text, for constructing KGs, performing search operations, and conducting analyses on the stored KGs.
-
Doing meta-analysis: Meta-analysis is a knowledge-intensive task that requires significant time and effort to find related studies, identify evidence items, annotate the contents, and aggregate the results [1]. BrainKB, which stores knowledge from diverse data sources, including scientific publications, facilitates the meta-analysis.
[1] Tiddi, I., Balliet, D. and ten Teije, A., 2020. Fostering scientific meta-analyses with knowledge graphs: a case-study. In The Semantic Web: 17th International Conference, ESWC 2020, Heraklion, Crete, Greece, May 31–June 4, 2020, Proceedings 17 (pp. 287-303). Springer International Publishing.
Models currently used in BrianKB:
- Genome Annotation Registry Service (GARS) Model
- Anatomical Structure Reference Service (AnSRS) Model
- Library Generation Model
Detailed descriptions of the models above are available at https://brain-bican.github.io/models/.
- FastAPI
- Docker
- RabbitMQ
- Serverless (OpenFaaS)
- Python
- Language models ((e.g., Google BERT and LLaMa)
- SPARQL
The sequence diagram below shows the interactions between different service components for the KG construction.
sequenceDiagram
autonumber
participant User
participant UI
participant KG_Construction as (Semi-) structured KG construction
participant Mapping as Mapping & Annotation
participant Alignment as Alignment & resolution
participant Validation as Validation & Quality assurance
participant Expert
participant Triplestore
User->>+ UI: Upload CSV
UI->>+KG_Construction: Return response
KG_Construction->>+KG_Construction: Perform initial check, e.g., presence of required columns
alt is invalid
KG_Construction-->>+UI: Return Error message
UI-->>+User: Return Error message
else is valid
KG_Construction->>+ Mapping: Perform mapping & annotation as necessary
Mapping->>+ Validation: Perform validation of KG triples
Validation->>+Validation: Validation checks, e.g., SHACL, provenance conflict
Validation->>+ Alignment: Resolve conflicts
alt conflict identified, perform resolution
Alignment->>+ Alignment: Perform automated conflict resolution and alignment operation
Alignment-->>+ Validation: Return response (triples with resolved conflicts)
else conflict identified, perform resolution-requires human oversight
Alignment->>+ Expert: Send to expert for manual conflict resolution
Expert-->>+Alignment: Return response (triples with resolved conflicts)
Alignment-->>+ Validation: Return response (triples with resolved conflicts)
end
Validation-->>+ Mapping: Return response (validated and conflict resolved KG triples)
Mapping-->>+ KG_Construction: Updated KG triples
KG_Construction->>+Triplestore: Store KG in database
Triplestore-->>+KG_Construction: Return acknowledgement
KG_Construction-->>+UI: Return response (operation status notification)
UI-->>+User:Send notification
end
We recognize the current beta site’s issues and are working towards improving it. In particular, we are working on improving the following problems of the beta site.
- Performance: We are currently using the free version of GraphDB. It has a limitation of two simultaneous queries. Because fetching the details, such as Library Aliquot and their inter-related information, requires running more than two queries, this limitation impacts the beta site's knowledge base page. We are currently considering the premium version of GraphDB and other opensource triple stores alternatives. The performance of future version of BrainKB will be significantly improved.
- Source code
- https://github.com/sensein/BrainKB - Backend services
- https://github.com/sensein/brainkb-ui/tree/admin-ui - UI
- Developer documentation
Features | Status |
---|---|
UI with NextJS | Implementation in progress |
KG construction from scientific publication | Designed the approach and implementation in progress |
Structured models (or ontology) design | Implementation in progress |
BrainKB documentation including deployment instructions and lessons learned | Practically complete. Updates will be made as the work progresses |
Date | Event |
---|---|
2024-03-26 | Project Conceptualization |
2024-04-05 | Initial Architecture Design Phase Completed |
2024-04-23 | Work on Design Document |
2024-04-25 | Development Phase Started |
2024-05-25 | First version of BrainKB |
2024-12-25 | Second version of BrainKB |
2025-04-10 | First complete version of BrainKB with all conceptualized features |
Status | Event |
---|---|
Completed | |
Completed, updated the architecture | |
Initial version completed and is updating | Work on Design Document |
Completed | |
Completed and first version has been deployed to AWS | |
2024-12-25 | Second version of BrainKB |
2025-04-10 | First complete version of BrainKB with all conceptualized features |