refactor: remove score_decay_rate and update formatting

Removed unused score_decay_rate from schema and documentation. Cleaned up code formatting and improved consistency across methods. Updated markdown documentation to align with current schema changes.
BoredLabsHQ · Nov 7, 2024 · 8ae5876 · 8ae5876
1 parent 1d199fe
commit 8ae5876
Show file tree

Hide file tree

Showing 3 changed files with 40 additions and 52 deletions.
diff --git a/docs/db/schema.md b/docs/db/schema.md
@@ -1,9 +1,3 @@
-Here’s an updated markdown version with explanations for `SemanticVector` and `score_decay_rate`:
-
----
-
-# Graph Representation
-
 ## Nodes
 
 ### Channel
@@ -24,9 +18,9 @@ Here’s an updated markdown version with explanations for `SemanticVector` and
 
 - **topic_id**: Unique identifier
 - **name**: Summary of the topic
-- **keywords**: List of key terms with scores
-- **overall_score**: Average or cumulative score
+- **keywords**: List of key terms with associated weights (e.g., `[{"term": "AI", "weight": 0.35}, {"term": "neural networks", "weight": 0.28}]`)
 - **bertopic_metadata**: BerTopic metadata
+- **topic_embedding: Topic embedding
 - **updated_at**: Last updated timestamp
 
 ---
@@ -45,58 +39,31 @@ Here’s an updated markdown version with explanations for `SemanticVector` and
 ### SemanticVector
 
 - **vector_id**: Unique identifier
-- **semantic_vector**: Aggregated representation of recent message semantics in a channel. This vector captures the
-  summarized, anonymized essence of new content without storing individual messages, aligning with privacy requirements.
+- **semantic_vector**: Aggregated representation of recent message semantics in a channel, preserving privacy by summarizing content instead of storing individual messages.
 - **created_at**: Creation date
 
-> **Explanation**: The `SemanticVector` represents the semantic profile of recent messages in a channel, allowing
-> Concord to adjust topic relevance without storing each message. Each vector aggregates the semantics of recent content
-> into a general representation, which can influence the `channel_score` in `ASSOCIATED_WITH` relationships between
-> channels and topics. This approach maintains user privacy while updating topic relevance dynamically.
+> **Explanation**: The SemanticVector node represents a general semantic profile of recent messages in a channel, supporting dynamic topic relevance without storing each message individually. This approach aligns with privacy requirements while allowing for the adjustment of topic relevance.
 
 ---
 
 ## Relationships
 
 ### ASSOCIATED_WITH (Channel → Topic)
 
-- **channel_score**: Cumulative or weighted score representing a topic’s importance or relevance to the channel
-- **keywords_weights**: Channel-specific keywords and their weights, reflecting the unique relationship between the
-  channel and topic
+- **topic_score**: Cumulative or weighted score representing a topic’s importance or relevance to the channel
+- **keywords_weights**: Channel-specific keywords and their weights, highlighting the unique relationship between the channel and topic
 - **message_count**: Number of messages analyzed in relation to the topic
 - **last_updated**: Timestamp of the last update
-- **score_decay_rate**: Rate at which `channel_score` decreases over time if no new relevant messages are analyzed. This
-  decay rate allows topic scores to adjust gradually, so less active or outdated topics diminish in relevance without
-  active content.
 - **trend**: Indicator of topic trend over time within the channel
 
-> **Explanation**: `score_decay_rate` ensures that topics associated with a channel decrease in relevance if no new
-> messages support their ongoing importance. This helps maintain an accurate and current reflection of active discussions
-> in a channel, giving more weight to trending or frequently discussed topics while allowing older or less relevant topics
-> to fade naturally.
+> **Explanation**: This relationship captures the importance of each topic to specific channels, with channel-specific keyword weights providing additional insight into unique topic-channel dynamics. `trend` enables tracking how each topic's relevance changes over time within the channel.
 
 ---
 
 ### RELATED_TO (Topic ↔ Topic)
 
 - **similarity_score**: Degree of similarity between two topics
-- **temporal_similarity**: Time-based similarity metric to track changing topic relationships over time
-- **co-occurrence_rate**: Frequency with which two topics are discussed together across channels
+- **temporal_similarity**: Metric to track similarity over time
+- **co-occurrence_rate**: Frequency of concurrent discussion of topics across channels
 - **common_channels**: Number of shared channels discussing both topics
-- **topic_trend_similarity**: Similarity in trends or changes in relevance for each topic
-
-```mermaid
-graph TD
-%% Nodes
-    Channel["Channel<br>-------------------------<br>channel_id: Unique identifier<br>platform: Platform (e.g., Telegram)<br>name: Name of the channel<br>description: Description of the channel<br>created_at: Creation date<br>active_members_count: Number of active members<br>language: Language of the channel<br>region: Geographical region<br>activity_score: Posting activity score"]
-    Topic["Topic<br>-------------------------<br>topic_id: Unique identifier<br>name: Summary of the topic<br>keywords: List of key terms with scores<br>overall_score: Average or cumulative score<br>bertopic_metadata: BerTopic metadata<br>updated_at: Last updated timestamp"]
-    TopicUpdate["TopicUpdate<br>-------------------------<br>update_id: Unique identifier<br>channel_id: Associated channel<br>topic_id: Associated topic<br>keywords: Keywords from the update<br>score_delta: Change in topic score<br>timestamp: Update time"]
-    SemanticVector["SemanticVector<br>-------------------------<br>vector_id: Unique identifier<br>semantic_vector: Aggregated semantics<br>created_at: Creation date"]
-%% Relationships
-    Channel -.-> ASSOCIATED_WITH["ASSOCIATED_WITH Relationship<br>-------------------------<br>channel_score: Cumulative or weighted score<br>keywords_weights: Channel-specific keywords and weights<br>message_count: Number of messages analyzed<br>last_updated: Timestamp of last update<br>score_decay_rate: Rate of score decay<br>trend: Topic trend over time"] --> Topic
-    Topic -.-> RELATED_TO["RELATED_TO Relationship<br>-------------------------<br>similarity_score: Degree of similarity<br>temporal_similarity: Time-based similarity<br>co-occurrence_rate: Co-occurrence of keywords<br>common_channels: Number of shared channels<br>topic_trend_similarity: Trend alignment"] --> Topic
-    TopicUpdate --> Topic
-    SemanticVector --> Channel
-```
-
----
+- **topic_trend_similarity**: Measure of similarity in topic trends across channels
diff --git a/src/bert/concord.py b/src/bert/concord.py
@@ -1,9 +1,13 @@
 # concord.py
 
 from bert.pre_process import preprocess_documents
+from graph.schema import Topic
 
 
-def concord(topic_model, documents):
+def concord(
+    topic_model,
+    documents,
+):
     # Load the dataset and limit to 100 documents
     print(f"Loaded {len(documents)} documents.")
 
@@ -40,4 +44,4 @@ def concord(topic_model, documents):
         print(f"  {word_score_str}")
 
     print("\nTopic modeling completed.")
-    return len(documents), None
+    return len(documents), Topic.create_topic()
diff --git a/src/graph/schema.py b/src/graph/schema.py
@@ -10,11 +10,10 @@
 
 # Relationship Models
 class AssociatedWithRel(StructuredRel):
-    channel_score = FloatProperty()
+    topic_score = FloatProperty()
     keywords_weights = ArrayProperty()
     message_count = IntegerProperty()
     last_updated = DateTimeProperty()
-    score_decay_rate = FloatProperty()
     trend = StringProperty()
 
 
@@ -60,14 +59,13 @@ def create_channel(cls, platform: str, name: str, description: str,
 
     def associate_with_topic(self, topic: 'Topic', channel_score: float,
                              keywords_weights: List[str], message_count: int,
-                             score_decay_rate: float, trend: str) -> None:
+                             trend: str) -> None:
         self.topics.connect(
             topic, {
                 'channel_score': channel_score,
                 'keywords_weights': keywords_weights,
                 'message_count': message_count,
                 'last_updated': datetime.utcnow(),
-                'score_decay_rate': score_decay_rate,
                 'trend': trend
             })
 
@@ -83,8 +81,8 @@ class Topic(StructuredNode):
     topic_id = UniqueIdProperty()
     name = StringProperty()
     keywords = ArrayProperty()
-    overall_score = FloatProperty()
     bertopic_metadata = JSONProperty()
+    topic_embedding = ArrayProperty()
     updated_at = DateTimeProperty(default_now=True)
 
     # Relationships
@@ -96,17 +94,24 @@ class Topic(StructuredNode):
 
     # Wrapper Functions
     @classmethod
-    def create_topic(cls, name: str, keywords: List[str], overall_score: float,
+    def create_topic(cls, name: str, keywords: List[str],
                      bertopic_metadata: Dict[str, Any]) -> 'Topic':
+        """
+        Create a new topic node with the given properties.
+        """
         return cls(name=name,
                    keywords=keywords,
-                   overall_score=overall_score,
                    bertopic_metadata=bertopic_metadata).save()
 
     def relate_to_topic(self, other_topic: 'Topic', similarity_score: float,
                         temporal_similarity: float, co_occurrence_rate: float,
                         common_channels: int,
                         topic_trend_similarity: float) -> None:
+        """
+        Create a relationship to another topic with various similarity metrics.
+        """
+        if not isinstance(other_topic, Topic):
+            raise ValueError("The related entity must be a Topic instance.")
         self.related_topics.connect(
             other_topic, {
                 'similarity_score': similarity_score,
@@ -118,10 +123,22 @@ def relate_to_topic(self, other_topic: 'Topic', similarity_score: float,
 
     def add_update(self, update_keywords: List[str],
                    score_delta: float) -> 'TopicUpdate':
+        """
+        Add an update to the topic with keyword changes and score delta.
+        """
         update = TopicUpdate.create_topic_update(update_keywords, score_delta)
         update.topic.connect(self)
         return update
 
+    def set_topic_embedding(self, embedding: List[float]) -> None:
+        """
+        Set the topic embedding vector, ensuring all values are floats.
+        """
+        if not all(isinstance(val, float) for val in embedding):
+            raise ValueError("All elements in topic_embedding must be floats.")
+        self.topic_embedding = embedding
+        self.save()
+
 
 class TopicUpdate(StructuredNode):
     update_id = UniqueIdProperty()