feat: Add Platform management and refine Neo4j initialization

Introduce `Platform` class with relationships and methods in schema.py. Refactor Neo4j initialization using neomodel in graph.py and update related functions to handle platforms in concord.py.
BoredLabsHQ · Nov 8, 2024 · b45e5b6 · b45e5b6
1 parent 8ae5876
commit b45e5b6
Show file tree

Hide file tree

Showing 11 changed files with 371 additions and 154 deletions.
diff --git a/.gitignore b/.gitignore
@@ -152,3 +152,6 @@ cython_debug/
 /concord/bertopic_model.pkl
 /.idea/rust.xml
 /nltk_data/
+/concord/dataset_topic_messages.csv
+/topic_model
+/topic_visualization.html
diff --git a/.idea/runConfigurations/Concord.xml b/.idea/runConfigurations/Concord.xml
diff --git a/.idea/runConfigurations/Server.xml b/.idea/runConfigurations/Server.xml
diff --git a/docs/db/schema.md b/docs/db/schema.md
@@ -2,68 +2,86 @@
 
 ### Channel
 
-- **channel_id**: Unique identifier
-- **platform**: Platform (e.g., Telegram)
-- **name**: Name of the channel
-- **description**: Description of the channel
-- **created_at**: Creation date
-- **active_members_count**: Number of active members
-- **language**: Language of the channel
-- **region**: Geographical region
-- **activity_score**: Posting activity score, indicating channel activity level
+- **channel_id**: Unique identifier for the channel.
+- **name**: Name of the channel.
+- **description**: Brief description of the channel.
+- **created_at**: Timestamp indicating when the channel was created.
+- **language**: Language predominantly used in the channel.
+- **activity_score**: Numerical score representing the activity level in the channel.
+
+**Methods**:
+- `create_channel`: Creates a new channel with specified details.
+- `associate_with_topic`: Connects a topic to the channel, setting scores and trend.
+- `add_semantic_vector`: Adds a semantic vector to the channel.
 
 ---
 
 ### Topic
 
-- **topic_id**: Unique identifier
-- **name**: Summary of the topic
-- **keywords**: List of key terms with associated weights (e.g., `[{"term": "AI", "weight": 0.35}, {"term": "neural networks", "weight": 0.28}]`)
-- **bertopic_metadata**: BerTopic metadata
-- **topic_embedding: Topic embedding
-- **updated_at**: Last updated timestamp
+- **topic_id**: Unique identifier for the topic.
+- **name**: Summary name of the topic.
+- **keywords**: List of key terms and associated weights (e.g., `[{"term": "AI", "weight": 0.35}]`).
+- **bertopic_metadata**: Metadata from BerTopic processing.
+- **topic_embedding**: Vector embedding for the topic.
+- **updated_at**: Timestamp of the last update.
+
+**Methods**:
+- `create_topic`: Creates a new topic with specified keywords and metadata.
+- `relate_to_topic`: Relates this topic to another, setting similarity metrics.
+- `add_update`: Adds a topic update with score change and keywords.
+- `set_topic_embedding`: Sets the embedding vector for the topic.
+- `get_topic_embedding`: Retrieves the embedding as a numpy array.
 
 ---
 
 ### TopicUpdate
 
-- **update_id**: Unique identifier
-- **channel_id**: Associated channel
-- **topic_id**: Associated topic
-- **keywords**: Keywords from the update
-- **score_delta**: Change in topic score
-- **timestamp**: Update time
+- **update_id**: Unique identifier for the update.
+- **keywords**: Keywords associated with this update.
+- **score_delta**: Numerical change in the topic score.
+- **timestamp**: Time when the update was made.
+- **topic_embedding**: Vector embedding for the topic.
+
+**Methods**:
+- `create_topic_update`: Creates a new update for a topic.
+- `link_to_channel`: Links this update to a channel.
 
 ---
 
 ### SemanticVector
 
-- **vector_id**: Unique identifier
-- **semantic_vector**: Aggregated representation of recent message semantics in a channel, preserving privacy by summarizing content instead of storing individual messages.
-- **created_at**: Creation date
+- **vector_id**: Unique identifier for the semantic vector.
+- **semantic_vector**: Aggregated vector summarizing recent message semantics.
+- **created_at**: Timestamp indicating creation.
 
-> **Explanation**: The SemanticVector node represents a general semantic profile of recent messages in a channel, supporting dynamic topic relevance without storing each message individually. This approach aligns with privacy requirements while allowing for the adjustment of topic relevance.
+**Methods**:
+- `create_semantic_vector`: Creates a new semantic vector.
 
 ---
 
 ## Relationships
 
 ### ASSOCIATED_WITH (Channel → Topic)
 
-- **topic_score**: Cumulative or weighted score representing a topic’s importance or relevance to the channel
-- **keywords_weights**: Channel-specific keywords and their weights, highlighting the unique relationship between the channel and topic
-- **message_count**: Number of messages analyzed in relation to the topic
-- **last_updated**: Timestamp of the last update
-- **trend**: Indicator of topic trend over time within the channel
-
-> **Explanation**: This relationship captures the importance of each topic to specific channels, with channel-specific keyword weights providing additional insight into unique topic-channel dynamics. `trend` enables tracking how each topic's relevance changes over time within the channel.
+- **topic_score**: Weighted score indicating a topic's relevance to the channel.
+- **last_updated**: Last time the relationship was updated.
+- **trend**: Trend indication for the topic within the channel.
 
 ---
 
 ### RELATED_TO (Topic ↔ Topic)
 
-- **similarity_score**: Degree of similarity between two topics
-- **temporal_similarity**: Metric to track similarity over time
-- **co-occurrence_rate**: Frequency of concurrent discussion of topics across channels
-- **common_channels**: Number of shared channels discussing both topics
-- **topic_trend_similarity**: Measure of similarity in topic trends across channels
+- **similarity_score**: Similarity metric between topics.
+- **temporal_similarity**: Measure of similarity persistence over time.
+- **co_occurrence_rate**: Rate of joint appearance in discussions.
+- **common_channels**: Count of shared channels discussing both topics.
+- **topic_trend_similarity**: Trend similarity between topics across channels.
+
+---
+
+### HasRel (General Relationship)
+
+This relationship can be used as a generic placeholder for relationships that do not have specific attributes.
+
+> **Note**: The relationships provide both dynamic and static metrics, such as similarity scores and temporal similarity, enabling analytical insights into evolving topic relationships.
+
diff --git a/src/bert/concord.py b/src/bert/concord.py
@@ -1,47 +1,72 @@
 # concord.py
-
 from bert.pre_process import preprocess_documents
-from graph.schema import Topic
+from graph.schema import Topic, Channel, Platform
 
 
-def concord(
-    topic_model,
-    documents,
-):
-    # Load the dataset and limit to 100 documents
-    print(f"Loaded {len(documents)} documents.")
+def concord(bert_topic, channel_id, platform_id, documents):
+    platform, channel = platform_channel_handler(channel_id, platform_id)
 
-    # Preprocess the documents
+    # Load and preprocess documents
+    print(f"Loaded {len(documents)} documents.")
     print("Preprocessing documents...")
     documents = preprocess_documents(documents)
 
-    # Fit the model on the documents
+    # Fit the topic model
     print("Fitting the BERTopic model...")
-    topics, probs = topic_model.fit_transform(documents)
+    bert_topic.fit(documents)
+    topic_info = bert_topic.get_topic_info()
 
-    # Get topic information
-    topic_info = topic_model.get_topic_info()
-
-    # Print the main topics with importance scores
+    # Log main topics
     print("\nMain Topics with Word Importance Scores:")
     for index, row in topic_info.iterrows():
         topic_id = row['Topic']
         if topic_id == -1:
             continue  # Skip outliers
         topic_freq = row['Count']
-        topic_words = topic_model.get_topic(topic_id)
+        topic_words = bert_topic.get_topic(topic_id)
 
-        # Prepare a list of formatted word-score pairs
-        word_score_list = [
-            f"{word} ({score:.4f})" for word, score in topic_words
-        ]
+        # Create a list of word-score pairs
+        word_score_list = [{
+            "term": word,
+            "weight": score
+        } for word, score in topic_words]
 
-        # Join the pairs into a single string
-        word_score_str = ', '.join(word_score_list)
+        # Create or update a Topic node
+        topic = Topic.create_topic(name=f"Topic {topic_id}",
+                                   keywords=word_score_list,
+                                   bertopic_metadata={
+                                       "frequency": topic_freq
+                                   }).save()
+        topic.set_topic_embedding(bert_topic.topic_embeddings_[topic_id])
+        channel.associate_with_topic(topic, channel_score=0.5, trend="")
 
-        # Print the topic info and the word-score string
         print(f"\nTopic {topic_id} (Frequency: {topic_freq}):")
-        print(f"  {word_score_str}")
+        print(
+            f"  {', '.join([f'{word} ({score:.4f})' for word, score in topic_words])}"
+        )
+
+    print("\nTopic modeling and channel update completed.")
+    return len(documents), None
+
 
-    print("\nTopic modeling completed.")
-    return len(documents), Topic.create_topic()
+def platform_channel_handler(channel_id, platform_id):
+    platform = Platform.nodes.get_or_none(platform_id=platform_id)
+    if not platform:
+        print(
+            f"Platform with ID '{platform_id}' not found. Creating new platform..."
+        )
+        platform = Platform(platform_id=platform_id).save()
+    channel = Channel.nodes.get_or_none(channel_id=channel_id)
+    if not channel:
+        print(
+            f"Channel with ID '{channel_id}' not found. Creating new channel..."
+        )
+        channel = Channel.create_channel(
+            channel_id=channel_id,
+            name=f"Channel {channel_id}",
+            description="",
+            language="English",
+            activity_score=0.0,
+        ).save()
+    platform.channels.connect(channel)
+    return platform, channel
diff --git a/src/bert/topic_update.py b/src/bert/topic_update.py
@@ -0,0 +1,98 @@
+# topic_update.py
+from sklearn.metrics.pairwise import cosine_similarity
+from datetime import datetime
+from graph.schema import Topic, TopicUpdate, Channel
+
+SIMILARITY_THRESHOLD = 0.8
+AMPLIFY_INCREMENT = 0.1
+DIMINISH_DECREMENT = 0.05
+NEW_TOPIC_INITIAL_SCORE = 0.1
+
+
+def compute_cosine_similarity(vector_a, vector_b):
+    return cosine_similarity([vector_a], [vector_b])[0][0]
+
+
+def update_channel_topics(channel_topics, new_topics, channel_id):
+    initial_scores = {
+        topic.topic_id: topic.topic_score
+        for topic in channel_topics
+    }
+    topic_updates = []
+
+    for new_topic in new_topics:
+        print(
+            f"\nProcessing new topic: {new_topic['name']} with weight {new_topic['weight']:.4f}"
+        )
+        similarities = {
+            idx:
+            compute_cosine_similarity(new_topic['embedding'],
+                                      channel_topic.topic_embedding)
+            for idx, channel_topic in enumerate(channel_topics)
+        }
+        print("Similarity scores:", similarities)
+
+        topic_amplified = False
+        for idx, similarity in similarities.items():
+            if similarity >= SIMILARITY_THRESHOLD:
+                channel_topic = channel_topics[idx]
+                original_score = channel_topic.topic_score
+                channel_topic.topic_score = min(
+                    1, channel_topic.topic_score + AMPLIFY_INCREMENT)
+                delta = channel_topic.topic_score - original_score
+                channel_topic.updated_at = datetime.utcnow()
+                channel_topic.save()
+                print(
+                    f"Amplifying topic '{channel_topic.name}' from {original_score:.4f} to "
+                    f"{channel_topic.topic_score:.4f} (delta = {delta:.4f})")
+
+                topic_update = TopicUpdate.create_topic_update(
+                    keywords=channel_topic.keywords, score_delta=delta)
+                topic_update.topic.connect(channel_topic)
+                topic_updates.append(topic_update)
+
+                topic_amplified = True
+
+        if not topic_amplified:
+            print(
+                f"Creating new topic '{new_topic['name']}' with initial score {NEW_TOPIC_INITIAL_SCORE:.4f}"
+            )
+            topic_node = Topic(name=new_topic['name'],
+                               topic_embedding=new_topic['embedding'],
+                               topic_score=NEW_TOPIC_INITIAL_SCORE,
+                               updated_at=datetime.utcnow()).save()
+            topic_node.add_update(new_topic.get('keywords', []),
+                                  NEW_TOPIC_INITIAL_SCORE)
+            Channel.nodes.get(channel_id=channel_id).associate_with_topic(
+                topic_node, NEW_TOPIC_INITIAL_SCORE,
+                new_topic.get('keywords', []), 1, 'New')
+            channel_topics.append(topic_node)
+
+    for channel_topic in channel_topics:
+        if channel_topic.name not in [nt['name'] for nt in new_topics]:
+            original_score = channel_topic.topic_score
+            channel_topic.topic_score = max(
+                0, channel_topic.topic_score - DIMINISH_DECREMENT)
+            delta = original_score - channel_topic.topic_score
+            channel_topic.updated_at = datetime.utcnow()
+            channel_topic.save()
+            print(
+                f"Diminishing topic '{channel_topic.name}' from {original_score:.4f} to "
+                f"{channel_topic.topic_score:.4f} (delta = -{delta:.4f})")
+
+            if delta != 0:
+                topic_update = TopicUpdate.create_topic_update(
+                    keywords=channel_topic.keywords, score_delta=-delta)
+                topic_update.topic.connect(channel_topic)
+                topic_updates.append(topic_update)
+
+    print("\nUpdated Channel Topics:")
+    print("{:<30} {:<15} {:<15}".format("Topic Name", "Initial Score",
+                                        "Updated Score"))
+    for topic in channel_topics:
+        initial_score = initial_scores.get(topic.topic_id,
+                                           NEW_TOPIC_INITIAL_SCORE)
+        print("{:<30} {:<15.4f} {:<15.4f}".format(topic.name, initial_score,
+                                                  topic.topic_score))
+
+    return topic_updates