Skip to content

Commit

Permalink
refactor: remove score_decay_rate and update formatting
Browse files Browse the repository at this point in the history
Removed unused score_decay_rate from schema and documentation.
Cleaned up code formatting and improved consistency across methods.
Updated markdown documentation to align with current schema changes.
  • Loading branch information
Septimus4 committed Nov 7, 2024
1 parent 1d199fe commit 8ae5876
Show file tree
Hide file tree
Showing 3 changed files with 40 additions and 52 deletions.
53 changes: 10 additions & 43 deletions docs/db/schema.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,3 @@
Here’s an updated markdown version with explanations for `SemanticVector` and `score_decay_rate`:

---

# Graph Representation

## Nodes

### Channel
Expand All @@ -24,9 +18,9 @@ Here’s an updated markdown version with explanations for `SemanticVector` and

- **topic_id**: Unique identifier
- **name**: Summary of the topic
- **keywords**: List of key terms with scores
- **overall_score**: Average or cumulative score
- **keywords**: List of key terms with associated weights (e.g., `[{"term": "AI", "weight": 0.35}, {"term": "neural networks", "weight": 0.28}]`)
- **bertopic_metadata**: BerTopic metadata
- **topic_embedding: Topic embedding
- **updated_at**: Last updated timestamp

---
Expand All @@ -45,58 +39,31 @@ Here’s an updated markdown version with explanations for `SemanticVector` and
### SemanticVector

- **vector_id**: Unique identifier
- **semantic_vector**: Aggregated representation of recent message semantics in a channel. This vector captures the
summarized, anonymized essence of new content without storing individual messages, aligning with privacy requirements.
- **semantic_vector**: Aggregated representation of recent message semantics in a channel, preserving privacy by summarizing content instead of storing individual messages.
- **created_at**: Creation date

> **Explanation**: The `SemanticVector` represents the semantic profile of recent messages in a channel, allowing
> Concord to adjust topic relevance without storing each message. Each vector aggregates the semantics of recent content
> into a general representation, which can influence the `channel_score` in `ASSOCIATED_WITH` relationships between
> channels and topics. This approach maintains user privacy while updating topic relevance dynamically.
> **Explanation**: The SemanticVector node represents a general semantic profile of recent messages in a channel, supporting dynamic topic relevance without storing each message individually. This approach aligns with privacy requirements while allowing for the adjustment of topic relevance.
---

## Relationships

### ASSOCIATED_WITH (Channel → Topic)

- **channel_score**: Cumulative or weighted score representing a topic’s importance or relevance to the channel
- **keywords_weights**: Channel-specific keywords and their weights, reflecting the unique relationship between the
channel and topic
- **topic_score**: Cumulative or weighted score representing a topic’s importance or relevance to the channel
- **keywords_weights**: Channel-specific keywords and their weights, highlighting the unique relationship between the channel and topic
- **message_count**: Number of messages analyzed in relation to the topic
- **last_updated**: Timestamp of the last update
- **score_decay_rate**: Rate at which `channel_score` decreases over time if no new relevant messages are analyzed. This
decay rate allows topic scores to adjust gradually, so less active or outdated topics diminish in relevance without
active content.
- **trend**: Indicator of topic trend over time within the channel

> **Explanation**: `score_decay_rate` ensures that topics associated with a channel decrease in relevance if no new
> messages support their ongoing importance. This helps maintain an accurate and current reflection of active discussions
> in a channel, giving more weight to trending or frequently discussed topics while allowing older or less relevant topics
> to fade naturally.
> **Explanation**: This relationship captures the importance of each topic to specific channels, with channel-specific keyword weights providing additional insight into unique topic-channel dynamics. `trend` enables tracking how each topic's relevance changes over time within the channel.
---

### RELATED_TO (Topic ↔ Topic)

- **similarity_score**: Degree of similarity between two topics
- **temporal_similarity**: Time-based similarity metric to track changing topic relationships over time
- **co-occurrence_rate**: Frequency with which two topics are discussed together across channels
- **temporal_similarity**: Metric to track similarity over time
- **co-occurrence_rate**: Frequency of concurrent discussion of topics across channels
- **common_channels**: Number of shared channels discussing both topics
- **topic_trend_similarity**: Similarity in trends or changes in relevance for each topic

```mermaid
graph TD
%% Nodes
Channel["Channel<br>-------------------------<br>channel_id: Unique identifier<br>platform: Platform (e.g., Telegram)<br>name: Name of the channel<br>description: Description of the channel<br>created_at: Creation date<br>active_members_count: Number of active members<br>language: Language of the channel<br>region: Geographical region<br>activity_score: Posting activity score"]
Topic["Topic<br>-------------------------<br>topic_id: Unique identifier<br>name: Summary of the topic<br>keywords: List of key terms with scores<br>overall_score: Average or cumulative score<br>bertopic_metadata: BerTopic metadata<br>updated_at: Last updated timestamp"]
TopicUpdate["TopicUpdate<br>-------------------------<br>update_id: Unique identifier<br>channel_id: Associated channel<br>topic_id: Associated topic<br>keywords: Keywords from the update<br>score_delta: Change in topic score<br>timestamp: Update time"]
SemanticVector["SemanticVector<br>-------------------------<br>vector_id: Unique identifier<br>semantic_vector: Aggregated semantics<br>created_at: Creation date"]
%% Relationships
Channel -.-> ASSOCIATED_WITH["ASSOCIATED_WITH Relationship<br>-------------------------<br>channel_score: Cumulative or weighted score<br>keywords_weights: Channel-specific keywords and weights<br>message_count: Number of messages analyzed<br>last_updated: Timestamp of last update<br>score_decay_rate: Rate of score decay<br>trend: Topic trend over time"] --> Topic
Topic -.-> RELATED_TO["RELATED_TO Relationship<br>-------------------------<br>similarity_score: Degree of similarity<br>temporal_similarity: Time-based similarity<br>co-occurrence_rate: Co-occurrence of keywords<br>common_channels: Number of shared channels<br>topic_trend_similarity: Trend alignment"] --> Topic
TopicUpdate --> Topic
SemanticVector --> Channel
```

---
- **topic_trend_similarity**: Measure of similarity in topic trends across channels
8 changes: 6 additions & 2 deletions src/bert/concord.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,13 @@
# concord.py

from bert.pre_process import preprocess_documents
from graph.schema import Topic


def concord(topic_model, documents):
def concord(
topic_model,
documents,
):
# Load the dataset and limit to 100 documents
print(f"Loaded {len(documents)} documents.")

Expand Down Expand Up @@ -40,4 +44,4 @@ def concord(topic_model, documents):
print(f" {word_score_str}")

print("\nTopic modeling completed.")
return len(documents), None
return len(documents), Topic.create_topic()
31 changes: 24 additions & 7 deletions src/graph/schema.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,11 +10,10 @@

# Relationship Models
class AssociatedWithRel(StructuredRel):
channel_score = FloatProperty()
topic_score = FloatProperty()
keywords_weights = ArrayProperty()
message_count = IntegerProperty()
last_updated = DateTimeProperty()
score_decay_rate = FloatProperty()
trend = StringProperty()


Expand Down Expand Up @@ -60,14 +59,13 @@ def create_channel(cls, platform: str, name: str, description: str,

def associate_with_topic(self, topic: 'Topic', channel_score: float,
keywords_weights: List[str], message_count: int,
score_decay_rate: float, trend: str) -> None:
trend: str) -> None:
self.topics.connect(
topic, {
'channel_score': channel_score,
'keywords_weights': keywords_weights,
'message_count': message_count,
'last_updated': datetime.utcnow(),
'score_decay_rate': score_decay_rate,
'trend': trend
})

Expand All @@ -83,8 +81,8 @@ class Topic(StructuredNode):
topic_id = UniqueIdProperty()
name = StringProperty()
keywords = ArrayProperty()
overall_score = FloatProperty()
bertopic_metadata = JSONProperty()
topic_embedding = ArrayProperty()
updated_at = DateTimeProperty(default_now=True)

# Relationships
Expand All @@ -96,17 +94,24 @@ class Topic(StructuredNode):

# Wrapper Functions
@classmethod
def create_topic(cls, name: str, keywords: List[str], overall_score: float,
def create_topic(cls, name: str, keywords: List[str],
bertopic_metadata: Dict[str, Any]) -> 'Topic':
"""
Create a new topic node with the given properties.
"""
return cls(name=name,
keywords=keywords,
overall_score=overall_score,
bertopic_metadata=bertopic_metadata).save()

def relate_to_topic(self, other_topic: 'Topic', similarity_score: float,
temporal_similarity: float, co_occurrence_rate: float,
common_channels: int,
topic_trend_similarity: float) -> None:
"""
Create a relationship to another topic with various similarity metrics.
"""
if not isinstance(other_topic, Topic):
raise ValueError("The related entity must be a Topic instance.")
self.related_topics.connect(
other_topic, {
'similarity_score': similarity_score,
Expand All @@ -118,10 +123,22 @@ def relate_to_topic(self, other_topic: 'Topic', similarity_score: float,

def add_update(self, update_keywords: List[str],
score_delta: float) -> 'TopicUpdate':
"""
Add an update to the topic with keyword changes and score delta.
"""
update = TopicUpdate.create_topic_update(update_keywords, score_delta)
update.topic.connect(self)
return update

def set_topic_embedding(self, embedding: List[float]) -> None:
"""
Set the topic embedding vector, ensuring all values are floats.
"""
if not all(isinstance(val, float) for val in embedding):
raise ValueError("All elements in topic_embedding must be floats.")
self.topic_embedding = embedding
self.save()


class TopicUpdate(StructuredNode):
update_id = UniqueIdProperty()
Expand Down

0 comments on commit 8ae5876

Please sign in to comment.