Skip to content

Commit

Permalink
feat: Add Platform management and refine Neo4j initialization
Browse files Browse the repository at this point in the history
Introduce `Platform` class with relationships and methods in schema.py.
Refactor Neo4j initialization using neomodel in graph.py and update related
functions to handle platforms in concord.py.
  • Loading branch information
sajz authored and Septimus4 committed Nov 8, 2024
1 parent 8ae5876 commit b45e5b6
Show file tree
Hide file tree
Showing 11 changed files with 371 additions and 154 deletions.
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -152,3 +152,6 @@ cython_debug/
/concord/bertopic_model.pkl
/.idea/rust.xml
/nltk_data/
/concord/dataset_topic_messages.csv
/topic_model
/topic_visualization.html
35 changes: 0 additions & 35 deletions .idea/runConfigurations/Concord.xml

This file was deleted.

17 changes: 17 additions & 0 deletions .idea/runConfigurations/Server.xml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

92 changes: 55 additions & 37 deletions docs/db/schema.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,68 +2,86 @@

### Channel

- **channel_id**: Unique identifier
- **platform**: Platform (e.g., Telegram)
- **name**: Name of the channel
- **description**: Description of the channel
- **created_at**: Creation date
- **active_members_count**: Number of active members
- **language**: Language of the channel
- **region**: Geographical region
- **activity_score**: Posting activity score, indicating channel activity level
- **channel_id**: Unique identifier for the channel.
- **name**: Name of the channel.
- **description**: Brief description of the channel.
- **created_at**: Timestamp indicating when the channel was created.
- **language**: Language predominantly used in the channel.
- **activity_score**: Numerical score representing the activity level in the channel.

**Methods**:
- `create_channel`: Creates a new channel with specified details.
- `associate_with_topic`: Connects a topic to the channel, setting scores and trend.
- `add_semantic_vector`: Adds a semantic vector to the channel.

---

### Topic

- **topic_id**: Unique identifier
- **name**: Summary of the topic
- **keywords**: List of key terms with associated weights (e.g., `[{"term": "AI", "weight": 0.35}, {"term": "neural networks", "weight": 0.28}]`)
- **bertopic_metadata**: BerTopic metadata
- **topic_embedding: Topic embedding
- **updated_at**: Last updated timestamp
- **topic_id**: Unique identifier for the topic.
- **name**: Summary name of the topic.
- **keywords**: List of key terms and associated weights (e.g., `[{"term": "AI", "weight": 0.35}]`).
- **bertopic_metadata**: Metadata from BerTopic processing.
- **topic_embedding**: Vector embedding for the topic.
- **updated_at**: Timestamp of the last update.

**Methods**:
- `create_topic`: Creates a new topic with specified keywords and metadata.
- `relate_to_topic`: Relates this topic to another, setting similarity metrics.
- `add_update`: Adds a topic update with score change and keywords.
- `set_topic_embedding`: Sets the embedding vector for the topic.
- `get_topic_embedding`: Retrieves the embedding as a numpy array.

---

### TopicUpdate

- **update_id**: Unique identifier
- **channel_id**: Associated channel
- **topic_id**: Associated topic
- **keywords**: Keywords from the update
- **score_delta**: Change in topic score
- **timestamp**: Update time
- **update_id**: Unique identifier for the update.
- **keywords**: Keywords associated with this update.
- **score_delta**: Numerical change in the topic score.
- **timestamp**: Time when the update was made.
- **topic_embedding**: Vector embedding for the topic.

**Methods**:
- `create_topic_update`: Creates a new update for a topic.
- `link_to_channel`: Links this update to a channel.

---

### SemanticVector

- **vector_id**: Unique identifier
- **semantic_vector**: Aggregated representation of recent message semantics in a channel, preserving privacy by summarizing content instead of storing individual messages.
- **created_at**: Creation date
- **vector_id**: Unique identifier for the semantic vector.
- **semantic_vector**: Aggregated vector summarizing recent message semantics.
- **created_at**: Timestamp indicating creation.

> **Explanation**: The SemanticVector node represents a general semantic profile of recent messages in a channel, supporting dynamic topic relevance without storing each message individually. This approach aligns with privacy requirements while allowing for the adjustment of topic relevance.
**Methods**:
- `create_semantic_vector`: Creates a new semantic vector.

---

## Relationships

### ASSOCIATED_WITH (Channel → Topic)

- **topic_score**: Cumulative or weighted score representing a topic’s importance or relevance to the channel
- **keywords_weights**: Channel-specific keywords and their weights, highlighting the unique relationship between the channel and topic
- **message_count**: Number of messages analyzed in relation to the topic
- **last_updated**: Timestamp of the last update
- **trend**: Indicator of topic trend over time within the channel

> **Explanation**: This relationship captures the importance of each topic to specific channels, with channel-specific keyword weights providing additional insight into unique topic-channel dynamics. `trend` enables tracking how each topic's relevance changes over time within the channel.
- **topic_score**: Weighted score indicating a topic's relevance to the channel.
- **last_updated**: Last time the relationship was updated.
- **trend**: Trend indication for the topic within the channel.

---

### RELATED_TO (Topic ↔ Topic)

- **similarity_score**: Degree of similarity between two topics
- **temporal_similarity**: Metric to track similarity over time
- **co-occurrence_rate**: Frequency of concurrent discussion of topics across channels
- **common_channels**: Number of shared channels discussing both topics
- **topic_trend_similarity**: Measure of similarity in topic trends across channels
- **similarity_score**: Similarity metric between topics.
- **temporal_similarity**: Measure of similarity persistence over time.
- **co_occurrence_rate**: Rate of joint appearance in discussions.
- **common_channels**: Count of shared channels discussing both topics.
- **topic_trend_similarity**: Trend similarity between topics across channels.

---

### HasRel (General Relationship)

This relationship can be used as a generic placeholder for relationships that do not have specific attributes.

> **Note**: The relationships provide both dynamic and static metrics, such as similarity scores and temporal similarity, enabling analytical insights into evolving topic relationships.
77 changes: 51 additions & 26 deletions src/bert/concord.py
Original file line number Diff line number Diff line change
@@ -1,47 +1,72 @@
# concord.py

from bert.pre_process import preprocess_documents
from graph.schema import Topic
from graph.schema import Topic, Channel, Platform


def concord(
topic_model,
documents,
):
# Load the dataset and limit to 100 documents
print(f"Loaded {len(documents)} documents.")
def concord(bert_topic, channel_id, platform_id, documents):
platform, channel = platform_channel_handler(channel_id, platform_id)

# Preprocess the documents
# Load and preprocess documents
print(f"Loaded {len(documents)} documents.")
print("Preprocessing documents...")
documents = preprocess_documents(documents)

# Fit the model on the documents
# Fit the topic model
print("Fitting the BERTopic model...")
topics, probs = topic_model.fit_transform(documents)
bert_topic.fit(documents)
topic_info = bert_topic.get_topic_info()

# Get topic information
topic_info = topic_model.get_topic_info()

# Print the main topics with importance scores
# Log main topics
print("\nMain Topics with Word Importance Scores:")
for index, row in topic_info.iterrows():
topic_id = row['Topic']
if topic_id == -1:
continue # Skip outliers
topic_freq = row['Count']
topic_words = topic_model.get_topic(topic_id)
topic_words = bert_topic.get_topic(topic_id)

# Prepare a list of formatted word-score pairs
word_score_list = [
f"{word} ({score:.4f})" for word, score in topic_words
]
# Create a list of word-score pairs
word_score_list = [{
"term": word,
"weight": score
} for word, score in topic_words]

# Join the pairs into a single string
word_score_str = ', '.join(word_score_list)
# Create or update a Topic node
topic = Topic.create_topic(name=f"Topic {topic_id}",
keywords=word_score_list,
bertopic_metadata={
"frequency": topic_freq
}).save()
topic.set_topic_embedding(bert_topic.topic_embeddings_[topic_id])
channel.associate_with_topic(topic, channel_score=0.5, trend="")

# Print the topic info and the word-score string
print(f"\nTopic {topic_id} (Frequency: {topic_freq}):")
print(f" {word_score_str}")
print(
f" {', '.join([f'{word} ({score:.4f})' for word, score in topic_words])}"
)

print("\nTopic modeling and channel update completed.")
return len(documents), None


print("\nTopic modeling completed.")
return len(documents), Topic.create_topic()
def platform_channel_handler(channel_id, platform_id):
platform = Platform.nodes.get_or_none(platform_id=platform_id)
if not platform:
print(
f"Platform with ID '{platform_id}' not found. Creating new platform..."
)
platform = Platform(platform_id=platform_id).save()
channel = Channel.nodes.get_or_none(channel_id=channel_id)
if not channel:
print(
f"Channel with ID '{channel_id}' not found. Creating new channel..."
)
channel = Channel.create_channel(
channel_id=channel_id,
name=f"Channel {channel_id}",
description="",
language="English",
activity_score=0.0,
).save()
platform.channels.connect(channel)
return platform, channel
98 changes: 98 additions & 0 deletions src/bert/topic_update.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
# topic_update.py
from sklearn.metrics.pairwise import cosine_similarity
from datetime import datetime
from graph.schema import Topic, TopicUpdate, Channel

SIMILARITY_THRESHOLD = 0.8
AMPLIFY_INCREMENT = 0.1
DIMINISH_DECREMENT = 0.05
NEW_TOPIC_INITIAL_SCORE = 0.1


def compute_cosine_similarity(vector_a, vector_b):
return cosine_similarity([vector_a], [vector_b])[0][0]


def update_channel_topics(channel_topics, new_topics, channel_id):
initial_scores = {
topic.topic_id: topic.topic_score
for topic in channel_topics
}
topic_updates = []

for new_topic in new_topics:
print(
f"\nProcessing new topic: {new_topic['name']} with weight {new_topic['weight']:.4f}"
)
similarities = {
idx:
compute_cosine_similarity(new_topic['embedding'],
channel_topic.topic_embedding)
for idx, channel_topic in enumerate(channel_topics)
}
print("Similarity scores:", similarities)

topic_amplified = False
for idx, similarity in similarities.items():
if similarity >= SIMILARITY_THRESHOLD:
channel_topic = channel_topics[idx]
original_score = channel_topic.topic_score
channel_topic.topic_score = min(
1, channel_topic.topic_score + AMPLIFY_INCREMENT)
delta = channel_topic.topic_score - original_score
channel_topic.updated_at = datetime.utcnow()
channel_topic.save()
print(
f"Amplifying topic '{channel_topic.name}' from {original_score:.4f} to "
f"{channel_topic.topic_score:.4f} (delta = {delta:.4f})")

topic_update = TopicUpdate.create_topic_update(
keywords=channel_topic.keywords, score_delta=delta)
topic_update.topic.connect(channel_topic)
topic_updates.append(topic_update)

topic_amplified = True

if not topic_amplified:
print(
f"Creating new topic '{new_topic['name']}' with initial score {NEW_TOPIC_INITIAL_SCORE:.4f}"
)
topic_node = Topic(name=new_topic['name'],
topic_embedding=new_topic['embedding'],
topic_score=NEW_TOPIC_INITIAL_SCORE,
updated_at=datetime.utcnow()).save()
topic_node.add_update(new_topic.get('keywords', []),
NEW_TOPIC_INITIAL_SCORE)
Channel.nodes.get(channel_id=channel_id).associate_with_topic(
topic_node, NEW_TOPIC_INITIAL_SCORE,
new_topic.get('keywords', []), 1, 'New')
channel_topics.append(topic_node)

for channel_topic in channel_topics:
if channel_topic.name not in [nt['name'] for nt in new_topics]:
original_score = channel_topic.topic_score
channel_topic.topic_score = max(
0, channel_topic.topic_score - DIMINISH_DECREMENT)
delta = original_score - channel_topic.topic_score
channel_topic.updated_at = datetime.utcnow()
channel_topic.save()
print(
f"Diminishing topic '{channel_topic.name}' from {original_score:.4f} to "
f"{channel_topic.topic_score:.4f} (delta = -{delta:.4f})")

if delta != 0:
topic_update = TopicUpdate.create_topic_update(
keywords=channel_topic.keywords, score_delta=-delta)
topic_update.topic.connect(channel_topic)
topic_updates.append(topic_update)

print("\nUpdated Channel Topics:")
print("{:<30} {:<15} {:<15}".format("Topic Name", "Initial Score",
"Updated Score"))
for topic in channel_topics:
initial_score = initial_scores.get(topic.topic_id,
NEW_TOPIC_INITIAL_SCORE)
print("{:<30} {:<15.4f} {:<15.4f}".format(topic.name, initial_score,
topic.topic_score))

return topic_updates
Loading

0 comments on commit b45e5b6

Please sign in to comment.