Skip to content

Commit 7a76336

Browse files
sajzSeptimus4
authored andcommitted
feat: Add Platform management and refine Neo4j initialization
Introduce `Platform` class with relationships and methods in schema.py. Refactor Neo4j initialization using neomodel in graph.py and update related functions to handle platforms in concord.py.
1 parent 8ae5876 commit 7a76336

File tree

11 files changed

+371
-154
lines changed

11 files changed

+371
-154
lines changed

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -152,3 +152,6 @@ cython_debug/
152152
/concord/bertopic_model.pkl
153153
/.idea/rust.xml
154154
/nltk_data/
155+
/concord/dataset_topic_messages.csv
156+
/topic_model
157+
/topic_visualization.html

.idea/runConfigurations/Concord.xml

Lines changed: 0 additions & 35 deletions
This file was deleted.

.idea/runConfigurations/Server.xml

Lines changed: 17 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

docs/db/schema.md

Lines changed: 55 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -2,68 +2,86 @@
22

33
### Channel
44

5-
- **channel_id**: Unique identifier
6-
- **platform**: Platform (e.g., Telegram)
7-
- **name**: Name of the channel
8-
- **description**: Description of the channel
9-
- **created_at**: Creation date
10-
- **active_members_count**: Number of active members
11-
- **language**: Language of the channel
12-
- **region**: Geographical region
13-
- **activity_score**: Posting activity score, indicating channel activity level
5+
- **channel_id**: Unique identifier for the channel.
6+
- **name**: Name of the channel.
7+
- **description**: Brief description of the channel.
8+
- **created_at**: Timestamp indicating when the channel was created.
9+
- **language**: Language predominantly used in the channel.
10+
- **activity_score**: Numerical score representing the activity level in the channel.
11+
12+
**Methods**:
13+
- `create_channel`: Creates a new channel with specified details.
14+
- `associate_with_topic`: Connects a topic to the channel, setting scores and trend.
15+
- `add_semantic_vector`: Adds a semantic vector to the channel.
1416

1517
---
1618

1719
### Topic
1820

19-
- **topic_id**: Unique identifier
20-
- **name**: Summary of the topic
21-
- **keywords**: List of key terms with associated weights (e.g., `[{"term": "AI", "weight": 0.35}, {"term": "neural networks", "weight": 0.28}]`)
22-
- **bertopic_metadata**: BerTopic metadata
23-
- **topic_embedding: Topic embedding
24-
- **updated_at**: Last updated timestamp
21+
- **topic_id**: Unique identifier for the topic.
22+
- **name**: Summary name of the topic.
23+
- **keywords**: List of key terms and associated weights (e.g., `[{"term": "AI", "weight": 0.35}]`).
24+
- **bertopic_metadata**: Metadata from BerTopic processing.
25+
- **topic_embedding**: Vector embedding for the topic.
26+
- **updated_at**: Timestamp of the last update.
27+
28+
**Methods**:
29+
- `create_topic`: Creates a new topic with specified keywords and metadata.
30+
- `relate_to_topic`: Relates this topic to another, setting similarity metrics.
31+
- `add_update`: Adds a topic update with score change and keywords.
32+
- `set_topic_embedding`: Sets the embedding vector for the topic.
33+
- `get_topic_embedding`: Retrieves the embedding as a numpy array.
2534

2635
---
2736

2837
### TopicUpdate
2938

30-
- **update_id**: Unique identifier
31-
- **channel_id**: Associated channel
32-
- **topic_id**: Associated topic
33-
- **keywords**: Keywords from the update
34-
- **score_delta**: Change in topic score
35-
- **timestamp**: Update time
39+
- **update_id**: Unique identifier for the update.
40+
- **keywords**: Keywords associated with this update.
41+
- **score_delta**: Numerical change in the topic score.
42+
- **timestamp**: Time when the update was made.
43+
- **topic_embedding**: Vector embedding for the topic.
44+
45+
**Methods**:
46+
- `create_topic_update`: Creates a new update for a topic.
47+
- `link_to_channel`: Links this update to a channel.
3648

3749
---
3850

3951
### SemanticVector
4052

41-
- **vector_id**: Unique identifier
42-
- **semantic_vector**: Aggregated representation of recent message semantics in a channel, preserving privacy by summarizing content instead of storing individual messages.
43-
- **created_at**: Creation date
53+
- **vector_id**: Unique identifier for the semantic vector.
54+
- **semantic_vector**: Aggregated vector summarizing recent message semantics.
55+
- **created_at**: Timestamp indicating creation.
4456

45-
> **Explanation**: The SemanticVector node represents a general semantic profile of recent messages in a channel, supporting dynamic topic relevance without storing each message individually. This approach aligns with privacy requirements while allowing for the adjustment of topic relevance.
57+
**Methods**:
58+
- `create_semantic_vector`: Creates a new semantic vector.
4659

4760
---
4861

4962
## Relationships
5063

5164
### ASSOCIATED_WITH (Channel → Topic)
5265

53-
- **topic_score**: Cumulative or weighted score representing a topic’s importance or relevance to the channel
54-
- **keywords_weights**: Channel-specific keywords and their weights, highlighting the unique relationship between the channel and topic
55-
- **message_count**: Number of messages analyzed in relation to the topic
56-
- **last_updated**: Timestamp of the last update
57-
- **trend**: Indicator of topic trend over time within the channel
58-
59-
> **Explanation**: This relationship captures the importance of each topic to specific channels, with channel-specific keyword weights providing additional insight into unique topic-channel dynamics. `trend` enables tracking how each topic's relevance changes over time within the channel.
66+
- **topic_score**: Weighted score indicating a topic's relevance to the channel.
67+
- **last_updated**: Last time the relationship was updated.
68+
- **trend**: Trend indication for the topic within the channel.
6069

6170
---
6271

6372
### RELATED_TO (Topic ↔ Topic)
6473

65-
- **similarity_score**: Degree of similarity between two topics
66-
- **temporal_similarity**: Metric to track similarity over time
67-
- **co-occurrence_rate**: Frequency of concurrent discussion of topics across channels
68-
- **common_channels**: Number of shared channels discussing both topics
69-
- **topic_trend_similarity**: Measure of similarity in topic trends across channels
74+
- **similarity_score**: Similarity metric between topics.
75+
- **temporal_similarity**: Measure of similarity persistence over time.
76+
- **co_occurrence_rate**: Rate of joint appearance in discussions.
77+
- **common_channels**: Count of shared channels discussing both topics.
78+
- **topic_trend_similarity**: Trend similarity between topics across channels.
79+
80+
---
81+
82+
### HasRel (General Relationship)
83+
84+
This relationship can be used as a generic placeholder for relationships that do not have specific attributes.
85+
86+
> **Note**: The relationships provide both dynamic and static metrics, such as similarity scores and temporal similarity, enabling analytical insights into evolving topic relationships.
87+

src/bert/concord.py

Lines changed: 51 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -1,47 +1,72 @@
11
# concord.py
2-
32
from bert.pre_process import preprocess_documents
4-
from graph.schema import Topic
3+
from graph.schema import Topic, Channel, Platform
54

65

7-
def concord(
8-
topic_model,
9-
documents,
10-
):
11-
# Load the dataset and limit to 100 documents
12-
print(f"Loaded {len(documents)} documents.")
6+
def concord(bert_topic, channel_id, platform_id, documents):
7+
platform, channel = platform_channel_handler(channel_id, platform_id)
138

14-
# Preprocess the documents
9+
# Load and preprocess documents
10+
print(f"Loaded {len(documents)} documents.")
1511
print("Preprocessing documents...")
1612
documents = preprocess_documents(documents)
1713

18-
# Fit the model on the documents
14+
# Fit the topic model
1915
print("Fitting the BERTopic model...")
20-
topics, probs = topic_model.fit_transform(documents)
16+
bert_topic.fit(documents)
17+
topic_info = bert_topic.get_topic_info()
2118

22-
# Get topic information
23-
topic_info = topic_model.get_topic_info()
24-
25-
# Print the main topics with importance scores
19+
# Log main topics
2620
print("\nMain Topics with Word Importance Scores:")
2721
for index, row in topic_info.iterrows():
2822
topic_id = row['Topic']
2923
if topic_id == -1:
3024
continue # Skip outliers
3125
topic_freq = row['Count']
32-
topic_words = topic_model.get_topic(topic_id)
26+
topic_words = bert_topic.get_topic(topic_id)
3327

34-
# Prepare a list of formatted word-score pairs
35-
word_score_list = [
36-
f"{word} ({score:.4f})" for word, score in topic_words
37-
]
28+
# Create a list of word-score pairs
29+
word_score_list = [{
30+
"term": word,
31+
"weight": score
32+
} for word, score in topic_words]
3833

39-
# Join the pairs into a single string
40-
word_score_str = ', '.join(word_score_list)
34+
# Create or update a Topic node
35+
topic = Topic.create_topic(name=f"Topic {topic_id}",
36+
keywords=word_score_list,
37+
bertopic_metadata={
38+
"frequency": topic_freq
39+
}).save()
40+
topic.set_topic_embedding(bert_topic.topic_embeddings_[topic_id])
41+
channel.associate_with_topic(topic, channel_score=0.5, trend="")
4142

42-
# Print the topic info and the word-score string
4343
print(f"\nTopic {topic_id} (Frequency: {topic_freq}):")
44-
print(f" {word_score_str}")
44+
print(
45+
f" {', '.join([f'{word} ({score:.4f})' for word, score in topic_words])}"
46+
)
47+
48+
print("\nTopic modeling and channel update completed.")
49+
return len(documents), None
50+
4551

46-
print("\nTopic modeling completed.")
47-
return len(documents), Topic.create_topic()
52+
def platform_channel_handler(channel_id, platform_id):
53+
platform = Platform.nodes.get_or_none(platform_id=platform_id)
54+
if not platform:
55+
print(
56+
f"Platform with ID '{platform_id}' not found. Creating new platform..."
57+
)
58+
platform = Platform(platform_id=platform_id).save()
59+
channel = Channel.nodes.get_or_none(channel_id=channel_id)
60+
if not channel:
61+
print(
62+
f"Channel with ID '{channel_id}' not found. Creating new channel..."
63+
)
64+
channel = Channel.create_channel(
65+
channel_id=channel_id,
66+
name=f"Channel {channel_id}",
67+
description="",
68+
language="English",
69+
activity_score=0.0,
70+
).save()
71+
platform.channels.connect(channel)
72+
return platform, channel

src/bert/topic_update.py

Lines changed: 98 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,98 @@
1+
# topic_update.py
2+
from sklearn.metrics.pairwise import cosine_similarity
3+
from datetime import datetime
4+
from graph.schema import Topic, TopicUpdate, Channel
5+
6+
SIMILARITY_THRESHOLD = 0.8
7+
AMPLIFY_INCREMENT = 0.1
8+
DIMINISH_DECREMENT = 0.05
9+
NEW_TOPIC_INITIAL_SCORE = 0.1
10+
11+
12+
def compute_cosine_similarity(vector_a, vector_b):
13+
return cosine_similarity([vector_a], [vector_b])[0][0]
14+
15+
16+
def update_channel_topics(channel_topics, new_topics, channel_id):
17+
initial_scores = {
18+
topic.topic_id: topic.topic_score
19+
for topic in channel_topics
20+
}
21+
topic_updates = []
22+
23+
for new_topic in new_topics:
24+
print(
25+
f"\nProcessing new topic: {new_topic['name']} with weight {new_topic['weight']:.4f}"
26+
)
27+
similarities = {
28+
idx:
29+
compute_cosine_similarity(new_topic['embedding'],
30+
channel_topic.topic_embedding)
31+
for idx, channel_topic in enumerate(channel_topics)
32+
}
33+
print("Similarity scores:", similarities)
34+
35+
topic_amplified = False
36+
for idx, similarity in similarities.items():
37+
if similarity >= SIMILARITY_THRESHOLD:
38+
channel_topic = channel_topics[idx]
39+
original_score = channel_topic.topic_score
40+
channel_topic.topic_score = min(
41+
1, channel_topic.topic_score + AMPLIFY_INCREMENT)
42+
delta = channel_topic.topic_score - original_score
43+
channel_topic.updated_at = datetime.utcnow()
44+
channel_topic.save()
45+
print(
46+
f"Amplifying topic '{channel_topic.name}' from {original_score:.4f} to "
47+
f"{channel_topic.topic_score:.4f} (delta = {delta:.4f})")
48+
49+
topic_update = TopicUpdate.create_topic_update(
50+
keywords=channel_topic.keywords, score_delta=delta)
51+
topic_update.topic.connect(channel_topic)
52+
topic_updates.append(topic_update)
53+
54+
topic_amplified = True
55+
56+
if not topic_amplified:
57+
print(
58+
f"Creating new topic '{new_topic['name']}' with initial score {NEW_TOPIC_INITIAL_SCORE:.4f}"
59+
)
60+
topic_node = Topic(name=new_topic['name'],
61+
topic_embedding=new_topic['embedding'],
62+
topic_score=NEW_TOPIC_INITIAL_SCORE,
63+
updated_at=datetime.utcnow()).save()
64+
topic_node.add_update(new_topic.get('keywords', []),
65+
NEW_TOPIC_INITIAL_SCORE)
66+
Channel.nodes.get(channel_id=channel_id).associate_with_topic(
67+
topic_node, NEW_TOPIC_INITIAL_SCORE,
68+
new_topic.get('keywords', []), 1, 'New')
69+
channel_topics.append(topic_node)
70+
71+
for channel_topic in channel_topics:
72+
if channel_topic.name not in [nt['name'] for nt in new_topics]:
73+
original_score = channel_topic.topic_score
74+
channel_topic.topic_score = max(
75+
0, channel_topic.topic_score - DIMINISH_DECREMENT)
76+
delta = original_score - channel_topic.topic_score
77+
channel_topic.updated_at = datetime.utcnow()
78+
channel_topic.save()
79+
print(
80+
f"Diminishing topic '{channel_topic.name}' from {original_score:.4f} to "
81+
f"{channel_topic.topic_score:.4f} (delta = -{delta:.4f})")
82+
83+
if delta != 0:
84+
topic_update = TopicUpdate.create_topic_update(
85+
keywords=channel_topic.keywords, score_delta=-delta)
86+
topic_update.topic.connect(channel_topic)
87+
topic_updates.append(topic_update)
88+
89+
print("\nUpdated Channel Topics:")
90+
print("{:<30} {:<15} {:<15}".format("Topic Name", "Initial Score",
91+
"Updated Score"))
92+
for topic in channel_topics:
93+
initial_score = initial_scores.get(topic.topic_id,
94+
NEW_TOPIC_INITIAL_SCORE)
95+
print("{:<30} {:<15.4f} {:<15.4f}".format(topic.name, initial_score,
96+
topic.topic_score))
97+
98+
return topic_updates

0 commit comments

Comments
 (0)