Skip to content

Commit 96b95d0

Browse files
committed
Support parsing Json export from Telegram Desktop Client
1 parent 079fb53 commit 96b95d0

File tree

6 files changed

+130
-21
lines changed

6 files changed

+130
-21
lines changed

README.md

Lines changed: 23 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -20,12 +20,13 @@ Can also generate histograms and word clouds from the chat logs.
2020

2121
### Support Matrix
2222

23-
| Platform | Direct Chat | Group Chat |
24-
|:------------------:|:-----------: |:----------:|
25-
| Facebook Messenger |||
26-
| Google Hangouts |||
27-
| Telegram |||
28-
| WhatsApp |||
23+
| Platform | Direct Chat | Group Chat |
24+
|:-------------------------:|:-----------:|:----------:|
25+
| Facebook Messenger |||
26+
| Google Hangouts |||
27+
| Telegram (API) |||
28+
| Telegram (Desktop Client) |||
29+
| WhatsApp |||
2930

3031
### Exported data
3132

@@ -76,9 +77,16 @@ Unfortunately, WhatsApp only lets you export your conversations **from your phon
7677
4. Send chat to yourself eg via Email
7778
5. Unpack the archive and add the individual .txt files to the folder `./raw_data/whatsapp/`
7879

79-
### Telegram
80+
### Telegram (Desktop Client)
8081

81-
The Telegram API works differently: you will first need to setup Chatistics, then query your chat logs programmatically. This process is documented below. Exporting Telegram chat logs is very fast.
82+
1. Open Telegram Desktop Client
83+
2. Open Settings > Export Telegram data
84+
5. Unpack result.json file to the folder `./raw_data/telegram/`
85+
86+
### Telegram (API)
87+
88+
The Telegram API works differently: you will first need to setup Chatistics, then query your chat logs programmatically.
89+
This process is documented below. Exporting Telegram chat logs is very fast.
8290

8391
## 2. Setup Chatistics
8492

@@ -102,18 +110,21 @@ python parse.py messenger
102110
103111
# WhatsApp
104112
python parse.py whatsapp
113+
114+
# Telegram (Desktop Client)
115+
python parse.py telegram_json
105116
```
106117

107-
### Telegram
118+
### Telegram (API)
108119
1. Create your Telegram application to access chat logs ([instructions](https://core.telegram.org/api/obtaining_api_id)).
109120
You will need `api_id` and `api_hash` which we will now set as environment variables.
110121
2. Run `cp secrets.sh.example secrets.sh` and fill in the values for the environment variables `TELEGRAM_API_ID`, `TELEGRAMP_API_HASH` and `TELEGRAM_PHONE` (your phone number including country code).
111122
3. Run `source secrets.sh`
112-
4. Execute the parser script using `python parse.py telegram`
123+
4. Execute the parser script using `python parse.py telegram_api`
113124

114125
The pickle files will now be ready for analysis in the `data` folder!
115126

116-
For more options use the `-h` argument on the parsers (e.g. `python parse.py telegram --help`).
127+
For more options use the `-h` argument on the parsers (e.g. `python parse.py telegram_api --help`).
117128

118129

119130
## 3. All done! Play with your data
@@ -144,7 +155,7 @@ Among other options you can filter messages as needed (also see `python visualiz
144155

145156
```
146157
--platforms {telegram,whatsapp,messenger,hangouts}
147-
Use data only from certain platforms (default: ['telegram', 'whatsapp', 'messenger', 'hangouts'])
158+
Use data only from certain platforms (default: ['telegram_api', 'telegram_json', 'whatsapp', 'messenger', 'hangouts'])
148159
--filter-conversation
149160
Limit by conversations with this person/group (default: [])
150161
--filter-sender

config.yml

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -15,9 +15,12 @@ hangouts:
1515
messenger:
1616
DEFAULT_RAW_LOCATION: 'raw_data/messenger'
1717
OUTPUT_PICKLE_NAME: 'messenger.pkl'
18-
telegram:
18+
telegram_api:
1919
USER_DIALOG_MESSAGES_LIMIT: 100000
20-
OUTPUT_PICKLE_NAME: 'telegram.pkl'
20+
OUTPUT_PICKLE_NAME: 'telegram_api.pkl'
21+
telegram_json:
22+
DEFAULT_RAW_LOCATION: 'raw_data/telegram/result.json'
23+
OUTPUT_PICKLE_NAME: 'telegram_json.pkl'
2124
whatsapp:
2225
DEFAULT_RAW_LOCATION: 'raw_data/whatsapp'
2326
OUTPUT_PICKLE_NAME: 'whatsapp.pkl'

parse.py

Lines changed: 15 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,8 @@
88
python parse.py <command> [<args>]
99
1010
Available commands:
11-
telegram Parse logs from telegram
11+
telegram_api Parse logs from telegram (api)
12+
telegram_json Parse logs from telegram (desktop client)
1213
hangouts Parse logs from hangouts
1314
messenger Parse logs from messenger
1415
whatsapp Parse logs from whatsapp
@@ -41,15 +42,24 @@ def __init__(self):
4142
sys.exit(1)
4243
getattr(self, args.command)()
4344

44-
def telegram(self):
45-
from parsers.telegram import main
46-
parser = ArgParseDefault(description='Parse message logs from Telegram')
45+
def telegram_api(self):
46+
from parsers.telegram_api import main
47+
parser = ArgParseDefault(description='Parse message logs from Telegram (API)')
4748
parser = add_common_parse_arguments(parser)
48-
parser.add_argument('--max-dialog', dest='max_dialog', type=int, default=config['telegram']['USER_DIALOG_MESSAGES_LIMIT'],
49+
parser.add_argument('--max-dialog', dest='max_dialog', type=int, default=config['telegram_api']['USER_DIALOG_MESSAGES_LIMIT'],
4950
help='Maximum number of messages to export per dialog')
5051
args = parser.parse_args(sys.argv[2:])
5152
main(args.own_name, max_exported_messages=args.max, user_dialog_messages_limit=args.max_dialog)
5253

54+
def telegram_json(self):
55+
from parsers.telegram_json import main
56+
parser = ArgParseDefault(description='Parse message logs from Telegram (Desktop Client)')
57+
parser = add_common_parse_arguments(parser)
58+
parser.add_argument('-f', '--file-path', dest='file_path', default=config['telegram_json']['DEFAULT_RAW_LOCATION'],
59+
help='Path to Telegram chat log file (json file)')
60+
args = parser.parse_args(sys.argv[2:])
61+
main(args.own_name, args.file_path, args.max)
62+
5363
def hangouts(self):
5464
from parsers.hangouts import main
5565
parser = ArgParseDefault(description='Parse message logs from Google Hangouts')

parsers/telegram.py renamed to parsers/telegram_api.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -61,7 +61,7 @@ async def _main_loop(client):
6161
df['platform'] = 'telegram'
6262
log.info('Detecting languages...')
6363
df = detect_language(df)
64-
export_dataframe(df, config['telegram']['OUTPUT_PICKLE_NAME'])
64+
export_dataframe(df, config['telegram_api']['OUTPUT_PICKLE_NAME'])
6565
log.info('Done.')
6666

6767

parsers/telegram_json.py

Lines changed: 85 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,85 @@
1+
from parsers.config import config
2+
from parsers.utils import export_dataframe, detect_language
3+
from dateutil.parser import parse
4+
import json
5+
import pandas as pd
6+
import logging
7+
from collections import defaultdict
8+
import os
9+
10+
log = logging.getLogger(__name__)
11+
12+
13+
def main(own_name, file_path, max_exported_messages):
14+
global MAX_EXPORTED_MESSAGES
15+
MAX_EXPORTED_MESSAGES = max_exported_messages
16+
log.info('Parsing Google Hangouts data...')
17+
if not os.path.isfile(file_path):
18+
log.error(f'No input file under {file_path}')
19+
exit(0)
20+
archive = read_archive(file_path)
21+
if own_name is None:
22+
own_name = " ".join([archive["personal_information"]["first_name"], archive["personal_information"]["last_name"]])
23+
own_id = archive["personal_information"]["user_id"]
24+
data = parse_messages(archive, own_id)
25+
log.info('{:,} messages parsed.'.format(len(data)))
26+
if len(data) < 1:
27+
log.info('Nothing to save.')
28+
exit(0)
29+
log.info('Converting to DataFrame...')
30+
df = pd.DataFrame(data, columns=config['ALL_COLUMNS'])
31+
df['platform'] = 'telegram'
32+
log.info('Detecting languages...')
33+
df = detect_language(df)
34+
export_dataframe(df, config['telegram_json']['OUTPUT_PICKLE_NAME'])
35+
log.info('Done.')
36+
37+
38+
def parse_messages(archive, own_id):
39+
def json_to_text(data):
40+
result = ""
41+
for v in data:
42+
if isinstance(v, dict):
43+
result += v["text"]
44+
else:
45+
result += v
46+
return result
47+
48+
data = []
49+
log.info('Extracting messages...')
50+
for chat in archive["chats"]["list"]:
51+
chat_type = chat["type"]
52+
if chat_type == "personal_chat" or chat_type == "private_group" or chat_type == "private_supergroup":
53+
conversation_with_id = chat["id"]
54+
conversation_with_name = chat["name"]
55+
for message in chat["messages"]:
56+
if message["type"] != "message":
57+
continue
58+
timestamp = parse(message["date"]).timestamp()
59+
# skip text from forwarded messages
60+
text = message["text"] if "forwarded_from" not in message else ""
61+
if "sticker_emoji" in message:
62+
text = message["sticker_emoji"]
63+
if isinstance(text, list):
64+
text = json_to_text(text)
65+
sender_name = message["from"]
66+
sender_id = message["from_id"]
67+
if sender_name is None:
68+
# unknown sender
69+
log.error(f"No senderName could be found for senderId ({sender_id})")
70+
71+
# saves the message
72+
outgoing = sender_id == own_id
73+
data += [[timestamp, conversation_with_id, conversation_with_name, sender_name, outgoing, text, '', '']]
74+
75+
if len(data) >= MAX_EXPORTED_MESSAGES:
76+
log.warning(f'Reached max exported messages limit of {MAX_EXPORTED_MESSAGES}. Increase limit in order to parse all messages.')
77+
return data
78+
return data
79+
80+
81+
def read_archive(file_path):
82+
log.info(f'Reading archive file {file_path}...')
83+
with open(file_path, encoding='utf-8') as f:
84+
archive = json.loads(f.read())
85+
return archive

utils.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ def __init__(self, **kwargs):
1616

1717
def add_load_data_args(parser):
1818
"""Adds common data loader arguments to arg parser"""
19-
platforms = ['telegram', 'whatsapp', 'messenger', 'hangouts']
19+
platforms = ['telegram_api', 'telegram_json', 'whatsapp', 'messenger', 'hangouts']
2020
parser.add_argument('-p', '--platforms', default=platforms, choices=platforms, nargs='+', help='Use data only from certain platforms')
2121
parser.add_argument('--filter-conversation', dest='filter_conversation', nargs='+', default=[], help='Limit by conversations with this person/group')
2222
parser.add_argument('--filter-sender', dest='filter_sender', nargs='+', default=[], help='Limit by messages by this sender')

0 commit comments

Comments
 (0)