-
Notifications
You must be signed in to change notification settings - Fork 1
/
dumpFileIntoDir.py
executable file
·64 lines (52 loc) · 2.22 KB
/
dumpFileIntoDir.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
#!/usr/bin/python
# -*- coding: utf-8 -*-
import sys
import json
import os
import re
# error check
if len(sys.argv) < 2:
sys.stderr.write("USUAGE: ./dumpFileIntoDir.py [out directory]\n")
sys.exit(1)
root = ''.join([sys.argv[1], '/'])
if os.path.exists(root):
sys.stderr.write("%s already exists\n" % root)
sys.exit(1)
# read file into json object
with open('top5people-sentbox.json') as f:
doc = json.loads(f.read())
"""Main Block
Usage: ./dumpFileIntoDir.py [output directory]
This code reads the dataset from the file and dump the contents into directories
in the directory given as an argument. Each directory is a label (recipient here)
and instances are placed in the directories. File names are unique message id
used in enron database. Only consider 'TO' list.
Due to a lot of operations related to file system, the code runs quite slowly
(it took several minutes in my laptop). I guess directory operations are really
slow. In the future, we might have to figure out a way to directly write down a
feature file rather than making directory structrue and importing them into a
feature file.
The code assumes that the file 'top5people-sentbox.json' exits in the same
directory. The dataset size amounts to approximately half of the entire dataset.
Currently, the directory structure does not reflect the date. However, the file
'top5people-sentbox.json' also includes the date information and I can modify
this code later to generate time-sensitive training set and test set.
"""
os.mkdir(root)
for sender, messages in doc.iteritems():
# for each sender, we make a separate dataset
sname = re.match(r'(.*)@.*', sender).group(1).strip()
sname = re.sub(' ', '', sname)
datadir = ''.join([root, sname, '/'])
os.mkdir(datadir)
for m in messages:
if m['rtype'] == 'TO':
# create a label directory is not exists
rname = re.match(r'(.*)@.*', m['recipient']).group(1).strip()
rname = re.sub(' ', '', rname)
labeldir = ''.join([datadir, rname, '/'])
if not os.path.exists(labeldir):
os.mkdir(labeldir)
filename = ''.join([labeldir, str(m['mid'])])
with open(filename, 'w') as f:
f.write(m['body'])