Sessionization, or session identification, refers to the process to identify a collection of continuous requests to one website from a user, also known as a session, based on the data saved by web server, e.g., a web log. This sessionization process allows one to study users's trends and usage patterns and help develop successful business strategy.
This pipeline, which is written in standard c++, provides a simple, fast, scalable and robust way
to identify sessions from large
scale web logs. For example, it takes only 2 minutes on my laptop to identify more than
log20170630.csv
(caution: large file!).
The pipeline therefore allows a real-time analysis of how users are accessing a website, including how long they stay and the number of documents they access during their visit, provided a real-time data streaming service.
log20170630
is one of web logs
of The Electronic Data Gathering, Analysis and Retrieval (EDGAR) system maintained by
the Securities and Exchange Commission (SEC). We also construct manually small sized csv files
in a similiar format to test the code.
A single user session is defined to have started when the IP address first requests a document from the EDGAR system and continues as long as the same user continues to make requests. The session is over after a certain period of inactive time. An exmaple of sessionization can be illustrated as below:
Figure. 1 Example of session identifications
The identified sessions would read like given an inactive time limit of 2 seconds.
101.81.133.jja,2017-06-30 00:00:00,2017-06-30 00:00:00,1,1
108.91.91.hbc,2017-06-30 00:00:01,2017-06-30 00:00:01,1,1
107.23.85.jfd,2017-06-30 00:00:00,2017-06-30 00:00:03,4,4
106.120.173.jie,2017-06-30 00:00:02,2017-06-30 00:00:02,1,1
107.178.195.aag,2017-06-30 00:00:02,2017-06-30 00:00:04,3,2
108.91.91.hbc,2017-06-30 00:00:04,2017-06-30 00:00:04,1,1
We use a double linked list to store pending sessions, each element in list is a session pointer. Doubly linked list provides us
a constant time to append and remove individual session to keep the process working efficiently. We use an unordered map to
store the key-value pair <session_ip, list_iterator>
, with the iterator points to list element associated with session_ip
.
This way we can quickly update existing sessions without the need of inefficient lookup in the list.
-
For each new event retrieved from the log, we use its timestamp to pop up expired sessions from the existing list (empty for the first event), given a fixed inactive period. These sessions are then printed out and erased from the list and the map.
-
We then integrate this new event into a session in the list as below:
-
case 1: there is pending session in list with the new event ip. We update that session's end timestamp and page count, and move it to the end of the list.
-
case 2: there is no pending session in list with this new event ip. We create a new session with given event properties and append it to the end of the session list.
-
We update the map to keep track of the iterator of this pending session.
-
We start to process the next event in the log and go back to step 1. In case there is no more incoming event, e.g., reaching the end-of-file, we print the pending sessions out.
For example, imagine we have constructed the doubly linked list as below with new events:
Figure. 2 Example of storing active sessions in a doubly linked list.
If the next incoming event ip appears in the existing session list, we remove that session ip1
from previous location and append session ip1
at the end of the list with updated end time:
Figure. 3 Update existing session in the session list.
When the next event comes in with a timestamp ts4, if (ts4-te2) and (ts4-te3) exceed the inactive time limit, we identify expired sessions ip2
and ip3
. We will pop out them one by one from the head of the list and the remaining looks like this:
Figure. 4 A session list after popping out the expired sessions.
-
Easy to identify expired sessions: sessions in list are sorted by its end timestamp with above implementation, that’s why we can easily print out expired sessions (from the front) for any specified timestamp.
-
Memory-efficient: List stores pending sessions but not expired sessions, and only one session per user ip.
-
Dynamical processing: the map gives us an easy way to access any list element and apply erase operation on it.
-
Handling edge cases: A sorting function is applied on a set of expired sessions when printing out, so that we can handle the edge case when last update time is the same.
A simple two step is required to use the pipeline:
- Compiling: on top level of the package, execute
make all
to compile the code. By default, the executable (SGenerator
) is being stored in ./bin
.
- Running: on top level, type
./run.sh
to run the pipeline.
By default, the input (log.csv
) and parameter(e.g.,
inactivity_period.txt
) files are under ./input
, and the generated
sessions (sessionization.csv
) are stored in ./output
directory.
But it is easy to change run.sh
to process different input files.
In run.sh
, you could specify different input, parameter, and output files:
./bin/./SGenerator -i your_path_to_input/your_input_file
-p your_path_to_param/your_param_file
-o your_path_to_output/your_output_file
Go to the ./insight_testsuite
directory and execute run_tests.sh
after compiling the code with make all
as described in previous section.
There are three tests included:
- sample data provided as shown in the first Figure.
- test single session with multiple events
- test multiple sessions with same start/end times.
- test different inactive period.
- c++ 11
- make