Analyzing Stack Overflow data

This article is a description of a program written in Clojure which does simple analysis on data from Stack Overflow. The code can be found on github here. I hope someone finds it useful or interesting. Feedback is always welcome!

Program description

Stack Overflow provides data of all questions and answers that has been posted on the site. It is released under a Creative Commons license. Every question is tagged with tags describing its contents. These tags will be analyzed in this program. The final "product" will be a function, comptags which, given an input-stream and some tags will generate a sequence of maps which can be analyzed further or, as here, printed to the screen:

> (-> (input-stream "resources/posts.xml")
      (comptags "iphone" "android" "blackberry")
      print-reports)

Aug 2008  volume:    23  android:  21.7%  blackberry:   0.0%  iphone:  78.3%
Sep 2008  volume:    99  android:   9.1%  blackberry:   9.1%  iphone:  81.8%
Oct 2008  volume:   207  android:   5.3%  blackberry:   1.0%  iphone:  93.7%
Nov 2008  volume:   216  android:   4.2%  blackberry:   6.0%  iphone:  89.8%
Dec 2008  volume:   231  android:   3.0%  blackberry:   4.3%  iphone:  92.6%
Jan 2009  volume:   322  android:  10.9%  blackberry:   2.5%  iphone:  86.6%
Feb 2009  volume:   422  android:  11.8%  blackberry:   2.1%  iphone:  86.0%
Mar 2009  volume:   652  android:   7.5%  blackberry:   1.5%  iphone:  91.0%
Apr 2009  volume:   725  android:   6.5%  blackberry:   2.3%  iphone:  91.2%
May 2009  volume:  1149  android:   6.9%  blackberry:   4.0%  iphone:  89.1%
Jun 2009  volume:  1300  android:  10.3%  blackberry:   2.7%  iphone:  87.0%
Jul 2009  volume:  1702  android:  10.0%  blackberry:   3.6%  iphone:  86.3%
Aug 2009  volume:  1863  android:   9.0%  blackberry:   4.0%  iphone:  87.0%
Sep 2009  volume:  1818  android:  10.6%  blackberry:   4.2%  iphone:  85.2%
Oct 2009  volume:  1991  android:  12.8%  blackberry:   3.4%  iphone:  83.8%
Nov 2009  volume:  2075  android:  15.1%  blackberry:   4.5%  iphone:  80.4%
Dec 2009  volume:  2416  android:  22.6%  blackberry:   4.3%  iphone:  73.2%
Jan 2010  volume:  3377  android:  28.7%  blackberry:   2.3%  iphone:  69.0%
Feb 2010  volume:  3481  android:  31.3%  blackberry:   2.7%  iphone:  66.0%
Mar 2010  volume:  4106  android:  32.3%  blackberry:   2.7%  iphone:  65.1%
Apr 2010  volume:  4323  android:  33.4%  blackberry:   2.4%  iphone:  64.2%
May 2010  volume:  4611  android:  35.5%  blackberry:   3.0%  iphone:  61.5%
Jun 2010  volume:  5998  android:  36.9%  blackberry:   2.5%  iphone:  60.7%
Jul 2010  volume:  7257  android:  38.1%  blackberry:   2.2%  iphone:  59.7%
Aug 2010  volume:  7744  android:  42.9%  blackberry:   1.6%  iphone:  55.5%
Sep 2010  volume:  7319  android:  43.5%  blackberry:   1.9%  iphone:  54.5%
Oct 2010  volume:  7573  android:  45.5%  blackberry:   1.9%  iphone:  52.6%
Nov 2010  volume:  8126  android:  51.5%  blackberry:   2.4%  iphone:  46.1%
Dec 2010  volume:  8434  android:  52.3%  blackberry:   2.2%  iphone:  45.5%
Jan 2011  volume:  9890  android:  52.2%  blackberry:   2.0%  iphone:  45.8%
Feb 2011  volume: 10943  android:  55.4%  blackberry:   2.0%  iphone:  42.7%
Mar 2011  volume: 13552  android:  55.7%  blackberry:   1.9%  iphone:  42.4%
May 2011  volume: 14325  android:  59.0%  blackberry:   2.1%  iphone:  38.9%

The program prints monthly reports about the tags and the total tag volume. For example, in February 2010 there were 3481 questions tagged iphone, android or blackberry. Of these questions, 31.3% was tagged "android", 2.7% was tagged "blackberry" and 66.0% was tagged "iphone". In May 2011 many more questions was asked of which the majority was tagged "android".

Data description

The Stack Overflow data can be downloaded from here. The folder contains data from several Stack Exchange sites including Stack Overflow. In the zipped folder called stackoverflow.com.7z there is a file called 'posts.xml'. This is the file we need.

The file is 5.9GB and contains over 5.5 million lines of xml. Running wc -l posts.xml takes almost three minutes on my machine.

Despite its size, the file has a very simple structure. It only contains a series of 'row' tags and all the information is represented as attributes.

<posts>
  <row Id="4" PostTypeId="1" AcceptedAnswerId="7" CreationDate="...
  <row Id="6" PostTypeId="1" AcceptedAnswerId="31" CreationDate="...
  <row Id="7" PostTypeId="2" ParentId="4" CreationDate="...
  <row Id="8" PostTypeId="1" AcceptedAnswerId="162" CreationDate="...
  <row Id="9" PostTypeId="1" AcceptedAnswerId="1404" CreationDate="...
  <row Id="11" PostTypeId="1" AcceptedAnswerId="1248" CreationDate="...
  ...

Questions are marked with PostTypeId="1" and answers with PostTypeId="2". There are many attributes in each row. This program will only consider questions and the attributes:

CreationDate: When the question was posted.
Tags: A list of tags encoded as <php><mysql><jquery>. This attribute is only present for questions.

Reading the data

Due to the size of the xml file it's not feasible to parse the entire content of the file into an in-memory DOM. The function clojure.xml/parse can therefor not be used.

Also, I will not use clojure.contrib.lazy-xml since the old Clojure Contrib is more or less deprecated at this point and the new xml library isn't ready for use just yet.

It is not too difficult to write custom xml processing functionality on top of the javax.xml.stream package. We shall therefor create a function, attribute-seq, which returns a lazy sequence of attributes. The function will have the following interface:

(attribute-seq <input-stream> <predicate> <attribute-parser>)

<input-stream> should be an open java.io.InputSteam object pointing to the file to be read.
<predicate> should be a function which, given a XMLStreamReader object, decides if the parser should parse the attributes or ignore the tag and continue reading.
<attribute-parser> should be a function which parses out some attributes from the XMLStreamReader.

The function attribute-seq is designed in a way that allows the api consumer to only parse out the attributes of interest and nothing else. This is good for both performance and flexibility.

Implementation

The attribute-seq function is in its own namespace attseq. This namespace contains the helper function attribute-seq* which has the same interface as attribute-seq except that instead of an InputStream it takes an XMLStreamReader object. It is defined as follows:

(defn- attribute-seq* [^XMLStreamReader stream pred attribute-parser]
  (lazy-seq
   (when (and (.hasNext stream)
              (.next stream))
     (if (and (.isStartElement stream)
              (pred stream))
       (cons (attribute-parser stream)
             (attribute-seq* stream pred attribute-parser))
       (attribute-seq* stream pred attribute-parser)))))

(The ^XMLStreamReader part of the argument list is a type hint to avoid reflection.)

Note that the body of the function is wrapped in lazy-seq. The body of a lazy-seq should return either nil (indicating the end of the sequence) or some object which implements the ISeq interface. The function returns nil when there is no more data to read. When there is more data the function only considers start elements (which are the only ones that contain attributes) and tags that satisfies the supplied predicate function (pred). The attributes are parsed with attribute-parser and only when a consumer of the sequence requires the next set of attributes it recursively continues to read the next part of the file.

The function xml-stream-reader wraps an InputStream in a XMLStreamReader:

(defn- xml-stream-reader [^InputStream input-stream]
  (.createXMLStreamReader ^XMLInputFactory (XMLInputFactory/newInstance)
                          input-stream))

Finally, attribute-seq uses the two helper functions to provide the desired interface:

(defn attribute-seq [input-stream pred attribute-parser]
  (attribute-seq* (xml-stream-reader input-stream)
                  pred
                  attribute-parser))

Usage examples

Here are two simple examples on how to use attribute-seq:

To generate a sequence of all attributes one could write:

(attribute-seq input
               (constantly true)
               #(into {} (for [i (range (.getAttributeCount %))]
                           [(.getAttributeLocalName % i) 
                            (.getAttributeValue % i)])))

To only parse attributes foo and bar in tag foobar:

(attribute-seq input
               #(= (.getLocalName %) "foobar")
               (fn [stream]
                 {:foo (.getAttributeValue stream nil "foo")
                  :bar (.getAttributeValue stream nil "bar")}))

Data processing

Sequence of questions

The goal of this section is to define a function, question-seq, which will generate a lazy sequence of questions.

First, we define a helper function get-attr which, given an attribute name and an XMLStreamReader object, returns the value of the attribute named name from the current position in the stream.

(defn get-attr [name ^XMLStreamReader stream]
  (.getAttributeValue stream nil name))

With this function, it's easy to define a question? predicate which determines if the stream is at a tag representing a question or not:

(defn question? [stream]
  (= (get-attr "PostTypeId" stream) "1"))

Next, we will define the attribute parser. Dates are parsed with DatatypeConverter

(defn date-parser [stream]
  (DatatypeConverter/parseDate (get-attr "CreationDate" stream)))

and the tags are parsed with a regular expression:

(defn tags-parser [stream]
  (into #{} (re-seq #"[\w#\\Q+-.\\E]+"
                   (get-attr "Tags" stream))))

The attribute parser for questions can now be defined as

(defn parse-question [stream]
  {:creation-date (date-parser stream)
   :tags (tags-parser stream)})

Finally, pulling it all together into the question-seq function is as easy as

(defn question-seq [input-stream]
  (attribute-seq input-stream 
                 question? 
                 parse-question))

question-seq will return a lazy sequence of maps containing

:creation-date: a java.util.Calendar object.
:tags: a set of tags, for example `#{"php" "mysql" "jquery"}.

Filtering and counting tags

Let's now turn to tag counting. Given a sequence of questions it's quite easy to filter on a particular tag and count the remaining questions. For example, to count the number of questions tagged scala we could write

> (count (filter #((:tags %) "scala") (question-seq input)))

or, with the convenient threading macro ->>:

> (->> (question-seq input)
       (filter #((:tags %) "scala"))
       count)

where input is an open InputStream pointing to the beginning of the posts.xml file.

Going forward, filtering on tags will be important so let's define a function (tagged) which returns a predicate function that can tell if a question is tagged with one of the argument tags:

(defn tagged [& tags]
  (fn [question]
    (seq (intersection (set tags) 
                       (:tags question)))))

tagged returns a function that takes one argument, question, and looks at the intersection between the questions tags and the closed over set of tags originally passed to tagged. The function seq returns nil if the intersection is empty.

With this function we can count questions tagged either mysql, postgresql or oracle like this:

> (->> (question-seq input)
       (filter (tagged "mysql" "postgresql" "oracle"))
       count)

This is great and all but it's not always what we want. What if we need to count how many occurences there are of each individual tag and not the grand total? For this we will look at how to count elements in a sequence of sets.

Given a sequence of sets, for example

(def sets [#{:a :b :c} 
           #{:b :c :e} 
           #{:c :e :f}])

we define a function which returns a map of element counts:

> (element-count sets)
{:a 1 :b 2 :c 3 :e 2 :f 1}

It's not always desireble to count all the elements in a sequence of sets. Sometimes we are only interested in a few of them. This is why element-count is designed to take an optional predicate function argument which decides which of the elements to be counted. Sets, being functions in Clojure, works perfectly with this approach:

> (element-count #{:a :c} sets)
{:a 1 :c 3}

But it's also possible to count elements that satisfies any predicate:

> (element-count even? [#{1 2 3} #{2 3 4} #{2 8 9}])
{2 3, 4 1, 8 1}

The implementation of element-count looks like this:

(defn element-count
  ([sets]
     (element-count (constantly true) sets))
  ([pred sets]
     (->> sets
	  (map (comp frequencies #(select pred %)))
	  (reduce #(merge-with + %1 %2) {}))))

select is a very useful function from the clojure.set namespace which selects items from a set that satisfies some predicate. The frequencies function returns a map of items to their frequencies (which, given a set, is always 1 as in (frequencies #{:a :b :c}) -> {:a 1 :b 1 :c 1}).

With this function it's possible to count all the tags on Stack Overflow with (element-count (map :tags (question-seq input))). The tags "php" and "mysql" can be counted with (element-count #{"php" "mysql"} (map :tags (question-seq input))) which returns the map {"php" 115290, "mysql" 49385}.

Generating the report

Let's define the first version of the function (comptags [input & tags] ...). It is defined as a series of lazy sequence transfomations:

Filter the sequence of questions so only questions tagged with at least one of tags remain.
Partition the questions by month. This results in a sequence of sequences.
For every month, generate a monthly report with make-report.

(defn comptags [input & tags]
  (->> (question-seq input) 
       (filter (apply tagged tags))
       (partition-by #(.get ^Calendar (:creation-date %) Calendar/MONTH))
       (map #(make-report tags %))))

The function make-report calculates the tag-count and the total number of tags, i.e, the volume. Note that this might be more than the number of questions since a question can be tagged with several tags. Finally, the report map is constructed by calculating the percentage distribution between tags together with the "month" and "volume":

(defn make-report [tags questions]
  (let [tag-count (element-count (set tags) (map :tags questions))
	volume (reduce + (vals tag-count))]
    (into {"month" (format "%1$tb %1$tY" (:creation-date (first questions)))
	   "volume" volume}
	  (map (fn [tag] 
                 [tag (* (/ (get tag-count tag 0) volume) 100.0)])
	       tags))))

Finally, we need a way to print the report. It would be possible to use print-table from clojure.pprint but then we would have to wait for the entire computation to finish before we see any results. With the function print-reports we get the monthly reports as soon as they are available:

(defn print-reports [reports]
  (doseq [report reports]
    (printf "%s  volume: %5d" (report "month") (report "volume"))
    (doseq [[tag percent] (dissoc report "month" "volume")]
      (printf "  %s: %5.1f%%" tag percent))
    (println)))

Let's try it out on some tags. Here are some NoSQL databases:

> (use 'clojure.java.io)
> (with-open [input (input-stream "resources/posts.xml")]
    (-> input
        (comptags "mongodb" "couchdb" "redis")
        print-reports))

Aug 2008  volume:     1  couchdb: 100.0%  mongodb:   0.0%  redis:   0.0%
Sep 2008  volume:     4  couchdb: 100.0%  mongodb:   0.0%  redis:   0.0%
Oct 2008  volume:     3  couchdb: 100.0%  mongodb:   0.0%  redis:   0.0%
...
Mar 2011  volume:   357  couchdb:  18.5%  mongodb:  70.6%  redis:  10.9%
Apr 2011  volume:   368  couchdb:  15.5%  mongodb:  70.9%  redis:  13.6%
May 2011  volume:   403  couchdb:  18.6%  mongodb:  71.0%  redis:  10.4%

Let's instead try a few popular programming languages:

> (with-open [input (input-stream "resources/posts.xml")]
    (-> input 
        (comptags "java" "c++" "c#") 
        print-reports))

...
Apr 2010  volume: 12387  c#:  47.4%  c++:  20.5%  java:  32.1%
May 2010  volume:  2700  c#:  44.8%  c++:  20.5%  java:  34.7%
Feb 2010  volume:     1  c#: 100.0%  c++:   0.0%  java:   0.0%
May 2010  volume: 10060  c#:  48.2%  c++:  18.8%  java:  33.0%
Jun 2010  volume: 13956  c#:  46.3%  c++:  21.1%  java:  32.6%
...

After April 2010 something strange happened. The creation dates in the xml file are not perfectly sorted so (partition-by ... MONTH) does not always behave as expected. Fixing this is not as easy as one might first think. Here are a few possible solutions:

Sort the xml data by creation date. This requires (as far as I understand) either that the entire sequence is sorted in memory or sorted on disk prior to running comptags.
Calculating the longest increasing subsequence. I don't think this can be done lazily as it involves backtracking and the algorithm must look at every element in the sequence in order to decide whether or not it has found the longest subsequence.
Remove the outliers. It might be possible to analyze the sequence and remove the questions for which the timestamps differ alot from its neighbours.
Require that the sequence increases monotonically. It is quite easy to do this lazily so I went with this aproach. However, some questions will be dropped so the final results will not be 100% correct. The implementation is quite naive and somewhat flawed but I couldn't figure out a simple way to do it better. Here is the implementation:

(defn monotonic-inc-by [key coll]
  (lazy-seq
   (let [[x y & more] coll]
     (if y
       (if-not (pos? (compare (key x) (key y)))
         (cons x (monotonic-inc-by key (cons y more))) 
         (monotonic-inc-by key (cons x more))) 
       [x]))))

Testing it on simple numeric data it's obvious why I consider it to be suboptimal:

> (monotonic-inc-by identity [1 2 3 0 4 5 6]) ;; Works fine
(1 2 3 4 5 6)
> (monotonic-inc-by identity [1 2 3 9 4 5 6]) ;; Oops
(1 2 3 9)

The percentage of questions dropped can be calculated (in parallel) like this:

> (let [c1 (-> (input-stream "resources/posts.xml")
               question-seq
               count 
               future)
        c2 (-> (input-stream "resources/posts.xml")
               question-seq
               (monotonic-inc-by :creation-date)
               count
               future)]
    (* (/ (- @c1 @c2) @c1) 100.0))
0.281019290307216

0.28% of the questions are dropped so it will probably not distort the statistics too much.

Finally, let's update the comptags function with this additional sequence transformation:

(defn comptags [input & tags]
  (->> (question-seq input)
       (filter (apply tagged tags))
       (monotonic-inc-by :creation-date)
       (partition-by #(.get ^Calendar (:creation-date %) Calendar/MONTH))
       (map #(make-report tags %))))

and try it on the same set of tags:

...
Oct 2010  volume: 15542  c#:  45.7%  c++:  19.6%  java:  34.7%
Nov 2010  volume: 17101  c#:  44.8%  c++:  20.5%  java:  34.7%
Dec 2010  volume: 16744  c#:  44.5%  c++:  19.4%  java:  36.2%
Jan 2011  volume: 18939  c#:  44.1%  c++:  19.3%  java:  36.5%
Feb 2011  volume: 19566  c#:  44.2%  c++:  18.4%  java:  37.4%
Mar 2011  volume: 24001  c#:  43.8%  c++:  18.3%  java:  37.9%
Apr 2011  volume: 22658  c#:  43.3%  c++:  19.4%  java:  37.3%
May 2011  volume: 24299  c#:  44.1%  c++:  17.4%  java:  38.5%

Final remarks

Many of these functions can probably be improved. I would like to see an improved version of monotonic-inc-by or some alternative workaround to the fact that the dates are not in sorted order. I would also like to know if it's possible to parse xml attributes in a more efficient manner.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Analyzing Stack Overflow data

Analyzing Stack Overflow data

Program description

Data description

Reading the data

Implementation

Usage examples

Data processing

Sequence of questions

Filtering and counting tags

Generating the report

Final remarks

Clone this wiki locally