-
Notifications
You must be signed in to change notification settings - Fork 4
Analyzing Stack Overflow data
This article is a description of a program written in Clojure which does simple analysis on data from Stack Overflow. The code can be found on github here. I hope someone finds it useful or interesting. Feedback is always welcome!
Stack Overflow provides data of all questions and answers that has
been posted on the site. It is released under a Creative Commons
license. Every question is tagged with tags describing its
contents. These tags will be analyzed in this program. The final "product" will be a function, comptags
which, given an
input-stream and some tags will generate a sequence of maps which can
be analyzed further or, as here, printed to the screen:
> (-> (input-stream "resources/posts.xml")
(comptags "iphone" "android" "blackberry")
print-reports)
Aug 2008 volume: 23 android: 21.7% blackberry: 0.0% iphone: 78.3%
Sep 2008 volume: 99 android: 9.1% blackberry: 9.1% iphone: 81.8%
Oct 2008 volume: 207 android: 5.3% blackberry: 1.0% iphone: 93.7%
Nov 2008 volume: 216 android: 4.2% blackberry: 6.0% iphone: 89.8%
Dec 2008 volume: 231 android: 3.0% blackberry: 4.3% iphone: 92.6%
Jan 2009 volume: 322 android: 10.9% blackberry: 2.5% iphone: 86.6%
Feb 2009 volume: 422 android: 11.8% blackberry: 2.1% iphone: 86.0%
Mar 2009 volume: 652 android: 7.5% blackberry: 1.5% iphone: 91.0%
Apr 2009 volume: 725 android: 6.5% blackberry: 2.3% iphone: 91.2%
May 2009 volume: 1149 android: 6.9% blackberry: 4.0% iphone: 89.1%
Jun 2009 volume: 1300 android: 10.3% blackberry: 2.7% iphone: 87.0%
Jul 2009 volume: 1702 android: 10.0% blackberry: 3.6% iphone: 86.3%
Aug 2009 volume: 1863 android: 9.0% blackberry: 4.0% iphone: 87.0%
Sep 2009 volume: 1818 android: 10.6% blackberry: 4.2% iphone: 85.2%
Oct 2009 volume: 1991 android: 12.8% blackberry: 3.4% iphone: 83.8%
Nov 2009 volume: 2075 android: 15.1% blackberry: 4.5% iphone: 80.4%
Dec 2009 volume: 2416 android: 22.6% blackberry: 4.3% iphone: 73.2%
Jan 2010 volume: 3377 android: 28.7% blackberry: 2.3% iphone: 69.0%
Feb 2010 volume: 3481 android: 31.3% blackberry: 2.7% iphone: 66.0%
Mar 2010 volume: 4106 android: 32.3% blackberry: 2.7% iphone: 65.1%
Apr 2010 volume: 4323 android: 33.4% blackberry: 2.4% iphone: 64.2%
May 2010 volume: 4611 android: 35.5% blackberry: 3.0% iphone: 61.5%
Jun 2010 volume: 5998 android: 36.9% blackberry: 2.5% iphone: 60.7%
Jul 2010 volume: 7257 android: 38.1% blackberry: 2.2% iphone: 59.7%
Aug 2010 volume: 7744 android: 42.9% blackberry: 1.6% iphone: 55.5%
Sep 2010 volume: 7319 android: 43.5% blackberry: 1.9% iphone: 54.5%
Oct 2010 volume: 7573 android: 45.5% blackberry: 1.9% iphone: 52.6%
Nov 2010 volume: 8126 android: 51.5% blackberry: 2.4% iphone: 46.1%
Dec 2010 volume: 8434 android: 52.3% blackberry: 2.2% iphone: 45.5%
Jan 2011 volume: 9890 android: 52.2% blackberry: 2.0% iphone: 45.8%
Feb 2011 volume: 10943 android: 55.4% blackberry: 2.0% iphone: 42.7%
Mar 2011 volume: 13552 android: 55.7% blackberry: 1.9% iphone: 42.4%
May 2011 volume: 14325 android: 59.0% blackberry: 2.1% iphone: 38.9%
The program prints monthly reports about the tags and the total tag
volume. For example, in February 2010 there were 3481 questions tagged
iphone
, android
or blackberry
. Of these questions, 31.3% was
tagged "android", 2.7% was tagged "blackberry" and 66.0% was tagged
"iphone". In May 2011 many more questions was asked of which the
majority was tagged "android".
The Stack Overflow data can be downloaded from
here. The
folder contains data from several Stack Exchange sites including Stack
Overflow. In the zipped folder called stackoverflow.com.7z
there is
a file called 'posts.xml'. This is the file we need.
The file is 5.9GB and contains over 5.5 million lines of xml. Running wc -l posts.xml
takes almost three minutes on my machine.
Despite its size, the file has a very simple structure. It only contains a series of 'row' tags and all the information is represented as attributes.
<posts>
<row Id="4" PostTypeId="1" AcceptedAnswerId="7" CreationDate="...
<row Id="6" PostTypeId="1" AcceptedAnswerId="31" CreationDate="...
<row Id="7" PostTypeId="2" ParentId="4" CreationDate="...
<row Id="8" PostTypeId="1" AcceptedAnswerId="162" CreationDate="...
<row Id="9" PostTypeId="1" AcceptedAnswerId="1404" CreationDate="...
<row Id="11" PostTypeId="1" AcceptedAnswerId="1248" CreationDate="...
...
Questions are marked with PostTypeId="1"
and answers with
PostTypeId="2"
. There are many attributes in each row. This program
will only consider questions and the attributes:
-
CreationDate
: When the question was posted. -
Tags
: A list of tags encoded as<php><mysql><jquery>
. This attribute is only present for questions.
Due to the size of the xml file it's not feasible to parse the entire
content of the file into an in-memory DOM. The function
clojure.xml/parse
can therefor not be used.
Also, I will not use clojure.contrib.lazy-xml
since the old
Clojure Contrib is more or less deprecated at this point and the new
xml library isn't ready for use just yet.
It is not too difficult to write custom xml processing functionality
on top of the javax.xml.stream
package. We shall therefor create
a function, attribute-seq
, which returns a lazy sequence of
attributes. The function will have the following interface:
(attribute-seq <input-stream> <predicate> <attribute-parser>)
-
<input-stream>
should be an openjava.io.InputSteam
object pointing to the file to be read. -
<predicate>
should be a function which, given aXMLStreamReader
object, decides if the parser should parse the attributes or ignore the tag and continue reading. -
<attribute-parser>
should be a function which parses out some attributes from theXMLStreamReader
.
The function attribute-seq
is designed in a way that allows the api
consumer to only parse out the attributes of interest and nothing
else. This is good for both performance and flexibility.
The attribute-seq
function is in its own namespace attseq
. This
namespace contains the helper function attribute-seq*
which has the
same interface as attribute-seq
except that instead of an
InputStream
it takes an XMLStreamReader
object. It is defined as
follows:
(defn- attribute-seq* [^XMLStreamReader stream pred attribute-parser]
(lazy-seq
(when (and (.hasNext stream)
(.next stream))
(if (and (.isStartElement stream)
(pred stream))
(cons (attribute-parser stream)
(attribute-seq* stream pred attribute-parser))
(attribute-seq* stream pred attribute-parser)))))
(The ^XMLStreamReader
part of the argument list is a type hint to
avoid reflection.)
Note that the body of the function is wrapped in lazy-seq
. The body
of a lazy-seq
should return either nil
(indicating the end of the
sequence) or some object which implements the ISeq
interface. The
function returns nil
when there is no more data to read. When there
is more data the function only considers start elements (which are the
only ones that contain attributes) and tags that satisfies the
supplied predicate function (pred
). The attributes are parsed with
attribute-parser
and only when a consumer of the sequence requires
the next set of attributes it recursively continues to read the next
part of the file.
The function xml-stream-reader
wraps an InputStream
in a
XMLStreamReader:
(defn- xml-stream-reader [^InputStream input-stream]
(.createXMLStreamReader ^XMLInputFactory (XMLInputFactory/newInstance)
input-stream))
Finally, attribute-seq
uses the two helper functions to provide the
desired interface:
(defn attribute-seq [input-stream pred attribute-parser]
(attribute-seq* (xml-stream-reader input-stream)
pred
attribute-parser))
Here are two simple examples on how to use attribute-seq
:
- To generate a sequence of all attributes one could write:
(attribute-seq input
(constantly true)
#(into {} (for [i (range (.getAttributeCount %))]
[(.getAttributeLocalName % i)
(.getAttributeValue % i)])))
- To only parse attributes
foo
andbar
in tagfoobar
:
(attribute-seq input
#(= (.getLocalName %) "foobar")
(fn [stream]
{:foo (.getAttributeValue stream nil "foo")
:bar (.getAttributeValue stream nil "bar")}))
The goal of this section is to define a function, question-seq
,
which will generate a lazy sequence of questions.
First, we define a helper function get-attr
which, given an
attribute name
and an XMLStreamReader
object, returns the value of
the attribute named name
from the current position in the stream.
(defn get-attr [name ^XMLStreamReader stream]
(.getAttributeValue stream nil name))
With this function, it's easy to define a question?
predicate which
determines if the stream is at a tag representing a question or not:
(defn question? [stream]
(= (get-attr "PostTypeId" stream) "1"))
Next, we will define the attribute parser. Dates are parsed with DatatypeConverter
(defn date-parser [stream]
(DatatypeConverter/parseDate (get-attr "CreationDate" stream)))
and the tags are parsed with a regular expression:
(defn tags-parser [stream]
(into #{} (re-seq #"[\w#\\Q+-.\\E]+"
(get-attr "Tags" stream))))
The attribute parser for questions can now be defined as
(defn parse-question [stream]
{:creation-date (date-parser stream)
:tags (tags-parser stream)})
Finally, pulling it all together into the question-seq
function is
as easy as
(defn question-seq [input-stream]
(attribute-seq input-stream
question?
parse-question))
question-seq
will return a lazy sequence of maps containing
-
:creation-date
: ajava.util.Calendar
object. -
:tags
: a set of tags, for example `#{"php" "mysql" "jquery"}.
Let's now turn to tag counting. Given a sequence of questions it's
quite easy to filter
on a particular tag and count the remaining
questions. For example, to count the number of questions tagged
scala
we could write
> (count (filter #((:tags %) "scala") (question-seq input)))
or, with the convenient threading macro ->>
:
> (->> (question-seq input)
(filter #((:tags %) "scala"))
count)
where input is an open InputStream
pointing to the beginning of the
posts.xml
file.
Going forward, filtering on tags will be important so let's define a
function (tagged
) which returns a predicate function that can tell
if a question is tagged with one of the argument tags:
(defn tagged [& tags]
(fn [question]
(seq (intersection (set tags)
(:tags question)))))
tagged
returns a function that takes one argument, question
, and
looks at the intersection between the questions tags and the closed
over set of tags originally passed to tagged
. The function seq
returns nil if the intersection is empty.
With this function we can count questions tagged either mysql
,
postgresql
or oracle
like this:
> (->> (question-seq input)
(filter (tagged "mysql" "postgresql" "oracle"))
count)
This is great and all but it's not always what we want. What if we need to count how many occurences there are of each individual tag and not the grand total? For this we will look at how to count elements in a sequence of sets.
Given a sequence of sets, for example
(def sets [#{:a :b :c}
#{:b :c :e}
#{:c :e :f}])
we define a function which returns a map of element counts:
> (element-count sets)
{:a 1 :b 2 :c 3 :e 2 :f 1}
It's not always desireble to count all the elements in a sequence of
sets. Sometimes we are only interested in a few of them. This is why
element-count
is designed to take an optional predicate function
argument which decides which of the elements to be counted. Sets, being
functions in Clojure, works perfectly with this approach:
> (element-count #{:a :c} sets)
{:a 1 :c 3}
But it's also possible to count elements that satisfies any predicate:
> (element-count even? [#{1 2 3} #{2 3 4} #{2 8 9}])
{2 3, 4 1, 8 1}
The implementation of element-count
looks like this:
(defn element-count
([sets]
(element-count (constantly true) sets))
([pred sets]
(->> sets
(map (comp frequencies #(select pred %)))
(reduce #(merge-with + %1 %2) {}))))
select
is a very useful function from the clojure.set
namespace which selects items from a set that satisfies some
predicate. The frequencies
function returns a map of items to their frequencies (which, given a set, is always 1 as in (frequencies #{:a :b :c}) -> {:a 1 :b 1 :c 1}
).
With this function it's possible to count all the tags on Stack
Overflow with (element-count (map :tags (question-seq input)))
. The
tags "php" and "mysql" can be counted with (element-count #{"php" "mysql"} (map :tags (question-seq input)))
which returns the map
{"php" 115290, "mysql" 49385}
.
Let's define the first version of the function (comptags [input & tags] ...)
. It is defined as a series of lazy sequence
transfomations:
-
Filter the sequence of questions so only questions tagged with at least one of
tags
remain. -
Partition the questions by month. This results in a sequence of sequences.
-
For every month, generate a monthly report with
make-report
.
(defn comptags [input & tags]
(->> (question-seq input)
(filter (apply tagged tags))
(partition-by #(.get ^Calendar (:creation-date %) Calendar/MONTH))
(map #(make-report tags %))))
The function make-report
calculates the tag-count
and the total
number of tags, i.e, the volume
. Note that this might be more than
the number of questions since a question can be tagged with several
tags. Finally, the report map is constructed by calculating the
percentage distribution between tags together with the "month"
and
"volume"
:
(defn make-report [tags questions]
(let [tag-count (element-count (set tags) (map :tags questions))
volume (reduce + (vals tag-count))]
(into {"month" (format "%1$tb %1$tY" (:creation-date (first questions)))
"volume" volume}
(map (fn [tag]
[tag (* (/ (get tag-count tag 0) volume) 100.0)])
tags))))
Finally, we need a way to print the report. It would be possible to
use print-table
from clojure.pprint
but then we would have to
wait for the entire computation to finish before we see any
results. With the function print-reports
we get the monthly reports
as soon as they are available:
(defn print-reports [reports]
(doseq [report reports]
(printf "%s volume: %5d" (report "month") (report "volume"))
(doseq [[tag percent] (dissoc report "month" "volume")]
(printf " %s: %5.1f%%" tag percent))
(println)))
Let's try it out on some tags. Here are some NoSQL databases:
> (use 'clojure.java.io)
> (with-open [input (input-stream "resources/posts.xml")]
(-> input
(comptags "mongodb" "couchdb" "redis")
print-reports))
Aug 2008 volume: 1 couchdb: 100.0% mongodb: 0.0% redis: 0.0%
Sep 2008 volume: 4 couchdb: 100.0% mongodb: 0.0% redis: 0.0%
Oct 2008 volume: 3 couchdb: 100.0% mongodb: 0.0% redis: 0.0%
...
Mar 2011 volume: 357 couchdb: 18.5% mongodb: 70.6% redis: 10.9%
Apr 2011 volume: 368 couchdb: 15.5% mongodb: 70.9% redis: 13.6%
May 2011 volume: 403 couchdb: 18.6% mongodb: 71.0% redis: 10.4%
Let's instead try a few popular programming languages:
> (with-open [input (input-stream "resources/posts.xml")]
(-> input
(comptags "java" "c++" "c#")
print-reports))
...
Apr 2010 volume: 12387 c#: 47.4% c++: 20.5% java: 32.1%
May 2010 volume: 2700 c#: 44.8% c++: 20.5% java: 34.7%
Feb 2010 volume: 1 c#: 100.0% c++: 0.0% java: 0.0%
May 2010 volume: 10060 c#: 48.2% c++: 18.8% java: 33.0%
Jun 2010 volume: 13956 c#: 46.3% c++: 21.1% java: 32.6%
...
After April 2010 something strange happened. The creation dates in the
xml file are not perfectly sorted so (partition-by ... MONTH)
does
not always behave as expected. Fixing this is not as easy as one might
first think. Here are a few possible solutions:
-
Sort the xml data by creation date. This requires (as far as I understand) either that the entire sequence is sorted in memory or sorted on disk prior to running
comptags
. -
Calculating the longest increasing subsequence. I don't think this can be done lazily as it involves backtracking and the algorithm must look at every element in the sequence in order to decide whether or not it has found the longest subsequence.
-
Remove the outliers. It might be possible to analyze the sequence and remove the questions for which the timestamps differ alot from its neighbours.
-
Require that the sequence increases monotonically. It is quite easy to do this lazily so I went with this aproach. However, some questions will be dropped so the final results will not be 100% correct. The implementation is quite naive and somewhat flawed but I couldn't figure out a simple way to do it better. Here is the implementation:
(defn monotonic-inc-by [key coll]
(lazy-seq
(let [[x y & more] coll]
(if y
(if-not (pos? (compare (key x) (key y)))
(cons x (monotonic-inc-by key (cons y more)))
(monotonic-inc-by key (cons x more)))
[x]))))
Testing it on simple numeric data it's obvious why I consider it to be suboptimal:
> (monotonic-inc-by identity [1 2 3 0 4 5 6]) ;; Works fine
(1 2 3 4 5 6)
> (monotonic-inc-by identity [1 2 3 9 4 5 6]) ;; Oops
(1 2 3 9)
The percentage of questions dropped can be calculated (in parallel) like this:
> (let [c1 (-> (input-stream "resources/posts.xml")
question-seq
count
future)
c2 (-> (input-stream "resources/posts.xml")
question-seq
(monotonic-inc-by :creation-date)
count
future)]
(* (/ (- @c1 @c2) @c1) 100.0))
0.281019290307216
0.28% of the questions are dropped so it will probably not distort the statistics too much.
Finally, let's update the comptags
function with this additional
sequence transformation:
(defn comptags [input & tags]
(->> (question-seq input)
(filter (apply tagged tags))
(monotonic-inc-by :creation-date)
(partition-by #(.get ^Calendar (:creation-date %) Calendar/MONTH))
(map #(make-report tags %))))
and try it on the same set of tags:
...
Oct 2010 volume: 15542 c#: 45.7% c++: 19.6% java: 34.7%
Nov 2010 volume: 17101 c#: 44.8% c++: 20.5% java: 34.7%
Dec 2010 volume: 16744 c#: 44.5% c++: 19.4% java: 36.2%
Jan 2011 volume: 18939 c#: 44.1% c++: 19.3% java: 36.5%
Feb 2011 volume: 19566 c#: 44.2% c++: 18.4% java: 37.4%
Mar 2011 volume: 24001 c#: 43.8% c++: 18.3% java: 37.9%
Apr 2011 volume: 22658 c#: 43.3% c++: 19.4% java: 37.3%
May 2011 volume: 24299 c#: 44.1% c++: 17.4% java: 38.5%
Many of these functions can probably be improved. I would like to see
an improved version of monotonic-inc-by
or some alternative
workaround to the fact that the dates are not in sorted order. I would
also like to know if it's possible to parse xml attributes in a more
efficient manner.