-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Showing
15 changed files
with
361 additions
and
17 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
# JSON | ||
|
||
## Using the data | ||
|
||
The JSON output, similarly to the source XML data from the data dump, is partly pretty-printed. Each line of the resulting JSON object, except for the first and last lines, represent one single JSON entry. For example, the JSON output of a data dump can look like: | ||
```json | ||
[ | ||
{"field1": "value1", "field2": "value2", "...": "..." }, | ||
{"field1": "value1", "field2": "value2", "...": "..." }, | ||
{"field1": "value1", "field2": "value2", "...": "..." }, | ||
{"field1": "value1", "field2": "value2", "...": "..." } | ||
] | ||
``` | ||
|
||
This is to work around many JSON parsers not supporting incremental parsing; i.e. where you feed the parser the file, and you read it one line at a time, parsing as you go. | ||
|
||
If your JSON parser doesn't support incremental parsing, you can read a line, strip the trailing comma if it exists (and your JSON parser doesn't support JSON 5, where trailing commas are allowed), and put the resulting JSON into a parser. This will give you one single entry form the data dump, that you can discard when you're done, before moving onto the next line. | ||
|
||
This is, essentially, meant to allow incremental parsing without a JSON parser that does incremental parsing. | ||
|
||
As with any file-based format, **it's a bad idea to read the entire file at once**. If you're *absolutely sure* that the specific file you're reading is small, carry on - but if you have a system set up to read from any site, you can and will run into problems on the bigger sites - especially Stack Overflow, which is almost 200GB in the source XML. If you need to read anything big, use some form of incremental parsing - either with support from your JSON parser, or by reading the file one line at a time. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
# Documentation index | ||
* [Schema](Schema.md): Information about the structure of the data | ||
* File-based output formats | ||
|
||
This does not include the source XML part of the data dump | ||
* [JSON](JSON.md): Everything you might want to know about the JSON output |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,16 @@ | ||
# Schema | ||
|
||
For the exact schema, see [this question](https://meta.stackexchange.com/q/2677/332043) on meta.SE. All output versions of the data dump follow the exact same schema as the source data dump. It is worth noting that the fields are hard-coded to manage type conversion to more sensible formats; if the data dump schema changes, the code will need to be updated to reflect this change. | ||
|
||
That said, the types used to represent the values may deviate slightly between formats. For example, file-based formats are likely to represent the date as a special Date type, while the source XML format and, for example, the transformer's JSON output, use a normal string type. | ||
|
||
The exact type changes, if any, will remain consistent within a format. If you're unsure about the exact type, you need to look at one entry, and the type will reliably be defined based on this[^1] -- naturally provided the field is defined, and not null. | ||
|
||
## Schema definition within the transformer | ||
|
||
The transformer has a `.hpp` file dedicated to defining the schema. This is required as SE doesn't define a namespace with the types for the attributes, which means manually defining the types is required. This is a consequence of XML not supporting any other types. | ||
|
||
The definitions operate with four basic types: long, double, string, and date. Note that for most file parsers, dates are treated as strings. | ||
|
||
[^1]: This excludes scenarios where SE decides to make single fields have multiple types. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,113 @@ | ||
#pragma once | ||
|
||
#include "data/ArchiveParser.hpp" | ||
#include <map> | ||
#include <string> | ||
|
||
namespace sedd::Schema { | ||
|
||
enum FieldType { | ||
LONG, | ||
DOUBLE, | ||
STRING, | ||
DATE, | ||
BOOL | ||
}; | ||
|
||
inline std::map<DataDumpFileType::DataDumpFileType, std::map<std::string, FieldType>> schema = { | ||
{DataDumpFileType::BADGES, { | ||
{"Id", LONG}, | ||
{"UserId", LONG}, | ||
{"Name", STRING}, | ||
{"Date", DATE}, | ||
{"Class", LONG}, | ||
{"TagBased", BOOL} | ||
}}, | ||
{DataDumpFileType::COMMENTS, { | ||
{"Id", LONG}, | ||
{"PostId", LONG}, | ||
{"Score", LONG}, | ||
{"Text", STRING}, | ||
{"CreationDate", DATE}, | ||
{"UserDisplayName", STRING}, | ||
{"UserId", LONG}, | ||
{"ContentLicense", STRING} | ||
}}, | ||
{DataDumpFileType::POST_HISTORY, { | ||
{"Id", LONG}, | ||
{"PostHistoryTypeId", LONG}, | ||
{"PostId", LONG}, | ||
{"RevisionGUID", STRING}, | ||
{"CreationDate", DATE}, | ||
{"UserId", LONG}, | ||
{"UserDisplayName", STRING}, | ||
{"Comment", STRING}, | ||
{"Text", STRING}, | ||
{"ContentLicense", STRING} | ||
}}, | ||
{DataDumpFileType::POST_LINKS, { | ||
{"Id", LONG}, | ||
{"CreationDate", DATE}, | ||
{"PostId", LONG}, | ||
{"RelatedPostId", LONG}, | ||
{"LinkTypeId", LONG} | ||
}}, | ||
{DataDumpFileType::POSTS, { | ||
{"Id", LONG}, | ||
{"PostTypeId", LONG}, | ||
{"AcceptedAnswerId", LONG}, | ||
{"ParentId", LONG}, | ||
{"CreationDate", DATE}, | ||
{"Score", LONG}, | ||
{"ViewCount", LONG}, | ||
{"Body", STRING}, | ||
{"OwnerUserId", LONG}, | ||
{"OwnerDisplayName", STRING}, | ||
{"LastEditorUserId", LONG}, | ||
{"LastEditorDisplayName", STRING}, | ||
{"LastEditDate", DATE}, | ||
{"LastActivityDate", DATE}, | ||
{"Title", STRING}, | ||
{"Tags", STRING}, | ||
{"AnswerCount", LONG}, | ||
{"CommentCount", LONG}, | ||
{"FavoriteCount", LONG}, | ||
{"ClosedDate", DATE}, | ||
{"CommunityOwnedDate", DATE}, | ||
{"ContentLicense", STRING} | ||
}}, | ||
{DataDumpFileType::TAGS, { | ||
{"Id", LONG}, | ||
{"TagName", STRING}, | ||
{"Count", LONG}, | ||
{"ExcerptPostId", LONG}, | ||
{"WikiPostId", LONG}, | ||
{"IsModeratorOnly", BOOL}, | ||
{"IsRequired", BOOL} | ||
}}, | ||
{DataDumpFileType::USERS, { | ||
{"Id", LONG}, | ||
{"Reputation", LONG}, | ||
{"CreationDate", DATE}, | ||
{"DisplayName", STRING}, | ||
{"LastAccessDate", DATE}, | ||
{"WebsiteUrl", STRING}, | ||
{"Location", STRING}, | ||
{"AboutMe", STRING}, | ||
{"Views", LONG}, | ||
{"UpVotes", LONG}, | ||
{"DownVotes", LONG}, | ||
{"ProfileImageUrl", STRING}, | ||
{"AccountId", LONG} | ||
}}, | ||
{DataDumpFileType::VOTES, { | ||
{"Id", LONG}, | ||
{"PostId", LONG}, | ||
{"VoteTypeId", LONG}, | ||
{"UserId", LONG}, | ||
{"CreationDate", DATE}, | ||
{"BountyAmount", LONG} | ||
}}, | ||
}; | ||
|
||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.