Skip to content

Commit

Permalink
Finish the base JSONTransformer
Browse files Browse the repository at this point in the history
Now to run it on more data :MildPanic:
  • Loading branch information
LunarWatcher committed Jul 21, 2024
1 parent fa130dd commit a966f58
Show file tree
Hide file tree
Showing 15 changed files with 361 additions and 17 deletions.
4 changes: 4 additions & 0 deletions .github/workflows/transformer.yml
Original file line number Diff line number Diff line change
Expand Up @@ -28,3 +28,7 @@ jobs:
mkdir build && cd build
cmake ..
cmake --build . -j 2
- name: Run test
run: |
cd build
cmake --build . -j 2 --target test
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ compile_commands.json
# Contains the downloaded data dump
/downloads/
/out/
*.7z

# Source: https://github.com/github/gitignore/blob/main/Python.gitignore {{{
# Byte-compiled / optimized / DLL files
Expand Down
22 changes: 22 additions & 0 deletions docs/JSON.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# JSON

## Using the data

The JSON output, similarly to the source XML data from the data dump, is partly pretty-printed. Each line of the resulting JSON object, except for the first and last lines, represent one single JSON entry. For example, the JSON output of a data dump can look like:
```json
[
{"field1": "value1", "field2": "value2", "...": "..." },
{"field1": "value1", "field2": "value2", "...": "..." },
{"field1": "value1", "field2": "value2", "...": "..." },
{"field1": "value1", "field2": "value2", "...": "..." }
]
```

This is to work around many JSON parsers not supporting incremental parsing; i.e. where you feed the parser the file, and you read it one line at a time, parsing as you go.

If your JSON parser doesn't support incremental parsing, you can read a line, strip the trailing comma if it exists (and your JSON parser doesn't support JSON 5, where trailing commas are allowed), and put the resulting JSON into a parser. This will give you one single entry form the data dump, that you can discard when you're done, before moving onto the next line.

This is, essentially, meant to allow incremental parsing without a JSON parser that does incremental parsing.

As with any file-based format, **it's a bad idea to read the entire file at once**. If you're *absolutely sure* that the specific file you're reading is small, carry on - but if you have a system set up to read from any site, you can and will run into problems on the bigger sites - especially Stack Overflow, which is almost 200GB in the source XML. If you need to read anything big, use some form of incremental parsing - either with support from your JSON parser, or by reading the file one line at a time.

6 changes: 6 additions & 0 deletions docs/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# Documentation index
* [Schema](Schema.md): Information about the structure of the data
* File-based output formats

This does not include the source XML part of the data dump
* [JSON](JSON.md): Everything you might want to know about the JSON output
16 changes: 16 additions & 0 deletions docs/Schema.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# Schema

For the exact schema, see [this question](https://meta.stackexchange.com/q/2677/332043) on meta.SE. All output versions of the data dump follow the exact same schema as the source data dump. It is worth noting that the fields are hard-coded to manage type conversion to more sensible formats; if the data dump schema changes, the code will need to be updated to reflect this change.

That said, the types used to represent the values may deviate slightly between formats. For example, file-based formats are likely to represent the date as a special Date type, while the source XML format and, for example, the transformer's JSON output, use a normal string type.

The exact type changes, if any, will remain consistent within a format. If you're unsure about the exact type, you need to look at one entry, and the type will reliably be defined based on this[^1] -- naturally provided the field is defined, and not null.

## Schema definition within the transformer

The transformer has a `.hpp` file dedicated to defining the schema. This is required as SE doesn't define a namespace with the types for the attributes, which means manually defining the types is required. This is a consequence of XML not supporting any other types.

The definitions operate with four basic types: long, double, string, and date. Note that for most file parsers, dates are treated as strings.

[^1]: This excludes scenarios where SE decides to make single fields have multiple types.

47 changes: 35 additions & 12 deletions transformer/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -10,14 +10,9 @@ set (CMAKE_POSITION_INDEPENDENT_CODE ON)

set (ENABLE_TEST OFF CACHE STRING "" FORCE)

add_executable(sedd-transformer
src/Main.cpp

src/data/ArchiveParser.cpp
src/data/ArchiveWriter.cpp

src/data/transformers/JSONTransformer.cpp
)
if (UNIX)
set (CMAKE_CXX_FLAGS_DEBUG "${CMAKE_CXX_FLAGS_DEBUG} -fsanitize=undefined")
endif()

include(FetchContent)
FetchContent_Declare(
Expand All @@ -43,20 +38,48 @@ FetchContent_Declare(
GIT_REPOSITORY https://github.com/CLIUtils/CLI11.git
GIT_TAG v2.4.2
)
FetchContent_Declare(
yyjson
GIT_REPOSITORY https://github.com/ibireme/yyjson
GIT_TAG 0.10.0
)

FetchContent_MakeAvailable(yyjson)
FetchContent_MakeAvailable(cli11)
FetchContent_MakeAvailable(spdlog)
FetchContent_MakeAvailable(stc)
FetchContent_MakeAvailable(archive)
FetchContent_MakeAvailable(pugixml)

target_include_directories(sedd-transformer PUBLIC src)
target_link_Libraries(
sedd-transformer
PUBLIC
add_executable(sedd-transformer
src/Main.cpp

)
add_library(sedd-src STATIC
src/data/ArchiveParser.cpp
src/data/ArchiveWriter.cpp

src/data/transformers/JSONTransformer.cpp
)
target_include_directories(sedd-src PUBLIC src)
target_link_libraries(
sedd-src
PUBLIC
archive
stc
spdlog::spdlog
pugixml
CLI11::CLI11
yyjson
)
target_link_libraries(
sedd-transformer
PUBLIC
sedd-src
)

add_subdirectory(tests EXCLUDE_FROM_ALL)
add_custom_target(test
COMMAND tests
DEPENDS tests
COMMENT "Test the data dump transformer ")
3 changes: 0 additions & 3 deletions transformer/src/data/ArchiveParser.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -155,9 +155,6 @@ void ArchiveParser::read(const GlobalContext& conf) {
throw std::runtime_error("Failed to parse line as XML");
}
const auto& node = doc.first_child();
//for (pugi::xml_attribute attr : node.attributes()) {
//spdlog::debug("{} = {}", attr.name(), attr.value());
//}

if (conf.transformer) {
conf.transformer->parseLine(node, ctx);
Expand Down
2 changes: 2 additions & 0 deletions transformer/src/data/ArchiveParser.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ namespace sedd {

namespace DataDumpFileType {
enum DataDumpFileType {
// If badges stops being first, update the CheckSchema test
BADGES,
COMMENTS,
POST_HISTORY,
Expand All @@ -19,6 +20,7 @@ namespace DataDumpFileType {
TAGS,
USERS,
VOTES,
// Must always be last; place any other values ahead of this
_UNKNOWN
};

Expand Down
113 changes: 113 additions & 0 deletions transformer/src/data/Schema.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
#pragma once

#include "data/ArchiveParser.hpp"
#include <map>
#include <string>

namespace sedd::Schema {

enum FieldType {
LONG,
DOUBLE,
STRING,
DATE,
BOOL
};

inline std::map<DataDumpFileType::DataDumpFileType, std::map<std::string, FieldType>> schema = {
{DataDumpFileType::BADGES, {
{"Id", LONG},
{"UserId", LONG},
{"Name", STRING},
{"Date", DATE},
{"Class", LONG},
{"TagBased", BOOL}
}},
{DataDumpFileType::COMMENTS, {
{"Id", LONG},
{"PostId", LONG},
{"Score", LONG},
{"Text", STRING},
{"CreationDate", DATE},
{"UserDisplayName", STRING},
{"UserId", LONG},
{"ContentLicense", STRING}
}},
{DataDumpFileType::POST_HISTORY, {
{"Id", LONG},
{"PostHistoryTypeId", LONG},
{"PostId", LONG},
{"RevisionGUID", STRING},
{"CreationDate", DATE},
{"UserId", LONG},
{"UserDisplayName", STRING},
{"Comment", STRING},
{"Text", STRING},
{"ContentLicense", STRING}
}},
{DataDumpFileType::POST_LINKS, {
{"Id", LONG},
{"CreationDate", DATE},
{"PostId", LONG},
{"RelatedPostId", LONG},
{"LinkTypeId", LONG}
}},
{DataDumpFileType::POSTS, {
{"Id", LONG},
{"PostTypeId", LONG},
{"AcceptedAnswerId", LONG},
{"ParentId", LONG},
{"CreationDate", DATE},
{"Score", LONG},
{"ViewCount", LONG},
{"Body", STRING},
{"OwnerUserId", LONG},
{"OwnerDisplayName", STRING},
{"LastEditorUserId", LONG},
{"LastEditorDisplayName", STRING},
{"LastEditDate", DATE},
{"LastActivityDate", DATE},
{"Title", STRING},
{"Tags", STRING},
{"AnswerCount", LONG},
{"CommentCount", LONG},
{"FavoriteCount", LONG},
{"ClosedDate", DATE},
{"CommunityOwnedDate", DATE},
{"ContentLicense", STRING}
}},
{DataDumpFileType::TAGS, {
{"Id", LONG},
{"TagName", STRING},
{"Count", LONG},
{"ExcerptPostId", LONG},
{"WikiPostId", LONG},
{"IsModeratorOnly", BOOL},
{"IsRequired", BOOL}
}},
{DataDumpFileType::USERS, {
{"Id", LONG},
{"Reputation", LONG},
{"CreationDate", DATE},
{"DisplayName", STRING},
{"LastAccessDate", DATE},
{"WebsiteUrl", STRING},
{"Location", STRING},
{"AboutMe", STRING},
{"Views", LONG},
{"UpVotes", LONG},
{"DownVotes", LONG},
{"ProfileImageUrl", STRING},
{"AccountId", LONG}
}},
{DataDumpFileType::VOTES, {
{"Id", LONG},
{"PostId", LONG},
{"VoteTypeId", LONG},
{"UserId", LONG},
{"CreationDate", DATE},
{"BountyAmount", LONG}
}},
};

}
55 changes: 53 additions & 2 deletions transformer/src/data/transformers/JSONTransformer.cpp
Original file line number Diff line number Diff line change
@@ -1,15 +1,21 @@
#include "JSONTransformer.hpp"
#include "data/ArchiveWriter.hpp"
#include "data/Schema.hpp"
#include "spdlog/spdlog.h"
#include "wrappers/yyjson.hpp"
#include <stdexcept>

namespace sedd {

void JSONTransformer::beginFile(const ParserContext& ctx) {
this->writer->open(DataDumpFileType::toFilename(ctx.currType) + ".json");
this->writer->write("[\n");
started = false;
}

void JSONTransformer::endFile() {
this->writer->write("]");
this->writer->write("\n]");
started = false;
this->writer->close();
}

Expand All @@ -27,7 +33,52 @@ void JSONTransformer::endArchive(const ParserContext& ctx) {
}

void JSONTransformer::parseLine(const pugi::xml_node& row, const ParserContext& ctx) {
this->writer->write("This is where a line would go\n");
YYJsonWriter jw;
yyjson_mut_val* obj = yyjson_mut_obj(*jw);
if (obj == nullptr) {
throw std::runtime_error("Failed to allocate JSON object");
}
yyjson_mut_doc_set_root(*jw, obj);


const auto& types = Schema::schema.at(ctx.currType);

for (const auto& attr : row.attributes()) {
// TODO: check if the second condition is necessary or not
if (attr.empty() || attr.value() == nullptr) {
yyjson_mut_obj_add_null(*jw, obj, attr.name());
continue;
}

switch (types.at(attr.name())) {
case Schema::LONG:
yyjson_mut_obj_add_int(*jw, obj, attr.name(), attr.as_llong());
break;
case Schema::DOUBLE:
yyjson_mut_obj_add_real(*jw, obj, attr.name(), attr.as_double());
break;
case Schema::BOOL:
yyjson_mut_obj_add_bool(*jw, obj, attr.name(), attr.as_bool());
break;
case Schema::STRING:
case Schema::DATE:
yyjson_mut_obj_add_str(*jw, obj, attr.name(), attr.as_string());
break;
default:
throw std::runtime_error("Invalid type for field " + std::string(attr.name()));
}
}

auto json = jw.write();
if (json.error()) {
throw std::runtime_error("Failed to construct JSON string");
}
if (this->started) {
this->writer->write(",\n");
} else {
this->started = true;
}
this->writer->write(json.str);
}

}
2 changes: 2 additions & 0 deletions transformer/src/data/transformers/JSONTransformer.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,15 @@

#include "data/ArchiveWriter.hpp"
#include "data/Transformer.hpp"
#include "wrappers/yyjson.hpp"
#include <filesystem>

namespace sedd {

class JSONTransformer : public Transformer {
private:
std::shared_ptr<ArchiveWriter> writer;
bool started = false;
public:
void endFile() override;
void beginFile(const ParserContext& ctx) override;
Expand Down
Loading

0 comments on commit a966f58

Please sign in to comment.