Finish the base JSONTransformer

Now to run it on more data :MildPanic:
LunarWatcher · Jul 21, 2024 · a966f58 · a966f58
1 parent fa130dd
commit a966f58
Show file tree

Hide file tree

Showing 15 changed files with 361 additions and 17 deletions.
diff --git a/.github/workflows/transformer.yml b/.github/workflows/transformer.yml
@@ -28,3 +28,7 @@ jobs:
           mkdir build && cd build
           cmake ..
           cmake --build . -j 2
+      - name: Run test
+        run: |
+          cd build
+          cmake --build . -j 2 --target test
diff --git a/.gitignore b/.gitignore
@@ -4,6 +4,7 @@ compile_commands.json
 # Contains the downloaded data dump
 /downloads/
 /out/
+*.7z
 
 # Source: https://github.com/github/gitignore/blob/main/Python.gitignore {{{
 # Byte-compiled / optimized / DLL files

diff --git a/docs/JSON.md b/docs/JSON.md
@@ -0,0 +1,22 @@
+# JSON
+
+## Using the data
+
+The JSON output, similarly to the source XML data from the data dump, is partly pretty-printed. Each line of the resulting JSON object, except for the first and last lines, represent one single JSON entry. For example, the JSON output of a data dump can look like:
+```json
+[
+  {"field1": "value1", "field2": "value2", "...": "..." },
+  {"field1": "value1", "field2": "value2", "...": "..." },
+  {"field1": "value1", "field2": "value2", "...": "..." },
+  {"field1": "value1", "field2": "value2", "...": "..." }
+]
+```
+
+This is to work around many JSON parsers not supporting incremental parsing; i.e. where you feed the parser the file, and you read it one line at a time, parsing as you go.
+
+If your JSON parser doesn't support incremental parsing, you can read a line, strip the trailing comma if it exists (and your JSON parser doesn't support JSON 5, where trailing commas are allowed), and put the resulting JSON into a parser. This will give you one single entry form the data dump, that you can discard when you're done, before moving onto the next line.
+
+This is, essentially, meant to allow incremental parsing without a JSON parser that does incremental parsing.
+
+As with any file-based format, **it's a bad idea to read the entire file at once**. If you're *absolutely sure* that the specific file you're reading is small, carry on - but if you have a system set up to read from any site, you can and will run into problems on the bigger sites - especially Stack Overflow, which is almost 200GB in the source XML. If you need to read anything big, use some form of incremental parsing - either with support from your JSON parser, or by reading the file one line at a time.
+
diff --git a/docs/README.md b/docs/README.md
@@ -0,0 +1,6 @@
+# Documentation index
+* [Schema](Schema.md): Information about the structure of the data
+* File-based output formats
+
+    This does not include the source XML part of the data dump
+    * [JSON](JSON.md): Everything you might want to know about the JSON output
diff --git a/docs/Schema.md b/docs/Schema.md
@@ -0,0 +1,16 @@
+# Schema
+
+For the exact schema, see [this question](https://meta.stackexchange.com/q/2677/332043) on meta.SE. All output versions of the data dump follow the exact same schema as the source data dump. It is worth noting that the fields are hard-coded to manage type conversion to more sensible formats; if the data dump schema changes, the code will need to be updated to reflect this change.
+
+That said, the types used to represent the values may deviate slightly between formats. For example, file-based formats are likely to represent the date as a special Date type, while the source XML format and, for example, the transformer's JSON output, use a normal string type.
+
+The exact type changes, if any, will remain consistent within a format. If you're unsure about the exact type, you need to look at one entry, and the type will reliably be defined based on this[^1] -- naturally provided the field is defined, and not null.
+
+## Schema definition within the transformer
+
+The transformer has a `.hpp` file dedicated to defining the schema. This is required as SE doesn't define a namespace with the types for the attributes, which means manually defining the types is required. This is a consequence of XML not supporting any other types.
+
+The definitions operate with four basic types: long, double, string, and date. Note that for most file parsers, dates are treated as strings.
+
+[^1]: This excludes scenarios where SE decides to make single fields have multiple types.
+
diff --git a/transformer/CMakeLists.txt b/transformer/CMakeLists.txt
@@ -10,14 +10,9 @@ set (CMAKE_POSITION_INDEPENDENT_CODE ON)
 
 set (ENABLE_TEST OFF CACHE STRING "" FORCE)
 
-add_executable(sedd-transformer
-    src/Main.cpp
-
-    src/data/ArchiveParser.cpp
-    src/data/ArchiveWriter.cpp
-
-    src/data/transformers/JSONTransformer.cpp
-)
+if (UNIX)
+    set (CMAKE_CXX_FLAGS_DEBUG "${CMAKE_CXX_FLAGS_DEBUG} -fsanitize=undefined")
+endif()
 
 include(FetchContent)
 FetchContent_Declare(
@@ -43,20 +38,48 @@ FetchContent_Declare(
     GIT_REPOSITORY https://github.com/CLIUtils/CLI11.git
     GIT_TAG v2.4.2
 )
+FetchContent_Declare(
+    yyjson
+    GIT_REPOSITORY https://github.com/ibireme/yyjson
+    GIT_TAG 0.10.0
+)
 
+FetchContent_MakeAvailable(yyjson)
 FetchContent_MakeAvailable(cli11)
 FetchContent_MakeAvailable(spdlog)
 FetchContent_MakeAvailable(stc)
 FetchContent_MakeAvailable(archive)
 FetchContent_MakeAvailable(pugixml)
 
-target_include_directories(sedd-transformer PUBLIC src)
-target_link_Libraries(
-    sedd-transformer 
-PUBLIC 
+add_executable(sedd-transformer
+    src/Main.cpp
+
+)
+add_library(sedd-src STATIC
+    src/data/ArchiveParser.cpp
+    src/data/ArchiveWriter.cpp
+
+    src/data/transformers/JSONTransformer.cpp
+)
+target_include_directories(sedd-src PUBLIC src)
+target_link_libraries(
+    sedd-src
+PUBLIC
     archive 
     stc 
     spdlog::spdlog
     pugixml
     CLI11::CLI11
+    yyjson
+)
+target_link_libraries(
+    sedd-transformer 
+PUBLIC 
+    sedd-src
 )
+
+add_subdirectory(tests EXCLUDE_FROM_ALL)
+add_custom_target(test
+    COMMAND tests
+    DEPENDS tests
+    COMMENT "Test the data dump transformer ")
diff --git a/transformer/src/data/ArchiveParser.cpp b/transformer/src/data/ArchiveParser.cpp
@@ -155,9 +155,6 @@ void ArchiveParser::read(const GlobalContext& conf) {
                         throw std::runtime_error("Failed to parse line as XML");
                     }
                     const auto& node = doc.first_child();
-                    //for (pugi::xml_attribute attr : node.attributes()) {
-                        //spdlog::debug("{} = {}", attr.name(), attr.value());
-                    //}
 
                     if (conf.transformer) {
                         conf.transformer->parseLine(node, ctx);

diff --git a/transformer/src/data/ArchiveParser.hpp b/transformer/src/data/ArchiveParser.hpp
@@ -11,6 +11,7 @@ namespace sedd {
 
 namespace DataDumpFileType {
     enum DataDumpFileType {
+        // If badges stops being first, update the CheckSchema test
         BADGES,
         COMMENTS,
         POST_HISTORY,
@@ -19,6 +20,7 @@ namespace DataDumpFileType {
         TAGS,
         USERS,
         VOTES,
+        // Must always be last; place any other values ahead of this
         _UNKNOWN
     };
 

diff --git a/transformer/src/data/Schema.hpp b/transformer/src/data/Schema.hpp
@@ -0,0 +1,113 @@
+#pragma once
+
+#include "data/ArchiveParser.hpp"
+#include <map>
+#include <string>
+
+namespace sedd::Schema {
+
+enum FieldType {
+    LONG,
+    DOUBLE,
+    STRING,
+    DATE,
+    BOOL
+};
+
+inline std::map<DataDumpFileType::DataDumpFileType, std::map<std::string, FieldType>> schema = {
+    {DataDumpFileType::BADGES, {
+        {"Id", LONG},
+        {"UserId", LONG},
+        {"Name", STRING},
+        {"Date", DATE},
+        {"Class", LONG},
+        {"TagBased", BOOL}       
+    }},
+    {DataDumpFileType::COMMENTS, {
+        {"Id", LONG},
+        {"PostId", LONG},
+        {"Score", LONG},
+        {"Text", STRING},
+        {"CreationDate", DATE},
+        {"UserDisplayName", STRING},
+        {"UserId", LONG},
+        {"ContentLicense", STRING}
+    }},
+    {DataDumpFileType::POST_HISTORY, {
+        {"Id", LONG},
+        {"PostHistoryTypeId", LONG},
+        {"PostId", LONG},
+        {"RevisionGUID", STRING},
+        {"CreationDate", DATE},
+        {"UserId", LONG},
+        {"UserDisplayName", STRING},
+        {"Comment", STRING},
+        {"Text", STRING},
+        {"ContentLicense", STRING}
+    }},
+    {DataDumpFileType::POST_LINKS, {
+        {"Id", LONG},
+        {"CreationDate", DATE},
+        {"PostId", LONG},
+        {"RelatedPostId", LONG},
+        {"LinkTypeId", LONG}
+    }},
+    {DataDumpFileType::POSTS, {
+        {"Id", LONG},
+        {"PostTypeId", LONG},
+        {"AcceptedAnswerId", LONG},
+        {"ParentId", LONG},
+        {"CreationDate", DATE},
+        {"Score", LONG},
+        {"ViewCount", LONG},
+        {"Body", STRING},
+        {"OwnerUserId", LONG},
+        {"OwnerDisplayName", STRING},
+        {"LastEditorUserId", LONG},
+        {"LastEditorDisplayName", STRING},
+        {"LastEditDate", DATE},
+        {"LastActivityDate", DATE},
+        {"Title", STRING},
+        {"Tags", STRING},
+        {"AnswerCount", LONG},
+        {"CommentCount", LONG},
+        {"FavoriteCount", LONG},
+        {"ClosedDate", DATE},
+        {"CommunityOwnedDate", DATE},
+        {"ContentLicense", STRING}
+    }},
+    {DataDumpFileType::TAGS, {
+        {"Id", LONG},
+        {"TagName", STRING},
+        {"Count", LONG},
+        {"ExcerptPostId", LONG},
+        {"WikiPostId", LONG},
+        {"IsModeratorOnly", BOOL},
+        {"IsRequired", BOOL}
+    }},
+    {DataDumpFileType::USERS, {
+        {"Id", LONG},
+        {"Reputation", LONG},
+        {"CreationDate", DATE},
+        {"DisplayName", STRING},
+        {"LastAccessDate", DATE},
+        {"WebsiteUrl", STRING},
+        {"Location", STRING},
+        {"AboutMe", STRING},
+        {"Views", LONG},
+        {"UpVotes", LONG},
+        {"DownVotes", LONG},
+        {"ProfileImageUrl", STRING},
+        {"AccountId", LONG}
+    }},
+    {DataDumpFileType::VOTES, {
+        {"Id", LONG},
+        {"PostId", LONG},
+        {"VoteTypeId", LONG},
+        {"UserId", LONG},
+        {"CreationDate", DATE},
+        {"BountyAmount", LONG}
+    }},
+};
+
+}
diff --git a/transformer/src/data/transformers/JSONTransformer.cpp b/transformer/src/data/transformers/JSONTransformer.cpp
@@ -1,15 +1,21 @@
 #include "JSONTransformer.hpp"
 #include "data/ArchiveWriter.hpp"
+#include "data/Schema.hpp"
+#include "spdlog/spdlog.h"
+#include "wrappers/yyjson.hpp"
+#include <stdexcept>
 
 namespace sedd {
 
 void JSONTransformer::beginFile(const ParserContext& ctx) {
     this->writer->open(DataDumpFileType::toFilename(ctx.currType) + ".json");
     this->writer->write("[\n");
+    started = false;
 }
 
 void JSONTransformer::endFile() {
-    this->writer->write("]");
+    this->writer->write("\n]");
+    started = false;
     this->writer->close();
 }
 
@@ -27,7 +33,52 @@ void JSONTransformer::endArchive(const ParserContext& ctx) {
 }
 
 void JSONTransformer::parseLine(const pugi::xml_node& row, const ParserContext& ctx) {
-    this->writer->write("This is where a line would go\n");
+    YYJsonWriter jw;
+    yyjson_mut_val* obj = yyjson_mut_obj(*jw);
+    if (obj == nullptr) {
+        throw std::runtime_error("Failed to allocate JSON object");
+    }
+    yyjson_mut_doc_set_root(*jw, obj);
+
+
+    const auto& types = Schema::schema.at(ctx.currType);
+
+    for (const auto& attr : row.attributes()) {
+        // TODO: check if the second condition is necessary or not
+        if (attr.empty() || attr.value() == nullptr) {
+            yyjson_mut_obj_add_null(*jw, obj, attr.name());
+            continue;
+        }
+
+        switch (types.at(attr.name())) {
+        case Schema::LONG:
+            yyjson_mut_obj_add_int(*jw, obj, attr.name(), attr.as_llong());
+            break;
+        case Schema::DOUBLE:
+            yyjson_mut_obj_add_real(*jw, obj, attr.name(), attr.as_double());
+            break;
+        case Schema::BOOL:
+            yyjson_mut_obj_add_bool(*jw, obj, attr.name(), attr.as_bool());
+            break;
+        case Schema::STRING:
+        case Schema::DATE:
+            yyjson_mut_obj_add_str(*jw, obj, attr.name(), attr.as_string());
+            break;
+        default:
+            throw std::runtime_error("Invalid type for field " + std::string(attr.name()));
+        }
+    }
+
+    auto json = jw.write();
+    if (json.error()) {
+        throw std::runtime_error("Failed to construct JSON string");
+    }
+    if (this->started) {
+        this->writer->write(",\n");
+    } else {
+        this->started = true;
+    }
+    this->writer->write(json.str);
 }
 
 }
diff --git a/transformer/src/data/transformers/JSONTransformer.hpp b/transformer/src/data/transformers/JSONTransformer.hpp
@@ -2,13 +2,15 @@
 
 #include "data/ArchiveWriter.hpp"
 #include "data/Transformer.hpp"
+#include "wrappers/yyjson.hpp"
 #include <filesystem>
 
 namespace sedd {
 
 class JSONTransformer : public Transformer {
 private:
     std::shared_ptr<ArchiveWriter> writer;
+    bool started = false;
 public:
     void endFile() override;
     void beginFile(const ParserContext& ctx) override;