Skip to content

Commit

Permalink
Merge pull request #4 from gdcc/3-numeric
Browse files Browse the repository at this point in the history
stop hard-coding sc:Integer, allow for sc:Float
  • Loading branch information
pdurbin authored Jun 3, 2024
2 parents 113919d + ea5e106 commit 2460e27
Show file tree
Hide file tree
Showing 4 changed files with 28 additions and 5 deletions.
1 change: 0 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -152,7 +152,6 @@ Same as above but use a JVM option in domain.xml such as the example below.
### Differences from Kaggle

- I see an `encodingFormat` of `text/comma-separated-values`. Kind of curious about that since I think `text/csv` is more the MIME type that's on https://www.iana.org/assignments/media-types/media-types.xhtml and https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/MIME_types . See https://github.com/IQSS/dataverse/issues/4943#issuecomment-2145333830
- Another thing that sticks out is that I see all of the `field`s have a `dataType` of `sc:Integer`. But nearly all of the columns (excluding `quality` and `Id`) are `sc:Float`. On the Kaggle side, we have a column type of "Id" and so if that's set on a column, we set the `dataType` to `sc:Text` since Ids can often be non-numerical. Just a minor difference there, though, so nothing alarming to me personally.

### Differences from pyDataverse

Expand Down
21 changes: 19 additions & 2 deletions src/main/java/io/gdcc/spi/export/croissant/CroissantExporter.java
Original file line number Diff line number Diff line change
Expand Up @@ -282,14 +282,19 @@ public void exportDataset(ExportDataProvider dataProvider, OutputStream outputSt
String variableDescription = dataVariableObject.getString("label", "");
String variableFormatType =
dataVariableObject.getString("variableFormatType");
String variableIntervalType =
dataVariableObject.getString("variableIntervalType");
String dataType = null;
/**
* There are only two variableFormatType types on the Dataverse side:
* CHARACTER and NUMERIC. (See VariableType in DataVariable.java.)
*/
switch (variableFormatType) {
case "CHARACTER":
dataType = "sc:Text";
break;
case "NUMERIC":
// TODO: Integer? What about other numeric types?
dataType = "sc:Integer";
dataType = getNumericType(variableIntervalType);
break;
default:
break;
Expand Down Expand Up @@ -400,4 +405,16 @@ private String getBibtex(
sb.append("}");
return sb.toString();
}

private String getNumericType(String variableIntervalType) {
/**
* According to DataVariable.java in Dataverse, the four possibilities are: discrete, contin
* (continuous), nominal, and dichotomous.
*/
return switch (variableIntervalType) {
case "discrete" -> "sc:Integer";
case "contin" -> "sc:Float";
default -> "sc:Text";
};
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -301,6 +301,13 @@ public void testExportDatasetMax() throws Exception {
assertEquals(prettyPrint(expected), prettyPrint(outputStreamMax.toString()));
}

/*
The data in stata13-auto.dta looks something like this:
make price mpg rep78 headroom trunk weight length turn displacement gear_ratio foreign
"AMC Concord" 4099 22 3 2.5 11 2930 186 40 121 3.58 0
"AMC Pacer" 4749 17 3 3.0 11 3350 173 40 258 2.53 0
"AMC Spirit" 3799 22 3.0 12 2640 168 35 121 3.08 0
*/
@Test
public void testExportDatasetCars() throws Exception {
exporter.exportDataset(dataProviderCars, outputStreamCars);
Expand Down
4 changes: 2 additions & 2 deletions src/test/resources/cars/expected/cars-croissant.json
Original file line number Diff line number Diff line change
Expand Up @@ -167,7 +167,7 @@
"@type": "cr:Field",
"name": "headroom",
"description": "Headroom (in.)",
"dataType": "sc:Integer",
"dataType": "sc:Float",
"source": {
"@id": "7",
"fileObject": {
Expand Down Expand Up @@ -239,7 +239,7 @@
"@type": "cr:Field",
"name": "gear_ratio",
"description": "Gear Ratio",
"dataType": "sc:Integer",
"dataType": "sc:Float",
"source": {
"@id": "8",
"fileObject": {
Expand Down

0 comments on commit 2460e27

Please sign in to comment.