Skip to content

Commit

Permalink
Merge branch 'main' into list-workflow
Browse files Browse the repository at this point in the history
  • Loading branch information
sterlingwes committed Feb 6, 2024
2 parents a6ef69b + d763f19 commit 2097964
Show file tree
Hide file tree
Showing 4 changed files with 93 additions and 9 deletions.
22 changes: 13 additions & 9 deletions site/docs/killed-in-gaza.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -22,22 +22,26 @@ import { JSONFileLinks } from "@site/src/components";

Each record will have the following fields:

| Key | Value |
| ------- | ---------------------------------------------------------------------------- |
| name | original arabic name from the source list |
| en_name | english name translation |
| id | unique string (do not depend on format, it may change) |
| dob | date of birth: string in YYYY-MM-DD format, or empty string if not available |
| sex | string of one of `m` for male or `f` for female |
| age | age as a number or `-1` if unavailable |
| Key | Value |
| ------- | ------------------------------------------------------------------------------------------------ |
| name | original arabic name from the source list |
| en_name | english name translation<br/>(`?` indicates part of the name was "unknown" in the arabic source) |
| id | unique string (do not depend on format, it may change) |
| dob | date of birth: string in YYYY-MM-DD format, or empty string if not available |
| sex | string of one of `m` for male or `f` for female |
| age | age as a number or `-1` if unavailable |

---

## English Name Translation

For a discussion of our translation methodology, [see our February update](/updates/killed-in-gaza-update-1).

## Source

The file is updated when a new list is released by Gaza's Ministry of Health.

Last updated as of January 5th for hospitals reporting in the South and November 2nd for the North.
Last updated as of January 5th for hospitals reporting in the South and November 2nd for the North. Additionally, 21 records were included from an earlier release as [noted in our Feburary update](/updates/killed-in-gaza-update-1#notable-changes).

Additional details about the list provided by the ministry:

Expand Down
9 changes: 9 additions & 0 deletions site/docusaurus.config.ts
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,11 @@ const config: Config = {
theme: {
customCss: "./src/css/custom.css",
},
blog: {
path: "updates",
routeBasePath: "updates",
blogSidebarTitle: "Recent Updates",
},
} satisfies Preset.Options,
],
],
Expand All @@ -71,6 +76,10 @@ const config: Config = {
to: "docs/examples/",
label: "Examples",
},
{
to: "updates/",
label: "Updates",
},
{
href: repoUrl,
label: "GitHub",
Expand Down
5 changes: 5 additions & 0 deletions site/sidebars.ts
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,11 @@ const sidebars: SidebarsConfig = {
},
items: ["killed-in-gaza", "casualties-daily", "summary"],
},
{
type: "link",
label: "Updates",
href: "/updates",
},
{
type: "link",
label: "Example Usage",
Expand Down
66 changes: 66 additions & 0 deletions site/updates/2024-02-06-killed-list-update.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
---
title: Improved Name Translations
description: Describing the latest changes to our list of names dataset, and what to look out for.
slug: killed-in-gaza-update-1
tags: [killed-in-gaza]
---

We've made some significant changes to our previously published [Killed in Gaza list](/docs/killed-in-gaza), which has the names of those known to have been killed in Gaza since October 7th. This post provides more detail on our new methodology and what to expect about the changes.

## Prior Method

Our prior list relied heavily on an existing library ([arabic-names-to-en](https://github.com/hamdongunner/arabic-names-to-en)) which first tried to translate a name segment using a dictionary mapping, then fell back to a character-by-character lookup. We then had some volunteers do a visual review and incorporated manual changes. For a list of over 14 thousand names, this proved hard to manage.

## New Method

We've since built our own dictionary mapping with more name coverage, and the process now looks like this:

1. clean arabic names in the [original list](https://github.com/TechForPalestine/palestine-datasets/tree/main/scripts/data/common/killed-in-gaza/data/raw.csv) of formatting issues (using [dict_ar_ar.csv](https://github.com/TechForPalestine/palestine-datasets/tree/main/scripts/data/common/killed-in-gaza/data/dict_ar_ar.csv))
1. lookup / translate each name part into english (using [dict_ar_en.csv](https://github.com/TechForPalestine/palestine-datasets/tree/main/scripts/data/common/killed-in-gaza/data/dict_ar_en.csv))
1. run final transformations when converting to JSON (see [JSON export script](https://github.com/TechForPalestine/palestine-datasets/tree/main/scripts/data/v2/killed-in-gaza.ts))

The final step includes a fallback step to rely on the old library for remaining arabic translations that are not yet in our curated `dict_ar_ar.csv`. Currently there are less than 2% of the names partially handled by this fallback mechanism, and we'll be working to reduce that number.

## Notable Changes

We've avoided what we believe would have been breaking changes to the dataset per our [versioning guide](/docs/guides/versioning), but we did add 21 new records from the original official list released in November 2023. The IDs that were introduced from that November list include:

- 401771530
- 401844790
- 405424524
- 407194836
- 411518053
- 425923364
- 436788202
- 437391725
- 438240293
- 438445371
- 441199296
- 800328817
- 802335927
- 803827518
- 804662112
- 804669000
- 901494161
- 930025457
- 932076094
- 942125832
- 95270068

The list before this change can be found on Github:

- [unminified](https://github.com/TechForPalestine/palestine-datasets/blob/8f1e630f5561ca13b004b1a8bb4d75f37ad58778/killed-in-gaza.json)
- [minified](https://github.com/TechForPalestine/palestine-datasets/blob/8f1e630f5561ca13b004b1a8bb4d75f37ad58778/killed-in-gaza.min.json)

Here are some additional details about the current list & the latest revision:

- there are 14,140 names
- english name changes between this and the last published list, using Levenshtein distance:
- 24% of names had no change
- 60% of names had differences of between 1-4 edits, inclusive
- 15% of names had differences of between 5-9 edits, inclusive
- 1.9% of names had differences of 10 or more edits
- 92 records (0.65%) had age changes from the prior release (all 1 year less than before)
- 29 names have "unknown" for part or all of the name, and those are now represented in the english translation as `?`

We're continually working to improve translations and the list in general. If you have ideas or want to contribute a change, please see our [contributing guide](/docs/guides/contributing).

0 comments on commit 2097964

Please sign in to comment.