This project aims to provide a mapping scheme that converts PANGO lineages (like B.1.1.529
and AY.2
) to their corresponding WHO labels or names (like Omicron
and Delta
). The goal is to facilitate general statistical studies on COVID-19 variants and to make the conversion process more easily accessible.
Most COVID-19 statistics aggregate all cases without distinguishing between variants. While some resources do group data by variant type, they often use technical strain names that cater primarily to scientific audience.
That is where this tool comes in handy — it quickly adds a WHO labels column to your data, making it easy to compile statistics for the general public using the variant names we all know.
For example, our mapping scheme can accurately assign WHO labels to over 85% of GISAID EpiCoV records (tested with over 200,000 Australia entries). The rest were mostly recombinant lineages, new lineages, or records without a lineage designation.
Core Mapping List: mapping.core.csv | mapping.core.sql (03Sep24)
Full Mapping List: mapping.full.json (03Sep24)
Supported WHO Labels
Alpha, Beta, Gamma, Delta, Epsilon, Zeta, Eta, Theta, Iota, Kappa, Lambda, Mu, Omicron
WHO label is a standardized nomenclature used by the World Health Organization (WHO) to classify and refer to different COVID-19 variants. By utilising Greek alphabets (e.g., Alpha, Beta), it simplifies communication and help the general public, media, and health officials easily understand and refer to these variants.
Supported PANGO Lineages
All key lineages on Cov-Lineages dataset and PANGO consensus sequences dataset.
PANGO (Phylogenetic Assignment of Named Global Outbreak) lineages are a system for naming and tracking the COVID-19 lineages. These lineages can have shorter alias names to simplify the representation of lineage names that can become quite lengthy as new sublineages are identified.
Lineage Unaliasing
Unaliasing a PANGO lineage involves mapping an alias back to its original, longer lineage name.
For example: An alias like "BA.1" might represent a more complex and longer lineage name such as "B.1.1.529.1". Unaliasing "BA.1" would result in the full designation "B.1.1.529.1".
Lineage Aliasing
Aliasing is the process of finding aliases of a PANGO lineage. This involves identifying the shorter, simplified alias that corresponds to a more complex lineage name in the PANGO system.
For example, mapping "B.1.1.529.1" to its shorter alias "BA.1" would be considered alias mapping.
mapping.core.csv
contains only the necessary and generalised matching rules. For example, B.1.1.7 -- Alpha
means both B.1.1.7
and B.1.1.7.*
map to Alpha. Sublineages of B.1.1.7
(i.e., B.1.1.7.*
) are not explicitly stored in the table to allow the largest descendant coverage in the lookup process.
+-----------+------------+
| Lineage | wholabel |
+-----------+------------+
| AY | Delta | # AY + AY.* -> Delta
| BB | Mu | # BB + BB.* -> Mu
| BA | Omicron | # BA + BA.* -> Omicron
| ... | ... |
+-----------+------------+
mapping.core.sql
provides the same data as mapping.core.csv
but in the form of SQL statements for replicating the table in a database.
PANGO to WHO mapping in this approach is done by approximately matching each record's lineage field in your data with our mapping table. Approximate matching means finding the most specific match.
def get_wholabel(lineage):
# split the lineage by periods to handle hierarchical structure
cpn = lineage.split('.')
# iterate over the parts of the lineage in reverse order
for i in reversed(range(len(cpn))):
sub = '.'.join(cpn[:i + 1])
if sub in mapping:
return mapping.get(sub)
# return Unknown if no match is found
return "Unknown"
Using only strict lookups on core mapping tables will significantly reduce the labeling outcomes.
Approximate Lookup offers a better labeling coverage by considering all sublineages in matching, but at the expense of more computation during lookup.
mapping.full.json
compiles mapping entries from a full list of commonly known lineages obtained here. This approach requires no approximate matching; a simple equality lookup on the mapping table keys suffices.
{
"B.1.617.2.86": {
"aliased": "AY.86",
"nextclade": "21J",
"unaliased": "B.1.617.2.86",
"wholabel": "Delta"
},
"B.1.1.529.2.75.5.1.2": {
"aliased": "BN.1.2",
"nextclade": "22D",
"unaliased": "B.1.1.529.2.75.5.1.2",
"wholabel": "Omicron"
}
}
The file is generated using generate_full_mapping.py
. Additional details such as nextstrainClade, aliasing, and unaliasing info are obtained from this dataset.
PANGO to WHO mapping in this approach is done by a simple equality match on the lookup keys.
def get_wholabel(lineage):
# check if the lineage is in the lookup dictionary
if lineage in mapping:
return mapping.get(lineage)
# return Unknown if the lineage is not found
return "Unknown"
You may use approximate lookup on this .json
file to improve labeling accuracy at the expense of performance.
Strict Lookup offers efficient retrieval but may omit sublineages or ancestors not included in the predefined definitions.
- Download
mapping.core.csv
. - Copy the data from the
.csv
file into a new worksheet namedmapping
in your Excel file that contains the data to be mapped. - Sort data in the
mapping
worksheet in ascending order based on the first column. - Assuming the
data
worksheet contains your data with the lineage column, apply the following formula in thedata
worksheet:Replace=IF(ISNUMBER(SEARCH(INDEX(mapping!A:A, MATCH(C2, mapping!A:A, 1)), C2)), INDEX(mapping!B:B, MATCH(C2, mapping!A:A, 1)), "Unknown")
C2
with the cell containing the lineage data.
Refer to excel_example.xlsx
in the example
folder for a practical demonstration.
- Download
mapping.core.sql
ormapping.core.csv
. - Create an instance of the mapping table in your database by either importing the
.csv
file or running the SQL commands in the.sql
file. Name this tablemapping
. - Assuming the
data
table contains your data with the lineage column, create a new materialised view or table in your database using the following query:The above query is provided inCREATE TABLE dataview AS SELECT a.year, -- data columns from your data table a.month, -- data columns from your data table COALESCE(a.lineage, 'Unknown') AS ori_lineage, -- lineage colun from your data table COALESCE(b.lineage, 'Unknown') AS ref_lineage, COALESCE(b.wholabel, 'Unknown') FROM data AS a -- your data table LEFT JOIN mapping AS b -- mapping table provided by this repo ON b.lineage = ( SELECT lineage FROM mapping WHERE lineage = ori_lineage -- strict match OR ori_lineage LIKE lineage || '.%' -- approximate match ORDER BY lineage DESC -- most specific match LIMIT 1 );
example/maping_query.sql
.
Refer to sqlite_example.db
in the example
folder for a practical demonstration using a SQLite Database.
We have provided two sample scripts in the example
folder to get you started.
-
get_wholabel_from_csv.py
: Example script using the CSV file with Approximate Lookup. It takes a lineage input and outputs the WHO label. The script uses themapping.core.csv
file. -
get_wholabel_from_json.py
: Example script using the JSON file with Strict Lookup. It takes a lineage input and outputs the WHO label. The script uses themapping.full.json
file.
These scripts are ready to run, and you can easily modify them for your specific needs.
You do not have to perform any of the following to do the conversion. The following only gives you an overview on how we derived the conversion / mapping table.
More Details
-
Base List Creation:
- Source: Utilise definitions from CoVariants and Wikipedia.
- Purpose: Establish initial mappings of PANGO lineages to WHO labels based on consensus data.
-
Base List Refinement:
- Sources for Expansion:
- Example Expansion:
- Alpha:
B.1.1.7 = Q
- Delta:
B.1.617.2 = AY
- Omicron:
B.1.1.529 = BA
- Alpha:
- Purpose: Add direct aliases to the base list and merge sublineages into their common ancestors for better coverage, such as combining all
BA
andXBB
sublineages into their respective categories.
-
Reference Table Creation:
- Source: Extract data from PANGO Consensus Sequences Summary and PANGO Designation Alias Key available on GitHub.
- Purpose: Form the reference table
metadata
for aliasing and unaliasing PANGO lineages.
-
Lineage Unaliasing:
- SQL Query:
SELECT * FROM ( SELECT a.lineage, a.wholabel, GROUP_CONCAT(DISTINCT b.unaliased) AS c FROM mapping AS a LEFT JOIN metadata AS b ON a.lineage = b.lineage OR b.lineage LIKE a.lineage || '.%' GROUP BY a.lineage, a.wholabel ORDER BY a.wholabel ASC ) WHERE c IS NOT NULL AND lineage != c;
- Purpose: Identify the root lineage of a given lineage from the base list through unaliasing. Determine the most specific common ancestors of similar sublineages to formulate matching rules with broadder coverage. For example, if
CH.1.1
maps toB.1.1.529.2.75.3.4.1.1.1.1
, andB.1.1.529.*
is Omicron, thenCH.*
should be classified as Omicron.
- SQL Query:
-
Lineage Aliasing:
- SQL Query:
SELECT GROUP_CONCAT(DISTINCT a.lineage), GROUP_CONCAT(DISTINCT a.wholabel), SUBSTR(b.lineage, 1, INSTR(b.lineage, '.') - 1) AS plin, GROUP_CONCAT(b.lineage), GROUP_CONCAT(DISTINCT b.unaliased) FROM mapping AS a LEFT JOIN metadata AS b ON unaliased = a.lineage OR unaliased LIKE a.lineage || '.%' GROUP BY plin ORDER BY a.wholabel, plin ASC;
- Purpose: Identify all possible aliases for a given lineage, with a particular focus on Omicron due to its numerous sublineages and descendants.
- SQL Query:
-
Cross-Checking:
- SQL Query:
WITH lineage_cte AS ( SELECT SUBSTR(lineage, 1, INSTR(lineage, '.') - 1) AS plin, nextclade FROM metadata WHERE nextclade LIKE '23_' AND plin != '' GROUP BY plin, nextclade ) SELECT plin, nextclade, b.lineage FROM lineage_cte AS a LEFT JOIN mapping AS b ON b.lineage = a.plin WHERE b.lineage IS NULL;
- Purpose: Identify lineages that may be missing from the list but should be labeled according to the Nextstrain Clade consensus (e.g.,
23I
mapping toOmicron
) using data from CoVariants. Pay special attention to Omicron, given its frequent emergence of new recombinant lineages.
- SQL Query:
Manual inspection is involved at each step to ensure accurate generalisation and concise addition of new matching rules.
The file mapping.core.csv
represents the final output of the process described above. The generate_full_mapping.py
script is then executed to produce mapping.full.json
, which includes more detailed information for direct lookup.
This project is intended for educational purposes only. The creator assumes no responsibility for its use. Users should verify the accuracy of the conversions before applying them to their projects or analyses.