You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In version 5 of the “OpenCitations Meta CSV dataset of all bibliographic metadata” (https://doi.org/10.6084/m9.figshare.21747461.v5) and version 5 of the “OpenCitations Meta RDF dataset of all bibliographic metadata and its provenance information” (https://doi.org/10.6084/m9.figshare.21747536.v5) the number of bibliographic resources differs from the number of bibliographic resources in the triplestore (105,953,699 BRs, from what can be read in the two datasets’ metadata on their Figshare pages).
More specifically, 99,270,517 bibliographic resources can be found in the CSV files, and 105,912,463 bibliographic resources can be found in the RDF files. As concerns the CSV files, the lesser number of bibliographic resources may be (partly?) due to the fact that entities have been counted by OMID, and OMIDs of journal issues and journal volumes are not represented in the CSV dump files.
The observations can be reproduced with the following script.
importcsvfromosimportlistdirfromos.pathimportjoin, isdirfromtqdmimporttqdmimportrefromzipfileimportZipFileimportjsondefget_br_data_from_rdf(br_rdf_path):
withZipFile(br_rdf_path) asarchive:
forfilepathinarchive.namelist():
if'prov'notinfilepathandfilepath.endswith('.zip'):
withZipFile(archive.open(filepath)) asbr_data_archive:
forfileinbr_data_archive.namelist():
iffile.endswith('.json'):
withbr_data_archive.open(file) asf:
data: list=json.load(f)
forobjindata:
forbrinobj['@graph']:
yieldbrdefread_csv_tables(*dirs):
""" Reads the output CSV non-compressed tables from one or more directories and yields rows as dictionaries. :param dirs: One or more directories to read files from, provided as variable-length arguments. :return: Yields rows as dictionaries. """csv.field_size_limit(131072*12) # increase the default field size limitfordirectoryindirs:
ifisdir(directory):
files= [fileforfileinlistdir(directory) iffile.endswith('.csv')]
forfileintqdm(files, desc=f"Processing {directory}", unit="file"):
file_path=join(directory, file)
withopen(file_path, 'r', encoding='utf-8') asf:
reader=csv.DictReader(f, dialect='unix')
forrowinreader:
yieldrowelse:
raiseValueError("Each argument must be a string representing the path to an existing directory.")
defcount_brs_in_rdf(br_rdf_path):
brcount=0brset=set()
forbrintqdm(get_br_data_from_rdf(br_rdf_path), desc='Counting BRs in RDF files', unit='br'):
ifbr['@id']:
brcount+=1brset.add(br['@id'].replace('https://w3id.org/oc/meta/', 'omid:'))
print('Number of BRs in the RDF files: ', brcount)
print('Are there duplicates?', len(brset)!=brcount)
returnbrcount, len(brset)
defcount_brs_in_csv(meta_csv_dump):
all_brs=set()
row_count=0pattern=r'omid:[^ \[\]]+'reader=read_csv_tables(meta_csv_dump)
forrowinreader:
row_count+=1interested_fields=' '.join([row['id'], row['venue'], row['volume'], row['issue']])
omids_in_row=re.findall(pattern, interested_fields)
all_brs.update(set(omids_in_row))
print('Total number of BRs in CSV files: ', len(all_brs))
print('Total number of rows: ', row_count)
return (len(all_brs), row_count)
if__name__=='__main__':
csv_dump_path='path/to/csv/files'# directory w/ uncompressed files of CSV dumpprint(count_brs_in_csv(csv_dump_path))
br_rdf_path='path/to/rdf/br.zip'# zip archive path!print(count_brs_in_rdf(br_rdf_path))
The text was updated successfully, but these errors were encountered:
In version 5 of the “OpenCitations Meta CSV dataset of all bibliographic metadata” (https://doi.org/10.6084/m9.figshare.21747461.v5) and version 5 of the “OpenCitations Meta RDF dataset of all bibliographic metadata and its provenance information” (https://doi.org/10.6084/m9.figshare.21747536.v5) the number of bibliographic resources differs from the number of bibliographic resources in the triplestore (105,953,699 BRs, from what can be read in the two datasets’ metadata on their Figshare pages).
More specifically, 99,270,517 bibliographic resources can be found in the CSV files, and 105,912,463 bibliographic resources can be found in the RDF files. As concerns the CSV files, the lesser number of bibliographic resources may be (partly?) due to the fact that entities have been counted by OMID, and OMIDs of journal issues and journal volumes are not represented in the CSV dump files.
The observations can be reproduced with the following script.
The text was updated successfully, but these errors were encountered: