Utilities which support the processing of XML based USPTO trademark bulk download files
The USPTO makes trademark data available to the public on both its own Bulk Data Download System Site as well as the external Reed Tech USPTO data portal. The TM applications data is made available in XML format on a daily as well as annual basis. The collection of ZIP files on the Reed Tech site contains both the daily XML files(front files) as well as the annual XML files(back file). The XML files are created and uploaded daily and contain pending and registered trademark text data including word mark, serial number, registration number, filing date, registration date, goods and services, classification number(s), status code(s) and design search code(s).
The annual XML application files are available on both the USPTO Bulk Data Storage System site as well as the Reed Tech site and contain files with TM XML application data from April 7, 1884 to 2017:
The TM annual bulk download application files are available for download from either the USPTO Bulk Data Site or the Reed Tech site and consist of a series of ZIP files containing all trademark data from April 7, 1884 through the last day of the previous year. Once unzipped, the concatenated XML files range in size from roughly 400 MB to 3 GB and contain upwards of 80,000 trademark records per file. The files are too large to be opened with most standard text editors or IDEs for viewing. Some commercial XML tools such as Oxygen XML Editor support viewing files of this size but these tools require a license.
There are currently 55 TM annual application ZIP files in the series containing all trademark data through the end of 2018. The annual TM files can be distinguished from the daily TM XML ZIP files available on the site by the file names. Annual files are named using the last day of the year following by the series number (1-59):
i.e.
apc18840407-20181231-01.zip
apc18840407-20181231-55.zip
Daily XML application files are named using the date with no series number:
i.e.
apc190101.zip
The trademark splitter is a Python based utility which separates out the trademarks contained within each bulk download application file and then builds a corpus using a directory structure based on name and date of the ZIP file. The utility currently supports both the TM daily bulk download files(front files) as well as the TM annual bulk download files (back files). The Python TM splitter tool uses a buffered reader to read and process the large XML input file in chunks so it won’t run out of memory.
For each individual trademark extracted from the bulk application XML file, the TM splitter creates 2 files:
- complete trademark containing all fields in USPTO standard trademark XML format
- file containing a subset of fields in Solr ready XML format
- trademark serial number used as the unique document id in Solr
- mark name
- mark drawing code
- design codes
- goods and services codes and descriptions
- c:\tm_corpus\apc18840407-181231-55-87275954\87275954.xml
- c:\tm_corpus\apc18840407-171231-55-87275954\solr\solr_87275954.xml
- Eclipse 4.6 (Nyon), Java 8: PyDev 5.5
- Eclipse 4.5, Java 8: PyDev 5.2.0
- Eclipse 3.8, Java 7: PyDev 4.5.5
- Eclipse 3.x, Java 6: PyDev 2.8.2
The tool was built and tested using Python 3.6 which can be downloaded from http://www.python.org.
Install Python 3.6 and run the following command from a command/terminal/shell window in order to confirm the version :
C:\TrademarkPublicData\TMProcessing>python -V
Python 3.6.0
The tool can be tested with the following trademark annual application XML input file:
apc18840407-apc181231-55.zip
Unzip the file to a location with plenty of storage and confirm that there is an XML file of the same name:
i.e. e:\TMData\Applications_XML\apc18840407-181231-55.zip
The annual XML ZIP files are named using the last day of the previous year and the sequence number. For example, the file “acp18840407-181231-55.zip” is one of 55 files that together represent a snapshot of all USPTO registered trademarks as of Dec. 31, 2017. This sample file contains 82,652 trademark records. It is 106 MB zipped and 1.72 GB unzipped.
The tool can be tested with the following trademark daily XML input file:
apc190329.zip
Unzip the file and confirm that there is an XML file with the same name:
i.e. apc190329.xml
Daily XML files are named using the date with no series number and only contain trademark records for a single day. This file contains records for all trademark transactions for March 29th, 2019.
The complete set of annual XML files form a snapshot of the USPTO public trademark data as of the last day of the previous year. In order to build an up to date trademark corpus representing all publicly available trademark data, the splitter must first process all annual XML files. The splitter can then process all daily XML files up to the previous day which includes all changes since the snapshot was published on the last day of the previous year.
Copy the Python script to the same directory at the XML test data:
i.e. C:\TrademarkPublicData\TMProcessing\tm_splitter.py
Launch a command line tool such as the Windows command prompt, Cygwin or other shell window.
Navigate to the installation directory:
$ cd C:\TrademarkPublicData\TMProcessing
Run the utility by providing the name of the input XML file and location of output directory as command line arguments:
$python tm_splitter.py -i e:\TMData\Applications_XML\apc18840407-181231-55.xml -d c:\tm_corpus
The utility will echo status messages to standard output as it builds a trademark corpus under the directory specified on the command line using the input file:
Number of arguments: 5 arguments
Argument List: ['D:\\TrademarkPublicData\\TMProcessing\\tm_splitter.py', '-i', 'e:\\TMData\\Applications_XML\\apc18840407-181231-56-87275954.xml', '-d', 'c:\\tm_corpus']
Argv: ['-i', 'e:\\TMData\\Applications_XML\\apc18840407-181231-55-87275954.xml', '-d', 'c:\\tm_corpus']
reading input file:
e:\TMData\Applications_XML\apc18840407-apc181231-55-87275954.xml
using record separator:
read buffer size in bytes: 4096
reading TMs...please wait...
processing tms:
processing tm number: 1 / 1
1
tm_file: 87275954.xml
tm_file_solr: solr_87275954.xml
c:\\tm_corpus/apc18840407-181231-55-87275954/87275954.xml'
tm file creation complete
creating Solr ready XML file:
c:\\tm_corpus/apc18840407-181231-55-87275954/solr/solr_87275954.xml'
tm Solr file creation complete
tm processing summary:
Number of lines processed from input file:
Number of tm classifications extracted: 1
Number of opening xml elements matched: 0
Number of doctype elements matched: 0
Number of start tm elements matched: 1
Number of end tm elements matched: 1
Number of complete tms matched: 1
Corpus created under output directory:
c:\tm_corpus
tm processing complete
After the tool has completed processing, confirm that files were created for each trademark under the directory specified on the command line.
The utility will create a corpus structure using the input file name for directory name and trademark serial number for file name:
i.e.
The following versions of PyDev are compatible with Eclipse:
i.e. MyEclipse version 2016 CI 7 uses Eclipse 4.5
Configure Eclipse 4.5 with PyDev 5.2
Read the notes on the manual installation of PyDev with Eclipse:
PyDev Install Manual 101
Download the PyDev zip file:
i.e. To download the PyDev 5.2 zip use the following link:
PyDev 5.2
C:\Users\mdhen_000\Downloads\PyDev_5.2.0.zip
Copy the PyDev 5.2 zip to the Eclipse dropins directory and unzip:
i.e.
C:\MyEclipse2016CI\dropins\PyDev5.2.0.zip
C:\MyEclipse2016CI\dropins\org.python.pydev.mylyn.feature_0.3.0.zip
Restart MyEclipse
##Creating Python project in Eclipse: Create Eclipse project for tm_splitter.py
Navigate to Package Explorer tab in Eclipse:
Right mouse click:
Select New->Project->Other->Pydev->PyDev Project
Name: Python_TM_Splitter
Grammar: 3.0-3.5
Interpreter: C:\Python3.6\python.exe
##Link to source code from Python project:
Right mouse from Package Explorer
New->Folder->Advanced
Select: Link to alternate location (Linked folder) ->Browse
Select: C drive:
C:\TrademarkPublicData\TMProcessing
This will link to the external files and not create them in the workspace itself.
##Configuring Preferences in Eclipse:
Windows->Preferences->PyDev->Interpreters:
Python 3.6:
C:\Python3.6\python.exe
Windows->Preferences->PyDev->Editor:
Hover->PyDevDocstring Hover->Unchecked
##Open PyDev Perspective:
Right mouse click Open Perspective icon with plus symbol in right hand corner of tool bar
Open Perspective-> PyDev
This source code is a work in progress and has not been fully vetted for a production environment.
The United States Department of Commerce (DOC)and the United States Patent and Trademark Office (USPTO) GitHub project code is provided on an ‘as is’ basis without any warranty of any kind, either expressed, implied or statutory, including but not limited to any warranty that the subject software will conform to specifications, any implied warranties of merchantability, fitness for a particular purpose, or freedom from infringement, or any warranty that the documentation, if provided, will conform to the subject software. DOC and USPTO disclaim all warranties and liabilities regarding third party software, if present in the original software, and distribute it as is. The user or recipient assumes responsibility for its use. DOC and USPTO have relinquished control of the information and no longer have responsibility to protect the integrity, confidentiality, or availability of the information.
User and recipient agree to waive any and all claims against the United States Government, its contractors and subcontractors as well as any prior recipient, if any. If user or recipient’s use of the subject software results in any liabilities, demands, damages, expenses or losses arising from such use, including any damages from products based on, or resulting from recipient’s use of the subject software, user or recipient shall indemnify and hold harmless the United States government, its contractors and subcontractors as well as any prior recipient, if any, to the extent permitted by law. User or recipient’s sole remedy for any such matter shall be immediate termination of the agreement. This agreement shall be subject to United States federal law for all purposes including but not limited to the validity of the readme or license files, the meaning of the provisions and rights and the obligations and remedies of the parties. Any claims against DOC or USPTO stemming from the use of its GitHub project will be governed by all applicable Federal law. “User” or “Recipient” means anyone who acquires or utilizes the subject code, including all contributors. “Contributors” means any entity that makes a modification.
This agreement or any reference to specific commercial products, processes, or services by service mark, trademark, manufacturer, or otherwise, does not in any manner constitute or imply their endorsement, recommendation or favoring by DOC or the USPTO, nor does it constitute an endorsement by DOC or USPTO or any prior recipient of any results, resulting designs, hardware, software products or any other applications resulting from the use of the subject software. The Department of Commerce seal and logo, or the seal and logo of a DOC bureau, including USPTO, shall not be used in any manner to imply endorsement of any commercial product or activity by DOC, USPTO or the United States Government.
To the extent possible under law,
https://github.com/USPTO/TrademarkPublicData
has waived all copyright and related or neighboring rights to
Trademark Public Data.
This work is published from:
United States.