Parsing Classes (sequence and values) #13

laironald · 2013-06-26T01:04:15Z

does the parsing of the classes strip away the "." and other punctuation? when i compare patent # 8087209 (ipg120103) with the USPTO equivalent, I see differences.

The parser returns [[u'52', u'7168'], [u'52', u'7161'], [u'52', u'463'], [u'52', u'464'], [u'52', u'2881']]

On the USPTO site, I see 52/716.8. Also, do we know why we see things in this order? The order differs from what is on the USPTO website.

gtfierro · 2013-06-27T20:59:32Z

Is this related to #8?

laironald · 2013-06-27T21:39:47Z

not related

On Thu, Jun 27, 2013 at 1:59 PM, Gabe Fierro notifications@github.comwrote:

Is this related to #8 #8
?

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/13#issuecomment-20154544
.

sent from mobile

gtfierro · 2013-07-01T18:18:51Z

Looking in the XML for ipg120103, I can see that the main-classification tag that indicates the US class is the following

<main-classification> 527168</main-classification>

According to the current USPTO XML schema 4.2, the first 3 characters are the class and the last characters are the subclass. This gives us class = 52 and subclass = 7168. Again according to the USPTO XML schema 4.2, the first 3 decimals of the subclass are to the left of the decimal place, giving us subclass = 716.8, so it's definitely possible to parse out 52/716.8. The other classes I believe are from the US classifications of the cited patents.

Should we extract the classes in this way?

laironald · 2013-07-01T22:03:50Z

gotcha. so if we see >3 decimals, then we can add a period. what happens in other cases?

gtfierro · 2013-07-01T22:06:10Z

From the documentation

Table 6 - U.S. Patent Classifications
Class – A 3-position alphanumeric field right justified with leading spaces.
Design Patents – The first position will contain a “D”. Positions 2 and 3, right justified,
with a leading space when required for a single digit class.
Plant Patents – Positions 1-3 will contain a “PLT”
All Other Patents – Three alphanumeric positions, right justified,
with leading spaces
Sub-Class – Three alphanumeric positions, right justified with leading spaces, and, if present, one to three >positions to the right of the decimal point (assumed decimal in the Red Book XML), left justified.

Note: An unstructured US classification would identify a sub-class
as a range with the sub-class range being separated by a hyphen “-“
A digest entry as a sub-class would appear as follows:
Three positions containing “DIG”, followed by one to three alphanumeric positions, left justified.

laironald · 2013-07-01T22:42:58Z

right this stuff is so confusing. its like creating structure within a
small field because the peeps at that USPTO team didn't want to think about
creating new tags. i thin we can definitely add value by applying those
rules as most people wouldn't bother with this... what do you think? (i
know its painful)

On Mon, Jul 1, 2013 at 3:06 PM, Gabe Fierro notifications@github.comwrote:

From the documentation

Table 6 - U.S. Patent Classifications
Class – A 3-position alphanumeric field right justified with leading
spaces.
Design Patents – The first position will contain a “D”. Positions 2 and 3,
right justified,
with a leading space when required for a single digit class.
Plant Patents – Positions 1-3 will contain a “PLT”
All Other Patents – Three alphanumeric positions, right justified,
with leading spaces
Sub-Class – Three alphanumeric positions, right justified with leading
spaces, and, if present, one to three >positions to the right of the
decimal point (assumed decimal in the Red Book XML), left justified.

Note: An unstructured US classification would identify a sub-class
as a range with the sub-class range being separated by a hyphen “-“
A digest entry as a sub-class would appear as follows:
Three positions containing “DIG”, followed by one to three alphanumeric
positions, left justified.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/13#issuecomment-20314062
.

sent from mobile

doolin · 2013-07-01T22:55:32Z

BNF -> PEG (if possible) -> Test drive.

http://fdik.org/pyPEG/

gtfierro · 2013-07-02T16:50:57Z

I don't think it's really that complicated; we just have to decide how we want to transform the strings.

The basic form is <class>/<sub-class>.<more-sub-class>. This is simple enough for Design Patents. For Plant patents, the first 3 characters are PLT, which seems to function as a class.

I think if we break up the class strings as:

class: string[:3]
subclass: string[3:6]
moresubclass: string[6:]

and don't strip the spaces, we should be fine

laironald · 2013-07-02T17:24:25Z

hey gabe. what does this data look like in DVN? to whatever extent we might
want to match that, so its compatible.

On Tue, Jul 2, 2013 at 9:50 AM, Gabe Fierro notifications@github.comwrote:

I don't think it's really that complicated; we just have to decide how we
want to transform the strings.

The basic form is /.. This is simple
enough for Design Patents. For Plant patents, the first 3 characters are
PLT, which seems to function as a classhttp://www.uspto.gov/web/offices/ac/ido/oeip/taf/def/plt.htm
.

I think if we break up the class strings as:

class: string[:3]

subclass: string[3:6]

moresubclass: string[6:]

and don't strip the spaces, we should be fine

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/13#issuecomment-20358963
.

sent from mobile

gtfierro · 2013-07-02T17:30:55Z

From what I can see, all the rows in /data/patentdata/DVNFIXED/class.sqlite3 look like

Patent | Prim | Class | Subclass
03930270 | 1 | 360 | 130.24

so because the current code doesn't handle the subclass decimals, if we handle that, then we should be great in terms of backwards compatibility.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parsing Classes (sequence and values) #13

Parsing Classes (sequence and values) #13

laironald commented Jun 26, 2013

gtfierro commented Jun 27, 2013

laironald commented Jun 27, 2013

gtfierro commented Jul 1, 2013

laironald commented Jul 1, 2013

gtfierro commented Jul 1, 2013

laironald commented Jul 1, 2013

doolin commented Jul 1, 2013

gtfierro commented Jul 2, 2013

laironald commented Jul 2, 2013

gtfierro commented Jul 2, 2013

Parsing Classes (sequence and values) #13

Parsing Classes (sequence and values) #13

Comments

laironald commented Jun 26, 2013

gtfierro commented Jun 27, 2013

laironald commented Jun 27, 2013

gtfierro commented Jul 1, 2013

laironald commented Jul 1, 2013

gtfierro commented Jul 1, 2013

laironald commented Jul 1, 2013

doolin commented Jul 1, 2013

gtfierro commented Jul 2, 2013

laironald commented Jul 2, 2013

gtfierro commented Jul 2, 2013