Skip to content

Commit 992e6e2

Browse files
committed
Merge branch '510k-article'
2 parents 0cfc98f + 9f040df commit 992e6e2

File tree

9 files changed

+359
-7
lines changed

9 files changed

+359
-7
lines changed

articles/medical-device-analysis.mdx

Lines changed: 349 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,349 @@
1+
---
2+
title: "A Copy of a Copy of a Copy: the Story of FDA Medical Device Clearances"
3+
date: "2024-03-10"
4+
thumbnail: "/medical-device-analysis/thumbnail.png"
5+
thumbnailAlt: ""
6+
description: "Uncovering medical device ancestry from the FDA's 510k data and creating a free website for exploring it."
7+
tags: ["python", "sqlite", "510k"]
8+
---
9+
10+
## Motivation
11+
12+
Back in the Fall of 2022, I watched the 2018 documentary *The Bleeding Edge*[^the-bleeding-edge],
13+
which investigates the FDA's medical device clearance process.
14+
I was surprised to learn that for many devices, clinical trials are not required.
15+
16+
The documentary scrutinizes the FDA's 510(k) clearance process,
17+
through which some devices can be fast-tracked for use if they are
18+
"substantially equivalent" to an existing device,
19+
even without a clinical trial.
20+
21+
Several of these devices even led to patient injuries including
22+
bleeding, organ puncture, and even cobalt poisoning.
23+
24+
![The Bleeding Edge Poster - Copyright imdb](/medical-device-analysis/the-bleeding-edge.png)
25+
26+
Although the documentary led to some of these devices being pulled off the market [^essure],
27+
I was curious whether similar devices might exist that haven't gotten the publicity yet.
28+
However, after some research, I found that the public data available was insufficient to answer my questions.
29+
30+
So, I put on my detective hat and began investigating the data that was available.
31+
32+
Before I dive into the details, it's important to know a little backstory on
33+
how medical devices get regulated in America.
34+
35+
### A Brief History of Medical Device Regulation
36+
37+
Before 1938, there were a few laws regulating food and drug quality, but
38+
the Federal government did not have much authority to enforce them.
39+
40+
In 1938, the Federal Food, Drug, and Cosmetic Act[^food-drug-cosmetic-act]
41+
was signed into law, granting the FDA (Food and Drug Administration) much
42+
more authority and requiring a pre-market
43+
review for new drugs to verify safety and effectiveness.
44+
45+
Until 1976, however, medical devices were not regulated by the FDA.
46+
A 1976 amendment to the Federal Food, Drug, and Cosmetic Act
47+
extended the FDA's jurisdiction to include medical devices.
48+
The amendment established 3 classes of medical devices: Class I, Class II, and Class III,
49+
ordered by the level of risk to the patient.
50+
51+
For example, latex examination gloves are a Class I device (low risk),
52+
but a pacemaker is a Class III device (high risk).
53+
Devices like joint implants and tooth fillings fall somewhere in the middle, at Class II.
54+
55+
There are a few pathways for device manufacturers to begin selling their devices.[^device-pathways]
56+
I won't get into all of them, but the main pathways are PMA or 510(k).
57+
58+
### PMA
59+
60+
Premarket Approval (PMA) is a more rigorous process which requires a clinical trial to
61+
demonstrate that the device is safe and effective.
62+
All class III devices must go through this approval process to be legally marketed.
63+
However, only 1% [^510k-study] of medical devices are cleared through this process.
64+
65+
### 510(k)
66+
67+
The 510(k) process provides a faster route to marketing a new device.
68+
New devices are allowed to be marketed if it is shown that they are "substantially equivalent"
69+
to a legally marketed device, and are either Class I or Class II.
70+
This process does not necessarily require a clinical trial to demonstrate safety or effectiveness,
71+
the applicant need only show that the device is equivalent to something legally on the market already.
72+
This marketed device is referred to as the "predicate" device.
73+
74+
The predicate device could itself have been allowed through a few pathways.
75+
The predicate could be a pre-amendment device, meaning something that was on the market before 1976 and grandfathered in.
76+
It could also be a device that received Premarket Approval.
77+
78+
Lastly, the predicate device could itself have been cleared by the 510(k) process.
79+
80+
This last detail is what sparked my curiosity.
81+
82+
A cleared device could be equivalent to some long chain of 510(k) cleared
83+
devices, without any of these devices requiring a clinical trial!
84+
85+
This was not how I expected medical devices to be cleared for the market.
86+
87+
## Wait it's just a graph?
88+
89+
This is where I had to dust off my computer science knowledge.
90+
We can model this as a graph, where each 510(k) submission is a node,
91+
and the predicate relationship is an edge.
92+
A device may use multiple predicates in its application.
93+
94+
For example, if device A has predicate devices X and Y, the graph might look like this:
95+
96+
![graph of predicates to a device](/medical-device-analysis/predicate-graph.png)
97+
98+
We could imagine this extending out across dozens or hundreds of devices in
99+
a 510(k)'s "ancestry".
100+
101+
## Finding the data
102+
103+
The next problem I had was how to find the data.
104+
There were two sources for FDA 510(k) data that I found:
105+
The Premarket Notification Database[^pmn-database] and even better, the
106+
OpenFDA API dataset[^fda-api-dataset], which provided a fairly comprehensive dataset as a single
107+
JSON file.
108+
109+
"Great!", I thought, this data will be perfect for mapping out predicate devices as a graph.
110+
111+
Except for one issue: the data does not include predicate devices.
112+
113+
As it turns out, the only way to find the predicates of a given device is by checking if a
114+
PDF summary of the application is available, and if so, examining it manually.
115+
These PDFs are completely free-form and do not even have a standardized template.
116+
117+
As of March 2024, there are 85,791 510(k) applications with summaries available, so doing this
118+
manually was just not going to work.
119+
120+
Another problem is that the FDA does not provide a single dataset containing all of these PDFs, as they do
121+
with the API data.
122+
123+
So, left with no other option, I decided to scrape the FDA's website to download all 85,791 PDFs.
124+
125+
## Scraping PDFs
126+
127+
I used Python's BeautifulSoup library to scrape the database.
128+
In the database, each entry contains a link to the summary PDF file.
129+
130+
![device database](/medical-device-analysis/database-screenshot.png)
131+
132+
The "Summary" link goes to a PDF file hosted on the FDA's website.
133+
134+
![device summary PDF](/medical-device-analysis/device-summary-screenshot.png)
135+
136+
With beautifulSoup, it was fairly simple to search for the word summary to find this link:
137+
138+
139+
```python
140+
from bs4 import BeautifulSoup
141+
142+
soup = BeautifulSoup(response.data, features="html.parser")
143+
summary = soup.find("a", string="Summary")
144+
url = summary.attrs.get("href")
145+
```
146+
147+
Once I had this link, I just downloaded each PDF and stored it locally.
148+
149+
### Being a polite scraper
150+
151+
The FDA's robots.txt includes the following lines:
152+
153+
```
154+
Hit-rate: 30 # wait 30 seconds before starting a new URL request default=30
155+
Visiting-hours: 23:00EDT-05:00EDT #index this site between 11PM - 5AM EDT
156+
```
157+
158+
Which means to be polite (and not get blocked), my scraper can only run between 11 PM and 5 AM (6 hours),
159+
and can only make a request every 30 seconds.
160+
161+
This means we can only scrape 6 * 60 * (60 / 30) = 720 files per day.
162+
So it took 85,791 / 720 = 119 days to scrape every PDF.
163+
164+
To accomplish this, I ran the scraper on a $6/month DigitalOcean droplet and let it go to work,
165+
then I checked back in 4 months and copied the PDFs to a local directory.
166+
167+
## Parsing Predicates
168+
169+
Now that I had all the PDFs, the next problem was parsing them.
170+
171+
With the `pypdf` Python library, it was easy to grab the embedded text from each document.
172+
Once I had the embedded text, I just ran a regex match for strings with a K followed by 6 digits like `K123456`.
173+
There were some common variations like using a `#` or a space after the `K`, which I also matched.
174+
175+
However, I hit another issue with older summary documents.
176+
The older PDF documents did not have embedded text, because they were often scanned PDFs,
177+
not digital.
178+
Using tesseract, I ran OCR (Optical Character Recognition) on the PDFs where I could not find predicate device IDs.
179+
This worked pretty well, but the OCR quality was fairly low.
180+
181+
Finally, for documents that were still missing predicates, I manually entered them
182+
using a Python script to display the PDF and accept the ID as input.
183+
184+
Sometimes this required searching for the predicate manually by name in the database.
185+
186+
Out of the 85,791 devices with summaries, I was able to find predicates for 63,389 (74%) devices.
187+
188+
Sometimes the summary would omit the exact predicate used or would provide a name that was
189+
not specific enough to identify a 510(k) application.
190+
These I ignored.
191+
192+
### Storing the data
193+
194+
I stored the predicate data in a simple SQLite database with a table containing two columns:
195+
`node_to` and `node_from`. This allowed me to run some simple queries on the data, and also
196+
join it against the device data found in the API dump.
197+
198+
## Answering some questions
199+
200+
### How many predicates does a device have on average?
201+
202+
Let's fetch the edges from the database and load them into a networkx graph.
203+
204+
I'm deliberately ignoring devices that are missing predicate information.
205+
I only care about the devices where we do have ancestry available.
206+
207+
```python
208+
import networkx as nx
209+
import sqlite3
210+
211+
con = sqlite3.connect("../scripts/devices.db")
212+
cur = con.cursor()
213+
214+
cur.execute("SELECT node_from, node_to FROM predicate_graph_edge;")
215+
all_edges = cur.fetchall()
216+
217+
g = nx.DiGraph(all_edges)
218+
```
219+
220+
We use a `DiGraph` object, which stands for Directed Graph, because
221+
each edge in our graph is directed.
222+
Device A being the predicate of device B does not imply that
223+
B is the predicate of A.
224+
225+
We might wonder how many predicates, on average, a device has.
226+
In graph theory terms this is called the *degree* of the node,
227+
in other words, the number of neighbors a node has connected to it.
228+
229+
Now we can calculate the average degree of a device:
230+
231+
```python
232+
print("Average degree:")
233+
print(sum(map(lambda x: x[1], g.in_degree())) / len(g))
234+
```
235+
236+
```
237+
Average degree:
238+
1.7794783986208664
239+
```
240+
241+
We can also calculate the median degree:
242+
243+
```python
244+
print("Median degree")
245+
print(statistics.median(map(lambda x: x[1], g.in_degree())))
246+
```
247+
248+
```
249+
Median degree
250+
1
251+
```
252+
253+
The average is skewed higher than the median because the degree follows a power law distribution,
254+
which we can see by checking a logarithm-scaled histogram of the degrees.
255+
256+
![Graph showing the histogram of node degrees](/medical-device-analysis/degree-histogram.png)
257+
258+
## Setting up a website
259+
260+
To make this data more accessible to anyone, I set up a website to visualize it: [510k.fyi](https://www.510k.fyi/)
261+
262+
The website is open source, free to use, and requires no account. Check it out!
263+
264+
The tech stack for this website is:
265+
266+
* Python and FastAPI for the backend
267+
* SQLite for the database (although I plan to switch to PostgreSQL later)
268+
* NextJS and React for the frontend, along with react-force-graph for visualizing the data
269+
270+
The most interesting part of this setup is using SQLite instead of a graph database.
271+
I was thinking a dedicated graph DB like neo4j might be needed, but the performance
272+
so far with SQLite has been great.
273+
274+
My schema looks like this:
275+
276+
```sql
277+
CREATE TABLE predicate_graph_edge(node_from TEXT,node_to TEXT,
278+
FOREIGN KEY(node_from) REFERENCES device(k_number),
279+
FOREIGN KEY(node_to) REFERENCES device(k_number),
280+
PRIMARY KEY(node_from, node_to));
281+
```
282+
283+
In other words, the predicate_graph_edge table holds all edges (relationships from predicate devices to the new device).
284+
These edge columns are also foreign keys to the `device` table, which represents the nodes in our graph.
285+
286+
Using a recursive CTE, we can query for all the predicate ancestors of a given device.
287+
This query will return all the edges.
288+
289+
```sql
290+
WITH RECURSIVE ancestor(n)
291+
AS (
292+
VALUES(?)
293+
UNION
294+
SELECT node_from FROM predicate_graph_edge, ancestor
295+
WHERE predicate_graph_edge.node_to=ancestor.n
296+
)
297+
SELECT node_to, node_from
298+
FROM predicate_graph_edge
299+
WHERE predicate_graph_edge.node_to IN ancestor
300+
```
301+
302+
But we also want the node data, so we can embed this in a subquery and JOIN it on the
303+
device table (and the recalls table).
304+
305+
```sql
306+
SELECT device.k_number, recall_id, recall.reason_for_recall
307+
FROM (
308+
WITH RECURSIVE ancestor(n)
309+
AS (
310+
VALUES(?)
311+
UNION
312+
SELECT node_from FROM predicate_graph_edge, ancestor
313+
WHERE predicate_graph_edge.node_to=ancestor.n
314+
)
315+
SELECT node_to, node_from
316+
FROM predicate_graph_edge
317+
WHERE predicate_graph_edge.node_to IN ancestor
318+
) ancestry
319+
JOIN device ON ancestry.node_from = device.k_number
320+
LEFT JOIN device_recall ON device_recall.k_number = device.k_number
321+
LEFT JOIN recall ON device_recall.recall_id = recall.id;
322+
```
323+
324+
The `?` in this case would be replaced with the root device in question.
325+
326+
For a nontrivial query (327 rows),
327+
328+
Using SQLite's `.timer on` command, we can see how fast the query is:
329+
330+
```
331+
Run Time: real 0.155 user 0.135287 sys 0.019959
332+
```
333+
334+
You can also see this data visualized [here](https://www.510k.fyi/devices/?id=K121623)
335+
336+
## Source
337+
338+
The source code for the scraper, website, and data analysis is all available under an open source
339+
license here: [https://github.com/wcedmisten/fda-510k-analysis](https://github.com/wcedmisten/fda-510k-analysis)
340+
341+
#
342+
343+
[^the-bleeding-edge]: https://en.wikipedia.org/wiki/The_Bleeding_Edge
344+
[^essure]: https://en.wikipedia.org/wiki/Essure
345+
[^food-drug-cosmetic-act]: https://en.wikipedia.org/wiki/Federal_Food,_Drug,_and_Cosmetic_Act
346+
[^device-pathways]: https://www.fda.gov/medical-devices/device-advice-comprehensive-regulatory-assistance/how-study-and-market-your-device
347+
[^510k-study]: https://jamanetwork.com/journals/jamainternalmedicine/fullarticle/227466
348+
[^pmn-database]: https://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfpmn/pmn.cfm
349+
[^fda-api-dataset]: https://open.fda.gov/apis/device/510k/download/

pages/about.tsx

Lines changed: 3 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -11,8 +11,8 @@ function AboutPage() {
1111
</p>
1212

1313
<p>
14-
I graduated from Virginia Tech with a B.S.
15-
in Computer Science.
14+
I have a Bachelor of Science
15+
in Computer Science from Virginia Tech.
1616
</p>
1717

1818
<p>I frequently start hobby projects, and I'm hoping this blog
@@ -25,10 +25,6 @@ function AboutPage() {
2525
I often write about Python, OpenStreetMap, web scraping, and data visualization.
2626
</p>
2727

28-
<p>
29-
My other hobbies include programming, gardening, cooking, and hiking.
30-
</p>
31-
3228
<p>
3329
You can contact me at <a href="mailto:wcedmisten@gmail.com?subject=Feedback on wcedmisten.fyi">
3430
wcedmisten@gmail.com
@@ -41,7 +37,7 @@ function AboutPage() {
4137
src="/william.jpg"
4238
rounded={true}
4339
fluid={true}
44-
alt="William pointing at a lime tree at Lewis Ginter Botanical Gardens" />
40+
alt="William in front of a sunset" />
4541
</Col>
4642
</Row>
4743
</Container>

pages/projects.tsx

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,13 @@ function ProjectsPage() {
66
<div className="post-wrapper">
77
<h1 className="post-title">Projects</h1>
88
<ListGroup variant="flush">
9+
<PostItem
10+
href="https://www.510k.fyi"
11+
thumbnailURL="/medical-device-analysis/thumbnail.png"
12+
thumbnailAlt="A graph of predicate device data"
13+
title="510k.fyi"
14+
description="A webapp to improve the FDA's 510k medical device database with enhanced predicate device data."
15+
date='2024-03-10' />
916
<PostItem
1017
href="/project/north-america-hospital-distance"
1118
thumbnailURL="/og-images/north-america-hospital-distance.png"
Loading
Loading
Loading
Loading
Loading
195 KB
Loading

0 commit comments

Comments
 (0)