|
| 1 | +--- |
| 2 | +title: "A Copy of a Copy of a Copy: the Story of FDA Medical Device Clearances" |
| 3 | +date: "2024-03-10" |
| 4 | +thumbnail: "/medical-device-analysis/thumbnail.png" |
| 5 | +thumbnailAlt: "" |
| 6 | +description: "Uncovering medical device ancestry from the FDA's 510k data and creating a free website for exploring it." |
| 7 | +tags: ["python", "sqlite", "510k"] |
| 8 | +--- |
| 9 | + |
| 10 | +## Motivation |
| 11 | + |
| 12 | +Back in the Fall of 2022, I watched the 2018 documentary *The Bleeding Edge*[^the-bleeding-edge], |
| 13 | +which investigates the FDA's medical device clearance process. |
| 14 | +I was surprised to learn that for many devices, clinical trials are not required. |
| 15 | + |
| 16 | +The documentary scrutinizes the FDA's 510(k) clearance process, |
| 17 | +through which some devices can be fast-tracked for use if they are |
| 18 | +"substantially equivalent" to an existing device, |
| 19 | +even without a clinical trial. |
| 20 | + |
| 21 | +Several of these devices even led to patient injuries including |
| 22 | +bleeding, organ puncture, and even cobalt poisoning. |
| 23 | + |
| 24 | + |
| 25 | + |
| 26 | +Although the documentary led to some of these devices being pulled off the market [^essure], |
| 27 | +I was curious whether similar devices might exist that haven't gotten the publicity yet. |
| 28 | +However, after some research, I found that the public data available was insufficient to answer my questions. |
| 29 | + |
| 30 | +So, I put on my detective hat and began investigating the data that was available. |
| 31 | + |
| 32 | +Before I dive into the details, it's important to know a little backstory on |
| 33 | +how medical devices get regulated in America. |
| 34 | + |
| 35 | +### A Brief History of Medical Device Regulation |
| 36 | + |
| 37 | +Before 1938, there were a few laws regulating food and drug quality, but |
| 38 | +the Federal government did not have much authority to enforce them. |
| 39 | + |
| 40 | +In 1938, the Federal Food, Drug, and Cosmetic Act[^food-drug-cosmetic-act] |
| 41 | +was signed into law, granting the FDA (Food and Drug Administration) much |
| 42 | +more authority and requiring a pre-market |
| 43 | +review for new drugs to verify safety and effectiveness. |
| 44 | + |
| 45 | +Until 1976, however, medical devices were not regulated by the FDA. |
| 46 | +A 1976 amendment to the Federal Food, Drug, and Cosmetic Act |
| 47 | +extended the FDA's jurisdiction to include medical devices. |
| 48 | +The amendment established 3 classes of medical devices: Class I, Class II, and Class III, |
| 49 | +ordered by the level of risk to the patient. |
| 50 | + |
| 51 | +For example, latex examination gloves are a Class I device (low risk), |
| 52 | +but a pacemaker is a Class III device (high risk). |
| 53 | +Devices like joint implants and tooth fillings fall somewhere in the middle, at Class II. |
| 54 | + |
| 55 | +There are a few pathways for device manufacturers to begin selling their devices.[^device-pathways] |
| 56 | +I won't get into all of them, but the main pathways are PMA or 510(k). |
| 57 | + |
| 58 | +### PMA |
| 59 | + |
| 60 | +Premarket Approval (PMA) is a more rigorous process which requires a clinical trial to |
| 61 | +demonstrate that the device is safe and effective. |
| 62 | +All class III devices must go through this approval process to be legally marketed. |
| 63 | +However, only 1% [^510k-study] of medical devices are cleared through this process. |
| 64 | + |
| 65 | +### 510(k) |
| 66 | + |
| 67 | +The 510(k) process provides a faster route to marketing a new device. |
| 68 | +New devices are allowed to be marketed if it is shown that they are "substantially equivalent" |
| 69 | +to a legally marketed device, and are either Class I or Class II. |
| 70 | +This process does not necessarily require a clinical trial to demonstrate safety or effectiveness, |
| 71 | +the applicant need only show that the device is equivalent to something legally on the market already. |
| 72 | +This marketed device is referred to as the "predicate" device. |
| 73 | + |
| 74 | +The predicate device could itself have been allowed through a few pathways. |
| 75 | +The predicate could be a pre-amendment device, meaning something that was on the market before 1976 and grandfathered in. |
| 76 | +It could also be a device that received Premarket Approval. |
| 77 | + |
| 78 | +Lastly, the predicate device could itself have been cleared by the 510(k) process. |
| 79 | + |
| 80 | +This last detail is what sparked my curiosity. |
| 81 | + |
| 82 | +A cleared device could be equivalent to some long chain of 510(k) cleared |
| 83 | +devices, without any of these devices requiring a clinical trial! |
| 84 | + |
| 85 | +This was not how I expected medical devices to be cleared for the market. |
| 86 | + |
| 87 | +## Wait it's just a graph? |
| 88 | + |
| 89 | +This is where I had to dust off my computer science knowledge. |
| 90 | +We can model this as a graph, where each 510(k) submission is a node, |
| 91 | +and the predicate relationship is an edge. |
| 92 | +A device may use multiple predicates in its application. |
| 93 | + |
| 94 | +For example, if device A has predicate devices X and Y, the graph might look like this: |
| 95 | + |
| 96 | + |
| 97 | + |
| 98 | +We could imagine this extending out across dozens or hundreds of devices in |
| 99 | +a 510(k)'s "ancestry". |
| 100 | + |
| 101 | +## Finding the data |
| 102 | + |
| 103 | +The next problem I had was how to find the data. |
| 104 | +There were two sources for FDA 510(k) data that I found: |
| 105 | +The Premarket Notification Database[^pmn-database] and even better, the |
| 106 | +OpenFDA API dataset[^fda-api-dataset], which provided a fairly comprehensive dataset as a single |
| 107 | +JSON file. |
| 108 | + |
| 109 | +"Great!", I thought, this data will be perfect for mapping out predicate devices as a graph. |
| 110 | + |
| 111 | +Except for one issue: the data does not include predicate devices. |
| 112 | + |
| 113 | +As it turns out, the only way to find the predicates of a given device is by checking if a |
| 114 | +PDF summary of the application is available, and if so, examining it manually. |
| 115 | +These PDFs are completely free-form and do not even have a standardized template. |
| 116 | + |
| 117 | +As of March 2024, there are 85,791 510(k) applications with summaries available, so doing this |
| 118 | +manually was just not going to work. |
| 119 | + |
| 120 | +Another problem is that the FDA does not provide a single dataset containing all of these PDFs, as they do |
| 121 | +with the API data. |
| 122 | + |
| 123 | +So, left with no other option, I decided to scrape the FDA's website to download all 85,791 PDFs. |
| 124 | + |
| 125 | +## Scraping PDFs |
| 126 | + |
| 127 | +I used Python's BeautifulSoup library to scrape the database. |
| 128 | +In the database, each entry contains a link to the summary PDF file. |
| 129 | + |
| 130 | + |
| 131 | + |
| 132 | +The "Summary" link goes to a PDF file hosted on the FDA's website. |
| 133 | + |
| 134 | + |
| 135 | + |
| 136 | +With beautifulSoup, it was fairly simple to search for the word summary to find this link: |
| 137 | + |
| 138 | + |
| 139 | +```python |
| 140 | +from bs4 import BeautifulSoup |
| 141 | + |
| 142 | +soup = BeautifulSoup(response.data, features="html.parser") |
| 143 | +summary = soup.find("a", string="Summary") |
| 144 | +url = summary.attrs.get("href") |
| 145 | +``` |
| 146 | + |
| 147 | +Once I had this link, I just downloaded each PDF and stored it locally. |
| 148 | + |
| 149 | +### Being a polite scraper |
| 150 | + |
| 151 | +The FDA's robots.txt includes the following lines: |
| 152 | + |
| 153 | +``` |
| 154 | +Hit-rate: 30 # wait 30 seconds before starting a new URL request default=30 |
| 155 | +Visiting-hours: 23:00EDT-05:00EDT #index this site between 11PM - 5AM EDT |
| 156 | +``` |
| 157 | + |
| 158 | +Which means to be polite (and not get blocked), my scraper can only run between 11 PM and 5 AM (6 hours), |
| 159 | +and can only make a request every 30 seconds. |
| 160 | + |
| 161 | +This means we can only scrape 6 * 60 * (60 / 30) = 720 files per day. |
| 162 | +So it took 85,791 / 720 = 119 days to scrape every PDF. |
| 163 | + |
| 164 | +To accomplish this, I ran the scraper on a $6/month DigitalOcean droplet and let it go to work, |
| 165 | +then I checked back in 4 months and copied the PDFs to a local directory. |
| 166 | + |
| 167 | +## Parsing Predicates |
| 168 | + |
| 169 | +Now that I had all the PDFs, the next problem was parsing them. |
| 170 | + |
| 171 | +With the `pypdf` Python library, it was easy to grab the embedded text from each document. |
| 172 | +Once I had the embedded text, I just ran a regex match for strings with a K followed by 6 digits like `K123456`. |
| 173 | +There were some common variations like using a `#` or a space after the `K`, which I also matched. |
| 174 | + |
| 175 | +However, I hit another issue with older summary documents. |
| 176 | +The older PDF documents did not have embedded text, because they were often scanned PDFs, |
| 177 | +not digital. |
| 178 | +Using tesseract, I ran OCR (Optical Character Recognition) on the PDFs where I could not find predicate device IDs. |
| 179 | +This worked pretty well, but the OCR quality was fairly low. |
| 180 | + |
| 181 | +Finally, for documents that were still missing predicates, I manually entered them |
| 182 | +using a Python script to display the PDF and accept the ID as input. |
| 183 | + |
| 184 | +Sometimes this required searching for the predicate manually by name in the database. |
| 185 | + |
| 186 | +Out of the 85,791 devices with summaries, I was able to find predicates for 63,389 (74%) devices. |
| 187 | + |
| 188 | +Sometimes the summary would omit the exact predicate used or would provide a name that was |
| 189 | +not specific enough to identify a 510(k) application. |
| 190 | +These I ignored. |
| 191 | + |
| 192 | +### Storing the data |
| 193 | + |
| 194 | +I stored the predicate data in a simple SQLite database with a table containing two columns: |
| 195 | +`node_to` and `node_from`. This allowed me to run some simple queries on the data, and also |
| 196 | +join it against the device data found in the API dump. |
| 197 | + |
| 198 | +## Answering some questions |
| 199 | + |
| 200 | +### How many predicates does a device have on average? |
| 201 | + |
| 202 | +Let's fetch the edges from the database and load them into a networkx graph. |
| 203 | + |
| 204 | +I'm deliberately ignoring devices that are missing predicate information. |
| 205 | +I only care about the devices where we do have ancestry available. |
| 206 | + |
| 207 | +```python |
| 208 | +import networkx as nx |
| 209 | +import sqlite3 |
| 210 | + |
| 211 | +con = sqlite3.connect("../scripts/devices.db") |
| 212 | +cur = con.cursor() |
| 213 | + |
| 214 | +cur.execute("SELECT node_from, node_to FROM predicate_graph_edge;") |
| 215 | +all_edges = cur.fetchall() |
| 216 | + |
| 217 | +g = nx.DiGraph(all_edges) |
| 218 | +``` |
| 219 | + |
| 220 | +We use a `DiGraph` object, which stands for Directed Graph, because |
| 221 | +each edge in our graph is directed. |
| 222 | +Device A being the predicate of device B does not imply that |
| 223 | +B is the predicate of A. |
| 224 | + |
| 225 | +We might wonder how many predicates, on average, a device has. |
| 226 | +In graph theory terms this is called the *degree* of the node, |
| 227 | +in other words, the number of neighbors a node has connected to it. |
| 228 | + |
| 229 | +Now we can calculate the average degree of a device: |
| 230 | + |
| 231 | +```python |
| 232 | +print("Average degree:") |
| 233 | +print(sum(map(lambda x: x[1], g.in_degree())) / len(g)) |
| 234 | +``` |
| 235 | + |
| 236 | +``` |
| 237 | +Average degree: |
| 238 | +1.7794783986208664 |
| 239 | +``` |
| 240 | + |
| 241 | +We can also calculate the median degree: |
| 242 | + |
| 243 | +```python |
| 244 | +print("Median degree") |
| 245 | +print(statistics.median(map(lambda x: x[1], g.in_degree()))) |
| 246 | +``` |
| 247 | + |
| 248 | +``` |
| 249 | +Median degree |
| 250 | +1 |
| 251 | +``` |
| 252 | + |
| 253 | +The average is skewed higher than the median because the degree follows a power law distribution, |
| 254 | +which we can see by checking a logarithm-scaled histogram of the degrees. |
| 255 | + |
| 256 | + |
| 257 | + |
| 258 | +## Setting up a website |
| 259 | + |
| 260 | +To make this data more accessible to anyone, I set up a website to visualize it: [510k.fyi](https://www.510k.fyi/) |
| 261 | + |
| 262 | +The website is open source, free to use, and requires no account. Check it out! |
| 263 | + |
| 264 | +The tech stack for this website is: |
| 265 | + |
| 266 | +* Python and FastAPI for the backend |
| 267 | +* SQLite for the database (although I plan to switch to PostgreSQL later) |
| 268 | +* NextJS and React for the frontend, along with react-force-graph for visualizing the data |
| 269 | + |
| 270 | +The most interesting part of this setup is using SQLite instead of a graph database. |
| 271 | +I was thinking a dedicated graph DB like neo4j might be needed, but the performance |
| 272 | +so far with SQLite has been great. |
| 273 | + |
| 274 | +My schema looks like this: |
| 275 | + |
| 276 | +```sql |
| 277 | +CREATE TABLE predicate_graph_edge(node_from TEXT,node_to TEXT, |
| 278 | +FOREIGN KEY(node_from) REFERENCES device(k_number), |
| 279 | +FOREIGN KEY(node_to) REFERENCES device(k_number), |
| 280 | +PRIMARY KEY(node_from, node_to)); |
| 281 | +``` |
| 282 | + |
| 283 | +In other words, the predicate_graph_edge table holds all edges (relationships from predicate devices to the new device). |
| 284 | +These edge columns are also foreign keys to the `device` table, which represents the nodes in our graph. |
| 285 | + |
| 286 | +Using a recursive CTE, we can query for all the predicate ancestors of a given device. |
| 287 | +This query will return all the edges. |
| 288 | + |
| 289 | +```sql |
| 290 | +WITH RECURSIVE ancestor(n) |
| 291 | +AS ( |
| 292 | + VALUES(?) |
| 293 | + UNION |
| 294 | + SELECT node_from FROM predicate_graph_edge, ancestor |
| 295 | + WHERE predicate_graph_edge.node_to=ancestor.n |
| 296 | +) |
| 297 | +SELECT node_to, node_from |
| 298 | +FROM predicate_graph_edge |
| 299 | +WHERE predicate_graph_edge.node_to IN ancestor |
| 300 | +``` |
| 301 | + |
| 302 | +But we also want the node data, so we can embed this in a subquery and JOIN it on the |
| 303 | +device table (and the recalls table). |
| 304 | + |
| 305 | +```sql |
| 306 | +SELECT device.k_number, recall_id, recall.reason_for_recall |
| 307 | +FROM ( |
| 308 | + WITH RECURSIVE ancestor(n) |
| 309 | + AS ( |
| 310 | + VALUES(?) |
| 311 | + UNION |
| 312 | + SELECT node_from FROM predicate_graph_edge, ancestor |
| 313 | + WHERE predicate_graph_edge.node_to=ancestor.n |
| 314 | + ) |
| 315 | + SELECT node_to, node_from |
| 316 | + FROM predicate_graph_edge |
| 317 | + WHERE predicate_graph_edge.node_to IN ancestor |
| 318 | +) ancestry |
| 319 | +JOIN device ON ancestry.node_from = device.k_number |
| 320 | +LEFT JOIN device_recall ON device_recall.k_number = device.k_number |
| 321 | +LEFT JOIN recall ON device_recall.recall_id = recall.id; |
| 322 | +``` |
| 323 | + |
| 324 | +The `?` in this case would be replaced with the root device in question. |
| 325 | + |
| 326 | +For a nontrivial query (327 rows), |
| 327 | + |
| 328 | +Using SQLite's `.timer on` command, we can see how fast the query is: |
| 329 | + |
| 330 | +``` |
| 331 | +Run Time: real 0.155 user 0.135287 sys 0.019959 |
| 332 | +``` |
| 333 | + |
| 334 | +You can also see this data visualized [here](https://www.510k.fyi/devices/?id=K121623) |
| 335 | + |
| 336 | +## Source |
| 337 | + |
| 338 | +The source code for the scraper, website, and data analysis is all available under an open source |
| 339 | +license here: [https://github.com/wcedmisten/fda-510k-analysis](https://github.com/wcedmisten/fda-510k-analysis) |
| 340 | + |
| 341 | +# |
| 342 | + |
| 343 | +[^the-bleeding-edge]: https://en.wikipedia.org/wiki/The_Bleeding_Edge |
| 344 | +[^essure]: https://en.wikipedia.org/wiki/Essure |
| 345 | +[^food-drug-cosmetic-act]: https://en.wikipedia.org/wiki/Federal_Food,_Drug,_and_Cosmetic_Act |
| 346 | +[^device-pathways]: https://www.fda.gov/medical-devices/device-advice-comprehensive-regulatory-assistance/how-study-and-market-your-device |
| 347 | +[^510k-study]: https://jamanetwork.com/journals/jamainternalmedicine/fullarticle/227466 |
| 348 | +[^pmn-database]: https://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfpmn/pmn.cfm |
| 349 | +[^fda-api-dataset]: https://open.fda.gov/apis/device/510k/download/ |
0 commit comments