It's not uncommon for molecular scientists to express frustration with outdated software tools used to aid their research. Many scientific software applications were developed years ago and may not have kept pace with advancements in technology or the evolving needs of researchers. As a result, scientists may encounter limitations such as poor user interfaces, lack of compatibility with modern operating systems, or inefficient data processing capabilities. Overall, the work of molecular biologists has far-reaching implications for human health, agriculture, industry, and the environment, making invaluable contributions to the advancement of science and the betterment of society; therefore, it's important their technology is on par!
Our application is an all-in-one space for molecular biologists! It's an LLM-powered electronic lab notebook with easy access to common biochemical information, a fine-tuned LLM lab partner for thinking through hard problems together, and simple recordkeeping in-line with the Design-Build-Test-Learn workflow common to synthetic biology. Each notebook has autogenerated summaries via Gemini API, the chemical encyclopedia is easy to understand, and the main fine-tuned chat bot is capable of answering your most technical questions! Please see photos.
The LLMs were fine-tuned with personally curated datasets with self-annotated data from web-scraped journal articles alongside the cleaner PubMedQA and BiosQA datasets. The main tech stack was Python+Flask backend with the LLMs communicating from Intel Developer Cloud through a separate Flask server, the frontend was primarily React+Tailwind CSS+MongoDB.
We had never worked with LLMs before nor with Intel Developer Cloud so simply catching up to speed was our first big hurdle. Because we wanted our LLM to solve a real problem, it necessarily had to rely at least partially on datasets that we would have to preprocess ourselves. If the data is there, then so is the solution, but it is there that often the real bottleneck lies. Parsing journal articles came with its own set of legal and technical difficulties that we were able to work through albeit it severely limited the scope of what we could investigate (primarily PubMed PMC open access articles, many of which were not properly labeled nor even active!).
The LLMs we were able to deliver were reliably fast and made smart use of the Intel Developer Cloud hardware we had available to work with. Many similar projects relied on the far superior Intel Max Series GPU instances which we did not have access to, so it was important for both training and serving our models that we optimize. Through implementing our own collators to take advantage of the BF16 data type's memory saving design and INT8 quantization to reduce the level of precision without losing out on accuracy we were able to use the standard advertised tools to speed up our LLM. However, we were happy to take this a step further and use openVINO to particularly optimize the model for Intel CPU architecture.
In these past 24 hours, we learned for the first time everything and anything to do with LLM implementation, Intel Developer Cloud, and honestly a good deal of graphic design. Molecular biologists have written entire journal articles consistently hating on the poor user design of existing ELN and LIMS software, so we had to try when it came to our own UI and UX! It was also our first time using PropelAuth which was pleasantly surprising in its ease of use and we hope to see its popularity rise as a tool.
We hope that this project can serve as a stepping stool for other developers. The LLMs being hosted on the huggingface leaderboard allow for others to quickly and easily see and compare our work and the overall idea itself. Beefy LLM models have been made to advance the sciences in inorganic chemistry like CoScientist so we hope that our project can inspire similar efforts in organic chemistry.