Skip to content

Commit eafe170

Browse files
author
Xuelong An Wang
committed
added inquery and lang-biology
1 parent 8582361 commit eafe170

File tree

3 files changed

+72
-21
lines changed

3 files changed

+72
-21
lines changed

_posts/2025-01-15-inquery.md

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
---
2+
layout: post
3+
title: "InQuery: Text2SQL App coded with LLMs"
4+
date: 2025-01-15 15:42
5+
description: a brief description of an app written in Dart with the help of Perplexity.AI, ChatGPT and DeepSeek-V3
6+
tags: research
7+
categories: blog-post
8+
disqus_comments: true
9+
related_posts: true
10+
authors:
11+
- name: Xuelong An
12+
thumbnail: /assets/img/inquery_logo.png
13+
---
14+
15+
# Why Dart?
16+
17+
Dart is a programming language which enables the same code to be compiled to multiple platforms, be it IOS, Android or a web platform. This means that you can build sort of "universal" applications accessible through any platform.
18+
19+
# Why Text to SQL?
20+
21+
Not everyone is an expert in structured language. There is a general field of research on text to SQL dedicated to users unfamiliar with this syntax navigate a database. In our project, we grounded it in the context of querying a medical database. As a sample use case, we query a [free, open-source demo of the MIMIC-IV Clinical Database](https://physionet.org/content/mimic-iv-demo/1.0/).
22+
23+
24+
# Are LLMs enough to build this app?
25+
26+
While I acknowledge the use of LLMs in building this app, their role is better described as an assistant rather than a substitute. An app is way more complicated, in the process of building it I faced a lot of problems which requires a human's debugging skills, and continuous iterative improvements (tons of breakpoints and trying to pinpoint the source of a problem).
27+
28+
For example, many of the problems which the LLM couldn't solve:
29+
30+
1. I faced an API query error which required me to try different API connection points to identify which is the one which works and could bridge the flask backend with the front end interface.
31+
32+
2. There was a problem regarding querying the database whereby I didn't know the database should be copied in the memory of the device, otherwise I would face unknown tables identified in the database. I needed a lot of breakpoints to debug to figure this out. Originally I wanted to query an external database, but as a proof-of-concept I started easy and queried a local database. The idea should generalize quiet smoothly.
33+
34+
3. There was a lot of trial and error before deciding to design the app as it is (something no LLM could anticipate at the beginning of the project).
35+
36+
3.1 Initially I wanted to use this BERT-based model, [MedTS](https://pmc.ncbi.nlm.nih.gov/articles/PMC8701710/), for my app. However, it's difficult to use and requires intricate input preprocessing before I can use it. In my app, I'm just assumming the user will provide a natural language input, not information such as a database schema intricately parsed into a json file, conveniently preprocessed in tree structures to be fed to MedTS. MedTS is lightweight and definitely would have been good, but it'd require extensive automatization regarding preprocessing of input.
37+
38+
3.2 Another approach to use a powerful text-to-SQL model is to use an open-source big model like one based on Llama, and compress it using tflite to deploy it on-device. This was not possible for me, because this powerful model is just too big to be compressed in <2 gb. After many trial and error attempts, I ended up (thanks to brainstorming with Perplexity.AI) to host the LLM in a Flask-based backend and call it through an API in the app. I'll still explore quantization to reduce the memory footprint of the sqlcoder-7b-2 used.
39+
40+
41+
The app's name, InQuery, is meant to be a play of the words Inquiry and Query as part of SQL.
42+
43+
44+
The code along with instructions on how to run it is found at https://github.com/awxlong/ai_sql_coder. Hope you find it helpful!

_posts/lang-biology.md renamed to _posts/2025-01-19-lang-biology.md

Lines changed: 28 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
---
22
layout: post
33
title: The interesting analogy between language and biology
4-
date: 2023-11-21 12:42
4+
date: 2025-01-19 12:42
55
description: comments on the interesting parallels that surface in NLP and computational biology, and how each informs about problems and solutions about each other
66
tags: research food-for-thought
77
categories: blog-post
@@ -11,14 +11,11 @@ authors:
1111
- name: Xuelong An
1212
toc:
1313
sidebar: left
14-
thumbnail: /assets/img/alphafold-pipeline.png
14+
thumbnail: /assets/img/alphafold-pipelin.png
1515

1616
---
1717
# On the interesting parallels of language and biology: a taster
1818

19-
<!-- Reading an interesting paper by Lauren,
20-
21-
Hi Antonio, I have a question on VAEs. -->
2219

2320
There are interesting parallels between language and biology. As a result, problems surfacing in natural language processing (NLP) tasks, may well inform us about possible problems and solutions in computational biology.
2421

@@ -28,14 +25,11 @@ Another common point concerns the representation of words and biological entitie
2825

2926
# Language-agnostic grammar: structure for language and biology
3027

31-
<!-- Another important i
3228

33-
A priori we need to distinguish between -->
3429
Inspired by Chomsky's linguistics, there is work on learning a language-agnostic tree-like structure to learn embeddings of text, called [Self-Structured AutoEncoder](https://arxiv.org/abs/2305.05588). This tree-like structure combines bottom-up composition of local contextual embeddings with top-down decomposition of fully contextual embeddings to learn in an unsupervised routine. Self-StrAE achieves competitive autoencoding performance with respect to a LLM, BERT-Based variant, denoting its low parameter efficiency that reflects the parsimonious nature of grammar.
3530

3631
Given the parallels with language, it is interesting to explore how would it work if Self-StrAE is applied to a DNA, or RNA sequence. What will the learnt structure tell us about the DNA sequence. Learnt embeddings would shed light into sequences of DNA which may have fucntional similarities. The compositions found by Self-StrAE would also find sequence motifs, which are short recurring patterns of DNA that map to specific biological functions like protein docking, because of frequent coocurrence leading to similar embeddings.
37-
<!--
38-
how would it work with genes? will it compose motifs -->
32+
3933
# Language respects constraints, so does biology
4034

4135
Another interesting parallel are the constraints underlying both generation of sentences and biological entities/processes.
@@ -56,17 +50,13 @@ Generating implausible proteins, or invalid angle rotations between bonds is aki
5650

5751
It is debatable whether the rules can be learnt from data. Rather, the role of rules and constraints might be to shape learning.
5852

59-
are programs the language for thought for computers, if so how NLP mahines aid biological scientific discovery.
53+
<figure>
54+
<img src="/assets/img/nesy-cell.png" alt="Sorry. Image couldn't load." width="100%" height="auto">
55+
<figcaption id="cell-embedder">A depiction of a neurosymbolic programming approach to virtual cell. Consider a single nucleotide sequence🧬 changing over time. A neural-network driven search of programs takes the sequence as input, and outputs a sequence of program or steps that transforms the input sequence to a final state. The sequence of steps serve as an explanation to the final state, which could correspond to cancer. </figcaption>
56+
</figure>
6057

6158
For further thought experiments, imagine whether a model can learn what are the rules of programming languages by only observing code, or infer the rules of grammar from just observing written text.
6259

63-
Structural
64-
65-
So I know that autoencoders are able to learn embeddings of the training data, however the training data is usually "static". Imagine now that the training data has a temporal dimension, and measures features that change over time of, say, a cell. Is there work that trains a "recurrent" VAE such that it can learn embeddings that are dependent on time? so if I want to visualize the embeddings, they change depending the time step in which I access them.
66-
67-
Does this make sense?
68-
69-
Lorenzo recommended VAEs that disentangle features of the input, such as a $$\beta$$-VAE, and see whether it can disentangle time. however, the paper he shared mainly worked with celebA. While there are multiple images of faces, each face is just a "snapshot", so maybe the VAE learns to disentangle noses from eyes from mouths, but not necessarily how they can change over time. Also, I'm not thinking of working with images. I'm thinking of tabular data that measures the features of multiple cells across time.
7060

7161
# NLP inspiring future biomedical research directions
7262

@@ -79,16 +69,33 @@ An interesting thought experiment hence, is can this grammar be expanded given t
7969
<figcaption id="ontology"> . </figcaption>
8070
</figure>
8171

82-
Mutations as contradictions? -> this sentence "this sentence has five words" has five words
8372

84-
For example, challenges in NLP involve processing a sequence not only forward, but also backward. Consider the sentence: "shift two positions backward of each letter of the word 'trapelo' per the alphabet to decode it", where it is needed to read the sentence back and forth to process the word. Do we have to read a genome forward and backwards?
8573
# Final thoughts
74+
8675
I believe that because we can natively read Human language, research on NLP is more commoditized and well-received by the public than computational genomics, despite the parallels outlined above. I think it is worth for me and anyone interested in computational biology to closely follow the research literature on language modelling to draw inspiration ofr DNA/RNA sequence modelling.
8776

88-
Finally, I leave you the following thought experiment: we humans can only read human language. While we can not natively understand the language of biology, our AI-based tools can understand them. Statement paraphrase from Demis Hassabis's [statement](https://www.youtube.com/watch?v=Gfr50f6ZBvo) that 'AI may just turn out to be the language to describe biology'
77+
Finally, I leave you the following thought experiment: we humans can only read human language. While we can not natively understand the language of biology, our AI-based tools can understand them.
78+
79+
'AI may just turn out to be the language to describe biology' - Demis Hassabis's [statement](https://www.youtube.com/watch?v=Gfr50f6ZBvo)
8980

9081
If you have answers, share thoughts, you can leave a comment or please email me!
9182

9283
[^1]: Sentences don't follow this simple template, but it helps get the idea across that placing structure into the support of a probabilistic model helps guide learning.
9384
[^2]: While it is true that some bonds between atoms like a carbon-carbon bond have no restricted rotation angle, others like a [hydrogen peroxide](https://www.sciencedirect.com/science/article/pii/S0022285217302990) is constrained to a setting of rotation degrees (with some uncertainty).
94-
[^3]: A probabilistic model's support refers to the domain of values for which the output of the model is non-zero.
85+
[^3]: A probabilistic model's support refers to the domain of values for which the output of the model is non-zero.
86+
87+
88+
<!--
89+
are programs the language for thought for computers, if so how NLP mahines aid biological scientific discovery.
90+
91+
92+
Structural
93+
94+
So I know that autoencoders are able to learn embeddings of the training data, however the training data is usually "static". Imagine now that the training data has a temporal dimension, and measures features that change over time of, say, a cell. Is there work that trains a "recurrent" VAE such that it can learn embeddings that are dependent on time? so if I want to visualize the embeddings, they change depending the time step in which I access them.
95+
96+
Does this make sense?
97+
98+
Lorenzo recommended VAEs that disentangle features of the input, such as a $$\beta$$-VAE, and see whether it can disentangle time. however, the paper he shared mainly worked with celebA. While there are multiple images of faces, each face is just a "snapshot", so maybe the VAE learns to disentangle noses from eyes from mouths, but not necessarily how they can change over time. Also, I'm not thinking of working with images. I'm thinking of tabular data that measures the features of multiple cells across time.
99+
Mutations as contradictions? -> this sentence "this sentence has five words" has five words
100+
101+
For example, challenges in NLP involve processing a sequence not only forward, but also backward. Consider the sentence: "shift two positions backward of each letter of the word 'trapelo' per the alphabet to decode it", where it is needed to read the sentence back and forth to process the word. Do we have to read a genome forward and backwards? -->

assets/img/inquery_logo.png

136 KB
Loading

0 commit comments

Comments
 (0)