From 4bed4d754e8ce173218ec2225f5618db45287718 Mon Sep 17 00:00:00 2001 From: HYoon-24 Date: Sun, 27 Oct 2024 17:05:56 -0400 Subject: [PATCH] added projects md pages --- _projects/RNA.md | 23 ++++++++++++++++ _projects/fake_news.md | 14 ++++++++++ _projects/hyperTDA.md | 36 +++++++++++++++++++++++++ _projects/multiscale.md | 29 +++++++++++++++++++++ _projects/musicians.md | 17 ++++++++++++ _projects/neuroscience.md | 55 +++++++++++++++++++++++++++++++++++++++ _projects/pbf.md | 25 ++++++++++++++++++ 7 files changed, 199 insertions(+) create mode 100644 _projects/RNA.md create mode 100644 _projects/fake_news.md create mode 100644 _projects/hyperTDA.md create mode 100644 _projects/multiscale.md create mode 100644 _projects/musicians.md create mode 100644 _projects/neuroscience.md create mode 100644 _projects/pbf.md diff --git a/_projects/RNA.md b/_projects/RNA.md new file mode 100644 index 0000000..c8673b7 --- /dev/null +++ b/_projects/RNA.md @@ -0,0 +1,23 @@ +--- +layout: page +title: RNA structure prediction +description: sampling & clustering +img: assets/img/SAMIII.png +importance: 3 +category: research +--- + +A RiboNucleic Acid (RNA) can form complex structure through intra-molecular base-pairing. Some classes of RNAs can regulate biological functions by changing its conformations. An example is illustrated below. +

+ + +

+ + +Identifying multiple structures of a RNA can bring therapeutic advancements for RNA viruses. A popular approach is to sample low-energy structures from the nearest neighbor thermodyanmic model. Most algorithms follow the general flow of sampling, clustering, and reporting cluster representatives. + +I worked on improving the clustering aspect of an RNA structure prediction algorithm called profiling. The current method resulted in too many clusters with negligible biological difference. I proposed algorithmic ways to identify clusters that should be merged based on structural similarity. The enhanced version of profiling is under development by Georgia Tech Discrete Mathematics and Molecular Biology group. + +I also examined the prospect of using current methods to identify new multimodal RNAs. I found that there is a class of RNAs (kinetic riboswitches) that is difficult to detect from current sampling methods. I proposed a simple co-transcription simulation method to identify multimodality of such RNAs. The results have been published in this paper. + +*Georgia Tech (2018-2019), joint work with Christine Heitsch (Georgia Tech) and Alain Laederach (UNC).* \ No newline at end of file diff --git a/_projects/fake_news.md b/_projects/fake_news.md new file mode 100644 index 0000000..1120dad --- /dev/null +++ b/_projects/fake_news.md @@ -0,0 +1,14 @@ +--- +layout: page +title: fake news detector +description: using BERT & transfer learning +img: assets/img/wordcloud_real_news_titles.png +importance: 3 +category: data science +--- + +During my 5 week fellowship at Data Science for All Women's Summit (Fall 2020), my teammates and I built a fake news detector using various natural language processing tools such as embeddings, RNN, BERT, and transfer learning. We performed careful preprocessing to remove biases in the dataset, and we used a model interpretability tool called LIME to identify points of improvement for our model. + +Take a look at our code on GitHub! + +
\ No newline at end of file diff --git a/_projects/hyperTDA.md b/_projects/hyperTDA.md new file mode 100644 index 0000000..de59bff --- /dev/null +++ b/_projects/hyperTDA.md @@ -0,0 +1,36 @@ +--- +layout: page +title: Topology & Hypergraphs +description: Understanding topology through hypergraphs +img: assets/img/PH_hypergraph2.png +importance: 2 +category: research +--- + +Persistence diagrams summarize the presence of loops in data. However, they do not inform us of where the loops are. Can we understand which points are “integral” to the overall structure of the data? The goal is to develop a quantification of each point in the data according to its contribution to the overall structure. There are two challenges. The first is that, given a persistence diagram, localizing the loops in a concise and interpretable way is an NP-hard problem. The second is the lack of methods to consolidate all loop information. + To address the first challenge, I developed a method called minimal generators that outputs a small and interpretable collection of data (generators) that represents the loops. I utilize a recent technique of performing the optimization over rational coefficients to circumvent the NP-hardness of the problem. + To consolidate all generators, my collaborators and I utilized hypergraphs, a generalization of a graph whose edges are collections of two or more vertices. We constructed a PH-hypergraph that has the original data as the vertex set and whose hyperedges are the generators of a persistence diagram. To encode the “importance” of each point of the persistence diagram, we weighed the hyperedge according to the persistence (the distance between a point and the diagonal line in the persistence diagram). Figure 1 illustrates the construction. + +

+ +

+
+Figure 1. Pipeline for constructing and analyzing PH-hypergraph. A. Given point cloud data, we compute the persistence diagram that summarizes the presence of loops. For each point in the persistence diagram, we compute the generator (collection of vertices representing the loops). We then construct a hypergraph whose vertex set is the original data and whose hyperedges are the generators. B. We analyze the PH-hypergraph via hypergraph centrality and community detection. +
+ +We use graph theory and network science to study the PH-hypergraph. We first compute hypergraph centrality, which ranks the original data according to their participation in large loops. We also perform community detection, which partitions the data according to how often a collection of points constitutes a shared loop. See Figure 1B for the complete method. See Figure 2 for examples of outputs. + +

+ +

+
+Figure 2. Outputs of hypergraph analysis on a random curve. (Left) PH-centrality assigns a high importance score to vertices that constitute the large loop in the center. (Right) PH-community reflects which subsets of data constitute the same loop. +
+ The method, called hyperTDA, identifies subsets of data integral to the overall structure and partitions the data into functional modules. We validated our method on simulated and experimental data, including diffusion of particles and animal movement trajectories. + +Preprint: Hypergraphs for multiscale cycles in structured data + +Code: github:hyperTDA + +*Joint work with Agnese Barbensi (University of Melbourne), Christian Madsen (University of Melbourne), Deborah Ajayi (University of Ibadan), Heather Harrington (University of Oxford), and Michael Stumpf (University of Melbourne).* + diff --git a/_projects/multiscale.md b/_projects/multiscale.md new file mode 100644 index 0000000..88fe76f --- /dev/null +++ b/_projects/multiscale.md @@ -0,0 +1,29 @@ +--- +layout: page +title: Sheaf theory for data science +description: Applications of persistent sheaf cohomology. +img: assets/img/PointCloudExample.png +importance: 4 +category: research +--- + + +For my PhD dissertation, I worked on applications of topology to data science. I used cosheaves and spectral sequences to compute persistence in a distributed manner. I applied such distributed computation to study multi-density data and recovered the information lost in persistence diagrams. + +For example, consider the following point cloud and its coresponding persistence diagram in dimension one. + +

+ + +

+ + + +By observing the persistence diagram, one would conclude that there is one significant feature. However, one can see from the point cloud that there are small but significant features that are densely sampled. My construction of distributed computation allows one to identify such significant features that are neglected by traditional methods. + +Here is a 30 minute video of my presentation at IMA special workshop on Bridging Statistics and Sheaves. + +The paper can be found on arXiv. Here is a copy of my PhD dissertation. + + +*University of Pennsylvania (2013-2018), PhD dissertation. Joint work with Robert Ghrist (U. Penn).* diff --git a/_projects/musicians.md b/_projects/musicians.md new file mode 100644 index 0000000..ac458c0 --- /dev/null +++ b/_projects/musicians.md @@ -0,0 +1,17 @@ +--- +layout: page +title: classical musicians recommender +description: +img: assets/img/smallgraph.png +importance: 1 +category: data science +--- + +I built a recommendation system for classical music performers. The recommender is based on the idea that musicians with frequent collaborations likely have similar performance styles. It first creates a graph of classical musicians and their collaborations and uses node2vec embeddings to find vector representations of the musicians. Given a list of users' favorite artists, the recommender uses similarity of the vector representations to recommend artists that a user may enjoy. + +Checkout the app at https://musicians-rec.herokuapp.com +Code: github +Blog post: medium +For interactive exploration of the artist graph, click below + +[
](http://irisyoon.com/musicians_recommendation/graph_80000/graph_visualization/network/) diff --git a/_projects/neuroscience.md b/_projects/neuroscience.md new file mode 100644 index 0000000..acc815d --- /dev/null +++ b/_projects/neuroscience.md @@ -0,0 +1,55 @@ +--- +layout: page +title: topology & neuroscience +description: A topological approach to neural encoding +img: assets/img/neuro_image.png +importance: 1 +category: research +--- + +What does it mean for there to be circular structures in neural activity? + +Consider a collection of images with different orientations shown in Figure 1A. The red and dark orange images have high similarity, whereas the red and green images have low similarity. What would happen if we arranged the eight images in a way that respects this similarity? The images would be arranged in a circular fashion, as shown in Figure 1B. + + +

+ +

+
+ Figure 1. Circular structure of data. A. Collection of images. B. An arrangement of images based on the similarity of orientation reveals a circular structure. C. Collection of neural activities (spike trains). D. Consider two spike trains to be similar if the vertical lines are well-aligned after "sliding" one spike train by a small amount. An arrangement of neural activity based on spike train similarity reveals a circular structure. +
+ + +Similarly, consider a collection of neural activities (called spike trains) shown in Figure 1C. Each row indicates the activity of a single neuron over some time period. The vertical line indicates the neuron's firing at a corresponding time. If we observe the neuron for, say $$ M $$ time intervals, then each spike train is a binary vector in $$ \mathbb{R}^M $$. Given two spike trains $$ s_1 $$ and $$ s_2 $$, let's measure similarities between two spike trains as the amount one needs to "slide" $$ s_1 $$ to "match" with $$ s_2 $$. Then, the red spike train is similar to the dark orange spike train, but it is quite dissimilar to the green spike train. Again, if we were to arrange the spike trains in a way that respects this similarity, we would arrange them in a cyclic manner (Figure 1D). + +Now, suppose there are many images and long spike trains that we cannot make the arrangements by hand. How would a computer recognize that these high-dimensional data contain cyclic structures? Let $$ P $$ denote the point cloud representing a system of interest, such as the collection of stimulus or neural activity. We calculate the similarity between every pair of elements in the system. We construct a representation of the system as we vary the similarity level by a sequence of simplicial complexes. The loops in this sequence are summarized by a persistence diagram, where the points far from the diagonal represent the large loops. See the following figure. + +

+ +

+
+ Figure 2. Detecting circular structures from high dimensional data. Given the data (either images or spike trains), we first compute a matrix encoding pairwise similarity between all elements in the system. We then create the sequence of simplicial complexes that represents the connectivity of the system at various similarity values. Finally, we summarize the loops in the simplicial complexes using a persistence diagram. Points far from the diagonal represent significant structures. +
+ + +So far, we have seen that persistence diagrams can indicate if a collection of images or spike trains contain circular structures. Consider a hypothetical experiment in which we present a stimulus video to a mouse while measuring its neural activity. Let's assume that the persistence diagram indicates that there are two circular structures in the stimulus and one circular structure in the neural activity. Is the unique circular feature in the neural activity reflecting one of the circular features in the stimulus? If so, which one? Such questions are topological manifestations of fundamental problems in neuroscience called neural encoding that study how neurons represent information. + + +

+ +

+
+ Figure 3. Neural encoding, stated as a problem in topology. Consider an experiment in which we present some stimulus while measuring neural activity. The persistence diagrams indicate that there are two circular features in the stimulus while there is only one circular feature in the neural activity. Which feature of the stimulus is represented by the neurons? +
+ +To address the above questions, I developed a framework for comparing persistence diagrams called the analogous bars method. + +The methods paper has been accepted in the Journal of Applied and Computational Topology, conditional on minor revisions. A follow-up paper implementing the method on simulated and experimental neuroscience datasets is under preparation. + +Preprint: Persistent Extension and Analogous Bars: Data-Induced Relations Between Persistence Barcodes. + +Code: github: analogous bars + +*Joint work with Chad Giusti (U. Delaware) and Robert Ghrist (U. Penn).* + + diff --git a/_projects/pbf.md b/_projects/pbf.md new file mode 100644 index 0000000..d1d4140 --- /dev/null +++ b/_projects/pbf.md @@ -0,0 +1,25 @@ +--- +layout: page +title: Philadelphia Bail Fund +description: +img: assets/img/philly.png +importance: 2 +category: data science +--- +From 2020 to 2021, I volunteered as a data scientist for Code for Philly, specifically the Philadelphia Bail Fund. The volunteers and I wanted to understand how the bail system of Philadelphia was affecting the citizens. The following lists a few questions that we wanted to address. + +* Which neighborhoods are most heavily impacted by the bail system? +* How do the defendant's race and gender impact the bail amount? +* Is there consistency across magistrates (the person who sets the bail)? That is, do two different magistrates set a similar bail amount for similar cases? + +The volunteers and I gathered new criminal filing records from the municipal court. We then performed various statistical analyses to address the above questions. One challenge of this analysis was that many variables were correlated, and we needed to control for the correlations. For example, if one magistrate is likely to handle more severe offenses than another, then one would have to take such differences into account when comparing the bail amounts set by the two magistrates. + +To that end, we used **topic modeling** on the criminal filing records to group cases into similar offense types and severity. We then performed a **matched study** to study if two magistrates set similar bail amounts given similar offense severity. We found that there is still a high variance in the bail amounts set across magistrates even after controlling for the difference in the offense type and severity. + +For more info, please visit the following app: PBF app +Code: github + +

+ +

+