Skip to content

Commit

Permalink
added projects md pages
Browse files Browse the repository at this point in the history
  • Loading branch information
irishryoon committed Oct 27, 2024

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
1 parent f69bf2d commit 4bed4d7
Showing 7 changed files with 199 additions and 0 deletions.
23 changes: 23 additions & 0 deletions _projects/RNA.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
---
layout: page
title: RNA structure prediction
description: sampling & clustering
img: assets/img/SAMIII.png
importance: 3
category: research
---

A RiboNucleic Acid (RNA) can form complex structure through intra-molecular base-pairing. Some classes of RNAs can regulate biological functions by changing its conformations. An example is illustrated below.
<p align="center">
<img width="350" src="https://irisyoon.com/assets/img/SAMIII_conformation1.png">
<img width="350" src="https://irisyoon.com/assets/img/SAMIII_conformation2.png">
</p>


Identifying multiple structures of a RNA can bring therapeutic advancements for RNA viruses. A popular approach is to sample low-energy structures from the nearest neighbor thermodyanmic model. Most algorithms follow the general flow of <b>sampling</b>, <b>clustering</b>, and reporting <b>cluster representatives</b>.

I worked on improving the <b>clustering</b> aspect of an RNA structure prediction algorithm called <a href="https://github.com/gtDMMB/RNAStructProfiling">profiling</a>. The current method resulted in too many clusters with negligible biological difference. I proposed algorithmic ways to identify clusters that should be merged based on structural similarity. The enhanced version of profiling is under development by Georgia Tech <a href="https://github.com/gtDMMB">Discrete Mathematics and Molecular Biology</a> group.

I also examined the prospect of using current methods to identify new multimodal RNAs. I found that there is a class of RNAs (kinetic riboswitches) that is difficult to detect from current sampling methods. I proposed a simple co-transcription simulation method to identify multimodality of such RNAs. The results have been published in this <a href="https://www.researchgate.net/publication/337314911_Towards_an_understanding_of_RNA_structural_modalities_a_riboswitch_case_study">paper.</a>

*Georgia Tech (2018-2019), joint work with <a href="https://sites.google.com/site/christineheitsch/">Christine Heitsch</a> (Georgia Tech) and <a href="https://ribosnitch.bio.unc.edu/">Alain Laederach</a> (UNC).*
14 changes: 14 additions & 0 deletions _projects/fake_news.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
---
layout: page
title: fake news detector
description: using BERT & transfer learning
img: assets/img/wordcloud_real_news_titles.png
importance: 3
category: data science
---

During my 5 week fellowship at <a href="https://www.correlation-one.com/ds4a">Data Science for All Women's Summit</a> (Fall 2020), my teammates and I built a fake news detector using various natural language processing tools such as embeddings, RNN, BERT, and transfer learning. We performed careful preprocessing to remove biases in the dataset, and we used a model interpretability tool called LIME to identify points of improvement for our model.

Take a look at our <a href="https://github.com/s-chrodinger/fake-news-detection">code</a> on GitHub!

<iframe src="//www.slideshare.net/slideshow/embed_code/key/f5TLIG5Ag7wuGg" width="595" height="400" frameborder="0" marginwidth="0" marginheight="0" scrolling="no" style="border:1px solid #CCC; border-width:1px; margin-bottom:5px; max-width: 100%;" allowfullscreen> </iframe> <div style="margin-bottom:5px"> </div>
36 changes: 36 additions & 0 deletions _projects/hyperTDA.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
---
layout: page
title: Topology & Hypergraphs
description: Understanding topology through hypergraphs
img: assets/img/PH_hypergraph2.png
importance: 2
category: research
---

Persistence diagrams summarize the presence of loops in data. However, they do not inform us of where the loops are. Can we understand which points are “integral” to the overall structure of the data? The goal is to develop a quantification of each point in the data according to its contribution to the overall structure.There are two challenges. The first is that, given a persistence diagram, localizing the loops in a concise and interpretable way is an NP-hard problem. The second is the lack of methods to consolidate all loop information.
To address the first challenge, I developed a method called <a href="https://github.com/irishryoon/minimal_generators_curves">minimal generators</a> that outputs a small and interpretable collection of data (generators) that represents the loops. I utilize a recent technique of performing the optimization over rational coefficients to circumvent the NP-hardness of the problem.
To consolidate all generators, my collaborators and I utilized hypergraphs, a generalization of a graph whose edges are collections of two or more vertices. We constructed a PH-hypergraph that has the original data as the vertex set and whose hyperedges are the generators of a persistence diagram. To encode the “importance” of each point of the persistence diagram, we weighed the hyperedge according to the persistence (the distance between a point and the diagonal line in the persistence diagram). Figure 1 illustrates the construction.

<p align="center">
<img width="750" src="https://irisyoon.com/assets/img/hyperTDA.png">
</p>
<div class="caption">
Figure 1. Pipeline for constructing and analyzing PH-hypergraph. <span style="font-weight:bold">A.</span> Given point cloud data, we compute the persistencediagram that summarizes the presence of loops. For each point in the persistence diagram, we compute the generator (collectionof vertices representing the loops). We then construct a hypergraph whose vertex set is the original data and whose hyperedges are the generators. <span style="font-weight:bold">B.</span> We analyze the PH-hypergraph via hypergraph centrality and community detection.
</div>

We use graph theory and network science to study the PH-hypergraph. We first compute hypergraph centrality, which ranks the original data according to their participation in large loops. We also perform community detection, which partitions the data according to how often a collection of points constitutes ashared loop. See Figure 1B for the complete method. See Figure 2 for examples of outputs.

<p align="center">
<img width="750" src="https://irisyoon.com/assets/img/hyperTDA_outputs.png">
</p>
<div class="caption">
Figure 2. Outputs of hypergraph analysis on a random curve. (Left) PH-centrality assigns a high importance score to verticesthat constitute the large loop in the center. (Right) PH-community reflects which subsets of data constitute the same loop.
</div>
The method, called <a href="https://github.com/degnbol/hyperTDA">hyperTDA,</a> identifies subsets of data integral to the overall structure and partitions the data into functional modules. We validated our method on simulated and experimental data, including diffusion of particles and animal movement trajectories.

Preprint: <a href="https://arxiv.org/abs/2210.07545">Hypergraphs for multiscale cycles in structured data</a>

Code: <a href="https://github.com/degnbol/hyperTDA">github:hyperTDA</a>

*Joint work with Agnese Barbensi (University of Melbourne), Christian Madsen (University of Melbourne), Deborah Ajayi (University of Ibadan), Heather Harrington (University of Oxford), and Michael Stumpf (University of Melbourne).*

29 changes: 29 additions & 0 deletions _projects/multiscale.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
---
layout: page
title: Sheaf theory for data science
description: Applications of persistent sheaf cohomology.
img: assets/img/PointCloudExample.png
importance: 4
category: research
---


For my PhD dissertation, I worked on applications of topology to data science. I used <b>cosheaves</b> and <b>spectral sequences</b> to compute <b>persistence</b> in a distributed manner. I applied such distributed computation to study <b>multi-density data</b> and recovered the information lost in persistence diagrams.

For example, consider the following point cloud and its coresponding persistence diagram in dimension one.

<p align="center">
<img width="350" src="https://irisyoon.com/assets/img/PointCloudExample.png">
<img width="350" src="https://irisyoon.com/assets/img/PD.png">
</p>



By observing the persistence diagram, one would conclude that there is one significant feature. However, one can see from the point cloud that there are small but significant features that are densely sampled. My construction of distributed computation allows one to identify such significant features that are neglected by traditional methods.

Here is a 30 minute video of my presentation at <a href="https://www.ima.umn.edu/2017-2018/SW5.21-25.18/27292">IMA special workshop on Bridging Statistics and Sheaves.</a>

The paper can be found on <a href="https://arxiv.org/abs/2001.01623">arXiv.</a> Here is a copy of my <a href="https://repository.upenn.edu/edissertations/2936/">PhD dissertation.</a>


*University of Pennsylvania (2013-2018), PhD dissertation. Joint work with <a href="https://www.math.upenn.edu/~ghrist/">Robert Ghrist</a> (U. Penn).*
17 changes: 17 additions & 0 deletions _projects/musicians.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
---
layout: page
title: classical musicians recommender
description:
img: assets/img/smallgraph.png
importance: 1
category: data science
---

I built a recommendation system for classical music performers. The recommender is based on the idea that musicians with frequent collaborations likely have similar performance styles. It first creates a graph of classical musicians and their collaborations and uses node2vec embeddings to find vector representations of the musicians. Given a list of users' favorite artists, the recommender uses similarity of the vector representations to recommend artists that a user may enjoy.

Checkout the app at <a href="https://musicians-rec.herokuapp.com">https://musicians-rec.herokuapp.com</a>
Code: <a href="https://github.com/irishryoon/musicians_recommendation">github</a>
Blog post: <a href="https://medium.com/@irishryoon/classical-musicians-recommender-22ee176daee8">medium</a>
For interactive exploration of the artist graph, click below

[<center><img src="http://irisyoon.com/assets/img/graph.png" height ="400"></center>](http://irisyoon.com/musicians_recommendation/graph_80000/graph_visualization/network/)
55 changes: 55 additions & 0 deletions _projects/neuroscience.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
---
layout: page
title: topology & neuroscience
description: A topological approach to neural encoding
img: assets/img/neuro_image.png
importance: 1
category: research
---

What does it mean for there to be circular structures in neural activity?

Consider a collection of images with different orientations shown in Figure 1A. The red and dark orange images have high similarity, whereas the red and green images have low similarity. What would happen if we arranged the eight images in a way that respects this similarity? The images would be arranged in a circular fashion, as shown in Figure 1B.


<p align="center">
<img width="750" src="https://irisyoon.com/assets/img/neuro_cyclic_structures.png">
</p>
<div class="caption">
Figure 1. Circular structure of data. <span style="font-weight:bold">A.</span> Collection of images. <span style="font-weight:bold">B.</span> An arrangement of images based on the similarity of orientation reveals a circular structure. <span style="font-weight:bold">C.</span> Collection of neural activities (spike trains). <span style="font-weight:bold">D.</span> Consider two spike trains to be similar if the vertical lines are well-aligned after "sliding" one spike train by a small amount. An arrangement of neural activity based on spike train similarity reveals a circular structure.
</div>


Similarly, consider a collection of neural activities (called spike trains) shown in Figure 1C. Each row indicates the activity of a single neuron over some time period. The vertical line indicates the neuron's firing at a corresponding time. If we observe the neuron for, say $$ M $$ time intervals, then each spike train is a binary vector in $$ \mathbb{R}^M $$. Given two spike trains $$ s_1 $$ and $$ s_2 $$, let's measure similarities between two spike trains as the amount one needs to "slide" $$ s_1 $$ to "match" with $$ s_2 $$. Then, the red spike train is similar to the dark orange spike train, but it is quite dissimilar to the green spike train. Again, if we were to arrange the spike trains in a way that respects this similarity, we would arrange them in a cyclic manner (Figure 1D).

Now, suppose there are many images and long spike trains that we cannot make the arrangements by hand. How would a computer recognize that these high-dimensional data contain cyclic structures? Let $$ P $$ denote the point cloud representing a system of interest, such as the collection of stimulus or neural activity. We calculate the similarity between every pair of elements in the system. We construct a representation of the system as we vary the similarity level by a sequence of simplicial complexes. The loops in this sequence are summarized by a persistence diagram, where the points far from the diagonal represent the large loops. See the following figure.

<p align="center">
<img width="750" src="https://irisyoon.com/assets/img/neural_PH.png">
</p>
<div class="caption">
Figure 2. Detecting circular structures from high dimensional data. Given the data (either images or spike trains), we first compute a matrix encoding pairwise similarity between all elements in the system. We then create the sequence of simplicial complexes that represents the connectivity of the system at various similarity values. Finally, we summarize the loops in the simplicial complexes using a persistence diagram. Points far from the diagonal represent significant structures.
</div>


So far, we have seen that persistence diagrams can indicate if a collection of images or spike trains contain circular structures. Consider a hypothetical experiment in which we present a stimulus video to a mouse while measuring its neural activity. Let's assume that the persistence diagram indicates that there are two circular structures in the stimulus and one circular structure in the neural activity. Is the unique circular feature in the neural activity reflecting one of the circular features in the stimulus? If so, which one? Such questions are topological manifestations of fundamental problems in neuroscience called neural encoding that study how neurons represent information.


<p align="center">
<img width="450" src="https://irisyoon.com/assets/img/encoding.png">
</p>
<div class="caption">
Figure 3. Neural encoding, stated as a problem in topology. Consider an experiment in which we present some stimulus while measuring neural activity. The persistence diagrams indicate that there are two circular features in the stimulus while there is only one circular feature in the neural activity. Which feature of the stimulus is represented by the neurons?
</div>

To address the above questions, I developed a framework for comparing persistence diagrams called the <a href="https://arxiv.org/abs/2201.05190">analogous bars method</a>.

The methods paper has been accepted in the Journal of Applied and Computational Topology, conditional on minor revisions. A follow-up paper implementing the method on simulated and experimental neuroscience datasets is under preparation.

Preprint: <a href="https://arxiv.org/abs/2201.05190">Persistent Extension and Analogous Bars: Data-Induced Relations Between Persistence Barcodes</a>.

Code: <a href="https://github.com/UDATG/analogous_bars">github: analogous bars</a>

*Joint work with <a href="http://www.chadgiusti.com/">Chad Giusti</a> (U. Delaware) and <a href="https://www.math.upenn.edu/~ghrist/">Robert Ghrist</a> (U. Penn).*


25 changes: 25 additions & 0 deletions _projects/pbf.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
---
layout: page
title: Philadelphia Bail Fund
description:
img: assets/img/philly.png
importance: 2
category: data science
---
From 2020 to 2021, I volunteered as a data scientist for <a href="https://codeforphilly.org/">Code for Philly</a>, specifically the <a href="https://www.phillybailfund.org/">Philadelphia Bail Fund</a>. The volunteers and I wanted to understand how the bail system of Philadelphia was affecting the citizens. The following lists a few questions that we wanted to address.

* Which neighborhoods are most heavily impacted by the bail system?
* How do the defendant's race and gender impact the bail amount?
* Is there consistency across magistrates (the person who sets the bail)? That is, do two different magistrates set a similar bail amount for similar cases?

The volunteers and I gathered new criminal filing records from the municipal court. We then performed various statistical analyses to address the above questions. One challenge of this analysis was that many variables were correlated, and we needed to control for the correlations. For example, if one magistrate is likely to handle more severe offenses than another, then one would have to take such differences into account when comparing the bail amounts set by the two magistrates.

To that end, we used **topic modeling** on the criminal filing records to group cases into similar offense types and severity. We then performed a **matched study** to study if two magistrates set similar bail amounts given similar offense severity. We found that there is still a high variance in the bail amounts set across magistrates even after controlling for the difference in the offense type and severity.

For more info, please visit the following app: <a href="https://codeforphilly-pbf-analysis-app-hzafyl.streamlitapp.com/">PBF app</a>
Code: <a href="https://github.com/CodeForPhilly/pbf-analysis">github</a>

<p align="center">
<img width="700" src="https://irisyoon.com/assets/img/pbf_magistrate.png">
</p>

0 comments on commit 4bed4d7

Please sign in to comment.