Language model, Sequence and function prediction

USPNet

task

氨基酸序列作为输入，预测切割位点和SP（signal peptide信号肽）的类别

motivation

data imbalance problem
以往的方法利用additional group information of proteins来提高性能，然而这些proteins不总是available

technical contribution

lstm+attention作为训练模型
类平衡损失与标签分布感知边缘（LDAM）损失结合起来作为USPNet的损失函数，解决data imbalance problem

pipeline

basemodel：Bi-LSTM
流程：

氨基酸序列->Lx20矩阵（L为序列长度）（生成msa，然后再输入msa transformer）
将1的序列进入特征提取模块+embedding layer
将2的结果输入bi-lstm（self attention的bi-lstm+cnn）同时提取前向后向的依赖关系、全局特征、局部特征
将3的结果输入一个head（基于mlp）来预测：切割位点、sp类别

Msa Transformer

task

input: multiple sequence alignment data
output: embedding matrix as feature representation

motivation

technical contribution

行、列注意力操作

行注意力：学习msa不同序列的相似之处
列注意力：全局信息，学习结构信息
O(M^2L^2)到O(ML^2) + O(LM^2)：
M是序列数量，L是序列长度，O(M^2L^2）每个序列内部计算attention_w需要L^2, 序列之间的计算需要M^2；O(ML^2) + O(LM^2) ，行注意力机制中，M列只计算内部的attention_w，需要ML^2，列注意力机制中，以相似的操作，需要LM^2

使用mask训练

使用无标注的数据进行训练

跨越多个不同的蛋白质家族进行训练

使训练结果在样本外同样表现良好

model

ProtENN

task

input: Pfam数据库中的未对比氨基酸序列数据
output: 此序列所属的Pfam家族

motivation

technical contribution

protCNN的输出的amino acid matrix embedding的相关性系数热力图与BLOSUM62 matrix比较，表示protCNN学习到了信息

model

ResNet、CNN
集成多个ProtCNN

RNA-FM

motivation

technical contribution

task

input：RNA的核苷酸序列
output：包含RNA结构和功能信息的L*640的embedding matrix

model

12 transformer layers(multi-head attention+fnn)
各个模块使用残差连接
masked language objective

ProGen

motivation

technical contribution

task

input：
output：

model

NetGPI

motivation

technical contribution

task

input：
output：

model

AcrNET

motivation

technical contribution

task

input：
output：

model