+
+ +
+

Data loading

+

This is a tutorial on loading of protein and amino acid scale datasets.

+
+

Loading of protein benchmarks

+

Load the overview table of protein benchmark datasets using the default +settings:

+
import aaanalysis as aa
+df_info = aa.load_dataset()
+df_info.iloc[:, :7].head(13)
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
LevelDataset# Sequences# Amino acids# Positives# NegativesPredictor
0Amino acidAA_CASPASE3233185605705184900PROSPERous
1Amino acidAA_FURIN715900316358840PROSPERous
2Amino acidAA_LDR3421182483546982779IDP-Seq2Seq
3Amino acidAA_MMP25733129762416310560PROSPERous
4Amino acidAA_RNABIND22155001649248509GMKSVM-RU
5Amino acidAA_SA23318560510108284523PROSPERous
6SequenceSEQ_AMYLO14148484511903ReRF-Pred
7SequenceSEQ_CAPSID7935336468038644071VIRALpro
8SequenceSEQ_DISULFIDE25476144708971650Dipro
9SequenceSEQ_LOCATION18357323981045790NaN
10SequenceSEQ_SOLUBLE17408443226987048704SOLpro
11SequenceSEQ_TAIL6668267169025744094VIRALpro
12DomainDOM_GSEC126929646363NaN
+

The benchmark datasets are categorized into amino acid (‘AA’), domain +(‘DOM’), and sequence (‘SEQ’) level datasets, indicated by their +name prefix, as exemplified here.

+
df_seq1 = aa.load_dataset(name="AA_CASPASE3")
+df_seq2 = aa.load_dataset(name="SEQ_CAPSID")
+df_seq3 = aa.load_dataset(name="DOM_GSEC")
+df_seq2.head(2)
+# Compare columns of three types
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + +
entrysequencelabel
0CAPSID_1MVTHNVKINKHVTRRSYSSAKEVLEIPPLTEVQTASYKWFMDKGIK...0
1CAPSID_2MKKRQKKMTLSNFTDTSFQDFVSAEQVDDKSAMALINRAEDFKAGQ...0
+

Each dataset can be utilized for a binary classification, with labels +being positive (1) or negative (0). A balanced number of samples can be +chosen by the n parameter, defining the sample number per class.

+
df_seq = aa.load_dataset(name="SEQ_CAPSID", n=100)
+# Returns 200 samples, 100 positives and 100 negatives
+df_seq["label"].value_counts()
+
+
+
label
+0    100
+1    100
+Name: count, dtype: int64
+
+
+

Or randomly selected using random=True:

+
df_seq = aa.load_dataset(name="SEQ_CAPSID", n=100, random=True)
+
+
+

The protein sequences can have varying length:

+
# Plot distribution
+import warnings
+warnings.simplefilter(action='ignore', category=FutureWarning)
+import matplotlib.pyplot as plt
+import seaborn as sns
+# Utility AAanalysis function for publication ready plots
+aa.plot_settings(font_scale=1.2)
+df_seq = aa.load_dataset(name="SEQ_CAPSID", n=100)
+list_seq_lens = df_seq["sequence"].apply(len)
+sns.histplot(list_seq_lens, binwidth=50)
+sns.despine()
+plt.xlim(0, 1500)
+plt.show()
+
+
+../_images/output_9_0.png +

Which can be easily filtered using min_len and max_len +parameters:

+
df_seq = aa.load_dataset(name="SEQ_CAPSID", n=100, min_len=200, max_len=800)
+list_seq_lens = df_seq["sequence"].apply(len)
+aa.plot_settings(font_scale=1.2)  # Utility AAanalysis function for publication ready plots
+sns.histplot(list_seq_lens, binwidth=50)
+sns.despine()
+plt.xlim(0, 1500)
+plt.show()
+
+
+../_images/output_11_0.png +
+
+

Loading of protein benchmarks: Amino acid window size

+

For amino acid level datasets, labels are provided for each residue +position, which can be seen by setting aa_window_size=None:

+
df_seq = aa.load_dataset(name="AA_CASPASE3", aa_window_size=None)
+df_seq.head(4)
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
entrysequencelabel
0CASPASE3_1MSLFDLFRGFFGFPGPRSHRDPFFGGMTRDEDDDEEEEEEGGSWGR...0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...
1CASPASE3_2MEVTGDAGVPESGEIRTLKPCLLRRNYSREQHGVAASCLEDLRSKA...0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...
2CASPASE3_3MRARSGARGALLLALLLCWDPTPSLAGIDSGGQALPDSFPSAPAEQ...0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...
3CASPASE3_4MDAKARNCLLQHREALEKDIKTSYIMDHMISDGFLTISEEEKVRNE...0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...
+

For convenience, we provide an “amino acid window” of length n. This +window represents a specific amino acid, which is flanked by (n-1)/2 +residues on both its N-terminal and C-terminal sides. It’s essential for +n to be odd, ensuring equal residues on both sides. While the default +window size is 9, sizes between 5 and 15 are also popular.

+
df_seq = aa.load_dataset(name="AA_CASPASE3")
+df_seq.head(4)
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
entrysequencelabel
0CASPASE3_1_pos4MSLFDLFRG0
1CASPASE3_1_pos5SLFDLFRGF0
2CASPASE3_1_pos6LFDLFRGFF0
3CASPASE3_1_pos7FDLFRGFFG0
+

Sequences can be pre-filtered using min_len and max_len and +n residues can be randomly selected by random with different +aa_window_sizes.

+
df_seq = aa.load_dataset(name="AA_CASPASE3", min_len=20, n=3, random=True, aa_window_size=21)
+df_seq
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
entrysequencelabel
0CASPASE3_55_pos170KKRKLEEEEDGKLKKPKNKDK1
1CASPASE3_29_pos185CPHHERCSDSDGLAPPQHLIR1
2CASPASE3_64_pos431DNPLNWPDEKDSSFYRNFGST1
3CASPASE3_93_pos455FVKNMNRDSTFIVNKTITAEV0
4CASPASE3_38_pos129SSFDLDYDFQRDYYDRMYSYP0
5CASPASE3_8_pos33RPPQLRPGAPTSLQTEPQGNP0
+
+
+

Loading of protein benchmarks: Positive-Unlabeled (PU) datasets

+

In typical binary classification, data is labeled as positive (1) or +negative (0). But with many protein sequence datasets, we face +challenges: they might be small, unbalanced, or lack a clear negative +class. For datasets with only positive and unlabeled samples (2), we use +PU learning. This approach identifies reliable negatives from the +unlabeled data to make binary classification possible. We offer +benchmark datasets for this scenario, denoted by the _PU suffix. For +example, the DOM_GSEC_PU dataset corresponds to the +DOM_GSEC set.

+
df_seq = aa.load_dataset(name="DOM_GSEC")
+df_seq
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
entrysequencelabeltmd_starttmd_stopjmd_ntmdjmd_c
0P05067MLPGLALLLLAAWTARALEVPTDGNAGLLAEPQIAMFCGRLNMHMN...1701723FAEDVGSNKGAIIGLMVGGVVIATVIVITLVMLKKKQYTSIHH
1P14925MAGRARSGLLLLLLGLLALQSSCLAFRSPLSVFKRFKETTRSFSNE...1868890KLSTEPGSGVSVVLITTLLVIPVLVLLAIVMFIRWKKSRAFGD
2P70180MRSLLLFTFSACVLLARVLLAGGASSGAGDTRPGSRRRAREALAAQ...1477499PCKSSGGLEESAVTGIVVGALLGAGLLMAFYFFRKKYRITIER
3Q03157MGPTSPAARGQGRRWRPPPLPLLLPLSLLLLRAQLAVGNLAVGSPS...1585607APSGTGVSREALSGLLIMGAGGGSLIVLSLLLLRKKKPYGTIS
4Q06481MAATGTAAAAATGRLLLLLLVGLTAPALALAGYIEALAANAGTGFA...1694716LREDFSLSSSALIGLLVIAVAIATVIVISLVMLRKRQYGTISH
...........................
121P36941MLLPWATSAPGLAWGPLVLGLFGLLAASQPQAVPPYASENQTCRDQ...0226248PLPPEMSGTMLMLAVLLPLAFFLLLATVFSCIWKSHPSLCRKL
122P25446MLWIWAVLPLVLAGSQLRVHTQGTNSISESLKLRRRVRETDKNCSE...0170187NCRKQSPRNRLWLLTILVLLIPLVFIYRKYRKRKCWKR
123Q9P2J2MVWCLGLAVLSLVISQGADGRGKPEVVSVVGRAGESVVLGCDLLPP...0738760PGLLPQPVLAGVVGGVCFLGVAVLVSILAGCLLNRRRAARRRR
124Q96J42MVPAAGRRPPRVMRLLGWWQVLLWVLGLPVRGVEVAEESGRLWSEE...0324342LPSTLIKSVDWLLVFSLFFLISFIMYATIRTESIRWLIP
125P0DPA2MRVGGAFHLLLVCLSPALLSAVRINGDGQEVLYLAEGDNVRLGCPY...0265287KVSDSRRIGVIIGIVLGSLLALGCLAVGIWGLVCCCCGGSGAG
+

126 rows × 8 columns

+
df_seq_pu = aa.load_dataset(name="DOM_GSEC_PU")
+df_seq_pu
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
entrysequencelabeltmd_starttmd_stopjmd_ntmdjmd_c
0P05067MLPGLALLLLAAWTARALEVPTDGNAGLLAEPQIAMFCGRLNMHMN...1701723FAEDVGSNKGAIIGLMVGGVVIATVIVITLVMLKKKQYTSIHH
1P14925MAGRARSGLLLLLLGLLALQSSCLAFRSPLSVFKRFKETTRSFSNE...1868890KLSTEPGSGVSVVLITTLLVIPVLVLLAIVMFIRWKKSRAFGD
2P70180MRSLLLFTFSACVLLARVLLAGGASSGAGDTRPGSRRRAREALAAQ...1477499PCKSSGGLEESAVTGIVVGALLGAGLLMAFYFFRKKYRITIER
3Q03157MGPTSPAARGQGRRWRPPPLPLLLPLSLLLLRAQLAVGNLAVGSPS...1585607APSGTGVSREALSGLLIMGAGGGSLIVLSLLLLRKKKPYGTIS
4Q06481MAATGTAAAAATGRLLLLLLVGLTAPALALAGYIEALAANAGTGFA...1694716LREDFSLSSSALIGLLVIAVAIATVIVISLVMLRKRQYGTISH
...........................
689P60852MAGGSATTWGYPVALLLLVATLGLGRWLQPDPGLPGLRHSYDCGIK...2602624DSNGNSSLRPLLWAVLLLPAVALVLGFGVFVGLSQTWAQKLWE
690P20239MARWQRKASVSSPCGRSIYRFLSLLFTLVTSVNSVSLPQSENPAFP...2684703IIAKDIASKTLGAVAALVGSAVILGFICYLYKKRTIRFNH
691P21754MELSYRLFICLLLWGSTELCYPQPLWLLQGGASHPETSVQPVLVEC...2387409EQWALPSDTSVVLLGVGLAVVVSLTLTAVILVLTRRCRTASHP
692Q12836MWLLRCVLLCVSLSLAVSGQHKPEAPDYSSVLHCGPWSFQFAVNLN...2506528EKLRVPVDSKVLWVAGLSGTLILGALLVSYLAVKKQKSCPDQM
693Q8TCW7MEQIWLLLLLTIRVLPGSAQFNGYNCDANLHSRFPAERDISVYCGV...2374396PFQLNAITSALISGMVILGVTSFSLLLCSLALLHRKGPTSLVL
+

694 rows × 8 columns

+
+
+ + +
+