Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fcg cfg #1

Open
lqc09 opened this issue Apr 10, 2022 · 14 comments
Open

fcg cfg #1

lqc09 opened this issue Apr 10, 2022 · 14 comments

Comments

@lqc09
Copy link

lqc09 commented Apr 10, 2022

Hello author, how to process fcg and cfg into jsonl

@ryderling
Copy link
Owner

Thanks for your attention.
Based on the implementation of Genius (see https://github.com/qian-feng/Gencoding for details), we disassemble all PE samples in the dataset with IDA Pro 6.4, then generate their FCGs and CFGs accordingly, and finally store them in the JSONL file format.

@lqc09
Copy link
Author

lqc09 commented Apr 11, 2022

Thanks a lot, can both FCGs and CFGs be handled by Genius (https://github.com/qian-feng/Gencoding)?

@ryderling
Copy link
Owner

Not yet. But I recalled that Genius is used to extract CFGs and it is easy to generate FCG based the framework of Genius.

@lqc09
Copy link
Author

lqc09 commented Apr 11, 2022

thanks, got it

@lizhangtan
Copy link

Hello author, I have read your paper and also tried to use Genius (https://github.com/qian-feng/Gencoding) to get the CFGs from PE samples. I ran the code in the preprocessing_ida.py of Genius and got the output (XXX.ida) like this:
(i__main__
raw_graphs
p1
(dp2
S'raw_graph_list'
p3
(lp4
(iraw_graphs
raw_graph
p5
(dp6
S'entry'
p7
I0
sS'fun_features'
...

So how can I get the CFG in JSONL file format? I appreciate it if you can give more details about how to use Genius to generate CFGs in JSONL file format.

@KennenH
Copy link

KennenH commented Jun 25, 2023

Hello author, I have read your paper and also tried to use Genius (https://github.com/qian-feng/Gencoding) to get the CFGs from PE samples. I ran the code in the preprocessing_ida.py of Genius and got the output (XXX.ida) like this: (i__main__ raw_graphs p1 (dp2 S'raw_graph_list' p3 (lp4 (iraw_graphs raw_graph p5 (dp6 S'entry' p7 I0 sS'fun_features' ...

So how can I get the CFG in JSONL file format? I appreciate it if you can give more details about how to use Genius to generate CFGs in JSONL file format.

Any solution? I also encounter with the problem.
And there is a file called train_external_function_name_vocab.jsonl before model training, I have no idea about how to generate this file either.

@Divine-sh
Copy link

Hello author, I have read your paper and also tried to use Genius (https://github.com/qian-feng/Gencoding) to get the CFGs from PE samples. I ran the code in the preprocessing_ida.py of Genius and got the output (XXX.ida) like this: (i__main__ raw_graphs p1 (dp2 S'raw_graph_list' p3 (lp4 (iraw_graphs raw_graph p5 (dp6 S'entry' p7 I0 sS'fun_features' ...
So how can I get the CFG in JSONL file format? I appreciate it if you can give more details about how to use Genius to generate CFGs in JSONL file format.

Any solution? I also encounter with the problem. And there is a file called train_external_function_name_vocab.jsonl before model training, I have no idea about how to generate this file either.

I don't know how to generate this file train_external_function_name_vocab.jsonl either, do you have a solution?

@ryderling
Copy link
Owner

reply to @KennenH and @Divine-sh :

As we have described in Section IV.A.2)
For each node representing the external function in FCG, it is one-hot encoded based on its function name and we limit the vocabulary size of external functions to 10,000 that are most frequently used in the training dataset.
And the file train_external_function_name_vocab.jsonl is used to store the TOP 10000 external function names in the training dataset.

@20521862
Copy link

20521862 commented Jul 1, 2023

reply to @KennenH and @Divine-sh :

As we have described in Section IV.A.2) For each node representing the external function in FCG, it is one-hot encoded based on its function name and we limit the vocabulary size of external functions to 10,000 that are most frequently used in the training dataset. And the file train_external_function_name_vocab.jsonl is used to store the TOP 10000 external function names in the training dataset.

Can you give me a way to reach the top 10,000 you mentioned?

@KennenH
Copy link

KennenH commented Jul 3, 2023

@20521862
I think it is quite clear as paper said:
it is one-hot encoded based on its function name and we limit the vocabulary size of external functions to 10,000 that are most frequently used in the training dataset
count calling times for every external function and perform a sort.

@20521862
Copy link

20521862 commented Jul 4, 2023

@20521862 I think it is quite clear as paper said: it is one-hot encoded based on its function name and we limit the vocabulary size of external functions to 10,000 that are most frequently used in the training dataset count calling times for every external function and perform a sort.

So does that mean I will have to parse all the PE file then collect the function names that have been called 10,000 times in the training data and save it in train_external_function_name_vocab.jsonl?

@KennenH
Copy link

KennenH commented Jul 4, 2023

@20521862
10,000(external functions) that are most frequently used
Not saving the functions that were called 10,000 times, but taking the first 10,000 functions that were called the most times.

@KennenH
Copy link

KennenH commented Jul 19, 2023

reply to @KennenH and @Divine-sh :

As we have described in Section IV.A.2) For each node representing the external function in FCG, it is one-hot encoded based on its function name and we limit the vocabulary size of external functions to 10,000 that are most frequently used in the training dataset. And the file train_external_function_name_vocab.jsonl is used to store the TOP 10000 external function names in the training dataset.

@ryderling Very much thanks for your reply. But I have another question, as mentioned earlier by @lizhangtan in this issue (the sixth post of this issue).

Hello author, I have read your paper and also tried to use Genius (https://github.com/qian-feng/Gencoding) to get the CFGs from PE samples. I ran the code in the preprocessing_ida.py of Genius and got the output (XXX.ida) like this: (i__main__ raw_graphs p1 (dp2 S'raw_graph_list' p3 (lp4 (iraw_graphs raw_graph p5 (dp6 S'entry' p7 I0 sS'fun_features' ...

So how can I get the CFG in JSONL file format? I appreciate it if you can give more details about how to use Genius to generate CFGs in JSONL file format.

I also used Genius (https://github.com/qian-feng/Gencoding) to process the assembly file of PE and obtained its output() .ida file. How can I obtain the CFG in JSONL file format from this .ida file? Would you please provide me more details, Any help would be greatly appreciated!

@KennenH
Copy link

KennenH commented Jul 26, 2023

reply to @KennenH and @Divine-sh :
As we have described in Section IV.A.2) For each node representing the external function in FCG, it is one-hot encoded based on its function name and we limit the vocabulary size of external functions to 10,000 that are most frequently used in the training dataset. And the file train_external_function_name_vocab.jsonl is used to store the TOP 10000 external function names in the training dataset.

@ryderling Very much thanks for your reply. But I have another question, as mentioned earlier by @lizhangtan in this issue (the sixth post of this issue).

Hello author, I have read your paper and also tried to use Genius (https://github.com/qian-feng/Gencoding) to get the CFGs from PE samples. I ran the code in the preprocessing_ida.py of Genius and got the output (XXX.ida) like this: (i__main__ raw_graphs p1 (dp2 S'raw_graph_list' p3 (lp4 (iraw_graphs raw_graph p5 (dp6 S'entry' p7 I0 sS'fun_features' ...
So how can I get the CFG in JSONL file format? I appreciate it if you can give more details about how to use Genius to generate CFGs in JSONL file format.

I also used Genius (https://github.com/qian-feng/Gencoding) to process the assembly file of PE and obtained its output() .ida file. How can I obtain the CFG in JSONL file format from this .ida file? Would you please provide me more details, Any help would be greatly appreciated!

@lizhangtan I've figured it out, it's actually data saved through pickle, reload it with pickle and you can get a readable object.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants