Module Todo: Document Metadata Extraction #717

TheTechromancer · 2023-03-20T15:03:10Z

TheTechromancer
Mar 20, 2023
Maintainer

It would be useful to have a collection of modules that download documents (.pdf, .docx, etc.) and extract useful metadata such as usernames and internal domain names. Thanks to @pjhartlieb and @Sw3d1shPh1sh for requesting.

Also, per @nicpenning:

"FOCA is a tool used mainly to find metadata and hidden information in the documents it scans. These documents may be on web pages, and can be downloaded and analysed with FOCA."

Link to the project: https://github.com/ElevenPaths/FOCA/tree/master

Hard to say what the best approach is to implement so it's just an idea :) and maybe some of these features are taken care of in other modules.

Would require:

A module that listens for UNVERIFIED_URLs matching PDF, DOCX, etc., downloads them, and re-raises them as binary blobs
A module that consumes the binary blobs and extracts metadata

EDIT: Possible sources of metadata-extraction logic:

https://github.com/datacoon/metawarc (supports MS Office old + new, PDF, Images)

TheTechromancer · 2023-08-24T15:40:33Z

TheTechromancer
Aug 24, 2023
Maintainer Author

Note: it would also be nice to emit the text from these documents as a generic event consumable by excavate, secretsdb, etc -- e.g. RAW_TEXT or something similar to Spiderfoot's RAW_RIR_DATA event type. This would enable them to feed the recursive discovery process with more subdomains, emails, URLs, etc.

0 replies

TheTechromancer · 2023-09-20T21:00:21Z

TheTechromancer
Sep 20, 2023
Maintainer Author

As a prerunner for this, I have written a proof-of-concept filedownload module that watches for interesting filetypes and downloads them into the scan's output folder. Dev is happening on the filedownload-module branch.

@nicpenning here is the module:
https://github.com/blacklanternsecurity/bbot/blob/filedownload-module/bbot/modules/filedownload.py

You can use it like this:

bbot -t evilcorp.com -f subdomain-enum -m filedownload

Pairing it with the web spider can also be very effective:

bbot -t evilcorp.com -f subdomain-enum -m filedownload -c web_spider_depth=2 web_spider_distance=2

0 replies

domwhewell-sage · 2024-05-16T07:52:14Z

domwhewell-sage
May 16, 2024

This is probably relevant to this discussion #907 (comment).

Now there are FILESYSTEM events the downloads can probably be raised as that.

As mentioned in the linked discussion that is a ML model to detect human passwords in several file formats.

Perhaps more interesting though is it uses Apache Tika to extract the strings from

extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF)

which we could then raise as RAW_DATA events for ingestion by other modules

9 replies

TheTechromancer Jun 1, 2024
Maintainer Author

Issue created: #1421. I likely won't have time to work on it for a bit, since I'm busy with BBOT 2.0. Anyone is welcomed to work on it in the meantime.

nicpenning Jun 1, 2024

Where does the FILESYSTEM event type exist? I am trying to get a better understanding how these event producers/consumers work via https://www.blacklanternsecurity.com/bbot/dev/event/. I looked at some modules that use it but it doesn't clearly show me how/what FILESYSTEM is doing.

I see most of the other event types here: https://github.com/blacklanternsecurity/bbot/blob/stable/bbot/core/event/base.py

But FILESYSTEM doesn't exist and WEBSCREENSHOT is very limited

bbot/bbot/core/event/base.py

Lines 1180 to 1182 in eeae1cb

    
           class WEBSCREENSHOT(DictHostEvent): 
        
               _always_emit = True 
        
               _quick_emit = True

.

Could you please point me in the right direction where Event code lives?

TheTechromancer Jun 1, 2024
Maintainer Author

The reason there isn't any unique code for those events is because we haven't needed it. In BBOT, an event is made up of a type (e.g. FILESYSTEM) and data (e.g. {"path": "~/.bbot/scans/suspicious_dobby/git_clone/bbot"}). You can specify whatever type and whatever data you want. There aren't any schemas or rules except for a few certain predefined event types.

If you want to add custom validation / functionality to an event type, you can create the class and override whichever methods you need, just like we've done with some of the other ones. The code would live core/event/base.py, so you're already in the right place.

nicpenning Jun 1, 2024

That makes more sense then! Got it. Is that likely what will be needed for when we try to implement Base64 blobs? Or is that a different layer. I was going to take a crack at what that would look like but I am still learning BBOT dev.

TheTechromancer Jun 1, 2024
Maintainer Author

Yep, I think it would end up looking something like this:

class FILESYSTEM(DictEvent):
    def sanitize_data(self, data):
        new_data = dict(data)
        if self.scan is not None:
            include_base64 = self.scan.config.get("include_base64_blob", False)
            
            if include_base64:
                data_path = Path(data["path"])
                if data_path.is_file():
                    # blob = <read file into blob>
                elif data_path.is_dir():
                    # blob = <tar up directory into blob>
                new_data["blob"] = blob

        return new_data

nicpenning · 2024-07-04T13:30:39Z

nicpenning
Jul 4, 2024

Some work being done for this:
#1438
#1440
#1433

#1434
#1421

3 replies

nicpenning Jul 29, 2024

All of these items are now closed.

@TheTechromancer Thoughts on next steps for this Discuss item?

In the interim, we can do some testing and report back on how the current state operates. We really appreciate all of the work being done here!

TheTechromancer Jul 29, 2024
Maintainer Author

Awesome, yeah next step I think is to add RAW_TEXT support to excavate, so we can start extracting URLs etc from these things.

domwhewell-sage Aug 5, 2024

I have opened a new issue #1634 which I will look at tackling this week

TheTechromancer · 2024-10-20T00:00:42Z

TheTechromancer
Oct 20, 2024
Maintainer Author

Circling back around to this one, as recently we've run into problems with unstructured.

Overall it's great that unstructured runs without a server component and without a Java dependency. However we should be on the lookout for a better alternative, preferably one written in rust or golang. It seems they are just now starting to emerge.

@domwhewell-sage this is one to keep an eye on:

https://github.com/yobix-ai/extractous

In contrast, Extractous maintains a dedicated focus on text and metadata extraction. It achieves significantly faster processing speeds and lower memory utilization through native code execution.

Built with Rust: The core is developed in Rust, leveraging its high performance, memory safety, multi-threading capabilities, and zero-cost abstractions.
Extended format support with Apache Tika: For file formats not natively supported by the Rust core, we compile the well-known Apache Tika into native shared libraries using GraalVM ahead-of-time compilation technology. These shared libraries are then linked to and called from our Rust core. No local servers, no virtual machines, or any garbage collection, just pure native execution.
Bindings for many languages: we plan to introduce bindings for many languages. At the moment we offer only Python binding, which is essentially is a wrapper around the Rust core with the potential to circumventing the Python GIL limitation and make efficient use of multi-cores.

0 replies

Uh oh!

Module Todo: Document Metadata Extraction #717

Uh oh!

Uh oh!

TheTechromancer Mar 20, 2023 Maintainer

Replies: 5 comments · 12 replies

Uh oh!

Uh oh!

TheTechromancer Aug 24, 2023 Maintainer Author

Uh oh!

Uh oh!

TheTechromancer Sep 20, 2023 Maintainer Author

Uh oh!

Uh oh!

domwhewell-sage May 16, 2024

Uh oh!

TheTechromancer Jun 1, 2024 Maintainer Author

Uh oh!

nicpenning Jun 1, 2024

Uh oh!

Uh oh!

TheTechromancer Jun 1, 2024 Maintainer Author

Uh oh!

nicpenning Jun 1, 2024

Uh oh!

Uh oh!

TheTechromancer Jun 1, 2024 Maintainer Author

Uh oh!

nicpenning Jul 4, 2024

Uh oh!

nicpenning Jul 29, 2024

Uh oh!

TheTechromancer Jul 29, 2024 Maintainer Author

Uh oh!

domwhewell-sage Aug 5, 2024

Uh oh!

TheTechromancer Oct 20, 2024 Maintainer Author

TheTechromancer
Mar 20, 2023
Maintainer

Replies: 5 comments 12 replies

TheTechromancer
Aug 24, 2023
Maintainer Author

TheTechromancer
Sep 20, 2023
Maintainer Author

domwhewell-sage
May 16, 2024

TheTechromancer Jun 1, 2024
Maintainer Author

TheTechromancer Jun 1, 2024
Maintainer Author

TheTechromancer Jun 1, 2024
Maintainer Author

nicpenning
Jul 4, 2024

TheTechromancer Jul 29, 2024
Maintainer Author

TheTechromancer
Oct 20, 2024
Maintainer Author