Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tests (and refactor) API extraction and symbol generation #1899

Open
mr-tz opened this issue Dec 9, 2023 · 22 comments · May be fixed by #2597
Open

Add tests (and refactor) API extraction and symbol generation #1899

mr-tz opened this issue Dec 9, 2023 · 22 comments · May be fixed by #2597
Assignees
Labels
good first issue Good for newcomers

Comments

@mr-tz
Copy link
Collaborator

mr-tz commented Dec 9, 2023

the routine you're updating is starting to look a little complicated, so maybe soon it's time to attempt to refactor, but without tests, i wouldn't be comfortable. even one trivial test would encode the special case/scenario that you encountered.

Originally posted by @williballenthin in #1897 (comment)

We should have more tests to encode all the various API patterns we support.

Refactoring of the existing routines and maybe introducing separate functions for .NET vs. native vs. others may make sense as part of that.

@v1bh475u
Copy link
Contributor

v1bh475u commented Feb 3, 2025

Hello, @mr-tz . I would like to work on this issue. Can you please assign it to me?

@v1bh475u
Copy link
Contributor

v1bh475u commented Feb 4, 2025

@williballenthin can you please give more insights about what the issue is about? I know that we need to do something about the api feature extracting but want more details.

@williballenthin
Copy link
Collaborator

Please read the PR #1897.

See that it updates the routine here:

Image

Take a look at that routine and see if you agree with the prompt:

the routine you're updating is starting to look a little complicated, so maybe soon it's time to attempt to refactor, but without tests, i wouldn't be comfortable. even one trivial test would encode the special case/scenario that you encountered.

If so, then you can take on this issue. If it doesn't make sense, we can address at another time.

@v1bh475u
Copy link
Contributor

v1bh475u commented Feb 4, 2025

I feel the best we can do is add some comments to give it a little verbosity, but I cannot think of currently how it can be further simplified.

@williballenthin
Copy link
Collaborator

i think the most important immediate goal is to add some unit tests to demonstrate he current behavior. then we can attempt any refactor knowing the behavior is the same.

so, let's start with tests.

if that doesn't interest you, no worries!

@v1bh475u
Copy link
Contributor

v1bh475u commented Feb 6, 2025

@williballenthin Should I make an explicit test for checking the function or modify the existing tests that can use it internally while testing rules?

@williballenthin
Copy link
Collaborator

let's create new dedicated unit test functions (one is probably sufficient) to exercise this functionality

@v1bh475u
Copy link
Contributor

v1bh475u commented Feb 6, 2025

def test_trim_dll_part():
    from capa.rules import trim_dll_part

    assert trim_dll_part("kernel32.CreateFileA") == "CreateFileA"
    assert trim_dll_part("System.Convert::FromBase64String") == "System.Convert::FromBase64String"
    assert trim_dll_part("ws2_32.#1") == "ws2_32.#1"

@williballenthin I am thinking of implementing something along this line. Is this suffice or do we need more rigorous tests? If more rigorous, then please give some idea about that.

@williballenthin
Copy link
Collaborator

yup that's the right idea. ideally we have a case that covers each possible branch, and then maybe some weird made up cases to show what's expected (like both :: and .dll and more).

@v1bh475u
Copy link
Contributor

v1bh475u commented Feb 6, 2025

You mean something like kernel32.ws2_32.#1?

@v1bh475u
Copy link
Contributor

v1bh475u commented Feb 6, 2025

So for the above case, we need to ensure that it gives ws2_32.#1 or kernel32.ws2_32.#1? Current implementation gives 2nd result.

@v1bh475u
Copy link
Contributor

v1bh475u commented Feb 7, 2025

@williballenthin, I can see it can a little bit confusing as here we are overlapping cases between .NET functions and Windows PE functions. So, I think it would be better to change this so that names for all different types are handled separately.

capa/capa/rules/__init__.py

Lines 584 to 587 in 736ad1c

if api.count(".") == 1:
if "::" not in api:
# skip System.Convert::FromBase64String
api = api.split(".")[1]

Here's my idea for it.

    # ordinal imports, like ws2_32.#1, or .NET namespace, like System.Diagnostics.Debugger::IsLogging, keep dll/namespace part
    if ".#" in api or "::" in api:
        return api

    # kernel32.CreateFileA
    if api.count(".") == 1:
        api = api.split(".")[1]
    return api

Thoughts?

@williballenthin
Copy link
Collaborator

looking good! only suggestion might be to break the first 'if' into two so they can be documented separately.

@v1bh475u
Copy link
Contributor

v1bh475u commented Feb 7, 2025

Fair point

@v1bh475u v1bh475u linked a pull request Feb 7, 2025 that will close this issue
3 tasks
@williballenthin
Copy link
Collaborator

if you're up for it, would you please extend this review to cover the following functions:

def is_aw_function(symbol: str) -> bool:
"""
is the given function name an A/W function?
these are variants of functions that, on Windows, accept either a narrow or wide string.
"""
if len(symbol) < 2:
return False
# last character should be 'A' or 'W'
if symbol[-1] not in ("A", "W"):
return False
return True
def is_ordinal(symbol: str) -> bool:
"""
is the given symbol an ordinal that is prefixed by "#"?
"""
if symbol:
return symbol[0] == "#"
return False
def generate_symbols(dll: str, symbol: str, include_dll=False) -> Iterator[str]:
"""
for a given dll and symbol name, generate variants.
we over-generate features to make matching easier.
these include:
- CreateFileA
- CreateFile
- ws2_32.#1
note that since capa v7 only `import` features and APIs called via ordinal include DLL names:
- kernel32.CreateFileA
- kernel32.CreateFile
- ws2_32.#1
for `api` features dll names are good for documentation but not used during matching
"""
# normalize dll name
dll = dll.lower()
# trim extensions observed in dynamic traces
dll = dll[0:-4] if dll.endswith(".dll") else dll
dll = dll[0:-4] if dll.endswith(".drv") else dll
dll = dll[0:-3] if dll.endswith(".so") else dll
if include_dll or is_ordinal(symbol):
# ws2_32.#1
# kernel32.CreateFileA
yield f"{dll}.{symbol}"
if not is_ordinal(symbol):
# CreateFileA
yield symbol
if is_aw_function(symbol):
if include_dll:
# kernel32.CreateFile
yield f"{dll}.{symbol[:-1]}"
# CreateFile
yield symbol[:-1]
def reformat_forwarded_export_name(forwarded_name: str) -> str:
"""
a forwarded export has a DLL name/path and symbol name.
we want the former to be lowercase, and the latter to be verbatim.
"""
# use rpartition so we can split on separator between dll and name.
# the dll name can be a full path, like in the case of
# ef64d6d7c34250af8e21a10feb931c9b
# which i assume means the path can have embedded periods.
# so we don't want the first period, we want the last.
forwarded_dll, _, forwarded_symbol = forwarded_name.rpartition(".")
forwarded_dll = forwarded_dll.lower()
return f"{forwarded_dll}.{forwarded_symbol}"

this is what we originally intended, but i totally understand it's not obvious from the issue title.

@v1bh475u
Copy link
Contributor

v1bh475u commented Feb 7, 2025

Would be glad to do so. 😊

@v1bh475u
Copy link
Contributor

v1bh475u commented Feb 8, 2025

def generate_symbols(dll: str, symbol: str, include_dll=False) -> Iterator[str]:
    """
    for a given dll and symbol name, generate variants.
    we over-generate features to make matching easier.
    these include:
      - CreateFileA
      - CreateFile
      - ws2_32.#1

    note that since capa v7 only `import` features and APIs called via ordinal include DLL names:
      - kernel32.CreateFileA
      - kernel32.CreateFile
      - ws2_32.#1

    for `api` features dll names are good for documentation but not used during matching
    """
    # normalize dll name
    dll = dll.lower()

    # trim extensions observed in dynamic traces
    dll = dll[0:-4] if dll.endswith(".dll") else dll
    dll = dll[0:-4] if dll.endswith(".drv") else dll
    dll = dll[0:-3] if dll.endswith(".so") else dll

    if is_ordinal(symbol):
        # ws2_32.#1
        # kernel32.CreateFileA
        yield f"{dll}.{symbol}"
        return

    # For non-ordinal symbols
    if include_dll:
        yield f"{dll}.{symbol}"

    # CreateFileA
    yield symbol

    if is_aw_function(symbol):
        if include_dll:
            # kernel32.CreateFile
            yield f"{dll}.{symbol[:-1]}"
        # CreateFile
        yield symbol[:-1]

@williballenthin, this is my idea of separating paths for ordinals and other symbols. Thoughts?

@v1bh475u
Copy link
Contributor

v1bh475u commented Feb 8, 2025

Also, I have added some comments in PR. Please review that too.

@v1bh475u
Copy link
Contributor

@williballenthin, can you please share your thoughts for my suggestions?

@v1bh475u
Copy link
Contributor

@williballenthin , are you available today for this?

@williballenthin
Copy link
Collaborator

no, my child is sick, so i'm not at my computer today. i'll review when i am able.

@v1bh475u
Copy link
Contributor

Oh. Sorry to hear that. May he/she get feel soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants