Skip to content

Commit

Permalink
Developer Guide (#109)
Browse files Browse the repository at this point in the history
* Migrade Developer Guide from Joern README here
* Added @bbrehm's PR content from joernio/joern#4240
* Added an entry for standalone-ext
  • Loading branch information
DavidBakerEffendi authored Mar 15, 2024
1 parent 863c302 commit 58f15b9
Show file tree
Hide file tree
Showing 6 changed files with 307 additions and 1 deletion.
6 changes: 6 additions & 0 deletions docs.joern.io/content/developer-guide/_index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
---
id: developer-guide
title: Developer Guide
bookCollapseSection: true
weight: 160
---
33 changes: 33 additions & 0 deletions docs.joern.io/content/developer-guide/contribution-guidelines.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
---
id: contributing
title: Contributing to Joern
weight: 10
---

Thank you for taking time to contribute to Joern! Here are a few guidelines to ensure your pull
request will get merged as soon as possible.

### Creating a Pull Request

Try to make use of the templates as far as possible, however they may not suit all needs. The
minimum we would like to see is:
- A title that briefly describes the change and purpose of the PR, preferably with the affected
module in square brackets, e.g. `[javasrc] Addition Operator Fix`.
- A short description of the changes in the body of the PR. This could be in bullet points or
paragraphs.
- A link or reference to the related issue, if any exists.

### Dos and Don'ts

Do not:
- Immediately CC/@/email spam other contributors, the team will review the PR and assign the most
appropriate contributor to review the PR. Joern is maintained by industry partners and
researchers alike, for the most part with their own goals and priorities, and additional help is
largely volunteer work. If your PR is going stale, then reach out to us in follow-up comments
with @'s asking for an explanation of priority or planning of when it may be addressed (if ever,
depending on quality).
- Leave the description body empty, this makes reviewing the purpose of the PR difficult.

Do remember to:
- Remember to format your code, i.e. run `sbt scalafmt Test/scalafmt`
- Add a unit test to verify your change.
51 changes: 51 additions & 0 deletions docs.joern.io/content/developer-guide/custom-tool.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
---
id: custom-tool
title: Creating a Custom Static Analysis with Joern
weight: 40
---

So you want to develop tools with Joern? Let's get started!

## Simple Standalone Application Template

[standalone-ext](https://github.com/joernio/standalone-ext) is the core template for developing
Joern-based tooling. Here are some tasks you can perform using this template to suite your needs:

* Update to latest Joern versions? Run `./updateDependencies.sh`
* Extend the CPG schema for custom nodes/edges/properties? See [`CpgExtSchema`](https://github.com/joernio/standalone-ext/blob/master/schema/src/main/scala/CpgExtSchema.scala)
* Want to create a CLI tool? See [`Main`](https://github.com/joernio/standalone-ext/blob/master/src/main/scala/org/codeminers/standalone/Main.scala)
* Want to create a REPL? See [`ReplMain`](https://github.com/joernio/standalone-ext/blob/master/src/main/scala/org/codeminers/standalone/ReplMain.scala)
* Want to add custom query steps? See [`package`](https://github.com/joernio/standalone-ext/blob/master/src/main/scala/org/codeminers/standalone/package.scala)

Joern modules can be imported, as well as their test resources, e.g.

```scala
// build.sbt

// parsed by project/Versions.scala, updated by updateDependencies.sh
val cpgVersion = "1.6.5"
val joernVersion = "2.0.262"
val overflowdbVersion = "1.187"
// ...
val joernDeps =
Seq("x2cpg", "javasrc2cpg", "joern-cli", "semanticcpg", "dataflowengineoss")
.flatMap { x =>
val dep = "io.joern" %% x % Versions.joern
val testDep = "io.joern" %% x % Versions.joern % Test classifier "tests"
Seq(dep, testDep)
}
libraryDependencies ++= Seq(/*...*/) ++ joernDeps
```

With the test resources, you have access to the same text fixtures and tooling that Joern has,
notably, the ability to generate CPG to test against from source code blocks.

## Examples

Here are some open-source tools developed from `standalone-ext`:

* [Privado Core](https://github.com/Privado-Inc/privado-core)
* [JoernTI](https://github.com/joernio/joernti-codetidal5)
* [CPG Miner](https://github.com/DavidBakerEffendi/cpg-miner)

Add your project here!
31 changes: 31 additions & 0 deletions docs.joern.io/content/developer-guide/ide-setup.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
---
id: ide-setup
title: Setting Up Your IDE
weight: 30
---

#### IntelliJ IDEA
* [Download IntelliJ Community](https://www.jetbrains.com/idea/download)
* Install and run it
* Install the [Scala Plugin](https://plugins.jetbrains.com/plugin/1347-scala) - just search and
install from within IntelliJ.
* Important: open `sbt` in your local joern repository, run `compile` and keep it open - this will
allow us to use the BSP build in the next step
* Back to IntelliJ: open project: select your local joern clone: select to open as `BSP Project`
(i.e. _not_ `sbt project`!)
* Await the import and indexing to complete, then you can start, e.g. `Build -> Build project` or
run a test

Pro tip: Scala 3 support is limited and opting for Nightly builds is highly recommended.

#### VSCode
- Install VSCode and Docker
- Install the plugin `ms-vscode-remote.remote-containers`
- Open Joern project folder in
[VSCode](https://docs.microsoft.com/en-us/azure-sphere/app-development/container-build-vscode#build-and-debug-the-project)
Visual Studio Code detects the new files and opens a message box saying: `Folder contains a Dev
Container configuration file. Reopen to folder to develop in a container.`
- Select the `Reopen in Container` button to reopen the folder in the container created by the
`.devcontainer/Dockerfile` file
- Switch to `scalameta.metals` sidebar in VSCode, and select `import build` in `BUILD COMMANDS`
- After `import build` succeeds, you are ready to start writing code for Joern
185 changes: 185 additions & 0 deletions docs.joern.io/content/developer-guide/learning-scala.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,185 @@
---
id: learning-scala
title: Learning Scala
weight: 20
---

Joern is built on Scala 3, and is not built with interoperability with other languages in mind. This
means that you will need to learn Scala one way or another. A great resource is the [Scala 3
Book](https://docs.scala-lang.org/scala3/book/introduction.html).

The rest of this page outlines Scala best practices, and how we try to follow them in the
development of Joern.

# TLDR

Pay close attention to the following sections on collection classes and their performance
characteristics:

* [Concrete Immutable Collection Classes](https://docs.scala-lang.org/overviews/collections-2.13/concrete-immutable-collection-classes.html)
* [Performance Characteristics](https://docs.scala-lang.org/overviews/collections-2.13/performance-characteristics.html)

# Best Practices

This section contains some best practices and general remarks about the performance of various
common tasks, as well as best practices for Joern development.

This section exists because we have at various points encountered these issues.

Some points are very general to programming -- but if something is observed as a common pitfall
inside Joern, then it deserves to be addressed here.

## Folds

Suppose we have `input: Iterable[List[T]]` and want to construct the concatenation.

A natural way could be
```scala
val res = input.foldLeft(Nil){_ ++ _}
```
This is a catastrophe: It boils down to
```scala
( ((Nil ++ input(0)) ++ input(1)) ++ input(2) ) ++ ...
```
However, the runtime of `A ++ B` scales like `A.length` for linked list, and this can end up costing
us `res.length ** 2`, i.e. quadratic runtime.

Instead, the correct way is
```scala
val res = input.foldRight(Nil) {_ ++ _}
```
This is, however, contingent on the internals of the `List` implementation! If we instead were to
collect into an `ArrayBuffer`, the correct way would be
```scala
input.foldLeft(mutable.ArrayBuffer.empty[T]) { case (acc, items) => acc.appendAll(items) }
```
This is because `List` allows O(1) prepend, and `ArrayBuffer` allows O(1) append.

Do not write `buf.appendedAll` -- that pulls a copy and runtime will always be quadratic!

Now, if you write `input.fold`, then the associativity is indeterminate. This means that you must
always assume the worst imaginable execution order!

### Fundamental Theorem

The fundamental theorem on getting the right complexity class for your folds is the following:

Assume that each item has a length, and that
```
combine(left, right).weight == left.weight + right.weight
```
and assume that the runtime of `combine(left, right)` is bounded by
`min(left.weight, right.weight)`.

Then, the total runtime of accumulation is upper bounded by
```
input.map{_.weight}.sum * log2(input.map{_.weight}.sum)
```

Proof: Consider the function `F(a)=a*log2(a)` and then track the evolution of
```time_already_spent - remaining_work.map{F(_.weight)}.sum``` We will show that this quantity is
non-increasing. Since this quantity is initially negative (no time was spent!) it must be negative
once we are done, which
gives the desired equation
```
time_spent - F(remaining_work.map{_.weight}.sum) == time_spent - F(output.weight) < 0
```
So to prove this inductively, suppose we combine two items with weights `a <= b`. We compare the
critical quantity before and after this update, to obtain
```
Delta == time_already_spent_after - remaining_work_after.map{F(_.weight)}.sum
- time_already_spent_before + remaining_work_before.map{F(_.weight)}.sum
<= a - F(a+b) + F(a) + F(b)
== a - a*log2(a+b) - b*log2(a+b) + a*log2(a) + b log2(b)
== a - a*log2(1 + b/a) - b*log2(1 + a/b)
< a - a*log2(1 + b/a)
<= 0
```
The last inequality was because `a <= b` and hence `log2(1+b/a) >= 1`.

### More examples

A good example of what not to do is Java's `Collectors.toList()`. This is intended for use on
parallel streams; it uses an internal `ArrayList`. However, `ArrayList` only permits fast append, no
fast prepend. Yeah, lol, quadratic runtime (because associativity depends on races).

To see how it should be done is seen in `ForkJoinParallelCpgPass`. Morally speaking, the relevant
function is
```scala
def combine[T](left:mutable.ArrayDeque[T], right:mutable.ArrayDeque[T]):mutable.ArrayDeque[T] = {
if (left.size < right.size) right.prependAll(left)
else left.appendAll(right)
}
```
This is the fundamental point: If you want `fold` to be fast, then you must handle both cases.

Another example that works is `++` for Scala `Vector`. It is a fun exercise to step over the code
and see that both `prependedAll` and `appendedAll` are fast!

With respect to performance, it is recommended to use mutable data structures by default, and only
use all the immutable stuff if necessary: That is, if you would otherwise require locks or if your
algorithm requires snapshots.

## Java Stream Collector

It is worthwhile to take another look at the Java stream collector. It is used in `CpgPass` like this

```scala
// parts:Array[T]
externalBuilder.absorb(
java.util.Arrays
.stream(parts)
.parallel()
.collect(
new Supplier[DiffGraphBuilder] {
override def get(): DiffGraphBuilder =
new DiffGraphBuilder
},
new BiConsumer[DiffGraphBuilder, AnyRef] {
override def accept(builder: DiffGraphBuilder, part: AnyRef): Unit =
runOnPart(builder, part.asInstanceOf[T])
},
new BiConsumer[DiffGraphBuilder, DiffGraphBuilder] {
override def accept(leftBuilder: DiffGraphBuilder, rightBuilder: DiffGraphBuilder): Unit =
leftBuilder.absorb(rightBuilder)
}
)
)
```
The stream collect API is, at its core, a glorified parallel fold. Noteworthy things are:
1. We don't supply a single accumulator, we supply an accumulator factory. This is important for
parallelism, otherwise we'd degrade to a `foldLeft`!
2. Ideally we only get one accumulator per CPU core. Each accumulator uses its `runOnPart` to absorb
the output of the next `part`. We especially don't allocate an accumulator (i.e. a new
`DiffGraphBuilder`) for every `part` -- that would limit us to parallelisms where each `part` is large
(e.g. each `method`) and would be bad for the case of many cheap `parts`.
3. Note how we collect everything into an array before building the stream. We do _not_ use some
kind of generic `SplitIterator` like `java.util.Spliterators.spliteratorUnknownSize`. Read the code of
that function and consider whether that would be a good idea in the context of `CpgPass`.
4. The code for merging, i.e. `leftBuilder.absorb(rightBuilder)` has already been discussed above
(especially the fact that it is `ArrayDeque`-based).

## API considerations: Beware of Iterator!

You should not offer a generic API that works on (lazy) Iterator. If you decide to do that, then
immediately and single-threadedly collect the Iterator, before passing it on to complex higher order
functions like fold or stream-collect.

The reason is that iterators are lazy. You don't know what side-effects and computations are hidden
in them! And if execution of the iterator is triggered by complex higher order functions, then this
may very well contain data races!

Or look at some code that we found in Joern:
```scala
iter.map{item => executer.submit{() => expensive_function(item)}}.map{_.get()}.toList
```
where `iter` was an `Iterator`. This code wanted to run `expensive_function` in parallel on all
items in the iterator, and collect the results into a `List`.

In fact this code was single-threaded, because iterators and maps of iterators are lazy. So the
first time something will be called is when `toList` tries to collect the first output; then it
schedules the first computations, awaits its result, and only then schedules the second computation.

TLDR: Iterators are scary. Collect them eagerly.

Special thanks to [@bbrehm](https://github.com/bbrehm) for the write-up!
2 changes: 1 addition & 1 deletion docs.joern.io/content/upgrade-guides.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
id: upgrade-guides
title: Upgrade guides
title: Upgrade Guide
weight: 150
---

Expand Down

0 comments on commit 58f15b9

Please sign in to comment.