-
Notifications
You must be signed in to change notification settings - Fork 16
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* Migrade Developer Guide from Joern README here * Added @bbrehm's PR content from joernio/joern#4240 * Added an entry for standalone-ext
- Loading branch information
1 parent
863c302
commit 58f15b9
Showing
6 changed files
with
307 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
--- | ||
id: developer-guide | ||
title: Developer Guide | ||
bookCollapseSection: true | ||
weight: 160 | ||
--- |
33 changes: 33 additions & 0 deletions
33
docs.joern.io/content/developer-guide/contribution-guidelines.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,33 @@ | ||
--- | ||
id: contributing | ||
title: Contributing to Joern | ||
weight: 10 | ||
--- | ||
|
||
Thank you for taking time to contribute to Joern! Here are a few guidelines to ensure your pull | ||
request will get merged as soon as possible. | ||
|
||
### Creating a Pull Request | ||
|
||
Try to make use of the templates as far as possible, however they may not suit all needs. The | ||
minimum we would like to see is: | ||
- A title that briefly describes the change and purpose of the PR, preferably with the affected | ||
module in square brackets, e.g. `[javasrc] Addition Operator Fix`. | ||
- A short description of the changes in the body of the PR. This could be in bullet points or | ||
paragraphs. | ||
- A link or reference to the related issue, if any exists. | ||
|
||
### Dos and Don'ts | ||
|
||
Do not: | ||
- Immediately CC/@/email spam other contributors, the team will review the PR and assign the most | ||
appropriate contributor to review the PR. Joern is maintained by industry partners and | ||
researchers alike, for the most part with their own goals and priorities, and additional help is | ||
largely volunteer work. If your PR is going stale, then reach out to us in follow-up comments | ||
with @'s asking for an explanation of priority or planning of when it may be addressed (if ever, | ||
depending on quality). | ||
- Leave the description body empty, this makes reviewing the purpose of the PR difficult. | ||
|
||
Do remember to: | ||
- Remember to format your code, i.e. run `sbt scalafmt Test/scalafmt` | ||
- Add a unit test to verify your change. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,51 @@ | ||
--- | ||
id: custom-tool | ||
title: Creating a Custom Static Analysis with Joern | ||
weight: 40 | ||
--- | ||
|
||
So you want to develop tools with Joern? Let's get started! | ||
|
||
## Simple Standalone Application Template | ||
|
||
[standalone-ext](https://github.com/joernio/standalone-ext) is the core template for developing | ||
Joern-based tooling. Here are some tasks you can perform using this template to suite your needs: | ||
|
||
* Update to latest Joern versions? Run `./updateDependencies.sh` | ||
* Extend the CPG schema for custom nodes/edges/properties? See [`CpgExtSchema`](https://github.com/joernio/standalone-ext/blob/master/schema/src/main/scala/CpgExtSchema.scala) | ||
* Want to create a CLI tool? See [`Main`](https://github.com/joernio/standalone-ext/blob/master/src/main/scala/org/codeminers/standalone/Main.scala) | ||
* Want to create a REPL? See [`ReplMain`](https://github.com/joernio/standalone-ext/blob/master/src/main/scala/org/codeminers/standalone/ReplMain.scala) | ||
* Want to add custom query steps? See [`package`](https://github.com/joernio/standalone-ext/blob/master/src/main/scala/org/codeminers/standalone/package.scala) | ||
|
||
Joern modules can be imported, as well as their test resources, e.g. | ||
|
||
```scala | ||
// build.sbt | ||
|
||
// parsed by project/Versions.scala, updated by updateDependencies.sh | ||
val cpgVersion = "1.6.5" | ||
val joernVersion = "2.0.262" | ||
val overflowdbVersion = "1.187" | ||
// ... | ||
val joernDeps = | ||
Seq("x2cpg", "javasrc2cpg", "joern-cli", "semanticcpg", "dataflowengineoss") | ||
.flatMap { x => | ||
val dep = "io.joern" %% x % Versions.joern | ||
val testDep = "io.joern" %% x % Versions.joern % Test classifier "tests" | ||
Seq(dep, testDep) | ||
} | ||
libraryDependencies ++= Seq(/*...*/) ++ joernDeps | ||
``` | ||
|
||
With the test resources, you have access to the same text fixtures and tooling that Joern has, | ||
notably, the ability to generate CPG to test against from source code blocks. | ||
|
||
## Examples | ||
|
||
Here are some open-source tools developed from `standalone-ext`: | ||
|
||
* [Privado Core](https://github.com/Privado-Inc/privado-core) | ||
* [JoernTI](https://github.com/joernio/joernti-codetidal5) | ||
* [CPG Miner](https://github.com/DavidBakerEffendi/cpg-miner) | ||
|
||
Add your project here! |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,31 @@ | ||
--- | ||
id: ide-setup | ||
title: Setting Up Your IDE | ||
weight: 30 | ||
--- | ||
|
||
#### IntelliJ IDEA | ||
* [Download IntelliJ Community](https://www.jetbrains.com/idea/download) | ||
* Install and run it | ||
* Install the [Scala Plugin](https://plugins.jetbrains.com/plugin/1347-scala) - just search and | ||
install from within IntelliJ. | ||
* Important: open `sbt` in your local joern repository, run `compile` and keep it open - this will | ||
allow us to use the BSP build in the next step | ||
* Back to IntelliJ: open project: select your local joern clone: select to open as `BSP Project` | ||
(i.e. _not_ `sbt project`!) | ||
* Await the import and indexing to complete, then you can start, e.g. `Build -> Build project` or | ||
run a test | ||
|
||
Pro tip: Scala 3 support is limited and opting for Nightly builds is highly recommended. | ||
|
||
#### VSCode | ||
- Install VSCode and Docker | ||
- Install the plugin `ms-vscode-remote.remote-containers` | ||
- Open Joern project folder in | ||
[VSCode](https://docs.microsoft.com/en-us/azure-sphere/app-development/container-build-vscode#build-and-debug-the-project) | ||
Visual Studio Code detects the new files and opens a message box saying: `Folder contains a Dev | ||
Container configuration file. Reopen to folder to develop in a container.` | ||
- Select the `Reopen in Container` button to reopen the folder in the container created by the | ||
`.devcontainer/Dockerfile` file | ||
- Switch to `scalameta.metals` sidebar in VSCode, and select `import build` in `BUILD COMMANDS` | ||
- After `import build` succeeds, you are ready to start writing code for Joern |
185 changes: 185 additions & 0 deletions
185
docs.joern.io/content/developer-guide/learning-scala.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,185 @@ | ||
--- | ||
id: learning-scala | ||
title: Learning Scala | ||
weight: 20 | ||
--- | ||
|
||
Joern is built on Scala 3, and is not built with interoperability with other languages in mind. This | ||
means that you will need to learn Scala one way or another. A great resource is the [Scala 3 | ||
Book](https://docs.scala-lang.org/scala3/book/introduction.html). | ||
|
||
The rest of this page outlines Scala best practices, and how we try to follow them in the | ||
development of Joern. | ||
|
||
# TLDR | ||
|
||
Pay close attention to the following sections on collection classes and their performance | ||
characteristics: | ||
|
||
* [Concrete Immutable Collection Classes](https://docs.scala-lang.org/overviews/collections-2.13/concrete-immutable-collection-classes.html) | ||
* [Performance Characteristics](https://docs.scala-lang.org/overviews/collections-2.13/performance-characteristics.html) | ||
|
||
# Best Practices | ||
|
||
This section contains some best practices and general remarks about the performance of various | ||
common tasks, as well as best practices for Joern development. | ||
|
||
This section exists because we have at various points encountered these issues. | ||
|
||
Some points are very general to programming -- but if something is observed as a common pitfall | ||
inside Joern, then it deserves to be addressed here. | ||
|
||
## Folds | ||
|
||
Suppose we have `input: Iterable[List[T]]` and want to construct the concatenation. | ||
|
||
A natural way could be | ||
```scala | ||
val res = input.foldLeft(Nil){_ ++ _} | ||
``` | ||
This is a catastrophe: It boils down to | ||
```scala | ||
( ((Nil ++ input(0)) ++ input(1)) ++ input(2) ) ++ ... | ||
``` | ||
However, the runtime of `A ++ B` scales like `A.length` for linked list, and this can end up costing | ||
us `res.length ** 2`, i.e. quadratic runtime. | ||
|
||
Instead, the correct way is | ||
```scala | ||
val res = input.foldRight(Nil) {_ ++ _} | ||
``` | ||
This is, however, contingent on the internals of the `List` implementation! If we instead were to | ||
collect into an `ArrayBuffer`, the correct way would be | ||
```scala | ||
input.foldLeft(mutable.ArrayBuffer.empty[T]) { case (acc, items) => acc.appendAll(items) } | ||
``` | ||
This is because `List` allows O(1) prepend, and `ArrayBuffer` allows O(1) append. | ||
|
||
Do not write `buf.appendedAll` -- that pulls a copy and runtime will always be quadratic! | ||
|
||
Now, if you write `input.fold`, then the associativity is indeterminate. This means that you must | ||
always assume the worst imaginable execution order! | ||
|
||
### Fundamental Theorem | ||
|
||
The fundamental theorem on getting the right complexity class for your folds is the following: | ||
|
||
Assume that each item has a length, and that | ||
``` | ||
combine(left, right).weight == left.weight + right.weight | ||
``` | ||
and assume that the runtime of `combine(left, right)` is bounded by | ||
`min(left.weight, right.weight)`. | ||
|
||
Then, the total runtime of accumulation is upper bounded by | ||
``` | ||
input.map{_.weight}.sum * log2(input.map{_.weight}.sum) | ||
``` | ||
|
||
Proof: Consider the function `F(a)=a*log2(a)` and then track the evolution of | ||
```time_already_spent - remaining_work.map{F(_.weight)}.sum``` We will show that this quantity is | ||
non-increasing. Since this quantity is initially negative (no time was spent!) it must be negative | ||
once we are done, which | ||
gives the desired equation | ||
``` | ||
time_spent - F(remaining_work.map{_.weight}.sum) == time_spent - F(output.weight) < 0 | ||
``` | ||
So to prove this inductively, suppose we combine two items with weights `a <= b`. We compare the | ||
critical quantity before and after this update, to obtain | ||
``` | ||
Delta == time_already_spent_after - remaining_work_after.map{F(_.weight)}.sum | ||
- time_already_spent_before + remaining_work_before.map{F(_.weight)}.sum | ||
<= a - F(a+b) + F(a) + F(b) | ||
== a - a*log2(a+b) - b*log2(a+b) + a*log2(a) + b log2(b) | ||
== a - a*log2(1 + b/a) - b*log2(1 + a/b) | ||
< a - a*log2(1 + b/a) | ||
<= 0 | ||
``` | ||
The last inequality was because `a <= b` and hence `log2(1+b/a) >= 1`. | ||
|
||
### More examples | ||
|
||
A good example of what not to do is Java's `Collectors.toList()`. This is intended for use on | ||
parallel streams; it uses an internal `ArrayList`. However, `ArrayList` only permits fast append, no | ||
fast prepend. Yeah, lol, quadratic runtime (because associativity depends on races). | ||
|
||
To see how it should be done is seen in `ForkJoinParallelCpgPass`. Morally speaking, the relevant | ||
function is | ||
```scala | ||
def combine[T](left:mutable.ArrayDeque[T], right:mutable.ArrayDeque[T]):mutable.ArrayDeque[T] = { | ||
if (left.size < right.size) right.prependAll(left) | ||
else left.appendAll(right) | ||
} | ||
``` | ||
This is the fundamental point: If you want `fold` to be fast, then you must handle both cases. | ||
|
||
Another example that works is `++` for Scala `Vector`. It is a fun exercise to step over the code | ||
and see that both `prependedAll` and `appendedAll` are fast! | ||
|
||
With respect to performance, it is recommended to use mutable data structures by default, and only | ||
use all the immutable stuff if necessary: That is, if you would otherwise require locks or if your | ||
algorithm requires snapshots. | ||
|
||
## Java Stream Collector | ||
|
||
It is worthwhile to take another look at the Java stream collector. It is used in `CpgPass` like this | ||
|
||
```scala | ||
// parts:Array[T] | ||
externalBuilder.absorb( | ||
java.util.Arrays | ||
.stream(parts) | ||
.parallel() | ||
.collect( | ||
new Supplier[DiffGraphBuilder] { | ||
override def get(): DiffGraphBuilder = | ||
new DiffGraphBuilder | ||
}, | ||
new BiConsumer[DiffGraphBuilder, AnyRef] { | ||
override def accept(builder: DiffGraphBuilder, part: AnyRef): Unit = | ||
runOnPart(builder, part.asInstanceOf[T]) | ||
}, | ||
new BiConsumer[DiffGraphBuilder, DiffGraphBuilder] { | ||
override def accept(leftBuilder: DiffGraphBuilder, rightBuilder: DiffGraphBuilder): Unit = | ||
leftBuilder.absorb(rightBuilder) | ||
} | ||
) | ||
) | ||
``` | ||
The stream collect API is, at its core, a glorified parallel fold. Noteworthy things are: | ||
1. We don't supply a single accumulator, we supply an accumulator factory. This is important for | ||
parallelism, otherwise we'd degrade to a `foldLeft`! | ||
2. Ideally we only get one accumulator per CPU core. Each accumulator uses its `runOnPart` to absorb | ||
the output of the next `part`. We especially don't allocate an accumulator (i.e. a new | ||
`DiffGraphBuilder`) for every `part` -- that would limit us to parallelisms where each `part` is large | ||
(e.g. each `method`) and would be bad for the case of many cheap `parts`. | ||
3. Note how we collect everything into an array before building the stream. We do _not_ use some | ||
kind of generic `SplitIterator` like `java.util.Spliterators.spliteratorUnknownSize`. Read the code of | ||
that function and consider whether that would be a good idea in the context of `CpgPass`. | ||
4. The code for merging, i.e. `leftBuilder.absorb(rightBuilder)` has already been discussed above | ||
(especially the fact that it is `ArrayDeque`-based). | ||
|
||
## API considerations: Beware of Iterator! | ||
|
||
You should not offer a generic API that works on (lazy) Iterator. If you decide to do that, then | ||
immediately and single-threadedly collect the Iterator, before passing it on to complex higher order | ||
functions like fold or stream-collect. | ||
|
||
The reason is that iterators are lazy. You don't know what side-effects and computations are hidden | ||
in them! And if execution of the iterator is triggered by complex higher order functions, then this | ||
may very well contain data races! | ||
|
||
Or look at some code that we found in Joern: | ||
```scala | ||
iter.map{item => executer.submit{() => expensive_function(item)}}.map{_.get()}.toList | ||
``` | ||
where `iter` was an `Iterator`. This code wanted to run `expensive_function` in parallel on all | ||
items in the iterator, and collect the results into a `List`. | ||
|
||
In fact this code was single-threaded, because iterators and maps of iterators are lazy. So the | ||
first time something will be called is when `toList` tries to collect the first output; then it | ||
schedules the first computations, awaits its result, and only then schedules the second computation. | ||
|
||
TLDR: Iterators are scary. Collect them eagerly. | ||
|
||
Special thanks to [@bbrehm](https://github.com/bbrehm) for the write-up! |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,6 +1,6 @@ | ||
--- | ||
id: upgrade-guides | ||
title: Upgrade guides | ||
title: Upgrade Guide | ||
weight: 150 | ||
--- | ||
|
||
|