Developer Guide (#109)

* Migrade Developer Guide from Joern README here * Added @bbrehm's PR content from joernio/joern#4240 * Added an entry for standalone-ext
joernio · Mar 15, 2024 · 58f15b9 · 58f15b9
1 parent 863c302
commit 58f15b9
Show file tree

Hide file tree

Showing 6 changed files with 307 additions and 1 deletion.
diff --git a/docs.joern.io/content/developer-guide/_index.md b/docs.joern.io/content/developer-guide/_index.md
@@ -0,0 +1,6 @@
+---
+id: developer-guide
+title: Developer Guide
+bookCollapseSection: true
+weight: 160
+---
diff --git a/docs.joern.io/content/developer-guide/contribution-guidelines.md b/docs.joern.io/content/developer-guide/contribution-guidelines.md
@@ -0,0 +1,33 @@
+---
+id: contributing
+title: Contributing to Joern
+weight: 10
+---
+
+Thank you for taking time to contribute to Joern! Here are a few guidelines to ensure your pull
+request will get merged as soon as possible.
+
+### Creating a Pull Request
+
+Try to make use of the templates as far as possible, however they may not suit all needs. The
+  minimum we would like to see is:
+- A title that briefly describes the change and purpose of the PR, preferably with the affected
+    module in square brackets, e.g. `[javasrc] Addition Operator Fix`.
+- A short description of the changes in the body of the PR. This could be in bullet points or
+    paragraphs.
+- A link or reference to the related issue, if any exists.
+
+### Dos and Don'ts
+
+Do not:
+- Immediately CC/@/email spam other contributors, the team will review the PR and assign the most
+    appropriate contributor to review the PR. Joern is maintained by industry partners and
+    researchers alike, for the most part with their own goals and priorities, and additional help is
+    largely volunteer work. If your PR is going stale, then reach out to us in follow-up comments
+    with @'s asking for an explanation of priority or planning of when it may be addressed (if ever,
+    depending on quality).
+- Leave the description body empty, this makes reviewing the purpose of the PR difficult.
+
+Do remember to:
+- Remember to format your code, i.e. run `sbt scalafmt Test/scalafmt`
+- Add a unit test to verify your change.
diff --git a/docs.joern.io/content/developer-guide/custom-tool.md b/docs.joern.io/content/developer-guide/custom-tool.md
@@ -0,0 +1,51 @@
+---
+id: custom-tool
+title: Creating a Custom Static Analysis with Joern
+weight: 40
+---
+
+So you want to develop tools with Joern? Let's get started!
+
+## Simple Standalone Application Template
+
+[standalone-ext](https://github.com/joernio/standalone-ext) is the core template for developing
+Joern-based tooling. Here are some tasks you can perform using this template to suite your needs:
+
+* Update to latest Joern versions? Run `./updateDependencies.sh`
+* Extend the CPG schema for custom nodes/edges/properties? See [`CpgExtSchema`](https://github.com/joernio/standalone-ext/blob/master/schema/src/main/scala/CpgExtSchema.scala)
+* Want to create a CLI tool? See [`Main`](https://github.com/joernio/standalone-ext/blob/master/src/main/scala/org/codeminers/standalone/Main.scala)
+* Want to create a REPL? See [`ReplMain`](https://github.com/joernio/standalone-ext/blob/master/src/main/scala/org/codeminers/standalone/ReplMain.scala)
+* Want to add custom query steps? See [`package`](https://github.com/joernio/standalone-ext/blob/master/src/main/scala/org/codeminers/standalone/package.scala)
+
+Joern modules can be imported, as well as their test resources, e.g.
+
+```scala
+// build.sbt
+
+// parsed by project/Versions.scala, updated by updateDependencies.sh
+val cpgVersion        = "1.6.5"
+val joernVersion      = "2.0.262"
+val overflowdbVersion = "1.187"
+// ...
+val joernDeps =
+  Seq("x2cpg", "javasrc2cpg", "joern-cli", "semanticcpg", "dataflowengineoss")
+    .flatMap { x =>
+      val dep     = "io.joern" %% x % Versions.joern
+      val testDep = "io.joern" %% x % Versions.joern % Test classifier "tests"
+      Seq(dep, testDep)
+    }
+libraryDependencies ++= Seq(/*...*/) ++ joernDeps
+```
+
+With the test resources, you have access to the same text fixtures and tooling that Joern has,
+notably, the ability to generate CPG to test against from source code blocks.
+
+## Examples
+
+Here are some open-source tools developed from `standalone-ext`:
+
+* [Privado Core](https://github.com/Privado-Inc/privado-core)
+* [JoernTI](https://github.com/joernio/joernti-codetidal5)
+* [CPG Miner](https://github.com/DavidBakerEffendi/cpg-miner)
+
+Add your project here!
diff --git a/docs.joern.io/content/developer-guide/ide-setup.md b/docs.joern.io/content/developer-guide/ide-setup.md
@@ -0,0 +1,31 @@
+---
+id: ide-setup
+title: Setting Up Your IDE
+weight: 30
+---
+
+#### IntelliJ IDEA
+* [Download IntelliJ Community](https://www.jetbrains.com/idea/download)
+* Install and run it
+* Install the [Scala Plugin](https://plugins.jetbrains.com/plugin/1347-scala) - just search and
+  install from within IntelliJ.
+* Important: open `sbt` in your local joern repository, run `compile` and keep it open - this will
+  allow us to use the BSP build in the next step
+* Back to IntelliJ: open project: select your local joern clone: select to open as `BSP Project`
+  (i.e. _not_ `sbt project`!)
+* Await the import and indexing to complete, then you can start, e.g. `Build -> Build project` or
+  run a test
+
+Pro tip: Scala 3 support is limited and opting for Nightly builds is highly recommended.
+
+#### VSCode
+- Install VSCode and Docker
+- Install the plugin `ms-vscode-remote.remote-containers`
+- Open Joern project folder in
+  [VSCode](https://docs.microsoft.com/en-us/azure-sphere/app-development/container-build-vscode#build-and-debug-the-project)
+  Visual Studio Code detects the new files and opens a message box saying: `Folder contains a Dev
+  Container configuration file. Reopen to folder to develop in a container.`
+- Select the `Reopen in Container` button to reopen the folder in the container created by the
+  `.devcontainer/Dockerfile` file
+- Switch to `scalameta.metals` sidebar in VSCode, and select `import build` in `BUILD COMMANDS`
+- After `import build` succeeds, you are ready to start writing code for Joern
diff --git a/docs.joern.io/content/developer-guide/learning-scala.md b/docs.joern.io/content/developer-guide/learning-scala.md
@@ -0,0 +1,185 @@
+---
+id: learning-scala
+title: Learning Scala
+weight: 20
+---
+
+Joern is built on Scala 3, and is not built with interoperability with other languages in mind. This
+means that you will need to learn Scala one way or another. A great resource is the [Scala 3
+Book](https://docs.scala-lang.org/scala3/book/introduction.html).
+
+The rest of this page outlines Scala best practices, and how we try to follow them in the
+development of Joern.
+
+# TLDR
+
+Pay close attention to the following sections on collection classes and their performance
+characteristics:
+
+* [Concrete Immutable Collection Classes](https://docs.scala-lang.org/overviews/collections-2.13/concrete-immutable-collection-classes.html)
+* [Performance Characteristics](https://docs.scala-lang.org/overviews/collections-2.13/performance-characteristics.html)
+
+# Best Practices
+
+This section contains some best practices and general remarks about the performance of various
+common tasks, as well as best practices for Joern development.
+
+This section exists because we have at various points encountered these issues.
+
+Some points are very general to programming -- but if something is observed as a common pitfall
+inside Joern, then it deserves to be addressed here.
+
+## Folds
+
+Suppose we have `input: Iterable[List[T]]` and want to construct the concatenation.
+
+A natural way could be
+```scala
+val res = input.foldLeft(Nil){_ ++ _}
+```
+This is a catastrophe: It boils down to
+```scala
+( ((Nil ++ input(0)) ++ input(1)) ++ input(2) ) ++ ...
+```
+However, the runtime of `A ++ B` scales like `A.length` for linked list, and this can end up costing
+us `res.length ** 2`, i.e. quadratic runtime.
+
+Instead, the correct way is
+```scala
+val res = input.foldRight(Nil) {_ ++ _}
+```
+This is, however, contingent on the internals of the `List` implementation! If we instead were to
+collect into an `ArrayBuffer`, the correct way would be
+```scala
+input.foldLeft(mutable.ArrayBuffer.empty[T]) { case (acc, items) => acc.appendAll(items) }
+```
+This is because `List` allows O(1) prepend, and `ArrayBuffer` allows O(1) append.
+
+Do not write `buf.appendedAll` -- that pulls a copy and runtime will always be quadratic!
+
+Now, if you write `input.fold`, then the associativity is indeterminate. This means that you must
+always assume the worst imaginable execution order!
+
+### Fundamental Theorem
+
+The fundamental theorem on getting the right complexity class for your folds is the following:
+
+Assume that each item has a length, and that 
+```
+combine(left, right).weight == left.weight + right.weight
+```
+and assume that the runtime of `combine(left, right)` is bounded by 
+`min(left.weight, right.weight)`.
+
+Then, the total runtime of accumulation is upper bounded by 
+```
+input.map{_.weight}.sum * log2(input.map{_.weight}.sum)
+```
+
+Proof: Consider the function `F(a)=a*log2(a)` and then track the evolution of 
+```time_already_spent - remaining_work.map{F(_.weight)}.sum``` We will show that this quantity is
+non-increasing. Since this quantity is initially negative (no time was spent!) it must be negative
+once we are done, which
+gives the desired equation
+```
+time_spent - F(remaining_work.map{_.weight}.sum) == time_spent - F(output.weight) < 0
+```
+So to prove this inductively, suppose we combine two items with weights `a <= b`. We compare the
+critical quantity before and after this update, to obtain
+```
+Delta == time_already_spent_after - remaining_work_after.map{F(_.weight)}.sum
+        -  time_already_spent_before + remaining_work_before.map{F(_.weight)}.sum
+    <= a - F(a+b) + F(a) + F(b) 
+    == a  - a*log2(a+b) - b*log2(a+b) + a*log2(a) + b log2(b)
+    == a - a*log2(1 + b/a) - b*log2(1 + a/b)
+    <  a - a*log2(1 + b/a)
+    <= 0
+```
+The last inequality was because `a <= b` and hence `log2(1+b/a) >= 1`.
+
+### More examples
+
+A good example of what not to do is Java's `Collectors.toList()`. This is intended for use on
+parallel streams; it uses an internal `ArrayList`. However, `ArrayList` only permits fast append, no
+fast prepend. Yeah, lol, quadratic runtime (because associativity depends on races).
+
+To see how it should be done is seen in `ForkJoinParallelCpgPass`. Morally speaking, the relevant
+function is
+```scala
+def combine[T](left:mutable.ArrayDeque[T], right:mutable.ArrayDeque[T]):mutable.ArrayDeque[T] = {
+    if (left.size < right.size) right.prependAll(left)
+    else left.appendAll(right)
+}
+```
+This is the fundamental point: If you want `fold` to be fast, then you must handle both cases.
+
+Another example that works is `++` for Scala `Vector`. It is a fun exercise to step over the code
+and see that both `prependedAll` and `appendedAll` are fast!
+
+With respect to performance, it is recommended to use mutable data structures by default, and only
+use all the immutable stuff if necessary: That is, if you would otherwise require locks or if your
+algorithm requires snapshots.
+
+## Java Stream Collector
+
+It is worthwhile to take another look at the Java stream collector. It is used in `CpgPass` like this
+
+```scala
+// parts:Array[T]
+    externalBuilder.absorb(
+        java.util.Arrays
+        .stream(parts)
+        .parallel()
+        .collect(
+            new Supplier[DiffGraphBuilder] {
+            override def get(): DiffGraphBuilder =
+                new DiffGraphBuilder
+            },
+            new BiConsumer[DiffGraphBuilder, AnyRef] {
+            override def accept(builder: DiffGraphBuilder, part: AnyRef): Unit =
+                runOnPart(builder, part.asInstanceOf[T])
+            },
+            new BiConsumer[DiffGraphBuilder, DiffGraphBuilder] {
+            override def accept(leftBuilder: DiffGraphBuilder, rightBuilder: DiffGraphBuilder): Unit =
+                leftBuilder.absorb(rightBuilder)
+            }
+        )
+    )
+```          
+The stream collect API is, at its core, a glorified parallel fold. Noteworthy things are:
+1. We don't supply a single accumulator, we supply an accumulator factory. This is important for
+parallelism, otherwise we'd degrade to a `foldLeft`!
+2. Ideally we only get one accumulator per CPU core. Each accumulator uses its `runOnPart` to absorb
+the output of the next `part`. We especially don't allocate an accumulator (i.e. a new
+`DiffGraphBuilder`) for every `part` -- that would limit us to parallelisms where each `part` is large
+(e.g. each `method`) and would be bad for the case of many cheap `parts`.
+3. Note how we collect everything into an array before building the stream. We do _not_ use some
+kind of generic `SplitIterator` like `java.util.Spliterators.spliteratorUnknownSize`. Read the code of
+that function and consider whether that would be a good idea in the context of `CpgPass`.
+4. The code for merging, i.e. `leftBuilder.absorb(rightBuilder)` has already been discussed above
+(especially the fact that it is `ArrayDeque`-based).
+
+## API considerations: Beware of Iterator!
+
+You should not offer a generic API that works on (lazy) Iterator. If you decide to do that, then
+immediately and single-threadedly collect the Iterator, before passing it on to complex higher order
+functions like fold or stream-collect.
+
+The reason is that iterators are lazy. You don't know what side-effects and computations are hidden
+in them! And if execution of the iterator is triggered by complex higher order functions, then this
+may very well contain data races!
+
+Or look at some code that we found in Joern: 
+```scala
+iter.map{item => executer.submit{() => expensive_function(item)}}.map{_.get()}.toList
+```
+where `iter` was an `Iterator`. This code wanted to run `expensive_function` in parallel on all
+items in the iterator, and collect the results into a `List`.
+
+In fact this code was single-threaded, because iterators and maps of iterators are lazy. So the
+first time something will be called is when `toList` tries to collect the first output; then it
+schedules the first computations, awaits its result, and only then schedules the second computation.
+
+TLDR: Iterators are scary. Collect them eagerly.
+
+Special thanks to [@bbrehm](https://github.com/bbrehm) for the write-up!
diff --git a/docs.joern.io/content/upgrade-guides.md b/docs.joern.io/content/upgrade-guides.md
@@ -1,6 +1,6 @@
 ---
 id: upgrade-guides
-title: Upgrade guides
+title: Upgrade Guide
 weight: 150
 ---