From 58f15b9e865d3209bf8e28238a006b19f850d018 Mon Sep 17 00:00:00 2001 From: David Baker Effendi Date: Fri, 15 Mar 2024 13:14:17 +0200 Subject: [PATCH] Developer Guide (#109) * Migrade Developer Guide from Joern README here * Added @bbrehm's PR content from https://github.com/joernio/joern/pull/4240 * Added an entry for standalone-ext --- .../content/developer-guide/_index.md | 6 + .../contribution-guidelines.md | 33 ++++ .../content/developer-guide/custom-tool.md | 51 +++++ .../content/developer-guide/ide-setup.md | 31 +++ .../content/developer-guide/learning-scala.md | 185 ++++++++++++++++++ docs.joern.io/content/upgrade-guides.md | 2 +- 6 files changed, 307 insertions(+), 1 deletion(-) create mode 100644 docs.joern.io/content/developer-guide/_index.md create mode 100644 docs.joern.io/content/developer-guide/contribution-guidelines.md create mode 100644 docs.joern.io/content/developer-guide/custom-tool.md create mode 100644 docs.joern.io/content/developer-guide/ide-setup.md create mode 100644 docs.joern.io/content/developer-guide/learning-scala.md diff --git a/docs.joern.io/content/developer-guide/_index.md b/docs.joern.io/content/developer-guide/_index.md new file mode 100644 index 00000000..7a04bb58 --- /dev/null +++ b/docs.joern.io/content/developer-guide/_index.md @@ -0,0 +1,6 @@ +--- +id: developer-guide +title: Developer Guide +bookCollapseSection: true +weight: 160 +--- diff --git a/docs.joern.io/content/developer-guide/contribution-guidelines.md b/docs.joern.io/content/developer-guide/contribution-guidelines.md new file mode 100644 index 00000000..43e505a2 --- /dev/null +++ b/docs.joern.io/content/developer-guide/contribution-guidelines.md @@ -0,0 +1,33 @@ +--- +id: contributing +title: Contributing to Joern +weight: 10 +--- + +Thank you for taking time to contribute to Joern! Here are a few guidelines to ensure your pull +request will get merged as soon as possible. + +### Creating a Pull Request + +Try to make use of the templates as far as possible, however they may not suit all needs. The + minimum we would like to see is: +- A title that briefly describes the change and purpose of the PR, preferably with the affected + module in square brackets, e.g. `[javasrc] Addition Operator Fix`. +- A short description of the changes in the body of the PR. This could be in bullet points or + paragraphs. +- A link or reference to the related issue, if any exists. + +### Dos and Don'ts + +Do not: +- Immediately CC/@/email spam other contributors, the team will review the PR and assign the most + appropriate contributor to review the PR. Joern is maintained by industry partners and + researchers alike, for the most part with their own goals and priorities, and additional help is + largely volunteer work. If your PR is going stale, then reach out to us in follow-up comments + with @'s asking for an explanation of priority or planning of when it may be addressed (if ever, + depending on quality). +- Leave the description body empty, this makes reviewing the purpose of the PR difficult. + +Do remember to: +- Remember to format your code, i.e. run `sbt scalafmt Test/scalafmt` +- Add a unit test to verify your change. \ No newline at end of file diff --git a/docs.joern.io/content/developer-guide/custom-tool.md b/docs.joern.io/content/developer-guide/custom-tool.md new file mode 100644 index 00000000..a534830b --- /dev/null +++ b/docs.joern.io/content/developer-guide/custom-tool.md @@ -0,0 +1,51 @@ +--- +id: custom-tool +title: Creating a Custom Static Analysis with Joern +weight: 40 +--- + +So you want to develop tools with Joern? Let's get started! + +## Simple Standalone Application Template + +[standalone-ext](https://github.com/joernio/standalone-ext) is the core template for developing +Joern-based tooling. Here are some tasks you can perform using this template to suite your needs: + +* Update to latest Joern versions? Run `./updateDependencies.sh` +* Extend the CPG schema for custom nodes/edges/properties? See [`CpgExtSchema`](https://github.com/joernio/standalone-ext/blob/master/schema/src/main/scala/CpgExtSchema.scala) +* Want to create a CLI tool? See [`Main`](https://github.com/joernio/standalone-ext/blob/master/src/main/scala/org/codeminers/standalone/Main.scala) +* Want to create a REPL? See [`ReplMain`](https://github.com/joernio/standalone-ext/blob/master/src/main/scala/org/codeminers/standalone/ReplMain.scala) +* Want to add custom query steps? See [`package`](https://github.com/joernio/standalone-ext/blob/master/src/main/scala/org/codeminers/standalone/package.scala) + +Joern modules can be imported, as well as their test resources, e.g. + +```scala +// build.sbt + +// parsed by project/Versions.scala, updated by updateDependencies.sh +val cpgVersion = "1.6.5" +val joernVersion = "2.0.262" +val overflowdbVersion = "1.187" +// ... +val joernDeps = + Seq("x2cpg", "javasrc2cpg", "joern-cli", "semanticcpg", "dataflowengineoss") + .flatMap { x => + val dep = "io.joern" %% x % Versions.joern + val testDep = "io.joern" %% x % Versions.joern % Test classifier "tests" + Seq(dep, testDep) + } +libraryDependencies ++= Seq(/*...*/) ++ joernDeps +``` + +With the test resources, you have access to the same text fixtures and tooling that Joern has, +notably, the ability to generate CPG to test against from source code blocks. + +## Examples + +Here are some open-source tools developed from `standalone-ext`: + +* [Privado Core](https://github.com/Privado-Inc/privado-core) +* [JoernTI](https://github.com/joernio/joernti-codetidal5) +* [CPG Miner](https://github.com/DavidBakerEffendi/cpg-miner) + +Add your project here! diff --git a/docs.joern.io/content/developer-guide/ide-setup.md b/docs.joern.io/content/developer-guide/ide-setup.md new file mode 100644 index 00000000..1dd264a2 --- /dev/null +++ b/docs.joern.io/content/developer-guide/ide-setup.md @@ -0,0 +1,31 @@ +--- +id: ide-setup +title: Setting Up Your IDE +weight: 30 +--- + +#### IntelliJ IDEA +* [Download IntelliJ Community](https://www.jetbrains.com/idea/download) +* Install and run it +* Install the [Scala Plugin](https://plugins.jetbrains.com/plugin/1347-scala) - just search and + install from within IntelliJ. +* Important: open `sbt` in your local joern repository, run `compile` and keep it open - this will + allow us to use the BSP build in the next step +* Back to IntelliJ: open project: select your local joern clone: select to open as `BSP Project` + (i.e. _not_ `sbt project`!) +* Await the import and indexing to complete, then you can start, e.g. `Build -> Build project` or + run a test + +Pro tip: Scala 3 support is limited and opting for Nightly builds is highly recommended. + +#### VSCode +- Install VSCode and Docker +- Install the plugin `ms-vscode-remote.remote-containers` +- Open Joern project folder in + [VSCode](https://docs.microsoft.com/en-us/azure-sphere/app-development/container-build-vscode#build-and-debug-the-project) + Visual Studio Code detects the new files and opens a message box saying: `Folder contains a Dev + Container configuration file. Reopen to folder to develop in a container.` +- Select the `Reopen in Container` button to reopen the folder in the container created by the + `.devcontainer/Dockerfile` file +- Switch to `scalameta.metals` sidebar in VSCode, and select `import build` in `BUILD COMMANDS` +- After `import build` succeeds, you are ready to start writing code for Joern diff --git a/docs.joern.io/content/developer-guide/learning-scala.md b/docs.joern.io/content/developer-guide/learning-scala.md new file mode 100644 index 00000000..7f34b621 --- /dev/null +++ b/docs.joern.io/content/developer-guide/learning-scala.md @@ -0,0 +1,185 @@ +--- +id: learning-scala +title: Learning Scala +weight: 20 +--- + +Joern is built on Scala 3, and is not built with interoperability with other languages in mind. This +means that you will need to learn Scala one way or another. A great resource is the [Scala 3 +Book](https://docs.scala-lang.org/scala3/book/introduction.html). + +The rest of this page outlines Scala best practices, and how we try to follow them in the +development of Joern. + +# TLDR + +Pay close attention to the following sections on collection classes and their performance +characteristics: + +* [Concrete Immutable Collection Classes](https://docs.scala-lang.org/overviews/collections-2.13/concrete-immutable-collection-classes.html) +* [Performance Characteristics](https://docs.scala-lang.org/overviews/collections-2.13/performance-characteristics.html) + +# Best Practices + +This section contains some best practices and general remarks about the performance of various +common tasks, as well as best practices for Joern development. + +This section exists because we have at various points encountered these issues. + +Some points are very general to programming -- but if something is observed as a common pitfall +inside Joern, then it deserves to be addressed here. + +## Folds + +Suppose we have `input: Iterable[List[T]]` and want to construct the concatenation. + +A natural way could be +```scala +val res = input.foldLeft(Nil){_ ++ _} +``` +This is a catastrophe: It boils down to +```scala +( ((Nil ++ input(0)) ++ input(1)) ++ input(2) ) ++ ... +``` +However, the runtime of `A ++ B` scales like `A.length` for linked list, and this can end up costing +us `res.length ** 2`, i.e. quadratic runtime. + +Instead, the correct way is +```scala +val res = input.foldRight(Nil) {_ ++ _} +``` +This is, however, contingent on the internals of the `List` implementation! If we instead were to +collect into an `ArrayBuffer`, the correct way would be +```scala +input.foldLeft(mutable.ArrayBuffer.empty[T]) { case (acc, items) => acc.appendAll(items) } +``` +This is because `List` allows O(1) prepend, and `ArrayBuffer` allows O(1) append. + +Do not write `buf.appendedAll` -- that pulls a copy and runtime will always be quadratic! + +Now, if you write `input.fold`, then the associativity is indeterminate. This means that you must +always assume the worst imaginable execution order! + +### Fundamental Theorem + +The fundamental theorem on getting the right complexity class for your folds is the following: + +Assume that each item has a length, and that +``` +combine(left, right).weight == left.weight + right.weight +``` +and assume that the runtime of `combine(left, right)` is bounded by +`min(left.weight, right.weight)`. + +Then, the total runtime of accumulation is upper bounded by +``` +input.map{_.weight}.sum * log2(input.map{_.weight}.sum) +``` + +Proof: Consider the function `F(a)=a*log2(a)` and then track the evolution of +```time_already_spent - remaining_work.map{F(_.weight)}.sum``` We will show that this quantity is +non-increasing. Since this quantity is initially negative (no time was spent!) it must be negative +once we are done, which +gives the desired equation +``` +time_spent - F(remaining_work.map{_.weight}.sum) == time_spent - F(output.weight) < 0 +``` +So to prove this inductively, suppose we combine two items with weights `a <= b`. We compare the +critical quantity before and after this update, to obtain +``` +Delta == time_already_spent_after - remaining_work_after.map{F(_.weight)}.sum + - time_already_spent_before + remaining_work_before.map{F(_.weight)}.sum + <= a - F(a+b) + F(a) + F(b) + == a - a*log2(a+b) - b*log2(a+b) + a*log2(a) + b log2(b) + == a - a*log2(1 + b/a) - b*log2(1 + a/b) + < a - a*log2(1 + b/a) + <= 0 +``` +The last inequality was because `a <= b` and hence `log2(1+b/a) >= 1`. + +### More examples + +A good example of what not to do is Java's `Collectors.toList()`. This is intended for use on +parallel streams; it uses an internal `ArrayList`. However, `ArrayList` only permits fast append, no +fast prepend. Yeah, lol, quadratic runtime (because associativity depends on races). + +To see how it should be done is seen in `ForkJoinParallelCpgPass`. Morally speaking, the relevant +function is +```scala +def combine[T](left:mutable.ArrayDeque[T], right:mutable.ArrayDeque[T]):mutable.ArrayDeque[T] = { + if (left.size < right.size) right.prependAll(left) + else left.appendAll(right) +} +``` +This is the fundamental point: If you want `fold` to be fast, then you must handle both cases. + +Another example that works is `++` for Scala `Vector`. It is a fun exercise to step over the code +and see that both `prependedAll` and `appendedAll` are fast! + +With respect to performance, it is recommended to use mutable data structures by default, and only +use all the immutable stuff if necessary: That is, if you would otherwise require locks or if your +algorithm requires snapshots. + +## Java Stream Collector + +It is worthwhile to take another look at the Java stream collector. It is used in `CpgPass` like this + +```scala +// parts:Array[T] + externalBuilder.absorb( + java.util.Arrays + .stream(parts) + .parallel() + .collect( + new Supplier[DiffGraphBuilder] { + override def get(): DiffGraphBuilder = + new DiffGraphBuilder + }, + new BiConsumer[DiffGraphBuilder, AnyRef] { + override def accept(builder: DiffGraphBuilder, part: AnyRef): Unit = + runOnPart(builder, part.asInstanceOf[T]) + }, + new BiConsumer[DiffGraphBuilder, DiffGraphBuilder] { + override def accept(leftBuilder: DiffGraphBuilder, rightBuilder: DiffGraphBuilder): Unit = + leftBuilder.absorb(rightBuilder) + } + ) + ) +``` +The stream collect API is, at its core, a glorified parallel fold. Noteworthy things are: +1. We don't supply a single accumulator, we supply an accumulator factory. This is important for +parallelism, otherwise we'd degrade to a `foldLeft`! +2. Ideally we only get one accumulator per CPU core. Each accumulator uses its `runOnPart` to absorb +the output of the next `part`. We especially don't allocate an accumulator (i.e. a new +`DiffGraphBuilder`) for every `part` -- that would limit us to parallelisms where each `part` is large +(e.g. each `method`) and would be bad for the case of many cheap `parts`. +3. Note how we collect everything into an array before building the stream. We do _not_ use some +kind of generic `SplitIterator` like `java.util.Spliterators.spliteratorUnknownSize`. Read the code of +that function and consider whether that would be a good idea in the context of `CpgPass`. +4. The code for merging, i.e. `leftBuilder.absorb(rightBuilder)` has already been discussed above +(especially the fact that it is `ArrayDeque`-based). + +## API considerations: Beware of Iterator! + +You should not offer a generic API that works on (lazy) Iterator. If you decide to do that, then +immediately and single-threadedly collect the Iterator, before passing it on to complex higher order +functions like fold or stream-collect. + +The reason is that iterators are lazy. You don't know what side-effects and computations are hidden +in them! And if execution of the iterator is triggered by complex higher order functions, then this +may very well contain data races! + +Or look at some code that we found in Joern: +```scala +iter.map{item => executer.submit{() => expensive_function(item)}}.map{_.get()}.toList +``` +where `iter` was an `Iterator`. This code wanted to run `expensive_function` in parallel on all +items in the iterator, and collect the results into a `List`. + +In fact this code was single-threaded, because iterators and maps of iterators are lazy. So the +first time something will be called is when `toList` tries to collect the first output; then it +schedules the first computations, awaits its result, and only then schedules the second computation. + +TLDR: Iterators are scary. Collect them eagerly. + +Special thanks to [@bbrehm](https://github.com/bbrehm) for the write-up! diff --git a/docs.joern.io/content/upgrade-guides.md b/docs.joern.io/content/upgrade-guides.md index 5a6727e4..d3cd6cc6 100644 --- a/docs.joern.io/content/upgrade-guides.md +++ b/docs.joern.io/content/upgrade-guides.md @@ -1,6 +1,6 @@ --- id: upgrade-guides -title: Upgrade guides +title: Upgrade Guide weight: 150 ---