Next-generation data transformation framework for TypeScript that puts developer experience first
Nowadays, almost every developer is working with increasingly complex and varying types of data. While tooling for this problem already exists, current solutions are heavy to use, targeted towards big enterprises and put little to no emphasis on developer experience.
TypeStream allows you to get started within seconds, iterate blazingly fast over type-safe transformation code and work with common data storage services either locally or in the cloud.
Here's how it could be integrated into your workflow:
Make sure you have Node.js (at least 16.0.0) installed and scaffold a new project using:
$ npm init typestream -- --get-started
Note: Right now, we only officially support Visual Studio Code as some important TypeStream features like zero-setup debugging require editor-specific configuration.
To get started developing your project, open the created folder in VS Code. At this point, you will probably be asked whether you want to use the workspace TypeScript version: press "Allow" to continue. If you don't see the prompt, you can also configure this manually.
Pipes are at the core of what TypeStream does as they contain the data transformation code of your project. Since you've specified the --get-started
flag while creating the project, you should already see a pipe under src/pipes/transform-product.ts
. Feel free to read through it to get a general idea of what it contains.
To try out the pipe and experiment with changes, you can start TypeStream in watch mode. To do that, open up an integrated terminal (this is necessary for debugging support) and run the following command:
$ npx tyst watch <pipe-name>
Make sure to replace <pipe-name>
with the name of the pipe you want to work on. If you're following the getting started guide, that's going to be transform-product
.
If everything's working correctly, TypeStream should now download a number of sample files and then attempt to process them using the pipe. Since you're in watch mode, TypeStream will start over whenever you save the file, allowing you to quickly experiment with changes to your transformation.
At this point, feel free to play around with the code and give all of TypeStream's different features a try, some of which are documented in the example file, others right here in the README.
If you get stuck with anything, want to suggest a new feature, or share general feedback, please don't hesitate to reach out to us by creating an issue — we'd love to hear from you! ❤️
When writing software, being able to directly see how the changes you've made affect the output is a key feature for efficient and fun development. Thus we have designed TypeStream in a way that let's you see your transformed data anywhere in your pipeline and update it every time you save your code. If there are errors in your transformation you will get an aggregated overview over the complete sample of datapoints your testing on.
When working with a lot of data, it's impossible to know every edge case upfront. That's why you'll hit a breakpoint right when an edge case breaks your transformation code to see what the outlier data looks like. You can also set your own breakpoints anywhere in your transformation code and step through one data sample at a time
Everyone who has used a strictly typed language before will love features like advanced IntelliSense, catching bugs at compile-time and the like. Using typed
you can infer the type of any variable in your pipe based on a statistically relevant sample.
Want to read and write data from your local file system, Google Cloud Storage, S3, BigQuery or Redshift? All at once? No problem! TypeStream’s modular resource system allows to read from and write to most common storage systems.
To keep things more maintainable or to aggregate multiple streams of data into one you can push into a resource in one pipe and consume it in the next.
The three core concepts to understand when working with TypeStream are resources, documents and pipes. To make each of them more tangible, we will work with an example use-case. If you want to get a more hands-on feeling for them, you can also use the getting started guide. An example use case could be that you have raw product data of two different eCommerce platforms - let's say Amazon and eBay. Your goal is to take the raw data from each provider, transform it into a common format and put it into a common storage so you can work with it.
One resource holds many documents that are all described by the same concept and have a similar structure. Each resource will also have different metadata that describe where its data can be retrieved from. Thus, for all of your raw amazon and ebay products you could define your resources as follows:
const amazonProduct = new S3Resource('raw-amazon-product', {
region: 'eu-central-1',
bucket: 'business-data',
pathPrefix: 'amazon-products/2022/',
})
const ebayProduct = new CloudStorageResource('raw-ebay-product', {
cloudStorageProject: 'typestream',
bucket: 'business-data',
pathPrefix: 'ebay-products/2022/',
})
// Used to write the transformed data into
const allProducts = new FileResource('transformed-product', {
basePath: '/Users/typestream/data',
recursive: true,
})
Note that for each type of storage there will be a different resource class with different kinds of parameters required. As of now, TypeStream supports the following resources:
- Google Cloud Storage
- AWS S3
- BigQuery
- AWS Redshift (coming soon...)
- Local file system
The standard authentication method for both GCP and AWS is authentication via default credentials. You can find the documentation on how to set up these for each platform here:
Alternatively, you can also provide explicit authentication for a project. If these environment variables are set, default credentials will be ignored entirely. You can set the environment variables by putting their values in the generated .env
file of your project:
GOOGLE_APPLICATION_CREDENTIALS
, which has to be a path to a service-account key. Use the docs for referenceAWS_ACCESS_KEY_ID
andAWS_SECRET_ACCESS_KEY
. Use the docs for reference
Documents are the containers of the data you’re working with. While you will never have to create a document yourself because TypeStream takes care of this under the hood, it makes sense to understand their properties.
Each document has data which will usually be in the form of a Buffer. You can call the read()
method
of the document to retrieve the data in raw form or helpers like asJson()
, asHtml()
or asText()
to automatically parse the data into the respective format. If the document doesn't contain i.e. valid JSON, an error will be thrown.
const buffer = await doc.read() // Buffer
const json = await doc.asJson() // any
const html = await doc.asHtml() // HTMLElement (node-html-parser)
const text = await doc.asText() // string
You can also work with the document’s metadata without ever calling read()
on it. What this looks like is
dependent on what kind of resource the document belongs to. Metadata could for example hold information
about the MIME-type of a Google Cloud Storage object or the path of a file in the local file system.
if (doc.metadata.contentType === 'application/json') console.log('Found JSON!')
Pipes are the essential building blocks when working with TypeStream. You can think of them as connectors between resources.
Each pipe has an origin resource from which it will consume data. When defining the pipe, you can transform the data of a document and then publish it to one or more target resources.
Working with the example from above, you could write a pipe that reads the documents from amazonProducts
,
transforms them in any desired way and publishes them to the allProducts
resource.
export default definePipe(ebayProducts, async ctx => {
const rawProduct = typed('RawProduct', await ctx.doc.asJson())
const transformedData = ctx.publish({
// Your transformation code goes here...
resource: allProducts,
data: transformedData,
metdata: { name: transformedData.name },
})
})
You can now write a second pipe for your ebayProducts
resource and also publish them into allProducts
.
When hosted via TypeStream Cloud, these pipes will listen for new objects being added to your resources
and process them automatically.
Transforming a lot of data, you easily find yourself repeating different processes time over time. To mitigate this problem TypeStream comes with a few simple utitilities. Each of these utilities is further documented in the TypeStream library
While using tyst watch
on a pipe, dump()
can be used to store all intermediate results into a single file. This can be used to quickly understand how changes in the transformation code affect the output. Every time you save your pipe, dump will overwrite the new intermediate results.
const intermediateResult = {
/** ...your data here*/
}
dump(intermediateResult)
pick()
can be used to comfortably select a few keys from a messy object. If the object is typed, there will also be autocomplete/type errors on the keys you choose.
const messyObject = { key1: 1, key2: 2, key3: 3, key4: 4, key5: 5 }
const prunedObject = pick(messyObject, ['key1', 'key3'])
When extracting data from server side rendered applications, automatically extracting the hydration from an HTML response can save a lot of time and nerves.
const hydration = extractJsonAssignments(htmlString)
const hydration = extractJsonAssignmentsFromDocument(htmlElement)
const hydration = extractJsonScriptsFromDocument(htmlElement)
Utilities to write more readable code when dealing with arrays
products.sort(basedOn(_ => _.price, 'desc'))
products.sort(basedOnKey('price', 'desc'))
products.sort(
basedOnMultiple([
['price', 'desc'],
['discount', 'asc'],
]),
)
sumOf(products.map(product => product.price))