Skip to content

Commit

Permalink
Merge pull request #32 from unicef/feature/feature-doc
Browse files Browse the repository at this point in the history
Add more information about magasin features
  • Loading branch information
merlos authored Feb 2, 2024
2 parents f299862 + 3ff0b46 commit 08942e0
Show file tree
Hide file tree
Showing 4 changed files with 79 additions and 25 deletions.
19 changes: 6 additions & 13 deletions docs/architecture.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,6 @@ title: "Architecture"
description-meta: "More about the architecture of magasin architecture"
format: html
---
# TODO: Complete

Magasin is an scalable end-to-end data platform based on open-source components that is natively run in a [kubernetes cluster](https://kubernetes.io).

Expand All @@ -25,8 +24,13 @@ Kubernetes is a container orchestration system designed to automate the deployme
## Helm charts
Magasin uses Kubernetes in combination with [Helm](https://helm.sh/), a package manager for Kubernetes applications. Helm is the equivalent to apt, pip, npm, pacman, snap, conda, etc. Using Helm, users specify the configuration of required Kubernetes resources to deploy magasin through a values file or command-line overrides. A package in helm is called **chart**.


## Loosely-coupled architecture

A fundamental contrast between magasin and other helm-based Kubernetes applications lies in their architectural approach. Typically, an application is characterized by a sole root helm chart governing all deployment rules. However, in magasin, each component operates as an autonomous helm chart. This design choice enables the establishment of a loosely-coupled architecture among its components. Rather than mandating a rigid structure for the entire architecture, magasin embraces a more open and adaptable approach, fostering flexibility in component selection and integration.



# Core components

## Ingestion: Dagster
Expand All @@ -38,13 +42,12 @@ Dagster's Dagit UI provides visibility of pipelines' tasks, scheduling, run stat

## Cloud storage: MinIO

TODO
MinIO is an open-source, high-performance object storage system designed for cloud-native and containerized applications. Founded in 2014, MinIO offers an S3-compatible API, enabling seamless integration with existing cloud storage ecosystems. It is known for its simplicity, scalability, and speed, making it a popular choice for organizations seeking efficient data storage solutions. MinIO's architecture is optimized for modern data workloads, leveraging erasure coding and distributed techniques to ensure data resilience and high availability. With its lightweight footprint and easy deployment on standard hardware, MinIO empowers developers to build scalable storage infrastructures tailored to their specific needs, whether for on-premises, hybrid, or multi-cloud environments.

## Query engine: Apache Drill

[Apache Drill](https://drill.apache.org/) is an open-source, schema-free query engine that provides a SQL interface to a wide range of non-relational datastores, such as NoSQL databases and collections of files such as JSON, CSV, ESRI shapefiles, SPSS & SAS formats, Parquet, and others.


While [data marts](https://en.wikipedia.org/wiki/Data_mart) for specific business functions or locations traditionally require hosting and maintenance of a relational database on a server or virtual machine, Apache Drill enables comparable functionality without need for running and hosting a database or maintaining schema changes from source systems over time.

Instead, a Dagster ingestion and transformation pipeline stores an 'analyst-ready' dataset that Apache Drill can query directly.
Expand All @@ -61,13 +64,3 @@ The multi-tenant JupyterHub component creates on-demand, isolated pods for authe
### Dask Gateway
Dask Gateway allows easy utilization of a Dask cluster from notebook environments for distributed computation of massive datasets or parallelizable operations.


References:

* Jupyterhub - https://jupyterhub.readthedocs.io/en/stable/reference/index.html

* Jupyterhub - kubernetes https://z2jh.jupyter.org/en/latest/index.html

* Authentication for Jupyterhub https://oauthenticator.readthedocs.io/en/latest/index.html

* AWS Public Sector Blog article on [Analyze terabyte-scale geospatial datasets with Dask and Jupyter on AWS](https://aws.amazon.com/blogs/publicsector/analyze-terabyte-scale-geospatial-datasets-with-dask-and-jupyter-on-aws/)
23 changes: 17 additions & 6 deletions docs/docs-home.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,22 @@ title: "Magasin documentation"
format: html
---

# TODO
Magasin is the cloud native open-source end-to-end data platform.
Magasin enables organizations to perform of automatic data ingestion, storage, analysis, ML/AI compute and visualization at scale.


## Learn more about magasin
* **[Why magasin?](why-magasin.qmd)**
* **[Architecture](architecture.qmd)**

## Start using magasin
* **[Get Started](get-started/)**. Quick
* **[Installation](install/)**. More detailed explanation of the base installatin.
* **[End user guides](end-user-guides.qmd)**. References to learn how to use the components
* **[Deployment](deployment.qmd)**. Learn more on the configuration of the different components.

## Develop magasin

* **[Contributing](contributing/)**. For developers and contributors.

Some intro of magasin, and the main categories to get started,

- more details about magasin
- get started
- Amnisitrator guides
Main page
Binary file added docs/images/why-of-fry/data-landscape.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
62 changes: 56 additions & 6 deletions docs/why-magasin.qmd
Original file line number Diff line number Diff line change
@@ -1,12 +1,62 @@
---
title: Why magasin?
number-sections: false
---
# TODO

Write down the reasoning behind creating magasin that we've used
- complex landscape
- open source
## The challenge

- good value for money
In today's data-informe world, governments and organizations face a monumental challenge: fragmented data spread across siloed systems. Departments, divisions, and units gather data independently, leading to inefficiencies and risks:

etc.
### Fragmentation in tools and capacity

* **Tool Fragmentation:** Organizations, especially those without centralized structures, struggle with diverse technologies across teams, hindering resource mobility and causing technology duplication.

* **Capacity Issues:** Siloed work exacerbates resource allocation challenges, limiting the organization's overall potential.

### Risks of Data Breaches
* **Security Concerns:** Without secure data storage and sharing mechanisms, organizations risk data breaches and unauthorized access to sensitive information.

### Myopic Data Analysis
* **Lack of Comprehensive Insights:** Siloed data prevents organizations from gaining a holistic understanding of their operations and stakeholders, leading to shortsighted decision-making.

To overcome these challenges and unlock the full potential of modern data analysis, machine learning, and artificial intelligence, organizations need a comprehensive set of tools.

## Marketplace gaps
When we go to the global market we find gaps.

### Overwhelming landscape

![Big Data Landascape 2019 by Matt Turck. [Source](https://mattturck.com/wp-content/uploads/2019/07/2019_Matt_Turck_Big_Data_Landscape_Final_Fullsize.png
)](images/why-of-fry/data-landscape.png)


Entering the world of data can be daunting, with a myriad of products each requiring trade-offs.

### Leaders are solving a specific problems set

Most data systems are optimized for massive scale and low-latency, crucial for time-sensitive tasks like targeted advertising. However, not all organizations face such time-pressured scenarios.

### With systems that require a high cost entry

It is important to note that these data systems are not designed for low-end hardware or low cost of entry, further complicating the landscape for organizations exploring data solutions without having a deep pocket.

### That are proprietary

Traditional end-to-end data platforms often come with proprietary restrictions, limiting flexibility and tying organizations to specific cloud vendors or industry niches.

This presents significant challenges for entities with decentralized structures and external collaborations, in particularly, UNICEF's government partners who demand a cloud-agnostic, open-source solution that delivers maximum value for their investment.

# Magasin was needed

Our solution is a game-changer in the data landscape. We offer a democratized approach to data anaylsis, empowering organizations of all sizes and structures to harness the power of data without compromising on cost, flexibility and proprietary vendor lock-in.

Magasin offers an open-source, cloud-native end-to-end solution covering data ingestion, storage, analysis, and visualization. Our [architecture](./architecture.qmd) is composed of loosely coupled, mature open-source solutions.

Magasin fits the needs of any ublic or private organizations that generate data from multiple data sources, that realized the need of performing data Science, ML/AI at scale.​

For the public sector and NGO organizations it is a cross-sectoral tool that can be implemented at the national, subnational, or department level.​

For the private sector it fits any organization that wants to leverage data science to keep relevant and improve business outcomes​ with a low cost as barrier of entry​ and with full control of the data.


**Don't just keep up – lead the charge with magasin and transform your organization's future today**.

0 comments on commit 08942e0

Please sign in to comment.