Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Giant vs Aleph/Datashare? #40

Open
ninoppp opened this issue Mar 25, 2022 · 16 comments
Open

Giant vs Aleph/Datashare? #40

ninoppp opened this issue Mar 25, 2022 · 16 comments

Comments

@ninoppp
Copy link

ninoppp commented Mar 25, 2022

Hey there

I'm currently checking out different platforms for investigations to set up for a small organization.
While we're doing local testing, I'd also like to ask you directly:

What are the main differences between Giant and the two main alternatives? Which features do the others not have, which ones is Giant missing? Why are you maintaining Giant instead of using Datashare or Aleph? Would you recommend it to a small media organization?

Thanks for bothering :)

Edit: another question: You mention a Platform for Investigations suite. I did not find any other tools belonging to that sute publicly, is Giant the only open source one?

@philmcmahon
Copy link
Contributor

philmcmahon commented Mar 28, 2022

Hey @ninoppp - thanks for raising this issue! Tagging my colleagues @itsibitzi and @joelochlann who should be able to help with some of the differences between Aleph and Datashare as they've got more experience of using these other platforms and might have a better idea of the pros/cons of each vs giant.

In terms of whether I'd recommend giant to a small media organisation, the main thing is that I think you would be the first external organisation to use giant. This would be exciting for us and we'd love to help you get it set up - but it may not be a smooth journey! You may find the Aleph/Datashare docs are more complete - in particular Aleph has been open source from the beginning as far as I know so there's a greater chance their documentation is tried/tested by external users. From a technical perspective, Giant currently uses both Neo4j and elasticsearch databases which can be a challenge to manage - though you can use managed versions of these too to reduce the maintenance effort.

Giant was originally designed with a focus on being able to ingest data as fast as possible. It has an ingestion pipeline that can scale horizontally in order to boost performance. This was based off the guardian's experience of other investigations where getting data in front of journalists as fast as possible was essential.

One defining feature of giant which is coming (very) soon hopefully is the ability to search for text and then view the text highlighted in place - see #38 - with Aleph/Datashare my understanding is that it is common to need to download documents rather than viewing/searching within the platform.

Re the 'platform for investigations' question - this is the only open source tool in the suite...for now! Originally giant was called 'pfi' with the idea that it would include multiple different tools, whrereas now it is focussed on searching/sharing documents securely.

Both Aleph and Datashare have versions you can quickly try out I think! For giant we could potentially give you a demo on a zoom call at some point if you're interested.

@ninoppp
Copy link
Author

ninoppp commented Mar 28, 2022

Thanks a lot for the response @philmcmahon !

I see the point of being the first one's to adopt it. However, if the software's good and you are motivated to help use every now and then I think it would be worth it.

Actually, my biggest concern: You seem to have AWS S3 baked into the software. Using an external service, especially Amazon, wouldn't really be compatibale with our OPSec model. How difficult would it be to use it with local storage only? And, since we're probably starting out with somewhat limited hardware (24 cores, 64GB Ram, main storage on HDDs), would it be feasable performance whise?

The text highlighting indeed isn't present in the other options - Aleph doesn't have it at all (I think) and Datashare only has it for the extracted text, not inside the PDF/whatever. Quite fancy :)

I've already taken a look at Aleph's and Datashare's demos. Will probably do some testing on our production hardware once that's in place (a few weeks). Was planning to try set up Giant in a VM, but a virtual demo would of course make things a lot easier :) Maybe we could continue this conversationn over email.

this is the only open source tool in the suite...for now
Are there any concrete plans on releasing other tools? If they have compatibility benefits with Giant that would of course make it an even more interesting option...

@philmcmahon
Copy link
Contributor

philmcmahon commented Mar 28, 2022

Hey @ninoppp I've sent you an email but to answer some of your other questions in case useful for others.

How difficult would it be to use it with local storage only?

Definitely possible - we use https://min.io/ when running giant locally. Whilst we've been mostly running giant in AWS recently it was always designed with the idea that it should work offline.

Are there any concrete plans on releasing other tools?

Sorry for the air of mystery there. Right now we don't have any plans. Our team is growing at the moment though so giant development should pick up a bit after a fair while leaving it untouched.

@pirhoo
Copy link

pirhoo commented May 4, 2022

Hi guys,

Pierre from ICIJ here!

Wonderful news to see you opened the source code of Giant on Github.

As suggested in this thread, it would be cool to establish a comparison matrix to help our communities choose a solution.

That might be a joint effort with our friends from Aleph (cc @Rosencrantz).

WDYT?

Thanks @ninoppp for raising the issue :)

@Rosencrantz
Copy link

Hi @pirhoo @philmcmahon

This sounds like an excellent idea. Some sort of side by side comparison would, I think, be worthwhile to help people make an informed decision on the right platform for their organisation.

@ninoppp Just to clarify a couple of points. Aleph does support/provide search term highlighting and document search without the need to download documents. If you'd like to learn more about using/running aleph we can always add you to our community slack channel.

:)

@philmcmahon
Copy link
Contributor

Hey @pirhoo @Rosencrantz sounds good! Should we have a quick call some time to work out what format it should be in? I guess we could just list the features and work out where there's overlap.

A key thing with giant is that nobody outside the guardian has tried to run it (yet!) so we'll need a bit of a content warning.

@pirhoo
Copy link

pirhoo commented May 6, 2022

Hi guys, what about a call on Friday 13rd? Let say 2pm London time?

Me and my team are on Paris timezone :)

@Rosencrantz
Copy link

Perfect timing... except I'm going to be on the way back from a meeting in Sarajevo and will be offline, but is there a chance that we'll all be at dataharvest together?

@ninoppp
Copy link
Author

ninoppp commented May 7, 2022

Very nice that see all of you connecting here :) In case you need a third party for something, feel free to hit me up.

@pirhoo
Copy link

pirhoo commented May 9, 2022

Perfect timing... except I'm going to be on the way back from a meeting in Sarajevo and will be offline, but is there a chance that we'll all be at dataharvest together?

I'll be in Dataharvest too :)

@philmcmahon
Copy link
Contributor

Hey, sorry for the slow reply! Sadly we (guardian) won't be at data harvest :( Could we meet when you get back? We're on london time

@Rosencrantz
Copy link

Absolutely! Sounds like a plan. How about the 26th of May?

@pirhoo
Copy link

pirhoo commented May 17, 2022

I can't make it on the 26th. Maybe the 25th?

@pirhoo
Copy link

pirhoo commented May 24, 2022

So @Rosencrantz and I met in Dataharvest, we partied a little too much and talked about everything but this very specific topic! I'm on my way for some time off until June 8. Maybe we can plan a meeting after that?

@philmcmahon
Copy link
Contributor

Doh I'm not doing a great job of keeping up with this - how about 9th June?

@pirhoo
Copy link

pirhoo commented May 25, 2022

Could work for me!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants