Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create new script for adding oral argument (and later other forms of) data to Elastic Search #2677

Closed
mlissner opened this issue Apr 24, 2023 · 3 comments
Labels

Comments

@mlissner
Copy link
Member

Currently we have cl_update_index, but I think django-elasticsearch might have something it supplies out of the box, or perhaps we'll need to make our own to get the performance we need. In any case, we'll need something like this so we can ingest the old data in the DB.

@albertisfu
Copy link
Contributor

According to #2676 cl_update_index main functionality is to populate the index for the first time.

Django elasticsearch DSL has some built-in commands to populate existing data to the index.
It has a --parallel option that is based on elasticsearch-py parallel_bulk that uses multiprocessing.pool.ThreadPool, by default it uses 4 threads and a chunk size of 500 objects to be sent to ES at once.

docker exec -it cl-django python /opt/courtlistener/manage.py search_index --rebuild --models audio.Audio --parallel
(Creates the index and populates it.)

docker exec -it cl-django python /opt/courtlistener/manage.py search_index --populate --models audio.Audio --parallel
(Index exists, only populates it.)

I did some tests, adding 20,000 Audio objects to ES which took ~35 minutes (using 1 shard).

However, there are a couple of issues related to parallel_bulk that mention some performance and memory leak issues indexing millions of items.

elastic/elasticsearch-py#1101
django-es/django-elasticsearch-dsl#433 (This one mentions that the issue is present when using MYSQL)

So we could try this command and see how it works for OA, or we could directly create a new cl_update_index ES based on celery.

What do you think?

@mlissner
Copy link
Member Author

Yeah, let's give it a try and see how far we can take it. It'll be great if it's good enough and we can use it without needing to get Celery involved (!)

@mlissner
Copy link
Member Author

I think it's safe to say this is done. As we need more data types, we'll add them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Done
Development

No branches or pull requests

2 participants