-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Connect the exernal_user_ids to the User Retirement Process #449
Comments
fyi @Ian2012 |
We should discuss the desired outcomes of such an operation. The default system doesn't store this information for this reason, but consistency of downstream data will need to be considered and decisions about where it's ok to keep aggregated or obfuscated data will need to be made before implementing anything. |
I think that is very reasonable. Discuss and decide before implementing anything. In my view, the fact that the system does not store PII information is a great starting point. All info from the xAPI statements is already safe to store as it is already anonymized. For in campus usage however having only anonymized data is not enough, mostly teacher want to be able to understand trends in the course so they can still have an effect on individual students that are at risk of failing a course. To achieve this we can turn on the optional
Now the thing is that even if for campus usage the teachers get access to de-anonymized data. This should not go on forever. Specially if the user deletes its account, the PII info should not longer be accesible. I understand this as simply deleting the row in the sink table in clickhouse (the LMS retirement process already does more than this) but as clickhouse mostly dumps new data in, then as it works now the old records would not be deleted when the row in the LMS is deleted. The only desired outcome I propose in this issue is that the corresponding row in the sink table is deleted or obfuscated. Any other xAPI statement or DBT processed facts table should remain untouched. |
Right, the problems I can foresee with this are:
I'm sure there are other small issues, but I want to make sure that you're aware that just deleting those rows isn't a sufficient fix on its own. I don't see the delete itself as a huge problem unless there are a lot of users in the system. Currently at most I think we would have 3 rows per user in the LMS users database. If a lot of users retired in a short period of time it might cause performance issues, but probably limited to the tables / reports that reference the external id or tables downstream of it directly. |
I had not considered this. I imagined that we already had the data anonymized so we would do all calculations with the actor_id and if and when someone wanted to de-anonymize then that would be done joining the profile table right in the graph query. Right now I'm thinking about two possible paths worth exploring. The first would be to double down and make an obfuscation transformation so that we can alter the The other path I can imagine at the moment would be to separate the table user_profile table into the PII fields and the non_PII fields. And then we could propose guidelines that recommend that PII data is not combined in downstream tables while maintaining the capability to further process things like language, gender or country.
These two are related to confusing users which I think can be mitigated if we are doing obfuscation instead of deleting the rows.
Do you think this risk warrants doing some rate limiting in the lms? for instance in the task that is dispatched when the user_profile changes. |
We can mitigate some of the impact on database performance by creating a command to re-generate the entire MVs or to batch update the records in downstream tables and do it in batches for multiple users if needed |
I'm talking with Altinity about this now to get their advice. Rebuilding MV tables could result in lengthy downtimes once the database has significant data in it. One possibility is to use a DICTIONARY type to cache PII data in memory like we do for course data. That should be performant enough to join on, and act as a buffer to updates on the base PII tables, at the cost of using more ClickHouse memory. I'll let you know what Altinity suggests. |
So based on their feedback I think this is our best option:
This makes a few tradeoffs:
I would guess that the retirement steps would live in event-sink-clickhouse and be run as part of LMS retirement. It would have to run the mutation logic in ClickHouse, which could take a very long time. I'm not sure off hand if there's a way to fire those off async, or if we'd rather make Celery tasks for them. There are definitely concerns about Celery tasks failing invisibly and users not being retired. |
I'm thinking of tackling this with the following steps. Comments/corrections/suggestions @bmtcril @felipemontoya @Ian2012 ? Clickhouse/Alembic migrations in #529:
Retire user with openedx-event-sink-clickhouse:
Monitor event sink/user retirements:
|
This all sounds great, my expectation is that if the |
@bmtcril I'm leaning towards leaving the "retire user" event sink on all the time, since User Retirement is in the platform by default. If |
Makes sense to me, and it's one less setting to manage. 👍 |
@bmtcril I think this issue can be resolved? |
I believe so, @felipemontoya does the completed work meet your needs? |
I believe this is set now, closing. |
Now that we have an optional sink to transfer exernal_user_ids to Clickhouse, we would need to keep this table up to date.
When a user is retired in the LMS, the view destroys the PII information for the user.
https://github.com/openedx/edx-platform/blob/a313b3fd20b4e12d3d1b0b35a0c5eaef1d0cc771/openedx/core/djangoapps/user_api/accounts/views.py#L1175
However the update should also go through to clickhouse even if the delete operation might be costly.
This issue came to my attention thanks to the Spanish universities when discussing aspects and GDPR.
The text was updated successfully, but these errors were encountered: