Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connect the exernal_user_ids to the User Retirement Process #449

Closed
felipemontoya opened this issue Oct 5, 2023 · 15 comments
Closed

Connect the exernal_user_ids to the User Retirement Process #449

felipemontoya opened this issue Oct 5, 2023 · 15 comments
Assignees

Comments

@felipemontoya
Copy link
Member

felipemontoya commented Oct 5, 2023

Now that we have an optional sink to transfer exernal_user_ids to Clickhouse, we would need to keep this table up to date.
When a user is retired in the LMS, the view destroys the PII information for the user.
https://github.com/openedx/edx-platform/blob/a313b3fd20b4e12d3d1b0b35a0c5eaef1d0cc771/openedx/core/djangoapps/user_api/accounts/views.py#L1175

However the update should also go through to clickhouse even if the delete operation might be costly.

This issue came to my attention thanks to the Spanish universities when discussing aspects and GDPR.

@felipemontoya
Copy link
Member Author

fyi @Ian2012

@bmtcril
Copy link
Contributor

bmtcril commented Oct 6, 2023

We should discuss the desired outcomes of such an operation. The default system doesn't store this information for this reason, but consistency of downstream data will need to be considered and decisions about where it's ok to keep aggregated or obfuscated data will need to be made before implementing anything.

@felipemontoya
Copy link
Member Author

I think that is very reasonable. Discuss and decide before implementing anything.

In my view, the fact that the system does not store PII information is a great starting point. All info from the xAPI statements is already safe to store as it is already anonymized.

For in campus usage however having only anonymized data is not enough, mostly teacher want to be able to understand trends in the course so they can still have an effect on individual students that are at risk of failing a course. To achieve this we can turn on the optional event_sink.external_id table from PR#43. This sink dumps data from the ExternalId model into clickhouse. According to the code it copies:

            "external_user_id",
            "external_id_type",
            "username",
            "user_id",
            "dump_id",
            "time_last_dumped",

Now the thing is that even if for campus usage the teachers get access to de-anonymized data. This should not go on forever. Specially if the user deletes its account, the PII info should not longer be accesible. I understand this as simply deleting the row in the sink table in clickhouse (the LMS retirement process already does more than this) but as clickhouse mostly dumps new data in, then as it works now the old records would not be deleted when the row in the LMS is deleted.

The only desired outcome I propose in this issue is that the corresponding row in the sink table is deleted or obfuscated. Any other xAPI statement or DBT processed facts table should remain untouched.

@bmtcril
Copy link
Contributor

bmtcril commented Oct 10, 2023

Right, the problems I can foresee with this are:

  • We will very likely want to combine the data from the external id table in ClickHouse into downstream tables for performance reasons, meaning more tables will need to be touched (deleted from / updated / rebuilt) to clean up the PII
  • If not carefully considered, operations on those downstream tables may cause data to change, confusing users
  • If any downstream tables need to be rebuilt (for example if a new column is added in an update) and they inner join back to the external id table which no longer has rows for retired users, the data contained in them will change, confusing users

I'm sure there are other small issues, but I want to make sure that you're aware that just deleting those rows isn't a sufficient fix on its own. I don't see the delete itself as a huge problem unless there are a lot of users in the system. Currently at most I think we would have 3 rows per user in the LMS users database. If a lot of users retired in a short period of time it might cause performance issues, but probably limited to the tables / reports that reference the external id or tables downstream of it directly.

@felipemontoya
Copy link
Member Author

We will very likely want to combine the data from the external id table in ClickHouse into downstream tables for performance reasons, meaning more tables will need to be touched (deleted from / updated / rebuilt) to clean up the PII

I had not considered this. I imagined that we already had the data anonymized so we would do all calculations with the actor_id and if and when someone wanted to de-anonymize then that would be done joining the profile table right in the graph query.

Right now I'm thinking about two possible paths worth exploring. The first would be to double down and make an obfuscation transformation so that we can alter the user_profile table as well. In this case I'm no longer sure if removing the actor_id to user_id mapping is necessary given. Probably that would also have limitations as there might be cases where the obfuscation does not propagate well into the downstream tables.

The other path I can imagine at the moment would be to separate the table user_profile table into the PII fields and the non_PII fields. And then we could propose guidelines that recommend that PII data is not combined in downstream tables while maintaining the capability to further process things like language, gender or country.

-If not carefully considered, operations on those downstream tables may cause data to change, confusing users
-If any downstream tables need to be rebuilt (for example if a new column is added in an update) and they inner join back to the external id table which no longer has rows for retired users, the data contained in them will change, confusing users

These two are related to confusing users which I think can be mitigated if we are doing obfuscation instead of deleting the rows.

If a lot of users retired in a short period of time it might cause performance issues

Do you think this risk warrants doing some rate limiting in the lms? for instance in the task that is dispatched when the user_profile changes.

@Ian2012
Copy link
Contributor

Ian2012 commented Oct 11, 2023

We can mitigate some of the impact on database performance by creating a command to re-generate the entire MVs or to batch update the records in downstream tables and do it in batches for multiple users if needed

@bmtcril
Copy link
Contributor

bmtcril commented Oct 24, 2023

I'm talking with Altinity about this now to get their advice. Rebuilding MV tables could result in lengthy downtimes once the database has significant data in it. One possibility is to use a DICTIONARY type to cache PII data in memory like we do for course data. That should be performant enough to join on, and act as a buffer to updates on the base PII tables, at the cost of using more ClickHouse memory.

I'll let you know what Altinity suggests.

@bmtcril
Copy link
Contributor

bmtcril commented Oct 26, 2023

So based on their feedback I think this is our best option:

  1. Update the ASPECTS_EVENT_SINK_USER_PROFILE_TABLE to partition data by user_id % 100 (creating 100 partitions) so that when users are being obfuscated or deleted ClickHouse is only modifying on one partition instead of the whole table.
  2. Do the same for the anonymous id table. I'm not sure where you're getting it from, it doesn't seem to be in Aspects.
  3. Make a dictionary (probably complex_key_sparse_hashed) combining those tables that does lookups into the PII with anonymous id as the key and a reasonable LIFETIME, which will cause the entire dataset to (non-blocking) reload in the background every X seconds
  4. Any to access of the PII must go through that dictionary, it gets joined in at report query time and should never be stored in a materialized view or downstream table
  5. Now when we get retirement requests we can use the update mutations on the base tables with PII without severe impact to performance. The tables will only be queried at dictionary refresh time and mutations will only be working on one shard at a time, minimizing the potential time blocking dictionary access to 1% of the data.

This makes a few tradeoffs:

  1. A dictionary with millions of users will take up significant RAM in ClickHouse, but joining on it will be very fast
  2. Not as fast as pre-baking the PII into a downstream table would be, though
  3. New user data will be delayed by up to the lifetime of the dictionary, so the setting of the lifetime is important to balance freshness vs performance, I would probably start with 15 mins and see how it feels

I would guess that the retirement steps would live in event-sink-clickhouse and be run as part of LMS retirement. It would have to run the mutation logic in ClickHouse, which could take a very long time. I'm not sure off hand if there's a way to fire those off async, or if we'd rather make Celery tasks for them. There are definitely concerns about Celery tasks failing invisibly and users not being retired.

@pomegranited pomegranited self-assigned this Nov 23, 2023
@pomegranited
Copy link
Contributor

pomegranited commented Nov 28, 2023

I'm thinking of tackling this with the following steps.

Comments/corrections/suggestions @bmtcril @felipemontoya @Ian2012 ?

Clickhouse/Alembic migrations in #529:

  • Partition the ASPECTS_EVENT_SINK_USER_PROFILE_TABLE by userid % 100 (configurable)
  • Partition the external_ids table (feat: add external id table #521) by userid % 100
  • Create an in-memory dictionary to join user profile PII with external_id, which refreshes every 15m (configurable).
    Join to this dictionary for all external_id or PII dataset/chart queries (worth an ADR).

Retire user with openedx-event-sink-clickhouse:

  • On receipt of the USER_RETIRE_LMS_MISC signal, queue request to retire user data in Clickhouse.
    Records will be deleted from the external_id and user_profile tables; dictionary will refresh with these changes automatically.
    Optimize the affected tables + partitions - ?
    Management command provided to manually retire a given user.

Monitor event sink/user retirements:

  • Add Event Sink tab to the Operator Dashboard, showing data from the event sink's celery queue.
    Chart queue size and error count over time.

This was referenced Nov 29, 2023
@bmtcril
Copy link
Contributor

bmtcril commented Nov 30, 2023

This all sounds great, my expectation is that if the ASPECTS_ENABLE_PII flag is off the table and dictionary will be empty, making my main concern about memory usage mostly moot. I think we will want a separate flag for whether retirement runs, defaulting to the same as ASPECTS_ENABLE_PII, just in case someone turns the flag on and then back off for some reason and still needs to retire data. Or perhaps it's just always on, I just want to minimize the chances of there being orphaned PII in the database.

@pomegranited
Copy link
Contributor

@bmtcril I'm leaning towards leaving the "retire user" event sink on all the time, since User Retirement is in the platform by default. If ASPECTS_ENABLE_PII is false, then there won't be any data to delete, but that should run quickly. And if someone does toggle ASPECTS_ENABLE_PII on and off, then user retirement will still happen.

@bmtcril
Copy link
Contributor

bmtcril commented Dec 4, 2023

Makes sense to me, and it's one less setting to manage. 👍

@pomegranited
Copy link
Contributor

@bmtcril I think this issue can be resolved?

@bmtcril
Copy link
Contributor

bmtcril commented Dec 13, 2023

I believe so, @felipemontoya does the completed work meet your needs?

@bmtcril
Copy link
Contributor

bmtcril commented Jan 12, 2024

I believe this is set now, closing.

@bmtcril bmtcril closed this as completed Jan 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants