Data Clean-Up
Types of data clean-up
Usually, data "clean-up" comes in two primary forms:
- Cleaning up data mess that accumulates naturally over time
- Deleting problematic data stemming from acute issues (e.g. implementation bugs, bot traffic, PII)
In this page, we'll help you address both forms of data clean-up
1. Cleaning up data when it naturally gets messy
Even if you start with strong data governance and best practices about how to structure your data, data can get messy over time (e.g. as your product expands or is updated, or as new data governance owners move on).
In this section, we will look at tips and tricks on how to clean up messy data when it gets out of hand.
Leverage Lexicon for a quick spot check
Dropping events that are not queried
Lexicon is our data dictionary in Mixpanel, and is where project owners and admins can add / manage descriptions for all events and properties to organize your data.
- Order by volume descending. Are your users querying events with large volume? If not, you might want to think about dropping these events. Dropping events means that they will no longer be ingested in Mixpanel. This is especially helpful for apps, since you don’t need to release new app versions and wait for users to download the latest versions for the code change to take effect (we still recommend that you clean up the code on the backend though!).
- Sometimes, customers may still want these large volume events because they rely on them downstream (via data pipelines). But if you’re not querying the events in Mixpanel itself, you shouldn’t send it in - if you’re on the events plan, this increases your utilisation; also, this clutters up the UI for users who don’t need to see it there. You could hide the event in the UI in Lexicon, but our recommended approach is to not send this data into Mixpanel at all.
- If you’re leveraging our SDKs and want to continue doing so for ease of leveraging our ID management and default properties, an alternative is to route events from Mixpanel’s SDKs via a proxy in your own domain. This way, you get to push the events required for backend analyses directly into your warehouse, and send only the events users query in our UI to Mixpanel
Cleaning up display names and adding descriptions for events
It’s always helpful to add descriptions to your events so users know where they are triggered. If you’re an owner or admin, you can add the descriptions directly on the UI itself. You could also export the data into a CSV, update the description there, then import it back into Lexicon if it’s easier to clean up that way. If you leverage Segment, mParticle, or Avo.app, you can use that to import your event names and descriptions that way too.
If you use Figma to identify your events, some customers add their links to the event descriptions as well.
Adding tags to events
Using custom events to combine or filter events
Cleaning up user profiles
If you want to clean up old user profile properties that are no longer being used, you can use our Engage API (opens in a new tab) to remove these old user properties. We also provide a ‘people_unset’ method in the Mixpanel-utils Library here (opens in a new tab).
2. Deleting problematic data
Data Deletion allows you to delete noisy or sensitive data from your Mixpanel project, helping you maintain the integrity of your analytics project, and preserve a clear view of user behavior.
This feature is progressively being rolled out, and may not yet be available for all customers.
You might find this useful if you:
- accidentally send Personal Identifiable Information (PII) in a property
- have implementation issues leading to duplicative tracking
- experience bot traffic issues that clutter an important event
In addition to permanently deleting the chosen data from our servers, we will immediately hide the data in your project so your users don’t accidentally view or analyze incorrect/sensitive data.
Unlike Hiding Events (opens in a new tab) or Data Views (opens in a new tab), Data Deletion affects your entire project, and can allow you to remove a subset of problematic events from an otherwise useful event.
Is data deletion right for my problem?
Data Deletion is a severe action with permanent consequences. Thus, we recommend it only in a few scenarios:
Scenario | Recommendation |
---|---|
Bot attack sent spam events I don’t want to consider in my analysis | Yes, use Data Deletion with targeted event property filters to delete spam event data. |
Due to an implementation issue, duplicate events were ingested | Yes, use Data Deletion with targeted event property filters to delete duplicate event data. |
Due to an implementation issue, events were sent with wrong timestamp | Yes, use Data Deletion with targeted event property filters to delete affected data, and then re-upload with corrected timestamps. Please refer to our 'Reminders with ETL Approach' section on this page as you proceed |
We have sensitive data that we accidentally ingested that we legally cannot have and need to urgently delete | Yes, use Data Deletion with targeted event property filters to delete events with sensitive data. You can then re-upload data without sensitive PII data included. Please refer to our 'Reminders with ETL Approach' section on this page as you proceed |
I no longer want an event/ property entirely, as it’s no longer relevant to my team | No, we recommend you actually use Dropping and Hiding Data (opens in a new tab) instead, as these will hide the events from your team, without using up your limited Data Deletion requests. |
I don’t need to legally delete some events with sensitive data, but I don’t want some of my users to be able to see the sensitive data | No, we recommend you use Mark Properties as Classified (opens in a new tab) to remove the data from the analysis experience for certain user types, while keeping accessible for other users. |
Who can use Data Deletion?
We limit access to Data Deletion in these scenarios:
- Your Mixpanel role is not Owner or Admin
- Your data extends beyond the past 90 days
- Your event volumes are > 5 billion events per month, any month over the trailing 3 months
- You have made 10 deletes over the past calendar month
How to use
Data Deletion is irreversible. Please exercise caution when using this feature.
Once you've identified the problematic data, and confirmed you want to delete:
- Navigate to the Data Deletion section in Project Settings
- Click ‘Request an Event Deletion’
- Select which event you want to delete
- Select time range (cannot be more than 90 days in the past)
- Add any Event property filters if needed
- Validate in the preview that the right data is being deleted
- Submit deletion request
Data Deletion will be permanent and irreversible in 7 days after submitting. Until that time, you may undo the action. Ensure that you carefully review and confirm your selections before proceeding.
During this intermediate 7 day period, we will automatically hide the data in your Deletion request from your users, so they don’t accidentally analyze incorrect or sensitive data.
Undoing data deletion
You may undo Data Deletion requests for only 7 days after submitting, after which deletions become permanent. To do so, take the following steps:
- Navigate to the Data Deletion section in Project Settings
- Locate the Deletion request you wish to undo in the table
- Click ‘Undo’ button in the Deletion Data column
Reminders with ETL approach
- For projects created prior to January 11th, 2023, offset your timestamps
- If your project was created prior to January 11th, 2023, you cannot just clean the data you don't want and re-import it. Your data is stored in project time, so you need to adjust the offset before importing
- Don’t forget to regenerate $insert_id’s when you ETL
- When you submit a deletion request, we hide your data immediately from your project to reduce any privacy concerns. We call this “soft deletion”, an interim phase before our “hard deletion” kicks in (where your data is permanently deleted from our servers) so that you can review the impact of your changes and undo when necessary. If you re-import data while the data is soft deleted with the same $insert_id, our deduplication systems may keep the old (deleted) event and toss the new event. Since this data is soft-deleted, your re-import won’t reflect the data as imported. Thus, when the ETL is done, you should regenerate the $insert_id value if possible, to avoid this possible collision
Frequently asked questions
How does Data Deletion impact billing?
When billing runs for the current month, deleted events will not be included in the tally calculation for that current month.
Do you support property deletion?
Not as of now. If you have events with properties you want to delete, you should do an Extract, Transform, Load approach. The goal would be to run a raw export on the event and then delete the bad imported data. Once this data has been transformed, re-import the data into the project. Then, call this event something slightly different and then hide the original one
Please refer to 'Reminders with ETL Approach' section as you proceed, and submit feedback (accessible in-app with our help menu) if this is a blocker for you and your team
Other resources
Was this page useful?