Ente encrypts your data, and backs it up to the cloud.
In between these two steps, we write the encrypted data onto disk, and this is then uploaded to cloud.
Our Android app is designed to run this process both when the app is in foreground and background.
There are scenarios where these two processes could run at the same time. For example, if you open the app, while the background sync is already running, we could trigger two parallel processes for backups.
We have a locking mechanism, where these two processes can acquire disk locks to make sure that only one is in charge of uploading a particular file. In addition, each of these processes are designed to write to separate file paths preventing collisions while running in parallel.
This week we learned the hard way that in addition to concurrent background and foreground processes, Flutter on Android can also spawn multiple foreground processes for the same app.
In case of ente, this would happen while receiving data through a share intent, while Ente is already running.
This had the potential to break the locking system we had in place, since it was not designed to handle multiple concurrent foreground processes, and it did.
When a customer reported an issue with loading a particular file, we discovered from their logs that the client was able to decrypt both the thumbnail and metadata, but not the file blob.
Note that all of these three attributes (thumbnail, metadata, file) are encrypted with the same key and different nonces.
One possibility that could have led to this is, if there was a bit rot on the storage layer. We were able to eliminate this possibility by performing integrity checks against the other two copies of this blob we maintained.
The other possibility was that there were two concurrent uploads from the client, one of which overwrote the encrypted blob that was created by the other process, who then posted this now overwritten blob, with the original but now invalid key.
Given the safe guards (locks + separate paths) that were already in place, the only case where such a situation could arise was if there were multiple processes, of the same type, running concurrently.
After looking at all entry points into the app, we discovered that when the app received a share intent while it was already running, Flutter was spawning a duplicate process in the foreground, instead of invoking the current active process.
This had the potential to trigger the issue that the customer faced, where one of the foreground processes could upload an overwritten blob, with a stale key.
We fixed the
issue
and pushed a release to all channels on 10.08.2023
(a day after we received
updated logs from the customer that revealed the issue).
It wasn't that the issue would break all files shared with Ente from a third-party app. It would only occur if you shared a file to Ente through the system share sheet while the app was already processing the same file, and the newly spawned process ended up winning the race against the existing one.
Lessons learned
- Rely on UUIDs while dealing with concurrent writes to disk, since the most damage they can cause is duplication, never corruption.
- Setup anonymous alerting, to proactively detect such failures.
- Verify assumptions about the concurrency guarantees offered by underlying platforms.
Conclusion
We are very grateful to the customer who notified us and helped us through the process of finding the root cause.
If you have shared files to Ente from other apps after having placed Ente in the background, please verify if you were impacted by this issue by exporting your data using our desktop app. If you were impacted, please reach out to support@ente.io.
We strive to write safe code, but this was an instance where we should have done better. We apologise. We are learning, and we will do better.