Recovery of Historic Data in Email Accounts – Datacenter Updates

Update – 21st May 2014

Data restoration continues
First pass of restoration for selected users complete

Data Recovery Progress

As mentioned last, we are working on recovering data in phases. Phase 1 of the recovery process is currently in progress and we have started to recover emails across affected accounts. Phase 1 will continue for at least a couple of additional weeks. Our customer facing teams are reaching out via emails to inform Hosts of the status of recovery for each account and we have done this across a large number of affected users already. This communication is going out every day now as we continue to restore additional data.

While the Phase 1 recovery process is running, the data recovery team is working on building a Phase 2 recovery program (which is a variation of the Phase 1 program but utilizes additional techniques to support even lower level bit-by-bit data recovery) simultaneously. With the Phase 2 process we expect to recover additional data that was not recovered during the Phase 1 process.

Our engineering team continues to be fully engaged in this activity and are leaving no stone un-turned. We am fully aware that we are now into the 4th week of this recovery process. However, given the complexity of the effort we are expecting this to take several weeks more. We thank you again for your continued patience. This remains a very critical issue for us and will continue to be until we exhaust each and every avenue available to us to recover your data.


 

Update – 20th May 2014

In the aftermath of the email storage outage on 24th April, we had setup interim storage devices so that mail services can be resumed for affected accounts. In the meanwhile, we have taken efforts to create a more hardened storage infrastructure.

We will now be migrating the affected accounts from the temporary servers to this infrastructure. The maintenance is planned to be conducted as below:

Date: 22nd May 2014 & 23 May 2014

Mail delivery and access to accounts will be stopped for couple of hours. Inbound mails will be queued up and delivered to respective accounts after the maintenance is completed. Sending of emails will still work, however, they will not be saved to ‘Sent items’. Migration activity will begin from the current storage to the permanent one.

End of migration & Mail delivery will be enabled. Inbound emails that were sent to our server during the maintenance start to be delivered.

We regret the inconvenience this may cause; thank you for your continued patience. If you have any queries regarding this, feel free to reach out to our support teams.


 

Update – 13th May 2014

– Post-mortem findings
– Details on the email restoration effort
– Frequently asked questions

Post-mortem For The Outage

Our senior tech and management team has concluded a detailed root-cause-analysis for this outage and here are the findings. At Around 2 PM IST on Thursday April 24th, we were in the process of adding a new storage cluster (consisting of hundreds of disks / servers) at one of our global data centers.

As part of this process, an aggregate (which is a collection of multiple servers & disks spanning RAID groups) holding the production data (live email data) and backup snapshot volumes (backups) for many email users was rendered inoperable while attempting to build a new aggregate.

We noticed the services on this storage cluster failing almost immediately and this was highlighted by our network and systems operations team which operates 24×7 across all our facilities. We immediately halted the offending deletion/new aggregate creation processes as soon as we detected the issue.

Our first goal at that point was to restore email services, which was done in almost 2 hours, by migrating existing user email accounts on this faulting storage cluster to a different cluster. We deployed a full cluster (comprising hundreds of servers) in almost 2 hours and restored access to email service.

We simultaneously started work with a team of experts to re-build the storage cluster and to bring historic data back online, but as informed it was a complex process considering it is spanned across multiple servers – and those efforts continue as of date with more updates in the following sections.

In response to this incident, our senior technical team immediately put into place a series of stringent change control measures and oversight, which went over and above the systems already in place, to ensure that no further opportunity exists for an outage caused by a similar event in the future.

Email Historic Data Restoration Process

At around 10 PM IST on April 24th, we initiated a process of bringing back the storage cluster online by constituting an advanced team dedicated to this purpose. This process involved reconstruction of individual files across the storage cluster ground up. This restoration process as we have highlighted earlier is very time consuming since it requires us to reconstruct the storage cluster and all of its components from the ground up.

The restoration effort by experts going on 24×7, involves a multitude of software, scripts, hardware systems, and manual inspection processes to carefully rebuild the cluster. We informed at the very onset that this process will take several days or weeks and our communication to you has been consistent with this. A secondary task force composed of our senior most engineers and managers has been reviewing progress every day and continues to do so.

As our engineers have progressed further on this restoration effort, they have also advised us that certain files on the aggregate are not-recoverable. They estimate that at least 20% of the underlying data (not be confused with users/mailboxes) on the storage cluster is not recoverable.

We do not have the ability at this time to tell you what this means for an individual mailbox in your account. However, this restoration effort continues and will continue until we are certain that we have done everything in our power to restore emails for each and every affected account – though we cannot guarantee the end results – and we deeply apologize for presenting you with this uncertainty. Our goal has and will continue to be to share meaningful updates as soon as possible.

Frequently asked questions

Over the past few weeks, we have had conversations with a number of hosting companies about this outage. We have created the following FAQ based on the questions we have heard most often.

What about backups? Don’t you have any?

As said earlier our backup strategy (as per the industry standard) constituted of creating periodic snapshots of the email data. Given that these snapshots were stored on the same storage cluster which is now offline – we have no direct access to them at this time.

Why is this taking so much time?

The reason the team is taking the time that they are is as follows.

  1. They are working on building the meta-data of mailbox files for each individual account (hundreds and thousands of accounts in number) on this storage cluster.
  2. They are writing scripts/tools to assist with both the manual and automated restoration of files on the cluster.
  3. They are running tests to validate that the restored data is valid and usable.

All of these processes require building a series of complex software systems and specialized hardware to operate on those systems. The team also has to double back often and try out new approaches when certain efforts do not yield the desired results. We remain committed to allowing this team, which we believe is composed of some of the best engineers in the industry, time to complete this effort.

Is my data lost? Will you ever be able to recover it?

We cannot and don’t want to promise specific outcomes given that the restoration process has not progressed to a stage that will allow us to do this. At some point over the next week(s) and we don’t know the exact date since this is a complex effort akin to a reasonably large software/hardware engineering project, we will know definitively the status of each and every affected email account. At that point we will communicate specifically with you to talk about the results that our engineering team has been able to generate.

At this time, the goal for this team is to re-build and re-construct the cluster and we request your patience while they continue to do that.

These answers are not sufficient – you need to tell me more

We are sorry if these answers appear to be insufficient. Please realize that our front line support team and our senior managers are all working hard to create the best possible outcome they can under the circumstances. We would be glad to answer a different question if it is sent our way.

What are you doing to prevent this or something like this from happening again?

We realize that this incident impacts your trust in us and our services, and so do we realize that your client’s trust has shaken on yours.

When we learned of this outage, our senior most technical engineers immediately put in place a series of measures to ensure that a similar outage or issue would not happen again on our systems. As stated earlier this kind of outage has never occurred ever before, and was almost unlikely until it happened.

Please allow us more time to come back and discuss with you the concrete steps we took immediately post this outage and that we will continue to take to create the best possible product and experience that we possibly can. We do not take our responsibility lightly and we know that we have let you down on this occasion – we know you deserve more detailed answers and we will provide them..

We also commenced a detailed deep dive into our systems and processes which goes far beyond this particular incident with the goal of demonstrating the rigor and confidence with which we build and deliver our services to you. We do not take this responsibility lightly. We know you deserve more detailed understanding of the work we are doing in this area and we will share this with you over the coming days.


 

Update – 12th May 2014

At this point we have no significant updates to present related to the restoration process, other than the fact the work is continuing at the fastest pace possible limited only by technical hinderances. These efforts are currently underway 24×7 across our global teams as indicated earlier. We will update you again as soon as we have some significant information to share.


Update – 10th May 2014

Thank you as always for your patience.

We realize that this has been a very challenging week. As informed time and again earlier, we are doing our absolute best and our senior technology & management teams, both internal and external are monitoring this restoration process very closely. We are already in the 2nd phase of the recovery process by deploying of scripts and software towards achieving the same as we move ahead to get the cluster back. We will inform you again when we have something significant to update or in the next 24 hours.


Update – 9th May 2014

Our engineering teams are continuing the restoration effort. As highlighted earlier, this process will take several more days since it a complex engineering effort involving multiple engineers, locations, hardware and software resources. Given our goal of providing meaningful information to you as it becomes available, we will post it as soon as we have significant information to share.


Update – 8th May 2014

Our engineering teams are continuing to attempt email restoration through a number of approaches at this point – both manual as well as automated.

As highlighted earlier, a large part of the effort to date has been towards building processes, deploying hardware, and developing software and scripts to drive this recovery process further, which turns out to be a massive task, continues 24×7.

Our engineering team continues to emphasize that while they are making steady progress, – the expected timelines for completing the recovery effort is will be several days or weeks, and even at the conclusion of those efforts we have no guarantees on the recoverability of all the emails for specific accounts.

Over the next few days, as the first reconnection attempt is made to historic data – our engineers have indicated that they may be able to partially recover some emails for certain users as they make their first initial pass through their system and we will continue to communicate with those users (companies / vendors) as and when that happens. In the meantime, we appreciate very much your patience and understanding while our teams continue their efforts.


Update – 6th May 2014

IMAP Access Restored – 1st Step towards Recovery

As per our previous update, we have re-enabled IMAP on the accounts Today. You can now configure your email accounts via IMAP on all mail clients as well as your mobile and hand-held devices . However, prior to this, we strongly recommend that you take a back up of all your locally saved emails before connecting to IMAP. This is to ensure that any saved local copies of your emails are not lost when synchronization happens with the mail servers.

Process to Backup your Emails

Follow the link below for instructions on how to backup of your emails on email clients:
http://www.cibol.net/blog/wp-content/uploads/2014/05/email-clients-backup-instructions.pdf

 

Note from CIBOL: Even though our partners have the world’s best storage infrastructure & employ the finest human capital, and we partner with only the “topmost” providers of the world, for all our services – still we strongly advise you to use a mail client and backup your email data on a regular basis on your local machine as well as on another storage device, so that can be used if there is any eventuality.

Rest assured that we are doing everything in our capacity to address the impact of this.


Update – 5th May 2014

At this time we have no additional information to provide over and above our update from yesterday. Efforts continue unabated and the primary purpose of sending this update is to continue our engagement with you every 24h as promised and to reassure you that we continue to drive this issue with the highest priority

Share!
  • Facebook
  • Twitter
  • LinkedIn
  • RSS
  • email
  • Print
  • MySpace
  • Orkut
  • Digg
  • StumbleUpon
  • del.icio.us
  • Google Bookmarks
  • Live
  • Reddit
  • Technorati
  • blogmarks
  • Add to favorites

Posted on: Thursday, May 8th, 2014 at 1:34 am
Category: General.
Follow responses via RSS 2.0 feed. Both comments and pings are currently closed.

Comments are closed.