Wednesday, July 16, 2014

The Case Of Lost Files


Overview


Recently a customer made a request for investigation of a strange case in their system.

The system itself is a set of web services interacting with each other. Let's say there is a service "Designer" which maintains some design artifacts: a template in a form of XML file and related images and texts; "Generator" service ingests those artifacts and generates PDF file. When a user makes a particular 'download' request to "Designer", latter calls "Generator", which produces a file, puts it to a file storage and return full path to the file generated. "Designer" then takes that file and sends it to a user, competing the request.

Now to the point. Sometimes users encountered an intermittent bug. Once "Designer" has made a call and received a path, it tries to access the file and receives "FileNotFound" exception.

Some additional information: the system runs as IaaS in Azure Cloud, and error had appeared after moving to Chinese Azure.




Research

My first intent was to look at the exception by myself, which was rather hard, as the whole process is complicated and involves several services (the description above is a little simplified). So I made a simple "test tool" consisting of two programs. A "writer" writes a randomly-named file to the storage, then calls "reader" via rest interface. "Reader" immediately tries to read the file, and if file is not found, tries to re-read it until file is read correctly or attempts limit is reached.

Unfortunately, the issue didn't reproduce when I tried to run both reader and writer at the same VM. So I made a note and moved on.

When running at different servers, let's say A1 and A2, the issue had been reproduced all right. And then I performed series of tests to check different file sizes and intervals between files writes. The result was interesting enough.


  • The bigger file is, the more time it takes to appear to the reader.
  • Different time intervals didn't seem to have much correlation with time to read.

At this point I was ready to blame Azure storage system. These symptoms looks like symptoms of a distributed system with replication. As it works fine from the same VM, it looks like "read-your-writes" consistency.

What had been confusing me, is that general such thing is impossible in Azure. It guarantees strong consistency. So I felt really uneasy and was far from understanding what was going on with files.

Research continued

So as I thought that the storage is the root of all evil, I tried writing and reading files to another shared folder. And files appeared immediately.

After talking to a person responsible for infrastructure, a new facts had been revealed. That person also tried to reproduce the issue, and also made a test tool and, surprisingly, was not able to reproduce the issue. And actually we compared out approaches and they have been pretty much the same. 

The difference was that he had been running tests at some other servers. I made a new series of tests and got such results. Issue reproduced when Writer and Reader had been at servers A1 and A2, and primary storage \\domain\share had been used. It also reproduced for the other VMs pair, B1 and B2.

When Writer and Reader had been at the same VM, there had been no issue. Same when I placed Writer to any of A machines and Reader to any of B's. Also when some other shared folder had been used, riles had been read completely OK.

That was interesting enough. Obviously, there was some combination of "VMs pair" and primary storage which triggered FileNotFound exceptions.

"VMs pair" could indicate some Azure network issues, so I ran tests again with network monitoring.

Network monitoring

In short, network data flow from Reader side looked as follows:

  1. A1 asks for shared folder \\domain\share" via SMB protocol
  2. Domain controller returns location as \\filestore01\share
  3. A1 sends a message to FILESTORE01 using DFSC protocol (!): Get referrals request.
  4. FILESTORE01 returns list of referrals, and (tah-dah!) first referral listed is FILESTORE02.
  5. A1 tries to read \\FILESTORE02\share\file and, certainly, gets an exception as a result.

Conclusion

DFS! So network monitoring shown that DSF had been trying to load-balance its nodes and certainly, when we tried to read a file which has not been replicated yet, we've got "FileNotFound" exception.

In this case DFS should have been working as a fail-over cluster, backing up its data to FILESTORE2, and performing all reads and writes to FILESTORE01. After a proper DFS configuration, the tests which failed previously, had passed successfully. Azure is now acquitted :)











No comments:

Post a Comment