I have written previously about the fact that I am a support analyst for a large data warehouse. This means that if something goes wrong with the data warehouse, my group is responsible for fixing it. Our database houses all the data for a very large brokerage and financial services firm. There are thousands of inbound and outbound connections to this data. If something goes wrong, it needs to be fixed, usually immediately.
We run nightly batch processing. During this processing we receive data feeds from various systems with the day's transactions. We import these transactions, and perform various transformations and calculations on the data. We then extract files to send the data to other downstream systems. Other systems also connect to our database and source our data for their own purposes as well. Our nightly process involves about 500 different jobs, which in turn run thousands of scripts and sql scripts to perform this processing.
Some people have asked me in the past what kind of things could go wrong that I would have to fix. Here is just a short list of reasons I could receive that call in the middle of the night.
- The most obvious reason, and the most frequent reason, I receive a call is that a batch job failed while processing. However, the reason the jobs fail are many. For example, the database could be out of space. This happens when for some reason the database log files are filled up, or some other server process uses all the space available. Another example is that a new code change or new program wasn't properly tested and failed during processing.
- The data file from the source is delayed. Our processes watch for the arrival of a file, and have a deadline to receive that file. If the file doesn't arrive by the deadline, we receive a page from system operations to find out why. Luckily most of these issues are resolved by our Tier 1 offshore team, so I don't get calls for this issue very often.
- The database or server goes down. When this happens it is an all hands on deck emergency, and everybody gets woken up to help resolve the issue.
- A source file from an upstream system is in the wrong format. This usually occurs when the upstream system made a change to the format and didn't tell us. Of course this causes the import job to fail, so could also fall under reason #1 above.
- A file that is supposed to be sent from our data warehouse doesn't arrive at its destination. There are many reasons why this can happen, from a failure in the code to extract the proper data to a failure in the code that actually sends the data, or a failure in the system that is used for file delivery.
The bottom line is that there are many reasons a production support analyst can receive a call in the middle of the night. The operation of the business depends on the data that is processed every night and if there is a problem that will prevent or delay that processing, someone has to fix it. Thank goodness there are people like me who are willing to wake up in the middle of the night and handle these situations.
I'm thankful my turn comes only once every 4 to 5 weeks!