Getting up to Speed

2022-11-05

I recently started a new role working on data infrastructure at LinkedIn. Working with data at a vast scale at an organization with a lot of experience in the space is exciting.

The area I work in is described internally as “Big Data.”. That term has been used in many marketing materials and can induce a lot of cringes. For this post, “Big Data” refers to working with large, semi-structured datasets. These can be hosted on distributed filesystems or streamed through a distributed event platform like Kafka. The datasets are typically not pre-loaded into an OTLP or OLAP database.

I’ve worked for 22 years in this industry, but most of my experience applies to online or real-time systems. Working with offline data is new to me, so I’ve had to get up to speed quickly with the history, trends, and vernacular common to the domain. Before starting in this role, I had some ambient awareness of terms like data warehouse, data lake, MPP, DAG, and others. Still, I wasn’t comfortable discussing them in-depth, and I wasn’t up to date on developments in the data ecosystem.

I focused my research on two categories: learning about the history of the space and learning about the customer profiles (data scientists and data engineers). In this post, I will focus on how I researched the former.

Reading Papers

Reading papers is something I started doing back when I worked in Bioinformatics. Most of the concepts I explored then came from academic research, and reading research papers was the only way to stay up to date. I picked the habit up again when I worked at PagerDuty.

Reading papers can be daunting for someone like me without an academic background. I’ve found that the approach Caitie McCaffrey describes in her blog works well for me (see “A Note on Reading Papers” in the linked article). Marc Brooker’s article Reading Research: A Guide for Software Engineers is also an excellent resource. Watching PapersWeLove talks is also a great way to read some computer science papers. I like reading a paper, watching a talk on it, then returning to it to re-read sections that confused me the first time.

Creating a Reading List

In Ten simple rules for developing good reading habits for grad school and beyond, the author recommends that if you have to learn about a new topic, consider reading in chronological order. I decided to do this and compiled a list of papers, starting with the Google File System paper published in 2003, continuing with the MapReduce paper in 2004, and so on.

I used a straightforward process to build my reading list. I knew that the GFS and MapReduce papers were among the earlier publications in the domain, so I anchored my search with them. I searched for them on Google Scholar and then looked at lists of papers that cited them. Many familiar pieces turned up, including the original Spark paper, but also many unfamiliar ones to me, such as the TensorFlow paper, the Flink paper, and eventually, the Photon paper and others that represent more recent contributions to the domain.

Here’s my list so far. I’m working through it, but please comment if you have anything to add!

Connecting to Experience

One exciting thing about this exercise was that it allowed me to compare what I was reading to what I was doing in the industry when the paper was published.

As mentioned, I’ve never worked directly in the “Big Data” domain, but I have worked with data in some form or another in my career. When the GFS and MapReduce papers were published, I worked in Bioinformatics. Bioinformaticians often need to analyze large datasets and feed the output of their analysis into some other process. The team I worked with built a workflow automation language and runtime engine designed to run on OpenMosix clusters with datasets stored on large NFS volumes. NFS was a pain to maintain and susceptible to common failure modes, so reading about how GFS was designed to run on commodity hardware and treated component failures as the norm made a lot of sense to me. Our lives would have been much easier if we had GFS and MapReduce (or the Hadoop ecosystem, the open-source project primarily inspired by these projects). (In retrospect, what we built was similar to Azkaban or Airflow).

Some of what I read doesn’t relate to my experience, but I still try to imagine how I’d solve the problem stated in the paper. It’s a valuable exercise that helps me get into the right headspace to read the rest of the paper.

Being Hands On

Reading papers is a great way to generate a broad understanding of work done in a particular domain. I also like to combine it with hands-on operational work. Thankfully, most of the topics covered relate to some open-source work that I can use to gain hands-on experience, even if just with toy examples. For example, after reading the GFS and MapReduce papers, I set up an HDFS cluster on a small set of raspberry PIs I had lying around (with my Honeycomb LX2 acting as a ResourceManager and NameNode). Maybe it’s my ops background, but I like seeing what’s happening at an OS level when running MapReduce jobs or Spark tasks. Some people get the same level of understanding while skipping this step, as always, YMMV.

Making Time

Researching and deep diving into a particular topic or domain takes a lot of time. I try to get through two papers per week. I’ve recently started working from the NYC office, about a 35-minute commute from my house in Brooklyn. I use this time to read through a paper on my Remarkable 2 tablet and make notes for later. I’ll also use time in the evenings after my daughter has gone to bed. Whatever works for you, if you can etch out an extra hour or so per day, that’ll likely be enough!