A few years back I heard about a conversation among some professors discussing when CS researchers peak. Part of the group believed that researchers generally peak right around the time they get their PhDs and that it’s all downhill from there, so older professors should just retire and make room for new, younger brains. Another part of the group felt that more experience makes for better researchers and that it’s actually rare for researchers to peak early in their careers, so no rush on the retirement.
After hearing about this discussion, I figured the best way to get an answer was to get some data. Google Scholar seemed like a promising source. I’d been working on a Programming by Demonstration tool for web automation – the tool that I’d ultimately develop into Helena – so collecting data would be easy.
I queried Google Scholar Author Search for “label:computer_science” – the top authors tagged with the ‘computer_science’ label. Unfortunately, it turns out this is just a small slice of CS researchers; most authors aren’t tagged with ‘computer science’ but rather with subfield- or technique-specific labels. So I took a step back. Instead of collecting those authors’ papers and calling it good, I collected all their tags – which gave us a nice dataset of the subfield and technique tags that co-occur most often with ‘computer science.’ I sorted them by frequency, picked as many of the top labels as could fit in the search bar (66), and this query gave us our list of the top 10,000 authors. From there, it was just a matter of iterating through the top authors and, for each author, iterating through all papers.
Writing the programs for collecting this data with Helena takes about 3 minutes. Here’s how you do it:
Once I had the data, I had to operationalize peaking. I decided the key question was: how far into their careers do researchers publish their most-cited works? Essentially, we’ll say a researcher peaks at <year of most-cited paper> - <year of first paper>. Now clearly this isn’t a definitive answer to when the researcher peaks. There are many, many ways to operationalize peaking, but this was a pretty good proxy that I could extract from publication years and citation counts.
Ok! So let’s plot this. How many researchers peak 1 year into their publishing careers? How many peak at 2 years?
Well that’s looking pretty discouraging. As someone who hopes to be in CS research for a nice long time, that’s not what I’m hoping to see. There’s that spike right at 7 years, right around the time folks are getting their PhDs. Looks like that peaking-at-PhD faction had the right idea.
But of course this is not at all how we should plot this data. Turns out there are a lot of people who have pretty short publication careers. So what we’re seeing here reflects the fact that most of these authors will never have the chance to publish their most-cited work after year 30 because they’re not even publishing then.
Here are the career lengths of the top 10,000 most-cited authors in CS:
I actually think this data is pretty interesting all on its own.
Anyway, at this point I’m feeling some hope. Maybe that 7-year peak doesn’t come from researchers sticking around for 40 years and never recovering their early glory – maybe it’s there because not everyone bothers to stick around for 40 years. So when is that researcher with the 40-year career peaking?
And now those of us who want to do CS research long into the future can breathe a sigh of relief. In the above chart, each dot represents the number of authors who have a particular career length and peak – pale dots where there are few authors, darker dots where there are many. The black line represents the line of best fit. As we can see, the longer the career, the later the most-cited paper, on average. There doesn’t even seem to be a year beyond which influential work stops happening. Age and experience just keep helping!
If you want to read a little more about this data, we used it as an example in this paper: Browser Record and Replay as a Building Block for End-User Web Automation Tools. The paper discusses Ringer, our record and replay tool, and how we can use straight-line replay scripts to produce components for more complicated programs.