a3nm's blog

Finding the members of the theoretical database community with DBLP

— updated

The DBLP service is a great bibliographical tool for computer science research. In this post, I explain how to use it to prepare the list of members of a research community. I will be using the theoretical database community, whose two conferences are PODS and ICDT.

The list of publications for one edition of a conference can be found on DBLP as XML, e.g., for ICDT'18. It is then easy to use xmlstarlet to find the list of people who have published at that conference:

curl -s 'https://dblp.uni-trier.de/db/conf/icdt/icdt2018.xml' |
  xmlstarlet sel -T -t -m "//inproceedings/author" -m . -c '.' -n |
  sort | uniq

For each person in the list, we can obtain detailed XML information, including its homepage, ORCID, etc., using the DBLP API again. (This also gives us a canonical form for the name, which may appear in different ways in various inproceedings entries.) This is just a bit more complicated than it should, because of a limitation of the DBLP search API: when queried with a name, sometimes the API inexplicably favors non-exact matches even in some cases where an exact match exist. So we must filter the matches ourselves to use an exact match if one exists, and a non-exact match otherwise. Of course, independently from this problem, you may be getting the wrong author, in particular because of homonyms, so these results should be taken with a grain of salt.

NAME="Antoine Amarilli"
ENAME=$(echo "$NAME" | sed 's/ /%20/g')
curl -s "https://dblp.org/search/author/api?h=1000&q=$ENAME" > matches.xml
URL=$(xmlstarlet sel -T -t -m "/result/hits/hit/info[author='$NAME']" \
    -c url -n < matches.xml | head -1)
if [[ -z "$URL" ]]
then
  URL=$(xmlstarlet sel -T -t -m /result/hits/hit/info/url \
      -c . -n < matches.xml | head -1)
fi
curl -L "${URL}.xml"

From there, we can use this to prepare a list of community members. Of course, any criterion for inclusion is completely arbitrary... My criterion to get a list of "active community members" is to select the who have published on three different years, with one publication in 2015 or later. Which gives:

Click to see the list...

Another inclusion criterion for a "historical" list would be the list of people who are not necessarily still active but have published over a long period, say, 10 different (not necessarily contiguous) years. Here is the resulting list, sorted by the year where the person has last published in ICDT or PODS.

Click to see the list...

Another kind of statistics that can be computed in this way is the "neighboring" conferences, i.e., the other conferences where members of the community have published. Here is the list of the top neighboring conferences of PODS and ICDT, sorted by the number of active community members who have published at least once there since 2015 (with hyperlinks and descriptions added manually):

  • 38: SIGMOD Conference, the practical database conference held jointly with PODS
  • 34: AMW, the database theory workshop held in honor of Alberto O. Mendelzon (whom you may remember from the previous list)
  • 29: IJCAI, an AI conference
  • 28: ICALP, a theoretical CS conference on logics and automata
  • 26: LICS, another theoretical CS conference about logics
  • 20: SODA, a theoretical CS conference on algorithms
  • 19: AAAI, another AI conference
  • 19: EDBT, the practical database conference held jointly with ICDT
  • 18: WWW, a conference about the World Wide Web
  • 15: Description Logics, the workshop on description logics
  • 15: ICDE, a practical data management conference
  • 13: CIKM, an information and knowledge management conference
  • 12: SEBD, the Italian conference on databases
  • 12: STOC, a general-purpose theoretical computer science conference
  • 11: KR, a conference on knowledge representation and reasoning
  • 11: FOCS, another general-purpose theoretical computer science conference

It would be interesting to visualize this data differently, e.g., visualize a world map with the community members, but sadly the affiliation information in DBLP is too sparse for this to work.

comments welcome at a3nm<REMOVETHIS>@a3nm.net