fixes following discussion with thomas - conference_footprint - compute the CO2 footprint of an academic conference

commit edcbb997c5b2264872c74ce04b266a43766ab3a4
parent 1936730c02e2ebe1d3c8cd4180e304a36b370200
Author: Antoine Amarilli <a3nm@a3nm.net>
Date:   Mon,  3 Oct 2022 19:07:44 +0200

fixes following discussion with thomas

Diffstat:
README.md  | 148 ++++++++++++++++++++++++++++++++++++++++++++-----------------------------------
addnoise.py  | 16 ++++++++++++++++
co2.py  | 29 ++++++++---------------------
compute.py  | 37 +++++++++++++++++--------------------
run.sh  | 4 +++-

5 files changed, 127 insertions(+), 107 deletions(-)
diff --git a/README.md b/README.md
@@ -6,38 +6,66 @@ an academic conference.
 It was used to compute the footprint of the [Highlights'22
 conference](https://highlights-conference.org/2022/).
 
-## Input data format
-
-The input data should be provided as a CSV field containing the following
-fields:
-- Field 1: Name of participant
-- Field 2: Institution of participant
-- Field 3: 3-letter airport or metropolitan area code of origin (first leg, before the conference)
-- Field 4: Transportation means of the first leg: "train", "plane", "bus/coach", or
-    "other" or "" to mean it is unknown.
-- Field 5: 3-letter code of destination (second leg, after the conference)
-- Field 6: Transportation means of the second leg
-- Field 7: "True" if the participant is extending their stay, i.e., travelling for
-    scientific reasons other than the conference. For such participants, the
-    computation will only take the longest of the two legs into account.
+## Data collection
+
+We collected information about the travel plans of participants using a [web
+form](https://framaforms.org/highlights-participant-travel-information-1664806487)
+([archive](https://web.archive.org/web/20221003161159/https://framaforms.org/highlights-participant-travel-information-1664806487)).
+To ensure that everyone filled the form, the link to payment was only given once
+the form was completed.
+
+We manually removed duplicate records and fake data.
+
+For people who did not fill in the details of their travel, we:
+
+- assumed that they were coming to/from the institution of their first
+  affiliation
+- when the transportation mechanism was unspecified, we assumed that trips of
+  <=400km were done by rail and trips of >400km were done by plane, following:
+  https://github.com/ConferenceCarbonTracker/CarbonFootprintAGU#44-mode-of-transport
+
+Afterwards, we discarded the name and institution of participants.
+
+We manually translated the free-form city and country to a
+machine-understandable location by searching by hand for the closest
+three-letter code (airport or metropolitan area). This step could be automated.
+
+The result is a CSV file in the following format:
+
+- Field 1: 3-letter airport or metropolitan area code of origin (first leg, before the conference)
+- Field 2: Transportation means of the first leg: "train", "plane", or "bus/coach".
+- Field 3: 3-letter code of destination (second leg, after the conference)
+- Field 4: Transportation means of the second leg
+- Additional fields, e.g., a field indicating if the participant is extending
+    their stay for scientific reasons other than the conference.
 
 ## Running the computation
 
 You need python3, standard shell utilities, and `GeodSolve` from Debian package
 `geographiclib-tools`.
 
-Run `./run.sh FILE CODE LAT LON` where:
+Run `./run.sh FILE CODE LAT LON NOISE` where:
 
 - FILE is the CSV file above
 - CODE is the 3-letter code used for local participants (their trips will be
     ignored)
-- LAT and LON are the geographical coordinates 
+- LAT and LON are the geographical coordinates where the conference is taking
+    place.
+- NOISE is the percentage of random error added to the distance (e.g., 0.2 for
+    20%)
 
 The script will generate:
 
 - map.geojson: a Geojson file displaying the various points of travel with color
     describing whether they are by plane or not. This can be plotted, e.g., with
     [uMap](http://umap.openstreetmap.fr/fr/).
+- `anonymized_participants`, a comma-separated list of participants with headers
+    and with the following fields:
+      - Field 1: mode of first leg (as above)
+      - Field 2: distance of first leg in meters, with random error added
+      - Field 3: mode of second leg (as above)
+      - Field 4: distance of second leg in meters, with random error added
+      - All additional fields in the input are left as-is.
 - `trips_with_footprint`, a comma-separated list of trips with the following
     fields:
       - Field 1: name (note that commas are dropped from names)
@@ -48,65 +76,22 @@ The script will generate:
 - It will also output some aggregate values on the standard error output, and prepare temporary files `trips`
     and `trips_with_dist`
 
-## Highlights'22 methodology
+## Footprint computation methodology
 
-### Registration form data collection
+### Local participants
 
-The Highlights registration form asked particiants:
-
-- "To estimate the carbon footprint of this edition of Highlights, please give
-  us some information about your travel"
-- "Arriving from": city and country, free text
-- "Arriving by": other / plane / train / bus or coach / car / local transportation (for locals)
-- ditto for departure
-- Extended stays: we asked whether:
-  - They participated to a co-located conference
-  - They participated to an extended stay support scheme
-  - They were "extending their stay for scientific reasons by another way"
-
-The fields were optional, but almost everyone filled them.
-
-### Processing and completing the registration form information
-
-We took the registration data and manually removed obviously fake submissions
-and apparent duplicates.
-
-We ignored local participants, for which we estimate a CO2 footprint of 0.
-
-For people who did not fill in the details of their travel, we:
-
-- assumed that they were coming to/from the institution of their first
-  affiliation
-- when the transportation mechanism was unspecified, we assumed that trips of
-  <=400km were done by rail and trips of >400km were done by plane, following:
-  https://github.com/ConferenceCarbonTracker/CarbonFootprintAGU#44-mode-of-transport
-
-This gives us a list of trips: each participant has 2 trips, each trip has an origin and
-destination (one of them conference venue) and a transportation mode.
+We ignore local participants, for which we estimate a CO2 footprint of 0.
 
 ### Geocoding and distance computation
 
-We manually translated the free-form city and country to a
-machine-understandable location by searching by hand for the closest
-three-letter code (airport or metropolitan area). We used the OpenFlights
+ We used the OpenFlights
 database airport-extended.dat on [this page](https://openflights.org/data.html) to convert these
 to geographical coordinates, and used known geographic coordinates for the
 conference venue. We used GeodSolve to compute the distance of each trip.
 
-### Adjusting for other scientific reasons
-
-For participants whose stay had other scientific justifications (no matter
-which), we counted only the longest of the two trips. The effect is basically to
-halve their emissions by considering that the conference carries half the
-responsibility. The reason why we do this instead of dividing the total by two
-is to make sure that we account for one of the "long trips" required between
-their institution and conference venue: indeed, some participants gave details
-of these long trips, whereas other gave details of one long trip and one trip to
-a neighboring place, e.g., for an extended stay.
+### Carbon footprint
 
-### Footprint computation
-
-Given this list of trips, we then compute their CO2 fotprint following the
+We compute the CO2 fotprint following the
 [labos1point5](https://labos1point5.org/ges-1point5) data, which is adapted from
 the French agency [Ademe](https://www.ademe.fr/).
 
@@ -129,3 +114,36 @@ the French agency [Ademe](https://www.ademe.fr/).
 
 We then sum the total emissions to arrive at the final value.
 
+## Highlights'22 methodology
+
+### Data collection
+
+The [Highlights registration
+form](https://framaforms.org/highlights2022-on-site-registration-1652701135) 
+([archive](https://web.archive.org/web/20220622164245/https://framaforms.org/highlights2022-on-site-registration-1652701135))
+asked particiants:
+
+- "To estimate the carbon footprint of this edition of Highlights, please give
+  us some information about your travel"
+- "Arriving from": city and country, free text
+- "Arriving by": other / plane / train / bus or coach / car / local transportation (for locals)
+- ditto for departure
+- Extended stays: we asked whether:
+  - They participated to a co-located conference
+  - They participated to an extended stay support scheme
+  - They were "extending their stay for scientific reasons by another way"
+
+The fields were optional, but almost everyone filled them.
+
+### Adjusting for other scientific reasons
+
+In the carbon footprint, to account for participants whose stay had other
+scientific justifications (no matter which), we counted only the longest of the
+two trips. The effect is basically to halve their emissions by considering that
+the conference carries half the responsibility. The reason why we do this
+instead of dividing the total by two is to make sure that we account for one of
+the "long trips" required between their institution and conference venue:
+indeed, some participants gave details of these long trips, whereas other gave
+details of one long trip and one trip to a neighboring place, e.g., for an
+extended stay.
+
diff --git a/addnoise.py b/addnoise.py
@@ -0,0 +1,16 @@
+#!/usr/bin/env python3
+
+import sys
+from random import uniform
+
+noise = float(sys.argv[1])
+
+print( "mode,distance in meters")
+
+for l in sys.stdin.readlines():
+    f = l.strip().split(',')
+    mode = f[0]
+    dist = float(f[3])
+    dist_anon = round(uniform(dist * (1-noise), dist * (1+noise)))
+    print(','.join((mode, str(dist_anon))))
+
diff --git a/co2.py b/co2.py
@@ -8,8 +8,6 @@ import json
 import sys
 from collections import defaultdict
 
-seen = set()
-
 places = defaultdict(lambda: (0, 0, None))
 
 n_trips = 0
@@ -22,23 +20,11 @@ co2_by_type = defaultdict(lambda : 0)
 
 for l in sys.stdin.readlines():
     f = l.strip().split(',')
-    person = f[0]
-    inst = f[1]
-    mode = f[2]
-    multitrip = f[3] == "True"
-    lat = f[4]
-    lon = f[5]
-
-    if multitrip and person in seen:
-        # for a multi-purpose trip, only count the first transport leg of that
-        # person
-        # we assume that the input is sorted by decreasing distance so that it's
-        # the longest leg
-        continue
-
-    seen.add(person)
+    mode = f[0]
+    lat = f[1]
+    lon = f[2]
 
-    distance = float(f[6])
+    distance = float(f[3])
 
     if mode.strip() not in ['plane', 'train', 'bus/coach']:
         if distance > 400000:
@@ -52,7 +38,7 @@ for l in sys.stdin.readlines():
 
     k = (lat,lon)
     plane = mode == "plane"
-    places[k] = (places[k][0] + (1 if plane else 0), places[k][1] + 1, inst)
+    places[k] = (places[k][0] + (1 if plane else 0), places[k][1] + 1)
 
     dist_by_type[mode] += distance
     n_trips += 1
@@ -72,7 +58,7 @@ for l in sys.stdin.readlines():
     co2 = (g_km_person * (distance / 1000.))/1000.
     co2_by_type[mode] += co2
     total_co2 += co2
-    print (','.join([person, inst, str(distance), mode, str(co2)]))
+    print (','.join([str(distance), mode, str(co2)]))
 
 
 ## OUTPUT GEOJSON
@@ -86,7 +72,8 @@ for k in places.keys():
     feature = {
       "type": "Feature",
       "properties": {
-          "name":places[k][2], "_umap_options": {"color": color}
+          #"name":places[k][2],
+          "_umap_options": {"color": color}
       },
       "geometry": {
         "type": "Point",
diff --git a/compute.py b/compute.py
@@ -27,28 +27,25 @@ n_extend = 0
 with open(datafile, 'r') as f:
     csvreader = csv.reader(f)
     for row in csvreader:
-        name = row[0].replace(',', '')
-        institution = row[1].replace(',', '')
-        fromcode = row[2]
-        frommode = row[3]
-        tocode = row[4]
-        tomode = row[5]
-        extend = row[6] == "True"
-        added = False
+        fromcode = row[0]
+        frommode = row[1]
+        tocode = row[2]
+        tomode = row[3]
         n_participants += 1
-        if extend:
-            n_extend += 1
+        if fromcode == localcode:
+            assert (frommode in ["local", "other", ""])
+            assert (tomode in ["local", "other", ""])
+            assert (tocode == localcode)
+            continue
+        if frommode == "":
+            frommode = "other"
+        if tomode == "":
+            tomode = "other"
+        assert (frommode in ["bus/coach", "plane", "train", "other"])
+        assert (tomode in ["bus/coach", "plane", "train", "other"])
+        n_nonlocals += 1
         for (mode, code) in [(frommode, fromcode), (tomode, tocode)]:
-            if code == localcode:
-                continue
-            added = True
-            n_nonlocal_trips += 1
-            print (','.join((name, institution, mode, str(extend),
-                airports[code][0], airports[code][1])))
-        if added:
-            n_nonlocals += 1
+            print (','.join((mode, airports[code][0], airports[code][1])))
 
 print("%d total participants" % n_participants, file=sys.stderr)
 print("%d nonlocal participants" % n_nonlocals, file=sys.stderr)
-print("%d nonlocal trips" % n_nonlocal_trips, file=sys.stderr)
-print("%d extending" % n_extend, file=sys.stderr)
diff --git a/run.sh b/run.sh
@@ -7,6 +7,7 @@ FILE="$1"
 LOCALCODE="$2"
 LAT="$3"
 LON="$4"
+NOISE="$5"
 
 if [ ! -f airports.dat ]
 then
@@ -15,6 +16,7 @@ then
 fi
 
 ./compute.py "$FILE" "$LOCALCODE" > trips
-paste -d, trips <(cut -d, -f5,6 trips | tr ',' ' ' | sed "s/^/$LAT $LON /" | GeodSolve -i| cut -d ' ' -f3 ) | sort -t',' -k7,7nr > trips_with_dist
+paste -d, trips <(cut -d, -f2,3 trips | tr ',' ' ' | sed "s/^/$LAT $LON /" | GeodSolve -i| cut -d ' ' -f3 ) | sort -t',' -k4,4nr > trips_with_dist
+./addnoise.py "$NOISE" < trips_with_dist > trips_anonymized.csv
 python3 co2.py < trips_with_dist > trips_with_footprint

	conference_footprint compute the CO2 footprint of an academic conference
	git clone https://a3nm.net/git/conference_footprint/
	Log \| Files \| Refs \| LICENSE

README.md	\|	148	++++++++++++++++++++++++++++++++++++++++++++-----------------------------------
addnoise.py	\|	16	++++++++++++++++
co2.py	\|	29	++++++++---------------------
compute.py	\|	37	+++++++++++++++++--------------------
run.sh	\|	4	+++-