Frequently Asked Questions

1. What are some available data sources? What is the resolution of these datasets?

You can refer to COVID-19 open datasets, for further details, please refer to the README in the pet-epi-notebooks repository. The time resolution for most COVID-19 datasets is daily, whereas spatially it is at the city level with the exception of the data for Bogotá D.C., which can be accessed at the individual level by postal code groups known as localidades (for further details see the README in the pet-epi-notebooks repository).

Furthermore, even though there is no merchant ID column in the dataset, we confirm that the dataset is merchant-level. Each record contains a merchant’s data for a specific week. Merchant IDs are currently not planned to be made available. However, please feel free to include considerations on merchant IDs in your project proposal e.g. you can explain how they would benefit your proposed solution and motivate their inclusion in future data releases to the data holder.

2. What is the level of granularity for the transaction data provided?

The transactional data are merchant-level weekly data (sum of transaction amounts and average transaction amounts for each week for each merchant), and the location granularity is at the postal code level. Participants have access to mock data to experiment with the dataset. If you have not received this access, please contact petschallenge@data.org.

3. Are there any other covariates available for data analysis?

Other covariates are currently not planned to be made available. However, please feel free to include considerations on additional covariates in your project proposal e.g. you can explain how they would benefit your proposed solution and motivate their inclusion in future data releases to the data holder.

For Epi data we suggest using the following datasets:

4. In policy scenario 1, what state-of-the-art epidemiological techniques are you referring to?

Any techniques and methods other than the listed in the remaining policy scenarios that can be supported by the literature. It’s up to the participants to propose relevant analysis for the Unconstrained Scenario.

5. Will the participants be able to access COVID-19 cases by postcode?

To the best of our knowledge, the only COVID-19 public dataset for the considered regions containing coarse-grained spatial information is the confirmed cases for Bogotá city published by datosabiertos.bogota.gov.co, where the variable `Localidad’ refers to the residence area of each case; each residence area can correspond to several postcodes. A data dictionary to convert each residence area into the corresponding postal codes can be consulted in the pet-epi-notebooks repository.

6. Can the prize money be used to fund salaries of e.g. RA or PDRA? More generally, are there conditions on its use?

Yes, as long as the individual’s salary can be justified as part of the development and implementation team. The prize money is designated for the development and implementation of the tools that have been proposed and not for any generic support of staff.

7. Are the sample transaction data that were shared real data and of the same format that we would have access to in the real competition?

The merchant-level weekly transactional data shared is mock data for the project proposal phase. This mock dataset is in the same format as the real dataset.

8. We were thinking our key aim would be to estimate two sets of Rt — one set (a base scenario) where Rt is informed only by case counts, and another set where we use measures of contact/mobility derived from the financial data to estimate Rt. Given that these estimates both represent highly aggregated measures, we weren’t sure how any merchant-specific info could be derived from these estimates. Is it a must to use differential privacy for this project?

A key component of the question about the effective reproductive number, Rt, is to show how privacy-enhanced transactional data can be used to inform contact patterns, and improve real-time Rt estimations. To answer this question, we believe that the following resources should be useful: Carvalho et al, 2021 GitHub repo for the epidemiology part: TRACE-LAC/pet-epi-notebooks, and for the privacy part: yanisvdc/opendp_baseline, along with resources on differential privacy: You are open to use the merchant-level transactional data in a way that you consider is the best. The rule is that you need to use differential privacy whenever you want to access information from the transactions (e.g. size of the dataset, count / mean / sum of any subpopulation). If you plan to estimate Rt from the financial data, you will need to incorporate the use of differential privacy within your epidemiological model (it possibly means accounting for the noise addition in your statistical estimation problem).

9. I/My team missed the webinar, can you share a list of resources?

Yes, here is a list of resources we are making available to every participant.

Accessing Code: We’ve made the code samples discussed during the webinar available on GitHub. To access them, simply follow this link: GitHub Repository. These samples are designed to provide hands-on experience and further your understanding of the practical applications of privacy-enhancing technologies.
Sample Data: You can access the data using this S3 link – HERE. The link redirects to downloading a password protected zip file. Please reach out to PETs Challenge (include email) team to receive the password to open the dataset.

Resources on Epidemiology modelling:
- You may review examples for some of the Epidemiological decision-making policy scenario of the challenge using open-source databases available online for the 4 selected locations. Please access them HERE
- R scripts for downloading Epi data: HERE

10. Could you please explain what the avg_amt in the data set represents?
Here is the definition on avg_amt:

avg_amt = amount divided by transaction count, for a given week.

11. What would be the recommended size and composition of the team?

Our official website states that while we do not permit collusion, we understand the importance of collaboration. Hence, teams are capped at three members. Should multiple submissions come from the same organization, we simply require a justification.

12. What are the time commitments associated with the first phase of the challenge?

The challenge consists of two phases. The first phase is about crafting a proposal in response to the questions and problems we’ve presented on our website. The time this phase takes will vary based on your team’s availability.

13. Is there any financial support available to the participants?

We provide financial support in the second phase, which involves tool development. This funding is designed to allow your team to concentrate on developing your proposal, which should be scalable and evaluated by us for its potential.

14. Is there a universal scale for Epsilon to compare implementations of a DP pipeline?

The scale for Epsilon isn’t universally defined; however, a smaller Epsilon, around one or two, is generally preferred for statistical releases. In machine learning models, it’s more common to see Epsilon values pushed to between eight and ten. The key consideration isn’t adhering to a strict rule but rather demonstrating that the privacy budget has been utilized judiciously. This means showing how, within the confines of a fixed privacy budget, the application remains useful within the confines of a fixed privacy budget,, particularly for purposes like epidemiological research. The overarching guideline is to keep Epsilon as low as feasible while still extracting meaningful data. It’s important to clearly articulate how the privacy budget is allocated throughout your methodology to derive insights from the actual dataset. This approach to using and justifying the Epsilon value is what’s generally expected.

15. Regarding the tools we develop as part of this challenge, are teams permitted to publish academic papers or articles on them afterward?

Teams are allowed to publish papers on the tools they develop. Our funders are advocates for supporting research. However, it is essential to follow a specific process for publication. If your project is selected, the terms of reference will outline this process. So, while we encourage research and publication, adhering to the outlined procedures is important.