Something that's been bouncing around in my head is the topic of data collection and analysis, specifically in the security realm. I've been part of the OWASP Top 10 2017 data collection and analysis process (might be for future ones) and also starting to work on the OWASP SAMM Benchmarking project which will likely encounter similar challenges.
Data collection and management is a large challenge for projects in Open Source organizations like OWASP. As a group of people that come together with a common goal and wanting to be neutral for the most part, there are some areas that are many times just taken for granted at a large organization that can present some hard challenges to a loose band of people trying to help in an area they are passionate about.
For the OWASP Top 10, for many years the data collection was a fairly quiet affair. Then a few people started making noise that it should be a fully public dataset so anyone could work on it. While this is a noble goal, there are real-world trade-offs for doing that. For the Top Ten 2017, we had to explain to organizations that were considering contributing data, that "yes, anyone in the world will be able to see the raw data." There were several companies that didn't contribute due to that stipulation. I was working with an organization that over half a million records that could be contributed, but legal couldn't approve the release as they feared (and rightly so) that someone could use it against them. Not to breach anything, but to attempt to damage their brand.
So you have to ask yourself, would you rather have a larger dataset that isn't public or a smaller dataset that is? Almost all the contributors to the Top 10 2017 were service providers. The numbers they reported were already an aggregation of a number of their clients that are anonymized. Which makes sense, there is a tangible risk to disclosing your internal vulnerability metrics to the world; but if it's already an aggregate dataset, then it shouldn't be able to be traced back to individual organizations (generally).
We've had similar discussions for OWASP SAMM. We are working on a project to build a benchmark of software security maturity scores and are working through the details of how to collect, manage, and distribute the data. It's a pretty challenging task to work through the details for how to collect data, allow updates, determine what metadata could/should be collected, how anonymous it should be, how to handle versions, how to provide meaningful metrics to help organizations learn from others while protecting the submitting organizations and earning their trust, and so on.
Should the raw contributed SAMM scores be public data?
Honestly, in my opinion, the answer is no.
Also, I think the raw contributed data for the Top 10 2020 shouldn't be either.
Look at all the industry analysis produced by numerous organizations. I don't believe any of that raw data has been provided to the public, and I'm ok with that. I have my own opinions (as I'm sure you do as well) about some of those reports, but at the end of the day, we have to choose whether or not to trust their analysis. Sometimes we can, sometimes, not so much. If we can get more data, with a broader reach, and possibly a little more detail, I think it's totally worth it. You may not agree with me, and that's fine. Just stating my opinion based on my experiences.
I would love to be able to start trying to build a knowledge base of which vulnerabilities are more prevalent in specific languages and frameworks among other correlations, we have all this data in tiny silos; we really need to put it to good use. I've previously talked about so many different questions that we haven't been able to answer because of the lack of a good clean, solid, large data set. Then we could start trying to solve what we can in the languages/frameworks themselves and teach people how to code securely for the remaining things. So much could be done in this space if we sit down and make it a bit more of a priority.