Home 🐋 Giants The GitHub Blog
author

The GitHub Blog

Stay inspired with updates, ideas, and insights from GitHub to aid developers in software design and development.

December 20, 2024  19:08:03

I’m excited to share an interview with two researchers that I’ve had the privilege of collaborating with on a recently released paper studying how open source maintainers adjust their work after they start using GitHub Copilot:

  • Manuel Hoffmann is a postdoctoral scholar at the Laboratory for Innovation Science at Harvard housed within the Digital, Data, and Design Institute at Harvard Business School. He is also affiliated with Stanford University. His research interests lie in social and behavioral aspects around open source software and artificial intelligence, under the broader theme of innovation and technology management, with the aim of better understanding strategic aspects for large, medium-sized, and entrepreneurial firms.
  • Sam Boysel is a postdoctoral fellow at the Laboratory for Innovation Science at Harvard. He is an applied microeconomist with research interests at the intersection of digital economics, labor and productivity, industrial organization, and socio-technical networks. Specifically, his work has centered around the private provision of public goods, productivity in open collaboration, and welfare effects within the context of open source software (OSS) ecosystems.

Research Q&A

Kevin: Thanks so much for chatting, Manuel and Sam! So, it seems like the paper’s really making the rounds. Could you give a quick high-level summary for our readers here?

Manuel: Thanks to the great collaboration with you, Sida Peng from Microsoft and Frank Nagle from Harvard Business School, we study the impact of GitHub Copilot on developers and how this generative AI alters the nature of work. We find that when you provide developers in the context of open source software with a generative AI tool that reduces the cost of the core work of coding, developers increase their coding activities and reduce their project management activities. We find our results are strongest in the first year after the introduction but still exist even two years later. The results are driven by developers who are working more autonomously and less collaboratively since they do not have to engage with other humans to solve a problem but they can solve the problem through AI assistance.

GitHub Copilot, on average, increases the share of coding for maintainers who use it. These effects are strongest in the first year, but they still persist over the course of two years.
GitHub Copilot, on average, increases the share of coding for maintainers who use it. These effects are strongest in the first year, but they still persist over the course of two years.
GitHub Copilot, on average, reduces the share of project management for maintainers who use it. These effects are strongest in the first year, but still persist over the course of two years.
GitHub Copilot, on average, reduces the share of project management for maintainers who use it. These effects are strongest in the first year, but still persist over the course of two years.

Sam: That’s exactly right. We tried to even further understand the nature of work by digging into the paradigm of exploration vs exploitation. Loosely speaking, exploitation is the idea to exert effort towards the most lucrative of the already known options while exploration implies to experiment to find new options with a higher potential return. We tested this idea in the context of GitHub. Developers that had access to GitHub Copilot are engaged in more experimentation and less exploitation, that is, they start new projects with access to AI and have a lower propensity to work on older projects. Additionally, they expose themselves to more languages that they were previously not exposed to and in particular to languages that are valued higher in the labor market. A back-of-the-envelope calculation from purely experimentation among new languages due to GitHub Copilot suggests a value of around half a billion USD within a year.

Kevin: Interesting! Could you provide an overview of the methods you used in your analysis?

Manuel: Would be happy to! We are using a regression discontinuity design in this work, which is as close as you can get to a randomized control trial when purely using pre-existing data, such as the one from GitHub, without introducing randomization by the researcher. Instead, the regression discontinuity design is based on a ranking of developers and a threshold that GitHub uses to determine eligibility for free access to GitHub Copilot’s top maintainer program.

Sam: The main idea of this method is that a developer that is right below the threshold is roughly identical to a developer that is right above the threshold. Stated differently, by chance a developer happened to have a ranking that made them eligible for the generative AI while they could as well not have been eligible. Taken together with the idea that developers neither know the threshold nor the internal ranking from GitHub, we can be certain that the changes that we observe in coding, project management, and the other activities on the platform are only driven by the developers having access to GitHub Copilot and nothing else.

Kevin: Nice! Follow-up question: could you provide an “explain like I’m 5” overview of the methods you used in your analysis?

Manuel: Sure thing, let’s illustrate the problem a bit more. Some people use GitHub Copilot and others don’t. If we just looked at the differences between people who use GitHub Copilot vs. those who don’t, we’d be able to see that certain behaviors and characteristics are associated with GitHub Copilot usage. For example, we might find that people who use GitHub Copilot push more code than those who don’t. Crucially, though, that would be a statement about correlation and not about causation. Often, we want to figure out whether X causes Y, and not just that X is correlated with Y. Going back to the example, if it’s the case that those who use GitHub Copilot push more code than those who don’t, these are a few of the different explanations that might be at play:

  1. Using GitHub Copilot causes people to push more code.
  2. Pushing more code causes people to use GitHub Copilot.
  3. There’s something else that both causes people to use GitHub Copilot and causes them to push more code (for example, being a professional developer).

Because (1), (2), and (3) could each result in data showing a correlation between GitHub Copilot usage and code pushes, just finding a correlation isn’t super interesting. One way you could isolate the cause and effect relationship between GitHub Copilot and code pushes, though, is through a randomized controlled trial (RCT).

In an RCT, we randomly assign people to use GitHub Copilot (the treatment group), while others are forbidden from using GitHub Copilot (the control group). As long as the assignment process is truly random and the users comply with their assignments, any outcome differences between the treatment and control groups can be attributed to GitHub Copilot usage. In other words, you could say that GitHub Copilot caused those effects. However, as anyone in the healthcare field can tell you, large-scale RCTs over long-time periods are often prohibitively expensive, as you’d need to recruit subjects to participate, monitor them to see if they complied with their assignments, and follow up with them over time.

Sam: That’s right. So, instead, wouldn’t it be great if there would be a way to observe developers without running an RCT and still draw valid causal conclusions about GitHub Copilot usage? That’s where the regression discontinuity design (RDD) comes in. The random assignment aspect of an RCT allows us to compare the outcomes of two virtually identical groups. Sometimes, however, randomness already exists in a system, which we can use as a natural experiment. In the case of our paper, this randomness came in the form of GitHub’s internal ranking for determining which open source maintainers were eligible for free access to GitHub Copilot.

Let’s walk through a simplified example. Let’s imagine that there were one million repositories that were ranked on some set of metrics and the rule was that the top 500,000 repositories are eligible for free access to GitHub Copilot. If we compared the #1 ranked repository with the #1,000,000 ranked repository, then we would probably find that those two are quite different from each other. After all, the #1 repository is the best repository on GitHub by this metric while the #1,000,000 repository is a whole 999,999 rankings away from it. There are probably meaningful differences in code quality, documentation quality, project purpose, maintainer quality, etc. between the two repositories, so we would not be able to say that the only reason why there was a difference in outcomes for the maintainers of repository #1 vs. repository #1,000,000 was because of free access to GitHub Copilot.

However, what about repository #499,999 vs. repository #500,001? Those repositories are probably very similar to each other, and it was all down to random chance as to which repository made it over the eligibility threshold and which one did not. As a result, there is a strong argument that any differences in outcomes between those two repositories is solely due to repository #499,999 having free access to GitHub Copilot and repo #500,001 not having free access. Practically, you’ll want to have a larger sample size than just two, so you would compare a narrow set of repositories just above and below the eligibility threshold against each other.

Kevin: Thanks, that’s super helpful. I’d be curious about the limitations of your paper and data that you wished you had for further work. What would the ideal dataset(s) look like for you?

Manuel: Fantastic question! Certainly, no study is perfect and there will be limitations. We are excited to better understand generative AI and how it affects work in the future. As such, one limitation is the availability of information from private repositories. We believe that if we were to have information on private repositories we could test whether there is some more experimentation going on in private projects and that project improvements that were done with generative AI in private may spill over to the public to some degree over time.

Sam: Another limitation of our study is the language-based exercise to provide a value of GitHub Copilot. We show that developers focus on higher value languages that they did not know previously and we extrapolated this estimate to all top developers. However, this estimate is certainly only a partial equilibrium value since developer wages may change over time in a full equilibrium situation if more individuals offer their services for a given language. However, despite the limitation, the value seems to be an underestimate since it does not contain any non-language specific experimentation value and non-experimentation value that is derived from GitHub Copilot.

Kevin: Predictions for the future? Recommendations for policymakers? Recommendations for developers?

Manuel: One simple prediction for the future is that AI incentivizes the activity for which it lowers the cost. However, it is not clear yet which parts will be incentivized through AI tools since they can be applied to many domains. It is likely that there are going to be a multitude of AI tools that incentivize different work activities which will eventually lead to employees, managers at firms, and policy-makers having to consider on which activity they want to put weight. We also would have to think about new recombinations of work activities. Those are difficult to predict. Avi Goldfarb, a prolific professor from the University of Toronto, gave an example of the steam engine with his colleagues. Namely, work was organized in the past around the steam engine as a power source but once electric motors were invented, that was not necessary anymore and structural changes happened. Instead of arranging all of the machinery around a giant steam engine in the center of the factory floor, electricity enabled people to design better arrangements, which led to boosts in productivity. I find this historical narrative quite compelling and can imagine similarly for AI that the greatest power still remains to be unlocked and that it will be unlocked once we know how work processes can be re-organized. Developers can think as well about how their work may change in the future. Importantly, developers can actively shape that future since they are closest to the development of machine learning algorithms and artificial intelligence technologies.

Sam: Adding on to those points, it is not clear when work processes change and whether it will have an inequality reducing or enhancing effect. Many predictions point towards an inequality enhancing effect since the training of large-language models requires substantial computing power which is often only in the hands of a few players. On the other hand, it has been documented that especially lower ability individuals seem to benefit the most from generative AI, at least in the short-term. As such, it’s imperative to understand how the benefits of generative AI are distributed across society. If not, are there equitable, welfare-improving interventions that can correct these imbalances? An encouraging result of our study suggests that generative AI can be especially impactful for relatively lesser skilled workers:

Low ability developers are increasing their coding activity by more than high ability developers. The Max Centrality proxy measures the largest share of commits the developer made across all public repositories where they have made at least one commit. The Achievements proxy counts the cumulative total of “badges” the developer has earned on the GitHub platform. The Followers proxy counts the number of peers who subscribe to notifications for the developer’s activity on GitHub. The Tenure proxy is measured by counting the number of days since the developer created their GitHub account. A developer would be considered to be “high ability” for a proxy if they were above the median for that proxy and “low ability” for a proxy if they were below the median for that proxy. See Table A5 in the [paper](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5007084) for more information with summary statistics of each proxy.
Low ability developers are increasing their coding activity by more than high ability developers. The Max Centrality proxy measures the largest share of commits the developer made across all public repositories where they have made at least one commit. The Achievements proxy counts the cumulative total of “badges” the developer has earned on the GitHub platform. The Followers proxy counts the number of peers who subscribe to notifications for the developer’s activity on GitHub. The Tenure proxy is measured by counting the number of days since the developer created their GitHub account. A developer would be considered to be “high ability” for a proxy if they were above the median for that proxy and “low ability” for a proxy if they were below the median for that proxy.
Low ability developers are reducing their project management activity by more than high ability developers. The Max Centrality proxy measures the largest share of commits the developer made across all public repositories where they have made at least one commit. The Achievements proxy counts the cumulative total of “badges” the developer has earned on the GitHub platform. The Followers proxy counts the number of peers who subscribe to notifications for the developer’s activity on GitHub. The Tenure proxy is measured by counting the number of days since the developer created their GitHub account. A developer would be considered to be “high ability” for a proxy if they were above the median for that proxy and “low ability” for a proxy if they were below the median for that proxy. See Table A5 in the [paper](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5007084) for more information with summary statistics of each proxy.
Low ability developers are reducing their project management activity by more than high ability developers.

Sam (continued): Counter to widespread speculation that generative AI will replace many entry level tasks, we find reason to believe that AI can also lower the costs of experimentation and exploration, reduce barriers to entry, and level the playing field in certain segments of the labor market. It would be prudent for policymakers to monitor distributional effects of generative AI, allowing the new technology to deliver equitable benefits where it does so naturally but at the same time intervening in cases where it falls short.

Personal Q&A

Kevin: I’d like to change gears a bit to chat more about your personal stories. Manuel, I know you’ve worked on research analyzing health outcomes with Stanford, diversity in TV stations, and now you’re studying nerds on the internet. Would love to learn about your journey to getting there.

Manuel: Sure! I was actually involved with “nerds on the internet” longer than my vita might suggest. Prior to my studies, I was using open source software, including Linux and Ubuntu, and programming was a hobby for me. I enjoyed the freedom that one had on the personal computer and the internet. During my studies, I discovered economics and business studies as a field of particular interest. Since I was interested in causal inference and welfare from a broader perspective, I learned how to use experimental and quasi-experimental studies to better understand social, medical and technological innovation that are relevant for individuals, businesses, and policy makers. I focused on labor and health during my PhD and afterwards I was able to lean a bit more into health at Stanford University. During my time at Harvard Business School, the pendulum swung back a bit towards labor. As such, I was in the fortunate position—thanks to the study of the exciting field of open source software—to continuously better understand both spaces.

Kevin: Haha, great to hear your interest in open source runs deep! Sam, you also have quite the varied background, analyzing cleantech market conditions and the effects of employment verification policies, and you also seem to have been studying nerds on the internet for the past several years. Could you share a bit about your path?

Sam: I’ve been a computing and open source enthusiast since I got my hands on a copy of “OpenSUSE for Dummies” in middle school. As an undergraduate, I was drawn to the social science aspect of economics and its ability to explain or predict human behavior across a wide range of settings. After bouncing around a number of subfields in graduate school, I got the crazy idea to combine my field of study with my passion and never looked back. Open source is an incredibly data-rich environment with a wealth of research questions interesting to economists. I’ve studied the role of peer effects in driving contribution, modelled the formation of software dependency networks using strategic behavior and risk aversion, and explored how labor market competition shapes open source output.

And thanks, Kevin. I’ll be sure to work “nerds on the internet” into the title of the next paper.

Kevin: Finding a niche that you’re passionate about is such a joy, and I’m curious about how you’ve found living in that niche. What’s the day-to-day like for you both?

Manuel: The day-to-day can vary but as an academic, there are a few tasks that are recurring at a big picture level. Research, teaching and other work. Let’s focus on the research bucket. I am quite busy with working on causal inference papers, refining them but also speaking to audiences to communicate our work. Some of the work is done jointly, some by oneself, so there is a lot of variation, and the great part of being an academic is that one can choose that variation oneself through the projects one selects. Over time one has to juggle many balls. Hence, I am working on finishing prior papers in the space of health; for example, some that you alluded to previously on Television, Health and Happiness and Vaccination at Work, but also in the space of open source software, importantly, to improve the paper on Generative AI and the Nature of Work. We have continuously more ideas to better understand the world we are living and going to live in at the intersection of open source software and generative AI and, as such, it is very valuable to relate to the literature and eventually answer exciting questions around the future of work with real world data. GitHub is a great resource for that.

Sam: As an applied researcher, I use data to answer questions. The questions can come from current events, from conversations with both friends and colleagues, or simply musing on the intricacies of the open source space. Oftentimes the data comes from publicly observed behavior of firms and individuals recorded on the internet. For example, I’ve used static code analysis and version control history to characterize the development of open source codebases over time. I’ve used job postings data to measure firm demand for open source skills. I’ve used the dependency graphs of packaging ecosystems to track the relationships between projects. I can then use my training in economic theory, econometric methodology, and causal inference to rigorously explore the question. The end result is written up in an article, presented to peers, and iteratively improved from feedback.

Kevin: Have things changed since generative AI tooling came along? Have you found generative AI tools to be helpful?

Manuel: Definitely. I use GitHub Copilot when developing experiments and programming in Javascript together with another good colleague, Daniel Stephenson from Virginia Commonwealth University. It is interesting to observe how often Copilot actually makes code suggestions based on context that are correct. As such, it is an incredibly helpful tool. However, the big picture of what our needs are can only be determined by us, as such, in my experience Copilot does seem to speed up the process and leads to avoiding some mistakes conditional on not just blindly following the AI.

Sam: I’ve only recently begun using GitHub Copilot, but it’s had quite an impact on my workflow. Most social sciences researchers are not skilled software engineers. However, they must also write code to hit deadlines. Before generative AI, the delay between problem and solution was usually characterized by many search queries, parsing Q&A forums or documentation, and a significant time cost. Being able to resolve uncertainty within the IDE is incredible for productivity.

Kevin: Advice you might have for folks who are starting out in software engineering or research? What tips might you give to a younger version of yourself, say, from 10 years ago?

Manuel: I will talk a bit at a higher level. Find work, questions or problems that you deeply care about. I would say that is a universal rule to be satisfied, be it in software engineering, research or in any other area. In a way, think about the motto, “You only live once,” as a pretty applicable misnomer. Another universal advice that has high relevance is to not care too much about the things you cannot change but focus on what you can control. Finally, think about getting advice from many different people, then pick and choose. People have different mindsets and ideas. As such, that can be quite helpful.

Sam:

  • Being able to effectively communicate with both groups (software eng and research) is extremely important.
  • You’ll do your best work on things you’re truly passionate about.
  • Premature optimization truly is the root of all evil.

Kevin: Learning resources you might recommend to someone interested in learning more about this space?

Manuel: There are several aspects we have touched upon—Generative AI Research, Coding with Copilot, Causal Inference and Machine Learning, Economics, and Business Studies. As such, here is one link for each topic that I can recommend:

I am sure that there are many other great resources out there, many that can also be found on GitHub.

Sam: If you’re interested to get more into how economists and other social scientists think about open source, I highly recommend the following (reasonably entry-level) articles that have helped shape my research approach.

For those interested in learning more about the toolkit used by empirical social scientists:

(shameless plug) We’ve been putting together a collection of data sources for open source researchers. Contributions welcome!

Kevin: Thank you, Manuel and Sam! We really appreciate you taking the time to share about what you’re working on and your journeys into researching open source.

The post Inside the research: How GitHub Copilot impacts the nature of work for open source maintainers appeared first on The GitHub Blog.

December 20, 2024  17:23:42

Hey devs! We have some exciting news for ya.

So, backstory first: in case you missed it, OpenAI launched its o1-preview and o1-mini models back in September, optimized for advanced tasks like coding, science, and math. We brought them to our GitHub Copilot Chat and to GitHub Models so you could start playing with them right away, and y’all seemed to love it!

Fast forward to today, and OpenAI has just shipped the release version to o1, an update to o1-preview with improved performance in complex tasks, including a 44% gain in Codeforces’ competitive coding test.

Sounds pretty good, huh? Well, exciting news: We’re bringing o1 to you in Copilot Chat across Copilot Pro, Business, and Enterprise, and including GitHub Models. It’s also available for you to start using right now!

How can I try o1 out in chat?

In Copilot Chat in Visual Studio Code and on GitHub, you can now pick “o1 (Preview)” via the model picker if you have a paid Copilot subscription (this is not available on our newly announced free Copilot tier).

Before, it said “o1-preview (Preview)” and now it just says “o1 (Preview)” because the GitHub model picker is in preview, but not the model itself. We have proven once again that the hardest problem in computer science is naming things, but yo dawg, we no longer have previews on previews. We are a very professional business.

Anyway.

Now you can use Copilot with o1 to explain, debug, refactor, modernize, test… all with the latest and greatest that o1 has to offer!

o1 is included in your paid subscription to GitHub Copilot, allowing up to 10 messages every 12 hours. As a heads up, if you’re a Copilot Business or Copilot Enterprise subscriber, an administrator will have to enable access to o1 models first before you can use it.

Beyond the superpowers that o1 already gives you out of the box, it also pulls context from your workspace and GitHub repositories! It’s a match made in developer heaven.

A GIF showing a developer selecting the OpenAI’s o1 model in GitHub Copilot to power Copilot Chat within the integrated development environment (IDE) VS Code.

What about o1 in GitHub Models?

If you haven’t played with GitHub Models yet, you’re in for a treat. It lets you start building AI apps with a playground and an API (and more). Learn more about GitHub Models in our documentation and start experimenting and building with a variety of AI models today on the GitHub Marketplace.

Head here to start with o1 in the playground, plus you can compare it to other models from providers such as Mistral, Cohere, Microsoft, or Meta, and use the code samples you get to start building. Every model is different, and this kind of experimentation is really helpful to get the results you want!

A GIF showing a sample prompt with OpenAI’s o1 model in GitHub Models, a platform that enables developers to experiment with leading AI models, compare their performance, and build them into applications. The sample prompt asks “How do you prove the Pythagorean theorem geometrically.” The AI model then offers an answer that displays its reasoning capabilities.

It’s all about developer choice!

Having choices fuels developer creativity! We don’t want you to just write better code, but for you to have the freedom to build, innovate, commit, and push in the way that works best for you.

I hope you’re as excited as I am. GitHub is doubling down on its commitment to give you the most advanced tools available with OpenAI’s new o1 model, and if you keep your eye out, there’s even better things to come. 👀

So, let’s build from here—together!

The post OpenAI’s latest o1 model now available in GitHub Copilot and GitHub Models appeared first on The GitHub Blog.

December 20, 2024  16:59:34

The need for software build security is more pressing than ever. High-profile software supply chain attacks like SolarWinds, MOVEit, 3CX, and Applied Materials have revealed just how vulnerable the software build process can be. As attackers exploit weaknesses in the build pipeline to inject their malicious components, traditional security measures—like scanning source code, securing production environments and source control allow lists—are no longer enough. To defend against these sophisticated threats, organizations must treat their build systems with the same level of care and security as their production environments.

These supply chain attacks are particularly dangerous because they can undermine trust in your business itself: if an attacker can infiltrate your build process, they can distribute compromised software to your customers, partners, and end-users. So, how can organizations secure their build processes, and ensure that what they ship is exactly what they intended to build?

The Supply-chain Levels for Software Artifacts (SLSA) framework was developed to address these needs. SLSA provides a comprehensive, step-by-step methodology for building integrity and provenance guarantees into your software supply chain. This might sound complicated, but the good news is that GitHub Artifact Attestations simplify the journey to SLSA Level 3!

In this post, we’ll break down what you need to know about SLSA, how Artifact Attestations work, and how they can boost your GitHub Actions build security to the next level.

Securing your build process: An introduction to SLSA

What is build security?

When we build software, we convert source code into deployable artifacts—whether those are binaries, container images or packaged libraries. This transformation occurs through multiple stages, such as compilation, packaging and testing, each of which could potentially introduce vulnerabilities or malicious modifications.

A properly secured build process can:

  • Help ensure the integrity of your deployed artifacts by providing a higher level of assurance that the code has not been tampered with during the build process.
  • Provide transparency into the build process, allowing you to audit the provenance of your deployed artifacts.
  • Maintain confidentiality by safeguarding sensitive data and secrets used in the build process.

By securing the build process, organizations can ensure that the software reaching end-users is the intended and unaltered version. This makes securing the build process just as important as securing the source code and deployment environments.

Introducing SLSA: A framework for build security

SLSA is a community-driven framework governed by the Open Source Security Foundation (OpenSSF), designed to help organizations systematically secure their software supply chains through a series of progressively stronger controls and best practices.

The framework is organized into four levels, each representing a higher degree of security maturity:

  • Level 0: No security guarantees
  • Level 1: Provenance exists for traceability, but minimal tamper resistance
  • Level 2: Provenance signed by a managed build platform, deterring simple tampering
  • Level 3: Provenance from a hardened, tamper-resistant build platform, ensuring high security against compromise

Provenance refers to the cryptographic record generated for each artifact, providing an unforgeable paper trail of its build history. This record allows you to trace artifacts back to their origins, allowing for verification of how, when and by whom the artifact was created.

Why SLSA Level 3 matters for build security

Achieving SLSA Level 3 is a critical step in building a secure and trustworthy software supply chain. This level requires organizations to implement rigorous standards for provenance and isolation, ensuring that artifacts are produced in a controlled and verifiable manner. An organization that has achieved SLSA Level 3 is capable of significantly mitigating the most common attack vectors targeting software build pipelines. Here’s a breakdown of the specific requirements for reaching SLSA Level 3:

  • Provenance generation and availability: A detailed provenance record must be generated for each build, documenting how, when and by whom each artifact was produced. This provenance must be accessible to users for verification.
  • Managed build system: Builds must take place on ephemeral build systems—short-lived, on-demand environments that are provisioned for each build in order to isolate builds from one another, reducing the risk of cross-contamination and unauthorized access.
  • Restricted access to signing material: User-defined build steps should not have access to sensitive signing material to authenticate provenance, keeping signing operations separate and secure.

GitHub Artifact Attestations help simplify your journey to SLSA Level 3 by enabling secure, automated build verification within your GitHub Actions workflows. While generating build provenance records is foundational to SLSA Level 1, the key distinction at SLSA Level 3 is the separation of the signature process from the rest of your build job. At Level 3, the signing happens on dedicated infrastructure, separated from the build workflow itself.

The importance of verifying signatures

While signing artifacts is a critical step, it becomes meaningless without verification. Simply having attestations does not provide any security advantages if they are not verified. Verification ensures that the signed artifacts are authentic and have not been tampered with.

The GitHub CLI makes this process easy, allowing you to verify signatures at any stage of your CI/CD pipeline. For example, you can verify Terraform plans before applying them, ensure that Ansible or Salt configurations are authentic before deployment, validate containers before they are deployed to Kubernetes, or use it as part of a GitOps workflow driven by tools like Flux.

GitHub offers several native ways to verify Artifact Attestations:

  • GitHub CLI: This is the easiest way to verify signatures.
  • Kubernetes Admission Controller: Use GitHub’s distribution of the admission controller for automated verification in Kubernetes environments.
  • Offline verification: Download the attestations and verify them offline using the GitHub CLI for added security and flexibility in isolated environments.

By verifying signatures during deployment, you can ensure that what you deploy to production is indeed what you built.

Achieving SLSA Level 3 compliance with GitHub Artifact Attestations

Reaching SLSA Level 3 may seem complex, but GitHub’s Artifact Attestations feature makes it remarkably straightforward. Generating build provenance puts you at SLSA Level 1, and by using GitHub Artifact Attestations on GitHub-hosted runners, you reach SLSA Level 2 by default. From this point, advancing to SLSA Level 3 is a straightforward journey!

The critical difference between SLSA Level 2 and Level 3 lies in using a reusable workflow for provenance generation. This allows you to centrally enforce build security across all projects and enables stronger verification, as you can confirm that a specific reusable workflow was used for signing. With just a few lines of YAML added to your workflow, you can gain build provenance without the burden of managing cryptographic key material or setting up additional infrastructure.

Build provenance made simple

GitHub Artifact Attestations streamline the process of establishing provenance for your builds. By enabling provenance generation directly within GitHub Actions workflows, you ensure that each artifact includes a verifiable record of its build history. This level of transparency is crucial for SLSA Level 3 compliance.

Best of all, you don’t need to worry about the onerous process of handling cryptographic key material. GitHub manages all of the required infrastructure, from running a Sigstore instance to serving as a root signing certificate authority for you.

Check out our earlier blog to learn more about how to set up Artifact Attestations in your workflow.

Secure signing with ephemeral machines

GitHub Actions-hosted runners, executing workflows on ephemeral machines, ensure that each build process occurs in a clean and isolated environment. This model is fundamental for SLSA Level 3, which mandates secure and separate handling of key material used in signing.

When you create a reusable workflow for provenance generation, your organization can use it centrally across all projects. This establishes a consistent, trusted source for provenance records. Additionally, signing occurs on dedicated hardware that is separate from the build machine, ensuring that neither the source code nor the developer triggering the build system can influence or alter the build process. With this level of separation, your workflows inherently meet SLSA Level 3 requirements.

Below is an example of a reusable workflow that can be utilized across the organization to sign artifacts:

name: Sign Artifact

on:
  workflow_call:
    inputs:
      artifact-path:
            required: true
            type: string

jobs:
  sign-artifact:
    runs-on: ubuntu-latest
    permissions:
      id-token: write
      attestations: write
      contents: read

    steps:
    - name: Attest Build Provenance
          uses: actions/attest-build-provenance@<version>
          with:
          subject-name: ${{ inputs.subject-name }}
          subject-digest: ${{ inputs.subject-digest }}

When you want to use this reusable workflow for signing in any other workflow, you can call it as follows:

name: Sign Artifact Workflow

on:
  push:
    branches:
      - main

jobs:
  sign:
    runs-on: ubuntu-latest

    steps:
    - name: Sign Artifact
      uses: <repository>/.github/workflows/sign-artifact.yml@<version>
      with:
        subject-name: "your-artifact.tar.gz" # Replace with actual artifact name
        subject-digest: "your-artifact-digest" # Replace with SHA-256 digest

This architecture of ephemeral environments and centralized provenance generation guarantees that signing operations are isolated from the build process itself, preventing unauthorized access to the signing process. By ensuring that signing occurs in a dedicated, controlled environment, the risk of compromising the signing workflow can be greatly reduced so that malicious actors can not tamper with the signing action’s code and deviate from the intended process. Additionally, provenance is generated consistently across all builds, providing a unified record of build history for the entire organization.

To verify that an artifact was signed using this reusable workflow, you can use the GitHub CLI with the following command:

gh artifact verify <file-path> --signer-workflow <owner>/<repository>/.github/workflows/sign-artifact.yml

This verification process ensures that the artifact was built and signed using the anticipated pipeline, reinforcing the integrity of your software supply chain.

A more secure future

GitHub Artifact Attestations bring the assurance and structure of SLSA Level 3 to your builds without having to manage additional security infrastructure. By simply adding a few lines of YAML and moving the provenance generation into a reusable workflow, you’re well on your way to achieving SLSA Level 3 compliance with ease!

Ready to strengthen your build security and achieve SLSA Level 3?

Start using GitHub Artifact Attestations today or explore our documentation to learn more.

The post Enhance build security and reach SLSA Level 3 with GitHub Artifact Attestations appeared first on The GitHub Blog.

December 19, 2024  17:00:28

What it is

Annotated Logger is a Python package that allows you to decorate functions and classes, which then log when complete and can request a customized logger object, which has additional fields pre-added. GitHub’s Vulnerability Management team created this tool to make it easier to find and filter logs in Splunk.

Why we made it

We have several Python projects that have grown in complexity over the years and have used Splunk to ingest and search those logs. We have always sent our logs in via JSON, which makes it easy to add in extra fields. However, there were a number of fields, like what Git branch was deployed, that we also wanted to send, plus, there were fields, like the CVE name of the vulnerability being processed, that we wanted to add for messages in a given function. Both are possible with the base Python logger, but it’s a lot of manual work repeating the same thing over and over, or building and managing a dictionary of extra fields that are included in every log message.

The Annotated Logger started out as a simple decorator in one of our repositories, but was extracted into a package in its own right as we started to use it in all of our projects. As we’ve continued to use it, its features have grown and been updated.

How and why to use it

Now that I’ve gotten a bit of the backstory out of the way, here’s what it does, why you should use it, and how to configure it for your specific needs. At its simplest, you decorate a function with @annotate_logs() and it will “just work.” If you’d like to dive right in and poke around, the example folder contains examples that fully exercise the features.

@annotate_logs()
def foo():
    return True
>>> foo()
{"created": 1733176439.5067494, "levelname": "DEBUG", "name": "annotated_logger.8fcd85f5-d47f-4925-8d3f-935d45ceeefc", "message": "start", "action": "__main__:foo", "annotated": true}
{"created": 1733176439.506998, "levelname": "INFO", "name": "annotated_logger.8fcd85f5-d47f-4925-8d3f-935d45ceeefc", "message": "success", "action": "__main__:foo", "success": true, "run_time": "0.0", "annotated": true}
True

Here is a more complete example that makes use of a number of the features. Make sure to install the package: pip install annotated-logger first.

import os
from annotated_logger import AnnotatedLogger
al = AnnotatedLogger(
    name="annotated_logger.example",
    annotations={"branch": os.environ.get("BRANCH", "unknown-branch")}
)
annotate_logs = al.annotate_logs

@annotate_logs()
def split_username(annotated_logger, username):
    annotated_logger.annotate(username=username)
    annotated_logger.info("This is a very important message!", extra={"important": True})
    return list(username)
>>> split_username("crimsonknave")
{"created": 1733349907.7293086, "levelname": "DEBUG", "name": "annotated_logger.example.c499f318-e54b-4f54-9030-a83607fa8519", "message": "start", "action": "__main__:split_username", "branch": "unknown-branch", "annotated": true}
{"created": 1733349907.7296104, "levelname": "INFO", "name": "annotated_logger.example.c499f318-e54b-4f54-9030-a83607fa8519", "message": "This is a very important message!", "important": true, "action": "__main__:split_username", "branch": "unknown-branch", "username": "crimsonknave", "annotated": true}
{"created": 1733349907.729843, "levelname": "INFO", "name": "annotated_logger.example.c499f318-e54b-4f54-9030-a83607fa8519", "message": "success", "action": "__main__:split_username", "branch": "unknown-branch", "username": "crimsonknave", "success": true, "run_time": "0.0", "count": 12, "annotated": true}
['c', 'r', 'i', 'm', 's', 'o', 'n', 'k', 'n', 'a', 'v', 'e']
>>>
>>> split_username(1)
{"created": 1733349913.719831, "levelname": "DEBUG", "name": "annotated_logger.example.1c354f32-dc76-4a6a-8082-751106213cbd", "message": "start", "action": "__main__:split_username", "branch": "unknown-branch", "annotated": true}
{"created": 1733349913.719936, "levelname": "INFO", "name": "annotated_logger.example.1c354f32-dc76-4a6a-8082-751106213cbd", "message": "This is a very important message!", "important": true, "action": "__main__:split_username", "branch": "unknown-branch", "username": 1, "annotated": true}
{"created": 1733349913.7200255, "levelname": "ERROR", "name": "annotated_logger.example.1c354f32-dc76-4a6a-8082-751106213cbd", "message": "Uncaught Exception in logged function", "exc_info": "Traceback (most recent call last):\n  File \"/home/crimsonknave/code/annotated-logger/annotated_logger/__init__.py\", line 758, in wrap_function\n  result = wrapped(*new_args, **new_kwargs)  # pyright: ignore[reportCallIssue]\n  File \"<stdin>\", line 5, in split_username\nTypeError: 'int' object is not iterable", "action": "__main__:split_username", "branch": "unknown-branch", "username": 1, "success": false, "exception_title": "'int' object is not iterable", "annotated": true}
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<makefun-gen-0>", line 2, in split_username
  File "/home/crimsonknave/code/annotated-logger/annotated_logger/__init__.py", line 758, in wrap_function
    result = wrapped(*new_args, **new_kwargs)  # pyright: ignore[reportCallIssue]
  File "<stdin>", line 5, in split_username
TypeError: 'int' object is not iterable

There are a few things going on in this example. Let’s break it down piece by piece.

  • The Annotated Logger requires a small amount of setup to use; specifically, you need to instantiate an instance of the AnnotatedLogger class. This class contains all of the configuration for the loggers.
    • Here we set the name of the logger. (You will need to update the logging config if your name does not start with annotated_logger or there will be nothing configured to log your messages.)
    • We also set a branch annotation that will be sent with all log messages.
  • After that, we create an alias for the decorator. You don’t have to do this, but I find it’s easier to read than @al.annotate_logs().
  • Now, we decorate and define our method, but this time we’re going to ask the decorator to provide us with a logger object, annotated_logger. This annotated_logger variable can be used just like a standard logger object but has some extra features.
    • This annotated_logger argument is added by the decorator before calling the decorated method.
    • The signature of the decorated method is adjusted so that it does not have an annotated_logger parameter (see how it’s called with just name).
    • There are optional parameters to the decorator that allow type hints to correctly parse the modified signature.
  • We make use of one of those features right away by calling the annotate method, which will add whatever kwargs we pass to the extra field of all log messages that use the logger.
    • Any field added as an annotation will be included in each subsequent log message that uses that logger.
    • You can override an annotation by annotating again with the same name
  • At last, we send a log message! In this message we also pass in a field that’s only for that log message, in the same way you would when using logger.
  • In the second call, we passed an int to the name field and list threw an exception.
    • This exception is logged automatically and then re-raised.
    • This makes it much easier to know if/when a method ended (unless the process was killed).

Let’s break down each of the fields in the log message:

Field Source Description
created logging Standard Logging field.
levelname logging Standard Logging field.
name annotated_logger Logger name (set via class instantiation).
message logging Standard Logging field for log content.
action annotated_logger Method name the logger was created for.
branch AnnotatedLogger() Set from the configuration’s branch annotation.
annotated annotated_logger Boolean indicating if the message was sent via Annotated Logger.
important annotated_logger.info Annotation set for a specific log message.
username annotated_logger.annotate Annotation set by user.
success annotated_logger Indicates if the method completed successfully (True/False).
run_time annotated_logger Duration of the method execution.
count annotated_logger Length of the return value (if applicable).

The success, run_time and count fields are added automatically to the message (“success”) that is logged after a decorated method is completed without an exception being raised.

Under the covers

How it’s implemented

The Annotated Logger interacts with Logging via two main classes: AnnotatedAdapter and AnnotatedFilter. AnnotatedAdapter is a subclass of logging.LoggerAdapter and is what all annotated_logger arguments are instances of. AnnotatedFilter is a subclass of logging.Filter and is where the annotations are actually injected into the log messages. As a user outside of config and plugins, the only part of the code you will only interact with are AnnotatedAdapter in methods and the decorator itself. Each instance of the AnnotatedAdapter class has an AnnotatedFilter instance—the AnnotatedAdapter.annotate method passes those annotations on to the filter where they are stored. When a message is logged, that filter will calculate all the annotations it should have and then update the existing LogRecord object with those annotations.

Because each invocation of a method gets its own AnnotatedAdapter object it also has its own AnnotatedFilter object. This ensures that there is no leaking of annotations from one method call to another.

Type hinting

The Annotated Logger is fully type hinted internally and fully supports type hinting of decorated methods. But a little bit of additional detail is required in the decorator invocation. The annotate_logs method takes a number of optional arguments. For type hinting, _typing_self, _typing_requested, _typing_class and provided are relevant. The three arguments that start with _typing have no impact on the behavior of the decorator and are only used in method signature overrides for type hinting. Setting provided to True tells the decorator that the annotated_logger should not be created and will be provided by the caller (thus the signature shouldn’t be altered).

_typing_self defaults to True as that is how most of my code is written. provided, _typing_class and _typing_requested default to False.

class Example:
    @annotate_logs(_typing_requested=True)
    def foo(self, annotated_logger):
        ...

e = Example()
e.foo()

Plugins

There are a number of plugins that come packaged with the Annotated Logger. Plugins allow for the user to hook into two places: when an exception is caught by the decorator and when logging a message. You can create your own plugin by creating a class that defines the filter and uncaught_exception methods (or inherits from annotated_logger.plugins.BasePlugin which provides noop methods for both).

The filter method of a plugin is called when a message is being logged. Plugins are called in the order they are set in the config. They are called by the AnnotatedFilter object of the AnnotatedAdapter and work like any logging.Filter. They take a record argument which is a logging.LogRecord object. They can manipulate that record in any way they want and those modifications will persist. Additionally, just like any logging filter, they can stop a message from being logged by returning False.

The uncaught_exception method of a plugin is called when the decorator catches an exception in the decorated method. It takes two arguments, exception and logger. The logger argument is the annotated_logger for the decorated method. This allows the plugin to annotate the log message stating that there was an uncaught exception that is about to be logged once the plugins have all processed their uncaught_exception methods.

Here is an example of a simple plugin. The plugin inherits from the BasePlugin, which isn’t strictly needed here since it implements both filter and uncaught_exception, but if it didn’t, inheriting from the BasePlugin means that it would fall back to the default noop methods. The plugin has an init so that it can take and store arguments. The filter and uncaught_exception methods will end up with the same result: flagged=True being set if a word matches. But they do it slightly differently, filter is called while a given log message is being processed and so the annotation it adds is directly to that record. While uncaught_exception is called if an exception is raised and not caught during the execution of the decorated method, so it doesn’t have a specific log record to interact with and set an annotation on the logger. The only difference in outcome would be if another plugin emitted a log message during its uncaught_exception method after FlagWordPlugin, in that case, the additional log message would also have flagged=True on it.

from annotated_logger.plugins import BasePlugin

class FlagWordPlugin(BasePlugin):
    """Plugin that flags any log message/exception that contains a word in a list."""
    def __init__(self, *wordlist):
        """Save the wordlist."""
        self.wordlist = wordlist

    def filter(self, record):
    """Add annotation if the message contains words in the wordlist."""
    for word in self.wordlist:
        if word in record.msg:
            record.flagged = True

    def uncaught_exception(self, exception, logger):
    """Add annotation if exception title contains words in the wordlist."""
    for word in self.wordlist:
        if word in str(exception)
            logger.annotate(flagged=True)


AnnotatedLogger(plugins=[FlagWordPlugin("danger", "Will Robinson")])

Plugins are stored in a list and the order they are added can matter. The BasePlugin is always the first plugin in the list; any that are set in configuration are added after it.

When a log message is being sent the filter methods of each plugin will be called in the order they appear in the list. Because the filter methods often modify the record directly, one filter can break another if, for example, one filter removed or renamed a field that another filter used. Conversely, one filter could expect another to have added or altered a field before its run and would fail if it was ahead of the other filter. Finally, just like in the logging module, the filter method can stop a log from being emitted by returning False. As soon as a filter does so the processing ends and any Plugins later in the list will not have their filter methods called.

If the decorated method raises an exception that is not caught, then the plugins will again execute in order. The most common interaction is plugins attempting to set/modify the same annotation. The BasePlugin and RequestsPlugin both set the exception_title annotation. Since the BasePlugin is always first, the title it sets will be overridden. Other interactions would be one plugin setting an annotation before or after another plugin that emits a log message or sends data to a third-party. In both of those cases the order will impact if the annotation is present or not.

Plugins that come with the Annotated Logger:

  • GitHubActionsPlugin—Set a level of log messages to also be emitted in actions notation (notice::).
  • NameAdjusterPlugin—Add a pre/postfix to a name to avoid collisions in any log processing software (source is a field in Splunk, but we often include it as a field and it’s just hidden).
  • RemoverPlugin—Remove a field. Exclude password/key fields and set an object’s attributes to the log if you want or ignore fields like taskName that are set when running async, but not sync.
  • NestedRemoverPlugin—Remove a field no matter how deep in a dictionary it is.
  • RenamerPlugin—Rename one field to another (don’t like levelname and want level, this is how you do that).
  • RequestsPlugin—Adds a title and status code to the annotations if the exception inherits from requests.exceptions.HTTPError.
  • RuntimeAnnotationsPlugin—Sets dynamic annotations.

dictconfig

When adding the Annotated Logger to an existing project, or one that uses other packages that log messages (flask, django, and so on), you can configure all of the Annotated Logger via dictConfig by supplying a dictConfig compliant dictionary as the config argument when initializing the Annotated Logger class. If, instead, you wish to do this yourself you can pass config=False and reference annotated_logger.DEFAULT_LOGGING_CONFIG to obtain the config that is used when none is provided and alter/extract as needed.

There is one special case where the Annotated Logger will modify the config passed to it: if there is a filter named annotated_filter that entry will be replaced with a reference to a filter that is created by the instance of the Annotated Logger that’s being created. This allows any annotations or other options set to be applied to messages that use that filter. You can instead create a filter that uses the AnnotatedFilter class, but it won’t have any of the config the rest of your logs have.

Notes

dictConfig partly works when merging dictionaries. I have found that some parts of the config are not overwritten, but other parts seem to lose their references. So, I would encourage you to build up a logging config for everything and call it once only. If you pass config, the Annotated Logger will call logging.config.dictConfig on your config after it has the option to add/adjust the config.

The logging_config.py example has a much more detailed breakdown and set of examples.

Pytest mock

Included with the package is a pytest mock to assist in testing for logged messages. I know that there are some strong opinions about testing log messages, and I don’t suggest doing it extensively, or frequently, but sometimes it’s the easiest way to check a loop, or the log message is tied to an alert, and it is important how it’s formatted. In these cases, you can ask for the annotated_logger_mock fixture which will intercept, record and forward all log messages.

def test_logs(annotated_logger_mock):
    with pytest.raises(KeyError):
        complicated_method()
    annotated_logger_mock.assert_logged(
        "ERROR",  # Log level
        "That's not the right key",  # Log message
        present={"success": False, "key": "bad-key"},  # annotations and their values that are required
        absent=["fake-annotations"],  # annotations that are forbidden
        count=1  # Number of times log messages should match
    )

The assert_logged method makes use of pychoir for flexible matching. None of the parameters are required, so feel free to use whichever makes sense. Below is a breakdown of the default and valid values for each parameter.

Parameter Default Value Valid Values Description
level Matches anything String or string-based matcher Log level to check (e.g., “ERROR”).
message Matches anything String or string-based matcher Log message to check.
present Empty dictionary Dictionary with string keys and any value Annotations required in the log.
absent Empty set `ALL`, set, or list of strings Annotations that must not be present in the log.
count All positive integers Integer or integer-based matcher Number of times the log message should match.

The present key is often what makes the mock truly useful. It allows you to require the things you care about and ignore the things you don’t care about. For example, nobody wants their tests to fail because the run_time of a method went from 0.0 to 0.1 or fail because the hostname is different on different test machines. But both of those are useful things to have in the logs. This mock should replace everything you use the caplog fixture for and more.

Other features

Class decorators and persist

Classes can be decorated with @annotate_logs as well. These classes will have an annotated_logger attribute added after the init (I was unable to get it to work inside the __init__). Any decorated methods of that class will have an annotated_logger that’s based on the class logger. Calls to annotate that pass persist=True will set the annotations on the class Annotated Logger and so subsequent calls of any decorated method of that instance will have those annotations. The class instance’s annotated_logger will also have an annotation of class specifying which class the logs are coming from.

Iterators

The Annotated Logger also supports logging iterations of an enumerable object. annotated_logger.iterator will log the start, each step of the iteration, and when the iteration is complete. This can be useful for pagination in an API if your results object is enumerable, logging each time a page is fetched instead of sitting for a long time with no indication if the pages are hanging or there are simply many pages.

By default the iterator method will log the value of each iteration, but this can be disabled by setting value=False. You can also specify the level to log the iterations at if you don’t want the default of info.

Provided

Because each decorated method gets its own annotated_logger calls to other methods will not have any annotations from the caller. Instead of simply passing the annotated_logger object to the method being called, you can specify provided=True in the decorator invocation. This does two things: first, it means that this method won’t have an annotated_logger created and passed automatically, instead it requires that the first argument be an existing annotated_logger, which it will use as a basis for the annotated_logger object it creates for the function. Second, it adds the annotation of subaction and sets the decorated function’s name as its value, the action annotation is preserved as from the method that called and provided the annotated_logger. Annotations are not persisted from a method decorated with provided=True to the method that called it, unless the class of the calling method was decorated and the called action annotated with persist=True, in which case the annotation is set on the annotated_logger of the instance and shared with all methods as is normal for decorated classes.

The most common use of this is with private methods, especially ones created during a refactor to extract some self contained logic. But other uses are for common methods that are called from a number of different places.

Split messages

Long messages wreak havoc on log parsing tools. I’ve encountered cases where the HTML of a 500 error page was too long for Splunk to parse, causing the entire log entry to be discarded and its annotations to go unprocessed. Setting max_length when configuring the Annotated Logger will break long messages into multiple log messages each annotated with split=True, split_complete=False, message_parts=# and message_part=#. The last part of the long message will have split_complete=True when it is logged.

Only messages can be split like this; annotations will not trigger the splitting. However, a plugin could truncate any values with a length over a certain size.

Pre/Post hooks

You can register hooks that are executed before and after the decorated method is called. The pre_call and post_call parameters of the decorator take a reference to a function and will call that function right before passing in the same arguments that the function will be/was called with. This allows the hooks to add annotations and/or log anything that is desired (assuming the decorated function requested an annotated_logger).

Examples of this would be having a set of annotations that annotate fields on a model and a pre_call that sets them in a standard way. Or a post_call that logs if the function left a model in an unsaved state.

Runtime annotations

Most annotations are static, but sometimes you need something that’s dynamic. These are achieved via the RuntimeAnnotationsPlugin in the Annotated Logger config. The RuntimeAnnotationsPlugin takes a dict of names and references to functions. These functions will be called and passed the log record when the plugin’s filter method is invoked just before the log message is emitted. Whatever is returned by the function will be set as the value of the annotation of the log message currently being logged.

A common use case is to annotate a request/correlation id, which identifies all of the log messages that were part of a given API request. For Django, one way to do this is via django-guid.

Tips, tricks and gotchas

  • When using the decorator in more than one file, it’s useful to do all of the configuration in a file like log.py. That allows you to from project.log import annotate_logs everywhere you want to use it and you know it’s all configured and everything will be using the same setup.
  • Namespacing your loggers helps when there are two projects that both use the Annotated Logger (a package and a service that uses the package). If you are setting anything via dictConfig you will want to have a single config that has everything for all Annotated Loggers.
  • In addition to setting a correlation id for the API request being processed, passing the correlation id of the caller and then annotating that will allow you to trace from the logs of service A to the specific logs in Service B that relate to a call made by service A.
  • Plugins are very flexible. For example:
    • Send every exception log message to a service like Sentry.
    • Suppress logs from another package, like Django, that you don’t want to see (assuming you’ve configured Django’s logs to use a filter for your Annotated Logger).
    • Add annotations for extra information about specific types of exceptions (see the RequestsPlugin).
    • Set run time annotations on a subset of messages (instead of all messages with RuntimeAnnotationsPlugin)

Questions, feedback and requests

We’d love to hear any questions, comments or requests you might have in an issue. Pull requests welcome as well!

The post Introducing Annotated Logger: A Python package to aid in adding metadata to logs appeared first on The GitHub Blog.

December 18, 2024  18:11:48

GitHub has a long history of offering free products and services to developers. Starting with free open source and public collaboration, we added free private repos, free minutes for GitHub Actions and GitHub Codespaces, and free package and release storage. Today, we are adding GitHub Copilot to the mix by launching GitHub Copilot Free.

Now automatically integrated into VS Code, all of you have access to 2,000 code completions and 50 chat messages per month, simply by signing in with your personal GitHub account. Or by creating a new one. And just last week, we passed the mark of 150M developers on GitHub. 🎉

Copilot Free gives you the choice between Anthropic’s Claude 3.5 Sonnet or OpenAI’s GPT-4o model. You can ask a coding question, explain existing code, or have it find a bug. You can execute edits across multiple files. And you can access Copilot’s third-party agents or build your own extension.

Did you know that Copilot Chat is now directly available from the GitHub dashboard and it works with Copilot Free, so you can start using it today? We couldn’t be more excited to make Copilot available to the 150M developers on GitHub.

Happy coding!

P.S. Students, educators, and open source maintainers: your free access to unlimited Copilot Pro accounts continues, unaffected! 😉

The post Announcing 150M developers and a new free tier for GitHub Copilot in VS Code appeared first on The GitHub Blog.

December 17, 2024  13:51:51

In this blog post, I’ll show the results of my recent security research on GStreamer, the open source multimedia framework at the core of GNOME’s multimedia functionality.

I’ll also go through the approach I used to find some of the most elusive vulnerabilities, generating a custom input corpus from scratch to enhance fuzzing results.

GStreamer

GStreamer is an open source multimedia framework that provides extensive capabilities, including audio and video decoding, subtitle parsing, and media streaming, among others. It also supports a broad range of codecs, such as MP4, MKV, OGG, and AVI.

GStreamer is distributed by default on any Linux distribution that uses GNOME as the desktop environment, including Ubuntu, Fedora, and openSUSE. It provides multimedia support for key applications like Nautilus (Ubuntu’s default file browser), GNOME Videos, and Rhythmbox. It’s also used by tracker-miners, the Ubuntu’s metadata indexer–an application that my colleague, Kev, was able to exploit last year.

This makes GStreamer a very interesting target from a security perspective, as critical vulnerabilities in the library can open numerous attack vectors. That’s why I picked it as a target for my security research.

It’s worth noting that GStreamer is a large library that includes more than 300 different sub-modules. For this research, I decided to focus on only the “Base” and “Good” plugins, which are included by default in the Ubuntu distribution.

Results

During my research I found a total of 29 new vulnerabilities in GStreamer, most of them in the MKV and MP4 formats.

Below you can find a summary of the vulnerabilities I discovered:

GHSL CVE DESCRIPTION
GHSL-2024-094 CVE-2024-47537 OOB-write in isomp4/qtdemux.c
GHSL-2024-115 CVE-2024-47538 Stack-buffer overflow in vorbis_handle_identification_packet
GHSL-2024-116 CVE-2024-47607 Stack-buffer overflow in gst_opus_dec_parse_header
GHSL-2024-117 CVE-2024-47615 OOB-Write in gst_parse_vorbis_setup_packet
GHSL-2024-118 CVE-2024-47613 OOB-Write in gst_gdk_pixbuf_dec_flush
GHSL-2024-166 CVE-2024-47606 Memcpy parameter overlap in qtdemux_parse_theora_extension leading to OOB-write
GHSL-2024-195 CVE-2024-47539 OOB-write in convert_to_s334_1a
GHSL-2024-197 CVE-2024-47540 Uninitialized variable in gst_matroska_demux_add_wvpk_header leading to function pointer ovewriting
GHSL-2024-228 CVE-2024-47541 OOB-write in subparse/gstssaparse.c
GHSL-2024-235 CVE-2024-47542 Null pointer dereference in id3v2_read_synch_uint
GHSL-2024-236 CVE-2024-47543 OOB-read in qtdemux_parse_container
GHSL-2024-238 CVE-2024-47544 Null pointer dereference in qtdemux_parse_sbgp
GHSL-2024-242 CVE-2024-47545 Integer underflow in FOURCC_strf parsing leading to OOB-read
GHSL-2024-243 CVE-2024-47546 Integer underflow in extract_cc_from_data leading to OOB-read
GHSL-2024-244 CVE-2024-47596 OOB-read in FOURCC_SMI_ parsing
GHSL-2024-245 CVE-2024-47597 OOB-read in qtdemux_parse_samples
GHSL-2024-246 CVE-2024-47598 OOB-read in qtdemux_merge_sample_table
GHSL-2024-247 CVE-2024-47599 Null pointer dereference in gst_jpeg_dec_negotiate
GHSL-2024-248 CVE-2024-47600 OOB-read in format_channel_mask
GHSL-2024-249 CVE-2024-47601 Null pointer dereference in gst_matroska_demux_parse_blockgroup_or_simpleblock
GHSL-2024-250 CVE-2024-47602 Null pointer dereference in gst_matroska_demux_add_wvpk_header
GHSL-2024-251 CVE-2024-47603 Null pointer dereference in gst_matroska_demux_update_tracks
GHSL-2024-258 CVE-2024-47778 OOB-read in gst_wavparse_adtl_chunk
GHSL-2024-259 CVE-2024-47777 OOB-read in gst_wavparse_smpl_chunk
GHSL-2024-260 CVE-2024-47776 OOB-read in gst_wavparse_cue_chunk
GHSL-2024-261 CVE-2024-47775 OOB-read in parse_ds64
GHSL-2024-262 CVE-2024-47774 OOB-read in gst_avi_subtitle_parse_gab2_chunk
GHSL-2024-263 CVE-2024-47835 Null pointer dereference in parse_lrc
GHSL-2024-280 CVE-2024-47834 Use-After-Free read in Matroska CodecPrivate

Fuzzing media files: The problem

Nowadays, coverage-guided fuzzers have become the “de facto” tools for finding vulnerabilities in C/C++ projects. Their ability to discover rare execution paths, combined with their ease of use, has made them the preferred choice among security researchers.

The most common approach is to start with an initial input corpus, which is then successively mutated by the different mutators. The standard method to create this initial input corpus is to gather a large collection of sample files that provide a good representative coverage of the format you want to fuzz.

But with multimedia files, this approach has a major drawback: media files are typically very large (often in the range of megabytes or gigabytes). So, using such large files as the initial input corpus greatly slows down the fuzzing process, as the fuzzer usually goes over every byte of the file.

There are various minimization approaches that try to reduce file size, but they tend to be quite simplistic and often yield poor results. And, in the case of complex file formats, they can even break the file’s logic.

It’s for this reason that for my GStreamer fuzzing journey, I opted for “generating” an initial input corpus from scratch.

The alternative: corpus generators

An alternative to gathering files is to create an input corpus from scratch. Or in other words, without using any preexisting files as examples.

To do this, we need a way to transform the target file format into a program that generates files compliant with that format. Two possible solutions arise:

  1. Use a grammar-based generator. This category of generators makes use of formal grammars to define the file format, and subsequently generate the input corpus. In this category, we can mention tools like Grammarinator, an open source grammar-based fuzzer that creates test cases according to an input ANTLR v4 grammar. In this past blog post, I also explained how I used AFL++ Grammar-Mutator for fuzzing Apache HTTP server.
  2. To create a generator specifically for the target software. In this case, we rely on analyzing how the software parses the file format to create a compatible input generator.

Of course, the second solution is more time-consuming, as we need not only to understand the file format structure but also to analyze how the target software works.

But at the same time, it solves two problems in one shot:

  • On one hand, we’ll generate much smaller files, drastically speeding up the fuzzing process speed.
  • On the other hand, these “custom” files are likely to produce better code coverage and potentially uncover more vulnerabilities.

This is the method I opted for and it allowed me to find some of the most interesting vulnerabilities in the MP4 and MKV parsers–vulnerabilities that until then, had not been detected by the fuzzer.

Implementing an input corpus generator for MP4

In this section, I will explain how I created an input corpus generator for the MP4 format. I used the same approach for fuzzing the MKV format as well.

MP4 format

To start, I will show a brief description of the MP4 format.

MP4, officially known as MPEG-4 Part 14, is one of the most widely used multimedia container formats today, due to its broad compatibility and widespread support across various platforms and devices. It supports packaging of multiple media types such as video, audio, images, and complex metadata.

MP4 is basically an evolution of Apple’s QuickTime media format, which was standardized by ISO as MPEG-4. The .mp4 container format is specified by the “MPEG-4 Part 14: MP4 file format” section.

MP4 files are structured as a series of “boxes” (or “atoms”), each containing specific multimedia data needed to construct the media. Each box has a designated type that describes its purpose.

These boxes can also contain other nested boxes, creating a modular and hierarchical structure that simplifies parsing and manipulation.

Each box/atom includes the following fields:

  • Size: A 32-bit integer indicating the total size of the box in bytes, including the header and data.
  • Type: A 4-character code (FourCC) that identifies the box’s purpose.
  • Data: The actual content or payload of the box.

Some boxes may also include:

  • Extended size: A 64-bit integer that allows for boxes larger than 4GB.
  • User type: A 16-byte (128-bit) UUID that enables the creation of custom boxes without conflicting with standard types.

Mp4 box structure

An MP4 file is typically structured in the following way:

  • ftyp (File Type Box): Indicates the file type and compatibility.
  • mdat (Media Data Box): Contains the actual media data (for example, audio and video frames).
  • moov (Movie Box): Contains metadata for the entire presentation, including details about tracks and their structures:
  • trak (Track Box): Represents individual tracks (for example, video, audio) within the file.
  • udta (User Data Box): Stores user-defined data that may include additional metadata or custom information.

Common MP4 file structure

Once we understand how an MP4 file is structured, we might ask ourselves, “Why are fuzzers not able to successfully mutate an MP4 file?”

To answer this question, we need to take a look at how coverage-guided fuzzers mutate input files. Let’s take AFL–one of the most widely used fuzzers out there–as an example. AFL’s default mutators can be summarized as follows:

  • Bit/Bytes mutators: These mutators flip some bits or bytes within the input file. They don’t change the file size.
  • Block insertion/deletion: These mutators insert new data blocks or delete sections from the input file. They modify the file size.

The main problem lies in the latter category of mutators. As soon as the fuzzer modifies the data within an mp4 box, the size field of the box should be also updated to reflect the new size. Furthermore, if the size of a box changes, the size fields of all its parent boxes must also be recalculated and updated accordingly.

Implementing this functionality as a simple mutator can be quite complex, as it requires the fuzzer to track and update the implicit structure of the MP4 file.

Generator implementation

The algorithm I used for implementing my generator follows these steps:

Step 1: Generating unlabelled trees

Structurally, an MP4 file can be visualized as a tree-like structure, where each node corresponds to an MP4 box. Thus, the first step in our generator implementation involves creating a set of unlabelled trees.

In this phase, we create trees with empty nodes that do not yet have a tag assigned. Each node represents a potential MP4 box. To make sure we have a variety of input samples, we generate trees with various structures and different node counts.

3 different 9-node unlabelled trees

In the following code snippet, we see the constructor of the RandomTree class, which generates a random tree structure with a specified total nodes (total_nodes):

RandomTree::RandomTree(uint32_t total_nodes){
uint32_t curr_level = 0;

//Root node
new_node(-1, curr_level);
curr_level++;

uint32_t rem_nodes = total_nodes - 1;
uint32_t current_node = 0;

while(rem_nodes > 0){

uint32_t num_children = rand_uint32(1, rem_nodes);
uint32_t min_value = this->levels[curr_level-1].front();
uint32_t max_value = this->levels[curr_level-1].back();

for(int i=0; i<num_children; i++){
uint32_t parent_id = rand_uint32(min_value, max_value);
new_node(parent_id, curr_level);
}

curr_level++;
rem_nodes -= num_children;
}
}

This code traverses the tree level by level (Level Order Traversal), adding a random number (rand_uint32) of children nodes (num_children). This approach of assigning a random number of child nodes to each parent node will generate highly diverse tree structures.

Random generation of child nodes

After all children are added for the current level, curr_level is incremented to move to the next level.

Once rem_nodes is 0, the RandomTree generation is complete, and we move on to generate another new RandomTree.

Step 2: Assigning tags to nodes

Once we have a set of unlabelled trees, we proceed to assign random tags to each node.

These tags correspond to the four-character codes (FOURCCs) used to identify the types of MP4 boxes, such as moov, trak, or mdat.

In the following code snippet, we see two different fourcc_info structs: FOURCC_LIST which represents the leaf nodes of the tree, and CONTAINER_LIST which represents the rest of the nodes.

The fourcc_info struct includes the following fields:

  • fourcc: A 4-byte FourCC ID
  • description: A string describing the FourCC
  • minimum_size: The minimum size of the data associated with this FourCC
const fourcc_info CONTAINER_LIST[] = {

{FOURCC_moov, "movie", 0,},
{FOURCC_vttc, "VTTCueBox 14496-30", 0},
{FOURCC_clip, "clipping", 0,},
{FOURCC_trak, "track", 0,},
{FOURCC_udta, "user data", 0,},
…

const fourcc_info FOURCC_LIST[] = {

{FOURCC_crgn, "clipping region", 0,},
{FOURCC_kmat, "compressed matte", 0,},
{FOURCC_elst, "edit list", 0,},
{FOURCC_load, "track load settings", 0,},

Then, the MP4_labeler constructor takes a RandomTree instance as input, iterates through its nodes, and assigns a label to each node based on whether it is a leaf (no children) or a container (has children):

…

MP4_labeler::MP4_labeler(RandomTree *in_tree) {
…
for(int i=1; i < this->tree->size(); i++){

Node &node = this->tree->get_node(i);
…
if(node.children().size() == 0){
//LEAF
uint32_t random = rand_uint32(0, FOURCC_LIST_SIZE-1);
fourcc = FOURCC_LIST[random].fourcc;
…
}else{
//CONTAINER
uint32_t random = rand_uint32(0, CONTAINER_LIST_SIZE-1);
fourcc = CONTAINER_LIST[random].fourcc;
…
}
…
node.set_label(label);
}
}

After this stage, all nodes will have an assigned tag:

Labeled trees with MP4 box tags

Step 3: Adding random-size data fields

The next step is to add a random-size data field to each node. This data simulates the content within each MP4 box.
In the following code, at first we set the minimum size (min_size) of the padding specified in the selected fourcc_info from FOURCC_LIST. Then, we append padding number of null bytes (\x00) to the label:

if(node.children().size() == 0){
//LEAF
…
padding = FOURCC_LIST[random].min_size;
random_data = rand_uint32(4, 16);
}else{
//CONTAINER
…
padding = CONTAINER_LIST[random].min_size;
random_data = 0;
}
…
std::string label = uint32_to_string(fourcc);
label += std::string(padding, '\x00');
label += std::string(random_data, '\x41');

By varying the data sizes, we make sure the fuzzer has sufficient space to inject data into the box data sections, without needing to modify the input file size.

Step 4: Calculating box sizes

Finally, we calculate the size of each box and recursively update the tree accordingly.

The traverse method recursively traverses the tree structure serializing the node data and calculating the resulting size box (size). Then, it propagates size updates up the tree (traverse(child)) so that parent boxes include the sizes of their child boxes:

std::string MP4_labeler::traverse(Node &node){
…
for(int i=0; i < node.children().size(); i++){ Node &child = tree->get_node(node.children()[i]);

output += traverse(child);
}

uint32_t size;
if(node.get_id() == 0){
size = 20;
}else{
size = node.get_label().size() + output.size() + 4;
}

std::string label = node.get_label();
uint32_t label_size = label.size();

output = uint32_to_string_BE(size) + label + output;
…
}

The number of generated input files can vary depending on the time and resources you can dedicate to fuzzing. In my case, I generated an input corpus of approximately 4 million files.

Code

You can find my C++ code example here.

Acknowledgments

A big thank you to the GStreamer developer team for their collaboration and responsiveness, and especially to Sebastian Dröge for his quick and effective bug fixes.

I would also like to thank my colleague, Jonathan Evans, for managing the CVE assignment process.

References

The post Uncovering GStreamer secrets appeared first on The GitHub Blog.

December 13, 2024  17:00:21

In November, we experienced one incident that resulted in degraded performance across GitHub services.

November 19 10:56 UTC (lasting 1 hour and 7 minutes)

On November 19, 2024, between 10:56:00 UTC and 12:03:00 UTC, the notifications service was degraded and stopped sending notifications to all our dotcom customers. On average, notification delivery was delayed by about one hour. This was due to a database host coming out of a regular maintenance process in read-only mode.

We mitigated the incident by making the host writable again. Following this, notification delivery recovered, and any delivery jobs that had failed during the incident were successfully retried. All notifications were fully recovered and available to users by 12:36 UTC.

To prevent similar incidents moving forward, we are working to improve our observability across database clusters to reduce our time to detection and improve our system resilience on startup.


Please follow our status page for real-time updates on status changes and post-incident recaps. To learn more about what we’re working on, check out the GitHub Engineering Blog.

The post GitHub Availability Report: November 2024 appeared first on The GitHub Blog.

December 12, 2024  13:51:13

Large language models (LLMs), such as those used by GitHub Copilot, do not operate directly on bytes but on tokens, which can complicate scaling. In this post, we explain how we solved that challenge at GitHub to support the growing number of Copilot users and features since the first launching Copilot two years ago.

Tokenization is the process of turning bytes into tokens. The byte-pair encoding (BPE) algorithm is such a tokenizer, used (for example) by the OpenAI models we use at GitHub. The fastest of these algorithms not only have at least an O(n log(n)) complexity, they are also not incremental, and therefore badly suited for our use cases, which go beyond encoding an upfront known input. This limitation resulted in a number of scaling challenges that led us to create a novel algorithm to address them. Our algorithm not only scales linearly, but also easily out-performs popular libraries for all inputs.

Read on to find out more about why and how we created the open source bpe algorithm that substantially improves state of the art BPE implementations in order to address our broader set of use cases.

The importance of fast, flexible tokenization

Retrieval augmented generation (RAG) is essential for GitHub Copilot’s capabilities. RAG is used to improve model output by augmenting the user’s prompt with relevant snippets of text or code. A typical RAG approach works as follows:

  • Index repository content into an embeddings database to allow semantic search.
  • Given a user’s prompt, search the embeddings database for relevant snippets and include those snippets in the prompt for the LLM.

Tokenization is important for both of these steps. Most code files exceed the number of tokens that can be encoded into a single embedding, so we need to split the files into chunks that are within the token limit. When building a prompt there are also limits on the number of tokens that can be used. The amount of tokens can also impact response time and cost. Therefore, it is common to have some kind of budgeting strategy, which requires being able to track the number of tokens that components of the prompt contribute. In both of these cases, we are dynamically constructing a text and constantly need the updated token count during that process to decide how to proceed. However, most tokenizers provide only a single operation: encode a complete input text into tokens.

When scaling to millions of repositories and billions of embeddings, the efficiency of token calculation really starts to matter. Additionally, we need to consider the worst-case performance of tokenization for the stability of our production systems. A system that processes untrusted user input in the form of billions of source code files cannot allow data that, intentionally or not, causes pathological run times and threatens availability. (See this discussion of potential denial-of-service issues in the context of OpenAI’s tiktoken tokenizer.)

Some of the features that can help address these needs are:

  • Keep track of the token count for a chunk while it is being built up.
  • Count tokens for slices of an original text or abort counting when the text exceeds a given limit.
  • Split text within a certain amount of tokens at a proper UTF-8 character boundary.

Implementing these operations using current tokenization algorithms would result in at least quadratic runtime, when we would like the runtime to be linear.

bpe: A fast tokenizer

We were able to make substantial improvements to the state of the art BPE implementations in order to address the broader set of use cases that we have. Not only were we able to support more features, but we do so with much better performance and scalability than the existing libraries provide.

Our implementation is open source with an MIT license can be found at https://github.com/github/rust-gems. The Rust crates are also published to crates.io as bpe and bpe-openai. The former contains the BPE implementation itself. The latter exposes convenience tokenizers (including pre-tokenization) for recent OpenAI token models.

Read on for benchmark results and an introduction to the algorithm itself.

Performance comparison

We compare performance with two benchmarks. Both compare tokenization on randomly generated inputs of different sizes.

  • The first uses any inputs, most of which will be split into smaller pieces during pre-tokenization. Pre-tokenization splits the input text into pieces (for example, using regular expressions). BPE is applied to those pieces instead of the complete input text. Since these pieces are typically small, this can significantly impact overall performance and hide the performance characteristics of the underlying BPE implementation. This benchmark allows comparing the performance that can be expected in practice.
  • The second uses pathological inputs that won’t be split by pre-tokenization. This benchmark allows comparing the performance of the underlying BPE implementations and reflects worst-case performance.

We used OpenAI’s o200k_base token model and compared our implementation with tiktoken-rs, a wrapper for OpenAI’s tiktoken library, and Huggingface’s tokenizers. All benchmarks were run single-threaded on an Apple M1 MacBook Pro.

Here are the results, showing single-threaded throughput in MiB/s:

Line graph displaying results for the benchmark that includes pre-tokenization. Our tokenizer outperforms tiktoken by almost 4x and Huggingface by about 10x.

Line graph displaying the worst case complexity difference between our linear, Huggingface’s heap-based, and tiktoken’s quadratic implementation.

The first figure shows the results for the benchmark that includes pre-tokenization. We see that our tokenizer outperforms tiktoken by almost 4x and Huggingface by about 10x. (These numbers are in line with tiktoken’s reported performance results. Note that our single-threaded performance is only matched when using eight threads.) The second figure shows the worst case complexity difference between our linear, Huggingface’s heap-based, and tiktoken’s quadratic implementation.

The rest of this post details how we achieved this result. We explain the basic principle of byte-pair encoding, the insight that allows the faster algorithm, and a high-level description of the algorithm itself.

Byte-pair encoding

BPE is a technique to encode text as a sequence of tokens from a token dictionary. The token dictionary is just an ordered list of tokens. Each token is either a single byte, or the concatenation of a pair of previously defined tokens. A string is encoded by replacing bytes with single-byte tokens and token pairs by concatenated tokens, in dictionary order.

Let’s see how the string abacbb is tokenized using the following dictionary:

a b c ac bb ab acbb

Initially, the string is tokenized into the single-byte tokens. Next, all occurrences (left to right) of the token pair a c are replaced by the token ac. This procedure is repeated until no more replacements are possible. For our input string abacbb, tokenization proceeds as follows:

1. a b a c b b
2. a b ac  b b
3. a b ac  bb
4. ab  ac  bb
5. ab  acbb

Note that initially we have several pairs of single-byte tokens that appear in the dictionary, such as a b and a c. Even though ab appears earlier in the string, ac is chosen because the token appears first in the token dictionary. It is this behavior that makes BPE non-incremental with respect to string operations such as slicing or appending. For example, the substring abacb is tokenized as ab ac b, but if another b is added, the resulting string abacbb is tokenized as ab acbb. Two tokens from the prefix abacb are gone, and the encoding for the longer string even ends up being shorter.

The two main strategies for implementing BPE are:

  • A naive approach that repeatedly iterates over the tokens to merge the next eligible token pair, resulting in quadratic complexity.
  • A heap-based approach that keeps eligible token pairs sorted, resulting in O(n log(n)) complexity.

However, all tokenizers require the full text up front. Tokenizing a substring or an extended string means starting the encoding from scratch. For that reason, the more interesting use cases from above quickly become very expensive (at least O(n2 log(n))). So, how can we do better?

Composing valid encodings

The difficulty of the byte-pair encoding algorithm (as described above) is that token pair replacements can happen anywhere in the string and can influence the final tokens at the beginning of the string. However, it turns out that there is a property, which we call compatibility, that allows us to build up tokenizations left-to-right:

Given a valid encoding, we can append an additional token to produce a new valid encoding if the pair of the last token and the appended token are a valid encoding.

A valid encoding means that the original BPE algorithm produces that same encoding. We’ll show what this means with an example, and refer to the crate’s README for a detailed proof.

The sequence ab ac is a valid encoding for our example token dictionary.

  • Is ab ac b a valid encoding? Check if the pair ac b is compatible:
    1. a c b
    2. ac  b
    

    We got the same tokens back, which means ab ac b is the encoding ab ac b.

  • Is ab ac bb a valid encoding? Again, check if the pair ac bb is compatible:

    1. a c b b
    2. ac  b b
    3. ac  bb 
    4. acbb 
    

    In this case, the tokens are incompatible, and ab ac bb is not valid.

The next section explains how we can go from building valid encodings to finding the encoding for a given input string.

Linear encoding

Using the compatibility rule, we can implement linear encoding with a dynamic programming algorithm.

The algorithm works by checking for each of the possible last tokens whether it is compatible with the tokenization of the remaining prefix. As we saw in the previous section, we only need the last token of the prefix’s encoding to decide this.

Let’s apply this idea to our example abacbb and write down the full encodings for every prefix:

  • a ——-> a
  • ab ——> ab
  • aba —–> ab a
  • abac —-> ab ac
  • abacb —> ab ac b
  • abacbb –> ab acbb

We only store the last token for every prefix. This gives us a ab a ac b for the first five prefixes. We can find the last token for a prefix with a simple lookup in the list. For example, the last token for ab is ab, and the last token for abac is ac.

For the last token of abacbb we have three token candidates: b, bb, and acbb. For each of these we must check whether it is compatible with the last token of the remaining prefix: b b, ac bb, or ab acbb. Retokenizing these combinations gives bb, acbb, and ab acbb, which means acbb is the only valid choice here. The algorithm works forward by computing the last token for every position in the input, using the last tokens for previous positions in the way we just described.

The resulting algorithm looks roughly like this:

let last_tokens = vec![];
for pos in 0..text.len() {
  for candidate in all_potential_tokens_for_suffix(text[0..pos + 1]) {
    if token_len(candidate) == pos + 1 {
      last_tokens.push(candidate);
      break;
    } else if is_compatible(
      last_tokens[pos + 1 - token_len(candidate)],
      candidate,
    ) {
      last_tokens.push(candidate);
      break;
    }
  }
}

How do we implement this efficiently?

  • Use an Aho-Corasick string matching automaton to get all suffix tokens of the string until position i. Start with the longest token, since those are more likely to survive than shorter tokens.
  • Retokenize the token pair efficiently to implement the compatibility check. We could precompute and store all valid pairings, but with large token dictionaries this will be a lot of data. It turns out that we can retokenize pairs on the fly in effectively constant time.

The string matching automaton is linear for the input length. Both the number of overlapping tokens and retokenization are bounded by constants determined by the token dictionary. Together, this gives us a linear runtime.

Putting it together

The Rust crate contains several different encoders based on this approach:

  • Appending and prepending encoders that incrementally encode a text when content is added to it. The algorithm works using the approach outlined above, storing the last token for every text position. Whenever the text is extended, the last tokens for the new positions are computed using the previously computed last tokens. At any point the current token count is available in constant time. Furthermore, constant time snapshots and rollbacks are supported, which make it easy to implement dynamic chunk construction approaches.
  • A fast full-text encoder based on backtracking. Instead of storing the last token for every text position, it only stores the tokens for the full input. The algorithm works left to right, picking a candidate token for the remaining input, and checking if it is compatible with the last token. The algorithm backtracks on the last token if none of the candidates are compatible. By trying the longest candidates first, very little backtracking happens in practice.
  • An interval encoder which allows O(1) token counting on subranges of the original text (after preprocessing the text in O(n) time). The algorithm works by encoding the substring until the last token lines up with the token at that position for the original text. Usually, only a small prefix needs to be encoded before this alignment happens.

We have explained our algorithm at a high level so far. The crate’s README contains more technical details and is a great starting point for studying the code itself.

The post So many tokens, so little time: Introducing a faster, more flexible byte-pair tokenizer appeared first on The GitHub Blog.

December 11, 2024  15:34:21

Gradio is a Python web framework for demoing machine learning applications, which in the past few years has exploded in popularity. In this blog, you’ll will follow along with the process, in which I modeled (that is, added support for) Gradio framework, finding 11 vulnerabilities to date in a number of open source projects, including AUTOMATIC1111/stable-diffusion-webui—one of the most popular projects on GitHub, that was included in the 2023 Octoverse report and 2024 Octoverse report. Check out the vulnerabilities I’ve found on the GitHub Security Lab’s website.

Following the process outlined in this blog, you will learn how to model new frameworks and libraries in CodeQL and scale your research to find more vulnerabilities.

This blog is written to be read standalone; however, if you are new to CodeQL or would like to dig deeper into static analysis and CodeQL, you may want to check out the previous parts of my CodeQL zero to hero blog series. Each deals with a different topic: status analysis fundamentals, writing CodeQL, and using CodeQL for security research.

  1. CodeQL zero to hero part 1: The fundamentals of static analysis for vulnerability research
  2. CodeQL zero to hero part 2: Getting started with CodeQL
  3. CodeQL zero to hero part 3: Security research with CodeQL

Each also has accompanying exercises, which are in the above blogs, and in the CodeQL zero to hero repository.

Quick recap

CodeQL uses data flow analysis to find vulnerabilities. It uses models of sources (for example, an HTTP GET request parameter) and sinks in libraries and frameworks that could cause a vulnerability (for example, cursor.execute from MySQLdb, which executes SQL queries), and checks if there is a data flow path between the two. If there is, and there is no sanitizer on the way, CodeQL will report a vulnerability in the form of an alert.

Diagram showing a node titled ‘Sources’ and another node titled ‘Sinks’. An arrow titled ‘Data flow path’ points from ‘Sources’ to ‘Sinks’. ‘Sources’ node gives examples of sources, that is ‘untrusted input, for example HTTP GET request parameters and ‘Sinks’ node gives examples of sinks, that is ‘dangerous functions, for example cursor.execute() from MySQLdb library’.

Check out the first blog of the CodeQL zero to hero series to learn more about sources, sinks and data flow analysis,. See the second blog to learn about the basics of writing CodeQL. Head on to the third blog of the CodeQL zero to hero series to implement your own data flow analysis on certain code elements in CodeQL.

Motivation

CodeQL has models for the majority of most popular libraries and frameworks, and new ones are continuously being added to improve the detection of vulnerabilities.

One such example is Gradio. We will go through the process of analyzing it and modeling it today. Looking into a new framework or a library and modeling it with CodeQL is a perfect chance to do research on the projects using that framework, and potentially find multiple vulnerabilities at once.

Hat tip to my coworker, Alvaro Munoz, for giving me a tip about Gradio, and for his guide on researching Apache Dubbo and writing models for it, which served as inspiration for my research (if you are learning CodeQL and haven’t checked it out yet, you should!).

Research and CodeQL modeling process

Frameworks have sources and sinks, and that’s what we are interested in identifying, and later modeling in CodeQL. A framework may also provide user input sanitizers, which we are also interested in.

The process consisted, among others, of:

  • Reviewing the documentation and code for sources—any entry points to the application, which take user input.
  • Reviewing the documentation and code for sinks—any functionality that is potentially dangerous, for example, a function that executes raw SQL queries and that may take user input.
  • Dynamically testing interesting elements.
  • Checking for previous security issues related to Gradio, and vulnerabilities in the applications that use Gradio.

Let me preface here that it’s natural that a framework like Gradio has sources and sinks, and there’s nothing inherently wrong about it. All frameworks have sources and sinks—Django, Flask, Tornado, and so on. The point is that if the classes and functions provided by the frameworks are not used in a secure way, they may lead to vulnerabilities. And that’s what we are interested in catching here, for applications that use the Gradio framework.

Gradio

Gradio is a Python web framework for demoing machine learning applications, which in the past few years has become increasingly popular.

Gradio’s documentation is thorough and gives a lot of good examples to get started with using Gradio. We will use some of them, modifying them a little bit, where needed.

Gradio Interface

We can create a simple interface in Gradio by using the Interface class.

import gradio as gr

def greet(name, intensity):
    return "Hello, " + name + "!" * int(intensity)

demo = gr.Interface(
    fn=greet,
    inputs=[gr.Textbox(), gr.Slider()],
    outputs=[gr.Textbox()])

demo.launch()

In this example, the Interface class takes three arguments:

  • The fn argument takes a reference to a function that contains the logic of the program. In this case, it’s the reference to the greet function.
  • The inputs argument takes a list of input components that will be used by the function passed to fn. Here, inputs takes the values from a text (which is equivalent to a gr.Textbox component) and a slider (equivalent to a gr.Slider component).
  • The outputs argument specifies what will be returned by the function passed to fn in the return statement. Here, the output will be returned as text (so, gr.Textbox).

Running the code will start an application with the following interface. We provide example inputs, “Sylwia” in the textbox and “3” in the slider, and submit them, which results in an output, “Hello, Sylwia!!!”.

Website with an input box with the value “Sylwia”, an “intensity” slider set to “3” and an output box with the value “Hello, Sylwia!!!”
Example app written using Gradio Interface class

Gradio Blocks

Another popular way of creating applications is by using the gr.Blocks class with a number of components; for example, a dropdown list, a set of radio buttons, checkboxes, and so on. Gradio documentation describes the Blocks class in the following way:

Blocks offers more flexibility and control over: (1) the layout of components (2) the events that trigger the execution of functions (3) data flows (for example, inputs can trigger outputs, which can trigger the next level of outputs).

With gr.Blocks, we can use certain components as event listeners, for example, a click of a given button, which will trigger execution of functions using the input components we provided to them.

In the following code we create a number of input components: a slider, a dropdown, a checkbox group, a radio buttons group, and a checkbox. Then, we define a button, which will execute the logic of the sentence_builder function on a click of the button and output the results of it as a textbox.

import gradio as gr

def sentence_builder(quantity, animal, countries, place, morning):
    return f"""The {quantity} {animal}s from {" and ".join(countries)} went to the {place} in the {"morning" if morning else "night"}"""

with gr.Blocks() as demo:

    gr.Markdown("Choose the options and then click **Run** to see the output.")
    with gr.Row():
        quantity = gr.Slider(2, 20, value=4, label="Count", info="Choose between 2 and 20")
        animal = gr.Dropdown(["cat", "dog", "bird"], label="Animal", info="Will add more animals later!")
        countries = gr.CheckboxGroup(["USA", "Japan", "Pakistan"], label="Countries", info="Where are they from?")
        place = gr.Radio(["park", "zoo", "road"], label="Location", info="Where did they go?")
        morning = gr.Checkbox(label="Morning", info="Did they do it in the morning?")

    btn = gr.Button("Run")
    btn.click(
        fn=sentence_builder,
        inputs=[quantity, animal, countries, place, morning],
        outputs=gr.Textbox(label="Output")
    )

if __name__ == "__main__":
    demo.launch(debug=True)

Running the code and providing example inputs will give us the following results:

Website with a slider titled “Count” set to 4, a dropdown list titled “Animal” set to “cat”, a checkbox group titled “Countries” with checked “USA” and “Japan” and unchecked “Pakistan”, a radio button titled “Locations” set to “park” and a checked checkbox with question with explanation “Did they do it in the morning?”. Below the form is a button titled “Run” and an output box with the value “The 4 cats from USA and Japan went to the park in the morning."
Example app written using Gradio Blocks

Identifying attack surface in Gradio

Given the code examples, the next step is to identify how a Gradio application might be written in a vulnerable way, and what we could consider a source or a sink. A good point to start is to run a few code examples, use the application the way it is meant to be used, observe the traffic, and then poke at the application for any interesting areas.

The first interesting point that stood out for me for investigation are the variables passed to the inputs keyword argument in the Interface example app above, and the on click button event handler in the Blocks example.

Let’s start by running the above Gradio Interface example app in your favorite proxy (or observing the traffic in your browser’s DevTools), and filling out the form with example values (string "Sylwia" and integer 3) shows the data being sent:

Screenshot from Firefox DevTools showing the traffic in the Gradio Interface app. Two requests are visible, and one of them is selected, showing data being sent in a form of JSON.
Traffic in the Gradio Interface example app, observed in Firefox DevTools

The values we set are sent as a string “Sylwia” and an integer 3 in a JSON, in the value of the “data” key. Here are the values as seen in Burp Suite:

Screenshot showing request with data in a form of a JSON. The “data” key takes a list with two values: “Sylwia”, 3. ’
JSON in the request to the example Interface application

The text box naturally allows for setting any string value we would like, and that data will later be processed by the application as a string. What if I try to set it to something else than a string, for example, an integer 1000?

Screenshot showing request with data in a form of a JSON. The “data” key takes a list with two values: 1000, 3.
Testing setting the textbox value to an integer

Turns out that’s allowed. What about a slider? You might expect that we can only set the values that are restricted in the code (so, here these should be integer values from 2 to 20). Could we send something else, for example, a string "high”?

Screenshot showing request with data in a form of a JSON. The “data” key takes a list with two values: 1000, “high”.
Testing setting the slider value to a string

We see an error:

File "/**/**/**/example.py", line 4, in greet
    return "Hello, " + name + "!" * int(intensity)
                                    ^^^^^^^^^^^^^^
ValueError: invalid literal for int() with base 10: 'high'

That’s very interesting. 🤔

The error didn’t come up from us setting a string value on a slider (which should only allow integer values from 2 to 20), but from the int function, which converts a value to an integer in Python. Meaning, until that point, the value from the slider can be set to anything, and can be used in any dangerous functions (sinks) or otherwise sensitive functionality. All in all, perfect candidate for sources.

We can do a similar check with a more complex example with gr.Blocks. We run the example code from the previous section and observe our data being sent:

Screenshot showing request with data in a form of a JSON. The “data” key takes a list with two values: 4,"cat",["USA","Japan"],"park",true’
Observed values in the request

The values sent correspond to values given in the inputs list, which comes from the components:

  • A Slider with values from 2 to 20.
  • A Dropdown with values: “cat”, “dog”, “bird”.
  • ACheckboxGroup with values: “USA”, “Japan”, “Pakistan”.
  • A Radio with values: “park”, “zoo”, “road”.
  • A Checkbox.

Then, we test what we can send to the application. We pass values that are not expected for the given components:

  • A Slider—a string "a thousand".
  • A Dropdown – a string "turtle", which is not one of the dropdown options.
  • A CheckboxGroup—a list with ["USA","Japan", "Poland"], of which "Poland" is not one of the options.
  • A Radio—a list ["a", "b"]. Radio is supposed to allow only one value, and not a list.
  • A Checkbox—an integer 123. Checkbox is supposed to take only true or false.
Screenshot showing request with data in the form of a JSON. The “data” key takes a list with two values: "a thousand","turtle",["USA","Japan", "Poland"],["a", "b"],123’
Observed values in the request

No issues reported—we can set the source values to anything no matter which component they come from. That makes them perfect candidates for modeling as sources and later doing research at scale.

Except.

Echo of sources past: gr.Dropdown example

Not long after I had a look at Gradio version 4.x.x and wrote models for its sources, the Gradio team asked Trail of Bits (ToB) to conduct a security audit of the framework, which resulted in Maciej Domański and Vasco Franco‬ creating a report on Gradio’s security with a lot of cool findings. The fixes for the issues reported were incorporated into Gradio 5.0, which was released on October 9, 2024. One of the issues that ToB’s team reported was TOB-GRADIO-15: Dropdown component pre-process step does not limit the values to those‬‭ in the dropdown list‬.

Screenshot from Trail of Bits’ Gradio audit showing finding number 15: Dropdown component pre-process step does not limit the values to those‬‭ in the dropdown list’ with severity set to low and difficulty set to low. The descriptions says, ‬
‭ The‬‭ Dropdown‬‭ component allows a user to set arbitrary values, even when the‬‭ allow_custom_value‬‭ parameter is set to‬‭ False‬‭.
TOB-GRADIO-15 from Trail of Bits’ Gradio review.

The issue was subsequently fixed in Gradio 5.0 and so, submitting values that are not valid choices in gr.Dropdown, gr.Radio, and gr.CheckboxGroup results in an error, namely:

gradio.exceptions.Error: "Value: turtle is not in the list of choices: ['cat', 'dog', 'bird']"

In this case, the vulnerabilities which may result from these sources, may not be exploitable in Gradio version 5.0 and later. There were also a number of other changes regarding security to the Gradio framework in version 5.0, which can be explored in the ToB report on Gradio security.

The change made me ponder whether to update the CodeQL models I have written and added to CodeQL. However, since the sources can still be misused in applications running Gradio versions below 5.0, I decided to leave them as they are.

Modeling Gradio with CodeQL

We have now identified a number of potential sources. We can then write CodeQL models for them, and later use these sources with existing sinks to find vulnerabilities in Gradio applications at scale. But let’s go back to the beginning: how do we model these Gradio sources with CodeQL?

Preparing example CodeQL database for testing

Recall that to run CodeQL queries, we first need to create a CodeQL database from the source code that we are interested in. Then, we can run our CodeQL queries on that database to find vulnerabilities in the code.

A diagram titled “How CodeQL works”. The diagram starts from the left with a picture of example source code and an arrow going to the right and pointing to an icon of a database, titled “CodeQL database”. Another arrow goes through the “CodeQL database”. On its beginning it says “CodeQL queries”’ and on its end - “Vulnzzz”’
How CodeQL works – create a CodeQL database, and run queries on that database

We start with intentionally vulnerable source code using gr.Interface that we want to use as our test case for finding Gradio sources. The code is vulnerable to command injection via both folder and logs arguments, which end in the first argument to an os.system call.

import gradio as gr
import os

def execute_cmd(folder, logs):
    cmd = f"python caption.py --dir={folder} --logs={logs}"
    os.system(cmd)


folder = gr.Textbox(placeholder="Directory to caption")
logs = gr.Checkbox(label="Add verbose logs")

demo = gr.Interface(fn=execute_cmd, inputs=[folder, logs])

if __name__ == "__main__":
    demo.launch(debug=True)

Let’s create another example using gr.Blocks and gr.Button.click to mimic our earlier examples. Similarly, the code is vulnerable to command injection via both folder and logs arguments. The code is a simplified version of a vulnerability I found in an open source project.

import gradio as gr
import os

def execute_cmd(folder, logs):
    cmd = f"python caption.py --dir={folder} --logs={logs}"
    os.system(cmd)


with gr.Blocks() as demo:
    gr.Markdown("Create caption files for images in a directory")
    with gr.Row():
        folder = gr.Textbox(placeholder="Directory to caption")
        logs = gr.Checkbox(label="Add verbose logs")

    btn = gr.Button("Run")
    btn.click(fn=execute_cmd, inputs=[folder, logs])


if __name__ == "__main__":
    demo.launch(debug=True)

I also added two more code snippets which use positional arguments instead of keyword arguments. The database and the code snippets are available in the CodeQL zero to hero repository.

Now that we have the code, we can create a CodeQL database for it by using the CodeQL CLI. It can be installed either as an extension to the gh tool (recommended) or as a binary. Using the gh tool with the CodeQL CLI makes it much easier to update CodeQL.

CodeQL CLI installation instructions
  1. Install GitHub’s command line tool, gh, using the installation instructions for your system.
  2. Install CodeQL CLI:
    gh extensions install github/gh-codeql
  3. (optional, but recommended) To use the CodeQL CLI directly in the terminal (without having to type gh), run:
    gh codeql install-stub
  4. Make sure to regularly update the CodeQL CLI with:
    codeql set-version latest

After installing CodeQL CLI, we can create a CodeQL database. First, we move to the folder, where all our source code is located. Then, to create a CodeQL database called gradio-cmdi-db for the Python code in the folder gradio-tests, run:

codeql database create gradio-cmdi-db --language=python --source-root='./gradio-tests'

This command creates a new folder gradio-cmdi-db with the extracted source code elements. We will use this database to test and run CodeQL queries on.

I assume here you already have the VS Code CodeQL starter workspace set up. If not, follow the instructions to create a starter workspace and come back when you are finished.

To run queries on the database we created, we need to add it to the VS Code CodeQL extension. We can do it with the “Choose Database from Folder” button and pointing it to the gradio-cmdi-db folder.

A screenshot from the VS Code CodeQL extension showing the “Databases” section and highlighted “Choose Database from Folder” button

Now that we are all set up, we can move on to actual CodeQL modeling.

Identifying code elements to query for

Let’s have another look at our intentionally vulnerable code.

import gradio as gr
import os

def execute_cmd(folder, logs):
    cmd = f"python caption.py --dir={folder} --logs={logs}"
    os.system(cmd)
    return f"Command: {cmd}"


folder = gr.Textbox(placeholder="Directory to caption")
logs = gr.Checkbox(label="Save verbose logs")
output = gr.Textbox()

demo = gr.Interface(
    fn=execute_cmd,
    inputs=[folder, logs],
    outputs=output)

if __name__ == "__main__":
    demo.launch(debug=True)

In the first example with the Interface class, we had several components on the website. In fact, on the left side we have one source component which takes user input – a textbox. On the right side, we have an output text component. So, it’s not enough that a component is, for example, of a gr.Textbox() type to be considered a source.

To be considered a source of an untrusted input, a component has to be passed to the inputs keyword argument, which takes an input component or a list of input components that will be used by the function passed to fn and processed by the logic of the application in a potentially vulnerable way. So, not all components are sources. In our case, any values passed to inputs in the gr.Interface class are sources. We could go a step further, and say that if gr.Interface is used, then anything passed to the execute_cmd function is a source – so here the folder and logs.

The same situation happens in the second example with the gr.Blocks class and the gr.Button.click event listener. Any arguments passed to inputs in the Button.onlick method, so, to the execute_cmd function, are sources.

Modeling gr.Interface

Let’s start by looking at the gr.Interface class.

Since we are interested in identifying any values passed to inputs in the gr.Interface class, we first need to identify any calls to gr.Interface.

CodeQL for Python has a library for finding reference classes and functions defined in library code called ApiGraphs, which we can use to identify any calls to gr.Interface. See CodeQL zero to hero part 2 for a refresher on writing CodeQL and CodeQL zero to hero part 3 for a refresher on using ApiGraphs.

We can get all references to gr.Interface calls with the following query. Note that:

  • We set the query to be @kind problem, which will format the results of the select as an alert.
  • In from, we define a node variable of the API::CallNode type, which gives us the set of all API::CallNodes in the program.
  • In where, we filter the node to be a gr.Interface call.
  • In select, we choose our output to be node and string “Call to gr.Interface”, which will be formatted as an alert due to setting @kind problem.
/**
 * @id codeql-zero-to-hero/4-1
 * @severity error
 * @kind problem
 */

import python
import semmle.python.ApiGraphs

from API::CallNode node
where node =
    API::moduleImport("gradio").getMember("Interface").getACall()
select node, "Call to gr.Interface"

Run the query by right-clicking and selecting “CodeQL: Run Query on Selected Database”. If you are using the same CodeQL database, you should see two results, which proves that the query is working as expected (recall that I added more code snippets to the test database. The database and the code snippets are available in the CodeQL zero to hero repository).

creenshot from the VS Code CodeQL extension showing two alerts in files “cmdi-interface-list.py” and in “cmdi-interface.py”. The first one is highlighted and shows a the “cmdi-interface-list.py” file open on the right side. In the file, the code line “demo = gr.Interface(fn=execute_cmd, inputs=[folder, logs], outputs=output)” is highlighted.
Query results in two alerts

Next, we want to identify values passed to inputs in the gr.Interface class, which are passed to the execute_cmd function. We could do it in two ways—by identifying the values passed to inputs and then linking them to the function referenced in fn, or by looking at the parameters to the function referenced in fn directly. The latter is a bit easier, so let’s focus on that solution. If you’d be interested in the second solution, check out the Taint step section.

To sum up, we are interested in getting the folder and logs parameters.

import gradio as gr
import os

def execute_cmd(folder, logs):
    cmd = f"python caption.py --dir={folder} --logs={logs}"
    os.system(cmd)
    return f"Command: {cmd}"


folder = gr.Textbox(placeholder="Directory to caption")
logs = gr.Checkbox(label="Save verbose logs")
output = gr.Textbox()

demo = gr.Interface(
    fn=execute_cmd,
    inputs=[folder, logs],
    outputs=output)

if __name__ == "__main__":
    demo.launch(debug=True)

We can get folder and logs with the query below:

/**
 * @id codeql-zero-to-hero/4-2
 * @severity error
 * @kind problem
 */

 import python
 import semmle.python.ApiGraphs

 from API::CallNode node
 where node =
     API::moduleImport("gradio").getMember("Interface").getACall()

 select node.getParameter(0, "fn").getParameter(_), "Gradio sources"

To get the first the function reference in fn (or in the 1st positional argument) we use the getParameter(0, "fn") predicate – 0 refers to the 1st positional argument and "fn" refers to the fn keyword argument. Then, to get the parameters themselves, we use the getParameter(_) predicate. Note that an underscore here is a wildcard, meaning it will output all of the parameters to the function referenced in fn. Running the query results in 3 alerts.

Screenshot from the VS Code CodeQL extension showing three alerts, two in file “cmdi-interface-list.py” and one in “cmdi-interface.py”. The first one is highlighted and shows the “cmdi-interface-list.py” file open on the right side. In the file, the “folder” parameter to “execute_cmd” function is highlighted.
Query results in three alerts

We can also encapsulate the logic of the query into a class to make it more portable. This query will give us the same results. If you need a refresher on classes and using the exists mechanism, see CodeQL zero to hero part 2.

/**
 * @id codeql-zero-to-hero/4-3
 * @severity error
 * @kind problem
 */

 import python
 import semmle.python.ApiGraphs
 import semmle.python.dataflow.new.RemoteFlowSources

 class GradioInterface extends RemoteFlowSource::Range {
    GradioInterface() {
        exists(API::CallNode n |
        n = API::moduleImport("gradio").getMember("Interface").getACall() |
        this = n.getParameter(0, "fn").getParameter(_).asSource())
    }
    override string getSourceType() { result = "Gradio untrusted input" }

 }

from GradioInterface inp
select inp, "Gradio sources"

Note that for GradioInterface class we start with the RemoteFlowSource::Range supertype. This allows us to add the sources contained in the query to the RemoteFlowSource abstract class.

An abstract class is a union of all its subclasses, for example, the GradioInterface class we just modeled as well as the sources already added to CodeQL, for example, from Flask, Django or Tornado web frameworks. An abstract class is useful if you want to group multiple existing classes together under a common name.

Meaning, if we now query for all sources using the RemoteFlowSource class, the results will include the results produced from our class above. Try it!

/**
 * @id codeql-zero-to-hero/4-4
 * @severity error
 * @kind problem
 */

 import python
 import semmle.python.ApiGraphs
 import semmle.python.dataflow.new.RemoteFlowSources

class GradioInterface extends RemoteFlowSource::Range {
    GradioInterface() {
        exists(API::CallNode n |
        n = API::moduleImport("gradio").getMember("Interface").getACall() |
        this = n.getParameter(0, "fn").getParameter(_).asSource())
    }
    override string getSourceType() { result = "Gradio untrusted input" }

 }


from RemoteFlowSource rfs
select rfs, "All python sources"

For a refresher on RemoteFlowSource, and how to use it in a query, head to CodeQL zero to hero part 3.

Note that since we modeled the new sources using the RemoteFlowSource abstract class, all Python queries that already use RemoteFlowSource will automatically use our new sources if we add them to library files, like I did in this pull request to add Gradio models. Almost all CodeQL queries use RemoteFlowSource. For example, if you run the SQL injection query, it will also include vulnerabilities that use the sources we’ve modeled. See how to run prewritten queries in CodeQL zero to hero part 3.

Modeling gr.Button.click

We model gr.Button.click in a very similar way.

/**
 * @id codeql-zero-to-hero/4-5
 * @severity error
 * @kind problem
 */

 import python
 import semmle.python.ApiGraphs

 from API::CallNode node
 where node =
     API::moduleImport("gradio").getMember("Button").getReturn()
     .getMember("click").getACall()

select node.getParameter(0, "fn").getParameter(_), "Gradio sources"

Note that in the code we first create a Button object with gr.Button() and then call the click() event listener on it. Due to that, we need to use API::moduleImport("gradio").getMember("Button").getReturn() to first get the nodes representing the result of calling gr.Button() and then we continue with .getMember("click").getACall() to get all calls to gr.Button.click. Running the query results in 3 alerts.

Screenshot from the VS Code CodeQL extension showing three alerts, two in file “cmdi-list.py” and one in “cmdi.py”. The first one is highlighted and shows the “cmdi-list.py” file open on the right side. In the file, the “folder” parameter to “execute_cmd” function is highlighted.
Query results in three alerts

We can also encapsulate the logic of this query into a class too.

/**
 * @id codeql-zero-to-hero/4-6
 * @severity error
 * @kind problem
 */


 import python
 import semmle.python.ApiGraphs
 import semmle.python.dataflow.new.RemoteFlowSources

class GradioButton extends RemoteFlowSource::Range {
    GradioButton() {
        exists(API::CallNode n |
        n = API::moduleImport("gradio").getMember("Button").getReturn()
        .getMember("click").getACall() |
        this = n.getParameter(0, "fn").getParameter(_).asSource())
    }

    override string getSourceType() { result = "Gradio untrusted input" }

 }

from GradioButton inp
select inp, "Gradio sources"

Vulnerabilities using Gradio sources

We can now use our two classes as sources in a taint tracking query, to detect vulnerabilities that have a Gradio source, and, continuing with our command injection example, an os.system sink (the first argument to the os.system call is the sink). See CodeQL zero to hero part 3 to learn more about taint tracking queries.

The os.system call is defined in the OsSystemSink class and the sink, that is the first argument to the os.system sink call, is defined in the isSink predicate.

/**
 * @id codeql-zero-to-hero/4-7
 * @severity error
 * @kind path-problem
 */


 import python
 import semmle.python.dataflow.new.DataFlow
 import semmle.python.dataflow.new.TaintTracking
 import semmle.python.ApiGraphs
 import semmle.python.dataflow.new.RemoteFlowSources
 import MyFlow::PathGraph

class GradioButton extends RemoteFlowSource::Range {
    GradioButton() {
        exists(API::CallNode n |
        n = API::moduleImport("gradio").getMember("Button").getReturn()
        .getMember("click").getACall() |
        this = n.getParameter(0, "fn").getParameter(_).asSource())
    }

    override string getSourceType() { result = "Gradio untrusted input" }

 }

 class GradioInterface extends RemoteFlowSource::Range {
    GradioInterface() {
        exists(API::CallNode n |
        n = API::moduleImport("gradio").getMember("Interface").getACall() |
        this = n.getParameter(0, "fn").getParameter(_).asSource())
    }
    override string getSourceType() { result = "Gradio untrusted input" }

 }


class OsSystemSink extends API::CallNode {
    OsSystemSink() {
        this = API::moduleImport("os").getMember("system").getACall()
    }
}


 private module MyConfig implements DataFlow::ConfigSig {
   predicate isSource(DataFlow::Node source) {
     source instanceof GradioButton
     or
     source instanceof GradioInterface
   }

   predicate isSink(DataFlow::Node sink) {
    exists(OsSystemSink call |
        sink = call.getArg(0)
        )
   }
 }

 module MyFlow = TaintTracking::Global<MyConfig>;

 from MyFlow::PathNode source, MyFlow::PathNode sink
 where MyFlow::flowPath(source, sink)
 select sink.getNode(), source, sink, "Data Flow from a Gradio source to `os.system`"

Running the query results in 6 alerts, which show us the path from source to sink. Note that the os.system sink we used is already modeled in CodeQL, but we are using it here to illustrate the example.

creenshot from the VS Code CodeQL extension showing six alerts, two in file “cmdi-interface-list.py”, one in “cmdi-interface.py”, two in file “cmdi-list.py” and one in “cmdi.py”. The first alert in the  “cmdi-interface-list.py” file is open and shows a three steps, with the last step highlighted. The file open on the right side. In the file, the “cmd” argument to “os.system” is highlighted.
Query results in six alerts

Similarly to the GradioInterface class, since we have already written the sources, we can actually use them (as well as the query we’ve written above) on any Python project. We just have to add them to library files, like I did in this pull request to add Gradio models.

We can actually run any query on up to 1000 projects at once using a tool called Multi repository Variant Analysis (MRVA).

But before that.

Other sources in Gradio

Based on the tests we did in the Identifying attack surface in Gradio section, we can identify other sources that behave in a similar way. For example, there’s gr.LoginButton.click, an event listener that also takes inputs and could be considered a source. I’ve modeled these cases and added them to CodeQL, which you can see in the pull request to add Gradio models. The modeling of these event listeners is very similar to what we’ve done in the previous section.

Taint step

We’ve mentioned that there are two ways to model gr.Interface and other Gradio sources—by identifying the values passed to inputs and then linking them to the function referenced in fn, or by looking at the parameters to the function referenced in fn directly.

import gradio as gr
import os

def execute_cmd(folder, logs):
    cmd = f"python caption.py --dir={folder} --logs={logs}"
    os.system(cmd)
    return f"Command: {cmd}"


folder = gr.Textbox(placeholder="Directory to caption")
logs = gr.Checkbox(label="Save verbose logs")
output = gr.Textbox()

demo = gr.Interface(
    fn=execute_cmd,
    inputs=[folder, logs],
    outputs=output)

if __name__ == "__main__":
    demo.launch(debug=True)

As it turns out, machine learning applications written using Gradio often use a lot of input variables, which are later processed by the application. In this case, inputs argument gets a list of variables, which at times can be very long. I’ve found several cases which used a list with 10+ elements.

In these cases, it would be nice to be able to track the source all the way to the component that introduces it— in our case, gr.Textbox and gr.Checkbox.

To do that, we need to use a taint step. Taint steps are usually used in case taint analysis stops at a specific code element, and we want to make it propagate forward. In our case, however, we are going to write a taint step to track a variable in inputs, that is an element of a list, and track it back to the component.

The Gradio.qll file in CodeQL upstream contains all the Gradio source models and the taint step if you’d like to see the whole modeling.

We start by identifying the variables passed to inputs in, for example, gr.Interface:

class GradioInputList extends RemoteFlowSource::Range {
    GradioInputList() {
      exists(GradioInput call |
        // limit only to lists of parameters given to `inputs`.
        (
          (
            call.getKeywordParameter("inputs").asSink().asCfgNode() instanceof ListNode
            or
            call.getParameter(1).asSink().asCfgNode() instanceof ListNode
          ) and
          (
            this = call.getKeywordParameter("inputs").getASubscript().getAValueReachingSink()
            or
            this = call.getParameter(1).getASubscript().getAValueReachingSink()
          )
        )
      )
    }

    override string getSourceType() { result = "Gradio untrusted input" }
  }

Next, we identify the function in fn and link the elements of the list of variables in inputs to the parameters of the function referenced in fn.

class ListTaintStep extends TaintTracking::AdditionalTaintStep {
  override predicate step(DataFlow::Node nodeFrom, DataFlow::Node nodeTo) {
    exists(GradioInput node, ListNode inputList |
       inputList = node.getParameter(1, "inputs").asSink().asCfgNode() |
       exists(int i | 
         nodeTo = node.getParameter(0, "fn").getParameter(i).asSource() |
         nodeFrom.asCfgNode() =
         inputList.getElement(i))
       )
    }
  }

Let’s explain the taint step, step by step.

exists(GradioInput node, Listnode inputList |

In the taint step, we define two temporary variables in the exists mechanism—node of type GradioInput and inputList of type ListNode.

inputList = node.getParameter(1, "inputs").asSink().asCfgNode() |

Then, we set our inputList to the value of inputs. Note that because inputList has type ListNode, we are looking only for lists.

exists(int i | 
nodeTo = node.getParameter(0, "fn").getParameter(i).asSource() |
nodeFrom.asCfgNode() = inputList.getElement(i))

Next, we identify the function in fn and link the parameters of the function referenced in fn to the elements of the list of variables in inputs, by using a temporary variable i.

All in all, the taint step provides us with a nicer display of the paths, from the component used as a source to a potential sink.

Scaling the research to thousands of repositories with MRVA

CodeQL zero to hero part 3 introduced Multi-Repository Variant Analysis (MRVA) and variant analysis. Head over there if you need a refresher on the topics.

In short,MRVA allows you to run a query on up to 1000 projects hosted on GitHub at once. It comes preconfigured with dynamic lists for most popular repositories 10, 100, and 1000 for each language. You can configure your own lists of repositories to run CodeQL queries on and potentially find more variants of vulnerabilities that use our new models. @maikypedia wrote a neat case study about using MRVA to find SSTI vulnerabilities in Ruby and Unsafe Deserialization vulnerabilities in Python.

MRVA is used together with the VS Code CodeQL extension and can be configured in the extension, in the “Variant Analysis” section. It uses GitHub Actions to run, so you need a repository, which will be used as a controller to run these actions. You can create a public repository, in which case running the queries will be free, but in this case, you can run MRVA only on public repositories. The docs contain more information about MRVA and its setup.

Using MRVA, I’ve found 11 vulnerabilities to date in several Gradio projects. Check out the vulnerability reports on GitHub Security Lab’s website.

Reach out!

Today, we learned how to model a new framework in CodeQL, using Gradio as an example, and how to use those models for finding vulnerabilities at scale. I hope that this post helps you with finding new cool vulnerabilities! 🔥

If CodeQL and this post helped you to find a vulnerability, we would love to hear about it! Reach out to us on GitHub Security Lab on Slack or tag us @ghsecuritylab on X.

If you have any questions, issues with challenges or with writing a CodeQL query, feel free to join and ask on the GitHub Security Lab server on Slack. The Slack server is open to anyone and gives you access to ask questions about issues with CodeQL, CodeQL modeling or anything else CodeQL related, and receive answers from a number of CodeQL engineers. If you prefer to stay off Slack, feel free to ask any questions in CodeQL repository discussions or in GitHub Security Lab repository discussions.

The post CodeQL zero to hero part 4: Gradio framework case study appeared first on The GitHub Blog.

December 17, 2024  17:32:32

Three years from today all obligations of the EU Cyber Resilience Act (CRA) will be fully applicable, with vulnerability reporting obligations applying already from September 2026. By that time, the CRA—from idea to implementation—will have been high on GitHub’s developer policy agenda for nearly six years.

Together with our partners, GitHub has engaged with EU lawmakers to ensure that the CRA avoids unintended consequences for the open source ecosystem, which forms the backbone of any modern software stack. This joint effort has paid off: while the first draft of the CRA created significant legal uncertainty for open source projects, the end result allocates responsibility for cybersecurity more clearly with those entities that have the resources to act.

As an open source developer, you’re too busy to become a policy wonk in your spare time. That’s why we’re here: to understand the origins and purpose of the CRA and to help you assess whether the CRA applies to you and your project.

Why GitHub got involved

The reasons why the EU decided to start regulating software products are clear: cheap IoT devices that end up being abused in botnets, smartphones that fail to receive security updates just a couple of years after purchase, apps that accidentally leak user data, and more. These are not just a nuisance for the consumers directly affected by them, they can be a starting point for cyberattacks that weaken entire industries and disrupt public services. To address this problem, the CRA creates cybersecurity, maintenance, and vulnerability disclosure requirements for software products on the EU market.

Cybersecurity is just as important for open source as it is for any other software. At the same time, open source has flourished precisely because it has always been offered as-is, free to re-use by anyone, but without any guarantees offered by the community that developed it. Despite their great value to society, open source projects are frequently understaffed and underresourced.

That’s why GitHub has been advocating for a stronger focus on supporting, rather than regulating, open source projects. For example, GitHub supports upscaling public funding programs like Germany’s Sovereign Tech Agency, who is joining us for a GitTogether today to discuss open source sustainability. This approach has also informed the GitHub Secure Open Source Fund, which will help projects improve their cybersecurity posture through funding, education, and free tools. Applications are open until January 7.

What open source developers need to know

The CRA creates obligations for software manufacturers, importers, and distributors. If you contribute to open source in a personal capacity, the most important question will be whether the CRA applies to you at all. In many cases, the answer will be “no,” but some uncertainty remains because the EU is still in the process of developing follow-up legislation and guidance on how the law should be applied. As with any law, courts will have the final word on how the CRA is to be interpreted.

From its inception, the CRA sought to exclude any free and open source software that falls outside of a commercial activity. To protect the open source ecosystem, GitHub has fought for a clear definition of “commercial activity” throughout the legislative process. In the end, the EU legislature has added clarifying language that should help place the principal burden of compliance on companies that benefit from open source, not on the open source projects they are relying on.

In determining whether the CRA applies to your open source project, the key will be whether the open source software is “ma[de] available on the market,” that is, intended for “distribution or use in the course of a commercial activity, whether in return for payment or free of charge.” (Art. 3(22); *see also *Art. 3(1), (22), (48); Recitals 18-19). Generally, as long as it’s not, then the CRA’s provisions won’t apply at all. If you intend to monetize the open source software yourself, then you are likely a manufacturer under the CRA, and subject to its full requirements (Art. 3(13)).

Lighter regulation of open source stewards

If you are part of an organization that supports and maintains open source software that is intended for commercial use by others, that organization may be classified as an “open source software steward” under the CRA (Art. 3(14), Recital 19).

Following conversations with the open source community, EU lawmakers introduced this new category to better reflect the diverse landscape of actors that take responsibility for open source. Examples of stewards include “certain foundations as well as entities that develop and publish free and open-source software in a business context, including not-for-profit entities” (Recital 19). Stewards have limited obligations under the CRA: they have to put in place and document a cybersecurity policy, notify cybersecurity authorities of actively exploited vulnerabilities in their projects that they become aware of, and promote sharing of information on discovered vulnerabilities with the open source community (Art. 24). The details of these obligations will be fleshed out by industry standards over the next few years, but the CRA’s text makes clear a key distinction—that open source stewards should be subjected to a “light-touch . . . regulatory regime” (Recital 19).

Developers should be mindful of some key provisions as they decipher whether they will be deemed manufacturers, stewards, or wholly unaffected by the CRA:

  • On donations: “Accepting donations without the intention of making a profit should not be considered to be a commercial activity” (Recital 15). However, in cases “where manufacturers that integrate a component into their own products with digital elements either contribute to the development of that component in a regular manner or provide regular financial assistance to ensure the continuity of a software product” (Recital 19), the recipient of that assistance may qualify as an open source software steward.
  • On distinction between development and supply of software: “[The] provision of products with digital elements qualifying as free and open-source software that are not monetised by their manufacturers should not be considered to be a commercial activity.” (Recital 18)
  • On contributions from manufacturers: “[T]he mere fact that an open-source software product with digital elements receives financial support from manufacturers or that manufacturers contribute to the development of such a product should not in itself determine that the activity is of commercial nature.” (Recital 18)
  • On regular releases: “[T]he mere presence of regular releases should not in itself lead to the conclusion that a product with digital elements is supplied in the course of a commercial activity.” (Recital 18)
  • On OSS nonprofits: “[T]he development of products with digital elements qualifying as free and open-source software by not-for-profit organisations should not be considered to be a commercial activity provided that the organisation is set up in such a way that ensures that all earnings after costs are used to achieve not-for-profit objectives.” (Recital 18)
  • On OSS contributors: “This Regulation does not apply to natural or legal persons who contribute with source code to products with digital elements qualifying as free and open-source software that are not under their responsibility.” (Recital 18)

Making the rules work for open source in practice

Even with the clarifications added during the legislative process, many open source developers still struggle to make sense of whether they’re covered or not. If you think your open source activities may fall within the scope of the CRA (either as a manufacturer or as a steward), the Eclipse ORC Working Group is a good place to get connected with other open source organizations that are thinking about the CRA and related standardization activities. Linux Foundation is currently co-hosting a workshop for open source stewards and manufacturers to prepare for compliance with OpenSSF, who is publishing resources about the CRA.

While it’s good news that some classes of open source developers will face minimal obligations, that probably won’t stop some companies from contacting projects about their CRA compliance. In the best case, this will lead to constructive conversations with companies taking responsibility for improving the cybersecurity of open source projects they depend upon. The CRA offers opportunities for that: for example, it allows manufacturers to finance a voluntary security attestation of open source software (Art. 25), and it requires manufacturers who have developed a fix for a vulnerability in an open source component to contribute that fix back to the project (Art. 13(6)). In the worst case, companies working on their own CRA compliance will send requests phrased as demands to overworked open source maintainers who will spend hours talking to lawyers just to often find out that they are not legally required to help manufacturers.

To avoid the worst-case scenario and help realize the opportunities of the CRA for open source projects, GitHub’s developer policy team is engaging with the cybersecurity agencies and regulators, such as the European Commission, that are creating guidance on how the rules will work in practice.

Most urgently, we call on the European Commission to issue guidance (required under Art. 26 (2) (a)) that will make it easy for every open source developer to determine whether they face obligations under the CRA or not. We are also giving input to related legislation, such as the NIS2 Directive, which defines supply chain security rules for certain critical services.

Our north star is advocating for rules that support open source projects rather than unduly burden them. We are convinced that this approach will lead to better security outcomes for everyone and a more sustainable open source ecosystem.

The post What the EU’s new software legislation means for developers appeared first on The GitHub Blog.