How Palantir Manages Continuous Vulnerability Scanning at Scale

How Palantir Manages Continuous Vulnerability Scanning at Scale
Photo by Venti Views / Unsplash

The Challenge

Effective vulnerability management is a cornerstone of any established security program. For complex cloud software providers like Palantir, staying on top of vulnerabilities and quickly remediating them is critical to staying ahead of our adversaries. If undetected or unmitigated, vulnerabilities in container images and software dependencies can rapidly become a blind spot that may be exploited. As increasingly complex software and infrastructure rely on ever-growing inventories of third-party libraries, manually scanning and individually remediating vulnerabilities in dependencies is no longer sufficient; enterprises must programmatically identify and quantify their exposure, prioritize remediation, and strictly enforce fix timeframes for all vulnerabilities if they hope to keep pace.

Palantir’s Response

Palantir’s products power the work of preeminent public and private institutions around the world; our customers rely on us to safeguard their data against threats. As such, we’ve developed a service we (very creatively) named the Container Vulnerability Scanner, or CVS. While this tool was initially created as a standalone vulnerability scanning service, we have also integrated it into Apollo, our continuous deployment platform, to streamline and even automate vulnerability remediation efforts.

This endeavor aims not only to surpass our own stringent security requirements, but to comply with some of the most rigorous accredited security frameworks in the field today. For example, our cloud environment is certified for FedRAMP Moderate/DoD Impact Level 5 (IL5); by extension, CVS is designed to meet or exceed the FedRAMP vulnerability management requirements across all of our production environments. These requirements dictate considerations such as authenticated scanning, CVE numbering and CVSS risk scoring, signature updates, and asset identification.

At a high level, CVS ensures that all software components and products we build and deploy meet strict security requirements by:

  • Centrally collecting and managing software and vulnerabilities to simplify tracking and remediation efforts;
  • Scanning containers to detect vulnerabilities in container layers and malicious software;
  • Scanning product artifacts for known dependency vulnerabilities;
  • Fixing all identified issues within strict service-level agreements (SLAs), with any breach of resolution timeline automatically preventing impacted assets from being deployed; and
  • Requiring a stringent review process for all exceptions and accepted risks. By default, any findings that have been suppressed must also define an expiration date for the suppression, which mandates periodic re-review and re-acceptance.

Based on both FedRAMP requirements as well as our own high bar for security posture, Palantir enforces strict SLAs for vulnerability remediation. While these SLAs represent the absolute maximum time permitted to address a vulnerability, we aim to fix them significantly faster in practice. We prioritize critical and high vulnerabilities for expedient mitigation and remediation when they are detected.

For vulnerabilities on underlying infrastructure, containers, or hosts, we adhere to the following maximum SLAs:

  • CRITICAL: 72 hours
  • HIGH: 30 days
  • MEDIUM: 90 days
  • LOW: 120 days

For vulnerabilities in Palantir-developed software products, which may be significantly more complicated to remediate, we adhere to the following maximum SLAs:

  • CRITICAL: 30 days
  • HIGH: 30 days
  • MEDIUM: 90 days
  • LOW: 120 days

Managing and Integrating Multiple Vulnerability Scanners in CVS

Fundamentally, CVS is a framework that allows for implementation of multiple discrete scanners. As our security requirements shift over time, we can simply add a new scanner to perform a discrete function, and then use the output of that scanner as part of conditional logic downstream.

There are three major scanners that we rely on today:

  • Trivy, an open-source vulnerability scanner from Aqua Security. Trivy scans an arbitrary container image to detect known CVEs in underlying layers and components included within the container.
  • ClamAV, an open-source anti-malware engine. ClamAV scans an arbitrary container image to detect known malware and other threats. ClamAV can be used with an open-source rules engine, commercial rules, or custom rules authored by the Information Security team. This provides a first pass at using signatures to detect anomalous or malicious files present in shipped containers.
  • Jfrog Xray, a software composition analysis tool. Xray scans all types of artifacts within Artifactory to detect known vulnerabilities in the artifact’s dependencies. This offers a comprehensive dependency scan coverage of all applications deployed, agnostic of the source code language — including Java, JavaScript, Go, and Rust.
CVS Architecture as it stands today.

CVS + Apollo

Base Container Images

In Rubix, Palantir’s Kubernetes-based cloud deployment architecture, we use minimized and hardened container images based on Ubuntu Linux for our underlying infrastructure and hosts. All hosts within a Rubix environment are ephemeral — they are built, destroyed, and rebuilt at regular intervals. An up-to-date machine image is maintained with the latest patches, guaranteeing that hosts will be patched at least every time they are rebuilt. This method of container image management allows us to ensure that every host is appropriately updated and scanned by CVS prior to deployment.

For our application containers, we take this even further with a scratch-based image. It has an extremely minimal footprint to reduce the attack surface of packages that are not strictly necessary. Whereas a standard container might contain ~90 packages, our container image ships with 6. In particular, it does not include shells, package managers, coreutils (e.g., ls, cp, etc.), or any other packages beyond glibc.

Using a minimized container image as the starting point for our products significantly reduces our application attack surface and eliminates a huge swath of vulnerabilities. This approach ensures we can focus our limited resources and effort on applicable security defects, rather than on remediating ancillary package or dependency vulnerabilities that have no security impact on our products.

Apollo Catalog Scanning

All software in our cloud environment is managed via Apollo. Each individual software component we wish to deploy in our cloud environment is first registered in the Apollo Catalog. Once registered, it is enrolled in CVS and must meet specific security criteria before it is deployed to any environment.

In practice, this means every version of every software component we release receives a CVS scan when initially added to the catalog, and must continually meet the security bar as prescribed in the above SLAs. This highly-scalable automation provides strong guarantees on security baselines across all of production environments managed by Apollo.

Recalls and Suppressions

Given the nature of our customers and their work, policy-based enforcement of remediation SLAs is woefully insufficient. High uptime requirements, the testing required to safely patch in place, and coordination of maintenance windows makes human intervention impossible to scale. Instead, CVS and Apollo work together to automatically take action based upon scan results and logical rules.

In Apollo, we use a practice known as a software recall. The exact behavior is configurable to meet specific needs, but by default, artifacts marked as recalled cannot be deployed, either by fresh installation or by upgrade. Instead, they will remain on their existing version until a new, non-recalled version is released.

If a container or artifact scan fails a CVS check, the respective product is automatically marked as recalled in Apollo. This checkpoint ensures that if a vulnerability is introduced in a new product version, we do not make ourselves susceptible to it by widely deploying it across our fleet. Whenever the recalled version is also the latest version in the Apollo catalog, a product support ticket is filed, which is routed immediately to the product owner for the recalled service. This escalates the issue directly to the team, at an appropriate priority level, to intervene and determine the best short- and long-term fixes. By placing responsibility for vulnerability management early in the build pipeline, this “shift-left” approach incentivizes developers to keep dependencies up-to-date and provides flexibility for remediation.

Edge cases with vulnerability detection do happen. False positives may appear, packages may be inappropriately marked as vulnerable, vendor-provided patches may not yet exist, and other externalities can occur. CVS and Apollo have a flexible suppression feature set that allow for handling these edge cases in a time-bound and risk-adjusted way.

In our cloud environment, all suppressions must be manually reviewed and approved by the Information Security team. As a secondary control, these suppressions are periodically reviewed by the Tech Compliance team to ensure we are meeting our regulatory and compliance requirements.

Product owners must specify a suppression using a structured format. An example suppression might look like the following:

CVE-2022-XXXX:
    category: VENDOR_DEPENDENCY
    rationale: package-name | No stable vendored fix available yet
    validUntil: "2022-10-15" # One month from now

In this instance, the specific CVE has been requested for a temporary 30-day suppression, as no vendor fix is yet available. After the validUntil date expires, the suppression is automatically removed, and Apollo begins its recall process with no warning.

Categories that may be considered for suppression include:

  • Vendor Dependency: A fix has not yet been released by the responsible third-party vendor.
  • False Positive: A vulnerability detection is demonstrably incorrect.
  • Recasted Risk: A vulnerability is recast to a different criticality rating, based on available mitigations and other factors.
  • Accepted Risk: A vulnerability may need to be accepted for a period outside of SLA due to external factors (e.g., complexity in remediation, infrastructure migration, etc.).
  • Base Image: A vulnerability in a published container originates from the base image it uses. Given that there are many downstream containers that use the same base image, a vulnerability that affects one of the base images can in turn be present in many others. This type of suppression is thus used to suppression a vulnerability that does not have a good fix in one of the base images, to prevent product recalling in all the images that use it.

Having a robust, auditable, reviewed suppression and recall system allows developers to quickly track and fix commonplace patches in their products, while allowing the information team to focus more deeply on the most critical vulnerabilities impacting our fleet.

CVS and the Future of Vulnerability Scanning at Palantir

CVS has been instrumental for exceeding the security requirements and expectations for our customers in both our cloud and on-premises environments. Since we have implemented CVS, we have dramatically improved patching velocity, increased confidence that security controls are applied uniformly and effectively, and significantly improved our ability to respond to security defects in our products and infrastructure.

When CVE-2021–44832, the remote code execution vulnerability in log4j, rocked the tech industry in December 2021, we relied on CVS, Apollo, and Foundry as our engines for global response. With these products working in concert, we were able to positively identify our exposure, recall and block known failed versions, and manage thousands of production service upgrades conducted across the 230+ customers’ environments managed by Palantir at the time within hours — including cloud, on-premise, classified, and edge networks.

Looking to the future, we are building tighter integration between CVS and Apollo, adding more configurable controls and rules logic, and introducing multi-party reviews/approvals into the platform. We are also working towards native software bill of material (SBOM) generation and validation in Apollo. To protect our software supply chain, we are modifying CVS and Apollo to enforce security controls which guarantee provenance and integrity for our published software artifacts, to better withstand supply chain attacks that commonly target software companies.

We are excited to add additional scanners that we believe will significantly improve the security value we get from this platform. Our incident detection teams use YARA rules to hunt for malicious and anomalous implants and malware across our fleet globally. A native YARA scanner as part of CVS could allow our network defenders better insight and protection against actors attempting to insert malicious code into our containers and software, for instance.

Palantir strives to empower our customers with capabilities to make their own security seamless. To that end, we’ve undertaken a project to expose CVS within Foundry and Apollo to enable customers to natively secure their own authored containers and software directly in our platforms.