Module: /

Skia Perf Technical Documentation

This documentation provides a comprehensive technical overview of the Skia Performance Dashboard (Perf), targeting software engineers new to the project. It focuses on the architectural rationale, data lifecycle, and core concepts beyond simple feature descriptions.

Project Purpose

Skia Perf is a large-scale performance monitoring and regression detection platform. It ingests high-frequency telemetry data from diverse sources (Chrome, Android, Fuchsia, Skia), organizes it into searchable time-series “traces,” and automatically identifies performance regressions using statistical analysis.

The system is designed to handle millions of data points across years of history, translating non-linear source control history into linear, searchable performance trends.

Fundamental Concepts & Terminology

  • Trace: A single line on a graph representing measurements for a specific test over time. A trace is uniquely identified by a Key (a set of key-value pairs like benchmark=motion_mark, bot=pixel_6, unit=ms).
  • CommitNumber: An internal, zero-indexed integer assigned to every Git commit. This linearizes the X-axis for high-performance database lookups and consistent graphing, regardless of branching or merge frequency.
  • Tile: A fixed-size partition of data (typically 256 commits). Storage and queries are optimized by loading only the specific “Tiles” relevant to a time range.
  • ParamSet: An inverted index of all keys and values present in a set of traces (e.g., all available bot names). This powers the query UI.
  • Regression: A statistically significant change in a trace’s value (a “step” up or down) identified by fitting a step function to the data.
  • Shortcut: A persisted, shortened ID representing a complex query or a mathematical formula (e.g., #base_memory_usage).

High-Level Architecture

Perf is built as a modular system where a single Go binary, perfserver, performs different roles based on its execution mode.

[ Data Sources ]     [ Ingestion Pipeline ]      [ Storage Layer ]      [ Analysis & UI ]
      |                        |                         |                      |
[ GCS Buckets ] ------> [ perfserver ingest ] ----> [ Spanner/CDB ] <---- [ perfserver cluster ]
      |            (Parses JSON, Maps Commits)      (Trace Store)        (Finds Regressions)
      |                                                  ^                      |
      |                                                  |                      v
[ Git Repos ] -------------------------------------------+---------- [ perfserver frontend ]
                                                                       (Web UI & API)

Design Rationale & Implementation Details

1. Data Ingestion and Format

Perf mandates a specific JSON schema for incoming data to decouple performance producers from the dashboard.

  • Rationale: By using a strictly versioned format, the system can ingest data from any build system (Buildbot, GitHub Actions, LUCI) without logic changes.
  • Path Logic: Files must be stored in Google Cloud Storage using a YYYY/MM/DD/HH directory structure. This allows the ingester to process files in chronological order and prevents GCS directory listing bottlenecks.
  • Commit Resolution: The perfgit module monitors Git repositories to map every incoming git_hash to a CommitNumber. If the database is empty, it reconstructs this mapping from the Git source of truth on startup.

2. Tiled Trace Storage

Traces are stored in a specialized “tiled” format within the database (Google Cloud Spanner or CockroachDB).

  • Why: Storing millions of individual floating-point numbers as independent rows would lead to massive index overhead.
  • How: Data is grouped into Tiles. A query for a specific time range resolves which Tiles are needed, fetches the compressed buffers, and reconstructs a DataFrame for the UI. This ensures O(1) lookup time for any specific point and linear scaling for range queries.

3. Regression Detection (Clustering & Step-Fitting)

Regression detection is not just simple thresholding; it uses shape-based analysis.

  • K-Means Clustering: Similar traces are grouped together. If 100 different tests all show the same “spike” at the same commit, they will likely fall into the same cluster.
  • Step Function Fitting: The system fits a mathematical step function to the centroid of a cluster.
  • Interestingness Score: Calculated as StepSize / LeastSquaresError. A high score means a clean, significant jump with low noise.
  • Fingerprinting: To avoid re-alerting on the same issue, clusters are fingerprinted using the IDs of the first 20 traces closest to the centroid. If a new cluster matches an old fingerprint, it is treated as a continuation of the same event.

4. Event-Driven vs. Continuous Alerting

The system supports two alerting strategies based on data density.

  • Continuous: Iterates over all configured alerts every few minutes. Ideal for smaller, dense datasets (like Skia).
  • Event-Driven: Triggered by PubSub events from the ingester. When new data arrives, only the alerts matching the incoming trace IDs are processed.
  • Rationale: For massive datasets (like Android, with 40M+ traces), continuous clustering is too computationally expensive. Event-driven detection reduces latency from hours to seconds.

Significant Modules & Files

go/tracestore (Backend Storage)

This module is the “brain” of data persistence. It manages the separation of trace values (the numbers) from the trace parameters (the metadata). It implements the logic for “joining” disparate tiles into a cohesive DataFrame for the frontend.

go/regression (Anomaly Logic)

Coordinates the regression detection lifecycle. It pulls configurations (Alerts), fetches data from the tracestore, executes the clustering algorithms, and writes found anomalies to the regressionstore. It is responsible for the “Start-Status-Result” polling pattern used by the UI.

modules/explore-simple-sk (Frontend Orchestrator)

The primary UI component for data interaction. It manages the reactive loop:

  1. Observes URL state changes (queries, zoom).
  2. Requests a “Frame” (data chunk) from the backend.
  3. Coordinates with plot-google-chart-sk to render the SVG lines.
  4. Overlays HTML elements (anomalies, bug icons) on top of the SVG to maintain interactive performance.

go/notify (Alert Delivery)

A modular system for dispatching alerts. It formats regression data into HTML or Markdown templates and interacts with external APIs like the Google Issue Tracker (Buganizer) or SMTP servers. It handles deduplication to ensure a single regression doesn't result in multiple redundant bugs.

Technical Workflows

The Query-to-Chart Workflow

When a user selects a filter in the UI:

  1. URL Sync: The stateReflector updates the URL with the new query.
  2. Request Initiation: explore-simple-sk calls /_/frame/start with the query.
  3. Backend Processing: dfbuilder identifies the necessary Tiles, fetches trace data, and applies any requested formulas (e.g., norm(), moving_average()).
  4. Progress Polling: The frontend polls /_/frame/status until the data is ready.
  5. Rendering: The resulting DataFrame is passed to the chart, which translates CommitNumbers into X-coordinates using the ChartLayoutInterface.

Data Recovery and Backups

  • Backups: Handled via perf-tool database backup. Only “user-generated” data (Alerts, Regressions, Shortcuts) is backed up.
  • Reconstruction: Trace data is not backed up because it can be 100% reconstructed by re-ingesting the JSON files from GCS. Commits are similarly reconstructed from the Git repository.

Module: /cockroachdb

CockroachDB Management Module

This module provides the operational glue for interacting with the CockroachDB cluster deployed within the Skia infrastructure. Rather than managing the database engine itself, this module focuses on providing developer and administrator access to the data layer for debugging, manual SQL intervention, and performance monitoring.

Connectivity Design and Interaction

The module is designed around the principle of ephemeral, secure access to a distributed database running inside a Kubernetes environment (perf-cockroachdb). Because the database is not exposed directly to the public internet, the scripts implement two primary access patterns:

  1. Direct SQL Execution (In-Cluster): The connect.sh script facilitates a “cloud-native” way to interact with the database. It spins up a temporary, short-lived Kubernetes pod running the official CockroachDB image. This pod connects to the perf-cockroachdb-public service internally. This approach is preferred for quick SQL queries as it avoids local toolchain dependencies and keeps the traffic entirely within the cluster's private network.

  2. Local Tunneling (Port-Forwarding): For more complex operations—such as using a local IDE, a native cockroach binary, or accessing the web-based Admin UI—the module utilizes Kubernetes port-forwarding.

    • Administrative UI: admin.sh bridges the local port 8080 to the database's status server. This allows developers to use a local browser to inspect cluster health, node status, and query performance metrics.
    • Remote SQL Access: skia-infra-public-port-forward.sh establishes a tunnel from the local machine to the database's wire protocol port (26257). This is routed through the skia-infra-public context, enabling developers to use local CLI tools as if the database were running on localhost.

Key Components

  • SQL Access Utilities (connect.sh, skia-infra-public-port-forward.sh): These scripts manage the lifecycle of a connection. The design choice to use the --insecure flag suggests that the cluster is configured to rely on network-level isolation and Kubernetes RBAC rather than client-side certificate management for these specific administrative entry points.

  • Observability Bridge (admin.sh): This component targets the CockroachDB built-in HTTP console. It automates the dual step of establishing the network tunnel and launching the browser, reducing the friction required to monitor the database during performance testing or troubleshooting.

Workflow: Remote Administration

The following diagram illustrates the lifecycle of a remote administrative session using the tunneling scripts:

[ Developer Machine ]          [ Kubernetes Cluster (perf) ]
          |                               |
  (1) Run admin.sh  ----------------------> [ Pod: perf-cockroachdb-0 ]
          |        <-- Port-Forwarding -- |          :8080 (HTTP)
          |                               |
  (2) Browser opens localhost:8080        |
          |                               |
          |                               |
  (3) Run skia-...-port-forward.sh -------> [ Pod: perf-cockroachdb-0 ]
          |        <-- Port-Forwarding -- |          :26257 (SQL)
          |                               |
  (4) Local SQL Client -------------------+
      connects to 127.0.0.1:25000

Implementation Choices

  • Pod Targeting: Scripts target perf-cockroachdb-0 specifically for port-forwarding. This implies a StatefulSet deployment where the zero-ordinal pod acts as a reliable entry point for administrative tasks, even if the service itself is distributed across multiple nodes.
  • Version Pinning: The connection script pins the client image to v19.2.5. This ensures compatibility with the server-side wire protocol and guarantees that the administrative environment is reproducible across different developer machines.

Module: /configs

Perf Instance Configurations

The /configs directory serves as the central repository for the operational environment definitions of Skia Perf. Each JSON file in this directory represents a unique instance of the Perf service, defining how it interacts with data stores, ingestion pipelines, version control systems, and alerting mechanisms.

These configurations are designed to be deserialized into the config.InstanceConfig Go struct, which acts as the “source of truth” for the application's behavior at runtime.

Design Philosophy and Implementation Choices

Configuration Driven Architecture

Perf is designed as a generic engine for time-series visualization and anomaly detection. Rather than hard-coding logic for different projects (like Chrome, Android, or V8), the system uses these configuration files to define the project-specific “shape” of data. This allows a single codebase to support diverse use cases, from local development to massive-scale production environments.

Data Ingestion and Synchronization

The module defines how performance data moves from build systems into the database.

  • Ingestion Sources: While production instances typically use Google Cloud Storage (GCS) and Pub/Sub for real-time data streaming, the module also supports local directory monitoring (source_type: "dir"). This is used in demo_spanner.json and local.json to allow developers to test the full ingestion stack without cloud dependencies.
  • Commit Mapping: A core responsibility of these configs is defining the relationship between performance “traces” and the source code history. Through git_repo_config, instances specify how to parse commit positions (e.g., using commit_number_regex) to ensure the X-axis of performance graphs remains linear and meaningful across thousands of commits.

Scalability and Performance Tuning

The configurations allow for fine-grained control over the underlying storage engine (primarily Google Cloud Spanner).

  • Tile-Based Indexing: The tile_size parameter determines the granularity of data partitioning. Smaller tiles (e.g., 256) are optimized for sparse datasets or frequent small-range queries, whereas larger tiles (e.g., 8192) in high-traffic instances like Chrome minimize overhead during massive bulk ingestion.
  • Caching Layers: To maintain UI responsiveness under heavy load, the configurations define caching strategies. By specifying level1_cache_key (often set to bot or benchmark), the system can pre-index and cache common query patterns in Redis or local memory.

Workflow Orchestration

Beyond simple visualization, these configurations integrate Perf into the broader CI/CD ecosystem:

[ Ingestion ] -> [ Trace Store ] -> [ Regression Detection ] -> [ Bug Filing ]
      ^                                     |                         |
      |                                     v                         v
[ Git Repo ] <---------------------- [ Anomaly Grouping ] -------- [ Issue Tracker ]
  • Anomaly Management: Configurations determine how regressions are identified and reported. Modern instances use use_regression2_schema to enable advanced SQL-based anomaly tracking.
  • Issue Tracking: The issue_tracker_config and notify_config blocks define the “where” and “how” of alerting—ranging from simple email notifications to automated bug creation in Google's Issue Tracker, including the use of specific API keys and secrets.

Key Components and Responsibilities

Instance Metadata and UI Customization

The top-level fields (e.g., instance_name, contact, favorites, extra_links) define the identity of the instance. The favorites and extra_links sections are particularly important for usability, allowing administrators to curate specific views or link to external documentation and dashboards directly within the Perf UI.

Data Store Configuration (data_store_config)

This component defines the backend storage technology. While the project is shifting towards Spanner (as seen in demo_spanner.json and the /spanner subdirectory), the config remains flexible enough to define connection strings and database types, ensuring the application knows how to communicate with the PostgreSQL-compatible Spanner interface.

Query and Discovery (query_config)

This section controls how users interact with the data:

  • Include Params: Lists the metadata keys (like benchmark, bot, test) that the UI should expose for filtering and searching.
  • Default URL Values: Allows setting the “personality” of an instance—for example, deciding whether a specific instance should default to showing a zero-based Y-axis or use a specialized test picker by default.

Subdirectory Roles

  • /spanner: Specifically contains configurations for instances that have migrated to the Spanner-based backend. These files represent the current standard for high-performance, horizontally scalable Perf deployments.
  • Local and Demo Configs: Files like local.json and demo_spanner.json are essential for the development lifecycle. They point to local data directories (./demo/data/) and use simplified auth schemes to allow developers to run the entire Perf stack on a single workstation for testing and debugging.

Module: /configs/spanner

Spanner-Based Perf Configurations

The /configs/spanner directory contains JSON configuration files for various Skia Perf instances that utilize Google Cloud Spanner as their primary data store. These configurations define how performance data is ingested, stored, queried, and reported for specific projects such as Chrome, Android, V8, Flutter, and Fuchsia.

Overview

Each file in this directory represents a distinct environment (production, experiment, or internal) for a performance monitoring dashboard. By using Cloud Spanner, these instances benefit from a horizontally scalable, globally consistent relational database, which is particularly suited for handling the high volume of “traces” (time-series performance data) generated by large-scale CI/CD systems.

The move to Spanner (referenced in these configs via the datastore_type: "spanner" and a PostgreSQL-compatible connection string) represents an architectural shift toward high-performance SQL-based storage for performance metrics.

Design Decisions and Implementation Choices

Data Storage Strategy

The configurations use a “tile-based” storage approach, controlled by the tile_size parameter.

  • Small Tiles (256 - 512): Used by projects like Skia, Angle, and Fuchsia. Smaller tiles are often more efficient for sparse data or instances where users frequently query small ranges of commits.
  • Large Tiles (4096 - 8192): Used by high-traffic instances like Chrome Internal. Larger tiles optimize for massive ingestion throughput and batch reading of dense performance data.
  • Follower Reads: Many internal configs enable enable_follower_reads. This improves read latency and reduces costs by allowing the application to read from Spanner replicas that might be slightly behind the leader, which is acceptable for dashboard visualization.

Ingestion Workflow

Data flow is standardized across instances using a Google Cloud Storage (GCS) to Pub/Sub pipeline.

[ Build System ] -> [ GCS Bucket ] -> [ Pub/Sub Topic ] -> [ Perf Ingestion Service ] -> [ Spanner DB ]
  • Source Type: Predominantly gcs, identifying the bucket where performance JSON files are uploaded.
  • Pub/Sub Integration: The topic and subscription fields define the “push” mechanism that triggers ingestion as soon as new data arrives in GCS.
  • Dead Letter (DL) Queues: Critical instances (like Chrome and WebRTC) include dl_topic and dl_subscription to handle failed ingestion attempts without losing data.

Anomaly Detection and Notification

The configurations define how the system reacts to performance regressions:

  • Notification Types: Options include markdown_issuetracker (for automated bug creation), html_email, or anomalygroup (which clusters related regressions before alerting).
  • Sheriffing: enable_sheriff_config allows these instances to pull alert thresholds and ownership data from a central management system.
  • Regression Schema: Newer instances use use_regression2_schema: true and fetch_anomalies_from_sql: true, indicating a transition to a more robust, queryable SQL schema for tracking performance changes over time.

Key Submodules and Components

Git Repository Configuration (git_repo_config)

Determines how commits are mapped to performance data.

  • Provider: Most use gitiles, which is optimized for Google-hosted source code.
  • Commit Parsing: Configurations like v8 and chrome use commit_number_regex to extract “Commit Positions” (e.g., refs/heads/main@{#12345}), which are used as a linear X-axis instead of raw Git hashes.

Query and Visualization (query_config)

Customizes the UI and discovery experience for each project's unique metric structure.

  • Include Params: Defines which metadata fields (e.g., benchmark, bot, test, subtest_1) are indexed and searchable in the Perf UI.
  • Conditional Defaults: Instances like android2 use this to automatically select specific stats (like min for timeNs) when a user selects a certain metric, reducing the manual effort required to find meaningful data.
  • Caching: High-load instances utilize Redis (cache_config) to store common query results, specifically targeting level1_cache_key (usually benchmark) to speed up dashboard loading.

Temporal Integration (temporal_config)

Specific to internal Chrome and Fuchsia instances, this links the Perf dashboard to Temporal, a workflow orchestration engine. This is used to trigger automated “bisects” (pinpoint_task_queue)—a process that automatically finds the exact CL responsible for a performance regression.

Directory Responsibilities

  • Production Configs: (e.g., chrome-public.json, v8-public.json) Primary dashboards used by developers.
  • Internal Configs: (e.g., chrome-internal.json, eskia-internal.json) Restricted instances for proprietary code or sensitive performance metrics.
  • Autopush/Experiment: (e.g., v8-internal-autopush.json, chrome-internal-experiment.json) Testing grounds for new Perf UI features or experimental Spanner schemas.

Module: /coverage

Perf Coverage Module

The /coverage module provides a comprehensive quality assurance suite for the Perf project. Instead of relying on a single metric, it implements a “triangulated” approach to code health by measuring type safety, test execution coverage, and test effectiveness through mutation testing.

The primary goal of this module is to generate actionable reports that reside in a unified dashboard, allowing developers to identify not just untested code, but also “weakly” tested code or type-system gaps.

Key Quality Dimensions

The module is structured around three distinct methodologies for evaluating the codebase:

  1. Type Coverage: Measures the “strictness” and completeness of TypeScript types across the project. It identifies where the any type or missing annotations might be bypassing the safety checks of the compiler. This ensures that the codebase remains maintainable and less prone to runtime errors.
  2. Test Execution (Line) Coverage: Uses c8 and mocha to track which lines and branches of code are executed during unit tests. This is the traditional metric for identifying “dead” zones in the test suite.
  3. Mutation Testing: Evaluates the quality of existing tests by injecting small bugs (mutants) into the source code (e.g., changing > to <). If the test suite still passes despite these changes, the mutation “survived,” indicating that the tests are not sensitive enough to detect logic regressions in that area.

Component Responsibilities

  • perf-coverage.sh (The Orchestrator): This script acts as the central entry point for generating coverage reports. It encapsulates the complex CLI arguments required for various tools, ensuring consistency between local execution and CI/CD pipelines. It allows for targeted runs (e.g., only running mutation tests) or a full suite execution.

  • add-coverage-links.py (Navigation Post-Processor): Most coverage tools generate static HTML reports that are isolated from one another. This script uses BeautifulSoup to programmatically inject a navigation header (“Back to Perf Coverage Dashboard”) into the generated HTML files. This transforms a collection of disparate reports into a cohesive, navigable documentation site. It handles idempotency by removing existing links before inserting new ones to prevent duplicate UI elements during re-runs.

  • stryker.config.json: Configures the Mutation Testing framework. It is specifically tuned to exclude Puppeteer (integration) tests and page objects, focusing purely on the core business logic within perf/modules. It balances performance and thoroughness by defining precise ignorePatterns.

  • tsconfig.coverage.json: A specialized TypeScript configuration used specifically for type-coverage reporting. It extends the base project configuration but restricts the scope to source files, excluding tests and demo files to ensure the reported coverage percentage reflects the production logic accurately.

Workflow Process

The following diagram illustrates how the module transforms source code and tests into a unified quality dashboard:

Source Code + Tests
       |
       |----(typescript-coverage-report)----> [Type Coverage HTML]
       |                                           |
       |----(c8 + mocha)--------------------> [Test Coverage HTML]
       |                                           |
       |----(stryker)-----------------------> [Mutation Report HTML]
       |                                           |
       V                                           V
[Raw HTML Reports] <----------------------- (add-coverage-links.py)
       |
       | (Injects Navigation UI)
       V
[Unified Coverage Dashboard]

Design Implementation Choices

  • Exclusion of Integration Tests: The test and mutation configurations specifically exclude *_puppeteer_test.ts. This is a deliberate choice to keep the coverage feedback loop fast. Integration tests are often too slow and “noisy” for mutation testing, which requires thousands of test re-runs.
  • LXML for HTML Manipulation: The Python post-processor uses lxml to ensure that even if the reporting tools produce slightly malformed HTML, the navigation links can be reliably injected without corrupting the reports.
  • Static Analysis vs. Runtime Analysis: By combining tsconfig checks with stryker runtime analysis, the module covers the entire lifecycle of code reliability—from compile-time correctness to runtime logic validation.

Module: /csv2days

csv2days

csv2days is a command-line utility designed to post-process CSV files exported from Skia Perf. Its primary purpose is to aggregate time-series data from sub-day granularity (RFC3339 timestamps) into daily granularity.

Overview

When exporting data from Perf, a CSV may contain multiple columns representing different data points collected on the same calendar day. This granularity can be excessive for certain types of reporting or spreadsheet analysis. csv2days simplifies these files by collapsing all columns belonging to the same date into a single column.

Design Decisions

Aggregation via Maximum Value

When multiple columns from the same day are merged, the tool must decide how to represent the data for that day. csv2days implements a Max strategy. For any set of columns being collapsed, the tool calculates the maximum numerical value across those columns for each row.

This decision is rooted in the common use case of monitoring performance metrics where the “peak” value for a day is often more significant than an average or a sum, particularly when dealing with sparse data where different columns represent different runs of the same task. If a value cannot be parsed as a float, the tool defaults to the first available string in that “run” of columns.

Header-Driven Transformation

The transformation logic is strictly driven by the headers of the CSV. The tool assumes that the CSV contains a horizontal timeline where headers follow the RFC3339 format.

  1. Identification: It uses regular expressions to identify date-time strings in the header.
  2. Date Truncation: It strips the time and timezone information, keeping only the YYYY-MM-DD portion.
  3. Run Identification: It identifies “runs” of columns—consecutive headers that resolve to the same date.

Key Components

main.go

The core logic resides in transformCSV. The process follows a specific pipeline:

  1. Header Analysis: It scans the first row to determine which columns are duplicates (same day). It records the “run lengths” (how many columns belong to one day) and the indices that need to be removed to reach a unique set of dates.
  2. Data Processing: For every subsequent row:
    • Max Application: It looks at the indices identified as a “run” and calculates the maximum value within that range.
    • Column Reduction: It removes the redundant columns (the indices that were merged into the first column of the run).
  3. Streaming Output: To maintain efficiency, the tool processes the file row-by-row and streams the output to stdout.

Workflow Diagram

The following diagram illustrates how multiple timestamped columns are collapsed into a single date column:

INPUT CSV:
[Header]  | Key | 2023-01-01T08:00Z | 2023-01-01T12:00Z | 2023-01-02T09:00Z |
[Row 1]   |  A  |        10         |        20         |        15         |

PROCESS:
1. Identify 2023-01-01 columns as a "Run"
2. Calculate Max(10, 20) for Row 1 -> 20
3. Truncate headers to YYYY-MM-DD
4. Remove redundant indices

OUTPUT CSV:
[Header]  | Key | 2023-01-01 | 2023-01-02 |
[Row 1]   |  A  |     20     |     15     |

Usage

The tool requires an input file specified by the --in flag and outputs the transformed CSV directly to standard output:

csv2days --in=perf_export.csv > daily_summary.csv

Module: /demo

Perf Demo Data Module

The /demo module provides a self-contained environment for generating and storing synthetic performance data. Its primary purpose is to demonstrate the capabilities of the Perf system—such as anomaly detection, regression tracking, and trend visualization—without requiring a live production environment.

The module is designed to work in tandem with the perf-demo-repo, mapping performance metrics to specific Git commits within that repository.

Design Philosophy: Deterministic Anomaly Simulation

Rather than providing static files that might become stale, the module includes a data generator (generate_data.go) that programmatically creates JSON files following the format.Format schema.

The generation logic is intentionally designed to simulate real-world performance scenarios:

  • Regression Triggering: The generator injects artificial spikes in metrics (e.g., adding a significant offset to the “encode” time at a specific commit index) to ensure that alerting and anomaly detection algorithms in Perf have visible “problems” to identify.
  • State Shifts: By using multipliers that change after specific commit indices, the generator simulates shifts in performance baselines, allowing users to test how the system handles intentional performance improvements or degradations.

Key Components

1. Data Generator (generate_data.go) This is a Go binary responsible for creating the data/ directory. It iterates through a hardcoded list of Git hashes from the demo repository and generates a JSON payload for each.

  • Why Go? Using the same language as the core Perf ingester allows the generator to import perf/go/ingest/format directly, ensuring the generated data is always compatible with the system's ingestion requirements.
  • Schema Implementation: It populates multi-dimensional keys (e.g., bot, benchmark, units) to demonstrate how Perf can pivot and filter data across different environmental facets.

2. Storage (/demo/data) This directory acts as a mock “data lake.” It contains the JSON output of the generator.

  • File-Based Ingestion: The files are structured to be consumed by a Perf ingester of type dir. This replicates a simple filesystem-based ingestion workflow where a watcher monitors a directory for new performance results.
  • Traceability: Each file is named sequentially (demo_data_commit_N.json) to provide a clear chronological lineage for performance trends.

Data Workflow and Integration

The workflow follows a path from source code state to visual representation:

[ Git Commits ]             [ generate_data.go ]           [ /demo/data/*.json ]
(perf-demo-repo)  ------>  (Logic + Randomness)  ------>  (Structured JSONs)
                                                                 |
                                                                 | (Ingested by Perf)
                                                                 v
                                                      [ Perf UI / Alerting ]
                                                      - Detect "encode" spike
                                                      - Graph "ms" vs "kb"

Key Implementation Choices

  • Multi-Metric Reporting: Each generated file contains multiple result groups (e.g., one for time in ms and one for memory in kb). This illustrates how a single ingestion event can update multiple disparate metrics (CPU vs. RAM) simultaneously.
  • Decoupled Metadata: Environment details like the architecture (x86) and the branch (master) are stored in a top-level Key map. This allows the Perf system to index these files efficiently and enables users to compare performance across different hardware configurations or branches.
  • Measurement Hierarchy: The use of SingleMeasurement objects (mapping test categories to specific operations like encode or decode) provides a granular view, allowing the system to track specific sub-routines within a larger benchmark.

Module: /demo/data

Benchmark Data Module

The /demo/data directory serves as the primary storage for performance benchmark results within the project. It contains a collection of JSON files, each representing a “point-in-time” snapshot of performance metrics associated with specific Git commits. This structured data allows for regression tracking, performance analysis over time, and cross-platform comparisons.

Data Architecture and Schema

The module follows a standardized JSON schema (Version 1) designed to decouple the environmental metadata from the actual performance measurements. This structure ensures that as new benchmarks or hardware bots are added, the reporting format remains consistent.

1. Metadata and Context Each file identifies the specific build and environment that produced the results:

  • git_hash: The unique identifier for the source code state.
  • key: A set of environmental descriptors including the benchmark ID, the hardware architecture/platform (bot), and the project branch (master).

2. Result Grouping Performance data is grouped within the results array. This grouping strategy is chosen to allow a single commit to report multiple categories of metrics (e.g., time-based vs. size-based) in a single transaction. Each result group is defined by its own key (typically defining the units).

3. Measurement Hierarchy Inside each result group, the measurements object maps specific test categories to an array of values.

  • test: The common category for operational metrics.
  • value: The specific operation performed (e.g., “encode”, “decode”).
  • measurement: The raw numerical data point.

4. Extensibility via Links The schema supports optional links objects at both the global and measurement levels. This design allows for traceability, enabling tools to link a specific performance outlier directly to external logs, search queries, or profiling reports.

Data Workflow

The data is intended to be consumed by a visualization or monitoring system. The relationship between the files reflects a chronological progression of the codebase:

[ Commit Hash A ] -> [ Commit Hash B ] -> [ Commit Hash C ]
       |                    |                    |
       v                    v                    v
+--------------+     +--------------+     +--------------+
| JSON Data 1  |     | JSON Data 2  |     | JSON Data 3  |
|  (encode: X) |     |  (encode: Y) |     |  (encode: Z) |
+--------------+     +--------------+     +--------------+
       |                    |                    |
       +----------+---------+----------+---------+
                  |
                  v
       [ Performance Trend Graph ]
       (Detection of Regressions)

Design Choices

  • Flat File Storage: By storing results as individual JSON files named by commit sequence, the system leverages the filesystem/version control for history rather than requiring a separate database for basic storage.
  • Key-Value Pairs for Units: Units are stored within a key object rather than a hardcoded field. This allows the reporting logic to be agnostic of what is being measured (e.g., milliseconds, kilobytes, or operations per second).
  • Measurement Objects: Using an array of objects for measurements (containing value and measurement) rather than a simple map allows for the future inclusion of per-measurement metadata, such as the links found in demo_data_commit_4.json.

Module: /docs

The /docs module serves as the central knowledge repository and architectural blueprint for the Skia Performance Dashboard (Perf). It acts as the “source of truth” for the system’s design, data protocols, and operational procedures. Beyond simple user guides, this module defines the rigid contracts required for cross-project data ingestion and the multi-service architecture that enables performance regression detection at scale.

Design Rationale: Documentation as Code

The structure of the /docs module reflects a design philosophy where documentation is treated with the same rigor as source code:

  • Contract-First Integration: Files like FORMAT.md (defining the nanobench JSON structure) exist to decouple data producers from the dashboard. Because Perf ingests data from diverse ecosystems (Fuchsia, Chrome, Android), a strictly versioned and documented format allows the ingestion pipeline to remain generic and stable while producers evolve independently.
  • Centralization of Tribal Knowledge: The module consolidates technical details that span Go backend services, SQL schema definitions, and LitElement frontend components into comprehensive references like ai_generated_doc.md. This reduces the onboarding barrier for a highly fragmented microservice architecture.
  • Traceability via Version Control: By maintaining documentation in the same repository as the implementation logic, architectural decisions and API changes are tracked through the same peer-review and history mechanisms as the code itself. This prevents the “documentation rot” common in external wikis.

Key Components and Responsibilities

Technical Reference and Aggregation (ai_generated_doc.md)

This component acts as the primary technical manual for the entire project. It documents the “why” behind the most significant architectural decisions, such as:

  • The Multi-Mode Server: Explaining why perfserver is a single binary that handles ingestion, clustering, and frontend serving through different flags to simplify containerized deployments.
  • Clustering Logic: Detailing the use of k-means clustering and step-function fitting to identify “interesting” regressions amidst noisy performance data.
  • Persistence Strategy: Describing the transition from CockroachDB to Google Cloud Spanner to achieve global consistency and horizontal scalability for time-series metrics.

Data Schema and Ingestion Contracts (FORMAT.md)

This file is the definitive specification for how performance data must be structured before it reaches the dashboard. It defines the hierarchical relationship between:

  • Version and Metadata: Identifying the schema version and the Git commit hash.
  • Keys: Defining the parameters (architecture, OS, configuration) that uniquely identify a performance trace.
  • Results: Structuring the measurements (timing, memory usage) and statistical aggregates (min, max, median).

API Specifications (API.md)

Defines the programmatic interfaces for external interactions, primarily focusing on alert management. This allows automated tools to create, list, or update alerts without human intervention, facilitating a “monitoring-as-code” workflow.

Key Data Ingestion Workflow

The documentation defines how data moves through the system, ensuring that every component adheres to the documented state transitions:

PRODUCERS (Fuchsia, Chrome, CI)
      |
      | 1. Format raw data into 'nanobench' JSON (per FORMAT.md)
      v
STORAGE (Google Cloud Storage)
      |
      | 2. Organized by YYYY/MM/DD/HH structure
      v
INGESTION SERVICE (perfserver ingest)
      |
      | 3. Validate against formatSchema.json (in /go/ingest/format)
      | 4. Resolve Git Hash to Commit Number (via /go/git)
      v
DATABASE (Google Cloud Spanner)
      |
      | 5. Store TraceValues and inverted ParamSets (per /go/sql/spanner)
      v
ANALYSIS ENGINE (perfserver cluster)
      |
      | 6. Group similar traces using k-means (per /go/clustering2)
      | 7. Fit Step Functions to detect regressions (per /go/stepfit)
      v
USER INTERFACE (perf.skia.org)
      |
      | 8. Visualize via explore-simple-sk and handle triage

Strategic Module Interaction

The /docs module provides the necessary context to understand how the various sub-directories function as a unified whole:

  • Config Management: Complements the /configs module by explaining how JSON configuration files translate to specific Perf instance identities.
  • Go Backend: Provides the high-level logic for the services found in /go, particularly the complex relationship between the tracestore and regression packages.
  • Frontend Modularization: Explains the component-based architecture in /modules, where UI elements like chart-tooltip-sk and anomalies-table-sk are orchestrated to provide a cohesive exploration experience.

Module: /go

Skia Perf Module (/go)

The /go directory contains the core backend services, data processing pipelines, and administrative tools for Skia Perf, a large-scale performance monitoring and regression detection platform.

High-Level Overview

Skia Perf is designed to ingest high-frequency telemetry data from diverse sources (Chrome, Android, Fuchsia, Skia), organize it into searchable time-series “traces,” and automatically identify performance regressions.

The architecture follows a specialized “tiled” storage model to handle millions of data points across years of history. It decouples the heavy analytical tasks—like k-means clustering and step-fit detection—from the user-facing web interface, utilizing an asynchronous “Start-Status-Result” pattern to maintain responsiveness.

Design Philosophy and Implementation Choices

Tiled Storage and Commit-Linearity

A fundamental design choice in Perf is the translation of non-linear Git history into a linear, integer-based coordinate system (CommitNumber).

  • Why: This allows for extremely fast range queries and predictable O(1) lookups in the database.
  • How: Data is partitioned into “Tiles” (typically 256 commits each). Components like tracestore and dfbuilder operate on these tiles to fetch only the temporal slices necessary for a given request, preventing memory exhaustion.

Configuration as Code

The system is highly multi-tenant. A single binary supports vastly different projects (e.g., v8 vs android) by interpreting an InstanceConfig.

  • Validation: The config and validate modules perform semantic checks (e.g., dry-running Go templates and pre-compiling Regex) at startup to ensure the instance is logically sound before it handles data.
  • Sheriffing: Alerting rules are managed as version-controlled Proto files (sheriffconfig) and synchronized into the operational database, allowing teams to manage monitoring thresholds via standard code reviews.

Asynchronous Orchestration

Heavy operations like regression detection and bisection are managed via Temporal workflows (workflows).

  • Reliability: This ensures that if a network call to an issue tracker or a bisection engine (Pinpoint) fails, the state is preserved and the task can be retried without losing progress.
  • Polling Pattern: The progress and dfiter modules facilitate a pattern where the frontend triggers a task and polls for updates, allowing the backend to handle long-running computations outside the HTTP request lifecycle.

Key Submodules and Responsibilities

The project is organized into functional layers:

1. Data Ingestion & Storage

  • ingest & process: The continuous pipeline that monitors sources (GCS/PubSub), parses incoming files (parser), and populates the database.
  • tracestore & sqltracestore: The low-level persistence layer. It separates numeric values from metadata (traceparamstore) to optimize query performance.
  • perfgit: Manages the mapping between Git hashes and the internal CommitNumber timeline.

2. Analysis & Detection

  • regression: The central engine that coordinates regression detection across the commit history.
  • clustering2 & kmeans: Implements shape-based grouping to find similar performance shifts across disparate tests.
  • stepfit: Provides the mathematical logic for identifying “steps” (sudden jumps or drops) in individual traces.
  • samplestats: Conducts statistical tests (Mann-Whitney U, Welch's T-test) to compare “before” and “after” samples.

3. User Interface & API

  • frontend: The web server orchestrator. It manages authentication, serves the UI, and coordinates between various backend stores.
  • dataframe & dfbuilder: Constructs the matrix-like data structures used by the UI to render graphs and tables.
  • ui/frame: The “brain” of the exploration page, handling complex query resolution and formula calculations (calc).

4. Alerting & Communication

  • notify: A modular delivery system that formats regressions into human-readable messages (HTML/Markdown) and dispatches them to Email or Issue Trackers.
  • anomalygroup: Aggregates individual regressions into logical “groups” to prevent alert fatigue and streamline bisection.
  • issuetracker: A high-level client for the Google Issue Tracker, automating bug filing and status updates.

5. Management & Tooling

  • perf-tool: A Swiss-army-knife CLI for administrators to perform re-ingestion, backups, and database migrations.
  • maintenance: A dedicated service role for background tasks like schema migration, cache warming (psrefresh), and data retention.
  • ts: Automates the generation of TypeScript definitions from Go structs to ensure frontend/backend type safety.

Core Workflows

Data Lifecycle: Ingestion to Visualization

This workflow illustrates how a single performance measurement moves from a test bot to a user's screen.

[ Test Bot ] --(JSON)--> [ GCS Bucket ]
                               |
                   (PubSub Notification)
                               |
                               v
[ Ingest Worker ] ----> [ Parser / Filter ]
       |                       |
       | (CommitNumber) <--- [ perfgit ]
       v                       |
[ TraceStore ] <---------------+
       |
[ dfbuilder ] <--- [ Frontend Query ]
       |
       v
[ DataFrame ] ----> [ Web UI (Graph) ]

Detection Workflow: Ingestion to Notification

This workflow shows how new data triggers automated analysis and alerting.

[ Ingest Event ]
       |
       v
[ continuous/Detector ] ----> [ alert/ConfigProvider ]
       |                             |
       | (Fetch Alert Configs) <-----'
       |
       v
[ regression/Detector ] ----> [ clustering2 / stepfit ]
       |                             |
       | (Anomalies Found) <---------'
       v
[ anomalygroup ] ------------> [ notify ]
       |                             |
       | (Merge into Group)          |-- [ Email ]
       |                             |-- [ IssueTracker ]
       v                             `-- [ Pinpoint (Bisection) ]
[ Temporal Workflow ]

Related Modules

  • /proto: Defines the gRPC and storage contracts used for cross-service communication.
  • infra/go/sql: Provides underlying SQL pooling and timeout management.
  • infra/go/pubsub: Manages the event-driven triggers for the ingestion pipeline.

Module: /go/alertfilter

High-Level Overview

The alertfilter module provides a centralized set of constants used to define the scope of alert visibility within the Skia Perf application. It acts as a shared vocabulary between the backend logic that queries alert configurations and the frontend components that allow users to toggle between different views of those alerts.

Design and Implementation

The primary design goal of this module is to eliminate “magic strings” and ensure consistency across the Perf codebase when filtering alerts. Alerts in Perf can be numerous, often belonging to different teams or individual developers. To make the system manageable, the UI provides mechanisms to filter these alerts based on ownership.

By defining these modes as constants, the system ensures that:

  1. Backend Queries use standardized keys when filtering alert configurations from the database.
  2. API Requests from the frontend remain consistent, avoiding bugs caused by case-sensitivity or typos in string literals.
  3. Future Expansion of filtering logic (e.g., filtering by “TEAM” or “SUBSYSTEM”) has a clear, singular location for definition.

Key Components and Responsibilities

The module currently defines two primary filtering modes:

  • ALL: Represents a global view. This mode is used when a user or a service needs to inspect every active alert configuration within the instance, regardless of who created them or who is listed as the owner.
  • OWNER: Represents a personalized view. This mode restricts the alert list to those specifically associated with the authenticated user. This is the primary mechanism for reducing noise in the dashboard, allowing developers to focus on the performance regressions they are directly responsible for.

Workflow

When a user interacts with the Perf alert dashboard, the filtering logic typically follows this flow:

User Interface          Backend Handler          Database/Store
+--------------+        +-----------------+      +------------------+
| Select View  |        | Validate Filter |      | Query Alerts     |
| (ALL/OWNER)  |------> | using constants |----> | with Filter Type |
+--------------+        +-----------------+      +------------------+
                               |
                               v
                        [Result Set Filtered]

This simple constant-based approach ensures that the “intent” of the user's filter is preserved and correctly interpreted as it passes through the various layers of the Perf service.

Module: /go/alerts

Alerts Module

The alerts module provides the core data structures and logic for managing performance regression detection configurations in Perf. It defines how an alert is structured, how to derive specific trace queries from generalized alert configurations, and provides a caching layer to ensure high-performance access to these configurations during the anomaly detection process.

Design Philosophy

The module is designed around the concept of a “Dynamic Alert Configuration.” Rather than requiring a separate alert for every single hardware/software combination, the system allows for generalized queries that can be expanded into many specific sub-queries using a “Group By” mechanism. This reduces configuration toil while maintaining granular detection.

Key Implementation Choices:

  • Expansion via Cartesian Product: The module uses the paramset of the data to expand a single Alert config into multiple specific queries. This allows an admin to say “alert on all models,” and the system will automatically generate a query for “model=nexus4”, “model=nexus6”, etc.
  • Soft Deletion: Alerts are rarely hard-deleted. They transition through ConfigState (ACTIVE to DELETED) to maintain historical context for previously detected anomalies.
  • Serialized IDs: To bridge the gap between backend int64 database IDs and frontend JSON/JavaScript requirements, the module uses a custom SerializesToString type. This ensures that large integer IDs do not lose precision in the browser and that uninitialized IDs (like 0 for issue tracker components) are handled gracefully as empty strings.

Key Components

Alert Configuration (config.go)

The Alert struct is the central entity. It contains:

  • Query Logic: The Query string (URL-encoded params) and GroupBy fields.
  • Detection Parameters: Algo (e.g., K-Means), Step detection settings, Radius (the window of commits to analyze), and Interesting (the threshold for regression).
  • Action Metadata: Where to send the alert (Alert email, IssueTrackerComponent) and what Action to take (report, bisect, or none).

Config Provider (configprovider.go)

Since anomaly detection is a frequent background process, querying the database for every check would be inefficient. The ConfigProvider implements a thread-safe, in-memory cache of all alert configurations.

  • Automatic Refresh: It runs a background “refresher” goroutine that periodically polls the underlying Store to update the local cache.
  • State Filtering: It maintains separate internal maps for “active” alerts and “all” alerts (including deleted ones), allowing callers to quickly retrieve the appropriate set without manual filtering.

Alert Store Interface (store.go)

This file defines the Store interface, which abstracts the persistence layer. It supports standard CRUD operations and specialized batch operations like ReplaceAll (used for synchronizing alerts with external subscription files). Implementations of this interface (such as sqlalertstore) handle the mapping between the Go structs and the database schema.

Key Workflows

Query Expansion Process

When the detection engine processes an Alert, it doesn't just run the raw Query. It expands it based on the GroupBy field:

1. Alert Config:
   Query: "benchmark=blink_perf"
   GroupBy: "browser, machine"

2. ParamSet (Available Data):
   browser: [chrome, firefox]
   machine: [m1, m2]

3. Expansion (QueriesFromParamset):
   -> "benchmark=blink_perf&browser=chrome&machine=m1"
   -> "benchmark=blink_perf&browser=chrome&machine=m2"
   -> "benchmark=blink_perf&browser=firefox&machine=m1"
   -> "benchmark=blink_perf&browser=firefox&machine=m2"

Configuration Caching and Retrieval

The ConfigProvider ensures that the detection engine always has a low-latency view of the configurations:

  Detection Engine          ConfigProvider                Alert Store (DB)
         |                        |                              |
         | GetAllAlertConfigs()   |                              |
         |----------------------->| (Check Cache)                |
         |      [Alert List]      |                              |
         |<-----------------------|                              |
         |                        |                              |
         |                        | <--- Periodically Refresh ---|
         |                        |      List(includeDeleted)    |
         |                        |----------------------------->|
         |                        |       [Fresh Alerts]         |
         |                        |<-----------------------------|
         |                        | (Update Internal Maps)       |

Related Modules

  • sqlalertstore: The primary SQL implementation of the Store interface.
  • mock: Mock implementations of Store and ConfigProvider for unit testing.
  • perf/go/types: Provides shared enums and types like StepDetection and RegressionDetectionGrouping.

Module: /go/alerts/mock

Alerts Mocks

The go/alerts/mock module provides autogenerated mock implementations of the interfaces defined in the go/alerts package. These mocks are built using testify, allowing developers to simulate the behavior of alert storage and configuration retrieval in unit tests without requiring a live database or complex setup.

Purpose and Design

The primary goal of this module is to decouple testing of higher-level components (like the anomaly detection engine or the UI handlers) from the underlying persistence layer. By providing programmable behaviors for alert configurations, tests can verify how the system reacts to specific alert states, missing configurations, or database errors.

The mocks are generated via mockery and adhere to the standard testify/mock pattern. Each mock struct includes a New[InterfaceName] constructor that automatically registers a cleanup function with the test runner (t.Cleanup), ensuring that expectations are asserted when the test finishes.

Key Components

ConfigProvider.go

The ConfigProvider mock simulates an object responsible for providing read access to alert configurations. This is typically used by components that need to query alert settings frequently, possibly with caching logic in the real implementation.

  • Capabilities: It mocks methods like GetAlertConfig and GetAllAlertConfigs.
  • Use Case: Testing the anomaly detection loop where the system needs to fetch current alert parameters to determine if a performance regression has occurred.

Store.go

The Store mock simulates the persistent storage layer (usually backed by PostgreSQL). It encompasses the full CRUD lifecycle of an alert configuration.

  • Capabilities: It mocks write operations (Save, Delete, ReplaceAll) and complex read operations (List, ListForSubscription).
  • Transactional Testing: The ReplaceAll method accepts a pgx.Tx parameter. In the mock, this allows verifying that bulk updates are intended to be part of a transaction, even if no actual transaction is executed during the test.

Typical Test Workflow

The mocks are utilized by setting expectations on specific method calls and defining what they should return.

  [ Test Case ]
        |
        | 1. Create Mock: m := mock.NewStore(t)
        |
        | 2. Set Expectation: m.On("Save", ...).Return(nil)
        |
        | 3. Inject Mock into Component under test
        |
        | 4. Execute Logic
        |
        V
  [ Assertions ] <--- (Automatic cleanup checks if "Save" was called)

By using these mocks, you can simulate failure modes that are difficult to trigger with a real database, such as specific pgx errors or race conditions where an alert is deleted between two different read operations.

Module: /go/alerts/sqlalertstore

sqlalertstore

The sqlalertstore module provides a SQL-backed implementation of the alerts.Store interface used in Perf. It manages the persistence, retrieval, and lifecycle of alert configurations, which define how the system detects anomalies in performance data.

Design Philosophy: Hybrid Storage

To balance the need for high-performance querying with the flexibility required for evolving alert configurations, this module employs a hybrid storage strategy:

  • Serialized State (JSON): The complete definition of an alert—including complex query parameters, filtering rules, and metadata—is stored as a JSON blob. This “Document Store” approach allows the alert structure to evolve without requiring frequent database schema migrations.
  • Relational Columns: Critical operational fields (like config_state and sub_name) are “promoted” from the JSON blob to top-level SQL columns. This enables the database to perform efficient indexing and filtering, which is essential for performance-sensitive tasks such as dashboard rendering and subscription-based alert processing.

Key Components

SQLAlertStore

The primary struct SQLAlertStore implements the alerts.Store interface. It wraps a database connection pool (pool.Pool) and manages a pre-defined map of SQL statements.

Schema and Data Mapping

The underlying table structure uses specific columns to optimize common workflows:

  • ID: The primary key. The store handles both inserting new alerts (where an ID is generated) and updating existing ones.
  • ConfigState: Represents the operational status (e.g., ACTIVE or DELETED). The store implements “soft deletes” by updating this column rather than removing rows, ensuring historical data remains intact while allowing the application to filter for active alerts quickly.
  • Subscription Linking: Columns sub_name and sub_revision link alerts to specific subscriptions. An index on sub_name ensures that ListForSubscription operations are highly performant.
  • LastModified: A Unix timestamp updated on every change. This facilitates cache invalidation and ensures downstream anomaly detection engines use the most recent configuration.

Key Workflows

Saving and Updating Alerts

When an alert is saved, the store determines if it is a new entry or an update based on the presence of a valid ID. It serializes the entire configuration into JSON and extracts the relational fields for the SQL columns.

  Application                    SQLAlertStore                    Database
       |                               |                             |
       | Save(SaveRequest)             |                             |
       |------------------------------>|                             |
       |                               | Serializes Cfg to JSON      |
       |                               | Identifies ID status        |
       |                               |                             |
       |                               | INSERT/UPDATE ...           |
       |                               |---------------------------->|
       |                               |                             |
       |               Success/Error   | <---------------------------|
       | <-----------------------------|                             |

Batch Replacement (ReplaceAll)

This workflow is used when a set of alerts needs to be synchronized with an external source (like a subscription configuration). It operates within a single transaction:

  1. Marks all currently ACTIVE alerts as DELETED.
  2. Inserts the new set of alert configurations. This ensures an atomic transition from the old state to the new state.

Alert Retrieval and Sorting

The List and ListForSubscription methods retrieve alerts and deserialize the JSON blobs back into Go structs. Because database results are not guaranteed to be ordered by application-level logic, the module explicitly sorts the resulting slice by DisplayName (and then by ID as a tie-breaker) before returning it to the caller.

Implementation Details

  • Soft Deletion: The Delete method performs an UPDATE statement setting config_state to 1 (DELETED) and updating the last_modified timestamp.
  • Serialization: The module uses standard JSON encoding to store the alerts.Alert struct. During retrieval, the SQL id is injected back into the struct after unmarshaling to ensure the application remains synchronized with the database's primary key.
  • Concurrency: By using last_modified and standard SQL transactions (in ReplaceAll), the store maintains consistency even when multiple processes attempt to update alert configurations simultaneously.

Module: /go/alerts/sqlalertstore/schema

The sqlalertstore/schema module defines the structural contract for persisting Perf alerts within a SQL database. It serves as the single source of truth for the database schema, ensuring that the Go representation of an alert maps correctly to the relational storage layer used by the sqlalertstore.

Design Philosophy: Hybrid Storage

The schema employs a hybrid storage strategy, balancing relational querying capabilities with the flexibility of document storage:

  • Serialized State (The “What”): The bulk of an alert's configuration—including its complex filtering rules, parameters, and metadata—is stored as a serialized JSON blob in the Alert column. This allows the alert definition to evolve (adding or removing fields) without requiring frequent and expensive SQL migrations.
  • Relational Columns (The “How”): Key fields are “promoted” to top-level SQL columns to facilitate efficient indexing, filtering, and sorting. This is critical for performance-sensitive operations, such as displaying alert dashboards or processing state-specific tasks.

Key Components and Data Mapping

State Management

The ConfigState column extracts the operational state of an alert (e.g., active, deleted) from the JSON blob. By storing this as an integer, the system can perform rapid lookups of all “active” alerts across thousands of entries without parsing JSON strings in the database engine.

Subscription Linking

Alerts in Perf are often tied to specific subscriptions. The schema explicitly tracks:

  • SubscriptionName: The identifier for the alert's origin.
  • SubscriptionRevision: A pointer to the specific version of the subscription configuration.

The module includes an explicit index (idx_alerts_subname) on the sub_name column. This design choice optimizes for the common workflow of retrieving all alerts associated with a specific subscription, which is a frequent operation in both the UI and the automated ingestion pipeline.

Concurrency and Updates

The LastModified column stores a Unix timestamp. This is primarily used for cache invalidation and ensuring that downstream consumers (like the anomaly detection engine) are operating on the most recent version of the alert definition.

Data Flow Overview

The following diagram illustrates how the schema interacts with the application and storage layers:

  Go Application Layer              SQL Database Layer
+-----------------------+         +----------------------------+
|                       |         |       Alerts Table         |
|  alerts.Alert Struct  | ----+   | [ID] (Primary Key)         |
|                       |     |   |                            |
+-----------------------+     +-->| [Alert] (JSON Blob)        |
           |                  |   |                            |
    (Extraction Logic)        +-->| [ConfigState] (Indexed)    |
           |                  |   |                            |
           +------------------+-->| [SubscriptionName] (Index) |
                                  |                            |
                                  | [LastModified]             |
                                  +----------------------------+

Future Considerations

The schema identifies two specific areas for technical debt reduction to improve performance and consistency:

  1. JSONB Transition: Moving from TEXT to JSONB for the Alert column to allow for more efficient internal database indexing of the blob content.
  2. Type Rationalization: Aligning the ConfigState representation across the Go codebase and SQL to prevent casting overhead and improve type safety.

Module: /go/anomalies

High-level Overview

The go/anomalies module defines the core abstraction for interacting with performance anomalies (regressions) within Skia Perf. It provides a standardized interface for querying anomaly data, regardless of whether that data resides in the legacy Chrome Perf system or Skia Perf's native SQL-based regression store.

The primary goal of this module is to decouple the consumption of anomaly data—used for visualization, alerting, and analysis—from the underlying storage implementation and the specific protocols required to communicate with external APIs.

Design Decisions and Implementation Choices

Unified Abstraction via Interfaces

The central component of the module is the Store interface. This design choice allows the Perf system to remain agnostic about the data source. By using a single interface, the system can switch between a direct SQL backend, a proxied Chrome Perf API, or a cached implementation without modifying the business logic of the calling components.

Cross-System Compatibility

The module leverages data structures defined in chromeperf.AnomalyMap and chromeperf.AnomalyForRevision. This maintains a consistent data contract between the frontend (which historically expected Chrome Perf formats) and various backends. This compatibility layer ensures that anomalies generated by different systems can be merged and displayed in a uniform way on performance dashboards.

Support for Diverse Query Patterns

The interface is designed to support the three primary ways users and automated systems interact with performance data:

  1. Trace-Centric: Querying anomalies for specific performance metrics (traces) across a range of commits.
  2. Time-Centric: Querying anomalies within a specific window of time, which is essential for “last 24 hours” views or investigating incidents at specific clock times.
  3. Revision-Centric: Investigating the context around a specific git revision to see if a particular change caused regressions across multiple disparate traces.

Key Components and Responsibilities

anomalies.go

This file defines the Store interface, which is the foundational contract for the entire module.

  • GetAnomalies: Retrieves anomalies based on commit positions. It allows for filtering by specific traceNames. If the slice is empty, the implementation is expected to return all anomalies within the commit range.
  • GetAnomaliesInTimeRange: Facilitates temporal lookups. Implementations (like the SQL-based one) often need to resolve these time ranges into commit ranges using a Git provider before querying the underlying database.
  • GetAnomaliesAroundRevision: Provides a way to “zoom in” on a specific point in history, returning anomalies that occurred at or near a target revision.

Submodules and Implementations

The module's functionality is extended and specialized through its submodules:

  • impl: Contains the concrete logic for data retrieval. This includes the sql_impl.go for native Skia Perf storage and chromeperf_impl.go for interacting with the legacy Google-internal Chrome Perf API.
  • cache: Implements a middleware layer that wraps another Store. It uses LRU (Least Recently Used) caches and a time-based invalidation strategy to reduce the latency of repeated queries and minimize the load on the source-of-truth databases or APIs.
  • mock: Provides autogenerated mocks for unit testing, allowing other modules to simulate various anomaly data scenarios (such as empty results or API errors) in a controlled environment.

Workflow: Interface Interaction

The following diagram illustrates how the Store interface acts as a gateway between the Perf UI/Services and the various data backends:

       +---------------------------------------+
       |   Perf UI / Regression Detection      |
       +---------------------------------------+
                           |
                           v
               +-----------------------+
               |  anomalies.Store (I)  |
               +-----------------------+
                           |
          +----------------+----------------+
          |                |                |
          v                v                v
+----------------+ +----------------+ +----------------+
|  cache.Store   | |  sql.Store     | |  chromeperf.   |
| (Middleware)   | | (Native DB)    | |  Store (API)   |
+----------------+ +----------------+ +----------------+
          |                |                |
          +------ wraps ---+                +--- calls ---> Chrome Perf API

Interactions

The module depends heavily on the perf/go/chromeperf package for its data models. This dependency reflects the module's role as a bridge between the modern Skia Perf infrastructure and the established data formats of the Chrome Performance monitoring ecosystem.

Module: /go/anomalies/cache

High-level Overview

The anomalies/cache module provides a performance-optimized caching layer for anomaly data retrieved from the Chrome Perf API. It acts as an intermediary Store that reduces the load on external API services and improves the responsiveness of Skia Perf when querying for regressions and anomalies.

It is designed to handle three primary types of lookups:

  1. Trace-based: Anomalies associated with specific trace names within a commit range.
  2. Revision-based: Anomalies occurring around a specific revision number.
  3. Time-based: Anomalies occurring within a specific time window.

Design Decisions and Implementation Choices

Layered Caching Strategy

The module utilizes two distinct LRU (Least Recently Used) caches to balance memory usage and performance:

  • testsCache: Indexed by a composite key of trace name and commit range. This handles the most frequent queries where users are looking at specific performance graphs.
  • revisionCache: Indexed by revision number, supporting workflows that investigate specific changesets.

Invalidation and Accuracy Trade-offs

A key challenge in caching anomaly data is that anomalies can be modified (e.g., marked as “invalid” or “fixed”) in the source system. To handle this, the module implements an invalidationMap.

Instead of a complex, fine-grained invalidation logic that would require deep inspection of every cache entry, the module uses a “simple and safe” approach:

  • When a trace is marked as modified via InvalidateTestsCacheForTraceName, its name is added to a map.
  • During subsequent fetches, if a trace is found in this map, the cache is bypassed even if a hit occurs, forcing a fresh fetch from Chrome Perf.
  • To prevent this map from growing indefinitely, it is completely wiped every 24 hours. While this creates a small window where an old anomaly might reappear briefly if the wipe happens immediately after a modification, it ensures O(1) operations and minimal memory overhead compared to tracking individual commit-level changes.

Proactive Cleanup

Standard LRU behavior handles capacity, but it doesn't account for data staleness. The module implements a background goroutine that periodically checks the oldest items in the cache against a Time-to-Live (TTL) of 10 minutes. This ensures that even low-traffic data is eventually refreshed to reflect the current state of the Chrome Perf database.

Key Components and Responsibilities

cache.go

This is the primary implementation file. It defines the store struct and the logic for the anomaly store.

  • GetAnomalies: Orchestrates a hybrid fetch. It checks the LRU cache for each requested trace. Any traces missing from the cache or marked in the invalidationMap are bundled into a single batch request to the ChromePerf client. The results are then merged and the cache is updated.
  • cleanupCache: A background worker function that drains the LRU cache of items older than the cacheItemTTL. It specifically targets the “oldest” items to minimize the work performed during each tick.
  • getAnomalyCacheKey: Generates a deterministic string key: traceName:startCommit:endCommit. This ensures that different ranges for the same trace are cached independently, preventing range-mismatch bugs.

Workflow: Data Retrieval

The following diagram illustrates how the GetAnomalies method handles a request for multiple traces:

User Request (Traces A, B, C)
          |
          v
+---------+----------+
| Check testsCache   | <-------+
+---------+----------+         |
          |                    |
    +-----+-----+              |
    |           |              |
 [A, B] Hit   [C] Miss/Invalid |
    |           |              |
    |     +-----v--------------+-----+
    |     | Fetch [C] from ChromePerf|
    |     +------------+-------------+
    |                  |
    |           +------v------+
    |           | Update Cache|
    |           +------+------+
    |                  |
    +--------+---------+
             |
             v
      Merged Results

Interactions

The module heavily relies on the chromeperf.AnomalyApiClient interface. This decoupling allows the cache to be tested with mocks (as seen in cache_test.go) and ensures the caching logic remains independent of the underlying transport (HTTP/gRPC) used to communicate with Chrome Perf.

Module: /go/anomalies/impl

The go/anomalies/impl module provides concrete implementations of the anomalies.Store interface. Its primary purpose is to abstract the retrieval of performance anomalies (regressions) from different backends—specifically the legacy Chrome Perf API and the modern Skia Perf SQL-based regression store.

By providing a unified interface, this module allows the rest of the Perf system to query for anomalies using commit ranges, time ranges, or specific revisions without needing to know whether the data is coming from an external service or a local database.

Key Components

Chrome Perf Implementation (chromeperf_impl.go)

The store struct in this file acts as a proxy to the Chrome Perf Anomaly API. It is used in deployments where Skia Perf needs to display or synchronize with anomalies managed by the legacy Chrome Perf system.

  • Responsibility: Facilitates communication with the chromeperf.AnomalyApiClient.
  • Design Choice: It performs minimal logic, primarily sorting trace names to ensure deterministic API requests and mapping the results into the standard chromeperf.AnomalyMap format.

SQL Implementation (sql_impl.go)

The sqlAnomaliesStore provides an implementation that retrieves anomalies directly from Skia Perf's own database by wrapping a regression.Store.

  • Responsibility: Translates Skia Perf “Regressions” into the “Anomaly” format expected by the frontend and other consumers.
  • Conversion Logic: Since Skia Perf stores data as regression.Regression objects, this implementation uses compat.ConvertRegressionToAnomalies to transform them.
  • Multiplicity Tracking: A significant detail in this implementation is how it handles multiple anomalies on the same trace at the same commit. It maintains a multiplicities map during the conversion process to increment the Multiplicity field of the anomaly, ensuring each unique regression is identifiable even if they overlap in the commit/trace dimensions.
  • Dependency on Git: Unlike the Chrome Perf implementation, the SQL store requires a git.Git provider. This is because Skia Perf's regression store is indexed by commit numbers. When a user requests anomalies for a time range, the store first uses the Git provider to resolve that time range into a slice of commits, determining the start and end commit positions before querying the database.

Workflows

Time-Range Query Resolution (SQL Store)

When querying by time, the module performs a two-step resolution to bridge the gap between wall-clock time and the commit-indexed database.

User Request (Time Range)
          |
          v
+-----------------------+      +-----------------------+
|   sqlAnomaliesStore   |      |      git.Git          |
|                       |----->|                       |
| GetAnomaliesInTime... |      | CommitSliceFromTime...|
+-----------------------+      +-----------|-----------+
          |                                |
          | <---------- Commit IDs --------+
          v
+-----------------------+      +-----------------------+
|    regression.Store   |      |       SQL Database    |
|                       |<---->|                       |
|    Range/RangeFiltered|      |  (Regressions Table)  |
+-----------------------+      +-----------------------+
          |
          v
   Convert to Anomalies --------> Result Map

Revision Windowing

For the GetAnomaliesAroundRevision method, the SQL implementation implements a “sliding window” strategy. It defines a hardcoded window (currently 500 commits) around the target revision. This provides context to the user, showing not just an anomaly at a specific point, but also nearby fluctuations that might be related to the same root cause.

Design Decisions

  • Interface Compatibility: The module heavily utilizes the chromeperf package's data structures (like AnomalyMap and AnomalyForRevision). This design choice was made to maintain compatibility with existing UI components and tools that were originally built for the Chrome Perf ecosystem.
  • Filtering at the Source: In sql_impl.go, the code distinguishes between fetching all regressions (Range) and fetching regressions for specific traces (RangeFiltered). This allows the module to offload filtering to the database layer when possible, rather than fetching all data and filtering in-memory.
  • Error Handling: The implementations are designed to be resilient; for instance, if the Chrome Perf API fails, it logs the error but may return a partial/empty result rather than crashing the calling process, acknowledging that anomaly data is often “best-effort” in high-latency environments.

Module: /go/anomalies/mock

The go/anomalies/mock module provides a mock implementation of the anomaly storage interface used within the Perf system. Its primary purpose is to facilitate unit testing for components that depend on anomaly data without requiring a live connection to ChromePerf or a production database.

Design and Implementation

The module leverages the testify/mock framework to provide a programmable substitute for the anomaly Store. This approach allows developers to define expected behaviors, such as returning specific sets of anomalies or simulating network errors, ensuring that higher-level logic (like regression detection or UI rendering) handles various data scenarios correctly.

The mock is autogenerated based on the Store interface defined in the anomalies package. This ensures that the mock remains synchronized with the actual interface used by the system.

Key Components

Store.go

The core of this module is the Store struct. It implements the methods required to query anomaly data across different dimensions:

  • Commit-based Lookups: The GetAnomalies method allows tests to simulate retrieving anomalies within a specific range of commit positions for a set of traces.
  • Time-based Lookups: The GetAnomaliesInTimeRange method enables testing workflows that rely on temporal queries rather than commit sequences.
  • Revision Context: The GetAnomaliesAroundRevision method provides functionality to mock the retrieval of anomalies centered around a specific point in history, useful for validating “nearby” anomaly detection logic.

Usage Workflow

When writing a test for a component that consumes anomalies, the mock.Store is instantiated and injected as a dependency. The general flow for using this module is as follows:

+------------------+          +------------------------+          +------------------+
|   Unit Test      |          |      Mock Store        |          | System Under Test|
+------------------+          +------------------------+          +------------------+
|                  |          |                        |          |                  |
| 1. Setup Mock    |--------->| Register Expectations  |          |                  |
|    (On/Return)   |          | (e.g., GetAnomalies)   |          |                  |
|                  |          +------------------------+          |                  |
| 2. Inject Mock   |--------------------------------------------->|  Execute Logic   |
|                  |          +------------------------+          |                  |
|                  |          | 3. Intercept Call      | <--------|  Call Store API  |
|                  |          |    Return Mock Data    | -------->|                  |
|                  |          +------------------------+          |                  |
| 4. Verify        |          |                        |          |                  |
|    AssertExpects |--------->| Check Call History     |          |                  |
+------------------+          +------------------------+          +------------------+
  1. Initialization: Use NewStore(t) to create a new mock instance. This automatically registers cleanup functions to verify that all expected calls were made before the test finishes.
  2. Expectation Setting: Use the .On(...) syntax to define which parameters the mock should expect and what values (or errors) it should return.
  3. Assertion: The mock tracks all interactions, allowing the test to verify not just the output of the system under test, but also that the system interacted with the anomaly store in the expected manner (e.g., querying the correct trace names).

Module: /go/anomalygroup

Anomaly Grouping Module

The anomalygroup module provides a centralized system for aggregating individual performance regressions (anomalies) into cohesive groups. In a high-scale performance monitoring environment, a single root cause (like a specific commit) often triggers multiple alerts across different benchmarks or configurations. This module shifts the workflow from managing hundreds of isolated alerts to managing a single “Anomaly Group,” which acts as the unit of work for bisection, bug reporting, and remediation tracking.

High-Level Overview

The primary goal of this module is to reduce “alert fatigue” and streamline root-cause analysis. It achieves this by correlating new anomalies with existing ones based on shared metadata (e.g., benchmark name, domain, and subscription) and overlapping commit ranges.

The module is structured as a tiered system:

  • Storage Layer: Defines the Store interface and provides a SQL implementation for persisting group metadata and membership.
  • Service Layer: A gRPC implementation that orchestrates interactions between anomaly, culprit, and regression data.
  • Notification/Utility Layer: Integrates the grouping logic into the Perf detection pipeline, ensuring every detected regression is either funneled into an existing group or initiates a new one.

Design and Implementation Choices

The “Find-or-Create” Lifecycle

The module follows a strict “find-or-create” pattern. When a regression is detected, the system does not immediately alert a human. Instead, it queries the Store for existing groups that match the anomaly's context.

  • If a match is found, the anomaly is added to the group, and any linked issues (bugs) are updated with a comment.
  • If no match is found, a new group is created, which may trigger automated processes like Temporal workflows for bisection.

Overlap-Based Grouping

A key design decision is the “Common Revision Range” logic. When an anomaly is added to a group, the group‘s start_commit and end_commit are narrowed to the intersection of the current group range and the new anomaly’s range. This ensures that a group only contains anomalies that could logically have been caused by the same commit.

Data Model Flexibility

The module uses JSONB (in the SQL implementation) for group metadata. This allows the system to store heterogeneous attributes like subscription_name or benchmark_name without requiring rigid schema migrations as the types of performance data evolve.

Concurrency and Consistency

To prevent race conditions where multiple detection workers might try to create a group for the same regression simultaneously, the utility layer employs a global mutex during the “find-or-create” phase. This ensures that the mapping of anomalies to groups remains consistent and prevents duplicate bisection jobs or bug reports.

Key Components

Store (/go/anomalygroup/store.go)

The Store interface defines the data access layer. It is responsible for the persistence of groups and their associations (Anomaly IDs and Culprit IDs).

  • AddAnomalyID: Not only links an ID but also performs the mathematical narrowing of the group's commit range.
  • FindExistingGroup: The primary discovery mechanism used to deduplicate regressions.

Service (/go/anomalygroup/service)

The AnomalyGroupService implements the gRPC API. It acts as an aggregator, fetching data from the anomalygroup store while also pulling detailed regression data (like medians and trace params) to provide a rich view of the group's impact. It includes ranking logic to identify “top anomalies” within a group based on the percentage change in performance.

Notifier (/go/anomalygroup/notifier)

This acts as the glue between the Perf engine and the grouping logic. It filters out “summary-level” regressions (which are too broad for specific grouping) and constructs a canonical Test Path (e.g., master/bot/benchmark/test/subtest) required for consistent cross-referencing with external systems like Chromeperf.

Utils (/go/anomalygroup/utils)

Contains the AnomalyGrouper, which handles the high-level business logic. It coordinates between the internal stores and external systems (Issue Trackers and Temporal). It decides whether to post a comment to an existing bug or trigger a new bisection workflow.

Key Workflows

Processing a New Regression

This workflow illustrates how a detected anomaly is integrated into the grouping system.

[ Perf Detection Engine ]
           |
           v
[ AnomalyGroupNotifier ] --------------------.
           |                                 |
    (Validate Trace &                        |
     Build Test Path)                        |
           |                                 |
           v                                 |
[ AnomalyGrouper (Utils) ] <-----------------'
           |
    (Lock Mutex)
           |
           v
    [ FindExistingGroup? ]
      /              \
    (No)             (Yes)
     |                 |
     v                 v
[ Create New Group ]  [ AddAnomalyID ]
[ Trigger Bisection ] [ Update Issue/Bug ]
     |                 |
     '-------.---------'
             |
       (Unlock Mutex)
             |
             v
      [ Return GroupID ]

Revision Range Narrowing

When adding an anomaly to a group, the module maintains the narrowest possible window for bisection:

Group Range:   [100 ..................... 150]
New Anomaly:           [120 ............. 160]
               ===============================
Result Range:          [120 .......... 150]
               (Narrowest common overlap)

Module: /go/anomalygroup/mocks

Anomaly Group Mocks

The go.skia.org/infra/perf/go/anomalygroup/mocks module provides mock implementations of the interfaces defined in the anomalygroup package. Its primary purpose is to facilitate unit testing for components that depend on anomaly group persistence and retrieval without requiring a live database or complex setup.

High-Level Overview

This module is generated using mockery and is based on the testify assertion framework. It focuses on mocking the Store interface, which is the primary abstraction for managing groups of performance anomalies. By using these mocks, developers can simulate various database states, such as existing groups, missing records, or successful updates, to verify the business logic of higher-level services.

Key Components and Responsibilities

Store.go

The Store struct is a mock implementation of the anomalygroup.Store interface. It records method calls and allows tests to define expected return values or behaviors (using On and Return methods from the testify mock package).

The mock covers the full lifecycle of an anomaly group:

  • Creation and Discovery: Mocking Create and FindExistingGroup allows testing of the logic that decides whether a new anomaly should start a new group or join an existing one based on subscription name, domain, and commit range.
  • Association Management: Methods like AddAnomalyID and AddCulpritIDs enable verification that the system correctly links individual detections and identified root causes to a group.
  • State Updates: UpdateBisectID and UpdateReportedIssueID are used to test the integration between anomaly grouping and external systems like bisection services and issue trackers.
  • Retrieval: LoadById, GetAnomalyIdsByAnomalyGroupId, and GetAnomalyIdsByIssueId support testing data access patterns and reporting logic.

Usage Workflow

When writing a test for a component that manages anomalies, the mock is typically initialized and injected as a dependency.

  Tester                Mock Store              Component Under Test
    |                       |                           |
    |-- NewStore(t) ------->|                           |
    |                       |                           |
    |-- On("LoadById")... ->|                           |
    |                       |                           |
    |-- Inject Store ------>|-------------------------->|
    |                       |                           |
    |-- Trigger Action -------------------------------->|
    |                       |                           |
    |                       |<-- Call LoadById() -------|
    |                       |                           |
    |                       |--- Return AnomalyGroup -->|
    |                       |                           |
    |-- AssertExpectations()|                           |

Implementation Decisions

  • Automated Generation: The use of mockery ensures that the mock stays in sync with the Store interface defined in the core anomalygroup package. If the interface changes, the mock can be regenerated to reflect the new API.
  • Testify Integration: Leveraging github.com/stretchr_testify/mock provides a standard, expressive syntax for setting up expectations, making tests more readable and maintainable.
  • Protobuf Dependency: The mock depends on go.skia.org/infra/perf/go/anomalygroup/proto/v1, ensuring that the data structures returned by the mock methods (like v1.AnomalyGroup) are exactly the same as those used by the real implementation.

Module: /go/anomalygroup/notifier

Anomaly Group Notifier

The anomalygroup/notifier module provides an implementation of a regression notifier that integrates with the Anomaly Grouping system. Instead of simply sending a static notification (like an email or chat message), this notifier delegates the handling of a detected regression to an AnomalyGrouper, which manages how regressions are aggregated, tracked, and associated with issues.

Overview

In the Perf system, a “Notifier” is typically responsible for alerting users when a regression is found. The AnomalyGroupNotifier fulfills this interface but focuses on the structured management of anomalies. Its primary role is to validate the incoming regression data, extract relevant metadata (such as “Test Paths”), and pass it to the anomalygroup/utils package for logic-heavy operations like finding or creating groups and updating issue trackers.

Design Decisions

  • Granularity Filtering: The notifier is designed to ignore “summary level” regressions. If a regression involves multiple traces (e.g., a high-level benchmark without a specific story), it is excluded from anomaly grouping. This prevents the system from creating noisy or overly broad groups that don't map clearly to specific test paths.
  • Test Path Construction: To maintain compatibility with external systems (like Chromeperf), the module enforces a specific hierarchy for identifying tests. It constructs a testPath string by concatenating parameters like master, bot, benchmark, test, and various subtest levels.
  • Minimal State: The notifier itself is stateless, acting as a translation layer between the Perf notification event and the persistent Anomaly Grouping storage.

Key Components

AnomalyGroupNotifier

The central struct that implements the notification interface. It holds a reference to an AnomalyGrouper.

  • RegressionFound: This is the primary entry point. When a regression is detected, this method:
    1. Validates that the regression represents a single, specific trace.
    2. Parses the trace keys to extract performance parameters.
    3. Calculates median values before and after the anomaly (using vec32) for logging and diagnostic purposes.
    4. Constructs a canonical testPath.
    5. Calls ProcessRegressionInGroup to handle the actual grouping and issue tracking.
  • No-op Methods: Methods like RegressionMissing and UpdateNotification are currently implemented as no-ops. This indicates that the anomaly grouping logic currently focuses on the discovery of regressions rather than the automated closing or updating of groups when a regression disappears.

Test Path Logic

The functions getTestPath and isParamSetValid encapsulate the requirements for a regression to be “groupable.”

  • A valid regression must contain a specific set of keys: master, bot, benchmark, test, and subtest_1.
  • The path is built following the pattern: master/bot/benchmark/test/subtest_1/.../subtest_3.

Workflow

The following diagram illustrates how a detected regression flows through this module:

Perf Detection Engine
        |
        v
[AnomalyGroupNotifier.RegressionFound]
        |
        |-- (Validate: Is it a single trace?)
        |-- (Validate: Does it have required params?)
        |-- (Action: Construct Test Path)
        |
        v
[AnomalyGrouper (via /go/anomalygroup/utils)]
        |
        |-- (Action: Find existing group or create new one)
        |-- (Action: Update Issue Tracker)
        |
        v
   End Result: Regression is grouped and tracked.

Related Modules

  • perf/go/anomalygroup/utils: Contains the AnomalyGrouper interface and the core logic for managing the lifecycle of an anomaly group.
  • perf/go/alerts: Provides the alert configurations that trigger these notifications.
  • perf/go/issuetracker: Used by the underlying grouper to link anomalies to bug reports.

Module: /go/anomalygroup/proto

High-Level Overview

The /go/anomalygroup/proto module serves as the primary definition layer for the Anomaly Grouping system. It establishes the contract between the performance monitoring services and the data storage layer responsible for organizing regressions. This module is essential for transitioning from “individual data point alerts” to “structured performance investigations.”

While the actual implementation details and specific gRPC service definitions are contained within versioned subdirectories (e.g., v1), this root module acts as the entry point for cross-service communication regarding grouped performance regressions.

Design and Implementation Choices

Protobuf-First Architecture

The choice to define anomaly groups using Protocol Buffers (protobuf) is driven by the need for interoperability across different microservices in the Perf ecosystem. By defining the “Anomaly Group” as a structured message, the system ensures that the detector, the bisection engine, and the reporting UI all share a consistent view of what constitutes a group, regardless of their specific internal languages or storage backends.

Evolution via Versioning

The module structure (specifically the use of the v1 subdirectory) reflects a design decision to support long-term API stability. Because Anomaly Groups are often linked to external trackers (like Monorail or Buganizer), the schema must evolve without breaking existing integrations. Versioning allows for:

  • Backward Compatibility: Services can continue to use older message formats while the backend transitions to more complex grouping logic.
  • Incremental Feature Rollout: New fields, such as those supporting advanced bisection parameters or different action types, can be introduced in new versions of the proto without disrupting the current production workflow.

Key Components and Responsibilities

The primary responsibility of this module is to provide the data models and service definitions required to manage the lifecycle of an anomaly group.

Abstraction of Regressions

The module defines how individual anomalies (sketches of performance drops) are aggregated. Instead of treating every regression as a unique event, the proto definitions allow for a many-to-one mapping. This choice minimizes “alert fatigue” by ensuring that multiple regressions caused by the same commit or affecting the same benchmark are treated as a single unit of work.

Service Definitions

The module hosts the gRPC service definitions that facilitate:

  • Querying: Finding groups based on metadata such as subscriptions, benchmarks, or specific commit ranges.
  • Mutation: Updating the state of a group as external events occur, such as a bisection job identifying a culprit or a developer linking a bug ID.

Workflow Integration

The proto definitions in this module facilitate the following logical flow across the Skia Perf system:

[ Perf Detector ] --> [ Proto: FindExistingGroups ]
                               |
                               +---( Existing Group? )
                               |          |
      (No) CreateNewGroup <----+          +----> (Yes) UpdateGroup
               |                                       |
               v                                       v
    [ Anomaly Group Data Model: ActionType, Metadata, Anomaly IDs ]
               |
               +-----------> [ Bisection Service ]
               |
               +-----------> [ Reporting/Issue Service ]

By providing a unified message format, this module ensures that once a group is created or updated via the gRPC interface, all downstream services (like Pinpoint for bisection or the auto-filer for bug reports) can consume the data without needing to understand the underlying database schema.

Module: /go/anomalygroup/proto/v1

High-Level Overview

The go/anomalygroup/proto/v1 module defines the core data structures and gRPC service interface for managing Anomaly Groups within the Perf system. Anomaly grouping is a critical abstraction used to cluster related performance regressions—typically those sharing a similar commit range, benchmark, or subscription—into a single actionable entity.

By grouping anomalies, the system can automate post-detection workflows, such as filing single bug reports for multiple related regressions or triggering bisection jobs to identify a specific culprit commit.

Design and Implementation Choices

Action-Oriented Grouping

The design pivots around the GroupActionType enum (REPORT, BISECT, NOACTION). Rather than being a passive collection of data points, an anomaly group is defined by the intended outcome.

  • REPORT: Indicates the group is intended for manual review or automated bug filing.
  • BISECT: Indicates the group is a candidate for automated regression testing (Pinpoint/Bisection) to find a culprit.

Decoupling from Specific Regressions

The CreateNewAnomalyGroup RPC is explicitly designed to avoid binding a group to a single regression initially. This allows the system to find “existing” groups that match the criteria of a newly detected anomaly before creating a redundant group. This deduplication logic is supported by FindExistingGroups, which searches based on subscription_name, test_path, and commit ranges.

Metadata vs. Entity IDs

The AnomalyGroup message separates entity relationships (anomaly_ids, culprit_ids, reported_issue_id) from the metadata that defined the group (subscription_name, benchmark_name). This allows the service to track the evolution of a group as more anomalies are discovered or as a bisection job identifies specific culprits.

Key Components and Workflows

AnomalyGroupService

This gRPC service (anomalygroup_service.proto) is the primary interface for the anomaly management lifecycle.

  • Discovery & Creation: When the detector finds a new regression, it uses FindExistingGroups to see if it fits into a current investigation. If not, CreateNewAnomalyGroup initializes a new tracker.
  • Refinement: As bisection jobs complete or manual triaging occurs, UpdateAnomalyGroup is used to append culprit_ids (found by bisection) or issue_id (from a bug tracker).
  • Analysis: FindTopAnomalies provides a prioritized list of regressions within a group, allowing the system to pick the “most significant” anomaly to lead a bisection job.

The Anomaly Entity

The Anomaly message acts as a bridge between Skia's internal regression format and the requirements of external tools (like Pinpoint). It captures:

  • Context: The paramset map translates Skia tags (e.g., stat, measurement) into the “bot/benchmark/story” format required by ChromePerf.
  • Significance: median_before and median_after provide the raw data needed to calculate the magnitude of the regression.

Key Workflows

The following diagram illustrates how a new anomaly interacts with this module to determine if it should trigger a new action or join an existing investigation:

[ New Anomaly Detected ]
          |
          v
[ FindExistingGroups ] <---------- [ Search by Sub, Benchmark, & Commit ]
          |
          +---( Match Found? )---+
          |                      |
    YES   |                NO    |
          v                      v
[ UpdateAnomalyGroup ]    [ CreateNewAnomalyGroup ]
(Append Anomaly ID)              |
          |                      v
          |               [ Determine Action ]
          |               (BISECT or REPORT)
          |                      |
          +----------+-----------+
                     |
                     v
          [ Anomaly Group State ]
          - List of Anomaly IDs
          - Culprit IDs (if bisected)
          - Issue ID (if reported)

Key Files

  • anomalygroup_service.proto: The source of truth for the API and data models.
  • anomalygroup_service.pb.go: Contains the generated Go structs for messages and enums, including the GroupActionType logic.
  • anomalygroup_service_grpc.pb.go: Contains the gRPC client and server interfaces used by Perf components to communicate with the anomaly group store.

Module: /go/anomalygroup/proto/v1/mocks

High-Level Overview

This module provides mock implementations of the AnomalyGroupService defined in the version 1 Protocol Buffers for the Perf system. Its primary purpose is to facilitate isolated unit testing of components that interact with anomaly grouping logic. By using these mocks, developers can simulate various service behaviors—such as successful data retrieval, persistence errors, or specific search results—without requiring a live gRPC server or an underlying database.

Design and Implementation Decisions

The mocks in this module are built using the stretchr/testify/mock framework. This choice allows for a declarative style of testing where expectations (input arguments) and returns (output data or errors) are defined before the execution of the code under test.

gRPC Interface Compliance

A key implementation detail in AnomalyGroupServiceServer.go is the manual embedding of v1.UnimplementedAnomalyGroupServiceServer.

  • The “Why”: Standard gRPC server generation in Go requires implementations to embed the Unimplemented version of the server struct. This ensures forward compatibility; if new methods are added to the Protobuf definition, existing implementations (including mocks) will still satisfy the interface by inheriting the default “Unimplemented” behavior for the new methods.
  • The “How”: Because the mockery generation tool occasionally fails to include this embedding, it was added manually. This ensures the mock type remains a valid AnomalyGroupServiceServer even as the service definition evolves.

Assertion and Cleanup

The module provides a NewAnomalyGroupServiceServer constructor that integrates with Go's testing.T. It automatically registers a cleanup function via t.Cleanup. This design ensures that mock.AssertExpectations(t) is called at the end of every test, verifying that all expected calls to the service were actually made, which prevents “silent” test passes where expected logic was bypassed.

Key Components and Responsibilities

AnomalyGroupServiceServer

This is the central mock struct. It mirrors the gRPC server interface and provides hooks for the following service responsibilities:

  • Group Lifecycle Management: Methods like CreateNewAnomalyGroup and UpdateAnomalyGroup allow tests to simulate the creation and modification of anomaly clusters.
  • Data Retrieval: LoadAnomalyGroupByID and FindExistingGroups allow callers to simulate the lookup of groups based on specific identifiers or search criteria.
  • Anomaly Analysis: FindTopAnomalies facilitates testing logic that prioritizes or filters specific anomalies within a group context.

Testing Workflow

The typical workflow involving this module focuses on intercepting calls between a high-level business logic component and the anomaly group persistence layer.

[ Test Case ]
      |
      | (1) Set Expectations:
      |     mock.On("LoadAnomalyGroupByID", ...).Return(fakeGroup, nil)
      v
[ Component Under Test ]
      |
      | (2) Call: LoadAnomalyGroupByID(ctx, req)
      v
[ AnomalyGroupServiceServer (Mock) ]
      |
      | (3) Matches arguments and returns fakeGroup
      v
[ Component Under Test ]
      |
      | (4) Process fakeGroup and perform assertions
      v
[ Test Case ]
      |
      | (5) Cleanup: Verify all mock expectations were met

Module: /go/anomalygroup/service

Anomaly Group Service

The anomalygroup/service module provides a gRPC implementation for managing and querying Anomaly Groups. Anomaly groups are logical collections of performance regressions (anomalies) that share common characteristics, such as being detected within the same benchmark, subscription, or commit range.

This service acts as an orchestration layer that interfaces with underlying storage systems for anomaly groups, culprits, and regressions to provide a unified API for the Skia Perf backend.

Key Responsibilities

The service is responsible for the lifecycle and metadata management of grouped anomalies:

  • Group Creation and Discovery: Creating new groups based on subscription and commit criteria, and finding existing groups that match a specific test path and commit range.
  • Metadata Management: Updating groups with external identifiers, such as Bisection IDs (from automated bisects), Issue IDs (from bug trackers), and associating specific Culprit IDs or new Anomaly IDs with an existing group.
  • Analysis and Ranking: Identifying the “top” anomalies within a group based on the magnitude of the performance shift.
  • Correlation: Linking groups to issues through detected culprits.

Design Decisions

Group Identification and Search

When searching for existing groups (FindExistingGroups), the service parses a TestPath string. It expects a specific hierarchical format (e.g., domain/bot/benchmark/measurement/test). The service specifically extracts the Domain and Benchmark to query the store, effectively grouping anomalies that occur on the same benchmark even if they are on different bots or specific test sub-metrics.

Anomaly Ranking Logic

The FindTopAnomalies functionality implements a specific ranking strategy:

  1. Metric: It calculates the percentage change between MedianBefore and MedianAfter.
  2. Sorting: Regressions are sorted in descending order of this percentage change.
  3. Story Identification: The service attempts to identify the “story” (the specific sub-test) by looking at subtest_3, then subtest_2, then subtest_1 in the paramset. This prioritization ensures the most specific test description available is returned.

Data Validation

The service enforces a strict schema for anomaly metadata via isParamSetValid. It requires the presence of specific keys (bot, benchmark, test, stat, subtest_1) and ensures that these keys contain exactly one value. This ensures consistency when these anomalies are exported or displayed in the UI.

Key Components

AnomalyGroupService

The primary struct implementing the gRPC server defined in anomalygroup/proto/v1. It integrates the following dependencies:

  • Store (anomalygroup): Handles the persistence of the group entities.
  • Store (culprit): Used to resolve which issues are associated with the culprits in a group.
  • Store (regression): Used to fetch the detailed performance data (medians, paramsets) for the anomalies contained within a group.
  • Temporal Client: Integrated for workflow orchestration (e.g., triggering bisections or reports).

Workflow: Updating a Group

The UpdateAnomalyGroup method acts as a multi-purpose update sink. Depending on the fields populated in the request, it routes to different store operations:

Request (UpdateAnomalyGroup)
|
|-- Has BisectionId? ----> anomalygroupStore.UpdateBisectID
|
|-- Has IssueId? --------> anomalygroupStore.UpdateReportedIssueID
|
|-- Has AnomalyId? ------> regressionStore.GetByIDs (to get commit range)
|                          |
|                          +-> anomalygroupStore.AddAnomalyID
|
+-- Has CulpritIds? -----> anomalygroupStore.AddCulpritIDs

Internal Ranking Workflow

When identifying the most significant regressions in a group:

Load Group by ID
      |
Fetch all Regression details for AnomalyIds in Group
      |
For each Regression:
   Calculate: (MedianAfter - MedianBefore) / MedianBefore
      |
Sort descending by calculated diff
      |
Take Top N (Limit)
      |
Extract specific params (bot, benchmark, measurement, etc.)
      |
Return Anomaly list

Module: /go/anomalygroup/sqlanomalygroupstore

The sqlanomalygroupstore module provides a SQL-backed implementation for managing anomaly groups in the Perf system. It transitions the system from a “per-anomaly” management style to a “group-centric” workflow, allowing related performance regressions to be handled as a single unit for bisection, bug reporting, and state tracking.

Overview and Purpose

In performance monitoring, a single underlying issue often triggers multiple anomalies across different bots or benchmarks. Treating these as independent events leads to redundant bisections and fragmented issue tracking. This module solves that by providing a persistent store to aggregate these anomalies.

The store acts as the source of truth for the lifecycle of a regression:

  1. Grouping: Collating anomalies based on shared context (benchmark, domain, subscription).
  2. Range Refinement: Dynamically calculating the intersection of revision ranges as new anomalies are added to a group.
  3. Action Orchestration: Tracking whether a group has been reported to an issue tracker or sent for bisection.

Key Components and Design Decisions

Data Modeling and Storage

The implementation balances relational structure with the flexibility needed for heterogeneous performance data.

  • JSONB Metadata: The group_meta_data field uses JSONB to store attributes like subscription_name, domain_name, and benchmark_name. This avoids rigid schema migrations when new metadata categories are introduced while still allowing for efficient SQL filtering via JSON path expressions.
  • Array Types for Membership: AnomalyIDs and CulpritIDs are stored as UUID ARRAY (or text arrays). This allows the system to retrieve all members of a group in a single row fetch, optimizing for read-heavy “group view” operations.
  • Denormalized Revision Ranges: The fields common_rev_start and common_rev_end are stored directly on the group. This denormalization allows the system to perform fast range-based lookups (e.g., “Find all groups affecting commit X”) without joining against hundreds of individual anomaly records.

Anomaly Aggregation Logic

The store implements specific logic when adding an anomaly to an existing group via AddAnomalyID. Rather than just appending an ID, it updates the group‘s common_rev_start and common_rev_end using GREATEST and LEAST functions respectively. This ensures the group’s “Common Revision Range” always represents the narrowest overlapping window shared by all member anomalies, which is essential for accurate bisection.

Key Workflows

Finding and Joining Groups

The FindExistingGroup method is the entry point for anomaly deduplication. When a new anomaly is detected, the system queries for existing groups that match the metadata and whose revision range overlaps with the new anomaly.

New Anomaly Detected
       |
       v
Check Store: FindExistingGroup()
(Match Metadata + Revision Range Overlap)
       |
       +----[ Match Found ]----> AddAnomalyID()
       |                         (Narrows common_rev_start/end)
       |
       +----[ No Match ]-------> Create()
                                 (Starts new group lifecycle)

Remediation Tracking

The module provides dedicated update methods to link the group to external entities.

  • UpdateBisectID: Links the group to a specific bisection job.
  • UpdateReportedIssueID: Links the group to a bug in an issue tracker.

These links prevent duplicate actions. For example, the system can query GetAnomalyIdsByIssueId to find all data points associated with a specific bug, facilitating “cluster” views of performance regressions.

File Responsibilities

  • sqlanomalygroupstore.go: Implements the AnomalyGroupStore struct and its methods. It contains the raw SQL logic for Spanner/PostgreSQL, including complex array unnesting for ID lookups and JSONB extraction for group metadata.
  • schema/: Defines the database layout and provides the conceptual “why” behind the table structures, such as the use of temporal tracking for audit trails.
  • sqlanomalygroupstore_test.go: Validates the SQL logic using a real database instance (Spanner), specifically testing edge cases like revision range narrowing and UUID validation.

Module: /go/anomalygroup/sqlanomalygroupstore/schema

Anomaly Group SQL Schema

The schema package defines the structured data model for storing and managing anomaly groups within a SQL database. It serves as the single source of truth for the database layout used by the sqlanomalygroupstore, ensuring that anomaly aggregations, their associated metadata, and subsequent remedial actions are persisted consistently.

Overview and Purpose

In the Perf system, individual anomalies are often related by shared characteristics such as benchmark, bot, or revision range. The AnomalyGroupSchema is designed to transition from a “per-anomaly” view to a “group-centric” view. This grouping is critical for:

  • Action Orchestration: Managing actions like bisections or bug reporting on a group of related regressions rather than triggering redundant tasks for every single data point.
  • State Tracking: Maintaining the lifecycle of a performance regression from discovery (creation) to resolution (culprit identification).
  • Performance Optimization: Denormalizing key fields (like revision ranges) to allow the system to query and filter groups without performing expensive joins or aggregations across the primary anomaly tables.

Key Components and Design Decisions

AnomalyGroupSchema

The core structure represents a single row in the AnomalyGroups table. The implementation choices reflect a balance between strict relational integrity and the flexibility required for evolving metadata.

  • Identity and Temporal Tracking: Each group is assigned a UUID (ID) to prevent collisions across distributed systems. It tracks CreationTime and LastModifiedTime to allow the cleanup of stale groups and to provide audit trails for when a group's state last changed.

  • Anomalies and Culprits (Array Storage): The schema utilizes UUID ARRAY types for AnomalyIDs and CulpritIDs. This design choice favors read performance for group-specific views, as it allows the system to retrieve the entire membership list of a group in a single row fetch, avoiding the overhead of a separate many-to-many mapping table for common operations.

  • Dynamic Metadata (JSONB): The GroupMetaData field is implemented as a JSONB object. This is a deliberate choice to accommodate the heterogeneous nature of performance data. While currently used for tracking subscriptions and benchmark identifiers, the JSONB format allows the system to store additional context (like environment variables or hardware configurations) without requiring a schema migration every time a new metadata tag is introduced.

  • Denormalized Revision Ranges: CommonRevStart and CommonRevEnd represent the overlapping revision range shared by all anomalies within the group. These values are recalculated and updated as the group grows. By storing these directly on the group record, the system can quickly identify which groups are relevant to a specific commit range during bisection lookups.

  • Action and Workflow State: The schema integrates directly with the alerting and bisection workflows through fields like Action, BisectionID, and ReportedIssueID.

    • Action acts as a state machine indicator (e.g., report, bisect).
    • ActionTime tracks when these external processes were triggered to prevent duplicate actions during subsequent scanning loops.

Data Workflow

The following diagram illustrates how the schema fields are populated and updated during the lifecycle of an anomaly group:

Discovery Phase          Aggregation Phase          Action Phase
(Anomaly Detected)      (Group Created/Updated)    (Remediation)
        |                         |                      |
        v                         v                      v
[ Individual Anomaly ] ----> [ AnomalyGroupSchema ] ----> [ Bisection Job ]
                             | - CommonRevStart/End |     | - BisectionID
                             | - AnomalyIDs (Array) | <---+
                             | - GroupMetaData      |     |
                             | - Action ('bisect')  | ----+
                                         |
                                         +--------------> [ Issue Tracker ]
                                                          | - ReportedIssueID

This workflow ensures that as the system moves from detecting a regression to investigating it, the AnomalyGroupSchema remains the central repository for the group's evolving state and history.

Module: /go/anomalygroup/utils

High-Level Overview

The anomalygroup/utils module provides the logic for organizing individual performance regressions (anomalies) into cohesive groups. Instead of treating every detected regression as an isolated event, this module attempts to correlate new anomalies with existing ones based on shared metadata like subscription names, commit ranges, and test paths. This grouping mechanism is critical for reducing alert fatigue and enabling automated root-cause analysis workflows, such as bisection.

Design and Implementation Choices

The module is designed around a “find-or-create” pattern for anomaly groups, prioritizing the consolidation of information into existing groups to maintain a single source of truth for related issues.

  • Concurrency Control: The module uses a global sync.Mutex during the grouping process. This design choice addresses the potential for race conditions where multiple parallel processing containers might simultaneously attempt to create a new group for the same set of regressions.
  • Decoupled Action Handling: The logic distinguishes between two primary group actions: REPORT (creating/updating bug tracker issues) and BISECT (triggering automated culprit finding). The implementation chooses how to update external systems (like the Issue Tracker) based on these action types.
  • Workflow Integration: When a new group is created, the module doesn't just store data; it proactively triggers long-running processes via Temporal. This offloads heavy lifting—like deciding whether to start a Pinpoint bisection—to a durable execution framework.

Key Components and Responsibilities

AnomalyGrouper Interface and Implementation

The AnomalyGrouper interface defines the contract for processing a regression within the context of grouping. The primary implementation, AnomalyGrouperImpl, acts as a coordinator between the Perf backend services, the Issue Tracker, and the Temporal workflow engine.

Regression Processing Logic (anomalygrouputils.go)

The core logic resides in ProcessRegression. Its responsibilities include:

  1. Correlation: Querying the backend service via FindExistingGroups to see if the new anomaly fits into an active group based on its subscription and commit range.
  2. Group Management:
    • If no group exists: It creates a new group and immediately triggers the MaybeTriggerBisection Temporal workflow.
    • If groups exist: It associates the anomaly with all matching groups.
  3. Communication Sync: It ensures that external issue trackers are kept up-to-date. If a group has already been reported as a bug or is linked to a culprit, the module adds comments to those issues to notify stakeholders of the new regression.

Issue Identification (FindIssuesToUpdate)

This helper function encapsulates the logic for mapping an AnomalyGroup back to physical issue IDs.

  • For REPORT actions, it looks for a specifically linked ReportedIssueId.
  • For BISECT actions, it queries the backend for issues associated with “culprits” (identified causes) linked to the group.

Key Workflow: Processing a New Regression

The following diagram illustrates how the module handles an incoming regression and decides whether to create a new group or update an existing one.

[ New Regression Detected ]
           |
           v
+--------------------------+
|  Lock Grouping Mutex     | (Prevent race conditions)
+--------------------------+
           |
           v
+--------------------------+      YES      +----------------------------+
|  Find Existing Groups?   |-------------->| 1. Link Anomaly to Groups  |
+--------------------------+               | 2. Find Associated Issues  |
           |                               | 3. Post Updates to Issues  |
           | NO                            +----------------------------+
           v                                             |
+--------------------------+                             |
| 1. Create Anomaly Group  |                             |
| 2. Link Anomaly to Group |                             |
| 3. Trigger Temporal WF   |                             |
+--------------------------+                             |
           |                                             |
           v                                             v
+-----------------------------------------------------------------------+
|                        Unlock Mutex & Return                          |
+-----------------------------------------------------------------------+

Module: /go/anomalygroup/utils/mocks

High-Level Overview

The anomalygroup/utils/mocks module provides mock implementations of the interfaces defined within the anomalygroup utility suite. Its primary purpose is to facilitate unit testing for components that depend on anomaly grouping logic—specifically the categorization and association of regressions into logical groups—without requiring a live database or the complex state management associated with real anomaly grouping operations.

Design and Implementation Choices

The module utilizes testify/mock to provide a programmatic way to simulate the behavior of the AnomalyGrouper interface.

The core design decision here is the use of automatically generated mocks (via mockery). This approach ensures that the mock implementation remains strictly in sync with the parent interface. By generating these mocks in a dedicated package, the project maintains a clean separation between production code and testing utilities, preventing test dependencies (like testify) from polluting the production binary.

The mock is designed to support:

  • Behavioral Verification: Ensuring that the calling code passes the correct context, alert configurations, and commit ranges.
  • Deterministic Outcomes: Allowing tests to simulate both successful grouping (returning a group ID) and various error states (e.g., database failures or validation errors) to verify error handling in the consumer.

Key Components

AnomalyGrouper.go

This file contains the AnomalyGrouper struct, which mocks the primary service responsible for regression management.

The central responsibility of this mock is to simulate the ProcessRegressionInGroup workflow. In a real-world scenario, this method involves complex logic to determine if a new anomaly should be joined to an existing group or start a new one based on metadata. The mock simplifies this for callers by allowing them to define expectations:

Input Parameters:
  - ctx: Request context.
  - alert: The alert configuration that triggered the detection.
  - anomalyID: The unique identifier for the detected regression.
  - startCommit/endCommit: The range where the regression occurred.
  - testPath/paramSet: Metadata describing the specific trace and attributes.

Return Values:
  - string: The ID of the anomaly group the regression was assigned to.
  - error: Any simulated operational failure.

The mock includes a NewAnomalyGrouper constructor that integrates with the Go testing.T cleanup lifecycle, ensuring that any unmet expectations (e.g., a method was expected to be called but wasn't) are automatically reported as test failures.

Typical Testing Workflow

When a component (such as a regression detector or a notification manager) identifies a regression, it interacts with the AnomalyGrouper. The mock allows you to simulate this interaction:

+-------------------+       +-----------------------+       +-------------------------+
|   Unit Test       |       |  Component Under Test |       |  Mock AnomalyGrouper    |
+---------+---------+       +-----------+-----------+       +------------+------------+
          |                             |                            |
          | 1. Set Expectations         |                            |
          | (On ProcessRegressionInGroup)|                            |
          +---------------------------->|                            |
          |                             |                            |
          | 2. Trigger Action           |                            |
          +---------------------------->|  3. Call Process...        |
          |                             +--------------------------->|
          |                             |                            |
          |                             |  4. Return Preset Result   |
          |                             |<---------------------------+
          | 5. Assert Requirements      |                            |
          +---------------------------->|                            |

Module: /go/backend

Perf Backend Module

The go/backend module implements the internal gRPC service architecture for the Skia Perf application. It serves as a centralized, non-user-facing API layer designed to decouple the frontend from heavy background operations and workflow orchestrations.

High-Level Overview

The backend service acts as a standard interface contract between different components of the Perf cluster. By isolating logic such as manual Pinpoint job triggering, anomaly group management, and culprit tracking into a dedicated service, the system ensures that user-facing components (the frontend) remain responsive.

This architecture allows for significant backend implementation changes—such as swapping out the underlying workflow engine (Temporal) or database logic—without requiring modifications to the frontend or other calling services.

Design and Implementation Choices

  • Internal Service-to-Service (S2S) Communication: The backend is explicitly designed for internal traffic. It uses Kubernetes DNS for service discovery within the cluster and relies on gRPC for efficient, typed communication.
  • Declarative Authorization: Security is not hardcoded into individual handlers. Instead, every service must implement the BackendService interface, which requires providing an AuthorizationPolicy. This policy is then enforced by a unified gRPC interceptor.
  • Workflow Abstraction: Heavier operations, particularly those involving long-running tasks like regression detection or Pinpoint bisection, are offloaded to this module. It frequently acts as a bridge to a Temporal cluster to manage stateful workflows.
  • Dependency Injection: The Backend struct is initialized with various “stores” (AnomalyGroup, Culprit, Subscription, Regression). This allows the service to remain agnostic of the specific storage implementation (e.g., Spanner vs. CockroachDB) while facilitating easier unit testing through mocks.

Key Components and Responsibilities

Backend Application (backend.go)

This is the core orchestrator. Its primary responsibility is the lifecycle management of the gRPC server. During initialization, it:

  1. Validates the instance configuration.
  2. Instantiates the necessary data stores and notification providers.
  3. Sets up the Temporal client (if anomaly grouping is enabled).
  4. Registers all sub-services (Pinpoint, Anomaly Group, Culprit) and applies their specific authorization policies to the gRPC interceptor stack.

Pinpoint Service (pinpoint.go)

A specialized wrapper around the Pinpoint service logic. It bridges the Perf backend to the Pinpoint bisection engine. Its primary role is to expose gRPC endpoints that allow the Perf UI to trigger and monitor performance bisection jobs. It implements strict role-based access control, typically requiring the Editor role.

Service Authorization Policy (/shared)

Contained within the shared sub-package, the AuthorizationPolicy structure defines the security contract for every endpoint. It supports:

  • Service-wide roles: A baseline role required to access any method in the service.
  • Method-specific overrides: Finer-grained control for sensitive operations.
  • Unauthenticated access: Explicitly allowing public access where necessary, though this is rare for backend services.

Client Utility (/client)

To ensure uniformity across the codebase, this sub-module provides a factory for creating gRPC clients. It abstracts away the complexities of:

  • Authentication: Automatically attaching Google OAuth2 identity tokens to requests.
  • TLS Configuration: Managing secure connections within the VPC.
  • Connection Dialing: Handling the boilerplate of grpc.Dial with appropriate interceptors.

Core Initialization Workflow

The following diagram illustrates how the backend service starts up and wires its internal dependencies:

[ Config File ] -> [ validate.LoadAndValidate ]
                         |
                         v
[ Storage Builders ] -> [ NewAnomalyGroupStore ]
                         [ NewCulpritStore      ]
                         [ NewRegressionStore   ]
                         |
                         v
[ External Services ] -> [ NewTemporalClient    ]
                         [ GetDefaultNotifier   ]
                         |
                         v
[ Service Registry ]  -> [ NewPinpointService   ]
                         [ NewAnomalyGroupServ  ]
                         [ NewCulpritService    ]
                         |
                         v
[ gRPC Server ] <------- [ Apply Auth Interceptors ]
      |
      +--> [ Listen on Port (e.g., :8005) ]
      +--> [ Enable Reflection ]
      +--> [ Serve Traffic ]

Key Submodules

  • backendserver: The executable entry point that parses CLI flags and calls the backend initialization logic.
  • testdata: Contains environment-specific configurations (like demo.json) used to bootstrap the service in development or CI environments.

Module: /go/backend/backendserver

High-Level Overview

The backendserver module provides the executable entry point for the Perf backend service. Its primary purpose is to act as a thin wrapper that bootstraps the backend environment, parses operational configuration from the command line, and initiates the long-running service process. It bridges the gap between the infrastructure's execution environment and the core logic defined in the perf/go/backend package.

Design and Implementation Choices

The module is designed around the urfave/cli framework to ensure that the service is highly configurable and self-documenting.

  • Flag-Driven Configuration: Rather than relying on static configuration files or hardcoded environment variables, the server uses the config.BackendFlags struct to define its requirements. This allows the deployment system to pass parameters directly, facilitating easier integration with container orchestration tools.
  • Separation of Concerns: The main.go file intentionally contains minimal logic. It delegates the heavy lifting—such as database connections, caching, and API routing—to the perf/go/backend package. This ensures that the core backend logic is decoupled from the CLI interface, making the system easier to test and reuse in different contexts.
  • Standardized Logging: The server initializes a standard output logger early in the lifecycle. This choice ensures that all startup events, including flag parsing and service initialization, are captured in a format compatible with cloud-native logging aggregators.

Key Components and Responsibilities

CLI Application (main.go)

The core responsibility of main.go is to define the command structure for the backend. It currently supports a run command, which serves as the primary execution path for the service.

When the run command is executed:

  1. Flag Processing: The application converts the definitions in config.BackendFlags into CLI flags.
  2. Lifecycle Management: It initializes the logger and logs the current flag configuration to provide visibility into the running state.
  3. Core Initialization: It calls backend.New(), passing the parsed flags. While the current implementation passes nil for several parameters (likely reserved for dependency injection or specialized handlers), this is where the system's core components are wired together.
  4. Service Execution: It invokes Serve(), which enters the main event loop of the backend, handling incoming requests until an interrupt signal is received.

Service Workflow

The following diagram illustrates the initialization and execution flow of the backendserver:

[ OS Args ]
     |
     v
[ CLI Flag Parser ] ----> [ Log Configuration ]
     |
     v
[ backend.New() ] <----- [ BackendFlags ]
     |
     +--> [ Instantiate internal components ]
     +--> [ Setup Listeners/Handlers ]
     |
     v
[ b.Serve() ] <--------- [ Infinite Loop ]
     |
     +--> [ Accept RPC/HTTP Requests ]
     +--> [ Process Data ]

Key Dependencies

  • perf/go/backend: Contains the actual service implementation. The backendserver is essentially a caller for this package.
  • perf/go/config: Defines the schema for the backend's configuration.
  • go.skia.org/infra/go/urfavecli: Provides the standardized CLI scaffolding used across Skia infrastructure projects.

Module: /go/backend/client

The backend/client module serves as the central factory for establishing gRPC connections to various Perf backend services. It abstracts the complexities of authentication, transport security, and connection management, providing a unified interface for other components of the system to communicate with backend microservices like Anomaly Groups, Culprits, and Pinpoint.

Design Decisions and Implementation

Centralized Connection Management

The module is designed around the concept of a shared connection utility (getGrpcConnection). By centralizing how gRPC connections are dialed, the system ensures consistent application of security policies and authentication headers across all clients. This approach allows developers to instantiate high-level service clients without needing to understand the underlying networking or security configuration of the cluster.

Security and Authentication

The client supports two primary connection modes based on the environment and specific service requirements:

  • Insecure Connections: Primarily used for local development or specific internal testing scenarios where TLS is not required.
  • Secure Internal Communication: For production workloads within a GKE cluster, the client uses a hybrid security model. It employs TLS for transport encryption but is configured with InsecureSkipVerify: true. This decision reflects a common internal networking pattern where communication stays within a trusted VPC/cluster boundary, making full certificate chain validation secondary to ensuring encrypted transit.
  • OAuth2 Identity: Authentication is handled via Google Default Application Credentials. The module automatically retrieves the service account's token source and attaches it as PerRPCCredentials to the gRPC connection, ensuring that every request is authorized with the appropriate identity (scoped to userinfo.email).

Configuration-Driven Connectivity

The module relies on the global perf/go/config to determine the target host (BackendServiceHostUrl). This allows the same binary to target different backend instances based on the deployment configuration. Additionally, every client factory supports an override parameter, facilitating flexible routing for integration tests or cross-cluster communication.

Key Components and Responsibilities

backendclientutil.go

This is the primary implementation file containing the logic for connection lifecycle management and client instantiation.

  • Connection Factory (getGrpcConnection): This internal function manages the grpc.Dial process. It handles the logic for choosing between insecure credentials and the TLS/OAuth2 stack.
  • Service Clients: The module provides specific factory functions for the different protobuf-defined services. These include:
    • NewPinpointClient: For interacting with the Pinpoint service.
    • NewAnomalyGroupServiceClient: For managing and querying anomaly groups.
    • NewCulpritServiceClient: For accessing information regarding identified culprits.

Workflow: Client Initialization

The following diagram illustrates the internal process when a consumer requests a new service client:

[ Consumer Call ]
      |
      v
[ Check if Backend Enabled? ] ---- No ----> [ Return Error ]
      |
     Yes
      |
[ Determine Host URL ] <--- (Override or Global Config)
      |
[ Create gRPC Connection ]
      |
      +---- If Secure: [ Fetch OAuth Token ]
      |                [ Configure TLS (Skip Verify) ]
      |
      +---- If Insecure: [ Use Insecure Creds ]
      |
[ grpc.Dial(host, opts) ]
      |
      v
[ Wrap Connection in Service Client ]
      |
      v
[ Return (e.g., AnomalyGroupServiceClient) ]

Key Submodules and Dependencies

  • perf/go/anomalygroup/proto/v1: Provides the interface for anomaly group interactions.
  • perf/go/culprit/proto/v1: Provides the interface for culprit tracking.
  • pinpoint/proto/v1: Provides the interface for Pinpoint integration.
  • go/auth: Used for managing Google-based authentication scopes.

Module: /go/backend/shared

High-Level Overview

The backend/shared module serves as a centralized location for common data structures and logic used across various backend services within the Perf system. Its primary purpose is to standardize how cross-cutting concerns—specifically security and access control—are defined and enforced across different service implementations.

Centralized Authorization Policy

The core of this module is the AuthorizationPolicy structure. Rather than hard-coding permission checks within individual RPC handlers or middleware, this module provides a declarative way to define access requirements. This approach decouples the “rules” of the service from the “engine” that enforces them.

Design Decisions and Implementation

  • Granular vs. Global Control: The design supports a tiered authorization model. By providing both AuthorizedRoles (service-wide) and MethodAuthorizedRoles (method-specific), the system allows developers to define a baseline security posture for a service while overriding or tightening requirements for sensitive operations.
  • Role-Based Access Control (RBAC): The module integrates directly with the common go/roles package. This ensures that the backend uses a unified identity and permission vocabulary, preventing discrepancies where different services might interpret “Admin” or “Viewer” differently.
  • Public Access Handling: The inclusion of the AllowUnauthenticated flag allows the policy to explicitly document when a service is intended to be public. This makes security audits easier, as public-facing endpoints are opted-into explicitly rather than being the default state.

Workflow: Authorization Evaluation

When a request enters a backend service, the service implementation typically references an AuthorizationPolicy instance to determine if the request should proceed.

Incoming Request
      |
      v
[ Auth Middleware ] <--- References --- [ AuthorizationPolicy ]
      |                                        |
      +---- (1) Is AllowUnauthenticated? ------+--> [ Allow ]
      |             YES                        |
      |                                        |
      +---- (2) Does user have a role in ------+--> [ Allow ]
      |         MethodAuthorizedRoles[RPC]?    |
      |             YES                        |
      |                                        |
      +---- (3) Does user have a role in ------+--> [ Allow ]
      |         AuthorizedRoles?               |
      |             YES                        |
      |                                        |
      +---- (4) No conditions met -------------+--> [ Deny (403) ]

Key Components

  • authorization.go: Defines the AuthorizationPolicy struct. This file is the source of truth for how backend services should describe their security requirements. It acts as the contract between service definitions and the middleware responsible for enforcing those definitions.

Module: /go/backend/testdata

Overview

The /go/backend/testdata directory serves as a repository for static configuration files used to simulate real-world runtime environments during development, testing, and demonstration of the Perf backend. Rather than relying on hardcoded defaults within the Go source code, this module provides a centralized location for JSON-based configurations that define how a Perf instance behaves, connects to data sources, and interacts with external services.

Design Rationale

The primary motivation for maintaining this module is to provide a “Single Source of Truth” for a functional Perf deployment environment that can be spun up locally or in a CI environment.

By using demo.json, the system achieves:

  • Decoupling: Separation of the application logic from environment-specific parameters like database connection strings or repository URLs.
  • Reproducibility: Ensuring that developers and automated tests operate against a consistent set of configurations, such as the specific CockroachDB connection string or the local directory ingestion path.
  • Validation: Serving as a schema reference for the Config struct used within the backend, ensuring that changes to the configuration format are reflected in a working example.

Key Components and Responsibilities

Configuration Specifications (demo.json)

This file is the core of the module. it defines a comprehensive instance profile. Its responsibilities include:

  • Identity and Networking: Establishing the instance name (chrome-perf-demo) and mapping the local communication ports for both the frontend and backend services.
  • Data Persistence Layer: Explicitly choosing cockroachdb as the storage engine and defining the tile_size (e.g., 256). This choice impacts how the backend optimizes data retrieval for trace queries.
  • Ingestion Logic: Configuring the backend to monitor a local directory (./demo/data/) rather than a cloud-based Pub/Sub or GCS bucket. This is crucial for offline development and rapid prototyping of data parsers.
  • External Integration Mocking: Providing placeholders for issue trackers, authentication headers (X-WEBAUTH-USER), and Git repository synchronization. By pointing to a public demo repo (perf-demo-repo.git), it allows the system to demonstrate commit-linking functionality without requiring private credentials.
  • UI Customization: Defining “Favorites” sections which allow the backend to populate the user interface with predefined links and documentation, simulating a curated production dashboard.

Workflow: Configuration Ingestion

The backend utilizes these files to bootstrap its internal services. The flow generally follows this pattern:

[ Backend Startup ]
        |
        V
[ Load /go/backend/testdata/demo.json ]
        |
        +-----> [ Initialize CockroachDB Connection ]
        |       (Using connection_string)
        |
        +-----> [ Initialize Ingestion Service ]
        |       (Watching ./demo/data/ for new trace files)
        |
        +-----> [ Sync Git Provider ]
        |       (Cloning/Updating /tmp/perf-demo)
        |
        +-----> [ Apply Auth/Notification Policies ]
                (Setting header names and issue tracker secrets)

This structure ensures that the backend can transition from a “demo” state to a “production” state simply by swapping the configuration file, keeping the underlying binary logic identical across environments.

Module: /go/bug

The bug module provides a specialized utility for generating bug reporting URLs within the Perf application. Its primary purpose is to bridge the gap between performance regression detection and issue tracking by dynamically populating bug templates with contextual metadata.

Design and Implementation Logic

The module is built around the concept of URI templates. Rather than hard-coding support for specific issue trackers (like Monorail or GitHub Issues), it utilizes a template-based approach to remain agnostic of the underlying bug-tracking system. This allows administrators to configure different reporting destinations without modifying the source code.

The core logic relies on the RFC 6570 URI Template standard via the uritemplates library. This ensures that all components of the URL—specifically those containing special characters like query parameters in a cluster link—are correctly escaped and encoded to prevent broken links in the resulting bug report.

Key Components

Template Expansion (bug.go) The module exposes the Expand function, which serves as the primary entry point. It takes a raw template string and injects three critical pieces of context:

  • cluster_url: A direct link to the Skia Perf cluster view where the regression was identified.
  • commit_url: The link to the specific git commit (provided via provider.Commit) suspected of causing the regression.
  • message: User-provided commentary or summary of the issue.

The function handles the mapping of these domain-specific concepts to the template variables, ensuring that the integration between the performance monitoring UI and the bug tracker is seamless.

Data Flow Workflow

The following diagram illustrates how the module transforms raw performance data and user input into a navigable bug report link:

[Perf UI / Detection]       [Git Provider]       [User Input]
         |                         |                  |
         v                         v                  v
   (clusterLink)             (commit.URL)         (message)
         |                         |                  |
         +-------------------------+------------------+
                                   |
                                   v
                        +----------------------+
                        |      bug.Expand      | <--- [URI Template]
                        +----------------------+
                                   |
                                   v
                        [Encoded Reporting URL]
                                   |
                                   v
                        (Opens in User Browser)

Usage in Testing and Examples

The module includes an ExampleExpand function and associated tests to verify that the encoding logic correctly handles complex URLs. This is particularly important for the cluster_url, which often contains its own set of encoded query parameters that must be safely nested within the final bug reporting URL.

Module: /go/builders

Perf Builders Module

The go/builders module serves as the central factory for the Skia Perf application. It is responsible for instantiating complex objects—such as data stores, version control interfaces, and file sources—by interpreting a central config.InstanceConfig object.

Design Philosophy

The primary motivation for this module is to resolve cyclical dependencies. Many sub-packages within Perf (like tracestore or regression) need to know about the configuration, but the configuration logic often needs to reference these packages to define how they are initialized. By centralizing the “construction” logic here, other packages can remain focused on their specific domains without needing to know how their peers are instantiated or how the global configuration is structured.

A key implementation choice is the use of a Singleton Database Pool. Since a Perf instance typically talks to a single backend (like Spanner or PostgreSQL), the module maintains a global singletonPool. This prevents the application from accidentally opening multiple connection pools to the same database, which could exhaust file descriptors or database connection limits.

Key Responsibilities and Components

Database Management

The module handles the lifecycle of the database connection pool.

  • NewDBPoolFromConfig: This is the core initializer. It parses connection strings, configures connection limits (MaxConns and MinConns), and wraps the raw pool in a timeout layer to ensure query hygiene.
  • Schema Validation: When initializing a pool, the builder optionally performs a schema check. It compares the actual database schema against the expectedschema to ensure the database is compatible with the current version of the code before the application starts processing traffic.

Store Factories

The module provides New[Component]StoreFromConfig functions for every major data entity in Perf. These functions encapsulate the logic of choosing between different implementations (e.g., SQL-based vs. Cache-backed stores).

  • Trace & Metadata Stores: Constructs sqltracestore instances. It also manages the initialization of InMemoryTraceParams to optimize trace lookups.
  • Regression & Shortcut Stores: Handles the logic of selecting versioned implementations, such as switching between sqlregressionstore and sqlregression2store based on the UseRegression2 config flag.
  • Anomaly & Alert Stores: Standardizes the creation of stores for alerts, anomaly groups, culprits, subscriptions, and user-reported issues.

Data Ingestion Sources

The builders resolve how Perf reads incoming data files:

  • NewSourceFromConfig: Determines whether data should be pulled from Google Cloud Storage (GCSSource) or a local directory (DirSource).
  • NewIngestedFSFromConfig: Provides a standard Go fs.FS interface to the underlying storage, allowing the rest of the application to treat GCS and local filesystems interchangeably.

Caching Strategy

The GetCacheFromConfig function determines the caching layer for queries. It supports:

  • Redis: Utilizing a Google Cloud Redis client.
  • Local: An in-memory cache for local development or small-scale deployments.

Core Workflow: Object Initialization

The typical flow for initializing a component involves resolving the database pool first, then passing it into the specific constructor for the requested store.

Config Object (InstanceConfig)
      |
      v
[ NewDBPoolFromConfig ] <-----------+
      |                             | (Check Schema)
      |                             v
      +------> [ singletonPool ] ---+
      |           (Thread-safe)     |
      |                             |
      v                             v
[ New...StoreFromConfig ]     [ NewPerfGitFromConfig ]
      |                             |
      +---> Returns Interface       +---> Returns perfgit.Git
            (e.g. alerts.Store)

Implementation Details

  • Concurrency: The singletonPool is protected by a sync.Mutex (singletonPoolMutex) to ensure that concurrent calls to initialize the database during startup do not create race conditions or multiple pools.
  • Logging: A custom pgxLogAdaptor is implemented to redirect internal database driver logs (from pgx) into the standard sklog system, ensuring unified log formatting across the application.
  • Timeouts: All database pools are wrapped using go/sql/pool/wrapper/timeout. This enforces that every context passed to a database operation has a deadline, preventing “hanging” queries from blocking the application indefinitely.

Module: /go/chromeperf

Overview

The chromeperf module provides a comprehensive Go client and integration layer for interacting with the Chrome Performance Monitoring (Chromeperf) ecosystem. Its primary responsibility is to bridge the gap between Skia Perf's internal data structures and the legacy Chromeperf APIs, specifically focusing on anomaly detection, regression reporting, and alert group management.

The module acts as a translation and transport layer, allowing Skia Perf to:

  1. Retrieve performance anomalies (regressions or improvements) from the Chromeperf backend.
  2. Report new regressions discovered by Skia's analysis engines back to Chromeperf.
  3. Manage Alert Groups, which aggregate multiple related anomalies into a single triagable unit.
  4. Normalize data identifiers, converting between Skia‘s structured trace keys and Chromeperf’s slash-delimited TestPath format.

Design Decisions and Implementation Choices

Communication via Skia-Bridge

A key architectural decision is the use of skia-bridge-dot-chromeperf.appspot.com as the default endpoint. While a legacy direct path to chromeperf.appspot.com exists, the module defaults to the bridge. This design allows for a more stable interface and potentially specialized authentication/filtering logic between the two systems. The ChromePerfClient interface abstracts this, supporting URL overrides for local development and testing.

Resilience and Status Code Handling

The SendPostRequest and SendGetRequest implementations in chromeperfClient.go incorporate specific logic for “accepted status codes.” Unlike standard HTTP clients that might treat any 2xx as success, this module allows callers to define exactly which codes are valid for a given operation. For example, ReportRegression accepts 404 as a non-error state in specific scenarios where parameter names differ between systems, preventing transient synchronization issues from triggering hard failures in the Skia backend.

Trace Name to TestPath Translation

Chromeperf identifies performance series using a hierarchical string (e.g., Master/Bot/Benchmark/Test/Subtest), whereas Skia Perf uses a flat map of key-value pairs. The TraceNameToTestPath function implements a deterministic mapping strategy:

  • Order matters: It strictly enforces a hierarchy: master -> bot -> benchmark -> test -> subtest_1...N.
  • Statistical Suffixes: Because Chromeperf often encodes statistics in the test name (e.g., _avg, _max), the translator can optionally append suffixes based on Skia's stat parameter to ensure lookups hit the correct legacy series.

Lossy Sanitization and the Reverse Key Map

Skia Perf restricts certain characters in trace keys (like ? or :), replacing them with underscores. To prevent this from breaking the ability to query the original data source, the module utilizes a ReverseKeyMapStore. This allows the system to “remember” that a sanitized Skia value like cpu_io actually corresponds to a Chromeperf value of cpu:io.

Key Components

Anomaly API (anomalyApi.go)

This is the core functional area of the module. It defines the Anomaly struct, which contains extensive metadata about performance shifts (medians before/after, P-values, segment sizes, and bug tracking information).

  • Reporting: ReportRegression sends new detections to Chromeperf to trigger the alerting pipeline.
  • Retrieval: Supports both revision-based (GetAnomalies) and time-based (GetAnomaliesTimeBased) queries.
  • Normalization: The UnmarshalJSON method for Anomaly handles legacy numeric IDs by transparently converting them to strings, ensuring compatibility with different versions of the Chromeperf backend.

Alert Group API (alertGroupApi.go)

Manages the grouping of anomalies. It provides methods to fetch details for a specific group key. A critical function here is GetQueryParams, which parses the anomaly list within a group to generate Skia-compatible query parameters, allowing users to jump from a Chromeperf alert group directly to a Skia Perf visualization of all affected traces.

Chromeperf Client (chromeperfClient.go)

The low-level transport implementation. It handles:

  • Authentication: Uses Google Default Credentials with the userinfo.email scope.
  • Tracing: Integrates with OpenCensus for distributed tracing of API calls.
  • JSON Serialization: Manages the encoding and decoding of complex request/response objects.

Key Workflow: Anomaly Retrieval and Mapping

The following diagram illustrates how the module transforms a Skia trace request into a Chromeperf anomaly set:

[ Skia Perf ]
  Trace Name: ",master=CP,bot=M1,benchmark=SunSpider,test=total,stat=value,"
      |
      v
[ TraceNameToTestPath ]
  Converts to: "CP/M1/SunSpider/total_avg"
      |
      v
[ AnomalyApiClient.GetAnomalies ]
  POST /anomalies/find { "tests": ["CP/M1/SunSpider/total_avg"], ... }
      |
      v
[ Chromeperf Backend ]
  Returns: { "anomalies": { "CP/M1/SunSpider/total_avg": [ {Anomaly_Data} ] } }
      |
      v
[ getAnomalyMapFromChromePerfResult ]
  1. Maps "CP/M1/SunSpider/total_avg" back to the original Skia Trace Name.
  2. Resolves Git Hashes to Commit Numbers using perfgit.Git.
      |
      v
[ AnomalyMap ]
  { "trace_name": { CommitNumber: Anomaly } }

Submodules

  • compat/: A translation layer that converts internal Skia regression.Regression objects into chromeperf.Anomaly structures.
  • sqlreversekeymapstore/: A SQL-backed implementation of the ReverseKeyMapStore for persisting character transformation mappings.
  • mock/: Autogenerated mocks for unit testing components that depend on these APIs.

Module: /go/chromeperf/compat

Overview

The compat module provides a translation layer between Skia Perf's internal regression formats and the legacy ChromePerf (Anomaly) data structures. Its primary purpose is to ensure interoperability during the transition or integration period where Skia Perf needs to communicate regression data to systems that still rely on the ChromePerf “Anomaly” schema.

The module simplifies the complex, multi-dimensional data captured in a Skia regression.Regression object into a flat, trace-oriented chromeperf.AnomalyMap.

Design Motivations

The translation logic addresses several structural differences between the two systems:

  • Trace Identification: Skia Perf uses structured trace keys (comma-separated key-value pairs), while ChromePerf uses a slash-delimited TestPath (e.g., Master/Bot/Benchmark/Test). This module handles the mapping of these identifiers to ensure regressions are attributed to the correct entities in legacy dashboards.
  • Revision Ranges: Skia tracks regressions primarily by specific commit numbers. The translation maps these into StartRevision and EndRevision fields to satisfy the “range-based” anomaly model used by ChromePerf.
  • Triage State Mapping: The module translates Skia's internal TriageStatus into a string-based state and applies specific flags (like IgnoreBugIDFlag) when a regression is marked as “Ignored,” ensuring the legacy system respects the triage decisions made in Skia.
  • Bug ID Handling: Skia supports multiple bugs per regression, whereas the legacy anomaly format historically expects a single primary Bug ID. The module currently selects the first available bug but includes diagnostic logging to monitor instances where data might be truncated, facilitating future schema improvements.

Key Workflows

Regression to Anomaly Conversion

The core functionality is encapsulated in ConvertRegressionToAnomalies. The process follows this logical flow:

  1. Validation: Ensures the regression contains valid data frames. If no trace data is present, it returns an empty map.
  2. Trace Iteration: For every trace involved in the regression, it attempts to resolve the legacy TestPath.
  3. Field Mapping: Values like medians (before/after), revision numbers, and improvement flags are cast and moved into the Anomaly struct.
  4. Status Sync: The triage status (e.g., Untriaged, Positive, Ignored) is synchronized.
  5. Map Construction: The resulting anomalies are grouped into a CommitNumberAnomalyMap, indexed by the trace key, allowing callers to look up anomalies by their specific performance series.
[regression.Regression]
          |
          v
+-----------------------------+
| ConvertRegressionToAnomalies |
+-----------------------------+
          |
          |-- Extract TraceSet Keys
          |-- Resolve TestPaths (e.g. Master/Bot/...)
          |-- Map Medians & Revisions
          |-- Resolve Bug IDs & Triage State
          v
[chromeperf.AnomalyMap]
    {
      "trace_key_A": { CommitNum: Anomaly },
      "trace_key_B": { CommitNum: Anomaly }
    }

Key Components

  • compat.go: Contains the primary conversion logic. It is responsible for the heavy lifting of data transformation, error handling for malformed trace names, and the temporary logic for narrowing down multiple bug assignments into a single field.
  • compat_test.go: Validates the conversion accuracy across various scenarios, including successful mappings, handling of nil data frames, and ensuring that different triage statuses (like Ignored) result in the correct legacy flag values.

Module: /go/chromeperf/mock

The /go/chromeperf/mock module provides a suite of autogenerated mock implementations for the interfaces defined in the chromeperf package. These mocks are designed to facilitate hermetic unit testing of the Skia Perf service by simulating interactions with external Chrome Performance monitoring APIs and storage layers.

Design Philosophy

The module leverages the testify/mock framework and is maintained via mockery. This approach was chosen to ensure that the testing infrastructure remains synchronized with the primary interfaces. When the core chromeperf interfaces evolve—such as adding new parameters to anomaly queries or modifying the regression reporting structure—the mocks can be regenerated to reflect these changes, reducing the manual overhead of updating test suites.

By using these mocks, developers can:

  • Simulate API failures (e.g., non-200 status codes, network timeouts) to ensure robust error handling.
  • Validate that specific parameters, such as commit positions or trace names, are correctly passed to the transport layer.
  • Provide deterministic return values for complex data structures like AnomalyMap or ReportRegressionResponse without requiring a live backend.

Key Components

AnomalyApiClient

The AnomalyApiClient mock simulates high-level operations related to performance anomalies. It allows tests to define expectations for fetching anomaly data across several dimensions:

  • Range-based queries: Mocking GetAnomalies and GetAnomaliesTimeBased allows tests to simulate data retrieval over commit ranges or specific time intervals.
  • Revision-specific lookups: GetAnomaliesAroundRevision enables testing of logic that centers on a specific point in time or a specific commit.
  • Regression Reporting: The ReportRegression mock is critical for verifying the logic that identifies and pushes new performance regressions to the Chrome Perf dashboard, including the validation of metadata like median values before and after a change.

ChromePerfClient

This mock represents the lower-level transport layer. While AnomalyApiClient focuses on the “what” (anomalies), ChromePerfClient focuses on the “how” (generic HTTP-like requests). It mocks SendGetRequest and SendPostRequest, providing a way to test the underlying serialization and communication logic. This is particularly useful for verifying that the correct API endpoints and query parameters are constructed before being sent over the wire.

ReverseKeyMapStore

The ReverseKeyMapStore mock facilitates testing of the data translation layer. In Skia Perf, keys or trace names may be modified or obfuscated for storage or display. This mock simulates the persistence and retrieval of mappings between “modified” values and “original” values. It allows tests to verify that the system can correctly resolve internal identifiers back to their source values during data processing or anomaly reporting.

Testing Workflow

The standard workflow for utilizing these mocks involves setting expectations within a Go test, injecting the mock into the component under test, and asserting that the interactions occurred as predicted.

+-------------------+       +-----------------------+       +-------------------------+
|     Go Test       |       |  Component Under Test |       |   Mock AnomalyApiClient |
+-------------------+       +-----------------------+       +-------------------------+
          |                             |                               |
          | 1. Set Expectations         |                               |
          |---------------------------->|                               |
          | (On "GetAnomalies").Return()|                               |
          |                             |                               |
          | 2. Execute Logic            |                               |
          |---------------------------->|                               |
          |                             | 3. Call API Method            |
          |                             |------------------------------>|
          |                             |                               |
          |                             | 4. Return Mock Data           |
          |                             |<------------------------------|
          | 5. Assert Expectations      |                               |
          |---------------------------->|                               |
          | (AssertExpectations)        |                               |

The New... constructor functions in each file include a Cleanup registration. This design ensures that AssertExpectations is automatically called at the end of each test, preventing “silent” failures where a test passes even if an expected API call was never actually made by the code.

Module: /go/chromeperf/sqlreversekeymapstore

SQL Reverse Key Map Store

The sqlreversekeymapstore module provides a persistent storage mechanism for mapping sanitized Skia Perf parameter values back to their original Chromeperf identifiers. This is a critical utility for maintaining interoperability between the two systems, particularly during anomaly detection and cross-platform data lookups.

Design Rationale

When data flows from Chromeperf to Skia Perf, certain characters in test paths and parameter keys are considered “invalid” by Skia's internal naming conventions. To ensure compatibility, these characters are typically replaced with underscores (_).

This transformation is lossy. For example, both cpu:io and cpu-io might be sanitized to cpu_io. Because multiple distinct original values can map to the same sanitized value, it is impossible to programmatically “undo” the sanitization to find the original Chromeperf source of truth.

This module solves the problem by recording these transformations as they occur. By maintaining a lookup table, the system can deterministically resolve a sanitized Skia parameter back to the specific Chromeperf value it originated from, enabling accurate queries against Chromeperf's legacy APIs.

Key Components

Implementation (sqlreversekeymapstore.go)

The core logic is encapsulated in the ReverseKeyMapStoreImpl struct. It abstracts the database interactions required to store and retrieve these mappings.

  • Database Agnosticism: The store supports multiple backend dialects (Standard SQL and Google Spanner). It uses the config.DataStoreType to select the appropriate SQL syntax, specifically handling differences in INSERT ... ON CONFLICT behavior.
  • Idempotent Writes: The Create method is designed to be safe for concurrent or repeated calls. If a mapping for a specific ModifiedValue and ParamKey already exists, the database ignores the new insertion attempt.
  • Deterministic Lookups: The Get method allows callers to provide a sanitized value and its associated parameter key to retrieve the original string.

Schema and Data Integrity

The underlying database table, ReverseKeyMap, is structured to optimize for lookup speed and data consistency:

  • Primary Key: A composite key consisting of (modified_value, param_key). This ensures that for any given parameter category (like a test path component), a sanitized string can only point to one “correct” original string.
  • Persistence Strategy: The design assumes that while the table may grow as new test paths are discovered, the set of unique paths eventually stabilizes, causing the storage overhead to plateau.

Workflow: Mapping and Restoration

The following diagram demonstrates the lifecycle of a parameter value as it moves from Chromeperf to Skia and back again via the store:

[ Chromeperf ]          [ Sanitization ]          [ Skia Perf ]
Original Value   --->    Transformation   --->    Modified Value
"cpu:io"                ( ":" -> "_" )           "cpu_io"
      |                                              |
      |                 [ Store.Create ]             |
      +----------------------------------------------+
                               |
                      [ SQL ReverseKeyMap ]
                      Modified: "cpu_io"
                      ParamKey: "test_path"
                      Original: "cpu:io"
                               |
      +------------------------+
      |                 [ Store.Get ]
      v
[ Original Restored ] <--- Used for Anomaly Lookups in Chromeperf

Key Methods

  • New(db pool.Pool, dbType config.DataStoreType): Initializes the store with the appropriate SQL dialect based on the database provider.
  • Create(ctx, modifiedValue, key, originalValue): Persists a new mapping. Returns the originalValue if successful, or an empty string/error if a collision or validation issue occurs.
  • Get(ctx, modifiedValue, key): Retrieves the original value associated with the sanitized input. If no mapping exists, it returns an empty string without an error, signifying that no transformation was recorded for that specific pair.

Module: /go/chromeperf/sqlreversekeymapstore/schema

SQL Reverse Key Map Schema

The sqlreversekeymapstore/schema module defines the database structure required to maintain a mapping between sanitized Skia Perf parameter values and their original Chromeperf counterparts. This mapping is essential for maintaining interoperability between the two systems, specifically during anomaly lookups and cross-platform queries.

Design Rationale

When data is migrated or uploaded from Chromeperf to Skia Perf, “invalid” characters within test paths are replaced with underscores to comply with Skia’s data requirements. Because this transformation is lossy (multiple distinct original characters might all be mapped to the same underscore), it is mathematically impossible to deterministically reconstruct the original Chromeperf test path from the modified Skia Perf path without external metadata.

Without this schema, querying Chromeperf for anomalies based on a Skia Perf test path would be unreliable, as the system would not know which original characters the underscores represent.

By storing these transformations as they occur, the system can perform a reverse lookup to find the “source of truth” original value. The design assumes that the set of unique test paths is relatively stable; therefore, while the table grows initially as new paths are encountered, the storage overhead is expected to plateau once all existing test paths have been processed.

Key Components and Responsibilities

schema.go

This file defines the ReverseKeyMapSchema struct, which represents the relational table structure. The schema is designed around three primary attributes:

  • ModifiedValue: The sanitized string as it exists in Skia Perf (containing underscores).
  • ParamKey: The specific parameter category (e.g., a specific part of the test path).
  • OriginalValue: The raw, unmodified string as it exists in Chromeperf.

Data Integrity and Indexing

The schema enforces uniqueness through a composite primary key consisting of the ModifiedValue and the ParamKey.

  • Mapping Logic: The combination of a parameter key and its modified value must point to a unique original value. This ensures that the lookup remains deterministic.
  • Search Performance: By using the ModifiedValue and ParamKey as the primary key, the database is optimized for the most common workflow: taking a known Skia Perf parameter and looking up its original Chromeperf identity.

Workflow: Key Restoration

The following diagram illustrates how this schema facilitates communication between the two systems:

Chromeperf Path          Skia Perf Path            Reverse Key Map
(Original)               (Sanitized)               (Database Store)
----------------         ---------------           -----------------------------
"master/bot/cpu:io"  ->  "master/bot/cpu_io"  ->   Modified: "cpu_io"
                                                   ParamKey: "test_path"
                                                   Original: "cpu:io"
                                                           |
                                                           |
[Anomaly Detection]  <-  [Query Original]      <-   [Lookup via ModifiedValue]

Module: /go/clustering2

Clustering2 Module

The clustering2 module provides the logic for grouping performance traces based on their shapes using the k-means algorithm. It is primarily used within the Perf framework to identify patterns in telemetry data, such as regressions or improvements, by clustering similar behavioral trends across different test configurations.

Design Philosophy

The module is designed around the concept of “trace shapes.” Instead of looking at individual data points, it treats a series of values over time (a trace) as a multi-dimensional vector. By clustering these vectors, the system can discover that a specific set of tests all experienced a similar performance shift at the same point in time, even if the absolute values of their metrics differ.

Key Implementation Choices

  • K-Means for Shape Analysis: The module uses k-means clustering because it is efficient at grouping large sets of traces into a predefined number of clusters ($K=50$ by default).
  • Centroid-Based Summaries: Each cluster is represented by a “centroid”—the average shape of all traces in that cluster. This allows the system to characterize a potentially massive number of traces with a single representative trend line.
  • Step Detection Integration: Once clusters are formed, the module fits the centroids to step functions. This helps distinguish between clusters representing “noisy” data and those representing “meaningful” shifts (regressions or improvements).
  • Parameter Statistical Weighting: To help users understand what is common among traces in a cluster, the module calculates the percentage frequency of key-value pairs (e.g., arch=x86) within that cluster.

Key Components and Responsibilities

Cluster Calculation (clustering.go)

The primary entry point is CalculateClusterSummaries. It orchestrates the following workflow:

  1. Observation Conversion: Converts a dataframe.DataFrame into a slice of kmeans.Clusterable objects. Traces are normalized or processed via ctrace2 to ensure the clustering is based on the shape of the data rather than absolute magnitude.
  2. Iterative Refinement: Runs the k-means algorithm for a maximum of 100 iterations or until the total error change falls below a threshold (KMEAN_EPSILON).
  3. Distance-Based Sorting: After clusters are formed, members within each cluster are sorted by their distance to the centroid. The traces closest to the centroid are considered the most “representative” of that cluster's behavior.

Data Structures

  • ClusterSummary: Contains the centroid data, the list of representative trace keys, the results of the step-fit analysis, and a summary of the parameters common to the cluster.
  • ClusterSummaries: A container for all clusters found during a single run, including metadata like the $K$ value used and the standard deviation threshold.

Parameter Summarization (valuepercent.go)

This component analyzes the metadata keys of all traces in a cluster to identify commonalities.

  • ValuePercent: Represents how often a specific key=value pair appears as a percentage of the total cluster size.
  • Human-Friendly Sorting: The SortValuePercentSlice function implements a specialized sorting logic. It groups values by their key (e.g., all config values together) and then sorts those groups by the highest percentage. This ensures that the most dominant traits of a cluster appear at the top of the report.

Workflows

Clustering Process

DataFrame (Traces)
      |
      v
[Convert to Clusterable Traces] <--- Normalize shapes
      |
      v
[Initialize K Centroids] <--------- Randomly select K traces
      |
      +----[ Loop: K-Means Iteration ]
      |          |
      |          v
      |      [Assign Traces to Nearest Centroid]
      |      [Recalculate Centroid Positions]
      |      [Calculate Total Error]
      |          |
      +----------+--- (Break if Error Change < EPSILON)
      |
      v
[Post-Processing]
      |
      +--> [Fit Centroids to Step Functions]
      +--> [Calculate Parameter Percentages]
      +--> [Sort Members by Distance to Centroid]
      |
      v
ClusterSummaries (Final Result)

Implementation Details

  • Distance Metric: The module relies on the Distance implementation provided by the ctrace2 package's ClusterableTrace, which typically measures the similarity between two floating-point arrays.
  • Centroid Calculation: Centroids are updated in each iteration by averaging the values of all traces assigned to that cluster (via ctrace2.CalculateCentroid).
  • Concurrency: The clustering process is currently synchronous within the CalculateClusterSummaries call, though it accepts a context.Context for cancellation and a Progress callback to report the total error back to the caller/UI.

Module: /go/config

Perf Configuration Module

The go/config module defines the structural and semantic requirements for configuring a Skia Perf instance. It serves as the single source of truth for the application's runtime behavior, governing how data is ingested, stored, queried, and notified.

High-Level Overview

Perf is a highly configurable system designed to handle diverse performance data sources. The configuration system is built around a central InstanceConfig struct, which is typically populated from a JSON file at startup. This module handles:

  • Data Structure: Defining the Go types that represent the configuration.
  • Schema Generation: Automatically creating JSON schemas from Go types to ensure documentation and validation stay in sync.
  • Validation: Providing a two-tier verification process (structural and semantic) to catch configuration errors before they reach production.

Design Decisions and Implementation Choices

Single Source of Truth via Reflection

Rather than maintaining a separate JSON schema file and Go struct, this module uses the invopop/jsonschema library. By performing reflection on the InstanceConfig struct, the system generates instanceConfigSchema.json. This ensures that any change to a Go field (like adding a new QueryConfig parameter) is automatically reflected in the validation logic and IDE autocompletion for configuration authors.

Separation of Structural and Semantic Validation

Validation is split into two distinct phases to maximize reliability:

  1. Structural: Handled by the generated JSON schema to verify types, required fields, and nesting.
  2. Semantic: Handled by custom Go logic in the validate submodule. This is crucial because a configuration might be valid JSON but logically broken (e.g., a notification template referencing a non-existent variable, or a Regex that uses unsupported syntax).

Duration Serialization

Standard Go time.Duration serializes to an integer (nanoseconds) in JSON, which is not human-readable. The module implements a custom DurationAsString type. It supports Marshaling/Unmarshaling strings like "2h" or "10m", making the JSON configuration files much easier for humans to maintain and review.

Key Components and Responsibilities

InstanceConfig (config.go)

The root configuration object. It aggregates several sub-configs, each responsible for a specific subsystem:

  • DataStoreConfig: Defines where trace data lives. It supports Spanner as the primary datastore and allows configuring connection pools and caching layers (either in-memory LRU or Memcached via CacheConfig).
  • IngestionConfig & SourceConfig: Control the flow of data into Perf. It defines where files come from (Google Cloud Storage or local directories) and how to handle arrival events via PubSub (including “Dead Letter” topics for failing messages).
  • GitRepoConfig: Configures how Perf interacts with source control. It supports both CLI-based git and the Gitiles API. It also handles “commit number” logic, allowing Perf to map git hashes to sequential integers used for graphing.
  • NotifyConfig & IssueTrackerConfig: Manage regression alerts. These utilize Go text templates for subjects and bodies, allowing instances to customize how they report anomalies to developers.
  • QueryConfig: Customizes the “Explore” UI. It allows instances to set default parameter selections (e.g., always default stat to value) and define “Conditional Defaults” (e.g., if a user selects metric=cpu, automatically suggest stat=avg).

Configuration Validation (/validate)

This submodule ensures the provided JSON is safe to run. It doesn‘t just check syntax; it performs “dry runs” of notification templates and compiles all regular expressions to ensure they are compatible with Go’s RE2 engine.

Command-Line Integration

The module provides AsCliFlags() methods for different service types (BackendFlags, FrontendFlags, IngestFlags). This allows the various Perf microservices to share a consistent set of command-line arguments (like --config_filename and --connection_string) while keeping their specific needs isolated.

Configuration Workflow

The following process describes how a configuration file moves from a static file to a running service:

[ config.json ]
       |
       v
+-----------------------+
|  Structural Check     | Checks: JSON types, required fields,
| (JSON Schema)         |         and valid nesting.
+-----------------------+
       |
       v
+-----------------------+ Checks:
|  Semantic Validation  | - Do Go templates compile?
|  (validate.go)        | - Are Regex patterns valid RE2?
+-----------------------+ - Are TileSizes logically consistent?
       |
       v
+-----------------------+
|  Global Config State  | The validated object is stored in
|  (config.Config)      | config.Config for the app to use.
+-----------------------+

Critical Constants

  • MaxSampleTracesPerCluster: Limits the number of traces shown in a cluster summary (default: 50) to maintain UI performance.
  • QueryMaxRunTime: Hard limit (10 minutes) on trace queries to prevent runaway database processes from exhausting resources.
  • MinStdDev: The floor for normalization (0.001); values smaller than this are treated as zero to avoid division-by-zero or noise amplification in regression detection.

Module: /go/config/generate

Purpose

The /go/config/generate module serves as a bridge between Go type definitions and runtime configuration validation. Its primary responsibility is to ensure that the InstanceConfig struct—the central configuration object for Perf—is accurately represented as a JSON Schema.

By automating the generation of this schema, the system guarantees that any structural changes made to the configuration in Go code are immediately reflected in the validation logic. This prevents the “drift” that often occurs when manual documentation or separate validation files are maintained alongside source code.

Design and Implementation

The module is implemented as a minimal Go binary designed to be executed via go generate.

Schema Synthesis

The core logic utilizes the jsonschema utility package to perform reflection on the config.InstanceConfig struct. This process transforms Go-specific metadata (such as struct tags, nested types, and field types) into a formal JSON Schema specification.

This approach was chosen to maintain a single source of truth. Instead of manually writing a JSON Schema to validate incoming configuration files, the Go struct itself defines the constraints. The generated schema at ../validate/instanceConfigSchema.json then acts as a portable artifact that can be used by:

  • Static validation tools.
  • IDE integrations for autocomplete and linting of configuration files.
  • Runtime validators that check user-provided configurations before the application starts.

Workflow

The generation process follows a linear path from Go source to a serialized JSON file:

[ Go Source Code ]
       |
       | (reflection)
       v
[ InstanceConfig Struct ] ----> [ jsonschema generator ]
                                         |
                                         | (serialization)
                                         v
                          [ instanceConfigSchema.json ]

Key Components

  • main.go: The entry point that orchestrates the generation. It explicitly links the config package (where the business logic definitions reside) with the jsonschema package (the transformation engine). It targets a specific output path in the validate directory, ensuring the generated schema is placed where the validation logic expects it.
  • InstanceConfig Integration: While not defined within this directory, the InstanceConfig struct from //perf/go/config is the critical input. The generator relies on the struct tags (like json:) and documentation comments within that struct to produce a human-readable and accurate schema.

Module: /go/config/validate

Perf Instance Configuration Validation

The go/config/validate module provides a robust validation layer for Skia Perf instance configurations. Its primary purpose is to ensure that JSON configuration files are not only structurally sound according to a schema but also semantically valid for the Perf runtime environment.

Overview

Configuration in Perf is complex, involving regular expressions, Go templates for notifications, and interdependent database settings. Simple JSON schema validation is insufficient for catching errors like an invalid regex or a notification template that references a non-existent field. This module bridges that gap by performing deep inspection of the configuration object before the application starts.

The validation process follows a two-tier approach:

  1. Structural Validation: Uses a JSON schema (instanceConfigSchema.json) to ensure types, required fields, and nesting are correct.
  2. Semantic Validation: Executes custom Go logic to verify templates, compile regular expressions, and check cross-field dependencies.

Key Components and Responsibilities

Schema Enforcement (instanceConfigSchema.json)

The module embeds a JSON schema that defines the structure of an InstanceConfig. This schema is the first line of defense, ensuring that mandatory blocks like data_store_config, ingestion_config, and git_repo_config are present. It also constrains the allowed properties for various sub-configs (e.g., QueryConfig, AuthConfig), preventing “silent” typos in configuration keys.

Semantic Validation Logic (validate.go)

The core validation logic resides in the Validate function. It performs several critical checks:

  • Notification Template Execution: For configurations using MarkdownIssueTracker, the validator doesn't just check if the template is valid Go syntax; it attempts to actually “dry-run” the template. It mocks data for commits, alerts, and clusters to ensure that the user-provided templates (subject and body) can be successfully expanded without runtime errors.
  • Regular Expression Compilation: Fields such as invalid_param_char_regex are compiled using Go's regexp package. This ensures that the patterns are compatible with RE2 syntax. Specifically, for invalid_param_char_regex, the validator enforces that the regex must match both a comma (,) and an equals sign (=), as these are fundamental delimiters in the Perf trace system.
  • Inter-dependency Checks: The module verifies logic that spans multiple configuration blocks. For example, it ensures that if notifications is set to a specific tracker type, the corresponding API key secrets are also provided. It also validates that CommitChunkSize in the query config is logically consistent with the TileSize in the data store config.

Validation Test Suite (testdata/ and validate_test.go)

The module includes a comprehensive suite of fixtures to prevent regressions:

  • Golden Files: Validates all existing production configurations against the current logic.
  • Failure Cases: Includes invalid_regex.json (testing unsupported RE2 features like lookaheads) and invalid-notify-template.json (testing references to non-existent template fields).

Validation Workflow

The following diagram illustrates the lifecycle of a configuration file as it passes through this module:

[ JSON Config File ]
         |
         v
+-----------------------+
|  JSON Schema Check    | ----> [ Fail: Invalid types/missing keys ]
+-----------------------+
         |
         v
+-----------------------+      +-----------------------------------+
|  Semantic Validation  |      | - Compile Regex                   |
|      (Validate)       | <--> | - Dry-run Notification Templates  |
+-----------------------+      | - Verify cross-field logic        |
         |                     +-----------------------------------+
         v
+-----------------------+
| Load into Global Mem  | ----> [ Success: Perf proceeds to boot ]
|    (config.Config)    |
+-----------------------+

Implementation Details

The module provides two primary entry points:

  • InstanceConfigFromFile: Reads a file from disk, performs schema validation, unmarshals it into the Go struct, and then runs semantic validation.
  • LoadAndValidate: A higher-level wrapper that logs schema violations to the system logs and populates the global config.Config singleton if validation passes. This is typically called during the initial setup of the Perf server.

Module: /go/config/validate/testdata

The testdata module provides a suite of JSON-based test fixtures used to verify the robustness of configuration validation logic. Its primary purpose is to exercise the parser's ability to distinguish between structurally sound configurations and those that contain semantic errors in complex fields, such as Go templates and regular expressions.

Design Intent

The data within this module is structured to target specific failure modes that are difficult to catch with simple schema validation:

  • Go Template Correctness: The notification system relies on Go templates to format alerts. The test data includes both a comprehensive “golden” file (valid-notify-template.json) containing all supported variables (e.g., .Commit.GitHash, .Alert.DisplayName) and failure cases (invalid-notify-template.json). This allows the validator to ensure that templates do not reference non-existent properties or use invalid syntax, which would otherwise lead to runtime errors during alert generation.
  • Regex Engine Compatibility: Go’s regexp package uses the RE2 syntax, which does not support certain features like lookahead assertions. The invalid_regex.json file specifically includes a lookahead pattern ((?=...)) to verify that the validator correctly identifies and rejects patterns that are incompatible with the underlying Go environment.
  • Schema Boundaries: The empty.json file serves as a baseline for testing how the system handles null or empty inputs, ensuring that required fields are enforced and default values are applied correctly when an incomplete configuration is provided.

Key Components and Responsibilities

The module is categorized by the specific validation criteria it aims to test:

Notification and Template Validation

The files valid-notify-template.json and invalid-notify-template.json define the expected interface for the notification engine.

  • Responsibility: They map out the complex object hierarchy available to the template executor, including Commit, Alert, Cluster, and StepFit objects.
  • Validation Depth: Beyond checking if a string is a valid template, these files help verify that the validator checks for the existence of nested fields, such as {{ .Alert.DirectionAsString }} or {{ index .ParamSet "device_name" }}.

Regular Expression Constraints

Configuration fields like invalid_param_char_regex and commit_number_regex are validated to ensure they can be compiled by the application.

  • Responsibility: invalid_regex.json provides a negative test case for patterns that might be valid in other engines (like Perl or JavaScript) but are unsupported in the project's Go environment.

Minimalist Configurations

  • Responsibility: empty.json tests the “fail-fast” capability of the validator. It ensures that the application does not attempt to boot with a blank configuration, requiring at least the presence of mandatory blocks like auth_config or data_store_config.

Validation Workflow

The following diagram illustrates how these files are typically utilized by the validation logic:

[ Configuration File ]          [ Validation Logic ]           [ Result ]
          |                              |                         |
          |---(Load JSON)--------------->|                         |
          |                              |-- Check Structure       |
          |                              |-- Compile Templates     |
          |                              |-- Compile Regex         |
          |                              |                         |
          |<-----------------------------|---(Report Errors)-------|
          |                              |                         |
          V                              V                         V
   (testdata/*.json)             (config/validate)          (Pass / Fail)

By maintaining these fixtures, the module ensures that any changes to the configuration schema or the notification engine's data model are accompanied by corresponding updates to the validation suite, preventing regressions in configuration parsing.

Module: /go/ctrace2

The ctrace2 module provides the bridging logic between raw performance trace data and the kmeans clustering engine. It defines how individual performance traces are normalized, compared, and averaged to facilitate anomaly detection and pattern discovery in Perf.

Core Responsibility: Clusterable Performance Data

The primary goal of ctrace2 is to transform raw, noisy performance data into a standardized mathematical representation. Performance traces often vary significantly in scale (e.g., one test might take 10ms while another takes 500ms), making direct comparison difficult.

To solve this, ctrace2 implements the ClusterableTrace struct, which satisfies the interfaces required by the kmeans package. This allows the clustering algorithm to treat performance traces as points in an N-dimensional space.

Data Normalization and Preparation

A key design choice in this module is the mandatory normalization of data via NewFullTrace. Before a trace can be used for clustering, it undergoes two critical transformations:

  1. Gap Filling: Missing data points (sentinels) are filled using linear interpolation or zero-filling via vec32.Fill. This ensures that traces with intermittent data can still be compared.
  2. Unit Standard Deviation: Traces are normalized so that their mean is 0 and their standard deviation is 1 (using vec32.Norm). This shift from absolute values to relative “shapes” allows the system to cluster traces that show similar behavior (e.g., a 10% performance regression) even if their absolute magnitudes differ.

The normalization includes a minStdDev parameter to prevent the amplification of “flatline” noise; if a trace is almost perfectly flat, it will not be scaled up to unit standard deviation, as doing so would exaggerate insignificant measurement jitter.

Key Components

ClusterableTrace

This is the central data structure. It holds the identifying Key of a trace and its normalized Values.

  • Distance: Implements the Euclidean distance calculation. Since all traces in a clustering operation are expected to have the same length (guaranteed by the data preparation layer), it calculates the square root of the sum of squared differences between corresponding data points.
  • CalculateCentroid: Provides the logic to create a “representative” trace for a cluster. It calculates the arithmetic mean for every data point across all members of a cluster, resulting in a new ClusterableTrace that serves as the cluster's center.

Workflow: From Raw Trace to Cluster Centroid

The following diagram illustrates how raw performance data is processed to become part of a cluster:

Raw Trace Data        Normalization (NewFullTrace)         K-Means Processing
+-------------+      +---------------------------+       +-------------------+
| Key: "test" |      | 1. Fill missing values    |       |  Compare Distance |
| [10, e, 12] | ===> | 2. Shift mean to 0        | ===>  |  to other traces  |
+-------------+      | 3. Scale std dev to 1     |       +---------+---------+
                     +-------------+-------------+                 |
                                   |                               |
                                   v                               v
                         +-------------------+           +-------------------+
                         | ClusterableTrace  | <-------- | CalculateCentroid |
                         | [ -1.2, 0.1, 1.1] |           | (Average of group)|
                         +-------------------+           +-------------------+

Key Constants

  • CENTROID_KEY: When a cluster centroid is exported or visualized (e.g., in a DataFrame), it is assigned the special key special_centroid to distinguish the “average” shape from the actual measured data traces.

Module: /go/culprit

Culprit Module

The go/culprit module serves as the central authority for managing “culprits”—specific commits or sets of commits definitively identified as the cause of performance regressions—within the Skia Perf ecosystem. It provides the infrastructure to persist culprit data, link it to detected anomalies, and orchestrate the notification process to alert developers via external issue trackers.

High-level Overview

This module bridges the gap between the bisection engine (which discovers culprits) and the communication layers (which report them). It is responsible for the entire lifecycle of a culprit:

  1. Persistence: Storing the association between a commit and a performance regression.
  2. Mapping: Maintaining the N:M relationship between code changes, Anomaly Groups, and external Issue Tracker IDs.
  3. Formatting: Transforming raw performance data into human-readable alerts.
  4. Notification: Dispatching these alerts to the appropriate teams based on subscription configurations.

Design Decisions and Implementation

Service-Oriented Architecture

The module is structured as a gRPC service (/proto and /service). This design allows different components of the Skia Perf backend—such as automated bisection tools or manual triage UIs—to interact with culprit data through a unified interface.

Resilience Through Data Redundancy

A key architectural choice in the data schema (found in culprit_service.proto) is the local definition of the Anomaly structure. Instead of referencing external proto files from other services, culprit maintains its own representation. This ensures service independence: changes to how other modules represent anomalies won't cause cascading breaking changes in the culprit management logic.

Safety and “Mocking” in Production

Because performance alerts can be noisy, the service includes a “Subscription Guarding” mechanism (/service). Before sending a notification, the system checks an allowlist (SheriffConfigsToNotify). If a subscription is not yet verified, the service automatically reroutes the notification to a safe, internal “mock” destination. This allows for testing new configurations in production environments without spamming development teams.

Decoupling Content from Delivery

The notification logic is strictly separated into two domains:

  • Formatters (/formatter): Use Go's text/template engine to turn protobuf data into Markdown. This allows for flexible, instance-specific report styling without changing the underlying logic.
  • Transports (/transport): Handle the actual network communication (e.g., Google Issue Tracker API). This allows the system to swap delivery methods (or use a NoopTransport for local development) without affecting the notification orchestration.

Key Components

Culprit Service (/service and /proto)

The orchestration layer. It implements PersistCulprit and NotifyUserOfCulprit. It coordinates between the storage layer and the notifier to ensure that when a bug is filed, the resulting Issue ID is recorded back into the database, creating a bidirectional link between the regression and the ticket.

SQL Culprit Store (/sqlculpritstore and store.go)

The persistent storage implementation.

  • Revision-Centric Identity: It treats the git revision (host/project/ref/revision) as the primary identity.
  • Upsert Logic: When a culprit is reported, the store checks for an existing record for that commit. If found, it appends the new anomaly_group_id rather than creating a duplicate, ensuring a single commit's impact is tracked holistically.
  • JSONB Mapping: It uses JSONB to store the GroupIssueMap, allowing it to track which specific anomaly group triggered which specific bug report in a single, efficient record.

Formatter (/formatter)

The “Logic-to-Markdown” engine. It takes Culprit and Anomaly protos and applies templates to generate subjects and bodies. It includes helper functions to calculate percentage changes and build URLs to the Perf UI or Git hosts.

Notifier (/notify)

The coordinator for alerts. It takes a request to notify, fetches the content from a Formatter, and passes it to a Transport.

Key Workflows

From Discovery to Notification

The following diagram illustrates how a culprit is processed from the moment a bisection tool identifies it to the point an external bug is filed:

[ Bisection Engine ]      [ Culprit Service ]      [ SQL Store ]      [ Issue Tracker ]
          |                        |                     |                    |
          |-- 1. PersistCulprit -->|                     |                    |
          |    (Commit, GroupID)   |-- 2. Upsert() ----->|                    |
          |                        |                     |                    |
          |-- 3. NotifyCulprit --->|                     |                    |
          |    (CulpritID)         |-- 4. Get Culprit -->|                    |
          |                        |                     |                    |
          |                        |-- 5. Format Msg ----|                    |
          |                        |                     |                    |
          |                        |-- 6. Transport Send -------------------->|
          |                        |                     |                    |
          |                        |<-- 7. Issue ID --------------------------|
          |                        |                     |                    |
          |                        |-- 8. AddIssueId() ->|                    |
          |<-- 9. Success (ID) ----|                     |                    |

Identification and Linking

The module manages the complex relationship between commits and regressions:

  1. Many-to-Many: One commit (Culprit) can cause many regressions (Anomaly Groups).
  2. Tracking: Each Anomaly Group within a Culprit record can have its own unique Issue ID, allowing the system to post updates to multiple relevant bug reports simultaneously.

Module: /go/culprit/formatter

Culprit and Anomaly Formatter

The formatter module is responsible for transforming raw performance regression data into human-readable notifications. It sits between the regression detection logic and the notification delivery systems (such as issue trackers or email services), ensuring that alerts contain actionable context like commit links, benchmark details, and performance delta percentages.

High-level Overview

The module provides a standardized way to generate subjects and message bodies for two primary scenarios:

  1. New Culprits: When a specific commit is identified as the cause of a performance regression.
  2. Anomaly Groups: When a collection of regressions is grouped together (e.g., across multiple bots or benchmarks) and needs a summary report.

By decoupling the data representation from the final message format, the system allows for flexible reporting that can be customized per-instance via configuration files.

Design Decisions and Implementation

Template-Driven Formatting

The core implementation uses Go‘s text/template engine. This choice allows the formatting logic to remain generic while supporting complex data injection. The MarkdownFormatter uses predefined default templates but can be overridden by an instance’s IssueTrackerConfig.

This design supports:

  • Contextual Data Injection: Templates have access to TemplateContext (for culprits) and ReportTemplateContext (for anomaly groups), which include metadata about the subscription, the commit, and the anomalies themselves.
  • Custom Functions: The formatter registers helper functions like buildCommitURL, buildGroupUrl, and buildAnomalyDetails within the template engine. This moves complex string manipulation (like calculating percentage changes or formatting bot names) out of the raw template and into tested Go code.

Flexibility and Fallbacks

The NewMarkdownFormatter implements a “fallback” pattern. If the InstanceConfig does not provide a specific subject or body template, the module uses hardcoded defaultNewCulpritSubject, defaultNewReportBody, etc. This ensures the system is always capable of sending a notification even with a minimal configuration.

Interface-Based Architecture

The Formatter is defined as an interface. This abstraction allows the Perf system to swap implementations easily:

  • MarkdownFormatter: The standard implementation for systems that support markdown (like Monorail or GitHub).
  • NoopFormatter: A “no-operation” implementation used in testing or environments where notifications should be suppressed without changing the calling service's logic.
  • Mocks: Automated mocks are generated to facilitate unit testing of higher-level services (like the notification service) without requiring a full template rendering setup.

Key Components

Formatter Interface (formatter.go)

Defines the contract for all formatting implementations. It requires two methods:

  • GetCulpritSubjectAndBody: Formats a message for a specific Culprit proto.
  • GetReportSubjectAndBody: Formats a summary for an AnomalyGroup and its associated list of Anomaly protos.

MarkdownFormatter (formatter.go)

The primary implementation. It stores compiled templates and instance-specific URLs (like the host URL and commit URL templates). During initialization, it parses the templates and attaches the functional maps required to generate links.

Workflow: Generating an Anomaly Report

The following diagram illustrates how the formatter processes an anomaly group into a notification:

[ Data Source ]          [ MarkdownFormatter ]          [ Output ]
      |                         |                           |
      |-- AnomalyGroup -------->|                           |
      |-- Subscription -------->|-- Resolve Templates ------|
      |-- Top Anomalies ------->|-- Execute Funcs: ---------|
      |                         |   * buildGroupUrl         |
      |                         |   * buildAnomalyDetails --|--> Subject String
      |                         |                           |--> Body (Markdown)

NoopFormatter (noop.go)

A stub implementation that returns empty strings. It serves as a safe default when no notification formatting is required, preventing nil pointer exceptions in the orchestration services.

Data Contexts

  • TemplateContext: Contains the Culprit commit information and the Subscription details (e.g., the name of the team or component being notified).
  • ReportTemplateContext: Contains the AnomalyGroup (group ID, benchmark name) and a list of TopAnomalies, which are the most significant regressions selected to represent the group.

Module: /go/culprit/formatter/mocks

The go.skia.org/infra/perf/go/culprit/formatter/mocks module provides automated mock implementations of the Formatter interface. This module exists to facilitate unit testing for components within the Perf system that handle notifications and reports related to performance regressions (culprits) and anomaly groups.

Design and Purpose

The primary design goal is to allow developers to test high-level notification logic without depending on the actual formatting logic (which typically involves complex template rendering or external metadata lookups). By using these mocks, tests can verify that the system correctly passes data to the formatter and handles the resulting subject lines and message bodies as expected.

The implementation utilizes the testify mock framework. This choice allows for expressive test assertions, such as ensuring that a specific culprit or subscription triggered a formatting request, or simulating error conditions during the message generation process.

Key Components

Formatter

The Formatter struct in Formatter.go is the central component of this module. It is a mock object that simulates the behavior of a culprit/anomaly formatter. It implements two primary functional workflows:

  • Culprit Notification Formatting: Through GetCulpritSubjectAndBody, the mock simulates the creation of notification content for a specific performance culprit. It accepts a Culprit proto and a Subscription proto, returning a mocked subject string, body string, and error.
  • Anomaly Group Reporting: Through GetReportSubjectAndBody, the mock simulates the creation of reports for collections of anomalies. This is used in testing workflows where multiple regressions are aggregated into a single notification for a specific subscription.

Workflow Example

In a typical test scenario, the mock acts as a stand-in for the real formatter to verify the orchestration logic of the notification service:

[ Test Case ] -> [ Notification Service ] -> [ Mock Formatter ]
      |                  |                        |
      |-- 1. Setup expectation ------------------>| (Expect GetCulpritSubjectAndBody)
      |                  |                        |
      |-- 2. Trigger Action ---->|                |
      |                  |-- 3. Call Format() --->|
      |                  |<-- 4. Return Mock Data-|
      |                  |                        |
      |-- 5. Assert service used mock data ------>|

Usage in Testing

The NewFormatter function is the standard entry point for using this mock. It automatically registers cleanup functions with the Go testing framework (t.Cleanup), ensuring that expectations (e.g., “this method must be called exactly once”) are asserted at the end of the test execution without manual boilerplate.

Module: /go/culprit/mocks

The go/culprit/mocks module provides autogenerated mock implementations of the interfaces defined in the culprit package. Its primary purpose is to facilitate unit testing for components that depend on culprit persistence and retrieval without requiring a live database or a complex setup of the culprit.Store.

Design Philosophy

The module leverages testify/mock to provide a flexible way to simulate the behavior of the culprit storage layer. By using mocks, developers can:

  • Isolate Components: Test business logic in services like anomaly detection or regression analysis without being affected by the state of a real database.
  • Simulate Edge Cases: Easily trigger specific error conditions (e.g., database timeouts or unique constraint violations) or return specific protobuf structures that might be difficult to reproduce with real data.
  • Verify Interactions: Ensure that the calling code correctly invokes storage methods with the expected parameters, such as specific anomaly_group_ids or commit slices.

Key Components

Store.go

This file contains the Store struct, which mocks the primary interface for managing culprits. It is generated via mockery and mirrors the methods required to interact with culprit data in the Perf system.

The mock provides implementations for the following critical workflows:

  • Culprit Ingestion and Updates (Upsert): Allows tests to simulate the creation or updating of culprits associated with specific anomaly groups. It mimics the behavior of returning a list of generated culprit IDs based on provided commit information.
  • Metadata Association (AddIssueId): Simulates the linking of a culprit to an external issue tracker ID. This is crucial for testing the integration between Perf's internal culprit tracking and external bug reporting systems.
  • Data Retrieval (Get, GetAnomalyGroupIdsForIssueId): Facilitates testing of UI endpoints or reporting tools by returning pre-defined v1.Culprit protobuf messages or mapping issue IDs back to internal anomaly groups.

Typical Testing Workflow

When utilizing this module, a test typically follows the “Setup-Expect-Verify” pattern:

  Test Component          Mock Store             Internal Logic
      |                       |                       |
      |-- 1. Setup Mock ----->|                       |
      |                       |                       |
      |-- 2. Set Expectations |                       |
      |   (On "Get" return X) |                       |
      |                       |                       |
      |-- 3. Call Method ---->|---------------------->|
      |                       |                       |
      |                       |<-- 4. Call "Get" -----|
      |                       |                       |
      |                       |--- 5. Return X ------>|
      |                       |                       |
      |-- 6. Verify Mocks ----|                       |

The NewStore function simplifies this by automatically registering a cleanup function on the provided testing.T instance, ensuring that AssertExpectations is called when the test completes.

Module: /go/culprit/notify

High-level Overview

The go.skia.org/infra/perf/go/culprit/notify module is responsible for orchestrating the notification process when performance regressions (anomalies) or their root causes (culprits) are identified. It acts as a bridge between the internal detection logic and external communication platforms, such as issue trackers.

The module abstracts the “how” of notification by separating the content generation (formatting) from the delivery mechanism (transport).

Design and Implementation Choices

The module follows a “Strategy” pattern to handle different notification environments and requirements.

  • Decoupling via Interfaces: The core logic relies on the CulpritNotifier interface. This allows the system to switch between real notifications and no-op (no-operation) modes easily, which is essential for local development or testing environments where sending real bugs is undesirable.
  • Separation of Concerns: The implementation divides the task into two distinct roles:
    • Formatter: Responsible for taking raw data (Protobuf messages for culprits or anomalies) and transforming them into human-readable subjects and bodies (typically Markdown).
    • Transport: Responsible for the actual network communication with external APIs (like Buganizer/Issue Tracker).
  • Factory Pattern for Configuration: The GetDefaultNotifier function acts as a factory that inspects the InstanceConfig. It determines whether to instantiate a functional IssueNotify system or a NoneNotify (noop) system based on the deployment configuration.

Key Components and Responsibilities

CulpritNotifier Interface

The primary contract for the module. It defines two main entry points:

  • NotifyAnomaliesFound: Triggered when a group of regressions is first detected.
  • NotifyCulpritFound: Triggered when an automated analysis has narrowed down a specific commit as the cause of a regression.

DefaultCulpritNotifier

This is the standard implementation of the CulpritNotifier. It does not contain formatting or transport logic itself; instead, it coordinates the two. It fetches the content from the formatter, passes it to the transport, and returns the resulting identifier (e.g., a Bug ID).

Integration Logic (notify.go)

This file handles the lifecycle of a notification. It ensures that if a Subscription (the configuration defining who should be alerted) is missing, the system fails gracefully or logs the omission rather than attempting to send a malformed alert.

Key Workflow: Notification Orchestration

The following diagram shows how the DefaultCulpritNotifier coordinates the flow of information from a detected event to an external system:

[ Caller ]             [ DefaultCulpritNotifier ]       [ Formatter ]       [ Transport ]
    |                             |                         |                   |
    |-- NotifyCulpritFound() ---->|                         |                   |
    |                             |-- GetSubjectAndBody() ->|                   |
    |                             |<-- (subject, body) -----|                   |
    |                             |                         |                   |
    |                             |-- SendNewNotification() ------------------->|
    |                             |                         |                   |
    |                             |<--------- (bug_id) -------------------------|
    |<----- (bug_id, err) --------|                         |                   |

Testing Utilities

The module includes a mocks sub-package (generated via mockery). This is used by other parts of the Perf system to simulate the notification layer. By using these mocks, developers can verify that the detection pipeline correctly triggers notifications with the right metadata without actually creating tickets in an issue tracker.

Module: /go/culprit/notify/mocks

High-level Overview

The go.skia.org/infra/perf/go/culprit/notify/mocks module provides automated mock implementations for the culprit notification system within Perf. Its primary purpose is to facilitate unit testing for components that depend on the notification logic—such as anomaly detection pipelines or culprit analysis engines—without triggering actual external notifications (e.g., creating real bug reports or sending emails).

Design and Implementation Choices

The module is built using testify/mock, which was chosen to provide a consistent, type-safe way to assert that notification events occur with the expected parameters.

A key design choice in this module is the use of mockery for code generation. By generating the CulpritNotifier mock automatically from an interface definition (presumably located in the parent notify package), the project ensures that the test infrastructure stays in lockstep with the production API. This prevents “stale” tests where a mock might satisfy an old version of an interface that has since changed.

The implementation focuses on two distinct stages of the Perf alerting lifecycle:

  1. Initial Anomaly Grouping: Handling a collection of detected performance regressions.
  2. Culprit Identification: Handling the specific commit identified as the root cause.

Key Components and Responsibilities

CulpritNotifier.go

This file contains the CulpritNotifier struct, which implements the interface required to simulate the notification subsystem. It manages the lifecycle of notifications through two primary mocked methods:

  • NotifyAnomaliesFound: This method simulates the process of alerting users about a new AnomalyGroup. It accepts the group details, the associated Subscription (which contains routing/alerting metadata), and a list of specific Anomaly objects. In a test environment, this allows developers to verify that the system correctly identifies which subscription should be notified when a set of regressions is detected.
  • NotifyCulpritFound: This method simulates the final stage of an investigation where a specific Culprit (a commit) has been identified. It validates that the notification logic correctly associates a culprit with the right subscription and returns a simulated notification ID (like a bug URL).

The file also includes a constructor, NewCulpritNotifier, which leverages Go's Cleanup interface. This is a critical design pattern here as it automatically registers AssertExpectations to run at the end of a test, ensuring that no expected notification calls were missed without requiring the developer to manually call assertion methods.

Key Workflow: Testing a Culprit Discovery

The following diagram illustrates how this mock integrates into a typical test suite workflow to validate the notification logic:

[ Test Case ]          [ Component Under Test ]        [ Mock CulpritNotifier ]
      |                         |                             |
      |-- Register Expectation ->                             |
      |   (NotifyCulpritFound)  |                             |
      |                         |                             |
      |---- Execute Action ---->|                             |
      |                         |---- Call NotifyCulprit() -->|
      |                         |                             |-- Record Call
      |                         |<------- Return Mock ID -----|
      |                         |                             |
      | <--- Verify Results ----|                             |
      |                         |                             |
      | (Test Cleanup)          |                             |
      |------------------------------------------------------>|-- AssertExpectations()
                                                              |   (Fails if not called)

Module: /go/culprit/proto

Overview

The go/culprit/proto module defines the communication interface and data schema for the Culprit Service. This service acts as the central authority for managing “culprits”—commits definitively identified as the cause of performance regressions—within the Skia Perf ecosystem.

By providing a unified gRPC interface, this module bridges the gap between the bisection engine (which discovers culprits), the storage layer (which persists them), and the notification systems (which alert developers).

Design and Implementation Choices

Resilience Through Data Redundancy

The Anomaly data structure in this module is a local definition rather than a reference to external proto files. While this creates some duplication with services like anomalygroup, it is a deliberate architectural choice to ensure service independence. If the anomaly grouping logic changes its internal representation, the Culprit Service remains stable, preventing breaking changes from cascading through the microservice architecture.

Granular Issue Tracking

A key design feature of the Culprit message is the group_issue_map. In large-scale performance monitoring, a single problematic commit (a culprit) often triggers multiple regressions across different platforms or benchmarks, which might be tracked in different anomaly groups. This mapping allows the service to:

  1. Maintain a many-to-many relationship between culprits and anomaly groups.
  2. Track specific issue IDs (e.g., Monorail or Buganizer) for each group, ensuring that updates are posted to the correct bug reports.

Global Commit Identification

The Commit message is designed to be repository-agnostic. By explicitly requiring host, project, and ref alongside the revision, the service can handle culprits across the diverse set of repositories monitored by Skia (e.g., Chrome, Skia, V8, Angle). This allows a single instance of the service to manage regressions originating from different source control providers.

Key Components

Service Interface (culprit_service.proto)

The CulpritService defines the lifecycle management of a regression:

  • Identification Persistence: The PersistCulprit method transforms the results of a bisection (a commit) into a permanent record linked to an anomaly group.
  • Asynchronous Notification: The service separates the detection of an anomaly from the identification of a culprit. NotifyUserOfAnomaly is used for initial “regression found” alerts, while NotifyUserOfCulprit is used for “culprit found” alerts, allowing the system to provide immediate feedback followed by precise root-cause analysis.

Data Structures

  • Anomaly: Captures the state of the world at the time of regression. It stores the “before” and “after” medians, which are critical for calculating the magnitude of the impact, and the dimensions (test name, bot name) to identify the specific environment affected.
  • Culprit: Represents a validated performance regression. It serves as an audit log, containing the commit metadata and a history of the notifications sent to developers.

Core Workflow: From Detection to Notification

The following diagram illustrates how the Culprit Service coordinates between the engine that finds bugs and the trackers that manage them:

Detection/Bisection Engine    Culprit Service         Database / External API
            |                        |                         |
            |--- PersistCulprit ---->|                         |
            |   (Commit + GroupID)   |---- Store Culprit ----->|
            |                        |                         |
            |-- NotifyUserOfCulprit -|                         |
            |   (CulpritID)          |---- Fetch Metadata ---->|
            |                        |                         |
            |                        |---- Create/Update Bug ->|
            |                        |<--- Return Issue ID ----|
            |                        |                         |
            |<------- Success -------|---- Update Map/Link --->|

Key Files

  • culprit_service.proto: The primary definition file. It defines the gRPC service and all message types used for requests and responses.
  • culprit_service.pb.go & culprit_service_grpc.pb.go: The compiled Go code. These files provide the concrete types and client/server boilerplate used by other Go services in the repository to interact with the Culprit Service.
  • generate.go: The automation hook that ensures the generated Go code stays in sync with the proto definitions.

Module: /go/culprit/proto/v1

This module defines the gRPC interface and data structures for the Culprit Service, a component of the Skia Perf ecosystem responsible for managing performance regression culprits and user notifications. It serves as the contract between the bisection engine (which identifies culprits) and the storage/notification layers.

Overview

The Culprit Service handles the lifecycle of a “culprit”— a specific commit identified as the cause of a performance regression. The module's primary responsibilities include:

  • Persistence: Storing and retrieving culprit data linked to specific anomaly groups.
  • Notification: Orchestrating alerts to users (e.g., creating bug tracker issues) when anomalies are detected or when bisection successfully identifies a culprit.

Design and Implementation Choices

Separation of Concerns

The data structures for Anomaly are intentionally duplicated from other services (like anomalygroup_service.proto). This redundancy allows the Culprit Service to evolve its definition of an anomaly independently of the grouping service, preventing tight coupling in a microservices environment where different teams might own different parts of the pipeline.

Mapping Culprits to Issues

The Culprit message includes a group_issue_map. This design choice recognizes that a single commit (culprit) might cause regressions across multiple different test suites or “anomaly groups.” By mapping anomaly_group_id to issue_id, the service can track which bugs were filed for which specific performance regressions associated with the same culprit.

Commit Metadata

The Commit message provides a normalized way to identify changes across different repositories. By including host, project, ref, and revision, it ensures that the service can uniquely identify commits even when Skia Perf monitors multiple disparate Git repositories (e.g., Chromium, V8, Skia).

Key Components

CulpritService

The gRPC service definition (culprit_service.proto) defines the following core operations:

  • PersistCulprit: Called after a bisection process identifies a culprit. It links a list of commits to an anomaly_group_id.
  • GetCulprit: Used by the UI or other services to fetch detailed metadata about identified culprits.
  • NotifyUserOfAnomaly: Triggered when a regression is first detected. This typically results in the creation of a tracking issue.
  • NotifyUserOfCulprit: Triggered when bisection finishes. It updates existing issues or creates new ones to alert developers that a specific commit they authored caused a regression.

Data Models

  • Anomaly: Contains the statistical context of a regression, including “before” and “after” medians and the specific dimensions (bot, benchmark, measurement) where the regression occurred.
  • Culprit: The record of a confirmed regression-causing commit, maintaining links to the anomaly groups it affected and the issues filed in response.

Typical Workflow

The following diagram illustrates how the Culprit Service interacts with the bisection and notification flow:

Bisection Engine        Culprit Service          Storage / Issue Tracker
      |                       |                            |
      |-- PersistCulprit ---->|                            |
      |   (Commits + GroupID) |---- Save to DB ----------->|
      |                       |                            |
      |-- NotifyUser ---------|                            |
          (Culprit IDs)       |---- Create/Update Issue -->|
                              |<--- Return Issue ID -------|
      |<-- Return Success ----|                            |

Files

  • culprit_service.proto: The source of truth for the service interface and message definitions.
  • culprit_service.pb.go: Generated Go structures for messages.
  • culprit_service_grpc.pb.go: Generated gRPC client and server interfaces.
  • generate.go: Contains the go:generate directives used to rebuild the protobuf and gRPC code via Bazel.

Module: /go/culprit/proto/v1/mocks

Overview

The go.skia.org/infra/perf/go/culprit/proto/v1/mocks module provides mock implementations of the Culprit Service gRPC server. Its primary purpose is to facilitate unit testing for components that depend on the CulpritService. By using these mocks, developers can simulate various service behaviors—such as successful culprit persistence or notification failures—without requiring a running gRPC backend or database.

Design and Implementation Choices

The module relies on the testify/mock framework to provide a flexible, programmable interface for defining expected behaviors during tests.

A critical implementation detail in this module is the manual handling of gRPC interface requirements. In standard Go gRPC implementations, a server must embed an Unimplemented... struct to ensure forward compatibility with the interface. Since many auto-generation tools (like mockery) may fail to include this embedding, it has been manually added to the CulpritServiceServer struct. This ensures the mock remains a valid implementation of the v1.CulpritServiceServer interface defined in the parent proto package.

Key Components

CulpritServiceServer

This is the central mock type. It mimics the behavior of the CulpritService by allowing tests to “stub” responses for specific RPC calls. It covers the following key service responsibilities:

  • Culprit Management: Functions like GetCulprit and PersistCulprit allow tests to simulate the retrieval and storage of performance regression culprits.
  • User Notification: Functions like NotifyUserOfAnomaly and NotifyUserOfCulprit enable verification of the notification logic, ensuring that the system correctly attempts to alert users when regressions or specific culprits are identified.

Initialization and Cleanup

The module provides a NewCulpritServiceServer constructor. This function is designed to integrate tightly with Go's testing.T. It automatically registers a cleanup function that calls AssertExpectations, which ensures that all programmed mock behaviors (e.g., “expect this function to be called exactly once”) were actually executed before the test finishes.

Typical Workflow

When testing a component that interacts with the Culprit Service, the workflow generally follows these steps:

1. Setup Mock   : Create mock instance using NewCulpritServiceServer(t).
2. Set Expectations: Define what inputs are expected and what should be returned.
                  (e.g., On("PersistCulprit", ...).Return(&v1.PersistCulpritResponse{}, nil))
3. Injection    : Pass the mock into the component being tested.
4. Execution    : Run the logic of the component under test.
5. Verification : The Cleanup function automatically verifies that the
                  component called PersistCulprit as expected.

Files

  • CulpritServiceServer.go: Contains the mock struct and method definitions for the gRPC service. This is where the manual embedding of v1.UnimplementedCulpritServiceServer resides to satisfy gRPC interface constraints.

Module: /go/culprit/service

Culprit Service

The culprit/service module provides a gRPC implementation for managing culprits and automating the notification process when performance regressions (anomalies) are identified in the Perf system. It acts as the orchestration layer between the storage of anomaly data and the external notification systems (e.g., bug trackers).

Overview

The primary purpose of this service is to handle the lifecycle of a “culprit”—a specific commit or set of commits identified as the cause of a performance change. It bridges several domains:

  1. Persistence: Saving identified culprits and associating them with specific Anomaly Groups.
  2. Lookup: Retrieving culprit details for UI or backend processing.
  3. Notification: Triggering alerts (filing bugs) based on the findings of bisection or anomaly detection.

The service is designed to be used by backend components that perform bisection and need to report their findings, or by systems that detect anomalies and require immediate user notification.

Key Components and Responsibilities

Culprit Persistence and Management

The service coordinates with culprit.Store and anomalygroup.Store to ensure that when a culprit is identified, the relationship between the problematic commit and the group of affected traces is maintained.

  • PersistCulprit: This workflow ensures atomicity at the application level. It first saves the culprit commits to the culpritStore and then updates the corresponding AnomalyGroup to include these new Culprit IDs. This bidirectional link is essential for tracking which regressions were caused by which commits.
  • GetCulprit: Provides a standard interface to fetch culprit metadata by ID.

Notification Logic

The service handles two types of notifications via the notify.CulpritNotifier interface. Both workflows rely on “Subscriptions” to determine where and how to file reports (e.g., which bug component, labels, or CC list to use).

  • Culprit Notification (NotifyUserOfCulprit): Triggered typically after a successful bisection. It loads the culprit details, identifies the relevant subscription associated with the anomaly group, and files a bug specifically for that culprit. It also records the resulting Issue ID back into the culprit record.
  • Anomaly Notification (NotifyUserOfAnomaly): Triggered when a group of anomalies is identified but a specific culprit may not yet be confirmed (or is being reported as a set). This files a broader report based on the anomaly group's characteristics.

Subscription Guarding and Mocking

A unique aspect of this service is the PrepareSubscription logic. Because performance alerts can be noisy or sensitive, the service includes a safety mechanism to prevent accidental notifications to end-users during testing or when onboarding new “Sheriff” configurations.

  • Allowlist Check: The service checks the InstanceConfig.SheriffConfigsToNotify list. If a subscription‘s name is not in this list, the service overwrites the bug destination (labels, components, CCs) with “mock” values. This ensures that even if a notification is triggered in a staging environment or for an unverified config, the bug is routed to a safe, internal hotlist rather than the actual team’s queue.

Key Workflows

Culprit Discovery and Reporting

When a bisection tool finds a culprit, the following process occurs:

Bisection Tool -> PersistCulprit(Commits, GroupID)
                      |
                      v
              [ Culprit Store ] <--- Save Commits
                      |
                      v
            [ AnomalyGroup Store ] <--- Link Culprit IDs to Group
                      |
                      +------> Response (Culprit IDs)

Bisection Tool -> NotifyUserOfCulprit(CulpritIDs, GroupID)
                      |
                      +-----> Load Subscription (via Group Name)
                      |
                      +-----> PrepareSubscription (Safe-guarding/Mocking)
                      |
                      v
              [ Culprit Notifier ] ---> EXTERNAL: File Bug
                      |
                      v
              [ Culprit Store ] <--- Record Issue ID

Implementation Decisions

  • Separation of Concerns: The service does not implement the logic for how to file a bug or how to store a commit; it strictly orchestrates the calls between specialized stores and the notifier.
  • GRPC Integration: By implementing backend.BackendService, this module easily integrates into the Skia Perf backend infrastructure, inheriting standard service registration and (eventually) centralized authorization policies.
  • Mocking for Safety: The PrepareSubscription function is an intentional “shim” in the implementation. It allows the team to run the full service logic in production-like environments while ensuring that experimental anomaly groups do not spam developers until their configurations are explicitly added to the allowlist.

Module: /go/culprit/sqlculpritstore

SQL Culprit Store

The sqlculpritstore module provides a persistent SQL-based implementation for managing “Culprits” within the Skia Perf ecosystem. A Culprit represents a specific commit (defined by its host, project, ref, and revision) that has been identified as the root cause of one or more performance regressions.

Design Philosophy

The primary challenge in managing culprits is the N:M relationship between code changes, diagnostic clusters (Anomaly Groups), and tracking systems (Issue Trackers). A single commit can cause regressions in multiple tests, and a single bug report might track several related regressions.

To address this, the store is designed around the following principles:

  • Revision-Centric Identity: While records are assigned a UUID for internal database efficiency, the business logic treats the git revision as the primary identifier.
  • Contextual Linking: The store doesn't just track that a commit is a culprit, but also why (via AnomalyGroupIDs) and where it is being tracked (via IssueIds).
  • Explicit Mapping: Through the GroupIssueMap, the store maintains a JSONB-encoded link between specific anomaly groups and their corresponding issue IDs. This allows the system to determine exactly which regression triggered a specific bug report without complex join operations.

Key Components

CulpritStore (sqlculpritstore.go)

The main struct implementing the storage interface. It handles the translation between Go protobuf messages (pb.Culprit) and the underlying SQL schema.

  • Upsert Logic: The Upsert method is a critical path. It identifies if a culprit already exists based on its commit coordinates. If it exists, the method appends the new anomaly_group_id to the existing list and updates the last_modified timestamp. If not, it generates a new UUID and creates a record. This ensure that a single commit is never duplicated in the store, regardless of how many regressions it causes.
  • Issue Management: The AddIssueId method enforces data integrity by ensuring an issue can only be linked to a culprit if the associated group_id is already recognized as being caused by that culprit.

Schema (/schema)

Defines the table structure and indexing strategy. A notable implementation choice is the by_revision composite index: INDEX by_revision (revision, host, project, ref)

By leading with the revision (a high-entropy hash), the database avoids “hotspots” and distributes data more evenly across partitions compared to leading with low-entropy strings like host.

Key Workflows

Identifying and Storing a Culprit

When the system identifies a set of suspect commits for a regression:

Discovery Engine -> [Anomaly Group ID + Commits]
      |
      v
CulpritStore.Upsert()
      |
      +-- Check if (Host/Project/Ref/Revision) exists?
      |         |
      |         +-- YES: Append Anomaly Group ID to array; Update LastModified
      |         |
      |         +-- NO:  Generate UUID; Create new record
      v
[ Database Updated ]

Linking an Issue

When a user or automated system files a bug for a specific regression:

Issue Tracker -> [Culprit ID + Issue ID + Anomaly Group ID]
      |
      v
CulpritStore.AddIssueId()
      |
      +-- Verify: Is Anomaly Group ID linked to this Culprit?
      |         |
      |         +-- NO: Return Error (Prevents orphaned/incorrect links)
      |
      +-- Update: Append Issue ID; Update GroupIssueMap (JSONB)
      v
[ Database Updated ]

Implementation Details

  • Concurrency and Updates: The store uses the last_modified field (Unix timestamp) to allow external caches or services to synchronize and identify updated culprit records efficiently.
  • Data Consistency: The Upsert method performs a validation check to ensure that all commits in a single batch belong to the same repository (Host, Project, and Ref), preventing accidental cross-pollination of repository metadata.
  • JSONB Handling: The GroupIssueMap is stored as JSONB to provide flexibility for future metadata expansion while allowing the system to retrieve the full context of a culprit's impact in a single query.

Module: /go/culprit/sqlculpritstore/schema

Culprit Storage Schema

The schema package defines the foundational data structure for persisting “Culprits” within the Perf system's SQL storage. A Culprit represents a specific commit identified as the root cause of a performance regression.

Design Philosophy: Beyond Single Regressions

In a performance monitoring ecosystem, a single commit might trigger multiple regressions across different subsystems or test suites. Conversely, multiple anomaly groups might eventually point to the same underlying code change.

To handle this N:M relationship, the schema is designed to treat the Culprit as a central entity that tracks its associations across various diagnostic contexts (Anomaly Groups) and tracking systems (Issue Trackers).

Key Components and Implementation Choices

1. The Culprit Identity A Culprit is uniquely identified by its source control coordinates: Host, Project, Ref, and Revision. While the system generates a UUID for primary key lookups, the business logic primarily interacts with the commit hash.

2. Relational Mapping and the Group-Issue Link The schema manages the relationship between regressions and their resolutions through three specific fields:

  • AnomalyGroupIDs: Tracks which diagnostic clusters have flagged this commit.
  • IssueIds: Tracks which bug reports are associated with this commit.
  • GroupIssueMap: A JSONB field that explicitly maps a specific Anomaly Group to a specific Issue ID.

The inclusion of GroupIssueMap as a JSONB object allows the system to maintain the context of why a bug was filed (i.e., which regression group triggered it) without requiring complex join tables for metadata that is frequently accessed together. Note: There is a planned refactoring to consolidate AnomalyGroupIDs and IssueIds into this map to reduce data redundancy.

3. Performance and Indexing Strategy The schema implements a composite index by_revision to optimize for the most common query pattern: “Is this specific commit already known as a culprit?”

The ordering of the index is a deliberate choice for database performance: INDEX by_revision (revision, host, project, ref)

By placing the revision (a high-entropy git hash) at the leading edge of the index, the storage engine can effectively distribute data across nodes and avoid “hotspots” that occur when sequential or low-entropy data (like a Host name) is used as the primary index prefix.

Logical Data Flow

Commit Hash (Revision)
      |
      v
[ Culprit Record ] <-----------+
      |                        |
      +--[ Anomaly Group A ] --+--> [ Issue 123 ]
      |                        |
      +--[ Anomaly Group B ] --+--> [ Issue 456 ]
      |                        |
      +-- [ GroupIssueMap ] ---+ (Stores the explicit links)

Schema Evolution

The schema currently supports LastModified as a Unix timestamp to facilitate cache invalidation and synchronization workflows, ensuring that external services can efficiently poll for updates to culprit statuses.

Module: /go/culprit/transport

Culprit Transport

The culprit/transport module provides a unified abstraction for dispatching notifications regarding identified culprits in the Skia Perf system. By decoupling the notification logic from the culprit detection engine, the system can support diverse communication channels—starting with automated issue tracking—while maintaining a consistent interface for the rest of the application.

Design Philosophy

The module is designed around the Transport interface, which abstracts the “where” and “how” of message delivery.

  • Interface-Driven Delivery: The core logic of the culprit detector does not need to know whether it is filing a bug in an issue tracker or sending an email. It simply provides a subscription configuration and the message content.
  • Context-Aware Routing: The transport implementations use Subscription metadata (defined in subscription/proto/v1) to determine routing details like component IDs, priorities, and CC lists.
  • Reliability and Observability: Given that notifications are critical for developer action, the transport layer includes built-in metrics to track delivery success and failure rates.

Key Components

Transport Interface

Defined in transport.go, this interface contains a single method: SendNewNotification. It returns a threadingReference (typically a bug ID or message URL) which allows the calling system to track the notification or perform follow-up actions (like posting comments on an existing thread).

IssueTrackerTransport

The primary production implementation of the Transport interface. It bridges Skia Perf with the Google Issue Tracker (Buganizer).

  • Authentication: It leverages the secret module to retrieve API keys and uses OAuth2 for authorized requests to the issuetracker service.
  • Data Transformation: It maps subscription-level configuration (e.g., BugComponent, BugPriority, Hotlists) into the specific data structures required by the Issue Tracker API.
  • Validation: It ensures that critical routing information, such as the BugComponent, is present before attempting to create an issue, preventing orphaned or unroutable notifications.

NoopTransport

A “No-Operation” implementation found in noop.go. This is used in environments where notifications are undesirable (e.g., local development or dry-run modes). It satisfies the interface by returning a successful result without performing any network I/O or side effects.

Workflow: Filing a Culprit Issue

The following diagram illustrates how the IssueTrackerTransport processes a notification request:

+----------------+       +------------------------+       +-------------------+
| Culprit        |       | IssueTrackerTransport  |       | Google Issue      |
| Service        |       |                        |       | Tracker API       |
+-------+--------+       +-----------+------------+       +---------+---------+
        |                            |                              |
        | 1. SendNewNotification()   |                              |
        |--------------------------->|                              |
        | (Subscription, Subj, Body) | 2. Map Proto to Issue        |
        |                            |    (Priority, CCs, etc.)     |
        |                            |                              |
        |                            | 3. POST /v1/issues           |
        |                            |----------------------------->|
        |                            |                              |
        |                            | 4. Return Issue ID           |
        |                            |<-----------------------------|
        |                            |                              |
        | 5. Increment Success Metric|                              |
        | 6. Return Issue ID String  |                              |
        |<---------------------------|                              |
        |                            |                              |

Implementation Details

  • Metric Integration: The IssueTrackerTransport maintains two counters: perf_issue_tracker_sent_new_culprit and perf_issue_tracker_sent_new_culprit_fail. These are essential for monitoring the health of the alerting pipeline.
  • Error Handling: If an issue creation fails, the transport attempts to serialize the issue data into the error message. This provides high-fidelity debugging information in the logs, allowing developers to see exactly what payload the Issue Tracker rejected.
  • Markdown Support: Notifications are sent with FormattingMode: "MARKDOWN", allowing the culprit detector to send rich text, links, and tables to the issue tracker for better readability.

Module: /go/culprit/transport/mocks

Culprit Transport Mocks

The culprit/transport/mocks module provides a programmatic double for the Transport interface used within the Skia Perf culprit detection system. Its primary purpose is to facilitate unit testing of components that handle culprit notifications—such as anomaly detection engines or alert managers—without triggering actual external side effects like sending emails or filing issue tracker tickets.

Design Philosophy

The module relies on the stretchr/testify/mock framework. This choice allows developers to write declarative tests that specify exactly how the notification system should be invoked. By using a mock rather than a fake or a manual stub, the system ensures that:

  1. Call Verification: Tests can assert that a notification was sent exactly once, or not at all, preventing duplicate or missing alerts.
  2. Input Validation: Tests can verify that the generated notification subject and body contain the expected metadata (e.g., commit hashes, regression magnitudes) before they are sent to a real user.
  3. Error Injection: Developers can simulate transport-layer failures (e.g., API timeouts, authentication errors) to ensure the culprit detection pipeline handles notification failures gracefully.

Key Components

Transport

The Transport struct is an autogenerated mock implementation. It mirrors the methods required to dispatch culprit information to various communication channels.

  • SendNewNotification: This is the core functional hook. In a production environment, this method would interface with external APIs (defined by the Subscription proto). In this mock implementation, it captures the context.Context, the Subscription configuration, and the message content. It returns a mockable string (typically representing a message ID or URL) and an error.

Usage Workflow

The mock is designed to be integrated into Go tests via the NewTransport constructor, which automatically handles test cleanup and expectation assertions.

+------------------+           +----------------------+           +------------------+
|    Unit Test     |           |   Component Under    |           |  Mocks Transport |
|  (Logic/Policy)  |           |        Test          |           |   (This Module)  |
+---------+--------+           +----------+-----------+           +---------+--------+
          |                               |                             |
          | 1. Setup Expectations         |                             |
          |------------------------------>|                             |
          | (Expect SendNewNotification)  |                             |
          |                               |                             |
          | 2. Execute Action             |                             |
          |------------------------------>| 3. Trigger Notification     |
          |                               |---------------------------->|
          |                               |                             | 4. Record Call
          |                               |                             | 5. Return Mock
          |                               | <---------------------------|    Values
          |                               |                             |
          | 6. Assertions (Auto-Cleanup)  |                             |
          | <-----------------------------|                             |

Implementation Details

The implementation of SendNewNotification uses type assertion logic to provide flexible return values. It can return static values configured via .Return() or dynamic values generated by a function passed to .Run(). This is particularly useful when the “Message ID” returned by the transport needs to be used in subsequent logic within the test case.

Module: /go/dataframe

Overview

The dataframe module provides a structured, table-like representation of performance measurement data, specifically optimized for the Skia Perf ecosystem. A DataFrame combines a set of time-series traces (TraceSet) with their corresponding commit metadata (ColumnHeader) and a calculated set of searchable attributes (ParamSet).

In the context of Perf, a “Trace” is a series of measurements associated with a unique key (a set of key-value pairs). The DataFrame organizes these traces so they can be visualized or analyzed over a common timeline of git commits.

Design and Implementation Choices

The “Why” Behind the DataFrame Structure

Unlike a simple collection of data points, a DataFrame represents a cohesive “slice” of performance history. The design choice to include Header, TraceSet, and ParamSet in a single object is driven by the need for self-contained data:

  • TraceSet: Stores the raw numerical data.
  • Header: Maps each index in a trace to specific commit information (hash, author, timestamp). This decoupling allows traces to be represented as simple arrays ([]float32) while still being linked to rich git history.
  • ParamSet: A computed summary of all keys present in the TraceSet. This is maintained within the object to allow the UI to quickly provide filtering options based only on the data currently loaded.

Column-Oriented Data Management

The module treats columns as discrete points in time (commits). Operations like MergeColumnHeaders and Join are implemented to handle the “sparse” nature of performance data, where different traces might have data for different sets of commits.

Trace Key A: [ 1.2,  nil,  1.4 ]  (Commits 1, 2, 3)
Trace Key B: [ nil,  2.2,  2.4 ]  (Commits 1, 2, 3)
                 |     |     |
           Header[0] Header[1] Header[2]

Memory and Performance Optimization

  • Compression: The Compress() method identifies and removes columns that contain no data across all traces. This is vital for reducing the payload size when sending data to a frontend, especially after a query returns a range where many commits might not have produced results for the requested traces.
  • Slicing: The Slice() method enables efficient pagination or windowing of data by creating sub-frames.
  • Sentinel Values: The module uses vec32.MissingDataSentinel to represent gaps in data, ensuring that trace arrays remain a fixed length relative to the Header while explicitly marking missing measurements.

Key Components and Responsibilities

DataFrameBuilder Interface

The DataFrameBuilder defines how data frames are constructed from underlying storage. It abstracts the complexity of querying the database and joining it with git metadata.

  • Query-based fetching: NewFromQueryAndRange handles fetching data matching specific attributes (e.g., “arch=x86”) over a time window.
  • Key-based fetching: NewFromKeysAndRange is used when specific trace IDs are already known.
  • N-point fetching: Methods like NewNFromQuery are designed for “overview” or “sparkline” views, where the user wants exactly $N$ points of history leading up to a specific time.

Join and Merge Logic

The Join and MergeColumnHeaders functions are the core of the module's data-alignment logic. They perform an “outer join” on the commit offsets.

  1. Header Merging: It identifies the unique union of all commits from two sources, sorted by their commit number/offset.
  2. Trace Alignment: It maps the data points from the original traces into the new, larger indices of the merged header, filling gaps with the missing data sentinel.

ParamSet Calculation

The BuildParamSet() method is responsible for reflecting the current state of the data. If traces are filtered out (via FilterOut), the ParamSet must be rebuilt so the UI doesn't display filtering options for data that is no longer present in the frame.

Data Merging Workflow

When joining two DataFrames (A and B) that represent different time ranges or different sets of traces:

DataFrame A Headers: [C1, C2, C4]
DataFrame B Headers: [C3, C4, C5]

1. Merge Headers -> [C1, C2, C3, C4, C5]
2. Map A indices -> 0->0, 1->1, 2->3
3. Map B indices -> 0->2, 1->3, 2->4
4. Resulting Trace for Key X:
   [ValA(C1), ValA(C2), ValB(C3), ValA/B(C4), ValB(C5)]

Key Files

  • dataframe.go: Defines the primary DataFrame and ColumnHeader structs and implements the logic for merging, joining, and filtering data.
  • dataframe_test.go: Contains logic for validating the complex index-mapping required during joins and ensuring that ParamSet calculations correctly reflect the trace data.
  • mocks/: Provides a mock implementation of the DataFrameBuilder for testing higher-level components (like the Perf API handlers) without requiring a database.

Module: /go/dataframe/mocks

DataFrame Mocks

The /go/dataframe/mocks module provides auto-generated mock implementations for the DataFrameBuilder interface. These mocks are primarily used to facilitate unit testing in the Perf system by simulating complex data retrieval and frame construction processes without requiring a live database or the actual heavy-duty dataframe implementation.

Design and Purpose

The core of this module is the DataFrameBuilder mock, which is generated using mockery. The decision to provide these mocks in a dedicated sub-package allows other modules within the Skia infrastructure to write deterministic tests for components that depend on data loading, such as UI handlers, alert systems, or analysis pipelines.

By using these mocks, developers can:

  • Simulate Data Latency: Test how the system handles long-running data fetches by controlling the mock's response time.
  • Inject Edge Cases: Easily return empty DataFrames, specific error conditions, or DataFrames with unusual shapes (e.g., mismatched trace lengths) that might be difficult to reproduce with real data.
  • Verify Query Logic: Ensure that the calling code is passing the correct query.Query objects or time ranges to the builder.

Key Components

DataFrameBuilder.go

This file contains the DataFrameBuilder struct, which embeds mock.Mock from the testify framework. It implements the dataframe.DataFrameBuilder interface, covering several data retrieval patterns:

  • Query-Based Construction: Methods like NewFromQueryAndRange and NewNFromQuery allow tests to simulate fetching data based on structured queries.
  • Key-Based Construction: Methods like NewFromKeysAndRange and NewNFromKeys simulate fetching specific traces when the exact keys are already known.
  • Metadata Exploration: NumMatches and PreflightQuery allow testing of the “dry run” or “count” functionality often used in the Perf UI to tell a user how many traces a query will return before they execute it.

Typical Workflow in Tests

The mock is designed to be integrated into Go tests using the testify pattern.

  1. Initialization: Create the mock using NewDataFrameBuilder(t). This automatically registers cleanup functions to assert that all expected calls were actually made.
  2. Expectation Setting: Define what the mock should return when specific methods are called.
  3. Injection: Pass the mock into the component being tested (which should accept the dataframe.DataFrameBuilder interface).
  4. Verification: The testify framework handles the verification of calls during the test's cleanup phase.
+-------------------+       +-----------------------+       +-------------------------+
|    Unit Test      | ----> |  MockDataFrameBuilder | ----> | Component Under Test    |
+---------+---------+       +-----------+-----------+       +------------+------------+
          |                             |                            |
          | 1. Setup Expectations       |                            |
          |---------------------------->|                            |
          |                             |                            |
          | 2. Execute Action           |                            |
          |--------------------------------------------------------->|
          |                             |                            |
          |                             | 3. Call Interface Method   |
          |                             |<---------------------------|
          |                             |                            |
          |                             | 4. Return Mock Data        |
          |                             |--------------------------->|
          |                             |                            |
          | 5. Assertions/Cleanup       |                            |
          |<----------------------------|                            |

Implementation Details

The implementation uses mockery's standard template, providing flexible return value handling. For every method, it checks if a functional return has been provided (allowing for dynamic logic in mocks) or if a static value was registered via .Return().

Special attention is given to the progress.Progress interface, which is passed to most builder methods. The mock allows testers to verify that progress is being tracked or to ignore it using mock.Anything.

Module: /go/dfbuilder

dfbuilder

The dfbuilder module is responsible for constructing DataFrame objects by querying and aggregating performance trace data from a TraceStore. It acts as the orchestration layer that translates high-level user requests (time ranges, queries, specific keys) into efficient, often parallelized, database operations.

Overview

A DataFrame in the Perf system is a matrix of performance data where columns represent commits (ordered by time) and rows represent individual traces (identified by structured keys). The dfbuilder handles the complexity of:

  1. Commit Mapping: Resolving time ranges or counts into specific commit numbers using Git.
  2. Tile-Based Retrieval: Interfacing with the TraceStore's tiled architecture to fetch data efficiently.
  3. Aggregation: Merging data from multiple tiles into a single coherent matrix.
  4. Optimization: Using caching and parallel “pre-flight” queries to improve UI responsiveness.

Design Decisions

Tiled Parallelism

The TraceStore stores data in fixed-size “tiles” (e.g., 256 commits per tile). When a user requests a large time range, the dfbuilder calculates which tiles are involved and launches parallel goroutines to query each tile simultaneously. This avoids serial bottlenecks and utilizes the horizontal scalability of the underlying database.

Backward Search (NewN...)

Many UI views request the “N most recent points.” Because performance data can be sparse (not every trace has data for every commit), the dfbuilder implements a backward-searching algorithm. It starts from the most recent commit and steps backward through tiles until it has collected exactly $N$ data points for the requested traces, or until it hits a configurable maxEmptyTiles limit.

Pre-flight Query Logic

To provide a responsive “Query” UI, the dfbuilder performs “pre-flight” queries. Instead of fetching all raw data, it:

  • Calculates how many traces match a partial query.
  • Dynamically builds a ParamSet of valid options for the next dropdown based on current selections.
  • Sub-querying: It can optionally remove one key from a multi-key query to find all possible values for that specific key that would still result in a valid trace when combined with the other keys.

Key Workflows

Constructing a DataFrame from a Query and Range

When a user requests data for a specific time range and a trace query:

User Request (Range, Query)
          |
          v
[Git Service] <---- Resolve time range to Commit Numbers/Headers
          |
          v
[DFBuilder]   <---- Calculate required Tile Numbers
          |
   +------+------+ (Parallel Tile Queries)
   |             |
[Tile N]      [Tile N-1]
   |             |
   +------+------+
          |
          v
[TraceSetBuilder] <--- Merge results into a matrix
          |
          v
[DataFrame] (Compressed & Filtered)

Parent Trace Filtering

The module includes logic to filter out “parent” traces. In Perf, traces often have a hierarchical structure (e.g., benchmark, bot, test, subtest). If a specific subtest trace exists, the higher-level “parent” trace (which might be an average or aggregate) is often redundant in the same view. The filterParentTraces function uses a TraceFilter to prune the TraceSet to only include the most specific (leaf) nodes.

Key Components

builder

The primary implementation of the dataframe.DataFrameBuilder interface. It maintains references to:

  • perfgit.Git: For commit metadata.
  • tracestore.TraceStore: For raw data access.
  • tracecache.TraceCache: An optional caching layer to speed up trace ID lookups.

preflightProcessRecentTiles

Handles the logic for scanning the most recent data tiles to populate the query UI. It uses an errgroup to query multiple tiles in parallel, ensuring that the “count” and “available parameters” are calculated quickly even if the latest tile is partially empty.

fromIndexRange

A utility that bridges the gap between Git and the DataFrame. It converts a range of commit numbers into ColumnHeader objects containing the Git hash, author, and timestamp required for the DataFrame header.

Implementation Details

  • Timeouts: Individual tile queries are protected by singleTileQueryTimeout (default 1 minute). This prevents a single “poison” tile or a massive ingestion spike from locking up the server during regression detection.
  • Sparse Data Handling: The builder uses a vec32.MissingDataSentinel to represent gaps in traces where no data was recorded for a specific commit, ensuring the matrix alignment remains consistent across all traces.
  • Trace Source Info: Beyond raw values, the builder tracks SourceInfo, which points back to the original files (e.g., Google Cloud Storage paths) from which the data was ingested, allowing for “drill-down” features in the UI.

Module: /go/dfiter

The dfiter module provides a high-level abstraction for iterating over performance data stored in DataFrames. Its primary responsibility is to transform raw, potentially sparse data retrieved from a TraceStore into a series of smaller, dense “windows” suitable for regression detection algorithms.

High-Level Overview

In the Skia Perf ecosystem, regression detection involves analyzing traces over time to find “steps” or shifts in performance. Because these algorithms typically operate on a fixed-sized window of points (defined by an Alert.Radius), the dfiter module acts as a bridge. It manages the complexity of querying the underlying data builders and then “slices” that data into the specific shapes required by the detection logic.

Design Decisions and Key Components

1. The Iterator Pattern (dfiter.go)

The module centers around the DataFrameIterator interface. This design allows the regression detection engine to remain agnostic of whether it is processing a single specific commit or scanning a wide range of history.

  • Exact Point Requests: When a Domain.Offset is provided, the iterator returns a single DataFrame centered on that specific commit.
  • Range Requests: When Offset is zero, the iterator behaves as a sliding window over a larger dataset.

2. Caching and Concurrency (dfIterProvider.go)

Dataframe generation is an expensive operation involving database lookups and data processing. To optimize this, the DfProvider implements:

  • In-Memory Caching: Stores recently built DataFrames keyed by a combination of the query, the end time, and the number of points requested.
  • SingleFlight Grouping: Uses golang.org/x/sync/singleflight to prevent “thundering herd” problems. If multiple concurrent requests ask for the same DataFrame (e.g., several different regression tasks triggered by the same alert), only one builder execution occurs; the result is then shared among all callers.

3. Trace Slicing Strategies (traceSlicer.go)

The module handles data differently depending on the regression algorithm being used. The choice of slicer is controlled by the DfIterTraceSlicer experiment flag and the Alert.Algo type.

  • K-Means Slicing (kmeansDataframeSlicer): This is the legacy approach. It treats the entire DataFrame as a unit, creating a sliding window across all traces simultaneously.

    Original DataFrame: [C1, C2, C3, C4, C5] (Radius=1, WindowSize=3)
    Iter 1: [C1, C2, C3]
    Iter 2: [C2, C3, C4]
    Iter 3: [C3, C4, C5]
    
  • StepFit Slicing (stepFitDfTraceSlicer): This strategy is optimized for individual trace analysis. It iterates through the DataFrame trace-by-trace rather than column-by-column.

    • Data Densification: A key feature is that it filters out MissingDataSentinel values. If a trace is sparse, the slicer collapses it into a dense array of valid points before applying the window. This ensures the regression algorithm always sees a full set of real data points, even if they were originally non-contiguous in time.

Key Workflows

Creating an Iterator

When NewDataFrameIterator is called, the following logic determines the data source:

User/System Request
      |
      v
Check Domain.Offset?
      |
      +--- [!= 0] ---> Find specific Commit Time -> Fetch exactly 2*Radius+1 points
      |                                              |
      +--- [== 0] ---> Check DfProvider Cache <------+
                            |                        |
                            +--- [Hit]  ---> Return Cached DF
                            |                        |
                            +--- [Miss] ---> Call DataFrameBuilder -> Cache Result
                                                     |
                                                     v
                                          Select Slicer Implementation
                                          (K-Means vs. StepFit)
                                                     |
                                                     v
                                          Return DataFrameIterator

Iterating with StepFit

The stepFitDfTraceSlicer provides a more granular iteration than traditional time-based slicing:

Trace A: [1.1, nan, 1.2, 1.3, nan, 1.4]
Trace B: [5.0, 5.1, 5.2, 5.3, 5.4, 5.5]

StepFit Filtering (Radius 1):
1. Collapse Trace A -> [1.1, 1.2, 1.3, 1.4]
2. Slice A (Win 1) -> [1.1, 1.2, 1.3]
3. Slice A (Win 2) -> [1.2, 1.3, 1.4]
4. Move to Trace B...
5. Slice B (Win 1) -> [5.0, 5.1, 5.2]
...and so on.

Key Files

  • dfiter.go: Entry point for creating iterators; handles “settling time” logic and metadata metrics.
  • dfIterProvider.go: Implements the caching layer and concurrency controls.
  • traceSlicer.go: Contains the logic for the different sliding window implementations.
  • traceSlicer_test.go: Provides extensive examples of how missing data is handled during slicing.

Module: /go/dryrun

The dryrun module provides the functionality to test Perf Alerts against historical data. This allows users to validate and tune alert configurations by observing which regressions would have been detected had the alert been active over a specific range of commits.

Core Logic and Design

The primary purpose of a dry run is to simulate the regression detection pipeline without triggering side effects like filing bug reports or sending notifications.

The module is designed around an asynchronous execution model. Because scanning historical data for regressions can be a long-running process, the module uses a “tracker” pattern. When a dry run is initiated, it returns a progress handle immediately, while the actual computation continues in a background goroutine.

Key Components

  • Requests struct: The central coordinator that manages dependencies required for regression detection, including access to Git data, trace shortcuts, and data frame builders.
  • StartHandler: The entry point for HTTP requests. It decodes a RegressionDetectionRequest, validates the alert configuration, and kicks off the background processing.
  • detectorResponseProcessor: A specialized callback function defined within the start handler. It acts as the glue between the raw clustering results and the user-facing progress updates. Its responsibilities include:
    • Converting raw cluster responses into formal Regression objects.
    • Merging multiple regressions found for the same commit into a single entry.
    • Enriching commit identifiers with full metadata (author, message, timestamp) from Git.
    • Updating the progress.Tracker so the UI can display real-time results.

Design Choice: Result Merging

A single dry run may execute multiple queries (e.g., if the alert uses a “group by” clause). This can result in multiple detections for the same commit across different sub-queries. The implementation uses a map (foundRegressions) keyed by CommitNumber to aggregate these results. As the detector finds new clusters, it merges them into the existing regression record for that commit, ensuring the user sees a consolidated view of all issues found at a specific point in time.

Dry Run Workflow

The following diagram illustrates the lifecycle of a dry run request:

User Request (POST)
      |
      v
[StartHandler] ------------------------+
      |                                |
      | 1. Validate Alert Config       | 2. Return Progress ID
      | 3. Add to Progress Tracker     |    to Frontend immediately
      | 4. Launch Goroutine            |
      |                                +-----> [HTTP Response (JSON)]
      v
[Background Goroutine]
      |
      | calls regression.ProcessRegressions(...)
      |
      +-----> [detectorResponseProcessor] (Callback)
                    |
                    | A. Convert cluster results to Regression objects
                    | B. Lookup Git commit details
                    | C. Merge results for same commits
                    | D. Update Tracker with current findings
                    |
                    v
             [Progress Tracker] <------- (Frontend polls this)

Implementation Details

  • Asynchrony: The module explicitly uses context.Background() for the background goroutine instead of the request context (r.Context()). This prevents the dry run from being cancelled when the user's initial HTTP request terminates.
  • Data Enrichment: The RegressionAtCommit struct is used to package the raw regression data with the provider.Commit metadata. This ensures the frontend has all the information necessary to display a human-readable list of results without performing additional lookups.
  • Error Handling: Errors encountered during the background process are piped into the req.Progress object. This allows the system to report failures (like invalid queries or database timeouts) back to the user through the progress polling mechanism.

Module: /go/e2e

High-Level Overview

The go/e2e module provides a specialized test runner and infrastructure for executing end-to-end (E2E) tests within the Skia infrastructure. It serves as a bridge between the high-level Task Driver system and specific test suites (such as Node.js-based Puppeteer tests). The primary goal of this module is to automate the lifecycle of E2E testing: checking out the source code, executing tests via Bazel, capturing results in a standardized xUnit XML format, and persisting those results to Google Cloud Storage (GCS).

Design and Implementation Philosophy

The module is designed as a Task Driver, leveraging the task_driver library to ensure that E2E tests can be executed reliably on Swarming bots with full observability and step-by-step logging.

  • Standardized Result Reporting: While E2E tests may output diverse log formats, this runner wraps execution to generate xUnit-compatible XML. This design choice ensures that testing results can be ingested by standard CI reporting tools, providing a consistent view of failures, errors, and execution time.
  • Unique Traceability in GCS: To prevent result collisions and maintain a historical record of test runs, the runner implements a unique object prefix generation strategy. It uses time-based partitioning (YYYY-MM-DD/HH-MM-SS) and an iterative collision check to ensure every test run has a dedicated, non-overlapping location in the storage bucket.
  • Environment Flexibility: The runner supports both --local and bot-based execution. When running on a bot, it automatically handles complex environment setup, including Gerrit authentication and Git cookie management, which are abstracted away from the actual test logic.
  • Bazel-Centric Execution: The runner utilizes Bazel as the underlying execution engine. By using specific flags like --config=mayberemote and --nocache_test_results, it ensures that E2E tests (which are often sensitive to environment state) are executed fresh when requested, while still benefiting from RBE (Remote Build Execution) when available.

Key Components and Responsibilities

Test Runner (test_runner.go)

This is the core entry point. Its responsibilities include:

  • Environment Orchestration: Managing the scratch work directory and initializing the repository checkout using checkout and git_steps.
  • Execution Management: Invoking Bazel to run specific test targets. It parses the standard output of these commands using regular expressions to extract failure counts even when the underlying test process exits with an error.
  • Result Transformation: Converting the raw output of Node.js/Puppeteer tests into the TestSuites and TestSuite XML structures.
  • Artifact Persistence: Handling the authenticated upload of XML results to GCS, ensuring that developers can access detailed logs even after the Swarming bot has been reclaimed.

Infrastructure Integration (BUILD.bazel)

The build configuration defines the e2e-test-runner binary, which bundles the logic required to interact with Google Cloud API, Gerrit, and the internal Task Driver framework. It marks the runner as a public binary, allowing it to be triggered by various CI tasks across the repository.

Typical Execution Workflow

The following diagram illustrates how the runner coordinates the lifecycle of an E2E test run:

[ Task Driver Start ]
          |
          v
[ Setup Environment ] ----> [ Initialize Git/Gerrit Auth ]
          |                 [ Create Temp Workdir ]
          v
[ Perform Checkout ] -----> [ Ensure Git Clone at Target Revision ]
          |
          v
[ Execute Bazel ] --------> [ Run Node.js E2E Test Target ]
          |                 [ Capture Stdout and Exit Codes ]
          v
[ Process Results ] ------> [ Regex Parse Failures ]
          |                 [ Generate xUnit XML ]
          v
[ Archive Artifacts ] ----> [ Generate Unique GCS Path ]
          |                 [ Upload test_result.xml ]
          v
[ Task Driver End ]

Key Files

  • test_runner.go: Contains the main logic for the task driver, including the GCS upload logic, the Bazel execution wrapper, and the XML result generation.
  • BUILD.bazel: Defines the Go library and binary targets, specifying the dependencies on the cloud storage, Gerrit, and Task Driver libraries.

Module: /go/e2e/tests

High-Level Overview

The /go/e2e/tests module provides a framework and suite for end-to-end (E2E) testing of web-based applications. Unlike unit or integration tests that verify isolated logic or API contracts, this module validates the entire system stack by simulating real user interactions within a headless browser. Its primary goal is to ensure that critical user journeys—from page load to UI state changes—function correctly in a production-like environment.

Design and Implementation Philosophy

The testing strategy is built around programmatic browser control and behavior-driven assertions. The following design choices guide the implementation:

  • Browser Orchestration via Puppeteer: The module utilizes Puppeteer to automate Chrome. This choice allows tests to interact with the DOM, handle asynchronous rendering, and capture page metadata exactly as a user would.
  • Environment-Agnostic Browser Execution: To ensure tests run consistently across local development machines and CI environments, the module relies on the CHROME_BIN environment variable. This decouples the test logic from the specific installation path of the browser.
  • Resource Efficiency and Isolation: Tests are structured to balance performance with isolation. While a single browser instance is typically launched for a suite (via before hooks) to save overhead, individual test cases (it blocks) utilize fresh browser pages/tabs. This prevents state leakage between tests while avoiding the heavy cost of restarting the browser executable for every assertion.
  • Container-Friendly Configuration: The browser is launched with specific flags, such as --no-sandbox and --disable-dev-shm-usage. These are intentional design choices to ensure compatibility with containerized execution environments where shared memory may be limited or namespace sandboxing is restricted.

Key Components and Responsibilities

Test Execution and Assertions

The module relies on the Chai assertion library to provide a descriptive and readable syntax for validating application state. The responsibility of a test file is to define a specific user scenario, navigate to the target service, and verify outcomes such as page titles, element visibility, or data consistency.

Browser Lifecycle Management

Each test suite is responsible for managing its own lifecycle. This involves:

  1. Setup: Initializing the browser driver and configuring the base URL.
  2. Execution: Navigating to specific routes and interacting with the page.
  3. Teardown: Ensuring the browser process is terminated (via after hooks) to prevent memory leaks and orphaned processes in the testing infrastructure.

Typical Test Workflow

The following diagram illustrates the lifecycle of an E2E test within this module:

[ Test Suite Start ]
        |
        v
[ Launch Browser ] <--- Uses CHROME_BIN and container-optimized flags
        |
        +---- [ Setup Page ] <--- Create a new isolated tab/page
        |           |
        |           v
        |     [ Navigate ] <--- page.goto(baseUrl)
        |           |
        |           v
        |     [ Verify ] <--- Assert DOM state or page metadata
        |           |
        |           v
        |     [ Close Page ]
        |
        v
[ Close Browser ] <--- Teardown process to free resources
        |
        v
[ Test Suite End ]

Key Files

  • example_nodejs_test.ts: Serves as the reference implementation for new E2E tests. It demonstrates how to initialize the Puppeteer instance, manage the page lifecycle, and perform assertions using the Chai library.
  • BUILD.bazel: Defines the test targets. It identifies the necessary dependencies, such as the Puppeteer driver and assertion libraries, ensuring the test runner has access to the required Node.js modules and browser binaries.

Module: /go/favorites

High-Level Overview

The go/favorites module defines the core domain model and persistence interface for “Favorites” within the Perf system. A “Favorite” represents a saved configuration—specifically a name, description, and URL—that allows users to bookmark and revisit specific data visualizations or query states.

This module acts as a contract layer, decoupling the business logic of the Perf application from specific storage implementations (such as SQL databases or mock objects used in testing).

Design and Implementation Choices

The module is designed around a strictly defined interface to ensure that the Perf frontend can manage user preferences consistently, regardless of the underlying infrastructure.

  • Interface-Driven Design: By defining the Store as an interface, the system supports pluggable backends. This allows for the sqlfavoritestore implementation in production and the mocks implementation for unit testing.
  • Encapsulation of State: The Favorite struct encapsulates all metadata required to reconstruct a saved view. The inclusion of LastModified (as a Unix timestamp) enables the UI to sort or filter favorites by recency without requiring complex timezone handling at the database level.
  • Separation of Concerns (SaveRequest): The use of a SaveRequest struct separate from the Favorite struct is a deliberate choice to distinguish between input data (what a user provides) and stored records (which include system-generated fields like ID and LastModified).
  • Arbitrary Liveness Responsibility: A unique design choice in this module is the inclusion of the Liveness method. While not strictly a “favorites” function, the Store was selected as a lightweight probe point to verify database connectivity for the entire application. This avoids adding overhead to more performance-critical stores while ensuring the system can monitor its health.

Key Components

store.go

This file defines the data structures and the behavioral contract for favorite management.

  • Favorite Struct: The primary data model. It includes the UserId (typically an email address) to enforce ownership and a Url which contains the encoded state of the Perf dashboard.
  • Store Interface: Defines the lifecycle of a favorite:
    • Creation and Updates: Create and Update use the SaveRequest pattern to ensure only mutable fields are passed from the frontend.
    • Retrieval: Get fetches a single record by ID, while List provides a collection of all favorites owned by a specific user.
    • Security-Conscious Deletion: The Delete method requires both an id and a userId. This ensures that the storage layer enforces ownership, preventing a user from deleting another person's favorite by simply knowing its ID.

Favorite Lifecycle Workflow

The following diagram illustrates the lifecycle of a Favorite configuration from creation to retrieval:

  User Action             Data Structure                Store Interface
  -----------             --------------                ---------------
       |                        |                              |
[ Save Dashboard ] ----> [ SaveRequest ]                       |
       |             (Name, URL, Desc, User)                   |
       |                        |                              |
       |                        +----------------------> [ Create() ]
       |                                                       |
       |                                             (ID & Timestamp generated)
       |                                                       |
       v                        |                              v
[ View My List ] <------- [ []*Favorite ] <------------ [ List(UserId) ]
       |            (ID, Name, URL, Modified)                  |
       |                        |                              |
       |                        |                              |
[ Delete Entry ] ------- (ID + UserId) ----------------> [ Delete() ]

Module: /go/favorites/mocks

High-Level Overview

The go/favorites/mocks module provides a programmatic double of the favorites.Store interface. Its primary purpose is to facilitate unit testing for components within the Perf system that depend on “favorites” functionality—such as saving, retrieving, or listing user-defined favorite configurations—without requiring a live database or a real implementation of the storage layer.

By using these mocks, developers can isolate the business logic of higher-level services, ensuring that tests are deterministic, fast, and do not rely on external infrastructure like Spanner or SQL.

Design and Implementation Choices

The module utilizes automated mock generation via mockery. This choice ensures that the mock implementation remains perfectly synchronized with the favorites.Store interface definition found in go/favorites.

Why use mocks here?

  • Isolation: Testing a service that uses favorites (e.g., a frontend API handler) should not fail because of a database connection issue.
  • Behavior Simulation: The mocks allow testers to simulate specific scenarios that are difficult to trigger with a real store, such as specific database errors, timeouts, or the return of empty datasets.
  • Verification: Beyond just providing data, these mocks allow tests to assert that specific methods (like Update or Delete) were called with the expected arguments and the correct number of times.

Implementation with testify

The implementation is built on the github.com/stretchr_testify/mock framework. This allows for a fluent API when setting up expectations:

Test Code           Mock Store             Component Under Test
---------           ----------             --------------------
Setup Expectation -> [ On("Get").Return(...) ]

Execute Test ----------------------------> Calls Store.Get()
                                              |
                    [ Match Arguments ] <-----|
                    [ Return Fake Data ] ----> Result processed by logic

Assert Expectations <- [ AssertExpectations ]

Key Components

Store.go

This is the core file of the module, containing the Store struct. It implements every method required by the favorites.Store interface:

  • CRUD Operations: Create, Get, Update, and Delete methods are implemented to capture arguments and return values defined in the test suite.
  • Query Operations: The List method simulates retrieving all favorites associated with a specific user ID.
  • Infrastructure Checks: The Liveness method is mocked to allow tests to simulate the health status of the underlying storage engine.

NewStore Constructor

The NewStore function is the standard entry point for using this module in a test. It takes a testing.T (or compatible interface) which allows it to:

  1. Register the mock for the current test context.
  2. Automatically hook into the test's Cleanup phase to call AssertExpectations, ensuring that any configured expectations were actually met before the test finished.

Module: /go/favorites/sqlfavoritestore

This module provides a SQL-based implementation of the favorites.Store interface, enabling the persistence and management of user “Favorites” (saved URLs/configurations) within the Perf application. It bridges the gap between the high-level favorites domain logic and the underlying relational database (typically CockroachDB or Spanner).

Design and Implementation Choices

The module is built with a focus on performance for user-centric queries and reliability in a distributed database environment.

  • Relational Storage: By using a SQL backend, the module leverages robust indexing and ACID transactions. The schema is optimized for the primary access pattern: retrieving all favorites associated with a specific user.
  • Decoupled SQL Statements: SQL queries are defined as a mapped collection of constants (statements). This separation ensures that the Go logic for scanning rows remains clean while providing a single location to tune or update database queries.
  • Stateless Operations: The FavoriteStore struct is designed to be stateless, holding only a reference to the database connection pool (pool.Pool). This allows the store to be safely shared across multiple goroutines.
  • Liveness Monitoring: Beyond standard CRUD operations, the module includes a Liveness check. This is a strategic inclusion for cloud-native deployments, allowing the application's frontend to verify its database connectivity independently of functional queries.
  • Timestamping: The store handles LastModified logic at the application level using Unix timestamps. This ensures consistency in how time is recorded regardless of the database's internal time configuration.

Key Components and Responsibilities

sqlfavoritestore.go

This is the core of the module, implementing the FavoriteStore and its associated methods. It handles the translation between Go structs (from the favorites package) and SQL parameters.

  • CRUD Implementation:
    • Create: Injects a new record and automatically calculates the last_modified timestamp.
    • Update: Modifies existing records based on ID and provides feedback if no rows were affected (e.g., if the ID was invalid).
    • Delete: Requires both the ID and the UserId to perform a deletion. This is a security-in-depth choice, ensuring a user can only delete their own favorites even if an ID is guessed or leaked.
    • List: Retrieves a subset of fields (id, name, url, description) for all favorites owned by a user, optimized for summary views.
  • Error Handling: Uses skerr to wrap standard database errors with contextual information (e.g., “Failed to load favorite”), making it easier to trace issues in logs.

schema/ (Submodule)

Defines the architectural “blueprint” for the Favorites table. It ensures that the database indexes and constraints (like NOT NULL on URLs) align with the requirements of the Go code. It manages the identity strategy (UUIDs) and the by_user_id index critical for performance.

Key Workflow: Saving and Retrieving a Favorite

The following diagram illustrates how the store interacts with the database to persist a user's favorite:

[ Caller ]
    |
    | 1. Create(ctx, SaveRequest{UserId, Name, Url...})
    v
[ FavoriteStore ]
    |
    | 2. Generate current Unix timestamp
    | 3. Execute 'insertFavorite' SQL
    |    (UserId, Name, Url, Desc, Timestamp)
    v
[ SQL Database ]
    |
    | 4. Generate UUID for 'id'
    | 5. Persist record & Update 'by_user_id' index
    v
[ FavoriteStore ]
    |
    | 6. Return nil (Success)
    v
[ Caller ]
    |
    | 7. List(ctx, UserId)
    | 8. Execute 'listFavorites' SQL
    |    (Uses index for fast lookup)
    v
[ SQL Database ]
    |
    | 9. Return matching rows
    v
[ FavoriteStore ]
    |
    | 10. Scan rows into []*favorites.Favorite
    v
[ Caller ]

Module: /go/favorites/sqlfavoritestore/schema

This module defines the data architecture for persisting user “Favorites” within a SQL database. It serves as the single source of truth for the database schema, ensuring that the Go representation of a favorite aligns with the underlying storage constraints and indexing requirements.

Design and Implementation Choices

The schema is designed to support a multi-user environment where favorites are frequently queried by ownership but accessed via unique identifiers for updates and deletions.

  • Identity Management: The ID uses a UUID generated at the database level (gen_random_uuid()). This choice provides a globally unique identifier that prevents ID exhaustion and allows for future decentralization or data migration without key collisions.
  • User Association: Users are identified by their email addresses (UserId), sourced from uber-proxy authentication. By using a string-based email as the primary key for user association rather than an internal integer ID, the system simplifies integration with the authentication layer and avoids an extra lookup table for user metadata.
  • Indexing Strategy: A dedicated index by_user_id is defined on the user_id column. This is a critical performance choice, as the primary access pattern for the application is expected to be “fetch all favorites for the current logged-in user.” Without this index, user-specific queries would require a full table scan.
  • Time Tracking: The LastModified field uses a standard Unix timestamp (integer). This provides a lightweight, timezone-agnostic way to handle sorting by recency or implementing cache invalidation logic on the client side.

Key Components and Responsibilities

schema.go

This file contains the FavoriteSchema struct, which uses specialized sql struct tags to define the DDL (Data Definition Language) properties of the table.

  • Content and Metadata: The schema separates core functional data (Url) from descriptive metadata (Name, Description). While the Url is mandatory (NOT NULL), the name and description are optional, allowing users to save links quickly without mandatory labeling.
  • Structural Integrity: By defining constraints like PRIMARY KEY and NOT NULL directly in the struct tags, the module ensures that the database enforcement layer matches the application's data requirements.

Key Workflow: User Data Retrieval

The schema is structured to optimize the flow from authentication to data display:

[ Authentication Layer ]
          |
          | Provides User Email
          v
[ sqlfavoritestore ]
          |
          | Query: SELECT * FROM Favorites WHERE user_id = {email}
          | (Uses 'by_user_id' Index for O(log n) lookup)
          v
[ SQL Database ]
          |
          | Returns set of FavoriteSchema records
          v
[ UI / Consumer ]

Module: /go/file

Overview

The file module defines the core abstraction for data ingestion within the Skia Perf system. It provides a common interface and data structure that decouples the ingestion logic (the “how” of processing data) from the storage and transport layers (the “where” of the data).

By standardizing how files are represented and discovered, this module allows the system to seamlessly switch between local development environments, automated testing, and production cloud-scale ingestion.

Core Abstractions

The File Struct

The file.File struct is the primary data transfer object. It encapsulates both the data and the metadata required for a single unit of ingestion.

  • Name: The identifier for the file, typically a path or URI.
  • Contents: An io.ReadCloser that provides the raw data. This allows the ingestion pipeline to stream data rather than loading entire files into memory, which is critical when handling large performance traces.
  • Created: A timestamp indicating when the file was originated.
  • PubSubMsg: An optional reference to a pubsub.Message. This is included to allow downstream consumers to manually acknowledge or negatively acknowledge the message if the source is backed by a messaging service like Google Cloud Pub/Sub.

The Source Interface

The Source interface defines a unified mechanism for discovering files. It follows a “push-based” model through a Go channel:

type Source interface {
    Start(ctx context.Context) (<-chan File, error)
}

This design choice allows the ingestion engine to remain reactive. Whether the files are being walked on a local disk or arriving via real-time cloud notifications, the consumer simply listens to the channel until it is closed.

Design Rationale

  • Streaming over Buffering: The use of io.ReadCloser instead of a byte slice for file contents ensures that the memory footprint of the ingestion process remains low even if individual files are large.
  • Decoupled Lifecycle Management: The Start method is designed to be called once per instance. This prevents race conditions and ensures a predictable lifecycle for the background workers (goroutines) that different implementations (like dirsource or gcssource) use to monitor their respective backends.
  • Metadata Passthrough: Including the PubSubMsg in the File struct is a design compromise that breaks total isolation in favor of reliability. It allows the ingestion pipeline to signal to the underlying transport layer exactly when a file has been successfully persisted or if it needs to be retried.

Workflow: General Ingestion Pattern

The following diagram illustrates how the file module acts as the contract between various data providers and the main ingestion engine:

+-------------------+      +-------------------+      +-------------------+
|  Local Directory  |      |    GCS Bucket     |      |  Other Providers  |
|   (dirsource)     |      |   (gcssource)     |      |    (Future)       |
+---------+---------+      +---------+---------+      +---------+---------+
          |                          |                          |
          | implements               | implements               | implements
          +--------------------------+--------------------------+
                                     |
                             +-------v--------+
                             |  file.Source   |
                             +-------+--------+
                                     |
                                     | Start(ctx) returns <-chan file.File
                                     v
                             +----------------+
                             | Perf Ingestion |
                             |     Engine     |
                             +----------------+

Submodules and Implementations

While the file package defines the contract, its submodules provide the concrete implementations used across different environments:

  • dirsource: A filesystem-based implementation. It walks a local directory and streams the files found. It is primarily used for local development and unit tests.
  • gcssource: A production-ready implementation that listens to Google Cloud Pub/Sub notifications to ingest files from GCS buckets in real-time.

Files and Responsibilities

  • file.go: Defines the Source interface and the File struct. This file serves as the single point of truth for how data enters the Perf system, ensuring that adding a new storage backend only requires implementing the Source interface.

Module: /go/file/dirsource

Overview

The dirsource module provides a filesystem-based implementation of the file.Source interface. Its primary purpose is to abstract a local directory into a stream of data, allowing the Skia Perf system to ingest files directly from the disk.

This implementation is intentionally kept simple and is designed specifically for local development, demonstration modes, and unit testing. It allows developers to run the Perf ingestion pipeline against local files without requiring a cloud storage provider or a complex messaging infrastructure.

Design Rationale

The implementation of DirSource prioritizes ease of setup over high-performance production features like real-time file watching or sophisticated metadata tracking.

  • One-Shot Iteration: Instead of monitoring the directory for new events (e.g., via inotify), the module performs a one-time walk of the directory tree when Start is called. This simplifies the state management of the source, making it predictable for tests.
  • Asynchronous Streaming: To prevent the caller from blocking while the filesystem is scanned, Start launches a background goroutine that pushes files into a buffered channel. This allows the ingestion pipeline to begin processing the first file while the source is still discovering the next.
  • Modified Time as Proxy: Since many filesystems do not reliably track or expose an “original creation time” in a cross-platform manner, the module uses the file's ModTime (Modified Time) to fill the Created field in the file.File struct.
  • Safety Constraints: The module enforces that Start can only be called once. This prevents accidental duplicate processing of the same directory within the same lifecycle, ensuring data consistency in demo environments.

Key Components

DirSource

Defined in dirsource.go, this is the core struct implementing the file.Source interface. It maintains the absolute path to the target directory and tracks whether the ingestion process has already been initiated.

The Start method performs the following actions:

  1. Verifies the source hasn't been started.
  2. Creates a buffered channel (channelSize = 10) to hold file.File objects.
  3. Spawns a goroutine to execute filepath.Walk.
  4. For every non-directory file encountered, it opens a file handle and emits a file.File containing the path, the open reader, and the modification timestamp.

Workflow: File Discovery and Emission

The following diagram illustrates how DirSource transforms a filesystem structure into a stream of data for the ingestion pipeline:

+--------------+          +-----------------------+          +-------------------+
|  Ingestion   |          |      DirSource        |          |  Local Filesystem |
|   Engine     |          |    (Background)       |          |      (Disk)       |
+--------------+          +-----------------------+          +-------------------+
       |                              |                               |
       |  (1) Start(ctx)              |                               |
       |----------------------------->|                               |
       |                              |  (2) filepath.Walk(dir)       |
       |      (Returns Channel)       |------------------------------>|
       |<-----------------------------|                               |
       |                              |  (3) Open & Read Metadata     |
       |                              |<------------------------------|
       |                              |                               |
       |  (4) Receive file.File{}     |                               |
       |<-----------------------------|                               |
       |                              |  (5) Repeat for all files     |
       |                              |------------------------------>|
       |                              |                               |
       |  (6) Channel Closed          |                               |
       |<-----------------------------|                               |

Files and Responsibilities

  • dirsource.go: Contains the logic for scanning the filesystem and mapping os.FileInfo to the common file.File structure used by Perf.
  • dirsource_test.go: Validates that the directory walker correctly identifies files, handles multiple files in a single pass, and properly errors out if Start is invoked more than once.
  • testdata/: A collection of static JSON files used during testing to ensure the source correctly reads file contents and handles path resolution.

Module: /go/file/dirsource/testdata

Overview

The testdata directory serves as a controlled environment containing static filesystem artifacts used to validate the behavior of the dirsource module. Rather than relying on dynamically generated files or mock objects, this directory provides concrete, predictable JSON structures that allow for end-to-end testing of directory scanning, file reading, and data parsing logic.

Design Rationale

The primary design choice here is the use of representative static assets. By storing physical .json files on disk, the test suite can exercise the full I/O stack—ensuring that the module correctly handles file handles, directory pathing, and content deserialization in a way that matches real-world usage.

Key considerations for these test files include:

  • Schema Consistency: The files (filea.json, fileb.json) follow a uniform structure ({"status": "..."}). This allows tests to verify that the dirsource implementation can iterate through multiple files and map them to a consistent internal data model or interface.
  • Path Resolution Testing: These files enable the verification of recursive or non-recursive directory crawling. By having multiple files in a single flat structure, the module can test its ability to identify, filter, and ingest specific file extensions (JSON) while ignoring others if necessary.

Key Components

The module is comprised of distinct JSON payloads that represent different data states:

  • filea.json: Acts as the primary data point for positive testing. It contains a standard “status” string used to verify that the scanner successfully opens a file and extracts its content accurately.
  • fileb.json: Provides a secondary data point. This is used to ensure that the dirsource logic correctly handles collections of files, verifying that the ingestion process doesn't stop after the first successful read and that it maintains data integrity across multiple distinct sources.

Ingestion Workflow

The typical interaction between the parent module and these files follows this process:

+-------------------+       +-----------------------+       +-------------------+
|  dirsource logic  | ----> |  Read /testdata/ dir  | ----> | Map JSON content  |
+---------+---------+       +-----------+-----------+       +---------+---------+
          |                             |                             |
          | (1) Path discovery          | (2) File I/O                | (3) Validation
          v                             v                             v
   [ filea.json ] <---------------------+---------------------> [ fileb.json ]
   [ "A test..." ]                                              [ "just another" ]

This workflow ensures that the system can navigate the filesystem and translate raw disk bytes into structured application data.

Module: /go/file/gcssource

GCS Source Module

The gcssource module provides an implementation of the file.Source interface specifically for Google Cloud Storage (GCS). It is designed to enable real-time ingestion of performance data files as they are uploaded to GCS buckets.

Overview

The primary purpose of this module is to bridge GCS storage events with the Perf ingestion pipeline. It leverages GCS Pub/Sub notifications to detect new file arrivals, filters those files based on configuration, and streams the file contents for processing.

By using an event-driven approach rather than polling, the module ensures that the system reacts immediately to new data while minimizing unnecessary API calls to GCS.

Key Components and Responsibilities

GCSSource

The central struct GCSSource manages the lifecycle of the ingestion source. Its responsibilities include:

  • Subscription Management: Setting up and maintaining a connection to a Google Cloud Pub/Sub topic.
  • Event Handling: Listening for messages that indicate a new object has been created in a bucket.
  • Validation and Filtering: Ensuring that only relevant files are passed into the pipeline.
  • Resource Management: Providing an io.ReadCloser for each discovered file, allowing the consumer to read data directly from GCS.

Filtering Logic

The module implements a multi-stage filtering process to determine if a file should be ingested:

  1. Filename Patterns: Uses a filter.Filter (configured via AcceptIfNameMatches and RejectIfNameMatches) to include or exclude files based on regex-like patterns.
  2. Source Path Restriction: Checks if the file resides within the allowed prefixes defined in the SourceConfig.Sources list. This prevents the ingestion of files from unauthorized or unrelated directories within the same bucket.

Reliability and Acknowledgment

The module carefully manages Pub/Sub message acknowledgments (Ack/Nack) to ensure no data is lost:

  • Ack: If a message is malformed (invalid JSON) or the file is explicitly rejected by filters, it is acknowledged to prevent it from being retried.
  • Nack: If a transient error occurs (e.g., GCS API is down, or the file cannot be read), the message is negatively acknowledged so it can be redelivered and retried later.
  • Dead Letter Support: If dead letter collection is enabled in the configuration, the logic shifts to prioritize moving failing messages to a dead letter queue rather than infinite retries.

Design Decisions

Single Subscriber Strategy

The module defaults to a low number of parallel receives (controlled by maxParallelReceives). This is an intentional design choice to maintain a predictable load on the ingestion system and to simplify the management of GCS read streams.

Manual Deadline Management

The module disables automatic deadline extensions by the Pub/Sub library (MaxExtension = -1). This forces the system to either process a message or let it time out quickly. This prevents the “stuck ingestor” problem where a single problematic message holds up the entire queue because its deadline is being automatically extended indefinitely.

Data Flow Workflow

GCS Bucket (New File)
       |
       v
Cloud Pub/Sub Topic
       |
       v
GCSSource.Receive (Pub/Sub Message)
       |
       |--[ Deserialize JSON ]--> (If invalid: Ack & Drop)
       |
       |--[ Apply Filter ]------> (If rejected: Ack & Drop)
       |
       |--[ Check Sources ]-----> (If no match: Ack & Drop)
       |
       |--[ Get GCS Reader ]----> (If GCS error: Nack & Retry)
       |
       v
file.File channel (Streamed to Ingestor)

Key Files

  • gcssource.go: Contains the core logic for the GCSSource struct, the Pub/Sub message listener, and the GCS object retrieval logic.
  • gcssource_manual_test.go: Provides integration tests using GCP emulators to verify the end-to-end flow from Pub/Sub message publication to file channel output.

Module: /go/filestore

High-Level Overview

The filestore module provides a unified abstraction for interacting with different storage backends (Local and Google Cloud Storage) through the standard Go io/fs interface. It is designed to allow the Skia Perf system to remain storage-agnostic, enabling high-level components to consume data—primarily ingestion files—without needing to know whether the source is a physical disk or a cloud bucket.

The module focuses on a read-only, stream-oriented pattern, prioritizing simplicity and compatibility with the Go standard library over exhaustive implementation of filesystem metadata features.

Design Decisions and Implementation

Unified Interface (fs.FS)

The decision to wrap storage backends in fs.FS rather than creating a custom storage interface allows the system to leverage Go’s rich ecosystem of standard library tools (e.g., io.ReadAll, bufio.NewScanner). This abstraction ensures that unit tests can swap out a production GCS backend for a local directory or an in-memory filesystem without changing any business logic.

Common Path Handling vs. Backend Specifics

While both backends implement the same interface, they handle path resolution differently based on the nature of the underlying storage:

  • Local Backend: Implements a “chrooted” approach. It anchors all operations to a specific root directory on the local disk. It uses path translation to ensure that even if absolute system paths are provided, they are resolved relative to the configured storage root, preventing accidental access to unauthorized parts of the host filesystem.
  • GCS Backend: Implements a URL-based approach (gs://<bucket>/<path>). It treats the entire GCS namespace as a single virtual filesystem. It parses these URLs on the fly to determine which bucket and object to fetch using the Google Cloud Storage API.

Read-Only Philosophy

Both implementations are optimized for data consumption.

  • They utilize read-only scopes (in GCS) or standard read-only file handles (in Local).
  • Non-essential methods like Stat() are often left unimplemented or return minimal information. This reflects the design goal: the Perf ingestion engine cares about the content of the files (the JSON or Proto data) rather than the filesystem-level metadata like permissions or modification times.

Key Components

gcs Submodule

Bridges the storage.Client from the Google Cloud SDK to fs.FS.

  • Responsibility: Managing authenticated GCS clients and translating virtual gs:// paths into API requests.
  • Implementation Choice: It embeds *storage.Reader within a custom file struct. This allows the module to satisfy the fs.File interface automatically, as storage.Reader already provides the necessary Read and Close methods.

local Submodule

Wraps the os package's filesystem capabilities.

  • Responsibility: Providing restricted access to a local directory.
  • Implementation Choice: It utilizes os.DirFS internally. The primary value-add of this submodule is the path sanitization and translation logic that allows the rest of the application to use standardized paths while the module maps them to the correct local disk location.

Storage Resolution Workflow

The following diagram shows how the filestore abstraction allows the application to remain indifferent to the underlying storage type:

[ Application Logic ]
       |
       | Requests "path/to/data.json"
       v
[ filestore (fs.FS) ]
       |
       +-----------------------+-----------------------+
       |                       |                       |
       v                       v                       v
[ Local Implementation ]      OR      [ GCS Implementation ]
       |                                       |
       | 1. Resolve relative path              | 1. Parse gs:// URL
       | 2. Access local disk                  | 2. storage.NewReader()
       v                                       v
[ Physical File ]                      [ GCS Object ]

Key Files

  • gcs/gcs.go: Contains the logic for parsing GCS URIs and the implementation of the fs.FS and fs.File interfaces for cloud storage.
  • local/local.go: Contains the logic for anchoring file access to a local root directory and implementing the fs.FS interface for disk-based storage.

Module: /go/filestore/gcs

High-Level Overview

The gcs module provides a bridge between the standard Go io/fs interface and Google Cloud Storage (GCS). It allows systems—specifically the Skia Perf backend—to treat GCS objects as if they were files in a standard filesystem.

The primary motivation for this module is to abstract the complexities of the GCS client (authentication, bucket management, and reader handling) behind the ubiquitous fs.FS interface. This allows higher-level components to remain storage-agnostic, facilitating testing and potential migrations to other storage backends.

Design Decisions and Implementation

Interface Adherence vs. Functionality

The module implements the fs.FS and fs.File interfaces. However, it follows a “minimal implementation” philosophy tailored to the needs of the Perf system.

  • Read-Only Focus: The implementation uses storage.ScopeReadOnly during initialization. This design choice minimizes the security footprint of the service, ensuring it can only consume data and never accidentally modify or delete ingestion files.
  • Deferred Implementation: Methods such as Stat() on the file struct are intentionally not implemented and return ErrNotImplemented. This decision was made because the Perf ingestion pipeline focuses on streaming data content rather than inspecting metadata (like timestamps or permissions) provided by os.FileInfo.

Path Parsing logic

Because fs.FS traditionally expects paths relative to a root, but GCS requires both a bucket name and an object path, the module utilizes a URL-based naming convention: gs://<bucket>/<path>.

The parseNameIntoBucketAndPath function decomposes these strings. It is designed to handle the nuances of URL parsing, such as stripping leading slashes from the URL path to convert them into valid GCS object keys.

Key Components

filesystem (gcs.go)

This is the central coordinator that satisfies fs.FS. It holds a long-lived storage.Client, which is authenticated using Google Application Default Credentials.

  • Responsibility: Managing the lifecycle of the GCS client and translating Open(name) calls into GCS readers.
  • Workflow: When Open is called, the filesystem parses the provided string into bucket and object components, then initializes a new storage.Reader using a background context.

file (gcs.go)

A thin wrapper around *storage.Reader.

  • Responsibility: To bridge the storage.Reader (which provides Read and Close) with the fs.File interface.
  • Design Choice: By embedding *storage.Reader, the struct automatically inherits the methods required for reading, keeping the implementation concise.

Data Access Workflow

The following diagram illustrates how a request for a file is translated from a standard interface call into a GCS network request:

[ Caller ]
    |
    | 1. Open("gs://my-bucket/data.json")
    v
[ filesystem ]
    |
    | 2. parseNameIntoBucketAndPath() -> ("my-bucket", "data.json")
    | 3. storage.Client.Bucket("my-bucket").Object("data.json").NewReader()
    v
[ storage.Reader ] <--- Wrapped in ---> [ file (fs.File) ]
    |
    | 4. Read() / Close() operations
    v
[ Google Cloud Storage API ]

Module: /go/filestore/local

High-Level Overview

The local module provides an implementation of the standard library's fs.FS interface specifically for the local file system. In the context of the larger Perf system, this module serves as a bridge between high-level file operations and physical storage on disk. By wrapping local file access in the fs.FS interface, it allows other components to remain agnostic about whether they are interacting with local storage, cloud storage, or an in-memory mock.

Design Decisions and Implementation

The implementation focuses on creating a “chrooted” view of the local filesystem. This is achieved by anchoring all file operations to a specific rootDir.

Root Isolation and Path Resolution

A key design choice is the use of os.DirFS. While os.Open can access any path on the system, os.DirFS restricts access to a specific directory tree. By combining an absolute root path with os.DirFS, the module ensures that callers interact with a controlled environment.

When a file is opened via the Open method:

  1. The module takes the requested path (which may be absolute or relative to the system root).
  2. It calculates the relative path of that file with respect to the module's initialized rootDir.
  3. It passes that relative path to the internal os.DirFS instance.

This approach provides a layer of safety and abstraction: the consumer of the local package can provide paths as they exist on the system, and the module handles the translation necessary to satisfy the fs.FS requirements, which typically expect paths relative to the filesystem root.

Key Components

The filesystem Struct

Defined in local.go, this struct is the core of the module. It maintains two primary pieces of state:

  • rootDir: The absolute path to the base directory. This is captured during initialization via filepath.Abs to ensure that the base of the filesystem is immutable and clearly defined, even if the process's working directory changes.
  • fs: An internal fs.FS instance (specifically a dirFS from the os package). This handles the actual low-level directory traversal and file reading.

The Open Workflow

The Open method acts as a translation layer. Instead of directly opening a path, it enforces the “local root” logic.

Input Path (name)
      |
      v
[ filepath.Rel ] <--- compares 'name' against 'rootDir'
      |
      +---- Error (if name is outside rootDir)
      |
      v
Relative Path
      |
      v
[ f.fs.Open ]    <--- os.DirFS handles actual I/O
      |
      v
   fs.File

This workflow ensures that even if a full system path is passed to Open, the module correctly identifies the segment relative to its configured root, preventing accidental access to files outside the intended scope of the perf storage directory.

Module: /go/frontend

Perf Frontend Module

The /go/frontend module serves as the central web server and orchestration layer for the Skia Perf application. It is responsible for serving the Web UI, managing user authentication, and coordinating communication between various backend services such as trace stores, regression detection engines, and issue trackers.

Overview

The frontend service is designed as a controller-based system that abstracts complex performance data operations into user-facing API endpoints. It acts as the “glue” that binds together the telemetry data (traces), the version control history (Git), and the automated analysis tools (clustering and regression detection).

A key design philosophy of this module is asynchronous data handling. Performance dataframes can be massive, and generating them often exceeds the duration of a standard HTTP request. Consequently, the frontend utilizes a “Start-Status-Result” pattern, allowing the UI to poll for progress while a background worker processes the data.

Key Components and Responsibilities

Service Entry Point (frontend.go)

This file acts as the primary initializer for the service. It performs several critical roles:

  • Dependency Injection: It constructs all necessary store implementations (e.g., TraceStore, AlertStore, RegressionStore) based on the provided configuration.
  • Template Orchestration: It manages the lifecycle of Go HTML templates. In development mode, templates are reloaded on every request to facilitate rapid UI iteration, while in production, they are loaded once and cached for performance.
  • Global Context Injection: The getPageContext method serializes the application's state into a window.perf JavaScript object. This ensures that the frontend UI has immediate access to instance-specific settings, feature flags (like FetchAnomaliesFromSql), and environment metadata (like the ImageTag).

API Routing and Logic (/api sub-module)

The logic is partitioned into specialized API structs, each responsible for a functional domain of the application:

  • Trace and Graph Management: Handles the retrieval of performance metrics and the construction of dataframes for visualization.
  • Regression and Triage: Manages the lifecycle of detected performance anomalies, providing the bridge between automated detection and manual human verification.
  • Personalization: Manages user-specific shortcuts and favorites, allowing researchers to save specific views of the data.

Request Proxying (proxy.go)

To circumvent Cross-Origin Resource Sharing (CORS) limitations when the browser needs to fetch data from external sources (e.g., googlesource.com), the module includes a specialized proxy handler. It forwards GET requests while carefully stripping security-sensitive headers like Origin and Referer to ensure the request is accepted by the destination server.

User Authentication and Role Enforcement

The module integrates with the alogin package to provide identity management. It uses a decorator pattern (RoleEnforcedHandler) to wrap sensitive endpoints, ensuring that only users with specific roles (e.g., Admin or Bisecter) can access administrative or resource-intensive functions.

Key Workflows

Server Initialization and Background Processes

When the frontend starts, it doesn't just wait for requests; it initiates several background synchronization tasks to ensure the data served is fresh.

Startup Sequence
      |
      |-- Load & Validate Config JSON
      |-- Initialize Trace & Metadata Stores
      |-- Start ParamSet Refresher (Periodic refresh of available trace keys)
      |-- Start Continuous Clustering (If enabled, runs background regression detection)
      |-- Initialize Notifiers (Email/Issue Tracker integrations)
      |
      V
  Serve HTTP

Git-to-UI Navigation (gotoHandler)

A common workflow involves navigating from a specific Git commit hash to its representation in the Perf UI. The gotoHandler manages this translation:

  1. It resolves a Git hash to a CommitNumber using the perfGit provider.
  2. It calculates a temporal window (range) around that commit.
  3. It redirects the user to the appropriate sub-page (Explore, Clustering, or Triage) with the time-range parameters pre-populated in the URL.

Design Decisions

Configuration Validation

The module enforces strict validation on startup via the testdata fixtures. This ensures that misconfigurations (such as an empty instance_name or invalid connection strings) result in an immediate failure during deployment rather than subtle runtime errors or UI breakage.

Non-Production Flexibility

To support CI/CD and staging environments, the system includes logic to override hostnames and strip environment-specific suffixes (like -autopush). This allows staging instances to use production-like configurations without requiring a complete duplication of the networking and authentication infrastructure.

Multi-Backend Support

The frontend is built to be “backend agnostic” regarding anomaly storage. It can be configured to fetch data from legacy Chromeperf APIs or the modern SQL-based (Spanner/CockroachDB) implementation. This is managed via the FetchAnomaliesFromSql flag, which determines which implementation of the TriageBackend is injected into the API controllers.

Module: /go/frontend/api

Perf Frontend API Module

The /go/frontend/api module defines the HTTP interface for the Perf application. It serves as the orchestration layer between the web frontend and various backend services, including trace stores, regression detection engines, issue trackers, and the Chromeperf legacy system.

Overview

This module follows a controller-based pattern where specific functional areas (Alerts, Anomalies, Graphs, Triage, etc.) are encapsulated in individual API structs. Each struct implements a RegisterHandlers method to attach its endpoints to a central Chi router.

The design emphasizes:

  • Abstraction of Backend Implementation: The API layer interacts with interfaces (e.g., Store, IssueTracker) allowing the system to switch between Skia-native implementations and Chromeperf-compatible backends.
  • Asynchronous Processing: Long-running operations, such as generating complex dataframes or running regression detection, use a progress-tracking pattern to avoid blocking HTTP connections.
  • Multi-Instance Compatibility: Logic in common.go ensures that requests from non-production environments (staging, autopush) are correctly routed or identified, facilitating a seamless CI/CD flow.

Functional Areas and Key Components

Graph and Trace Data (graphApi.go, mcpApi.go)

Responsible for fetching and formatting performance trace data for visualization.

  • Frame Requests: graphApi manages the “Start-Status-Results” lifecycle for building dataframes. This allows the UI to poll for progress while the backend processes large volumes of trace data.
  • Data Point Details: Provides the “why” behind a specific data point by fetching source file metadata and point-specific links from the ingestedFS.
  • Model Context Protocol (MCP): mcpApi provides a specialized endpoint for LLM/Agentic tools to query trace data within specific time ranges and query parameters.

Regression and Anomaly Management (regressionsApi.go, anomaliesApi.go)

Handles the detection, listing, and lifecycle of performance regressions.

  • Compatibility Layer: anomaliesApi is designed to support both the legacy Chromeperf backend and the modern Skia-native storage. It uses preferLegacy flags to determine whether to proxy requests to Chromeperf or query the local regStore and subStore.
  • Group Reports: Aggregates anomalies by Bug ID, Revision, or Anomaly Group ID to provide a holistic view of a performance change.

Triage and Issue Tracking (triageApi.go, triageBackend.go)

Facilitates the workflow of turning detected anomalies into actionable bugs.

  • Backend Switching: Through the TriageBackend interface, the system can either file bugs directly into the Issue Tracker (Skia-native) or proxy the request to Chromeperf's triage service.
  • Nudging and Resetting: Allows users to refine anomaly boundaries or clear triage states, which involves complex coordination between the regression.Store and the commit history.

Alerts and Subscriptions (alertsApi.go, sheriffConfigApi.go)

Manages the configuration that drives automated regression detection.

  • Dry Runs: Supports testing alert configurations and bug templates before they go live via alertBugTryHandler and alertNotifyTryHandler.
  • LUCI Config Integration: sheriffConfigApi provides metadata and validation endpoints used by LUCI Config to ensure that configuration changes in external repositories are valid before being ingested.

Shortcuts and Favorites (shortcutsApi.go, favoritesApi.go)

Enhances user experience through personalization and shareability.

  • State Persistence: shortcutsApi maps complex trace selections (lists of keys) to short IDs, enabling shareable URLs for specific graph views.
  • User Favorites: Manages per-user links and sections, merging global instance-wide favorites with user-specific entries stored in the database.

Key Workflows

Asynchronous Data Loading

The following process is used for operations like /v1/frame/start:

User Request         API Layer                Progress Tracker       Backend Worker
     |                   |                          |                     |
     |-- POST /start -->|                          |                     |
     |                   |-- Create Progress ID --> |                     |
     |<-- Return ID ----|                          |                     |
     |                   |-- Start Go Routine --------------------------->|
     |                   |                          |                     |-- Fetch Data --|
     |-- GET /status -->|                          |                     |                |
     |<-- Progress % ---| <---- Query Status ------|                     |                |
     |                   |                          |                     |-- Process -----|
     |-- GET /results ->|                          |                     |                |
     |<-- DataFrame ----| <---- Get Results -------| <--- Mark Done -----|

Anomaly Triage Flow

The API handles triage by coordinating between the UI, the internal regression store, and external trackers:

    [UI] ----(EditAnomaliesRequest)----> [triageApi]
                                            |
                                    [TriageBackend]
                                     /            \
                      (Skia Native) /              \ (Chromeperf Proxy)
                                   /                \
                        [regStore.SetBugID]      [Chromeperf Client]
                               |                         |
                        [DB Update]               [External API POST]

Design Decisions

Non-Production Host Overriding

In common.go, the function getOverrideNonProdHost is used to strip suffixes like -autopush or -staging. This was implemented to allow testing environments to interact with production-like service configurations without needing to replicate the entire networking and authentication stack for every environment variant.

Trace Cleaning

anomaliesApi.go includes logic to “clean” test names. This is a defensive implementation against malformed trace IDs that might contain characters incompatible with URL parsing or specific database query engines. It uses a configurable regex (InvalidParamCharRegex) to ensure consistency across different Perf instances (e.g., Fuchsia vs. Skia).

Subscription Uniqueness

The subscriptionsHandler in alertsApi.go is designed to provide a flat list of all monitoring subscriptions. This is critical for the “Sheriff” view in the frontend, where users need to filter regressions based on their team's ownership rather than individual alert IDs.

Module: /go/frontend/api/testdata

The /go/frontend/api/testdata directory serves as a controlled environment for testing the Perf frontend API and its configuration parsing logic. It provides a canonical example of a complete system configuration, allowing developers to verify how the application interprets complex settings without relying on a live production environment.

Purpose and Design Decisions

The primary component of this module is config.json. This file is designed to simulate a realistic application state for integration tests and local development mocks. By centralizing these values, the project ensures that API endpoints—which often behave differently based on the underlying data store or authentication headers—can be tested against a predictable “ground truth.”

A key design choice reflected in this data is the use of a local, file-based environment that mirrors production complexity. For instance, the configuration specifies a cockroachdb datastore type and a local directory for data ingestion. This allows the testing suite to validate the frontend's ability to handle SQL-backed data flows and ingestion triggers in an isolated sandbox.

Key Components and Configuration Logic

The data within this module covers several critical functional areas of the Perf frontend:

  • Instance Metadata and Security: Defines how the instance identifies itself (e.g., chrome-perf-test) and how it handles user identity. The auth_config specifies X-WEBAUTH-USER as the source of truth for identity, which is essential for testing authorization middleware and audit logging.
  • External Service Integration: Encapsulates the configuration for issue trackers and notification systems. It includes references to secret management (e.g., Google Cloud Secret Manager paths for API keys), allowing the API to test the logic that fetches credentials without exposing actual production keys.
  • Data Ingestion and Schema: Controls how the system views incoming performance data. By setting use_regression2_schema: true, the test data forces the application into a specific architectural path for regression detection, facilitating tests for the newer data schema.
  • Query and UI Customization: Defines which parameters (like arch, config, or bot) are indexed for queries and provides a “Favorites” structure. This part of the configuration is used to verify that the frontend correctly renders navigation links and filters based on the configuration file rather than hardcoded values.

Data Flow Overview

The following diagram illustrates how the configuration data in this module influences the behavior of the API during testing:

+-----------------------+      +-------------------------+      +----------------------+
| /testdata/config.json | ---> | API Configuration Layer | ---> | Mock/Test Handlers   |
+-----------------------+      +-------------------------+      +----------------------+
           |                                |                             |
           | (Defines Auth)                 | (Defines Storage)           | (Defines UI)
           v                                v                             v
 [Header: X-WEBAUTH-USER]        [Conn: cockroachdb/demo]      [Favorites & Links]

Usage in Implementation

Developers use this module to:

  1. Validate JSON Unmarshaling: Ensure the Go structures in the frontend/api package align with the expected JSON format.
  2. Mock Environment Dependencies: Use the backend_host_url and git_repo_config values to simulate cross-service communication.
  3. Sanitize Inputs: The invalid_param_char_regex provides a standard for testing input validation across various API endpoints to prevent injection or malformed queries.

Module: /go/frontend/mock

The /go/frontend/mock module provides a self-contained, high-fidelity mock server for the Perf application. Unlike simple unit tests or component-level demos, this server renders the actual production HTML templates and JavaScript bundles while simulating the entire backend API.

It is primarily used for:

  • Demonstrations: Providing a “live” version of the Perf UI with deterministic data.
  • Integration Testing: Serving as a target for E2E testing frameworks (like Puppeteer) via the test_on_env Bazel rule.
  • Frontend Development: Allowing UI developers to iterate on features without needing a local instance of BigTable, Spanner, or authenticated microservices.

Design Decisions and Implementation

High-Fidelity Rendering

The mock server uses the real frontend.Frontend logic to load and execute Go HTML templates found in perf/pages/production. It injects a specialized mockContext (defined in frontend_mock_for_demo.go) into these templates. This context mimics the global configuration usually provided by the production server, enabling features like the Pinpoint bisect button or specific chart tooltips that are toggleable via feature flags.

Stateless API Simulation

The backend is simulated using a set of hardcoded data structures in frontend_mock_api_impl.go. The implementation focuses on mimicking the behavior of the Perf API rather than just returning static JSON:

  • Query Builder: The nextParamListHandler simulates the hierarchical filtering of trace keys (e.g., selecting an “arch” narrows down the available “os” values).
  • Asynchronous Jobs: Since the real Perf backend often processes graph requests asynchronously, the mock implements the /frame/start and /status/{id} pattern. It stores a “pending” query in memory and returns a “Finished” status with a mock dataframe when polled.
  • Deterministic Data: Trace data is generated based on the length of the trace keys, ensuring that the graphs look consistent across different runs.

Environment Integration (test_on_env)

The server includes specific logic to support automated testing environments:

  1. Dynamic Port Allocation: If the ENV_DIR environment variable is detected, the server listens on port :0 to avoid collisions during parallel test execution.
  2. Readiness Signaling: It writes its assigned port and a “ready” file to a specific directory. This allows a test runner to wait until the server is fully initialized before attempting to connect.

Key Components

ComponentResponsibility
frontend_mock_for_demo.goEntry point. Configures the chi router, handles static asset serving (JS/CSS/Images), and defines the render helper that injects mock global state into HTML templates.
frontend_mock_api_impl.goContains the mock logic for all /_/ API endpoints, including alerts, regressions, triage, favorites, and trace data retrieval.
BUILD.bazelDefines the mock_dist_files filegroup, which collects all production UI assets (CSS, JS, Maps, Images) required to make the server functional without an external CDN.

Core Workflow: Data Retrieval

The following diagram illustrates how the mock server simulates the asynchronous data fetching process used by the frontend to render graphs:

Browser                 Mock Server (frontend_mock_server)
   |                          |
   |-- POST /_/frame/start -->| 1. Stores Queries in m.currentQueries
   |                          | 2. Returns status: "Running", url: "/_/status/demo-req"
   |<-- JSON {status, url} ---|
   |                          |
   |-- GET /_/status/demo-req | 3. Retrieves stored queries
   |                          | 4. Filters mockTraceData based on queries
   |<-- JSON {status: "Finished", results: {dataframe}}
   |                          |
   | (Browser renders plot)    |

Mock State and Data

The server maintains a small amount of in-memory state (protected by sync.Mutex) to track the “current” query, but otherwise acts as a functional simulator of:

  • Parameters: A fixed set of architectures (arm, x86_64) and OSs (Android, Ubuntu, etc.).
  • Anomalies: Predetermined regression points linked to specific trace keys to test the triage and alerting UI.
  • Authentication: Always reports the user as user@google.com with admin and bisecter roles.

Module: /go/frontend/testdata

Frontend Test Data

The testdata module serves as a centralized repository for configuration fixtures used during the unit testing and integration testing of the Perf frontend. It focuses on simulating various application states and ensuring that the configuration parser and validator logic are robust against malformed or boundary-pushing inputs.

Purpose and Design Choice

The primary design goal for this module is to provide repeatable, immutable data structures that represent both valid and invalid application states. Rather than programmatically generating complex configuration objects in Go code, these JSON files allow developers to see the exact structure the frontend expects to ingest from environment-specific configuration maps.

Using separate files for specific failure modes (e.g., missing or excessive string lengths) allows the test suite to execute table-driven tests that specifically target the validation logic within the frontend initialization phase. This approach ensures that the application fails gracefully with descriptive errors when provided with an invalid configuration, rather than encountering runtime panics.

Component Responsibilities

Standard Configuration Fixture The config.json file represents a complete, valid configuration. It defines the operational environment for the frontend, including:

  • Data Store Connectivity: Configures backend persistence (e.g., CockroachDB) and tile sizes for data processing.
  • Ingestion and Repository State: Sets up the relationship between data sources (local directories or Pub/Sub topics) and the git history provider.
  • UI/UX Customizations: Defines “favorites” sections, query parameter inclusions, and notification settings that control the end-user experience.

Validation Boundary Fixtures The remaining files in this module are designed specifically to test the constraints of the instance_name field. This field is a critical identifier used in telemetry, logging, and external service integration.

  • Empty and Missing States: config_empty_instance_name.json and config_no_instance_name.json are used to verify that the system correctly identifies missing required fields or prevents the use of empty strings where a unique identifier is expected.
  • Constraint Testing: config_long_instance_name.json provides a value that exceeds standard character limits (typically 64 characters). This is used to test that the frontend validation logic prevents data that might be rejected by downstream cloud services or cause UI layout breakage.

Configuration Parsing Workflow

The following diagram illustrates how these files are utilized during the application lifecycle testing:

+-----------------+       +-----------------------+
|   Test Runner   | ----> | Load JSON fixture     |
| (Frontend Unit) |       | from /testdata/       |
+-----------------+       +-----------+-----------+
                                      |
                                      v
+-----------------+       +-----------------------+
| Assert Expected | <---- | Execute Unmarshal and |
| Error/Success   |       | Validation Logic      |
+-----------------+       +-----------------------+

By checking against these predefined files, the frontend ensures that changes to the configuration structures in the Go source code do not silently break compatibility with existing configuration formats used in production environments.

Module: /go/fuchsia_to_skia_perf

Fuchsia to Skia Perf

fuchsia_to_skia_perf is a command-line utility designed to bridge the gap between Fuchsia‘s performance testing infrastructure and the Skia Performance monitoring system. It transforms performance data from Fuchsia’s native JSON format into a schema that the Skia Perf ingestion pipeline can parse and visualize.

High-Level Overview

The tool operates as a specialized data ETL (Extract, Transform, Load) pipeline. It extracts raw metrics from Fuchsia build artifacts, transforms them by calculating statistical aggregates and normalizing units, and loads them into a destination suitable for Skia Perf—either a local directory or a Google Cloud Storage (GCS) bucket.

The primary goal of this tool is to ensure that performance regressions in Fuchsia can be tracked using Skia's visualization tools, which require specific metadata (like “improvement direction”) and a flat, trace-based data structure.

Design Decisions

Benchmark Granularity and Data Partitioning

Fuchsia test results often bundle multiple test suites into a single large JSON file. However, Skia Perf is optimized for partitioning data based on specific benchmarks. To align these two systems, the converter splits a single Fuchsia input record into multiple Skia Perf JSON files, keyed by the test suite name. This allows Skia to treat each suite as a distinct entity, improving query performance and visualization clarity.

Statistical Aggregation and Visualization

Rather than just reporting raw values, the converter automatically generates two distinct types of entries for every metric:

  1. Comprehensive Stats: A base entry containing the full statistical profile (Min, Max, Sum, Count, First, and Standard Deviation).
  2. Average Focused (_avg): A specialized entry focusing on the mean and error (standard deviation).

This dual-entry approach is a deliberate choice to support different visualization needs in Skia Perf: the _avg series provides a clean trend line for dashboards, while the base entry allows for deep-dive analysis into the variance and distribution of test results.

Unit Normalization and Polarity

Fuchsia and Skia use different conventions for units and “improvement direction” (i.e., whether a higher or lower number is better). The module implements a mapping logic that:

  • Normalizes Units: Converts various strings (e.g., nanoseconds, ns) to a canonical format (e.g., ms) and scales values accordingly (e.g., dividing by 1,000,000).
  • Infers Polarity: Assigns a smallerIsBetter or biggerIsBetter flag based on the unit. It supports overrides within the input data, allowing developers to explicitly mark a metric (e.g., ms_biggerIsBetter) if the default assumption is incorrect.

Key Components

Transformation Engine (/convert/lib.go)

This is the core logic of the module. It handles the lifecycle of a conversion run:

  1. Validation: Ensures the input contains necessary metadata like build_id and commit_id.
  2. Grouping: Categorizes measurements by their benchmark/test suite.
  3. Calculation: Computes numerical statistics and applies unit scaling.
  4. Formatting: Maps the processed data into the SkiaPerfResult schema.
Processing Workflow:
[Fuchsia JSON] -> [Unmarshal] -> [Group by Benchmark]
                                        |
                                        v
[Calculate Stats] <--- [Map Units/Direction] <--- [Scale Values]
       |
       v
[Construct SkiaPerfResult] -> [Write Local File]
                           -> [Upload to GCS (Optional)]

Data Models (/convert/types.go)

This component defines the structural contract between the two systems.

  • FuchsiaPerfResults: Models the input, focusing on build metadata and raw measurement arrays.
  • SkiaPerfResult: Models the output, which includes a Key map (defining the trace's identity—bot, master, benchmark) and the calculated result items.

Entry Point (main.go)

The CLI wrapper handles environment-specific configuration. It manages:

  • Authentication: Sets up Google Cloud credentials if GCS uploading is requested.
  • Configuration: Parses flags to define where the data comes from and where it should go.
  • Path Partitioning: If uploading to GCS, it organizes files into a ingest/YYYY/MM/DD/ structure to facilitate efficient discovery by the Skia Perf ingester.

Implementation Details

The tool generates output filenames using a specific pattern: <build_id>-<benchmark>-<bot>-<master>.json. This naming convention ensures uniqueness and provides enough context for administrators to manually inspect the ingestion bucket if necessary.

When calculating the “error” metric for the average results, the tool utilizes a sample standard deviation. This provides a statistically sound representation of variance, which Skia Perf uses to render error bars in its UI.

Module: /go/fuchsia_to_skia_perf/convert

Fuchsia to Skia Perf Conversion

The convert module provides the logic for transforming performance test results from the Fuchsia JSON format into the Skia Perf format. This conversion allows performance data generated by Fuchsia builders to be ingested and visualized by Skia's performance monitoring tools.

Overview

The module functions as a data pipeline that reads a specific Fuchsia performance schema, normalizes units and improvement directions, calculates statistical aggregates, and outputs files compatible with Skia Perf's ingestion requirements.

Design Decisions

  • Benchmark Granularity: The converter splits a single Fuchsia input record (which may contain multiple test suites) into separate Skia Perf JSON files per benchmark (test suite). This aligns with how Skia Perf organizes data, where a “benchmark” is a primary key for partitioning performance traces.
  • Unit Normalization: To ensure consistency across different test runners, the module maps various Fuchsia unit strings (e.g., nanoseconds, ns, milliseconds) to a canonical set of Skia units (e.g., ms). It also handles value scaling, such as converting nanoseconds to milliseconds or bytes to MiB.
  • Automated Statistics: For every test result provided, the module automatically generates two Skia result items:
    1. A base item containing comprehensive statistics (min, max, sum, count, first value, and standard deviation).
    2. An _avg item that specifically focuses on the mean and error (standard deviation), which is often the primary metric for visualization.
  • Improvement Direction: The module infers whether a metric should “go up” or “go down” to indicate improvement. It uses a default mapping based on the unit (e.g., ms defaults to smallerIsBetter) but allows the input data to override this via a suffix (e.g., ms_biggerIsBetter).

Key Components

Data Transformation Logic (lib.go)

This file contains the core processing engine. The conversion process follows this workflow:

Fuchsia JSON Input
      |
      v
Unmarshal & Validate (Check BuildID, CommitID, etc.)
      |
      v
Group results by Test Suite (Benchmark)
      |
      +------> Calculate Stats (Min, Max, Avg, StdDev)
      |
      +------> Map Units & Improvement Direction
      |
      v
Construct SkiaPerfResult Object
      |
      +------> Write to Local Disk (if configured)
      |
      +------> Upload to GCS (if client provided)
  • Run(cfg Config): The entry point that orchestrates the file reading, grouping, and output generation.
  • PopulateResults(perfResults): Maps the raw measurements into the structured SkiaResultItem format, applying the dual-item (base + average) generation strategy.
  • MapUnitAndDirection(input): Resolves the final string used by Skia to determine both the unit type and the visualization polarity (up/down).
  • CalculateStats(results): Performs numerical analysis and unit conversion (e.g., ns to ms). It uses a sample standard deviation calculation for the error metric.

Data Models (types.go)

Defines the structure of both the input and output formats.

  • FuchsiaPerfResults: Represents the input schema, which is a list of build records, each containing metadata like build_id, builder, and commit_id, alongside an array of performance measurements.
  • SkiaPerfResult: Represents the output schema. It includes a Key map (defining the trace identity) and a Results array containing the actual measurements and their associated metadata.

Configuration

The conversion process is controlled via the Config struct, which specifies:

  • Data Sources: The path to the input JSON file.
  • Destinations: A local directory for output files and/or a Google Cloud Storage (GCS) bucket path.
  • Metadata: The “Master” name and an optional date for GCS path partitioning (organized as ingest/YYYY/MM/DD/).

Module: /go/git

Perf Git Module

The perf/go/git module provides a high-level abstraction and persistence layer for Git repository data within the Perf system. Its primary responsibility is to bridge the gap between non-linear Git history (identified by hashes) and Perf's internal requirement for a linear, integer-based timeline (identified by CommitNumber).

High-Level Overview

In the Perf ecosystem, performance data is plotted against a continuous x-axis. Because Git hashes are non-deterministic and non-sequential, this module maps every relevant Git commit to a monotonically increasing CommitNumber.

The module performs three core functions:

  1. Ingestion: Periodically polling a Git source (via the provider abstraction) to find new commits.
  2. Persistence: Storing commit metadata (hash, author, subject, timestamp) in an SQL database (PostgreSQL or Spanner) to enable fast range queries without hitting the Git backend.
  3. Resolution: Providing an API to translate between timestamps, Git hashes, and commit numbers.

Design Decisions and Implementation Choices

The Commit Number Mapping

A fundamental design choice in Perf is the use of CommitNumber as the primary coordinate for data. This module supports two ways of determining this number:

  • Sequential Assignment: By default, as the module discovers new commits, it assigns the next available integer in the database.
  • Repo-Supplied (Regex): For repositories like Chromium that embed a monotonic position in the commit message (e.g., Cr-Commit-Position: refs/heads/master@{#727989}), the module can be configured with a regex to extract this number directly. This ensures that the CommitNumber in Perf matches the official project revision.

Caching for Performance

To minimize database load, the implementation utilizes an LRU (Least Recently Used) cache for commit details. Given that a typical commit entry is approximately 400 bytes, the cache is capped at 25,000 entries (roughly 10MB). This significantly speeds up the rendering of dashboards and alerts where the same recent commits are requested frequently.

Background Polling and Synchronization

The module follows a “sync-and-cache” pattern. It does not query the Git provider (Gitiles or local CLI) for every user request. Instead, it runs a background goroutine that pulls updates into the local SQL database. This ensures that even if the Git backend is temporarily slow or unavailable, the Perf UI remains responsive using the cached metadata.

Key Components and Files

interface.go

Defines the Git interface, which is the contract for all Git-related operations in Perf. This includes methods for range lookups (CommitSliceFromTimeRange), history traversal (PreviousGitHashFromCommitNumber), and file-specific auditing (CommitNumbersWhenFileChangesInCommitNumberRange).

impl.go

The primary implementation of the Git interface. It manages the lifecycle of the background updater and contains the SQL logic for both PostgreSQL and Spanner.

  • Update Logic: The Update method identifies the delta between the most recent hash in the database and the current HEAD of the repository, then streams and inserts the missing commits.
  • Collision Handling: It uses ON CONFLICT DO NOTHING clauses to ensure that multiple service instances or rapid update cycles do not result in duplicate entries.

gittest/

A specialized test harness that bootstraps a complete environment for integration testing. It creates a real Git repository with deterministic timestamps, initializes a test database, and provides a pre-populated set of hashes. This ensures that logic involving time-to-hash mapping can be tested without flakiness.

Data Workflow: Update Cycle

The following diagram illustrates how the module synchronizes the database with the remote repository:

[ Background Poller ]          [ SQL Database ]          [ Git Provider ]
         |                            |                         |
         |--- 1. Get Most Recent ---->|                         |
         |       Commit from DB       |                         |
         |<----- (Hash, Number) ------|                         |
         |                            |                         |
         |--- 2. Fetch New Commits ---------------------------->|
         |       since <Hash>         |                         |
         |<---------------------------- (Stream of Commit Objs) |
         |                            |                         |
         |--- 3. Extract Number ----> |                         |
         |       (Regex or Incr)      |                         |
         |                            |                         |
         |--- 4. INSERT INTO Commits >|                         |
         |       (Hash, Meta, No.)    |                         |

Related Submodules

  • provider/: Defines the low-level interface for fetching raw Git data.
  • providers/: A factory module that selects between git_checkout (local CLI) and gitiles (REST API) based on the instance configuration.
  • schema/: Defines the database table structure used to persist commit metadata.
  • mocks/: Provides autogenerated mocks for unit testing components that depend on the Git interface.

Module: /go/git/gittest

The gittest module provides a high-level test harness for the Perf system‘s Git integration. It is designed to bootstrap a realistic environment for integration tests, bridging the gap between raw Git repositories and the Perf service’s data structures.

Design and Intent

The primary goal of this module is to abstract away the repetitive setup required to test Git-based performance monitoring. Testing the Perf Git logic requires a complex state consisting of:

  1. A valid Git repository with a deterministic commit history.
  2. An initialized SQL database schema.
  3. A local checkout (mirror) that the system can use to perform analysis.
  4. Matching configurations that link these pieces together.

By providing a single constructor (NewForTest), this module ensures that tests across the Perf codebase use a consistent dataset, making it easier to verify algorithms that traverse history or map timestamps to commit hashes.

Key Components

The Test Environment Lifecycle

The core of the module is the NewForTest function. It manages the orchestration of several distinct subsystems:

  • Repository Generation: It uses testutils.GitBuilder to initialize a temporary Git repository. It populates this repository with a predefined sequence of commits starting at StartTime (Unix 1680000000), spaced exactly one minute apart. This predictability allows test authors to write assertions based on relative time offsets.
  • Database Provisioning: It initializes a Spanner database instance for tests using sqltest.NewSpannerDBForTests. This provides the persistence layer where Perf stores metadata associated with Git commits.
  • Provider Abstraction: It instantiates a git_checkout.Provider. This is the component responsible for actually interacting with the Git binary and the local filesystem, ensuring that the test environment behaves identically to a production deployment.
  • Automatic Cleanup: The module utilizes t.Cleanup to ensure that temporary directories, database connections, and background processes (like the Git builder) are torn down immediately after a test completes, preventing resource leaks in the test runner.

Data Workflow

When NewForTest is called, the following process occurs:

[ GitBuilder ] --(Creates Repo)--> [ Local .git Dir ]
      |                                    |
      | (Commits files at 1min intervals)  |
      v                                    v
[ Commit Hashes ] <----------- [ git_checkout.Provider ]
      |                                    |
      | (Configuration Links Them)         |
      v                                    v
[ InstanceConfig ] <---------- [ Spanner DB Instance ]
  1. Seed: A local Git repository is created and populated with synthetic commits (foo.txt, bar.txt).
  2. Configure: An InstanceConfig is generated, pointing the GitRepoConfig.URL to the GitBuilder's directory and setting a temporary path for the local checkout.
  3. Sync: The git_checkout provider is initialized, which effectively “clones” the builder's repo into the temporary directory.
  4. Return: The function returns the context, the database handle, the builder, the ordered list of hashes, the provider, and the config object.

Usage in Tests

The returned hashes slice is critical for testing. Since Git commit hashes are non-deterministic (based on authorship and exact time of creation), the module returns the generated hashes in chronological order. Tests use these hashes to verify that the Git provider correctly identifies the “revision” associated with specific performance data points.

Module: /go/git/mocks

The go.skia.org/infra/perf/go/git/mocks module provides a mock implementation of the Git interface used within the Perf system. This module is essential for unit testing components that interact with Git history, commit data, and repository metadata without requiring a live Git repository or network access to a Git provider.

Purpose and Design

The primary goal of this module is to enable predictable, isolated testing of Perf's business logic. In the Perf system, the Git interface (defined in perf/go/git/provider) acts as the bridge between performance data and the source code history. Many operations—such as calculating regression ranges, mapping timestamps to commits, or identifying when specific files changed—rely on this interface.

By using these mocks, developers can:

  1. Simulate Edge Cases: Easily test behavior when a commit is missing, a repository is empty, or a Git provider returns an error.
  2. Ensure Determinism: Avoid flaky tests caused by changes in an external repository or network latency.
  3. Speed Up Tests: Bypass the overhead of cloning repositories or executing shell commands to git.

Key Components

The Git Struct

The core of this module is the Git struct found in Git.go. It is an autogenerated mock produced by mockery, utilizing the testify/mock framework. It implements every method required by the Perf system's Git provider interface, including:

  • Resolution Methods: Converting between types.CommitNumber (Perf's internal sequential index), Git hashes, and timestamps (e.g., CommitNumberFromGitHash, CommitNumberFromTime).
  • Retrieval Methods: Fetching detailed commit metadata or slices of commits based on ranges (e.g., CommitFromCommitNumber, CommitSliceFromTimeRange).
  • Analysis Methods: Identifying changes within a range (e.g., CommitNumbersWhenFileChangesInCommitNumberRange).
  • Lifecycle Methods: Controlling the background state of the provider (e.g., Update, StartBackgroundPolling).

Usage Workflow

When writing a test, you instantiate the mock using NewGit, which automatically registers cleanup functions to verify that all expected calls were made before the test finishes.

Test Setup Phase
----------------
1. Call mocks.NewGit(t)
2. Define expectations using .On(...).Return(...)

Execution Phase
---------------
3. Pass the mock into the component being tested
4. Component calls Git methods (e.g., GitHashFromCommitNumber)
5. Mock returns the pre-defined values

Verification Phase
------------------
6. Test finishes
7. Cleanup function calls AssertExpectations

Implementation Details

The mock is tightly coupled with:

  • go.skia.org/infra/perf/go/types: For internal Perf types like CommitNumber.
  • go.skia.org/infra/perf/go/git/provider: For the Commit data structure and the interface definition.
  • github.com/stretchr/testify/mock: For the underlying mock engine.

Because the code is generated, the logic within Git.go focuses on checking types and returning values provided during the “Setup” phase of a test. If a method is called that wasn‘t expected, or if a return value wasn’t specified for a called method, the mock will trigger a panic to alert the developer of an incomplete test configuration.

Module: /go/git/provider

High-Level Overview

The provider module establishes a uniform abstraction layer for interacting with Git repositories within the Skia infrastructure. Rather than forcing downstream consumers to handle the specifics of local Git CLI operations versus remote Gitiles API calls, this module defines a common interface and data structure for retrieving commit history, metadata, and file-specific changes.

By decoupling the data source from the data consumption, the module allows Perf and other services to remain agnostic about how repository data is physically fetched or stored.

Design Rationale

The primary design goal is to provide a consistent view of repository history that is optimized for consumption by performance monitoring systems.

  • Sequential Processing via Callbacks: The CommitProcessor pattern used in CommitsFromMostRecentGitHashToHead is designed for efficiency when processing large ranges of history. Instead of loading thousands of commits into memory simultaneously, the provider streams commits to the caller. This minimizes memory overhead during initial repository indexing or catch-up tasks.
  • Agnostic Backend: The Provider interface is intentionally minimal. It assumes that the underlying implementation (whether it's a local git checkout or a network-based service like Gitiles) handles the complexities of authentication, caching, and network protocols.
  • Database-Friendly Commit Model: The Commit struct serves as a bridge between Git's raw output and the internal database schema. It includes CommitNumber (a monotonic offset used for indexing in Perf) and utilizes JSON annotations that maintain compatibility with legacy CommitDetail structures.
  • Commit Body vs. Persistence: A specific implementation choice is made in the Commit struct where the Body field is kept for parsing metadata (such as extracting specific commit numbers or Gerrit footers), but is explicitly noted as not being intended for database storage. This saves significant storage space while still providing the necessary context during the ingestion phase.

Key Components

The Provider Interface

Defined in provider.go, this interface outlines the essential operations required to sync and query repository state:

  • Incremental Updates: Update ensures the local view of the repository is current.
  • History Traversal: CommitsFromMostRecentGitHashToHead enables “incremental” ingestion. By passing the last known Git hash, the provider can determine the delta between the database and the current HEAD, processing only new commits in chronological order.
  • Targeted Auditing: GitHashesInRangeForFile allows the system to filter noise by identifying exactly when specific configuration or data files were modified within a range, rather than scanning every commit in the repository.

The Commit Model

The Commit struct represents the canonical version of a Git commit within the Skia ecosystem. It includes helper methods for human-readable output:

  • Display: Generates a standardized short-form string (e.g., 7abc123 - 2 days ago - Fix memory leak) used in UI logs and CLI outputs.
  • HumanTime: Leverages the go/human package to convert Unix timestamps into relative durations, providing a more intuitive sense of “when” a change occurred compared to raw epoch values.

Workflow: Incremental Ingestion

The typical interaction between a consumer and the provider follows a “sync-and-stream” pattern to keep internal databases up to date with the remote repository.

[ Consumer ]           [ Provider ]          [ Git Backend ]
     |                      |                      |
     |--- 1. Update() ----->|                      |
     |                      |--- 2. git pull / API |
     |                      |<-- 3. New Commits ---|
     |                      |                      |
     |--- 4. CommitsFrom(lastHash, callback) ----->|
     |                      |                      |
     |                      |--- 5. Parse Commits -|
     |                      |                      |
     |<-- 6. Invoke callback(Commit) [Repeated] ---|
     |                      |                      |
     |--- 7. Store in DB -->|                      |

This workflow ensures that the consumer only processes what is necessary and that the logic for “what has changed” remains encapsulated within the provider implementation.

Module: /go/git/providers

High-Level Overview

The providers module serves as a factory and abstraction layer for obtaining Git data in the Perf system. Its primary purpose is to instantiate a provider.Provider based on the system's configuration.

By abstracting the source of Git information, the rest of the Perf application can remain agnostic of whether it is interacting with a local disk-based repository or a remote web-based Gitiles instance. This flexibility allows Perf to scale across different infrastructure environments—from high-performance local setups to cloud-native deployments where persistent disk management is undesirable.

Design and Implementation Choices

The Factory Pattern

The module implements a single factory function, New, which encapsulates the logic for selecting and initializing the appropriate Git provider. This centralizes the dependency management for Git access, ensuring that the calling code does not need to know about authentication scopes, HTTP clients, or local filesystem paths.

Provider Selection Logic

The choice of provider is driven by the GitRepoConfig.Provider setting in the InstanceConfig:

  1. Local Checkout (git_checkout): Selected if the provider is explicitly set to CLI or left empty (default). This uses a local Git binary and a directory on disk.
  2. Gitiles (gitiles): Selected when the provider is set to Gitiles. This bypasses the local disk and interacts with repositories via the Gitiles Web API.

Unified Authentication Management

A key responsibility of this module is preparing the necessary credentials for remote communication.

  • For Gitiles, the factory automatically configures a google.DefaultTokenSource with the auth.ScopeGerrit scope. It then wraps this in a standard httputils client before passing it to the Gitiles implementation. This ensures that the provider is ready to perform authenticated API calls immediately upon creation.
  • For Local Checkout, authentication is handled internally by the git_checkout module (via gitauth and cookie files), but the factory ensures the correct environment configuration is passed down.

Key Components and Responsibilities

builder.go

This is the entry point for the module. It manages the imports for all supported backend implementations (git_checkout and gitiles). It acts as the “glue” that translates configuration strings into functional Go objects.

provider.Provider Interface

While defined in an external package (//perf/go/git/provider), this interface is the “contract” that this module fulfills. Any provider returned by the factory is guaranteed to support:

  • Fetching commits in chronological order.
  • Tracking history for specific files.
  • Retrieving commit metadata (author, timestamp, message).
  • Synchronizing with the remote source.

Key Workflows

Provider Initialization Workflow

The following diagram illustrates how the New function determines which implementation to return:

[ InstanceConfig ]
       |
       v
Check GitRepoConfig.Provider
       |
       +--- [ empty ] or "CLI" ----> [ Initialize git_checkout ]
       |                                     |
       |                                     v
       |                          Check/Clone local directory
       |                                     |
       |                                     v
       |                              Return git_checkout.Impl
       |
       +--- "Gitiles" -------------> [ Initialize Gitiles ]
       |                                     |
       |                              Create OAuth2 Token
       |                                     |
       |                              Setup HTTP Client
       |                                     |
       |                                     v
       |                              Return gitiles.Gitiles
       |
       +--- [ Other ] -------------> Return Error (Invalid Type)

Module: /go/git/providers/git_checkout

High-Level Overview

The git_checkout module provides a Git repository provider for the Perf system. It implements the provider.Provider interface by wrapping a local Git checkout and executing git commands via system calls. This module is designed for environments where a persistent, on-disk Git clone is preferred for performance or local tool integration, allowing Perf to synchronize with a remote repository and query its history.

Design and Implementation Choices

External Process Execution

The primary design choice is “shelling out” to the system's git executable rather than using a pure Go Git implementation. This ensures full compatibility with all Git features, including complex authentication schemes (like Gerrit/git-cookie) and standard performance optimizations that the native Git binary provides.

State and Lifecycle

The module manages a local directory (specified in the InstanceConfig).

  • Initialization: Upon creation, it verifies the existence of the directory. If it doesn't exist, it performs an initial git clone.
  • Synchronization: The Update method performs a git pull to bring the local checkout up to date with the remote tracking branch.
  • Commit Tracking: The provider supports a startCommit configuration. This acts as a logical “horizon”; the provider can be configured to ignore history preceding this commit, which is useful for large repositories where only recent history is relevant to performance tracking.

Authentication

The module integrates with Google Cloud's google.DefaultTokenSource to handle Gerrit authentication. When enabled, it uses gitauth to manage a /tmp/git-cookie file, ensuring that the background git processes have the necessary credentials to interact with protected remote repositories.

Efficient Log Parsing

To avoid loading large amounts of git history into memory at once, the module uses a streaming parser (parseGitRevLogStream). It pipes the output of git rev-list directly into a scanner that processes commits one-by-one, invoking a callback for each. This design allows the system to process thousands of commits with a constant memory footprint.

Key Components and Responsibilities

Impl Struct

The core implementation of the provider.Provider interface. It maintains the absolute path to the Git executable and the repository's location on disk.

Commit Retrieval (CommitsFromMostRecentGitHashToHead)

Retrieves new commits since a given hash. It utilizes git rev-list with a range (e.g., hash..HEAD) and specific formatting flags to extract the author, subject, and Unix timestamp.

  • If no recent hash is provided, it falls back to the startCommit.
  • If neither is available, it starts from the beginning of the repository's history reachable from HEAD.

File-Specific Queries (GitHashesInRangeForFile)

Finds all commits within a range that modified a specific file. This is crucial for Perf's “blame” or “trace” features where changes to specific configuration or data files need to be tracked. It translates the request into a git log --format=%H -- <filename> command.

Metadata Extraction (LogEntry)

Provides the full, human-readable commit message and metadata for a specific hash using git show -s. This is typically used for UI display when a user inspects a specific point on a performance trace.

Key Workflows

New Provider Initialization

The following diagram illustrates the workflow when New() is called:

Config -> [ Auth Check ] -> ( Gerrit Auth via gitauth/git-cookie )
              |
              v
       [ Find Git Binary ] -> ( Resolve Absolute Path )
              |
              v
       [ Repo Check ] ------> ( Directory Exists? )
              |                     |
              | No                  | Yes
              v                     v
       ( Run git clone )     ( Use existing dir )
              |                     |
              +----------+----------+
                         |
                         v
                  Return Impl{}

Synchronizing and Processing Commits

The workflow for identifying new work typically follows this pattern:

Caller -> Update() -> [ git pull ]
  |
  +-> CommitsFromMostRecentGitHashToHead(last_hash)
        |
        +-> [ git rev-list last_hash..HEAD --pretty ]
              |
              +-> ( Stdout Pipe )
                    |
                    v
            [ parseGitRevLogStream ] -> ( Callback for each Commit )

Module: /go/git/providers/gitiles

Overview

The gitiles module provides an implementation of the provider.Provider interface that interacts with Git repositories via the Gitiles Web API. In the context of the Perf system, this module is responsible for discovering new commits, retrieving commit metadata, and filtering history for specific files without requiring a local checkout of the repository.

By using Gitiles, the system can operate in environments where local disk space is at a premium or where maintaining a constantly updated local git clone is inefficient. It acts as a bridge between the high-level performance tracking logic and the remote version control system.

Key Components and Responsibilities

Gitiles Struct

The core of the module is the Gitiles struct, which encapsulates the logic for communicating with a remote Gitiles instance. It stores configuration such as the repository URL, the target branch, and an optional starting commit to limit the scope of history processing.

Commit Ingestion and Processing

The primary responsibility of this module is to stream commits from a known point up to the current HEAD. This is handled by CommitsFromMostRecentGitHashToHead.

  • Design Choice (Batching): Instead of fetching one commit at a time, which would be prohibitively slow over HTTP, the module uses LogFnBatch to fetch commits in batches (defaulting to 100). This reduces round-trip overhead while maintaining a manageable memory footprint.
  • Design Choice (Ordering): The module uses the gitiles.LogReverse() option. This ensures that commits are processed in chronological order (oldest to newest), which is critical for the Perf system to build its internal representation of history linearly.
  • Branch Handling: The implementation distinguishes between the “main” branch and side branches. For the main branch, it typically requests a range from a hash to HEAD. For other branches, it uses the fully qualified branch name and a starting commit offset to ensure it tracks the correct line of development.

File-Specific History

The GitHashesInRangeForFile method allows the system to query history for specific paths. This is used when the system needs to determine which commits actually modified a specific configuration file or test suite, allowing it to skip irrelevant commits during analysis.

Metadata Retrieval

The LogEntry method provides a standardized way to retrieve a formatted string containing commit details (Author, Date, Subject, Body). This is used for displaying commit information in the Perf UI.

Workflows

Commit Discovery Workflow

When the Perf system needs to update its view of the world, it triggers a discovery process through this provider:

[Perf System] -> Call CommitsFromMostRecentGitHashToHead(last_known_hash)
      |
      v
[Gitiles Provider] -> Determine branch expression (e.g., "refs/heads/main")
      |
      v
[Gitiles Provider] -> Request batch of commits from Gitiles API (Reversed)
      |
      +--< [Batch Received]
      |          |
      |          v
      |    [CommitProcessor Callback] -> (Perf system stores/indexes commit)
      |          |
      +----------+ (Repeat until HEAD reached)
      |
      v
[Perf System] -> Update complete

Implementation Details

  • Provider Interface: The module explicitly validates that it satisfies the provider.Provider interface at compile time: var _ provider.Provider = (*Gitiles)(nil).
  • Stateless Updates: The Update method is a no-op in this provider. Unlike local git providers that need to perform a git fetch to update local state, the Gitiles provider is naturally up-to-date as it queries the remote API directly for every request.
  • Error Handling: The module uses go.skia.org/infra/go/skerr to wrap errors from the Gitiles client, providing context on whether a failure occurred during batch loading, log retrieval, or callback processing.

Module: /go/git/schema

Perf Git Schema

The schema module defines the foundational data structures and database schema for tracking Git commits within the Perf system. This module serves as the “source of truth” for how commit metadata is persisted and mapped between the relational database and Go types.

Design Philosophy: Mapping History to Integers

A core requirement of the Perf system is the ability to map linear Git history to a continuous range of integers, referred to as CommitNumber. While Git natively identifies commits via non-linear hashes, the Perf database requires a strictly increasing integer key to efficiently handle time-series data, range queries, and regressions.

The Commit struct acts as the bridge between these two worlds. It pairs the immutable Git metadata (hash, author, timestamp) with a monotonically increasing CommitNumber that defines the commit‘s position in the Perf system’s timeline.

Key Components

The Commit Struct

The Commit struct is designed to be used directly with an SQL-based ORM or schema generator. Its fields are chosen to satisfy the requirements of both the UI (showing author and subject) and the backend analytical engines (filtering by time or commit range).

  • CommitNumber: This is the primary key. It is used as the coordinate for the x-axis in Perf graphs. By using an integer as the primary key rather than the Git hash, the system ensures that queries for “the last N commits” or “commits between X and Y” are highly performant.
  • GitHash: Stored as a unique, non-null string to ensure data integrity. This allows the system to resolve external Git references back to the internal CommitNumber.
  • Timestamp: Stored as a Unix timestamp (seconds). This allows the system to correlate commits with the wall-clock time the data was ingested or produced, which is critical for identifying infrastructure-related regressions.
  • Metadata (Author/Subject): These fields are included to provide context in the Perf UI without requiring a secondary lookup to a Git host (like Gitiles) during the initial rendering of search results or alerts.

Data Workflow

The schema facilitates a workflow where raw Git data is ingested and “indexed” into the Perf database:

Git Repository          Ingestion Process                Perf Database (Schema)
+------------+          +-----------------------+        +---------------------------+
| SHA: a1b2c | ------>  | 1. Assign CommitNumber| ---->  | PK: CommitNumber (e.g. 5) |
| Author: .. |          | 2. Extract Metadata   |        | Hash: a1b2c               |
| Subject:.. |          | 3. Insert into DB     |        | Timestamp: 1672531200     |
+------------+          +-----------------------+        +---------------------------+

Legacy Compatibility

The struct includes JSON annotations designed to maintain serialization parity with the legacy cid.CommitDetail types. This implementation choice allows the backend to transition to the new schema-driven database approach without breaking existing frontend consumers that expect a specific JSON shape when requesting commit details.

Module: /go/graphsshortcut

High-Level Overview

The graphsshortcut module provides the core data structures and interfaces for managing graph shortcuts within the Perf system. A “shortcut” is a persistent snapshot of a user's dashboard configuration—including multiple graph definitions, queries, and formulas—represented by a unique, content-addressed ID.

This module acts as the domain layer for the “permalink” and “multigraph” features, allowing complex visualizations to be shared via short URLs. By decoupling the shortcut definition from the specific storage implementation, it enables the system to support diverse backends like SQL databases for production and in-memory caches for local development.

Design Decisions and Implementation Choices

Content-Addressable Identification

The module implements a deterministic ID generation strategy in the GetID() method. The ID is an MD5 hash derived from the contents of the GraphsShortcut object.

  • Deduplication: Because the ID is based on the content, identical graph configurations will always result in the same ID. This prevents the storage layer from accumulating redundant entries for identical shortcuts.
  • Normalization: Before hashing, the module sorts all Queries and Formulas within each GraphConfig. This ensures that two shortcuts representing the same data but created with different UI selection orders remain functionally and identifies as identical.
  • Order Sensitivity: While internal query/formula order is normalized, the order of the Graphs array itself is preserved in the hash. This is a deliberate choice because the sequence of graphs on a dashboard is part of the user's intended layout.

Interface-Driven Persistence

The module defines a Store interface rather than a concrete implementation. This allows the core logic of Perf to remain agnostic of the underlying database technology. It provides a contract for two primary operations:

  • InsertShortcut: Storing a configuration and returning its unique ID.
  • GetShortcut: Retrieving a configuration based on its ID.

Key Components

Data Models (graphsshortcut.go)

  • GraphConfig: Represents the parameters for a single visualization. It bundles Queries (trace filters), Formulas (mathematical transformations), and Keys (specific trace identifiers).
  • GraphsShortcut: A container for one or more GraphConfig objects. This allows a single shortcut to represent an entire dashboard of multiple graphs rather than just a single trace.

The Store Interface

The Store interface serves as the gateway to persistence. Implementations of this interface (found in the sibling graphsshortcutstore module) handle the complexities of JSON serialization and database interactions. By keeping the interface in this base module, the system avoids circular dependencies between the storage logic and the domain objects.

Identification Workflow

The following diagram illustrates how the module ensures that shortcut IDs are generated consistently, regardless of how the user interacted with the UI to create the queries.

[ User Input ]           [ GraphsShortcut.GetID() ]          [ Result ]
      |                             |                            |
      | 1. Create Shortcut          |                            |
      |    Graph A:                 |                            |
      |      - arch=arm             |                            |
      |      - config=8888          |                            |
      +---------------------------->|                            |
                                    | 2. Sort Queries/Formulas   |
                                    |    (arch, config)          |
                                    |-------------------+        |
                                    |                   |        |
                                    |<------------------+        |
                                    |                            |
                                    | 3. MD5 Hash Content        |
                                    |-------------------+        |
                                    |                   |        |
                                    |<------------------+        |
                                    |                            |
                                    | 4. Return Hex String       |
                                    +--------------------------->| "c21e3c..."

Module: /go/graphsshortcut/graphsshortcutstore

High-Level Overview

The graphsshortcutstore module provides implementations for persisting and retrieving graph shortcuts in the Perf system. A graph shortcut is essentially a saved state of multiple graphs—such as trace filters, queries, and display settings—that can be referenced via a unique ID.

This module fulfills a critical role in the “permalink” and “multigraph” features of Perf, allowing users to share or revisit complex dashboard configurations without encoding the entire state into a URL.

Design Decisions and Implementation Choices

The module is designed around the graphsshortcut.Store interface, with two distinct implementations tailored for different environment constraints:

  • Production Persistence (SQL): The standard implementation uses an SQL database (typically Spanner) for durable storage. It treats the database as a content-addressable store where the primary key is the shortcut ID and the payload is a JSON-serialized blob. This approach provides durability and global availability across production instances.
  • Local Development & Debugging (Cache): A specialized implementation, cacheGraphsShortcutStore, uses an in-memory or distributed cache rather than a database. This was specifically designed to solve the “breakglass” problem: when developers connect a local instance to a production database for debugging, they often lack write permissions to the production SQL tables. By routing shortcut writes to a local cache, developers can still use features like “multigraph” and shortcut generation without needing elevated database privileges.
  • JSON Serialization: Both implementations serialize the GraphsShortcut struct into JSON before storage. This avoids the need for a complex relational schema for the graph configurations themselves, which are frequently subject to UI-driven changes. By storing them as blobs, the store remains agnostic to the internal structure of the graph data.

Key Components

GraphsShortcutStore

Located in graphsshortcutstore.go, this is the primary SQL-backed implementation. It manages the lifecycle of shortcuts using two main operations:

  • InsertShortcut: Encodes the shortcut to JSON and performs an INSERT ... ON CONFLICT (id) DO NOTHING. The “Do Nothing” strategy is used because shortcut IDs are typically derived from the hash of their content; if the ID already exists, the content is identical, and no update is necessary.
  • GetShortcut: Retrieves the JSON blob by ID and decodes it back into the domain objects.

cacheGraphsShortcutStore

Located in cachegraphsshortcutstore.go, this implementation wraps a cache.Cache interface. It mirrors the logic of the SQL store—serializing data to JSON—but directs the output to a cache. This is the preferred implementation for local development environments.

Testing Infrastructure

The module utilizes a suite of subtests defined in graphsshortcuttest. This allows the SQL implementation to be verified against a real database instance (via sqltest.NewSpannerDBForTests) while ensuring it adheres to the standard behavior expected by the rest of the Perf application.

Shortcut Lifecycle Workflow

The following diagram illustrates how a shortcut moves from the application into persistent storage:

[ Perf UI / Logic ]          [ graphsshortcutstore ]          [ Storage (SQL or Cache) ]
         |                              |                             |
         | 1. Create GraphsShortcut     |                             |
         |----------------------------->|                             |
         |                              | 2. Serialize to JSON        |
         |                              |------------------+          |
         |                              |                  |          |
         |                              |<-----------------+          |
         |                              |                             |
         |                              | 3. Store (ID, JSON)         |
         |                              |---------------------------->|
         | 4. Return ID (Hash)          |                             |
         |<-----------------------------|                             |
         |                              |                             |
         | 5. Request ID (Permalink)    |                             |
         |----------------------------->|                             |
         |                              | 6. Fetch JSON by ID         |
         |                              |<----------------------------|
         |                              |                             |
         |                              | 7. Deserialize JSON         |
         |                              |------------------+          |
         |                              |                  |          |
         | 5. Return Shortcut Object    |<-----------------+          |
         |<-----------------------------|                             |

Module: /go/graphsshortcut/graphsshortcutstore/schema

High-Level Overview

The schema module defines the structural contract for persisting shortcut data within the Graphs Shortcut Store. Its primary purpose is to provide a unified Go representation of the underlying SQL table structure used to store and retrieve serialized graph configurations.

In the context of the Perf system, a “shortcut” is a durable reference to a specific state or collection of graphs. This module ensures that both the application code and the database schema remain synchronized regarding how these shortcuts are identified and stored.

Design Decisions and Implementation

The schema is designed around a simple key-value paradigm optimized for content-addressable storage or unique identifier lookups.

  • ID-Based Retrieval: The ID field serves as the unique handle for a set of graphs. By using a TEXT type with a UNIQUE NOT NULL PRIMARY KEY constraint, the system enforces data integrity at the database level, preventing collision and ensuring that every shortcut can be retrieved with $O(1)$ complexity via its primary key.
  • Serialized Persistence: The Graphs field is defined as a TEXT type rather than a structured relational set of tables. This implementation choice favors flexibility and performance for the following reasons:
    • Schema Evolution: Since the internal structure of a “graph” might change frequently in the frontend or higher-level Go modules, storing it as a serialized string (typically JSON) avoids the need for complex database migrations whenever a new UI property is added.
    • Atomic Operations: Storing the entire state in a single column allows the system to save or load a complete dashboard state in a single database operation, reducing the overhead of joins or multiple queries.

Key Components

GraphsShortcutSchema

The core structure GraphsShortcutSchema utilizes struct tags (sql:"...") to bridge the gap between Go types and the SQL dialect used by the persistence layer.

  • ID: Acts as the immutable identifier for the shortcut. In practice, this is often a hash of the content or a generated UUID, allowing the application to generate permalinks to specific graph views.
  • Graphs: Contains the payload of the shortcut. This is the “source of truth” for the graph configurations, including parameters like trace filters, time ranges, or specific visualization settings.

Data Flow Workflow

The following diagram illustrates how the schema acts as the intermediary between the application logic and the physical storage:

[ Application Logic ]          [ schema.GraphsShortcutSchema ]          [ SQL Database ]
          |                                  |                                 |
          |  1. Construct Schema Object      |                                 |
          |--------------------------------->|                                 |
          |     (Set ID and Serialized JSON) |                                 |
          |                                  |   2. Execute INSERT/SELECT      |
          |                                  |-------------------------------->|
          |                                  |   (Uses struct tags for SQL)    |
          |                                  |                                 |
          |  3. Receive Hydrated Object      | <-------------------------------|
          |<---------------------------------|                                 |
          |     (Deserialize Graphs field)   |                                 |

Module: /go/graphsshortcut/graphsshortcuttest

graphsshortcuttest

The graphsshortcuttest module provides a standardized test suite for validating implementations of the graphsshortcut.Store interface. By centralizing these tests, the system ensures that different storage backends (e.g., SQL-based, In-memory) behave consistently regarding data persistence, normalization, and error handling.

Design Philosophy: Contract Testing

The module is designed around the concept of “contract testing.” Instead of each implementation of a Store writing its own basic functional tests, they import and run the SubTests defined here. This approach ensures:

  1. Consistency: All backends must adhere to the same behavioral expectations.
  2. Normalization Verification: The tests specifically verify side effects of storage, such as the automatic sorting of query strings, ensuring that the Store acts as a canonicalization layer.
  3. Reduced Boilerplate: Implementation-specific test files only need to handle the setup/teardown of their respective drivers (like starting a Docker container for a database) before passing the resulting Store instance to this suite.

Key Components and Responsibilities

Test Suite Map (SubTests)

The primary entry point is the SubTests map. It maps descriptive test names to SubTestFunction signatures. This structure allows implementation-specific tests to iterate over the map and run each test as a subtest:

For each name, func in SubTests:
   t.Run(name, func(t, myStoreInstance))

Functional Validation (InsertGet)

The InsertGet function validates the primary lifecycle of a shortcut. A key implementation detail tested here is query normalization. When a GraphsShortcut is provided with queries in an arbitrary order, the Store is expected to return them in a sorted, deterministic state. This is crucial for deduplication and predictable UI rendering.

Input Shortcut         Storage Backend          Output Shortcut
+--------------+       +--------------+        +--------------+
| Queries:     |       |              |        | Queries:     |
|  - arch=x86  | ----> |  Persist &   | -----> |  - arch=arm  |
|  - arch=arm  |       |  Normalize   |        |  - arch=x86  |
+--------------+       +--------------+        +--------------+

Error Handling (GetNonExistent)

This ensures that the Store implementation correctly propagates errors when a requested ID does not exist, rather than returning an empty or partially initialized object.

Implementation Details

  • graphsshortcuttest.go: Contains the logic for the test suite. It defines the SubTestFunction type, which abstracts the testing.T and graphsshortcut.Store dependency, allowing the tests to be decoupled from the actual storage driver.
  • Data Integrity: The tests use testify/assert and testify/require to enforce strict equality between what is sent to the store and what is retrieved, ensuring that no fields (like Keys or the list of Graphs) are dropped or corrupted during the serialization/deserialization process.

Module: /go/graphsshortcut/mocks

The /go/graphsshortcut/mocks module provides automated mock implementations of the graphsshortcut.Store interface. These mocks are designed to facilitate unit testing of components that depend on graph shortcut persistence without requiring a live database or storage backend.

Design and Implementation

The module utilizes testify/mock to provide a flexible, programmable implementation of the storage layer. This approach allows developers to:

  1. Isolate Unit Tests: Test business logic in services that use graph shortcuts by simulating various storage outcomes (success, specific errors, or timeouts).
  2. Verify Interactions: Assert that the expected methods are called with the correct parameters, ensuring that the calling code correctly handles the lifecycle of a shortcut.

The code is autogenerated using mockery, ensuring that the mock implementation remains strictly synchronized with the graphsshortcut.Store interface definition. This eliminates the maintenance overhead of manually updating test doubles when the primary interface changes.

Key Components

Store.go

This file defines the Store struct, which embeds mock.Mock. It provides mock implementations for the primary persistence operations:

  • GetShortcut(ctx, id): Simulates retrieving a serialized graph configuration. It allows tests to return a specific graphsshortcut.GraphsShortcut object or an error based on the provided ID.
  • InsertShortcut(ctx, shortcut): Simulates the creation of a new shortcut. In a test environment, this is typically used to return a pre-defined ID string, allowing the caller to proceed as if a database write succeeded.

Usage Workflow

The NewStore function is the entry point for utilizing these mocks. It integrates directly with the Go testing lifecycle by registering a cleanup function that automatically asserts expectations.

    +-------------------+           +-----------------------+
    |    Unit Test      |           |  Mocks.Store (Mock)   |
    +---------+---------+           +-----------+-----------+
              |                                 |
              | 1. NewStore(t)                  |
              +-------------------------------->|
              |                                 |
              | 2. On("GetShortcut").Return(...)|
              +-------------------------------->|
              |                                 |
              | 3. Invoke System Under Test     |
              +-------------------------------->|
              |                                 |
              | 4. AssertExpectations (Auto)    |
              |<--------------------------------+

By using NewStore(t), the mock is bound to the test's lifespan. If the code under test fails to call a method that was “set up” or calls it with the wrong arguments, the test will fail during the Cleanup phase.

Module: /go/ingest

Ingest Module

The go/ingest module serves as the primary entry point and configuration layer for the Skia Perf ingestion system. It provides the high-level logic to instantiate and connect the various sub-modules—filtering, formatting, parsing, and processing—into a cohesive pipeline that transforms raw benchmark files into indexed, searchable performance traces.

Overview

The module's main responsibility is to bridge the gap between human-readable configuration (usually provided via JSON or command-line flags) and the specialized internal engines that handle data. It defines the Config structure, which acts as the blueprint for an ingestion instance, specifying everything from where data is sourced (e.g., Google Cloud Storage) to how it should be validated and where it should be stored.

Design Decisions

Configuration-Driven Architecture

The design is heavily configuration-driven, centered around the Config struct. This allows a single binary to support vastly different ingestion workflows (e.g., internal Skia benchmarks vs. external Chrome performance tests) simply by changing the configuration file. This decoupling ensures that the core logic in process or parser remains agnostic of the specific environment.

Reliability and Observability

Because ingestion is the “front door” of the Perf system, the module is designed for high reliability:

  • Constructors with Validation: The NewConfig and subsequent component initializers validate inputs (like regex patterns or database connection strings) early. This “fail-fast” approach prevents the system from starting in a broken state.
  • Operational Metrics: The module wires up metrics2 across all sub-components, providing real-time visibility into ingestion rates, error frequencies, and latency.

Component Orchestration

The module doesn't just pass data; it manages the lifecycle of dependencies. For example, it coordinates the setup of the git.Git connector used by the process module to resolve hashes, ensuring that the local git cache is initialized before workers start processing files.

Key Components and Responsibilities

Ingest Configuration (config.go)

This is the central definition of an ingestion instance. It categorizes configuration into several key areas:

  • Source Configuration: Defines the SourceConfig (GCS bucket, file prefixes) and PubSubConfig for event-driven ingestion.
  • Ingestion Logic: Contains parameters for the Filter (which files to ignore) and the Parser (which branches to accept).
  • Infrastructure Links: Holds connection details for the TraceStore (where data goes) and the Git repository (how data is mapped to commits).

Integration Workflow

The following diagram illustrates how the ingest module assembles the sub-modules into a functional pipeline:

[ Config File / JSON ]
          |
          v
+------------------------+
|     Ingest Module      |
|  (Initialization)      |
+-----------+------------+
            |
            +--> [ Filter ]  (Rules for file selection)
            |
            +--> [ Parser ]  (Rules for data transformation)
            |
            +--> [ Git ]     (Connector for commit mapping)
            |
            v
+------------------------+
|    Process Module      |
|   (The Execution)      |
+-----------+------------+
            |
            +----[ Workers ]<----( Source: GCS/PubSub )
            |        |
            |        +--> [ Parse & Map ]
            |        |
            |        +--> [ Write to TraceStore ]
            |        |
            |        +--> [ Notify Downstream ]
            v
    [ Persistent Storage ]

Sub-Module Interaction

  • filter: Invoked early in the process to discard irrelevant files before they consume CPU time in the parser.
  • format: Provides the structural definitions and JSON schemas that the parser uses to validate incoming blobs.
  • parser: Utilized by the workers in the process module to turn raw bytes into standardized trace IDs and values.
  • process: The active engine started by the ingest module to manage the actual flow of data and interaction with databases.

Module: /go/ingest/filter

The go/ingest/filter module provides a mechanism for determining whether a file should be processed or ignored during ingestion based on its name. This is a critical component for performance monitoring pipelines that ingest data from large-scale storage (like Google Cloud Storage), where filtering out irrelevant files or transaction logs early prevents unnecessary resource consumption and reduces processing noise.

Logic and Design

The filtering logic is built around two optional regular expressions: accept and reject. The design follows a “deny-by-default” approach when an accept pattern is provided, and an “allow-by-default” approach otherwise, provided the file doesn't match a reject pattern.

The evaluation logic follows these rules:

  1. Acceptance: If an accept regex is defined, the filename must match it. Failure to match results in immediate rejection.
  2. Rejection: If a reject regex is defined, the filename must not match it. A match results in immediate rejection.
  3. Default: If neither regex is provided, all filenames are accepted.

The Filter struct caches the compiled *regexp.Regexp objects to ensure that performance is optimized for high-volume ingestion where thousands of filenames may be evaluated against the same ruleset.

Workflow

The following diagram illustrates the decision flow when Filter.Reject(name) is called:

       [ Input Filename ]
               |
               v
     +-------------------+
     |  Is Accept Regex  |---- No ----+
     |      Defined?     |            |
     +---------+---------+            |
               | Yes                  |
               v                      |
     +---------+---------+            |
     | Does it Match?    |---- Yes ---+
     +---------+---------+            |
               | No                   |
               v                      |
      [ REJECTED (true) ]             |
                                      v
                            +-------------------+
                            |  Is Reject Regex  |---- No ----+
                            |      Defined?     |            |
                            +---------+---------+            |
                                      | Yes                  |
                                      v                      |
                            +---------+---------+            |
                            | Does it Match?    |---- No ----+
                            +---------+---------+            |
                                      | Yes                  |
                                      v                      v
                             [ REJECTED (true) ]    [ ACCEPTED (false) ]

Key Component

filter.go

This file contains the core Filter implementation.

  • New(accept, reject string): Validates and compiles the provided regex strings. It returns an error if the regex syntax is invalid, ensuring that ingestion processes fail fast during configuration rather than during runtime execution.
  • Reject(name string) bool: The primary interface for the module. It returns true if the file should be discarded and false if it should be processed. By returning true for a “reject” action, it allows callers to write clean guard clauses like if filter.Reject(name) { continue }.

Module: /go/ingest/format

Ingest Format

The go/ingest/format module defines the data structures and validation logic for performance data files ingested into the Perf system. It serves as the formal specification for how external processes and test runners should format their results to be correctly indexed and visualized.

Overview

The primary goal of this module is to provide a flexible yet strictly validated schema that maps raw performance measurements to Trace IDs. A Trace ID is a comma-separated string of key-value pairs (e.g., ,arch=x86,config=8888,test=draw_circle,units=ms,stat=min,) used by Perf to identify a unique time series of data points.

The module supports two formats:

  1. Standard Format (Version 1): The modern, recommended format designed for clarity and multi-metric reporting.
  2. Legacy Format: A format primarily used by nanobench (Skia's internal microbenchmarking tool) which relies on nested maps.

Design Decisions

Trace ID Construction

The design favors a flat key-value structure for identification. The Format struct and its sub-components (Result, SingleMeasurement) are structured so that keys defined at the top level (global to the file) are merged with keys defined at the result level and specific measurement level. This allows for efficient data representation where common metadata (like git_hash or arch) is defined once, while specific metrics (like min, max, median) are defined locally.

Multi-Metric Support

A single test run often produces multiple related values (e.g., different statistical aggregations of the same test). The Result struct allows a single entry to contain multiple measurements. This avoids duplicating the entire metadata block for every statistical variation of a single test, reducing file size and improving readability.

Single Source of Truth via JSON Schema

To prevent “schema drift” between the Go implementation and external data producers, the module uses an embedded JSON Schema (formatSchema.json). This schema is programmatically generated from the Go structs (via the generate submodule). This ensures that validation logic used during ingestion is identical to the documentation provided to contributors.

Key Components

Standard Format (format.go)

This is the primary entry point for modern ingestion. It defines the Format struct, which includes:

  • Contextual Metadata: GitHash, Issue (CL), and Patchset to associate results with specific code versions.
  • Global Keys: A Key map containing parameters that apply to every measurement in the file (e.g., hardware configuration).
  • Results: A list of Result objects. Each result can either be a simple Measurement (float32) or a complex Measurements map.

Trace ID Generation Workflow

The following diagram illustrates how keys are aggregated from different levels of the Format struct to form a final Trace ID:

File Level:
  {"key": {"arch": "x86", "config": "8888"}}
      |
      v
Result Level:
  {"key": {"test": "draw_circle", "units": "ms"}}
      |
      v
Measurement Level:
  {"measurements": {"stat": [{"value": "min", "measurement": 1.2}]}}
      |
      +---------------------------------------+
      | Resulting Trace ID:                   |
      | ,arch=x86,config=8888,stat=min,test=draw_circle,units=ms, |
      +---------------------------------------+

Validation and Parsing

The module provides robust utilities to ensure data integrity:

  • Parse: Decodes JSON into the Format struct and enforces version checking.
  • Validate: Performs a two-pass check. First, it ensures the JSON is syntactically valid and matches the internal Go types. Second, it validates the blob against the embedded JSON Schema to catch logic errors (e.g., missing required fields like git_hash).
  • GetLinksForMeasurement: A helper function that resolves all URLs associated with a specific trace. It merges global links (file-level) with measurement-specific links, allowing users to jump from a specific data point in the Perf UI to external logs or artifacts.

Legacy Format (leagacyformat.go)

This file maintains compatibility with older Skia tooling. It defines the BenchData struct, which uses a more nested structure: Results -> [Config/State] -> [TestName] -> [Metric]. Unlike the standard format, which is versioned and schema-validated, the legacy format is handled as a map[string]interface{} to accommodate the highly dynamic nature of older benchmark outputs.

Embedded Schema (formatSchema.json)

The schema file is embedded into the Go binary using //go:embed. This allows the Validate function to perform schema validation without requiring external file dependencies at runtime. It defines the strict requirements for the Version 1 format, such as mandatory fields and allowed data types for measurements.

Module: /go/ingest/format/generate

The generate module is a utility designed to maintain consistency between the Go-based implementation of the Perf ingestion format and its external documentation. Its primary responsibility is to act as a code-to-schema compiler that ensures the structural definition of data ingested into Perf remains synchronized across all tools and external data producers.

Design Philosophy

The core design decision behind this module is to treat the Go source code as the single source of truth for the ingestion protocol. In a complex data pipeline, the format used to describe performance results can evolve. Manually maintaining a separate JSON Schema file is error-prone and leads to “schema drift,” where the documentation or validation rules fail to match the actual parsing logic in the Go codebase.

By programmatically generating the schema from the format.Format struct, the system guarantees that:

  • Any field added, renamed, or removed in the Go struct is immediately reflected in the schema.
  • Validation logic (e.g., required fields or data types) remains identical for both internal processing and external validation.
  • External developers producing data for Perf have a machine-readable specification that is guaranteed to be accurate.

Implementation Strategy

The module leverages the go.skia.org/infra/go/jsonschema package to perform structural reflection on the format.Format struct.

  1. Reflection: The generator inspects the Go types, field names, and especially the json tags of the target struct.
  2. Mapping: It maps Go-specific primitives and complex types to their corresponding JSON Schema representations.
  3. Output: The result is serialized into a standard formatSchema.json file located in the parent directory. This output is used by other parts of the system to validate incoming JSON blobs before they are processed by the ingestion engine.

Process Flow

The following diagram illustrates how this module fits into the development workflow:

+--------------------------+
|  perf/go/ingest/format/  |
|      (Go Structs)        |
+------------+-------------+
             |
             | Source of Truth
             v
+------------+-------------+
| /format/generate/main.go |  <-- This Module
+------------+-------------+
             |
             | Reflection & Generation
             v
+------------+-------------+
|    formatSchema.json     |
| (Machine-readable Spec)  |
+------------+-------------+
             |
             +----------------------------> External Data Producers
             |
             +----------------------------> Validation Middlewares

Key Components

  • Schema Generator (main.go): The entry point that executes the generation logic. It specifically references the format.Format struct and directs the output to a static file path relative to the module. Its simplicity is intentional, as it acts strictly as a bridge between the internal type system and the file system.

Module: /go/ingest/parser

High-Level Overview

The parser module provides the logic for transforming raw performance data files into a standardized format suitable for storage in a trace-based database. It serves as the translation layer between external benchmarking tools—which may produce data in various JSON schemas—and the internal Perf system.

The module is designed to handle both a “Legacy” format and a modern “Version 1” format, ensuring backward compatibility while supporting newer features like explicit commit positions and complex measurement maps.

Design Philosophy and Implementation Choices

The parser's implementation is guided by the need for data integrity and system stability in a high-volume ingestion pipeline:

  • Format Autodiscovery: Instead of requiring explicit configuration for file types, the Parser attempts to decode files using the Version 1 schema first. If that fails, it falls back to the Legacy parser. This allows a single ingestion pipeline to handle a heterogeneous mix of data sources.
  • Key Sanitization: A critical responsibility of the parser is ensuring that parameter keys and values do not contain characters (like , or =) that would break the internal string-based representation of traces. It uses configurable regular expressions and a “force valid” approach to replace illegal characters, preventing database corruption or query errors.
  • Metric Filtering: To keep the database clean, the parser identifies and discards “noise.” For example, it explicitly ignores parameters prefixed with GL_ (internal OpenGL constants) in legacy files, as these are considered too verbose for high-level performance tracking.
  • Branch-Based Gating: The parser can be configured to only accept data from specific branches. This prevents experimental or development branch data from polluting the production performance metrics unless explicitly desired.

Key Components and Responsibilities

Parser Struct (parser.go)

The central coordinator of the module. It maintains the state necessary for ingestion, including:

  • Validation Logic: Uses invalidParamCharRegex to sanitize incoming metadata.
  • Branch Filtering: Holds a map of allowed branch names to quickly decide if a file should be processed or skipped via ErrFileShouldBeSkipped.
  • Metrics Tracking: Integrates with metrics2 to track successful parses, failures, and files with no data, providing operational visibility into the ingestion pipeline.

Version 1 Parsing

Handles the modern schema which supports:

  • Commit Numbers: Recognizes the CP:nnnnnn prefix in the git_hash field to treat Git hashes as sequential commit positions.
  • Complex Measurements: Processes the Measurements map, which allows a single result entry to contain multiple named metrics (e.g., min_ms, max_rss) without duplicating the common metadata.

Legacy Parsing

Maintains compatibility with older benchmarking outputs. Its primary task is flattening deeply nested JSON structures (Test Name -> Config -> Results) into a flat list of parameters and float values. It also handles the extraction of “Samples” (multiple runs of the same test) which are specifically aggregated for the min_ms sub-result.

Parameter Management

Functions like buildInitialParams and getParamsAndValuesFromVersion1Format are responsible for merging “Global” keys (describing the machine/environment) with “Local” keys (describing the specific test run). This creates the unique identity for every performance trace.

Data Transformation Workflow

The following diagram illustrates how the Parse method processes a file from raw input to standardized trace data:

Input: file.File (Name, Contents)
          |
          V
[ Read all contents into memory ]
(Allows multiple passes for format detection)
          |
          V
[ Try Version 1 Extraction ] ---- Success? ----> [ Sanitize Keys ]
          |                                            |
        Fail?                                          |
          |                                            |
[ Try Legacy Extraction ] ------- Success? ----> [ Sanitize Keys ]
          |                                            |
        Fail?                                          |
          |                                            |
[ Return Error ]                                [ Filter by Branch ]
                                                       |
                                              [ Skip if excluded? ]
                                                       |
                                                       V
                                          Standardized Output:
                                          - []paramtools.Params (Trace IDs)
                                          - []float32 (Values)
                                          - Hash (Commit ID)

Key Files

  • parser.go: Contains the primary Parser implementation and logic for both schema versions.
  • parser_test.go: Defines the behavioral contract of the parser using a wide array of test fixtures to ensure stability across edge cases like malformed JSON or special character collisions.
  • testdata/: An authoritative collection of JSON fixtures representing different data scenarios (success, failure, different schemas) used to validate the parser's logic.

Module: /go/ingest/parser/testdata

The /go/ingest/parser/testdata module serves as the authoritative collection of test fixtures for the performance data ingestion system. Its primary role is to define the operational boundaries of the ingestion logic, ensuring that the system remains resilient across schema evolutions, handles data corruption gracefully, and correctly identifies performance metrics from various benchmarking sources.

Design Philosophy and Implementation Choices

The directory is structured to separate data by schema version (legacy and version_1). This separation reflects a fundamental design choice to maintain strict backward compatibility while allowing for the evolution of the ingestion format.

The fixtures are designed around three core testing principles:

  1. Identity Verification: Ensuring that the combination of global keys (e.g., OS, Architecture) and local result keys (e.g., Test Name, Configuration) correctly resolves to a unique time-series identity.
  2. Sanitization and Collisions: Validating that the parser can handle special characters (,, =) that might otherwise collide with the internal delimiters used by the time-series database.
  3. Data Filtering: Defining “noise” (such as legacy GL_ prefixes or experimental branch data) through negative test cases, ensuring that only relevant metrics enter the long-term storage.

Key Components

Legacy Data Handling (/legacy)

The files within this component represent a historical, more permissive JSON schema. The parser‘s responsibility here is heavily focused on traversal and filtering. Because the legacy format lacks strict enforcement, the test data validates the parser’s ability to:

  • Navigate Deep Nesting: Locating metrics within structures like results -> configuration -> metrics.
  • Execute Exclusion Rules: Using files like unknown_branch.json to verify that data from specific development paths is discarded.
  • Manage Mixed Types: Handling arrays that may contain both numeric data and non-numeric “noise” within the same block.

Version 1 Schema Validation (/version_1)

The Version 1 fixtures represent the modern, more structured ingestion format. The focus shifts from filtering noise to identifying metadata and handling special cases. Key responsibilities demonstrated here include:

  • Commit Position Resolution: Utilizing the CP: prefix in git_hash fields to distinguish between traditional Git SHAs and sequential commit numbers, as seen in with_commit_number.json.
  • Escaping Logic: Validating that the parser correctly preserves data integrity when keys or values contain mathematical symbols or database delimiters (e.g., with_comma_in_param.json).
  • Measurement Aggregation: Testing how the system interprets different measurement formats, such as single scalar values versus maps of multi-config measurement arrays.

High-Level Ingestion Workflow

The following diagram illustrates how the ingestion logic uses these fixtures to transform raw JSON input into a standardized record:

Raw JSON File (Test Data)
      |
      V
[ Format Detection ] -----------------------+
      |                                     |
      +--> [ Legacy Parser ]                +--> [ Version 1 Parser ]
      |    (Filters GL_ prefixes,           |    (Handles CP: prefixes,
      |     Maps nested metrics)            |     Resolves Identity Keys)
      |                                     |
      V                                     V
[ Key Normalization ] <---------------------+
      |
      |-- Check for delimiter collisions (",", "=")
      |-- Merge global and local key blocks
      |
      V
[ Value Extraction ]
      |
      |-- Convert numeric strings to float64
      |-- Validate sample arrays (ignore non-numeric)
      |
      V
Standardized Ingestion Record
(Used for Database Write)

Usage in Testing

These files are not merely static examples; they are the inputs for the parser_test.go suite. The system compares the output of the parser against “golden” expectations derived from these files.

  • Positive tests (e.g., success.json) ensure that valid data is correctly parsed into the internal data model.
  • Negative tests (e.g., invalid.json, invalid_commit_number.json) ensure that the parser returns explicit errors or handles exceptions without crashing, which is critical for a high-volume ingestion pipeline where malformed data is a common occurrence.

Module: /go/ingest/parser/testdata/legacy

The /go/ingest/parser/testdata/legacy directory serves as a comprehensive suite of test fixtures designed to validate the ingestion and parsing logic for legacy performance result formats. These files are used to ensure that the parser correctly handles various edge cases, data structures, and validation rules inherent in the older JSON schema used by benchmarking systems.

Purpose and Design

The primary goal of these data files is to define the expected boundaries of the legacy ingestion system. Because legacy formats often lack strict schema enforcement, these files document through example how the parser should interpret nested objects, handle missing keys, and filter out noise.

Key design considerations reflected in these files include:

  • Schema Flexibility: Validating how the parser traverses deeply nested structures (e.g., results -> test_name -> configuration -> metrics).
  • Data Integrity: Defining which fields are considered “noise” (like specific GL prefixes or non-string values in certain contexts) and should be discarded during ingestion.
  • Robustness: Ensuring the system handles corrupted or empty data gracefully without crashing.

Key Test Scenarios

1. Data Structure Variations

  • Measurement Types: one_measurement.json and zero_measurement.json verify the parser's ability to extract single data points and handle boundary values like zero, which might otherwise be misinterpreted as missing data.
  • Sample Aggregation: samples_success.json demonstrates how the system handles arrays of raw performance measurements (samples) alongside aggregated statistics like min_ms. It also tests the parser's ability to ignore non-numeric values within these arrays.
  • Metadata Handling: Files like success.json showcase complex results containing a mix of key (identifying the environment), options (contextual metadata), and meta (test-specific metrics like max_rss_mb).

2. Filtering and Validation Logic

Several files contain keys specifically named SHOULD_NOT_APPEAR_IN_RESULTS... (e.g., in samples_success.json and unknown_branch.json). These are used to test the parser's filtering logic:

  • Prefix Filtering: Ignoring keys starting with GL_.
  • Type Validation: Ensuring that only string values are accepted in certain metadata blocks, while non-numeric values are ignored in measurement blocks.
  • Branch Filtering: unknown_branch.json tests the ingestion engine's ability to ignore data from specific development branches (e.g., “ignoreme”).

3. Error and Edge Cases

  • Malformed Input: invalid.json provides a baseline for how the parser handles non-JSON content.
  • Empty Results: no_results.json and samples_no_results.json ensure the system doesn't fail when valid metadata is present but no actual performance metrics are included.

Workflow: Data Parsing Logic

The following diagram illustrates how the ingestion logic typically processes a file from this test data set:

JSON Input File
      |
      V
[ Schema Validation ] ----> If Invalid (invalid.json) -> Reject
      |
      V
[ Extract Global Keys ] --> gitHash, issue, patchset, system
      |
      V
[ Iterate Results ]
      |
      +-- [ Filter Keys ] --> Ignore "GL_" prefixes or "ignoreme" branches
      |
      +-- [ Parse Metrics ]
      |         |
      |         +-- Numeric values (min_ms, samples) -> Store
      |         +-- Non-numeric/Strings in Metrics   -> Discard
      |
      +-- [ Map Metadata ] -> Map 'options' and 'meta' to result tags
      |
      V
Processed Ingestion Record

File Summary

  • success.json / samples_success.json: The gold standard for valid legacy data, covering diverse configurations (8888, 565, gpu) and metric types.
  • invalid.json: Tests resilience against syntax errors.
  • no_results.json / samples_no_results.json: Tests handling of empty result sets with valid headers.
  • one_measurement.json / zero_measurement.json: Tests specific numerical edge cases.
  • unknown_branch.json: Tests environmental filtering logic.

Module: /go/ingest/parser/testdata/version_1

This directory serves as a comprehensive suite of test cases for the “Version 1” ingestion format. It contains JSON files designed to validate the robustness, edge-case handling, and schema compliance of the ingestion parser.

Purpose and Design Decisions

The primary goal of these data samples is to define the boundaries of what the parser should accept, reject, or transform. The design of these files reflects real-world ingestion scenarios where data might be messy, incomplete, or formatted using specific conventions (such as commit position markers).

By providing these samples, the module ensures that the parser can:

  1. Differentiate between various measurement structures (single values vs. multi-config arrays).
  2. Handle identity metadata across different levels of the JSON hierarchy.
  3. Sanitize or preserve special characters within keys and values.

Key Data Scenarios

The test data can be categorized into three functional groups:

1. Valid and Edge-Case Data Structures

These files demonstrate the flexibility of the Version 1 schema:

  • Success Cases (success.json, one_measurement.json): Demonstrate the standard structure. Results can contain a top-level measurement or a nested measurements map containing arrays of values (e.g., different configs like 8888 or gpu).
  • Commit Identity (with_commit_number.json): Shows the use of the CP: prefix in the git_hash field to represent a “Commit Position” rather than a standard Git SHA.
  • Result Variations: Includes cases with zero measurements (zero_measurement.json) or entirely empty results lists (no_results.json), which the parser must handle without crashing.

2. Character and Format Robustness

Ingested data often contains characters that could conflict with internal storage formats (like key-value pair delimiters).

  • Special Characters (with_special_chars.json): Tests a wide range of symbols (e.g., !~@#$%^&*()) within both keys and values.
  • Delimiters (with_comma_in_param.json, with_equal_in_param.json): Specifically tests strings containing , and =, which are often used as separators in time-series databases. These files verify that the parser correctly escapes or encapsulates these values.

3. Error and Validation Cases

These files define the failure modes of the parser:

  • Syntactic Errors (invalid.json): Plain text that is not valid JSON.
  • Schema Violations (invalid_commit_number.json): Identifies cases where fields like git_hash contain malformed prefixes (e.g., CP:727A901 where the number format might be incorrect).

Data Workflow

The following diagram illustrates how the parser uses these files to determine the final identity of a performance measurement:

JSON Input File
      |
      V
+-----------------+      +-----------------------+
|  Global Keys    |----->| Common Metadata       |
| (arch, os, etc) |      | (applied to all)      |
+-----------------+      +-----------+-----------+
                                     |
                                     V
+-----------------+      +-----------------------+      +--------------------+
|  Result Keys    |----->| Unique Series ID      |----->| Final Ingested     |
| (test, config)  |      | (Global + Result Keys)|      | Data Point         |
+-----------------+      +-----------------------+      +--------------------+
                                     |
      +------------------------------+
      |
      V
+-----------------+      +-----------------------+
|  Measurements   |----->| Value (float64)       |
| (single or map) |      |                       |
+-----------------+      +-----------------------+

Key Components

  • git_hash / version: Every valid file includes these to identify the schema version and the point in time the data represents.
  • key block: Found at both the root level (global params) and within individual results (test-specific params). The parser must merge these to create a full set of dimensions for the data.
  • links: Demonstrated in with_commit_number.json, showing how external references (like documentation or build logs) are attached to the ingestion record.

Module: /go/ingest/process

Ingest Process

The process module is the core execution engine of the Skia Perf ingestion pipeline. It coordinates the lifecycle of performance data from its raw state in a source (like Google Cloud Storage or a local directory) to its indexed state within a TraceStore.

Overview

The module provides a multi-threaded ingestion worker system that handles parsing, commit mapping, data normalization, and storage. It is designed to be resilient, utilizing retries for database operations and supporting Google Cloud Pub/Sub for both input signaling and downstream event notifications.

The entry point is the Start function, which initializes the necessary infrastructure components—source monitors, trace stores, metadata stores, and git connectors—and launches a configurable number of parallel worker goroutines.

Key Components and Responsibilities

Worker Lifecycle

The system operates using a producer-consumer model:

  1. Source: A file.Source (e.g., GCS bucket listener) produces file.File objects onto a channel.
  2. Workers: Multiple worker goroutines consume from this channel. Each worker maintains its own parser.Parser to handle the transformation of raw bytes into structured performance data.
  3. Processing: Each file undergoes a specific workflow: Parse -> Commit Mapping -> ParamSet Construction -> Store Write -> Event Notification.

Commit Mapping and Git Integration

Ingested files typically contain a Git hash. The process module is responsible for resolving this hash into a monotonic types.CommitNumber.

  • It uses the git.Git interface to look up commit numbers.
  • If a hash is unrecognized, the worker triggers an update of the local git metadata to ensure it isn‘t simply a new commit that hasn’t been cached yet.
  • If a hash remains invalid after an update, the file is acknowledged (skipped) to prevent infinite retry loops in Pub/Sub.

Data Normalization and Storage

Once parsed and mapped to a commit, the data is prepared for the tracestore.TraceStore:

  • ParamSet Construction: The worker aggregates all parameters from the file into a ParamSet, which serves as the index for searching traces.
  • Resilient Writing: Database writes are wrapped in a retry loop (defaulting to 10 attempts) to handle transient failures or contention in the underlying storage (e.g., Spanner or SQL).
  • Metadata: If the file contains supplemental links or metadata, these are stored in a separate MetadataStore.

Downstream Notifications

After a successful write, the module can notify other services via a Pub/Sub topic defined in FileIngestionTopicName.

  • It filters and deduplicates trace IDs (see Trace Clustering Logic) before sending.
  • This allows downstream systems like the clustering or anomaly detection engines to react immediately to newly ingested data.

Trace Clustering Logic

To optimize downstream processing (like clustering), the module includes logic to prune redundant trace IDs. Many performance tests report multiple statistics for the same logical test (e.g., test_name, test_name_avg, test_name_min).

The getTraceIdsForClustering function implements a “canonicalization” check:

  • If a trace has a suffix like _avg, _min, _max, or _count, the system looks for a “canonical” version of that same trace (the name without the suffix and a matching stat key).
  • If the canonical version exists in the same file, the suffixed version is excluded from the Pub/Sub notification to reduce noise.

Internal Workflow

[ Source (GCS/Dir) ]
        |
        v
[ file.File Channel ]
        |
        +----[ Worker 1 ]----[ Parser ]----> (Parsed Data)
        |          |
        |          +---------[ Git ]-------> (Commit Number)
        |          |
        |          +---------[ TraceStore ]-> [ Persistent Storage ]
        |          |
        |          +---------[ Pub/Sub ]----> [ Ingestion Events Topic ]
        |
        +----[ Worker 2 ]---- ...
        |
        +----[ Worker N ]---- ...

Design Decisions

Dead Letter Collection (DLC)

The module supports “Dead Letter” semantics via Pub/Sub Nacks. If DeadLetterCollection is enabled in the configuration, processing failures will trigger a Nack(), allowing the message to be redelivered or moved to a dead-letter queue by the infrastructure. If disabled, or if the error is unrecoverable (like a bad git hash), the message is Ack()-ed to clear the pipeline.

Context and Timeouts

A defaultDatabaseTimeout of 60 minutes is applied to file processing. This high threshold accounts for large files that might contain thousands of traces requiring significant indexing time in the database, while still providing a circuit breaker for stalled connections.

Parallelism

The number of parallel ingesters is configurable. This allows the system to scale horizontally based on the CPU/Memory available to the container and the IOPS capacity of the underlying TraceStore.

Module: /go/ingestevents

Overview

The ingestevents module provides a standardized data contract and serialization format for communication between Perf ingesters and regression detection components. Within the Perf architecture, ingesters process raw performance data files and store them in the database. Once a file is successfully processed, the system must trigger downstream tasks—specifically regression detection—to analyze the newly arrived data.

This module facilitates this “Event-Driven Alerting” by defining the IngestEvent structure and providing utilities to pass this data through Google Cloud PubSub efficiently.

Design Decisions: Payload Efficiency

A key challenge in event-driven architectures is balancing the richness of the event data against transport limits. PubSub has a maximum message size (10MB), and high-volume performance data can easily exceed this if not handled carefully.

To address this, the module implements a mandatory compression strategy:

  • Gzip Compression: All IngestEvent payloads are Gzipped before being sent to PubSub and must be decompressed upon receipt. This ensures that even files containing thousands of TraceIDs or complex ParamSets remain well below the transport limits.
  • JSON Encoding: Within the compressed envelope, data is stored as standard JSON to maintain readability and ease of debugging during development.

Key Components

IngestEvent Structure

The IngestEvent struct is the core data transfer object. It contains three primary pieces of information that allow downstream clusterers to perform regression detection without re-querying the database for metadata:

  • TraceIDs: A slice of unencoded trace identifiers found in the ingested file. This tells downstream consumers exactly which series of data points have been updated.
  • ParamSet: A summary of the parameters (key-value pairs) associated with the TraceIDs. This provides immediate context about the hardware, benchmarks, and configurations affected by the new data.
  • Filename: The source of the data, useful for auditing and tracking the ingestion pipeline.

Serialization Utilities

The module provides two primary functions to manage the lifecycle of an event:

  • CreatePubSubBody: Orchestrates the encoding process. It uses a bytes.Buffer coupled with a gzip.Writer to transform an IngestEvent into a compressed byte slice ready for PubSub publishing.
  • DecodePubSubBody: The inverse operation. It handles the Gzip decompression and JSON decoding, returning a pointer to the original IngestEvent.

Workflow: Data Ingestion to Alerting

The following diagram illustrates how this module fits into the broader Perf data pipeline:

[ Raw Data File ]
       |
       v
[ Ingester Service ] ----> ( Writes to Database )
       |
       | ( Creates IngestEvent )
       v
[ ingestevents.CreatePubSubBody ]
       |
       | ( Gzipped JSON )
       v
[ Google Cloud PubSub ]
       |
       v
[ Clusterer / Detection Service ]
       |
       | ( Receives Message )
       v
[ ingestevents.DecodePubSubBody ]
       |
       | ( Result: TraceIDs, ParamSet )
       v
[ Regression Detection Logic ]

Implementation Details

The implementation leverages go.skia.org/infra/go/util for safe Gzip stream handling and go.skia.org/infra/go/skerr for structured error wrapping. This ensures that failures during decompression or decoding (e.g., due to malformed PubSub messages) provide enough context to identify where the pipeline stalled.

Module: /go/initdemo

initdemo

The initdemo module provides a utility for bootstrapping a local development environment for Skia Perf. Its primary purpose is to automate the creation of the required database and the application of the current schema, ensuring developers have a consistent and functional backend for local testing.

Design Philosophy

This utility is designed for idempotency and speed in local development. Rather than relying on complex migration tools or manual database setup steps, initdemo uses the direct Spanner-compatible schema defined within the Perf codebase.

The choice to use a simple Go binary for this task reflects a preference for:

  • Consistency: It ensures that the local “demo” database matches the exact schema expectations of the current source code.
  • Simplicity: By wrapping both database creation and schema application in one command, it reduces the friction for new contributors setting up their environment.

Key Components and Responsibilities

Database Initialization (main.go)

The core logic resides in main.go. It performs two distinct phases of setup:

  1. Database Provisioning: It attempts to create a new database (defaulting to demo). It gracefully handles cases where the database already exists, allowing the tool to be run repeatedly without side effects.
  2. Schema Application: It retrieves the Spanner-compatible schema directly from the perf/go/sql/spanner package. This creates all necessary tables, indices, and constraints required for Skia Perf to function.

Schema Source

The module depends on go.skia.org/infra/perf/go/sql/spanner. This dependency is critical because it acts as the “Source of Truth” for the database structure. By importing spanner.Schema, initdemo guarantees that the local environment is always in sync with the production-ready schema definitions used by the main Perf application.

Workflow

The following diagram illustrates the sequence of operations performed by the utility:

  [ Developer ]
       |
       V
+--------------+
| initdemo run |
+--------------+
       |
       | 1. Connects to local DB instance (e.g., CockroachDB/Spanner Emulator)
       |
       V
+-----------------------+
| CREATE DATABASE demo; |----( If exists, log and continue )
+-----------------------+
       |
       | 2. Fetch Schema from /perf/go/sql/spanner
       |
       V
+-----------------------+
|  Apply SQL Statements |----( Create Tables, Indices, etc. )
+-----------------------+
       |
       V
+-----------------------+
|    Success / Exit     |
+-----------------------+

Configuration

The module supports customization through command-line flags, primarily allowing the user to point the utility at a specific database instance or rename the target database.

  • database_url: Defines the connection string. Although the tool is used for Spanner-compatible schemas, it utilizes the pgx library, reflecting the common local development pattern of using CockroachDB or similar PostgreSQL-wire-compatible emulators.
  • databasename: Allows the user to override the default “demo” name.

Module: /go/issuetracker

Overview

The issuetracker module provides a high-level abstraction for interacting with the Google Issue Tracker (Buganizer) API, specifically tailored for the Skia Perf ecosystem. Its primary goal is to automate the lifecycle of performance regressions—from initial detection to filing bugs and updating them with diagnostic data.

By wrapping the lower-level issuetracker/v1 API, this module handles the complexities of authentication, data formatting (Markdown), and the mapping of Perf-specific entities (anomalies, regressions, and subscriptions) into actionable bug reports.

Design and Implementation Choices

Data-Driven Bug Filing

Unlike a simple API client, FileBug relies heavily on internal state from the regression.Store. When a bug is filed, the module does not simply trust the parameters passed from the frontend; instead, it queries the database to:

  1. Identify the correct Bug Component, Priority, and Severity based on the Subscription linked to the regression.
  2. Aggregate technical details (Bots, Benchmarks, Measurements) directly from the regression data to ensure the bug description is accurate and comprehensive.

Safety and Testing Rails

The implementation includes a “test run” mechanism (checkTestRun). If a regression is not linked to a specific internal testing email (e.g., sergeirudenkov@google.com), the module defaults the bug status to NEW, clears the assignee, and removes CCs. This prevents automated systems from accidentally spamming production engineering teams during development or misconfiguration.

URL Management

The module handles the “Long URL” problem inherent in web-based analysis tools. Performance reports often involve hundreds of individual regression keys. To prevent breaking Issue Tracker or browser limits, the module calculates the length of the generated graph URL. If it exceeds a safe threshold (~2000 characters), it swaps the direct link for a “Link by Bug ID” (e.g., /u?bugID=12345), which leverages Perf's ability to look up regressions associated with a specific tracker ID.

Authentication and Environments

The module supports two distinct operating modes:

  • Production: Uses go/secret to fetch API keys from GCP Secret Manager and initializes an OAuth2 authorized client.
  • Development: If devMode is active, it redirects traffic to a local mockhost (port 8081) and bypasses authentication, allowing for end-to-end UI testing without real API credentials.

Key Components

IssueTracker Interface (issuetracker.go)

This is the primary entry point. It defines the contract for filing bugs, adding comments, and querying issues. The implementation (issueTrackerImpl) coordinates with several sub-systems:

  • regression.Store: Provides the underlying data for detected performance shifts.
  • userissue.Store: Tracks issues manually filed by users to prevent duplicates and maintain a history of user-driven triage.
  • anomalygroup/service: Used to rank and select the “Top Anomalies” to include in the bug's summary, ensuring the most impactful data is presented first.

Bug Description Generation

The module dynamically constructs Markdown descriptions. The workflow for generating a bug body follows this logic:

[ Regression IDs ] --> [ Fetch Subscription Details ] --> [ Determine P/S Level ]
                               |
                               v
[ Fetch Regression Data ] --> [ Aggregate Bot Names ]
                               |
                               v
[ Rank Anomalies ] ---------> [ Format Top 10 List ]
                               |
                               v
[ Final Markdown ] <--------- [ Construct Graph Links ]

User Issue Logic (FileUserIssue)

While FileBug is often automated, FileUserIssue handles cases where a user manually identifies a regression on a specific trace. This workflow is simpler but critical for manual triage; it creates a bug with a standardized title containing the Trace ID and Commit Position, then persists this relationship in the userissue.Store.

Key Workflows

Automated Bug Selection

When multiple regressions are grouped into a single bug filing request, the module must decide which metadata to use. It iterates through all associated subscriptions and selects the one with the highest priority (lowest numerical value) and highest severity.

Regressions: [R1, R2, R3]
  |
  +--> Sub A (P2, S2)
  +--> Sub B (P1, S3)
  |
  [ Selection: Sub B ] (Priority P1 wins over P2)

Bug Update Loop

Once a bug is created, the module immediately posts a follow-up comment. This comment contains a specialized Perf URL that uses the newly created IssueId as a query parameter. This ensures that anyone viewing the bug can immediately jump back into the Perf UI to see the live, filtered graph of the relevant regressions.

Module: /go/issuetracker/mockhost

Overview

The mockhost module provides a lightweight, standalone HTTP server that emulates a subset of the Issue Tracker API. Its primary purpose is to facilitate local development and testing of services that interact with the issue tracking system, allowing developers to verify request/response handling without requiring access to a live production API or complex authentication setups.

Design and Implementation

The module is designed for simplicity and predictability. It implements a RESTful interface using the chi router, mimicking the endpoint structure expected by clients of the issuetracker/v1 library.

Instead of maintaining a complex state or an in-memory database, the mock host uses a “static response” strategy. It accepts valid API requests, logs the incoming parameters for visibility during debugging, and returns pre-defined JSON payloads that conform to the issuetracker data structures. This approach ensures that the mock remains low-maintenance and deterministic.

Key Workflows

The server handles three primary operations, mapping HTTP methods to specific Issue Tracker behaviors:

[ Client ]          [ mockhost (:8081) ]
    |                     |
    | GET /v1/issues      |
    |-------------------->| Log query -> Return static issue list
    |                     |
    | POST /v1/issues     |
    |-------------------->| Decode body -> Return new issue with ID 98765
    |                     |
    | POST /v1/issues/{id}/comments
    |-------------------->| Parse {id} -> Return comment confirmation

Components and Responsibilities

Entry Point and Routing (main.go)

The main.go file acts as the central coordinator. It initializes a chi router and maps specific URL patterns to handler functions. The server listens on port :8081 by default. Its main responsibility is routing and ensuring the HTTP server's lifecycle is managed.

API Handlers (main.go)

The handlers encapsulate the logic for simulating the Issue Tracker API:

  • listIssuesHandler: Simulates searching for issues. It extracts the query parameter from the URL to log what the client is searching for, then returns a ListIssuesResponse containing a single mock issue (ID 12345). This allows clients to test list-parsing logic.
  • fileBugHandler: Simulates the creation of a new bug. It decodes the incoming Issue object from the request body to reflect the submitted title back to the client, while assigning a static mock ID (98765) to simulate the backend's ID generation.
  • createCommentHandler: Simulates adding a comment to an existing issue. It validates that the issueId in the URL is a valid integer and echoes the comment text back in the response. This is useful for verifying that clients are correctly targeting the right issue resources.

Data Structures

The module relies on //go/issuetracker/v1:issuetracker for its data models. By using the same structures as the production client, the mock ensures that the JSON serialization and deserialization remain perfectly compatible with the real service.

Logging

Integration with go/sklog ensures that all interactions with the mock host are recorded to the console. This allows developers to inspect the payloads being sent by their services in real-time by simply watching the mockhost output.

Module: /go/issuetracker/mocks

The issuetracker/mocks module provides an automated mocking implementation of the IssueTracker interface. Its primary purpose is to facilitate unit testing for components within the Perf system that interact with external issue tracking services without requiring actual network calls or authentication against a real Issue Tracker API.

Design and Implementation Choice

The module utilizes mockery to generate code based on the issuetracker.IssueTracker interface. This approach ensures that the mock stays in sync with the actual interface definition found in /perf/go/issuetracker. By using the testify/mock framework, it allows developers to:

  1. Program Behaviors: Define specific return values or errors for calls to the issue tracker.
  2. Verify Interactions: Assert that the system under test (SUT) calls specific methods (like FileBug or CreateComment) with the expected parameters.
  3. Decouple Tests: Isolate Perf logic (such as anomaly detection or regression filing) from the complexities of the Issue Tracker V1 API.

Key Components

IssueTracker.go

This file contains the IssueTracker struct, which embeds mock.Mock. It implements the standard operations required for Perf's integration with bug tracking:

  • Bug Creation (FileBug, FileUserIssue): These methods simulate the creation of new issues. In a test environment, they allow the SUT to receive a mock Issue ID (int) to verify that the ID is correctly stored or referenced in the Perf database.
  • Communication (CreateComment): Mocks the addition of comments to existing issues, used for updating status or providing additional data on detected regressions.
  • Discovery (ListIssues): Simulates querying the tracker for existing issues, returning a slice of v1.Issue objects. This is crucial for testing logic that prevents duplicate bug filing.

Typical Testing Workflow

The mock is designed to be instantiated within a test suite using NewIssueTracker(t). This constructor automatically registers cleanup functions to assert that all defined expectations were met before the test finishes.

+-----------+              +----------------------+              +-----------------+
|   Test    |              |  System Under Test   |              |  Mock (this)    |
|  Routine  |              |    (e.g., Alerter)   |              |  IssueTracker   |
+-----------+              +----------------------+              +-----------------+
      |                           |                              |
      |-- 1. Setup Expectations ->|                              |
      |   (On FileBug return 123) |                              |
      |                           |                              |
      |-- 2. Trigger Action ----->|                              |
      |                           |-- 3. Call FileBug(ctx, req) ->|
      |                           |                              |
      |                           |<-- 4. Return (123, nil) ------|
      |                           |                              |
      |-- 5. Assert Result -------|                              |
      |                           |                              |
      |-- 6. Cleanup/Verify (Auto)|----------------------------->|
                                                                 | (Was FileBug called?)

Key Dependencies

  • perf/go/issuetracker: Defines the request and response structures (e.g., FileBugRequest) that the mock must handle.
  • go.skia.org/infra/go/issuetracker/v1: Provides the underlying data models for the issues themselves.
  • github.com/stretchr/testify/mock: The engine driving the programmatic responses and assertions.

Module: /go/kmeans

Generic K-Means Clustering

The kmeans module provides a flexible, generic implementation of Lloyd's Algorithm for k-means clustering. Rather than being tied to a specific data format like 2D coordinates or high-dimensional vectors, it uses a set of interfaces that allow it to cluster any data type where a distance metric and a centroid calculation can be defined.

This is particularly useful in the context of performance monitoring (Perf), where clustering might be applied to different types of trace data or experimental results.

Design and Architecture

The implementation decouples the clustering logic from the mathematical specifics of the data. This is achieved through three primary abstractions:

  • Clusterable: An empty interface (interface{}) representing the data points (observations) to be clustered.
  • Centroid: An interface representing the “center” of a cluster. It must provide a Distance method to calculate how far a Clusterable is from itself and an AsClusterable method to allow the centroid to be treated as a data point in results.
  • CalculateCentroid: A function type responsible for generating a new Centroid from a slice of Clusterable observations. This encapsulates the logic of how to “average” a specific group of data points.

Decision: Interface-Based Abstraction

By using interfaces, the module avoids hardcoding Euclidean distance or vector arithmetic. For example, if clustering time-series data, the Centroid implementation could use Dynamic Time Warping (DTW) for distance, while a categorical dataset might use Hamming distance. The core algorithm remains unchanged regardless of these implementation details.

Key Workflows

The Clustering Loop

The module executes the standard iterative k-means process. Each iteration (performed by the Do function) follows these steps:

  1. Assignment: Each observation is assigned to the nearest centroid based on the Distance metric.
  2. Update: For each resulting cluster, a new centroid is calculated using the provided CalculateCentroid function.
  3. Refinement: If a centroid has no assigned observations, it is discarded, potentially reducing the number of clusters (k).
  Initial Centroids + Observations
           |
           v
+-----------------------------+
|    Do() Iteration Loop      | <-----------+
| 1. Find closest centroid    |             |
| 2. Group observations       |             | Repeat N times
| 3. Calculate new centroids  |             | (iters)
+-----------------------------+             |
           |                                |
           +--------------------------------+
           |
           v
  Final Centroids + Grouped Clusters

Result Aggregation

The GetClusters function organizes the final output. It produces a two-dimensional slice where each inner slice represents a cluster. By convention, the first element of each inner slice is the Centroid itself (converted via AsClusterable), followed by all observations belonging to that cluster. This provides a clear, grouped view of the algorithm's output.

Implementation Details

kmeans.go

This is the core of the module.

  • Do: Implements a single iteration of the algorithm. It is designed to be called repeatedly. Note that it returns a new slice of centroids and may return fewer than the input if clusters become empty.
  • KMeans: A convenience wrapper that runs Do for a fixed number of iterations.
  • TotalError: A utility to calculate the sum of distances from all observations to their respective centroids, providing a measure of how well the clusters fit the data.

kmeans_test.go

The tests serve as the primary documentation for how to implement the required interfaces. They demonstrate a concrete 2D implementation (myObservation) where the same struct satisfies both Clusterable and Centroid interfaces, and a corresponding calculateCentroid function that computes the arithmetic mean of X and Y coordinates.

Module: /go/maintenance

Perf Maintenance Module

The maintenance module serves as the central orchestration point for all long-running background processes and administrative tasks within a Skia Perf instance. Instead of handling user requests, this module is responsible for database health, data synchronization, schema migrations, and cache warming.

High-Level Overview

In a distributed system like Skia Perf, various tasks must occur outside the critical path of the web UI or ingestion engine. The maintenance module consolidates these tasks into a single entry point. It manages the lifecycle of background goroutines that handle:

  • Database Schema Management: Ensuring the SQL schema is up-to-date and performing migrations.
  • Data Ingestion & Sync: Keeping the local representation of Git repositories fresh.
  • Data Retention: Pruning old regressions and shortcuts to manage database size.
  • Cache Management: Periodically refreshing Redis caches to ensure query performance remains high.
  • External Config Sync: Pulling configurations (like sheriffing rules) from LUCI Config.

Key Components and Responsibilities

Process Orchestration (maintenance.go)

The primary responsibility of this module is the Start function. It acts as a switchboard, using configuration flags (MaintenanceFlags) and instance settings to decide which background services to initialize.

Design decisions in this coordinator include:

  • Blocking Execution: The Start function is designed to run indefinitely (ending in a select {}). This is intended for use in a dedicated “maintenance” microservice or container that runs alongside the main Perf application.
  • Centralized Scheduling: It defines the “heartbeat” of the system through various constants (e.g., gitRepoUpdatePeriod, deletionPeriod). By centralizing these, developers can easily reason about the total background load on the database.

Schema and Migration

The module ensures the database environment is ready before starting other services. It utilizes expectedschema to validate and migrate the core schema. It also handles specialized migrations, such as moving regression data between table formats, which are executed in small, controlled batches (regressionMigrationBatchSize) to avoid locking the database or exhausting resources.

Data Lifecycle and Retention

Through the deletion submodule, the maintenance process enforces a data retention policy. It targets “Shortcuts” (temporary trace groupings) and “Regressions” that have aged out (currently 18 months).

Refreshing Query Caches

To prevent the first user of the day from experiencing slow queries, the maintenance module performs “cache warming.” It initializes a ParamSetRefresher which scans the TraceStore and populates Redis. This ensures that the available query parameters (keys and values) are always pre-calculated and ready for the UI.

External Service Integration

  • Git Polling: Periodically fetches new commits from the source of truth to ensure the Perf database stays mapped to the correct revision history.
  • Sheriff Config: Integrates with LUCI Config to import subscription and alerting rules, allowing teams to manage their Perf configurations via version-controlled files outside of the Perf database itself.

Key Workflows

Initialization and Background Loop

When the maintenance service starts, it follows a specific sequence to ensure dependencies are met before background loops begin:

Start(ctx, flags, config)
|
|-- 1. Initialize Tracing (Observability)
|-- 2. Connect to Database & Validate/Migrate Schema
|-- 3. Initialize Git Provider & Start Polling
|
|-- 4. Launch Concurrent Goroutines (if enabled):
|   |--> [Migration]   Periodic Regression Migration
|   |--> [Config]      LUCI Config Import Routine
|   |--> [Cache]       Redis ParamSet Refresh Routine
|   |--> [Deletion]    Data Retention / TTL Cleanup
|
|-- 5. Block (select {})

Design Choices: Why a Separate Module?

  • Isolation of Concerns: By separating maintenance from the main frontend or ingest processes, heavy operations (like schema migration or massive deletions) do not steal CPU or IO cycles from user-facing requests.
  • Fault Tolerance: If a background migration fails or hangs, it does not crash the web server.
  • Single-Writer Principle: For certain migration tasks, having a single maintenance instance ensures that multiple pods aren't trying to perform schema changes or batch deletions simultaneously, reducing transaction contention.

Module: /go/maintenance/deletion

Perf Data Retention Maintenance

The deletion module provides a background maintenance service responsible for enforcing data retention policies within the Skia Perf system. It specifically targets the cleanup of aged regression data and their associated shortcuts to ensure the database remains performant and focused on relevant recent history.

High-Level Overview

In the Perf system, regressions (detections of performance changes) and shortcuts (references to specific sets of traces) accumulate over time. To maintain database health, this module implements a Time-To-Live (TTL) policy. Currently, the system is hardcoded to a 18-month retention period.

The module operates by periodically scanning the database for regressions older than this TTL, identifying the specific database keys (commit numbers and shortcut IDs), and removing them in atomic batches.

Key Components and Responsibilities

Deleter (deleter.go)

The Deleter is the central coordinator. It interacts with both the regression.Store and the shortcut.Store. Its primary responsibility is to bridge the two stores; since shortcuts are often referenced by regression entries, they should be cleaned up together to prevent orphaned data or broken references in the UI.

Logic & Design Choices

TTL Enforcement

The deletion logic uses the timestamp of the “step point” (the point in time where a performance shift occurred) within a regression‘s cluster summary to determine eligibility. If the timestamp of a regression’s Low or High cluster is older than 18 months relative to the current time, it is marked for deletion.

Batch-Based Processing

Instead of a single massive delete operation—which could lock database tables and degrade performance—the module uses a “batching” approach.

  • It starts scanning from the oldest known commit in the database.
  • It collects eligible regressions and shortcuts until a configurable shortcutBatchSize is met.
  • The actual deletion is performed within a single database transaction to ensure consistency (either both the regression and its shortcut are removed, or neither is).

Frequency and Scheduling

The RunPeriodicDeletion method establishes a long-running goroutine. It uses a ticker to trigger DeleteOneBatch at a regular iterationPeriod. This allows the maintenance to run continuously in the background at a slow, steady pace, eventually catching up to the TTL window without causing spikes in database load.

Workflows

Periodic Deletion Loop

The following diagram illustrates how the background process manages the steady cleanup of data:

RunPeriodicDeletion(period, batchSize)
|
| (Wait for 'period')
|-----> DeleteOneBatch(batchSize)
        |
        |-- 1. Get Oldest Commit Number
        |-- 2. Scan Range [oldest, oldest + batchSize]
        |-- 3. Filter for regressions older than 18 months
        |-- 4. If Batch not full, extend range and repeat step 2
        |
        |-- 5. Open Database Transaction
        |-- 6. Delete Regressions by Commit ID
        |-- 7. Delete Shortcuts by ID
        |-- 8. Commit Transaction
|
| (Wait for next 'period')
|-----> ...

Key Files

  • deleter.go: Contains the core logic for calculating the 18-month cutoff, scanning the regression store, and executing the transactional deletes.
  • deleter_test.go: Provides integration tests using a test database (Spanner) to verify that only data older than the TTL is removed and that the batching logic correctly identifies eligible records.

Module: /go/notify

Perf Regression Notifications

The notify module is a high-level orchestration layer responsible for transforming detected performance regressions into human-readable alerts and delivering them to various destinations. It decouples the “what” of a regression (statistical data and commit history) from the “how” (formatting and transport).

High-Level Overview

The notification system follows a pipeline where raw detection data is first gathered into a common metadata format, then passed to a provider to be formatted into a specific message (e.g., HTML or Markdown), and finally handed off to a transport layer for delivery (e.g., Email or Issue Tracker).

This modular design allows the Perf instance to support diverse workflows:

  • Standard Alerts: Sending HTML emails or creating Buganizer/Monorail issues.
  • Android-Specific Workflows: Generating deep links to internal build diffs and formatting test method names.
  • Chromeperf Integration: Reporting anomalies directly to the Chromeperf API.
  • Dry Runs: Using a “Noop” transport for testing detection logic without bothering developers.

Key Components and Responsibilities

Core Orchestration (notify.go)

The Notifier interface is the primary entry point. The defaultNotifier implementation manages the flow of data. When a regression is found (or goes missing), it:

  1. Gathers Metadata: Combines the alert configuration, commit details, and cluster statistics into a RegressionMetadata object.
  2. Hydrates Links: If the regression is tied to specific traces, it queries the tracestore and filesystem to find “source” links (e.g., links to the raw JSON or log files that produced the data point).
  3. Executes Formatting: Uses a NotificationDataProvider to turn metadata into a subject and body.
  4. Dispatches Transport: Sends the final message via the configured Transport.

Data Providers and Formatters

The system distinguishes between how data is gathered and how it is styled:

  • NotificationDataProvider: Determines what fields are available for the message.
    • The Default Provider uses standard commit and cluster data.
    • The Android Provider (android_notification_provider.go) adds specialized logic for Android-specific metadata, such as extracting Build IDs from commit subjects and formatting test class/method strings.
  • Formatter: Handles the template rendering.
    • HTML Formatter (html.go): Used primarily for rich emails.
    • Markdown Formatter (markdown.go): Used for issue trackers and includes custom template functions like buildIDFromSubject to parse specific URL structures.

Transports

Transports are the final leg of the journey, abstracting the I/O required to reach the user:

  • Email (email.go): Sends multi-part emails. It supports “threading references,” allowing “Regression Missing” notifications to appear as replies to the original “Regression Found” alert.
  • Issue Tracker (issuetracker.go): Creates and updates bugs via the Google Issue Tracker (Buganizer) API. It automatically sets priorities, severities, and components based on the alert configuration.
  • Chromeperf (chromeperfnotifier.go): A specialized transport that doesn't send a message to a human, but instead reports the anomaly to the Chromeperf service for cross-platform tracking.
  • Noop (noop.go): A null-object pattern implementation for environments where notifications should be suppressed.

Key Workflows

Regression Found Process

This workflow illustrates how a statistical anomaly becomes a developer-facing bug:

[ Detection Engine ] -> RegressionFound(commit, alert, cluster)
        |
        v
[ defaultNotifier ]
        |-- getRegressionMetadata() --> Fetches Git hashes & source links
        |-- GetNotificationData()   --> Executes Go Templates (HTML/Markdown)
        |-- SendNewRegression()     --> Calls Transport (Email/API)
        v
[ Transport Layer ]
        |-- IssueTracker: Creates Bug #1234
        |-- Email: Sends message with Message-ID <abc@perf>
        v
[ Persistence ] --> Notification ID (#1234 or <abc@perf>) is saved to track history

Notification Threading

To avoid “alert fatigue” and keep histories clean, the system uses a threadingReference.

1. Initial Regression Found -> Transport returns "ID-123"
2. Performance recovers     -> RegressionMissing(threadingReference="ID-123")
3. Transport uses ID-123    -> Adds a comment to Bug #123 OR Sends a Reply-To Email

Design Decisions

Template-Driven Messages

The use of Go‘s text/template and html/template allows instance administrators to customize notification content without changing Go code. The config.NotifyConfig allows specifying custom body and subject templates in the instance’s JSON configuration.

Commit Range URLs

Because different projects use different git mirrors (e.g., Gerrit, GitHub, internal Gitiles), the commitrange.go logic uses a configurable commitRangeURITemplate. This allows the notification to link to a side-by-side diff (using {begin} and {end} placeholders) rather than just a single commit landing page.

Separation of Metadata and Presentation

By defining common structures in the /common submodule, the system ensures that the detection logic remains pure and doesn't need to know if the final output is a Markdown table or an HTML list. This also simplifies testing, as mocks can return standard NotificationData regardless of the transport being tested.

Module: /go/notify/common

Notification Common

The notify/common module defines the core data structures used across the Perf regression notification system. It acts as a bridge between the detection engine and the various notification delivery mechanisms (such as email, issue trackers, or chat platforms).

By centralizing these structures, the system ensures that different notification formatters have access to a consistent set of metadata regardless of the specific alert configuration or the final destination of the message.

Core Data Structures

RegressionMetadata

This structure is the primary data container passed to notification formatters. It is designed to encapsulate the full context of a detected performance change, allowing for the generation of rich, actionable reports.

The inclusion of both the RegressionCommit and the PreviousCommit is critical for providing a “diff” view, enabling users to see exactly what changed in the codebase to cause the regression.

Key components of the metadata include:

  • Contextual Links: The InstanceUrl provides a direct path back to the Perf instance for deep-flow analysis.
  • Analytical Data: The Cl (Cluster Summary) and Frame (UI Frame Response) contain the statistical backing for the regression, allowing notifications to include high-level summaries of the data points involved.
  • Detection Specifics: The module handles two distinct detection paradigms:
    • Standard Regressions: Primarily use the commit range and alert configuration.
    • Individual Trace Regressions: When detection is set to “Individual” mode, the structure provides granular details including TraceID and specific commit links. This allows notifications to pinpoint exact changes in high-cardinality data environments.

NotificationData

While RegressionMetadata contains the raw information about a performance change, NotificationData represents the output of the formatting process. It separates the presentation layer from the delivery layer.

  • Body: Contains the formatted content (often HTML or Markdown) intended for the recipient.
  • Subject: Contains a concise summary, typically used for email subject lines or issue titles.

Workflow: From Detection to Notification

The common module facilitates the transition of data through the following conceptual pipeline:

[ Detection Engine ]
        |
        | Identifies anomaly and collects:
        | - Alert Config
        | - Commit Range
        | - Cluster Data
        v
[ RegressionMetadata ] <--- (Defined in notify/common)
        |
        | Passed to a Formatter (e.g., HTML/Markdown)
        v
[ NotificationData ]   <--- (Defined in notify/common)
        |
        | Passed to a Transport (e.g., Email/Issue Tracker)
        v
[ Final Recipient ]

This separation ensures that the logic for what a regression is (metadata) is kept distinct from how it is described to a human (notification data), allowing the system to easily support new notification channels by simply implementing new formatters that consume these common structures.

Module: /go/notify/mocks

The go/notify/mocks module provides a suite of autogenerated mock implementations for the core interfaces used within the Perf notification system. These mocks are built using testify/mock and are designed to facilitate unit testing of components that handle regression alerts, data formatting, and message delivery without requiring live connections to external services (like email servers or issue trackers).

High-Level Purpose

The notification system in Perf follows a decoupled architecture where data retrieval, message construction, and transport delivery are handled by distinct components. This mock package allows developers to:

  1. Isolate Logic: Test the logic of a Notifier implementation by mocking the Transport layer.
  2. Verify State Transitions: Assert that the system correctly identifies when to send a “New Regression” vs. a “Regression Missing” (resolved) notification.
  3. Simulate Failures: Inject errors into the data provider or transport layers to ensure robust error handling in the calling services.

Key Components

The module mirrors the primary interfaces found in the parent notify package:

Notifier.go

The Notifier mock simulates the high-level orchestration of notifications. It is responsible for deciding what content should be sent based on regression events.

  • Design Role: It acts as the entry point for the detection logic.
  • Key Workflows: It mocks methods like RegressionFound and RegressionMissing, which typically involve complex arguments such as ClusterSummary, FrameResponse, and Commit data. This allows tests to verify that the notification system receives the correct metadata when a performance anomaly is detected.

Transport.go

The Transport mock represents the delivery mechanism (e.g., Email, Monorail/Issue Tracker).

  • Design Role: It abstracts the actual I/O.
  • Why it's used: Instead of sending real emails or creating real bugs during a test run, this mock captures the body, subject, and threadingReference (used for message chaining/threading) to ensure the outgoing message is formatted correctly.

NotificationDataProvider.go

This mock handles the assembly of the data payload required for a notification.

  • Design Role: It sits between the raw performance data and the formatted message.
  • Functionality: It mocks the retrieval of NotificationData based on RegressionMetadata. This is crucial for testing how different regression scenarios (found vs. missing) are transformed into user-facing information.

Workflow Example

The following diagram illustrates how these mocks are typically used in a unit test for a component that manages regression life cycles:

[ Test Suite ]
      |
      | 1. Setup expectations on Notifier Mock
      v
[ System Under Test (e.g., Regression Detector) ]
      |
      | 2. Detects anomaly -> Calls RegressionFound()
      v
[ Notifier Mock ]
      |
      | 3. Returns a canned "NotificationID"
      v
[ Test Suite ]
      |
      | 4. Assert that Notifier was called with
      |    the expected Commit and Alert objects.

Usage Implementation Note

All mocks in this package include a New[InterfaceName] helper function. These helpers automatically register a cleanup function with the *testing.T instance, ensuring that AssertExpectations is called at the end of the test to verify that all defined mock calls were actually executed.

Module: /go/notifytypes

Notifytypes Module

The notifytypes module serves as the central source of truth for defining how the Perf system communicates regression alerts and performance data to external consumers. Rather than scattering string constants or logic throughout the codebase, this module provides a typed schema that dictates both the medium of notification (the “how”) and the context of the data being sent (the “what”).

Core Abstractions

The module is built around two primary type definitions that decouple the notification logic from the underlying alert detection systems.

Notifier Mediums (Type)

The Type abstraction defines the destination and format of a notification. This is used by the system to instantiate the correct notification client. The design supports a variety of delivery methods:

  • Human-Readable Formats: HTMLEmail and MarkdownIssueTracker cater to human consumption, specifying not just the destination but the markup language required for clear presentation.
  • System-to-System Integration: ChromeperfAlerting and AnomalyGrouper represent automated workflows. Instead of sending a message to a human, these types signal the system to push structured data into external tracking services or internal grouping logic for further automated analysis.
  • No-Op Actions: The None type allows for a “dry-run” or silenced state where regressions are detected and logged but no external side effects are triggered.

Data Contexts (NotificationDataProviderType)

While the Type defines the transport, the NotificationDataProviderType defines the source-specific schema of the data.

In a multi-tenant environment like Perf, different projects (e.g., standard Skia vs. Android) require different metadata to be included in an alert. For example, an AndroidNotificationProvider might bundle specific build IDs or device characteristics that are irrelevant to other projects. By using this type, the notification engine can select the appropriate data formatter to bridge the gap between generic regression data and project-specific requirements.

Workflow Integration

The constants in this module act as the glue between alert configuration and the notification dispatcher:

[ Alert Configuration ]
          |
          v
[ Notification Dispatcher ] <--- [ notifytypes.Type ]
          |                      (e.g., HTMLEmail)
          |
          +---------------------> [ Notification Data Provider ]
          |                       (e.g., AndroidNotificationProvider)
          |
          v
[ External Systems ]
(Email, Issue Tracker, Chromeperf)

By centralizing these types, the system ensures that adding a new notification destination or a new specialized data provider only requires an update to this registry, providing a consistent interface for all alerting components in the Perf ecosystem.

Module: /go/perf-tool

Overview

The perf-tool is a comprehensive command-line interface (CLI) designed for administrative and diagnostic interactions with Skia Perf. It serves as the primary tool for managing Perf instances, providing capabilities that span database maintenance, data lifecycle management, and infrastructure provisioning.

The tool bridges the gap between local configuration files and remote cloud resources (GCS, PubSub, CockroachDB), allowing developers and SREs to perform complex operations like re-ingesting historical data, migrating alerts between instances, and debugging specific trace data without needing to write custom scripts.

Design Philosophy and Implementation Choices

The project is structured to separate the CLI definition (routing and flags) from the business logic.

  • Interface-Driven Logic: The core functionality is encapsulated within the application module. By defining an Application interface, the CLI implementation in main.go remains clean and highly testable. This abstraction allows the CLI to focus on flag parsing and environment setup while delegating complex workflows to the application layer.
  • Configuration as Truth: Most commands require a --config_filename flag. The tool is designed to treat the InstanceConfig (JSON/TOML) as the definitive source of truth for the environment it is interacting with. This ensures that operations like PubSub creation or Database restores are always scoped to the correct instance.
  • Safe Data Portability: Database operations (Alerts, Shortcuts, Regressions) use a custom serialization format (Go gob inside .zip files) rather than raw SQL dumps. This choice provides:
    • Portability: Backups can be restored across different database versions or instances.
    • Atomicity: Related entities (like Regressions and their associated Shortcuts) can be bundled together to ensure functional integrity after restoration.
  • Idempotency and Safety: Many operations, such as PubSub provisioning and database restoration, are designed to be idempotent. The “dry-run” capability for re-ingestion allows users to verify which files will be affected before committing to expensive cloud operations.

Key Components and Responsibilities

CLI Entry Point (main.go)

This file defines the user interface of the tool using the urfave/cli framework. Its responsibilities include:

  • Flag Management: Handling global and command-specific flags such as connection strings, commit ranges, and file paths.
  • Context Initialization: Setting up logging (via sklog) and instantiating the TraceStore or InstanceConfig based on the provided flags.
  • Command Routing: Mapping CLI commands to the appropriate methods in the application module.

Application Orchestrator (/application)

This module contains the heavy lifting for all functional areas:

  • Database Operations: Implements logic for backing up and restoring Alerts, Shortcuts, and Regressions. It manages the complexity of batching large datasets and maintaining referential integrity (e.g., ensuring a regression backup includes the shortcuts it references).
  • Ingestion Management: Provides tools to force the system to re-process data. It can scan GCS buckets for historical files and republish them to PubSub topics to trigger the standard ingestion pipeline. It also includes a validate sub-command to check ingestion files against the schema and parser logic locally.
  • Trace Debugging: Provides direct access to the TraceStore. This allows users to list trace IDs matching a specific query or export raw performance data for specific commit ranges into JSON files for external analysis.
  • Infrastructure Provisioning: Automates the creation of necessary Google Cloud PubSub topics and subscriptions based on the instance configuration, ensuring the cloud environment stays in sync with the code-defined configuration.

Key Workflows

Trace Data Export

This workflow demonstrates how the tool extracts data from the storage layer for external use.

[ CLI: traces export ] -> [ Instance Config ] -> [ TraceStore (BigTable/CockroachDB) ]
           |                   |                          |
           |-- 1. Parse Query -|                          |
           |                                              |
           |------- 2. Query Commits [Begin, End] ------->|
                                                          |
[ Local JSON File ] <--- 4. Encode & Write <--- 3. Retrieve Trace Values

Infrastructure Synchronization

When setting up a new Perf instance or updating an existing one, the tool synchronizes the cloud environment.

[ InstanceConfig ]        [ Google Cloud PubSub ]         [ Local State ]
        |                          |                            |
1. Read Topics Config ------------>|                            |
        |                          |                            |
2. Check Existence <---------------|                            |
        |                                                       |
3. Create Missing Topics/Subscriptions ------------------------>|
        |                                                       |
4. Set Dead Letter Policies/ACK Deadlines --------------------->|

Module: /go/perf-tool/application

Overview

The application module serves as the central orchestration layer for the perf-tool CLI. It encapsulates the high-level business logic and complex workflows required to manage a Skia Perf instance, acting as a bridge between the command-line interface and the underlying storage, ingestion, and cloud infrastructure systems.

By centralizing these operations, the module ensures that administrative tasks—such as database migrations, data re-ingestion, and trace debugging—are executed consistently and safely across different environments (local vs. production).

Design Philosophy and Implementation Choices

The module is designed around the Application interface, which promotes testability and provides a clean abstraction for the CLI handlers.

  • Transactional Safety in Backups: Design decisions for database backups (Alerts, Shortcuts, Regressions) prioritize data integrity and portability. Instead of raw database dumps, the module uses Go's gob encoding wrapped in .zip archives. This choice allows for versioned, structured backups that include necessary metadata and allow for targeted restoration.
  • Deterministic Regression Backups: When backing up regressions, the module also identifies and exports the specific Shortcuts referenced by those regressions. This ensures that a restored regression remains functional and linkable to the original trace data in a new environment.
  • GCS and PubSub Integration: For data ingestion management, the module interacts directly with Google Cloud Storage and PubSub. The IngestForceReingest logic uses hourly directory partitioning to efficiently scan large buckets, and it leverages PubSub to trigger the standard ingestion pipeline, ensuring that “forced” data follows the same processing path as live data.
  • Validation before Ingestion: The IngestValidate component performs a two-stage check: first against the schema to ensure structural correctness, and second through the actual parser to verify that keys, measurements, and links are generated as expected before a user commits to a large-scale ingestion.

Key Components and Responsibilities

Database Operations

Managed through functions like DatabaseBackup* and DatabaseRestore*, these components interact with builders to instantiate the appropriate stores (Alert, Shortcut, or Regression) based on the provided InstanceConfig.

  • Regressions: Backed up in batches (defaulting to 1000 commits) to manage memory pressure. The restore process is idempotent; it recreates deterministic shortcut IDs to maintain data consistency.
  • Alerts/Shortcuts: Handled as discrete entities, allowing administrators to migrate configurations without necessarily moving performance data.

Trace Management

The module provides tools to inspect the TraceStore directly from the command line.

  • TracesList: Performs queries against specific tiles to debug trace IDs and values.
  • TracesExport: Facilitates data extraction for external analysis. It maps query strings to internal trace names and exports the resulting values as JSON, supporting both file output and standard output.

Ingestion and Infrastructure

  • PubSub Provisioning: ConfigCreatePubSubTopicsAndSubscriptions automates the creation of the ingestion infrastructure. It handles complex configurations like Dead Letter Policies and acknowledgement deadlines, ensuring the cloud environment matches the local configuration file.
  • Re-ingestion Logic: IngestForceReingest allows for “time-traveling” data. By scanning GCS objects within a date range and republishing their metadata to the ingestion topic, it triggers the system to re-process historical data (e.g., after a parser bug fix).

Key Workflows

Data Re-ingestion Process

This workflow illustrates how the module triggers the reprocessing of historical performance data.

[ User Input ]        [ GCS Bucket ]          [ PubSub Topic ]      [ Perf Ingestor ]
      |                      |                       |                      |
1. Start/End Dates --------> |                       |                      |
      |                2. List Objects               |                      |
      | <--------------------|                       |                      |
      |                                              |                      |
3. Path Filter Apply ----> (Filter Files)            |                      |
      |                                              |                      |
4. Publish Message (Object Metadata) --------------> |                      |
                                                     | ---- 5. Notify ----> |
                                                                            |
                                                                    6. Re-parse File

Regression Backup with Dependencies

Backing up regressions requires a “lookup and include” strategy for shortcuts.

[ Regression Store ]        [ Perf Git ]          [ Shortcut Store ]      [ ZIP Archive ]
          |                      |                        |                      |
1. Fetch Regressions (Batch)     |                        |                      |
          | ---- 2. Get Dates -> |                        |                      |
          | <--------------------|                        |                      |
          |                                               |                      |
3. Extract Shortcut IDs --------------------------------> |                      |
          |                                               | ---- 4. Fetch -----> |
          |                                               | <--------------------|
          |                                                                      |
5. Encode Regressions + Encoded Shortcuts -------------------------------------> |

Module: /go/perf-tool/application/mocks

The /go/perf-tool/application/mocks module provides mock implementations of the core application logic interfaces for the perf-tool CLI. These mocks are generated using mockery and are built upon the testify framework, facilitating unit testing of command-line interactions and high-level workflows without requiring a live database, cloud infrastructure, or real file system mutations.

Purpose and Design Decisions

The primary goal of this module is to decouple the CLI's user interface (command parsing and flag handling) from the actual execution of heavy operations like database backups, trace exports, and ingestion management.

By using mocks, developers can:

  • Verify Parameter Passing: Ensure that command-line flags (like --start, --stop, or --dryrun) are correctly parsed and passed to the underlying application logic.
  • Simulate Failures: Test how the CLI handles errors returned from complex operations (e.g., a failed PubSub topic creation) without needing to manually induce environmental errors.
  • Performance: Run tests for the perf-tool management commands in milliseconds, avoiding the overhead of connecting to BigTable or SQL backends.

Key Components

Application Mock (Application.go)

The Application struct is the central mock in this package. It mirrors the interface used by the perf-tool application layer, covering several functional domains of the Perf system:

  • Database Maintenance: Includes mocks for backing up and restoring high-level entities such as Alerts, Regressions, and Shortcuts. This allows testing the backup/restore CLI commands while ensuring the logic correctly handles file paths and instance configurations.
  • Ingestion Management: Provides hooks for IngestForceReingest and IngestValidate. This is critical for testing the logic that triggers data reprocessing across specific time ranges or validates ingestion file formats.
  • Trace Operations: Mocks for TracesExport and TracesList. These facilitate testing how the tool queries TraceStore and writes results to output files or standard output, utilizing types.CommitNumber and types.TileNumber for range-based logic.
  • Infrastructure Setup: The ConfigCreatePubSubTopicsAndSubscriptions mock allows testing the initialization commands that provision Google Cloud PubSub resources based on the provided InstanceConfig.

Typical Testing Workflow

When testing a new command in perf-tool, the mock is used to intercept calls from the command-line handlers.

[ CLI Command ] ----> [ Application Interface (Mock) ] ----> [ Test Assertions ]
      |                          |                                 |
1. User runs:              2. Mock records call:             3. Test verifies:
   "perf-tool ingest..."      "IngestForceReingest(true, ...)"  - Was it called?
                                                                - Were flags correct?

The NewApplication function simplifies this by automatically registering the mock with the testing.T cleanup routine, ensuring that AssertExpectations is called when the test finishes to verify that all expected calls were made.

Module: /go/perfclient

Overview

The perfclient module provides a standardized interface for sending performance benchmarking data to Skia's Perf ingestion system. It functions as a specialized wrapper around Google Cloud Storage (GCS), abstracting the complexities of file naming conventions, data compression, and directory structuring required by the Perf ingestion engine.

Design Philosophy

The module is designed around the principle of deterministic, time-series organization. The Perf ingestion system expects data to be organized in GCS using a specific hierarchy based on time and task metadata. By centralizing this logic in perfclient, different Skia services can ensure that their performance results are stored in a way that the ingestion service can automatically discover and process them.

Key implementation choices include:

  • Automatic Compression: To optimize storage costs and upload speed, the client transparently compresses the JSON payload using GZIP. It utilizes GCS “transcoding” features by setting the Content-Encoding: gzip header, allowing the data to be served uncompressed if requested while remaining compressed at rest.
  • Collision Avoidance: File names are generated using a combination of a user-provided prefix, an MD5 hash of the data content, and a millisecond-precision timestamp. This ensures that even if multiple tasks upload data simultaneously for the same configuration, they will not overwrite each other.
  • Path Hierarchy: Data is organized into a YYYY/MM/DD/HH folder structure. This allows the ingestion engine to poll specific time-based slices of data efficiently rather than scanning the entire bucket.

Key Components

ClientInterface

The primary entry point is the ClientInterface. It defines the contract for pushing data to Perf. This abstraction allows other modules to use a MockPerfClient during unit testing, avoiding actual GCS network calls.

Client

The concrete implementation of the interface. It holds a reference to a gcs.GCSClient and a basePath (the root directory in the bucket where all performance data should reside).

Data Workflow

The PushToPerf method executes the following logic:

  1. Serialization: Converts the format.BenchData struct into JSON.
  2. Compression: Gzip-compresses the resulting JSON bytes.
  3. Path Calculation: Invokes objectPath to determine the exact destination in GCS.
  4. Upload: Transfers the compressed bytes to GCS with the appropriate metadata headers (Content-Encoding and Content-Type).
Data Flow:
[BenchData Struct]
      |
      v
[JSON Marshaling] -> [MD5 Hashing]
      |                    |
      v                    v
[GZIP Compression] -> [Path Construction]
      |                    |
      +----------+---------+
                 |
                 v
[GCS Upload (with gzip headers)]

Path Construction (objectPath)

This function is critical for maintaining compatibility with the Perf ingestion system. It constructs paths following this pattern: [basePath]/[YYYY]/[MM]/[DD]/[HH]/[folderName]/[filePrefix]_[hash]_[timestamp].json

  • basePath: The root GCS folder for the specific environment or service.
  • folderName: Typically represents a high-level grouping, such as a Task name (e.g., “My-Task-Debug”).
  • filePrefix: A descriptor for the type of benchmark (e.g., “nanobench”).
  • now: The timestamp used to determine the directory hierarchy and the file name suffix.

Module: /go/perfresults

Perf Results

The perfresults module is a Go library and set of tools designed to bridge the gap between Chromium's distributed build/test infrastructure (LUCI) and the Skia Perf ingestion system. Its primary responsibility is the automated discovery, retrieval, and parsing of performance benchmark results—typically stored as JSON files in Content Addressed Storage (CAS)—produced by Swarming tasks.

The module provides a unified interface to navigate the hierarchy of Buildbucket builds and Swarming tasks to extract telemetry data for long-term storage and trend analysis.

Design and Data Flow

The architecture follows a “discovery-to-normalization” pipeline. Instead of requiring a direct path to a result file, the module starts with a high-level Build ID and programmatically resolves the underlying storage locations.

[ Buildbucket ID ]
       |
       | (Lookup Build Metadata)
       v
[ Swarming Parent Task ]
       |
       | (Identify Shards/Children)
       v
[ Child Task IDs ]
       |
       | (Query CAS Outputs)
       v
[ RBE CAS Digests ]
       |
       | (Fetch & Merge JSONs)
       v
[ Internal PerfResults ] ----> [ Ingestion / CLI / Workflows ]

Key Components

The Loader (perf_loader.go)

The loader is the central orchestrator. It encapsulates the logic required to communicate with multiple LUCI services in the correct sequence.

  • Service Coordination: It manages the transition from Buildbucket (to get build properties and the root Swarming task) to Swarming (to find child tasks and their CAS output references).
  • Dependency Injection: It uses an rbeProvider to generate RBE clients on the fly based on the specific CAS instance identified in the task metadata, ensuring it can fetch data across different infrastructure silos (e.g., chrome-swarming vs chromium-swarm).

Result Parsing and Histograms (perf_results_parser.go)

This component handles the “Histogram Set” JSON format. Because these files can be large (10MB+), the parser is designed for efficiency:

  • Streaming Decoding: It uses a streaming json.Decoder to process entries one by one, reducing memory footprint compared to loading the entire file into a byte slice.
  • Data Model: Results are stored in a PerfResults struct, which maps a TraceKey (comprising Chart, Unit, Story, Architecture, and OS) to a Histogram (a collection of raw sample values).
  • Aggregation Mapping: Since raw samples are often too granular for time-series databases, the module provides a standard mapping for statistical reductions like mean, max, min, std, and count.

Infrastructure Clients (buildbucket.go, swarming.go, rbecas.go)

These files provide specialized wrappers around LUCI and RBE protocol buffer clients:

  • Buildbucket: Extracts BuildInfo, including the Git revision and “Machine Group” (e.g., ChromiumPerf). This metadata is critical for placing the results on the correct timeline in Skia Perf.
  • Swarming: Handles the logic of finding child tasks. It uses task creation/completion timestamps to narrow the search space when querying the Swarming API for tasks tagged with a specific parent_task_id.
  • RBE CAS: Specialized for “flattening” CAS directory trees. It searches through the output tree of a task to locate files named perf_results.json, even if they are nested within benchmark-specific subdirectories.

Submodules

The project is extended by several specialized submodules that handle specific parts of the performance lifecycle:

  • ingest: Translates the internal PerfResults structures into the specific JSON schema and Google Cloud Storage (GCS) path hierarchy required by the Skia Perf ingester.
  • cli: A command-line tool that allows developers or CI scripts to manually trigger the loading and transformation of results for a given Buildbucket ID.
  • workflows: Contains Temporal workflow definitions for managing long-running, fault-tolerant ingestion jobs. It ensures that if a network call fails during the multi-step discovery process, the job can resume without losing state.
  • testdata: A comprehensive suite of recorded gRPC/HTTP interactions and sample JSON files, allowing for deterministic testing of the entire pipeline without live infrastructure access.

Design Decisions

  • Merging Strategy: When multiple child tasks (shards) produce results for the same benchmark, the Loader automatically merges them. If two histograms share the same TraceKey, their SampleValues are concatenated. This treats sharded test execution as a single logical benchmark run.
  • Secure-by-Default Ingestion: In the ingest submodule, if a builder's configuration cannot be explicitly verified as “public,” the system defaults to storing results in non-public internal buckets to prevent accidental data leaks.
  • Trace Key Uniqueness: The TraceKey includes Architecture and OSName because these are derived from the Swarming bot dimensions. This ensures that even if two different machines run the same benchmark story, their results are stored as distinct traces if their hardware/OS profiles differ.

Module: /go/perfresults/cli

Perf Results CLI

The perfresults/cli module provides a command-line tool designed to bridge the gap between Buildbucket task execution and the Skia Perf ingestion system. Its primary purpose is to retrieve raw performance data associated with a specific Buildbucket build, transform it into a standardized format suitable for Skia Perf, and persist it as local JSON files.

This tool is particularly useful in CI/CD pipelines where performance benchmarks are executed as sub-tasks of a main build, and those results need to be extracted and prepared for long-term storage and analysis.

Design and Data Flow

The CLI acts as an orchestrator between the perfresults loading logic and the ingest formatting logic. The design favors a “pull and transform” model:

  1. Retrieval: It uses the perfresults package to abstract the complexity of communicating with Buildbucket and locating relevant benchmark artifacts.
  2. Normalization: Raw results are grouped by benchmark. For each benchmark, the CLI attaches contextual metadata—specifically the Git revision and the Buildbucket job link—to ensure the data is traceable back to its source.
  3. Transformation: It delegates the conversion of internal performance structures to the Skia Perf ingestion format via the ingest package.
  4. Persistence: Results are written to individual files named by benchmark and build ID, providing a clear output stream for downstream processes (like cloud storage uploaders).
[ Buildbucket ID ]
       |
       v
+--------------+      +-----------------------+
| perfresults  |----->| Raw Benchmark Results |
|    Loader    |      | (Memory Structures)   |
+--------------+      +-----------------------+
                               |
                               v
+--------------+      +-----------------------+      +-----------------+
|    ingest    |<-----| Add Metadata:         |      |  Output Files:  |
|  Converter   |      | - Git Revision        |----->|  bench_123.json |
+--------------+      | - Buildbucket Link    |      |  bench_456.json |
                      +-----------------------+      +-----------------+

Key Components

Main Logic (main.go)

The entry point handles command-line flag parsing and coordinates the execution flow. It is responsible for:

  • Contextualization: It merges high-level build information (from perfresults.Loader) with specific benchmark data.
  • File Management: It manages the creation of the output directory and ensures each benchmark result is serialized correctly.
  • Inter-process Communication: By printing the paths of the generated files to stdout, the CLI allows parent scripts or automation tools to easily identify and process the resulting JSON files.

Integration with Other Modules

The CLI serves as the glue between several specialized modules:

  • perf/go/perfresults: Provides the Loader which handles the heavy lifting of finding and downloading artifacts from Buildbucket.
  • perf/go/perfresults/ingest: Contains the logic to translate internal Go structures into the specific JSON schema required by the Skia Perf ingestion pipeline.

Module: /go/perfresults/ingest

Perf Results Ingestion

The perfresults/ingest module provides the logic necessary to transform raw performance results into a structured format suitable for the Skia Perf ingestion pipeline and determines the appropriate storage locations within Google Cloud Storage (GCS).

It acts as a bridge between the data structures defined in the perfresults module (which represent the raw telemetry/benchmark output) and the format.Format expected by the Skia Perf ingester.

High-Level Overview

The ingestion process involves two primary responsibilities:

  1. Format Conversion: Translating internal performance result structures (Histograms and BuildInfo) into a standardized JSON format that the Perf ingester can parse and index.
  2. Path Resolution: Determining the standardized GCS URI where the results should be stored based on the execution time, builder configuration, and benchmark name.

Design Decisions

Data Aggregation

The perfresults format often contains a collection of sample values for a single measurement. However, for charting and time-series analysis, these samples need to be reduced to specific statistical points (e.g., mean, max, min).

Instead of choosing a single representative value, the module utilizes perfresults.AggregationMapping to generate multiple traces for a single histogram. Each aggregation (like “avg” or “std”) is converted into a format.SingleMeasurement. This allows users to toggle between different statistical views of the same benchmark data in the Perf UI.

Internal vs. External Data Routing

Security and visibility are handled at the path generation level. The module distinguishes between “public” and “non-public” buckets based on the builder name.

  • Default to Internal: If a builder's configuration cannot be explicitly verified as public via bot_configs, the module defaults to the internal bucket (chrome-perf-non-public). This “secure-by-default” approach prevents accidental exposure of sensitive performance data.

Key Components

JSON Transformation (json.go)

This file handles the structural mapping between the perfresults package and the ingest/format package.

  • ConvertPerfResultsFormat: This is the entry point for data transformation. It maps histogram keys (Chart, Unit, Story, Arch, OS) into the Key metadata map used by the ingester for filtering.
  • toMeasurement: Processes the raw SampleValues from a histogram. It filters out invalid numerical values (Inf, NaN) before they reach the ingestion pipeline to ensure database integrity.

GCS Path Management (gcs.go)

This file defines the organizational hierarchy of the performance data in GCS. The path structure is designed to be easily browsable and predictable for the ingester:

gs://<bucket>/ingest/<YYYY>/<MM>/<DD>/<HH>/<MachineGroup>/<BuilderName>/<Benchmark>

  • Time Normalization: All paths are generated using UTC time. The convertTime function flattens the precision to the hour, grouping results into hourly “buckets” to optimize file discovery and ingestion batching.
  • Builder Metadata: It handles defaults for missing metadata (e.g., using ChromiumPerf as the default Machine Group and BuilderNone for missing builders) to ensure the path remains valid and consistent.

Workflow

The typical flow of data through this module can be visualized as follows:

[Raw PerfResults] -> ConvertPerfResultsFormat() -> [format.Format Object]
                                                           |
                                                           v
[Build Metadata]  -> convertPath() --------------> [GCS URI Destination]
      +                                                    |
[Timestamp]                                                v
                                               (Ready for Upload/Ingest)

The resulting JSON object and GCS path are then used by higher-level services to write the data to GCS, where the Skia Perf ingester will eventually pick it up for processing into the trace database.

Module: /go/perfresults/testdata

This module serves as a centralized repository of deterministic test inputs and recorded network interactions used to verify the functionality of performance result processing. It enables the testing of complex workflows—such as fetching task metadata, parsing performance histograms, and merging result sets—without requiring active connections to external services like Buildbucket or Swarming.

Core Responsibilities

The data within this module is structured to support three primary testing objectives:

  1. API Replay and Service Mocking: The module contains recorded pRPC/gRPC interactions (captured in .json and .rpc files). These files allow the perfresults clients to simulate communication with infrastructure services. By providing pre-recorded request/response pairs, the tests can verify how the system handles various states—such as a successful build lookup, a non-existent task ID, or a complex task hierarchy—under stable, repeatable conditions.

  2. Data Schema and Parsing Verification: Files like full.json, empty.json, and valid_histograms.json represent the expected internal schema for performance data. These are used to ensure that parsers correctly translate raw JSON inputs into internal Go structures (e.g., GenericSet, DateRange, and Histogram objects) and that diagnostic metadata is correctly associated with specific samples.

  3. Aggregation and Logic Validation: Specialized datasets like merged.json and merged_diff.json are designed to test higher-level logic. These files provide the “before” and “after” states required to validate that the module can successfully combine multiple results or calculate differences between distinct performance runs.

Key Components and Design

The test data is organized into functional groups to reflect the multi-stage nature of performance result processing:

  • Infra Metadata (FindTaskID_..., SwarmingClient_...): These files mock the discovery phase. They contain the specific metadata—such as Swarming instance names and CAS (Content Addressed Storage) digests—needed to locate where performance results are actually stored after a build completes.
  • Result Loading (LoadPerfResults_...): These datasets cover the edge cases of the loading logic. This includes scenarios where a build exists but contains no performance data (“NoChildRuns”) or where the build identification is entirely invalid.
  • Histogram Sets (perftest group): These files represent the final performance metrics. They include complex diagnostic maps that link specific measurements to bot IDs, operating systems, and benchmark versions.

Workflow Visualization

The data in this module facilitates the testing of the following automated discovery and parsing pipeline:

[ Build ID ] --> ( Mock Buildbucket ) --> [ Swarming Task ID ]
                                                 |
                                                 v
[ CAS Digest ] <-- ( Mock Swarming ) <--- [ Task Result ]
      |
      +--------> ( Mock RBE/CAS ) ------> [ histogram.json ]
                                                 |
                                                 v
[ Internal Result Set ] <--------------- ( Parser Logic )

By providing static files for every step in this chain, the module ensures that logic changes in the parser or client can be verified for correctness and backward compatibility with historical data formats.

Module: /go/perfresults/workflows

Perf Results Workflows

The perfresults/workflows module contains the business logic for orchestrating the ingestion and processing of performance data within the Skia infrastructure. It leverages the Temporal framework to manage long-running, distributed tasks that require strong guarantees on state persistence and fault tolerance.

Design Philosophy: Fault-Tolerant Ingestion

Performance data ingestion is a multi-step process involving data retrieval, validation, storage in the trace store, and triggering downstream analysis (such as regression detection). The module is built on the following principles:

  • Reliable Transitions: By using Temporal, the system ensures that if a step fails (e.g., a network timeout during a storage write), the workflow can resume from its last successful state rather than restarting the entire ingestion pipe.
  • Separation of Concerns: The module separates the orchestration (the sequence of steps) from the execution (the actual work). Workflows define the “recipe,” while Activities perform the “cooking.”
  • Idempotency: Activities are designed to be idempotent so that retries—inherent in distributed systems—do not result in duplicate data or corrupted state.

Key Components and Responsibilities

The module is structured to support both the high-level workflow definitions and the granular activities they invoke.

  • Workflow Definitions: These are the top-level Go functions that define the lifecycle of a performance result. A typical workflow orchestrates the flow of data from an external upload or discovery event into the internal Skia Perf ecosystem. It handles branching logic, error handling policies, and timeout configurations.
  • Activities: These are the atomic units of work called by workflows. Common responsibilities include:
    • Data Validation: Checking the schema and integrity of incoming performance JSON files.
    • Storage Operations: Interfacing with the trace store (e.g., BigTable or Spanner) to persist the results.
    • Downstream Notifications: Sending signals to the clustering and regression detection systems once new data has been successfully ingested.
  • Input/Output Contracts: The module defines the data structures used to pass information between steps. These contracts ensure that as workflows evolve, the data passed between disparate activities remains consistent and type-safe.

Workflow Architecture

The relationship between the orchestrating workflow and the underlying infrastructure is depicted below:

[ Trigger Event ]
       |
       v
+-----------------------------+
|      Temporal Workflow      |
|  (Orchestration & State)    |
+--------------+--------------+
               |
      +--------+--------+
      |        |        |
      v        v        v
+----------+ +----------+ +----------+
| Activity | | Activity | | Activity |
| (Fetch)  | | (Parse)  | | (Store)  |
+----+-----+ +----+-----+ +----+-----+
     |            |            |
     +------------+------------+
                  |
                  v
       [ Result Finalization ]
  1. Orchestration: The Workflow manages the control flow, ensuring that “Parse” only happens after “Fetch” succeeds, and “Store” only occurs if “Parse” produces valid data.
  2. Execution: Each Activity is picked up by a Worker and executed. The status of these activities is reported back to the Temporal server to maintain the global state of the ingestion job.
  3. Completion: Upon success, the workflow may trigger secondary events, such as updating “latest result” pointers or alerting developers of significant performance shifts.

Interaction with the Worker

While the workflows module defines the logic, it relies on the worker submodule to provide the execution environment. The workflows are registered with the worker at startup, allowing the worker to “claim” tasks from the Temporal task queue that match the workflow and activity names defined here. This decoupling allows the workflow logic to be updated and deployed independently of the worker's infrastructure configuration, provided the interfaces remain compatible.

Module: /go/perfresults/workflows/worker

Perf Upload Worker

The perf-upload-worker serves as the execution engine for Temporal workflows related to performance result ingestion and processing. In the context of the Perf results subsystem, this worker acts as the bridge between the Temporal orchestration engine and the actual execution of tasks, such as uploading data to storage or triggering indexing processes.

Design Philosophy: Temporal Execution

The worker is designed around the principle of externalized orchestration. Instead of embedding business logic directly into a monolith, the worker provides the compute resources to execute workflows defined elsewhere. This decoupling allows for:

  • Scalability: Multiple worker instances can be deployed to handle high volumes of performance data uploads by listening to the same task queue.
  • Reliability: Since Temporal manages state and retries, this worker remains relatively stateless. If a worker process crashes, the Temporal server detects the timeout and redistributes the pending tasks to other available workers.
  • Observability: By integrating Prometheus metrics directly into the worker's lifecycle, the system tracks execution latency and task failures at the infrastructure level.

Key Components and Responsibilities

The primary responsibility of this module is the lifecycle management of the Temporal worker process, managed within main.go.

  • Connection Management: The worker establishes a long-lived connection to the Temporal cluster (configured via --host_port and --namespace). It uses a single, heavyweight client.Client instance to minimize resource overhead, as per Temporal best practices.
  • Task Queue Subscription: The worker listens to a specific --task_queue. By default, it dynamically generates a queue name based on the current system user (e.g., localhost.username), which facilitates local development and testing without interfering with production workflows.
  • Metrics Integration: The worker utilizes a specialized MetricsHandler to export Temporal-specific SDK metrics to Prometheus. This is crucial for monitoring the health of the workflow execution environment, such as worker pollers and activity execution rates.

Operational Workflow

The worker operates in a continuous loop, polling the Temporal server for work. The high-level interaction between the worker and the broader system is illustrated below:

+------------------+          +------------------+          +-----------------------+
|  Temporal Server | <----1---->  Perf Worker    | <----2---->  Workflow/Activity   |
|  (Orchestrator)  |          |   (Execution)    |          |   Implementations     |
+------------------+          +------------------+          +-----------------------+
          ^                            |
          |                            | 3. Export Metrics
          |                            v
          |                  +------------------+
          +------------------+ Prometheus/Skia  |
                             | Monitoring       |
                             +------------------+
  1. Polling: The worker establishes a persistent gRPC connection to the Temporal Server and polls the configured Task Queue.
  2. Dispatch: When the server has a scheduled task (Workflow or Activity), the worker receives the task and dispatches it to the registered implementation logic.
  3. Telemetry: Throughout the execution, the worker pushes heartbeat and performance data to the Prometheus endpoint (defaulting to port :8000).

Configuration and Deployment

The worker is packaged as a containerized application (perf_upload_worker). It relies on command-line flags to determine its environment:

  • --task_queue: Defines which set of tasks this specific worker fleet will handle.
  • --namespace: Segregates workflow execution within the Temporal cluster (e.g., separating “prod” from “staging”).

Module: /go/perfserver

Perfserver

The perfserver module provides a unified entry point for all long-running processes required to operate a Skia Perf instance. It is designed as a multi-command CLI tool that encapsulates disparate operational roles—web serving, data ingestion, maintenance, and regression detection—into a single binary.

Architectural Philosophy

The design of perfserver follows a “sidecar” or “micro-service” compatible architecture where a single codebase can fulfill different roles depending on the command-line arguments. This approach simplifies deployment and configuration management: instead of managing multiple distinct binaries, the same container image or executable is deployed across different service tiers (e.g., Kubernetes deployments), with only the entrypoint command changing.

Key Components and Responsibilities

The functionality is divided into several sub-commands, each targeting a specific area of the Perf lifecycle:

1. Frontend (frontend)

The frontend command launches the primary web server. It is responsible for serving the user interface and handling API requests for data visualization. While it primarily focuses on the “read” path of the system, it acts as the central hub for user interaction with performance traces.

2. Ingestion (ingest)

The ingest command starts the data processing pipeline. Its responsibility is to monitor configured sources (such as cloud storage buckets), parse incoming performance files, and populate the TraceStore.

  • Workflow: It operates continuously, utilizing parallel workers to ensure that as new benchmark data is produced, it is indexed and made available for queries with minimal latency.

3. Regression Detection (cluster)

Despite its name, the cluster command is essentially a specialized instance of the frontend logic configured specifically for background analysis. It focuses on the “alerting” path—continuously scanning newly arrived data against configured alert definitions to identify performance regressions or improvements.

4. Maintenance (maintenance)

The maintenance command runs background tasks that are required for the long-term health of the database and application state. These tasks are typically “singleton” operations, meaning only one instance of the maintenance process should run per Perf instance to avoid data contention or redundant processing.

Operational Workflow

The perfserver coordinates these components through a shared configuration validation logic. Every sub-command (excluding documentation generators) follows a similar initialization pattern:

  1. Parse CLI flags to locate the instance configuration file.
  2. Validate the configuration against a schema to ensure environmental consistency.
  3. Initialize telemetry and monitoring (Prometheus).
  4. Hand off execution to the specialized package (e.g., perf/go/frontend or perf/go/ingest/process).
[ Configuration File ]
       |
       v
+-----------------------------+
|        perfserver           |
+-----------------------------+
       |
       +--- [ frontend ] ----> Serves Web UI & API
       |
       +--- [ ingest ] ------> Monitors Storage -> Populates TraceStore
       |
       +--- [ cluster ] ------> Runs Alerting & Regression Detection
       |
       +--- [ maintenance ] --> Database Cleanup & Singleton Tasks

Implementation Details

  • Flag Management: The module heavily leverages go/urfavecli to map configuration structures directly to command-line flags. This ensures that the CLI interface stays in sync with the underlying configuration objects defined in perf/go/config.
  • Shared Logic: By centralizing these commands, the server ensures that logging (via sklog), error handling (via skerr), and metrics initialization are applied consistently across all roles of the Perf system.
  • Validation: Before starting critical processes like ingest or maintenance, the server uses validate.InstanceConfigFromFile to catch configuration errors early, preventing partial failures in production.

Module: /go/pinpoint

Pinpoint Client Module

The pinpoint module provides a Go client for interacting with Pinpoint, the performance regression analysis service used by Chrome and Skia. It abstracts the complexities of communicating with legacy Chromeperf and Pinpoint endpoints, allowing Skia Perf to programmatically trigger “Try Jobs” (to test specific patches) and “Bisect Jobs” (to identify the root cause of performance regressions).

Overview

This module acts as a bridge between Skia Perf and the Pinpoint service. Its primary responsibility is to translate high-level requests—such as “bisect this anomaly” or “run a try job with this patch”—into the specific URL-encoded POST requests required by Pinpoint's legacy API.

The client manages:

  • Authentication: Uses Google Default Token Sources with auth.ScopeUserinfoEmail to authorize requests.
  • Routing: Determines whether to send bisect requests to the modern Pinpoint API or the legacy Chromeperf bisect service based on the source of the anomaly.
  • Data Transformation: Normalizes inputs, such as converting underscores to dots in story names, to match Pinpoint's internal requirements.
  • Monitoring: Tracks the success and failure rates of job creation via internal metrics.

Key Components and Responsibilities

Client (pinpoint.go)

The Client struct is the central entry point. It wraps an http.Client configured with necessary OAuth2 credentials and holds telemetry counters.

  • CreateTryJob: Initiates a job to compare a base commit/patch against an experimental commit/patch. This is typically used to verify if a proposed fix actually improves performance before landing.
  • CreateBisect: Initiates a bisection to find the specific commit that introduced a performance change. It supports two different backend paths depending on whether the configuration indicates the anomaly was fetched from the new SQL-based system.
  • doPostRequest: A private helper that handles the low-level HTTP execution, response body reading, and error extraction. It specifically knows how to parse Pinpoint's error JSON format to provide actionable error messages.

Request Orchestration

The module uses specific structures to define job parameters:

  • TryJobCreateRequest: Captures details like BaseGitHash, ExperimentPatch, Benchmark, and Story.
  • BisectJobCreateRequest: Captures regression-specific data like StartGitHash, EndGitHash, ComparisonMagnitude, and AlertIDs.

Logic Flow: Request Building

Pinpoint's legacy API consumes parameters via URL query strings even for POST requests. The module handles this through several “build” functions that ensure required fields are present and formatted correctly.

[ Skia Perf ] --(Request Struct)--> [ Client.CreateBisect ]
                                            |
                                   [ getBisectRequestURL ]
                                     /              \
                          (New Anomaly?)       (Legacy Anomaly?)
                               /                      \
                [ buildPinpointURL ]          [ buildChromeperfURL ]
                               \                      /
                             [ buildBisectRequestParams ]
                                         |
                                  (dotify stories)
                                  (add "skia_perf" tags)
                                         |
[ Pinpoint API ] <---(POST with Params)---'

Implementation Decisions

Legacy API Compatibility

The module intentionally targets the legacy Pinpoint API endpoints (/api/new and /pinpoint/new/bisect). This decision necessitates the use of URL query parameter encoding for POST bodies, as seen in buildTryJobRequestURL and buildBisectRequestParams.

Data Normalization (dotify)

Pinpoint internally expects story names to use dot notation (e.g., story.name) rather than underscores (e.g., story_name), which are common in other parts of the Skia ecosystem. The dotify function automatically handles this transformation to prevent job submission failures.

Error Handling

Instead of returning generic HTTP errors, extractErrorMessage attempts to parse the JSON response from Pinpoint to find a specific error field. If Pinpoint returns a 400 or 500 status code with a message like {"error": "benchmark not found"}, this module ensures that specific string is propagated back to the caller.

Conditional Routing

The function getBisectRequestURL uses the config.Config.FetchAnomaliesFromSql toggle to decide which legacy endpoint to hit. This allows the system to support a transition period between old Chromeperf-managed anomalies and newer SQL-managed anomalies without breaking the bisection workflow.

Module: /go/pivot

Pivot Module

The pivot module provides functionality to transform and aggregate Performance DataFrames. It allows users to group traces by specific keys and apply mathematical operations to summarize data, similar to a “Pivot Table” in a spreadsheet or a GROUP BY clause in SQL.

Overview

In Perf, data is typically represented as a series of traces (floating-point arrays) identified by a set of parameters (e.g., arch=arm, config=8888). The pivot module allows you to “collapse” these traces based on a subset of those parameters.

For example, if you have traces for various configurations across different architectures, you can pivot by arch to see the aggregate performance of arm vs. intel, regardless of the specific configuration.

Key Concepts

Pivot Request

The transformation is governed by a Request struct which defines three things:

  1. GroupBy: A list of keys to retain. All other keys in the original trace IDs are discarded, and traces sharing the same remaining keys are grouped together.
  2. Operation: The aggregation function used to combine multiple traces within a group into a single representative trace.
  3. Summary (Optional): A list of operations to apply to the resulting traces to reduce them from a series of values (over time/commits) into a single scalar value per operation.

Operations

The module supports several mathematical operations for both grouping and summarization:

  • Sum / Avg: Standard arithmetic sum and mean.
  • Geo: Geometric mean.
  • Std: Standard deviation.
  • Count: Number of data points.
  • Min / Max: Extremum values.

Workflow and Design

The pivoting process follows a structured pipeline:

1. Grouping

The module identifies all unique combinations of the keys provided in GroupBy that exist in the DataFrame. It then maps every existing trace in the input DataFrame to one of these groups. If a trace does not contain all the keys specified in GroupBy, it is excluded from the result.

2. Aggregation (Group By)

For each group, the Operation (e.g., Sum) is applied across all traces in that group. This results in one trace per group. The trace ID for this new trace contains only the keys specified in the GroupBy list.

Input Traces:
  {arch: arm,   config: 8888} -> [1, 0, 0]
  {arch: arm,   config: 565 } -> [0, 2, 0]
  {arch: intel, config: 8888} -> [1, 1, 1]

Pivot (GroupBy: ["arch"], Operation: Sum):
  {arch: arm}   -> [1+0, 0+2, 0+0] -> [1, 2, 0]
  {arch: intel} -> [1]             -> [1, 1, 1]

3. Summarization (Optional)

If Summary operations are provided, the module further transforms the aggregated traces. Instead of a trace representing values over multiple commits, the resulting “trace” contains one value for each operation listed in Summary.

Intermediate Grouped Trace (from above):
  {arch: arm} -> [1, 2, 0]

Summary (Summary: [Avg, Max]):
  {arch: arm} -> [1, 2]  // Avg is 1, Max is 2

Implementation Details

  • Logic Mapping: The module uses an internal opMap to link Operation enums to specific implementation functions from the go/calc and go/vec32 packages. This ensures consistency between how data is grouped and how it is summarized.
  • DataFrame Reconstruction: After pivoting, the module rebuilds the ParamSet and updates the DataFrame headers. If a summary is performed, the headers are replaced with simple offsets representing the summary columns.
  • Performance: It utilizes query.MakeKeyFast and query.ParseKeyFast for efficient trace ID manipulation and supports context cancellation for long-running aggregations on large datasets.

Module: /go/playground

Playground

The playground module serves as an interactive experimentation hub for performance data analysis. It provides a web-accessible sandbox where developers and performance engineers can validate detection algorithms, test regression logic against synthetic or real-world traces, and fine-tune sensitivity parameters without impacting production systems or persistent storage.

Design Philosophy and Core Functionality

The primary goal of the playground is to decouple the analysis logic from the data storage and ingestion infrastructure. In the standard Perf production environment, anomaly detection is often part of a large, automated pipeline that reads from BigTable and writes to SQL databases. The playground bypasses these dependencies by accepting raw data via HTTP and processing it in-memory.

This design enables:

  • Rapid Iteration: Immediate feedback on how changing a “radius” or “threshold” affects anomaly detection.
  • Algorithm Validation: Comparison between different regression methods (e.g., AbsoluteStep vs. OriginalStep) on the same data set.
  • Noise Reduction Testing: Testing consolidation strategies like Non-maximum Suppression to ensure that a single regression isn't reported as multiple adjacent anomalies.

Key Submodules and Responsibilities

The module is structured to separate the API lifecycle from the specific mathematical analysis being performed.

Anomaly Detection (/anomaly)

This is the primary functional area of the playground. It implements a sliding window approach to identify shifts in time-series data.

  • Windowing Strategy: Rather than treating a trace as a single entity, the module slides a window of size 2 * radius + 1 across the data. This localization allows the regression package to focus on finding a single “best” step within a small context, which is more robust against long-term trends or multiple shifts in a single trace.
  • Data Adaptation: A significant portion of the implementation involves “shim logic.” The core regression and dataframe packages used by Skia Perf expect complex structures (trace sets, headers, paramsets). The playground's anomaly logic constructs transient, “dummy” dataframes to wrap raw float slices, allowing the production-grade regression code to run as if it were processing a standard database query.
  • Anomaly Consolidation (Non-maximum Suppression): To prevent “jitter” (where several points around a step are all flagged), the module implements a grouping logic. If enabled, it identifies clusters of contiguous points flagged as anomalies and selects only the point with the highest absolute regression score—the point where the “step” is most pronounced.

Workflow: Request-to-Analysis

The following diagram illustrates how data flows from a user request through the detection engine:

[User Request (JSON)]
      | (Trace, Threshold, Radius, Algorithm)
      v
[HTTP Handler]
      |
      +-----> [Data Cleaning] (Remove missing data sentinels)
      |
      +-----> [Sliding Window Loop]
      |          |
      |          v
      |       [regression.StepFit] <--- (Analyzes N points)
      |          |
      |          +--> [Threshold Check] (Is regression > threshold?)
      |
      +-----> [Optional: Grouping/Suppression]
      |          | (Merges adjacent hits, keeps max score)
      v
[Enriched Response]
        (Indices, Medians Before/After, Regression Scores)

Implementation Decisions

  • Statistically Driven Summaries: When an anomaly is detected, the module calculates MedianBefore and MedianAfter values. These are calculated using vec32.RemoveMissingData to ensure that gaps in telemetry do not result in “NaN” or skewed medians, providing the user with a clean delta of the performance change.
  • Algorithm Agnostic API: The request structure uses a string-based Algorithm field. This allows the playground to support any algorithm registered in the regression package without changing the API schema, making it extensible as new detection methods are developed.
  • Simulated Environment: By using the PlaygroundTraceName constant, the module satisfies internal requirements for named traces while maintaining the abstraction that this data is ephemeral and not tied to a real hardware bot or test suite.

Module: /go/playground/anomaly

Anomaly Playground

The anomaly playground module provides a sandbox environment for testing and tuning anomaly detection algorithms on performance data. It exposes an HTTP interface that allows users to submit individual data traces and receive a list of detected anomalies based on configurable parameters like window size and sensitivity thresholds.

This module acts as a bridge between the frontend and the core regression detection logic, allowing developers and users to experiment with detection settings without modifying production configurations or underlying databases.

Detection Logic and Design Decisions

The module implements a Sliding Window Step Fit approach. Instead of analyzing a whole trace at once, it moves a window across the data to identify localized “steps” or shifts in value.

Key Workflows

  1. Request Handling: The Handler receives a DetectRequest containing the raw trace data, a Radius (determining window size), a Threshold (sensitivity), and the specific Algorithm to use (e.g., AbsoluteStep, OriginalStep).
  2. Windowing: The slidingWindowStepFit function iterates through the trace. At each index i, it creates a window of size 2 * radius + 1.
  3. Step Fitting: For each window, the module wraps the slice into a temporary dataframe.DataFrame and calls regression.StepFit. This leverages the existing production logic used by the Perf service.
  4. Anomaly Consolidation:
    • If GroupAnomalies is false, every index flagged by the algorithm is returned.
    • If GroupAnomalies is true, the module performs Non-maximum Suppression. It groups consecutive indices flagged as anomalies and only returns the one with the highest absolute regression score (the “most significant” point in the cluster).

Process Diagram

[Trace Data]
      |
      v
[Windowing] ----> [Sub-trace (i - radius to i + radius)]
      |                       |
      |                       v
      |            [regression.StepFit Analysis]
      |                       |
      |                       v
      |<--- [Is it a "High" or "Low" Step?]
      |
      v
[Candidate Anomalies]
      |
      +--- (If GroupAnomalies=true) ---> [Non-maximum Suppression]
      |                                      (Pick best in group)
      v
[JSON Response (Anomalies)]

Key Components

anomaly.go

Contains the core logic for the playground:

  • DetectRequest / DetectResponse: Defines the JSON API. The request allows choosing the algorithm and whether to group nearby anomalies to reduce noise.
  • slidingWindowStepFit: The engine that breaks the trace into windows. It constructs dummy dataframe headers to satisfy the requirements of the regression package's API, simulating a real data environment.
  • Handler: The HTTP entry point. It manages the lifecycle of a request, invokes the detection, calculates metadata for each detected anomaly (like MedianBeforeAnomaly and MedianAfterAnomaly), and performs the optional grouping logic.

anomaly_test.go

Provides functional tests for the detection logic. It verifies that the playground correctly identifies simple steps (up/down), handles empty or flat traces, and correctly implements the grouping suppression logic to ensure only the most relevant points are reported.

Implementation Details

  • Handling Missing Data: The module uses vec32.RemoveMissingDataSentinel when calculating medians to ensure that gaps in performance data (common in real-world traces) do not skew the statistical summary of the detected anomaly.
  • Regression Scores: The “strength” of an anomaly is determined by the Regression value returned by the step-fit algorithm. When grouping anomalies, the absolute value is used to determine which point in a cluster represents the most significant shift.
  • Performance Trace Isolation: The constant PlaygroundTraceName is used to identify traces within the temporary dataframes created for analysis, ensuring compatibility with internal Perf logic that expects named traces.

Module: /go/preflightqueryprocessor

preflightqueryprocessor

The preflightqueryprocessor module provides specialized logic for handling complex trace queries in Skia Perf, particularly during the “preflight” stage of data exploration. It manages the aggregation of parameters and trace counts across multiple data tiles, with specific support for “missing value” logic that standard database queries cannot easily represent.

Core Responsibility

In Skia Perf, a query typically filters traces based on a set of key-value pairs (e.g., benchmark=V8_Flash). However, users often need to perform “preflight” checks to understand the shape of their data before running a full analysis. This module facilitates:

  1. Shared Parameter Aggregation: Collecting all unique values for all keys across all traces that match a query.
  2. Subquery Processing: Efficiently narrowing down available options for a specific key based on existing filters on other keys.
  3. Sentinel Value Handling: Supporting the __missing__ sentinel, which allows users to query for traces that lack a specific key.

Design Decisions and Implementation

The Sentinel Strategy (__missing__)

Standard database backends often struggle to query for the absence of a key alongside specific values in a single pass. To solve this, the module implements a “fetch-and-filter” strategy:

  • Query Transformation: When a query contains the MissingValueSentinel (__missing__), the processor removes that key from the actual database query. This ensures a superset of traces (including those missing the key) is fetched from the store.
  • In-Memory Filtering: The FilterParams function then applies the logic manually: a trace matches if the key is missing OR if the key‘s value is within the user’s explicitly allowed set.

Concurrent Aggregation

Preflight queries often run across multiple goroutines (one per data tile). To handle this efficiently:

  • Shared State: A preflightQueryBaseProcessor holds a shared sync.Mutex and paramtools.ParamSet. All subqueries and the main query share these instances to build a single unified result.
  • Batching: To minimize mutex contention, the preflightSubQueryProcessor collects values into a local slice for each tile and then performs a single batch update to the shared state once the tile is fully processed.

Main vs. Subquery Processors

The module distinguishes between the primary query and supplementary subqueries:

  • Main Processor: Responsible for the total count of unique traces and identifying missing keys (recording them as empty strings in the result set).
  • Subquery Processor: Used when the UI needs to know “what are the available values for key X, given the current filters on keys Y and Z?”. It focuses strictly on populating the values for its target key.

Key Workflows

Query Processing Flow

The following diagram illustrates how a query is handled when a sentinel value is present:

User Query: [config=gpu, arch=__missing__]
     |
     v
PrepareQueryWithSentinel()
     |-- 1. Create FilterMap: { "arch": { "allowed": [] } }
     |-- 2. Strip "arch" from Query -> Backend Query: [config=gpu]
     v
Fetch Traces from Tiles (Parallel)
     |
     |--> Tile 1 Results ----> FilterParams() ----> If Match: Add to Shared ParamSet
     |--> Tile 2 Results ----> FilterParams() ----> If Match: Add to Shared ParamSet
     v
Finalize()
     |-- Subqueries move collected values into the final shared ParamSet.
     |-- Result: Total Count + Aggregated ParamSet.

Key Components

ParamSetAggregator & PreflightQueryResultCollector

These interfaces define how trace data is consumed. The main processor implements both (tracking count and params), while the subquery processor only implements aggregation.

preflightMainQueryProcessor

The primary coordinator. It uses a map[string]bool of unique trace IDs to ensure the total count is accurate even if traces overlap across tiles. It also supports SetKeysToDetectMissing, which forces the aggregator to record an empty string if a specific expected key is absent from a trace.

preflightSubQueryProcessor

Optimized for “discovery” queries. It tracks values in a local filteredValuesFromTiles map during tile processing and only populates the shared ParamSet during the Finalize phase to reduce lock overhead.

PrepareQueryWithSentinel and FilterParams

The logic engine for the __missing__ value. PrepareQueryWithSentinel modifies the query object in place (after cloning) to ensure the backend returns a broad enough dataset for the manual FilterParams logic to work correctly.

Module: /go/progress

Perf Progress Tracking Module

The progress module provides a standardized mechanism for tracking and reporting the status of long-running backend tasks in the Perf application. It bridges the gap between asynchronous server-side processes (like complex data queries or “dry runs”) and the user interface, which needs real-time feedback on task advancement.

Design Philosophy

The module is built around a “push-pull” architecture:

  1. Push: The long-running backend task updates a Progress object with its current state, messages, and eventually, results.
  2. Pull: The frontend polls a specific URL associated with that task. The backend Tracker intercepts these requests and returns a serialized snapshot of the task's progress.

To ensure consistency and prevent logical errors in task reporting, the state machine is strictly enforced. Once a task transition out of the Running state (to Finished or Error), any further attempts to modify its state or messages will result in a panic. This encourages developers to handle task finalization at the outermost calling level, ensuring a clean lifecycle.

Key Components

Progress Interface

The Progress interface is the core unit of state. It manages:

  • Status: A task is always in one of three states: Running, Finished, or Error.
  • Messages: An ordered collection of key-value pairs used to describe stages (e.g., “Step: 1/5”, “Stage: Analyzing traces”). If a message key is reused, the value is updated in place, allowing for dynamic progress bars or counters.
  • Results: An arbitrary data structure containing the final output of the task.
  • URL: A unique endpoint where the UI can poll for updates.

Tracker

The Tracker acts as a registry and HTTP handler for all active Progress objects. It manages the lifecycle of these objects using an internal Least Recently Used (LRU) cache.

Key responsibilities include:

  • ID Assignment: When a Progress is added to the Tracker, it is assigned a UUID and a corresponding polling URL based on a configured basePath.
  • HTTP Handling: The Tracker provides a standard http.Handler that extracts the task ID from the URL path, retrieves the task from the cache, and serializes its state to JSON.
  • Cleanup: To prevent memory leaks, the Tracker runs a background goroutine that evicts completed or failed tasks from the cache after a set duration (defaulting to 5 minutes).

Workflow Example

The following diagram illustrates the interaction between an HTTP handler, a background worker, and the Tracker.

HTTP Handler              Background Worker              Tracker / UI
------------              -----------------              ------------
1. Create Progress ---->  2. Start Goroutine
3. Add to Tracker  ---->  4. Return JSON (Initial URL) -> UI starts polling
                          |
                          5. update Message() ----------> UI sees "Step 1"
                          |
                          6. update Results()
                          |
                          7. call Finished()  ----------> UI sees "Finished"
                                                          & fetches Results
                          |
                          [ 5 minutes later ] ----------> Tracker evicts task

Implementation Details

  • Concurrency: The standard implementation of Progress uses a sync.Mutex to ensure that concurrent updates from a worker and read requests from the Tracker handler are thread-safe.
  • Serialization: The SerializedProgress struct is designed to be easily consumed by TypeScript frontends, using go2ts compatible tags.
  • Error Handling: When Error(msg) is called, the status is updated, and the error message is automatically stored in the messages list under a reserved Error key.
  • Persistence (Future): While currently memory-backed, the module includes hooks for Redis-based persistence to support progress tracking across horizontal scaling boundaries.

Module: /go/psrefresh

High-Level Overview

The psrefresh module is responsible for maintaining and providing an up-to-date ParamSet for a Perf instance. A ParamSet is a collection of all keys and values (metadata) for all traces stored in the system.

In a high-volume performance monitoring system, querying the underlying database to discover which parameters are available (e.g., “Which benchmarks ran on this specific bot?”) can be expensive. This module solves that by background-loading metadata into memory and optionally caching filtered results to ensure that the user interface remains responsive when users build queries.

Design Decisions and Implementation Choices

Tile-Based Aggregation

The module retrieves metadata by looking at “tiles”—chunks of time-series data. The defaultParamSetRefresher is designed to aggregate metadata from a configurable number of the most recent tiles (typically the two most recent). This ensures that the ParamSet reflects currently active traces while ignoring stale parameters from deleted or very old data.

In-Memory vs. Cached Access

There are two primary ways this module serves data:

  1. Direct Refresher (defaultParamSetRefresher): Keeps the full, global ParamSet in memory. This is updated on a periodic background tick.
  2. Cached Refresher (CachedParamSetRefresher): Wraps the default refresher. It pre-calculates and stores filtered ParamSet results in a cache (like Redis or local memory) based on specific “Level 1” and “Level 2” keys defined in the configuration. This is a performance optimization for UI components that drill down through common hierarchies (e.g., Benchmark -> Bot).

Thread Safety and Reliability

  • Concurrency: The refresher uses a sync.Mutex to protect the in-memory ParamSet during background updates, ensuring that readers never see a partially constructed set.
  • Resilience: During the refresh process (oneStep), the system is designed to be tolerant of failures. While failing to fetch the latest tile results in an error, failures to fetch older supplementary tiles are logged as warnings rather than crashing the refresh cycle, allowing the system to provide “mostly complete” data rather than no data at all.

Key Components and Files

psrefresh.go

Contains the core logic for the defaultParamSetRefresher.

  • OPSProvider Interface: Abstracts the data source (usually a TraceStore). It requires two methods: identifying the latest tile and retrieving a ParamSet for a specific tile.
  • oneStep(): The atomic unit of work that fetches the latest tile ID, iterates backward to collect the requested number of tiles, merges their metadata, and “freezes” the result into a read-only structure.

cachedpsrefresh.go

Implements the CachedParamSetRefresher, which adds a caching layer over the standard refresher.

  • Hierarchical Pre-population: It uses PopulateCache() to proactively execute “Preflight” queries. It iterates through values of a primary key (Level 1) and optionally a secondary key (Level 2), storing the resulting ParamSet and trace count in the cache.
  • Smart Query Routing: When GetParamSetForQuery is called, the component checks if the query matches the cached levels (e.g., exactly 1 or 2 specific keys). If it matches, it serves from the cache; otherwise, it falls back to a real-time database query.

Key Workflows

Background Refresh Process

The standard refresher maintains the global state of available parameters.

[ Timer Tick ]
      |
      V
[ oneStep() ] ----------------------> [ TraceStore (OPSProvider) ]
      |                                      |
      | <--- Get Latest Tile ID -------------|
      |                                      |
      | <--- Get ParamSet for Tile N --------|
      | <--- Get ParamSet for Tile N-1 ------|
      |
[ Merge & Normalize ]
      |
[ Lock Mutex ] -> [ Update pf.ps ] -> [ Unlock Mutex ]

Cached Query Workflow

When the UI requests a filtered ParamSet (e.g., selecting a benchmark to see available bots), the cached refresher determines the most efficient data path.

[ Request: GetParamSetForQuery(Query) ]
      |
      |-- If Query has Level1/Level2 keys only? --+
      |                                           |
      | [ YES ]                                [ NO ]
      V                                           V
[ Check Cache ]                         [ Real-time DB Query ]
      |                                           |
      |-- Cache Hit? --+                          |
      |                |                          |
   [ YES ]           [ NO ]                       |
      V                V                          V
[ Return Data ]  [ Fetch from DB ] --------> [ Return Data ]

Module: /go/psrefresh/mocks

High-Level Overview

The go/psrefresh/mocks module provides mock implementations of interfaces used by the ParamSet Refresher (psrefresh) system within the Perf service. Its primary purpose is to facilitate unit testing for components that depend on an OPSProvider (Ordered ParamSet Provider), allowing developers to simulate data retrieval from the underlying storage layer without requiring a live database or complex setup.

Design and Implementation Choices

The module relies on testify/mock and is generated via mockery. This choice ensures that the mocks are strictly typed and consistent with the actual interfaces they represent. By using generated mocks, the project maintains a clear separation between the logic being tested and the data-providing infrastructure.

A key design aspect of these mocks is the abstraction of tile-based data access. In the Perf system, data is organized into “tiles” (chunks of time-series data). The OPSProvider mock allows tests to control exactly what a component perceives as the “latest tile” or what “ParamSet” (a collection of key-value pairs representing trace metadata) exists within a specific tile.

Key Components

OPSProvider.go

This file contains the OPSProvider struct, which mocks the interface responsible for bridging the refresher logic and the actual data store. It manages two primary responsibilities in a test environment:

  • State Simulation: It allows tests to define the current state of the system by mocking GetLatestTile. This is crucial for testing how the refresher reacts when new data is added or when the system is already up to date.
  • Data Injection: Through the GetParamSet mock function, developers can inject specific paramtools.ReadOnlyParamSet objects into the workflow. This allows for fine-grained testing of how the Perf system indexes metadata and how it handles potential errors during data retrieval.

The mock includes a NewOPSProvider constructor that automatically handles test cleanup and expectation assertions, ensuring that tests fail if the code under test does not interact with the provider as expected.

Key Workflows

The typical workflow for using this module involves setting up expectations within a unit test to simulate the lifecycle of a ParamSet refresh operation:

[ Test Setup ]
      |
      V
[ Mock OPSProvider ] <--- Define Return: GetLatestTile (e.g., Tile #500)
      |
      V
[ Mock OPSProvider ] <--- Define Return: GetParamSet(ctx, 500) (e.g., custom ParamSet)
      |
      V
[ Component Under Test ] --- Calls GetLatestTile() ---> [ Mock ]
      |                                                   |
      |<--- Returns Tile #500 ----------------------------|
      |
[ Component Under Test ] --- Calls GetParamSet(500) ---> [ Mock ]
      |                                                   |
      |<--- Returns ParamSet -----------------------------|
      |
      V
[ Assertions ] <--- Verify component processed ParamSet correctly

Module: /go/redis

The go/redis module provides the integration layer between Skia Perf and Google Cloud Memorystore (Redis). Its primary role is to manage the discovery of Redis instances and facilitate data caching to improve the performance of the Perf query UI.

The module bridges two distinct domains: the management of Google Cloud Platform (GCP) resources and the application-level interaction with Redis data structures.

Design and Implementation Choices

The module is designed around the RedisWrapper interface, which abstracts the complexity of GCP infrastructure management. This abstraction allows for clean separation between the logic that locates a database and the logic that consumes it, while also enabling the automated mocking found in the mocks sub-module.

Key design decisions include:

  • Dynamic Instance Discovery: Rather than relying on hardcoded IP addresses or brittle DNS entries, the module uses the GCP Cloud Redis API to list and identify instances. This allows the system to be resilient to infrastructure changes, such as migrating instances or updating service endpoints in a specific project/zone.
  • Asynchronous Refresh Cycle: The implementation utilizes a background goroutine (StartRefreshRoutine) to periodically poll the state of Redis. This ensures that the application has up-to-date metadata about the target Redis instance without blocking the main execution path.
  • Thread-Safe Access: The RedisClient uses a sync.Mutex during cache updates. This prevents race conditions when multiple refresh cycles or concurrent operations attempt to modify the client's internal state or interaction logic simultaneously.

Key Components

RedisClient

The RedisClient is the primary implementation of the RedisWrapper. It acts as a coordinator between three major dependencies: the GCP Cloud Redis client (for infrastructure), the TraceStore (the source of data), and the Redis data client (for caching).

  • Lifecycle Management: Through StartRefreshRoutine, the client manages a ticker-based loop. It searches for a specific Redis instance name (provided via configuration) within a target GCP project and zone.
  • Infrastructure Discovery: The ListRedisInstances method handles the pagination and iteration logic required by the GCP API, converting the stream of instances into a usable slice of redispb.Instance objects.
  • Cache Maintenance: The RefreshCachedQueries method (and its associated workflows) is responsible for the actual data movement. It establishes a connection to the discovered host and port and performs the necessary Redis commands to update the cache. This ensures the Query UI can retrieve pre-computed results instead of performing expensive lookups on the primary TraceStore.

Key Workflows

Redis Discovery and Cache Refresh

The following diagram illustrates how the module moves from a configuration state to an active cached state:

[ StartRefreshRoutine ]
          |
          | (Every refreshPeriod)
          v
[ ListRedisInstances ] <---- Queries GCP Cloud Redis API
          |
          | Filters by config.Instance Name
          v
[ Target Instance Found? ] -- No --> [ Log Error/Wait ]
          |
          Yes (Extract Host/Port)
          v
[ RefreshCachedQueries ]
          |
          | 1. Create redis.NewClient(Host:Port)
          | 2. Lock Mutex
          | 3. Update cache keys (e.g., "FullPS")
          | 4. Unlock Mutex
          v
[ Cache Updated ]

This workflow ensures that even if the Redis instance is recreated or its internal IP changes, the Perf system will automatically rediscover the new endpoint and resume caching operations without requiring a manual restart.

Module: /go/redis/mocks

The go/redis/mocks module provides automated mock implementations of the Redis management interfaces used within the Perf system. Its primary purpose is to enable unit testing of components that interact with Google Cloud Redis instances without requiring a live cloud environment or complex integration setups.

Design and Implementation Choices

The module is built around the RedisWrapper mock, which is generated using mockery. The decision to use generated mocks rather than manual stubs ensures that the testing layer stays in sync with the actual RedisWrapper interface used in production.

The implementation utilizes the testify/mock framework. This allows developers to:

  1. Define expected behaviors: Specify exactly how many times a method should be called and with what arguments.
  2. Control return values: Simulate both successful API responses (such as lists of redispb.Instance objects) and error conditions (such as context timeouts or API failures).
  3. Validate assertions: Automatically verify that the code under test interacted with the Redis management layer as expected during the test cleanup phase.

Key Components

RedisWrapper.go

This file contains the RedisWrapper struct, which mocks the interface responsible for high-level Redis lifecycle and discovery operations. It focuses on two primary responsibilities:

  • Instance Discovery: Through ListRedisInstances, the mock simulates the retrieval of Redis instance metadata from a specific project or region. This is critical for testing logic that needs to dynamically discover Redis endpoints based on cloud configurations.
  • Background Maintenance: The StartRefreshRoutine method mocks the behavior of long-running background processes that handle the periodic refreshing of Redis configurations or connections. In a test environment, this allows callers to verify that the refresh cycle is initiated with the correct duration and configuration parameters (config.InstanceConfig) without actually spawning persistent goroutines.

Mocking Workflow

The typical usage pattern involves initializing the mock within a test and injecting it into the higher-level service that requires a RedisWrapper.

[ Test Case ]
      |
      | 1. Initialize NewRedisWrapper(t)
      v
[ RedisWrapper Mock ] <------- 2. Setup expectations (On("ListRedisInstances").Return(...))
      |
      | 3. Inject mock into Perf Component
      v
[ Component Under Test ] ----> 4. Calls ListRedisInstances()
      |
      | 5. Test ends, mock automatically asserts expectations
      v
[ Assertions Passed/Failed ]

This workflow ensures that components responsible for Perf data storage and caching can be validated in isolation, ensuring that logic governing instance selection and maintenance routines is robust against various infrastructure states.

Module: /go/regression

Regression Module

The regression module is the core analytical engine of Skia Perf. It is responsible for detecting, refining, and persisting performance regressions (shifts in metric data) across the commit history.

High-Level Overview

Performance regressions are identified by comparing metric values before and after a specific commit. The module operates by fetching “frames” of data (a window of commits), applying statistical algorithms to identify clusters of traces that show similar shifts, and then triaging these shifts based on user-defined alert configurations.

The system supports two primary detection methodologies:

  1. K-Means Grouping: Identifies broad shifts affecting many traces by clustering similar performance profiles.
  2. StepFit: Analyzes individual traces to find the exact point where a value shifted significantly, useful for pinpointing regressions in specific sub-components.

Key Design Decisions

Separation of Detection and Refinement

The module splits the lifecycle of a regression into distinct stages:

  • Detection (detector.go): Executes the heavy mathematical lifting (K-Means or StepFit). It is intentionally “greedy,” finding all statistical anomalies within the provided data frame.
  • Refinement (refiner/): A post-processing layer that filters the raw detection results. It applies business logic—such as minimum trace thresholds and directionality (UP/DOWN/BOTH)—to ensure only actionable regressions are reported.

“Domain-Centric” Data Fetching

Instead of scanning every individual commit, the detector uses a “Domain” (a range of commits). It leverages dfiter.DataFrameIterator to efficiently slide a window across the data. For each step in the iteration, the target commit is placed at the center of the window (the “midpoint”), allowing the algorithms to compare a stable “before” baseline against an “after” state.

GroupBy Query Expansion

To prevent small regressions from being “drowned out” by the noise of a large dataset, the module supports GroupBy. If an alert is configured to group by a parameter (e.g., device), the detector doesn't run one large query. Instead, allRequestsFromBaseRequest expands the base request into multiple sub-queries, one for each unique value of that parameter. This ensures high-granularity detection.

Hybrid Storage Strategy

The Store interface supports a transition from legacy to modern schemas:

  • Relational Indexing: Metadata (Commit Number, Alert ID, Triage Status) is stored in relational columns for fast searching and filtering.
  • JSON Payloads: Complex objects like ClusterSummary and the FrameResponse (the actual data points) are stored as JSON blobs. This provides the flexibility to update detection algorithms without requiring database schema migrations.

Key Components

Detection Engine (detector.go, stepfit.go)

The ProcessRegressions function is the entry point for detection. It manages a pool of workers to process multiple alert configurations in parallel.

  • tooMuchMissingData: A critical heuristic that filters out traces with more than 50% missing data on either side of the midpoint. This prevents “gaps” in data from being falsely identified as performance drops.
  • StepFit: Implements individual trace analysis. It looks for a “Turning Point” in the trace and calculates the magnitude of the shift.

The Regression Model (regression.go)

The Regression struct tracks the state of a detected anomaly. A single Regression object can track both a High (regression) and a Low (improvement) cluster for the same commit and alert ID. This allows the UI to present a unified view of all shifts occurring at a specific point in time.

Storage Layers (sqlregressionstore, sqlregression2store)

The module provides different implementations of the Store interface:

  • sqlregression2store: The modern implementation optimized for Spanner. It supports advanced features like NudgeAndResetAnomalies (moving a regression to a more accurate commit) and multi-source bug tracking (Manual, Auto-Triage, and Bisection).
  • migration/: Orchestrates the background movement of data from the legacy V1 store to the V2 store, using a transactional “pull and mark” strategy to ensure no data is lost during the transition.

Key Workflows

Detection and Reporting Workflow

The following diagram shows how a detection request is transformed into a confirmed regression.

[ RegressionDetectionRequest ]
       |
       v
[ detector.ProcessRegressions ]
       |-- allRequestsFromBaseRequest (Expand GroupBy)
       |-- DataFrameIterator (Fetch window of commits)
       |
       +--> [ Algorithm: KMeans or StepFit ]
       |           |
       |           v
       |    [ ClusterSummaries Generated ]
       |
       v
[ refiner.Process ]
       |-- Validate Midpoint Match
       |-- Filter by Direction (UP/DOWN/BOTH)
       |-- Filter by Minimum Trace Count
       |
       v
[ Store.SetHigh / SetLow ]
       |-- SQL UPSERT (Atomic Read-Modify-Write)
       |-- Persist JSON Cluster & Frame

Continuous Orchestration (continuous/)

The continuous submodule acts as the driver for the detection engine. It monitors for new data ingestion (via PubSub) or timer triggers, identifies which Alert configs match the new data, and dispatches them to the detector. It also handles the logic for deduplicating notifications so users aren't alerted multiple times for the same evolving regression.

Module: /go/regression/continuous

Continuous Regression Detection

The continuous module is responsible for the background detection of performance regressions in the Skia Perf ecosystem. It acts as an orchestrator that monitors incoming data and configuration changes to identify shifts in performance metrics without requiring manual user intervention.

High-Level Overview

Continuous regression detection is the “auto-pilot” of Skia Perf. While users can perform ad-hoc analysis in the UI, this module ensures that every commit and every ingested data point is evaluated against predefined Alerts (regression detection configurations).

The module supports two primary operational modes:

  1. Polling (Traditional): Periodically scans the most recent commits (defined by a “radius”) across all active alert configurations.
  2. Event-Driven (Modern): Listens for Google Cloud PubSub events triggered by the ingestion of new files. It identifies which alerts match the incoming trace IDs and runs detection specifically for that new data.

Key Design Decisions

Event-Driven vs. Polling

Historically, regression detection was a heavy background process that scanned many commits for all configurations. The implementation of RunEventDrivenClustering addresses scalability by moving toward an incremental model. By leveraging PubSub notifications from the ingestion pipeline, the system can pinpoint exactly which alerts need to be re-evaluated, reducing the lag between data ingestion and regression notification.

Parallelism and Workload Distribution

Regression detection is computationally expensive, involving trace fetching and clustering algorithms. The module employs a hierarchical worker pool strategy to maintain performance:

  • processAlertConfigsWorkerCount: Distributes different Alert configurations across parallel goroutines.
  • processAlertConfigForTracesWorkerCount: Within a single configuration (especially in event-driven mode), individual traces are processed in parallel.
  • Random Shuffling: In polling mode, alert configurations are shuffled before processing. This ensures that if multiple instances of the service are running, they don't all get stuck processing the same large/slow alert at the same time.

GroupBy and Query Refinement

To prevent “alert fatigue” and ensure precision, the module dynamically refines queries. If an alert has a GroupBy setting (e.g., grouping by device), the continuous detector doesn't just run the generic alert query. Instead, it generates specific sub-queries for each group or trace ID found in the incoming data. This allows the system to detect regressions that might be “smothered” by noise in a larger dataset.

Workflow: Event-Driven Detection

The following diagram illustrates how a new data point travels from ingestion to a potential notification:

[Data Ingestion]
       |
       v
[PubSub Message] -> Received by buildTraceConfigsMapChannelEventDriven
       |
       |-- Decode IngestEvent (contains Trace IDs)
       |-- Match Trace IDs against all Alert Configs
       |
       v
[Config/Trace Map] -> Dispatched to Workers
       |
       |-- Parallel Processing: ProcessAlertConfig
       |     |-- Fetch Dataframe (commits within radius)
       |     |-- Run Clustering / Step Fit
       |
       v
[Regression Found?]
       |
       |-- YES: reportRegressions
       |     |-- Store in Regression Store
       |     |-- Send Notifcation (Email/Bug/etc.)
       |     |-- Update existing notifications if direction matches
       |
       |-- NO: Continue

Key Components

Continuous Struct

The central coordinator that holds references to data stores (regression.Store, shortcut.Store), the git provider, and the notification system. It maintains the lifecycle of background detection.

continuous.go

Contains the core logic for the detection loops:

  • reportRegressions: Evaluates clustering results. It specifically looks for the “Step Point”—the exact commit where the performance shift occurred—and validates if the magnitude and direction (UP, DOWN, BOTH) meet the Alert's criteria.
  • updateStoreAndNotification: Handles the deduplication of alerts. It checks if a regression for a specific commit and alert ID already exists. If it's new, it triggers the notifier; if it exists but has changed, it updates the existing notification.
  • getQueryWithDefaultsIfNeeded: A utility that merges global instance defaults with specific alert queries, ensuring that filters like stat=value are applied even if omitted in a specific alert configuration.

Detection Requests (ProcessAlertConfig)

This function transforms an alerts.Alert into a regression.RegressionDetectionRequest. It calculates the “Domain” (the range of commits to analyze) based on flags and configures the regression package to execute the heavy lifting of data fetching and mathematical analysis.

Configuration and Flags

The behavior of this module is heavily influenced by the InstanceConfig and FrontendFlags:

  • Radius: Determines how many commits to look at on either side of a potential regression point to establish a baseline and a new state.
  • EventDrivenRegressionDetection: A boolean toggle that switches the entire logic from the polling ticker to the PubSub listener.

Module: /go/regression/migration

This module provides a mechanism for migrating regression data from the legacy regressions table schema to the updated regressions2 table schema. It is designed to facilitate a smooth transition between storage formats while ensuring data integrity and allowing for incremental, background processing.

Overview

The migration is handled by the RegressionMigrator struct, which orchestrates the transfer of data between the legacy sqlregressionstore and the modern sqlregression2store. The primary motivation for this migration is to move toward a more robust schema that includes additional fields and better indexing, as supported by the regressions2 table.

The migrator is designed to be run either as a one-off batch process or as a periodic background task that slowly drains the legacy table without impacting system performance.

Key Components

  • RegressionMigrator: The central coordinator. It holds references to both the legacy and new stores and manages the transactional logic required to ensure that a regression is either fully migrated or not at all.
  • Legacy Store (sqlregressionstore): Used to identify regressions that haven't been migrated yet (via GetRegressionsToMigrate) and to mark them as completed once they are successfully stored in the new schema.
  • New Store (sqlregression2store): Handles the conversion of legacy regression objects into the new format and persists them to the updated database schema.

Migration Logic and Design Decisions

The migration process follows a “pull and mark” strategy to allow for resumes and to prevent data loss.

  1. Batching: The migrator operates in batches (configurable batchSize). This prevents memory exhaustion when dealing with large historical datasets and allows the migration to be interleaved with standard production traffic.
  2. Atomicity: Each regression migration within a batch is wrapped in its own database transaction.
    • Action: Write to regressions2 -> Update migration status in regressions.
    • Reason: If a failure occurs during the migration of a specific record, only that record‘s transaction is rolled back. This ensures that the system doesn’t have to re-process an entire batch if only one record fails, and prevents duplicate entries in the new table.
  3. Data Enrichment: Legacy regression objects often lack fields required by the new schema (beyond AlertId and CommitNumber). The sqlregression2store handles the necessary transformations to ensure the data is compatible with the new format before writing.
  4. Handling Updates: The system is designed to handle cases where a legacy regression might be updated (e.g., triaged) after its initial migration. The GetRegressionsToMigrate logic in the legacy store identifies these “stale” records so they can be re-synced to the new store.

Key Workflows

Periodic Migration Process

The migrator can run a background loop that periodically checks for work.

[ Timer Trigger ]
       |
       v
[ RunOneMigration ] --------------------------+
       |                                      |
       v                                      |
[ Fetch Batch from Legacy Store ]             | Timeout
       |                                      | (1 Minute)
       v                                      |
[ For each Regression in Batch ]              |
       |                                      |
       +--> [ Start Transaction ]             |
       |           |                          |
       |           v                          |
       |    [ Write to New Store ]            |
       |           |                          |
       |           v                          |
       |    [ Mark Legacy as Migrated ]       |
       |           |                          |
       |           v                          |
       |    [ Commit Transaction ]            |
       |                                      |
       +--------------------------------------+

Instantiation

To initialize the migrator, the New function sets up the required dependencies, including an AlertConfigProvider. This is necessary because the new regression store requires context regarding alerts that the legacy store did not strictly enforce or link in the same manner.

Module: /go/regression/mocks

The /go/regression/mocks module provides a set of autogenerated mock implementations for the regression storage layer in the Perf system. These mocks are built using mockery and are based on the testify framework, specifically designed to facilitate unit testing of components that interact with regression data without requiring a live database or complex setup.

High-Level Purpose

The primary component in this module is the Store mock. In the production environment, a regression store is responsible for persisting and retrieving performance regression data, handling triage statuses, and linking regressions to bug tracking systems. By providing a mock version of this store, the system allows developers to:

  1. Isolate Business Logic: Test regression detection and notification workflows independently of the underlying PostgreSQL storage implementation.
  2. Simulate Edge Cases: Easily trigger specific return values, such as database errors, empty result sets, or complex nested regression structures, to ensure robust error handling.
  3. Verify State Changes: Assert that specific methods—like SetBugID or TriageHigh—are called with the expected parameters during a test execution.

Key Components and Design

The Store Mock

The Store.go file defines the Store struct, which embeds mock.Mock. It replicates the interface used by the actual regression storage layer, covering a wide range of operations:

  • Data Retrieval: Methods like GetRegression, GetByIDs, and Range allow tests to simulate the retrieval of regression clusters based on commit numbers, alert IDs, or time ranges.
  • Triage and State Management: Methods such as TriageHigh, TriageLow, and SetHigh/SetLow enable the simulation of user actions or automated processes that mark regressions as “triaged” or “ignored”.
  • Bug Integration: Functions like SetBugID and GetBugIdsForRegressions facilitate testing the integration between Perf regressions and external issue trackers.
  • Maintenance Operations: Methods like DeleteByCommit and NudgeAndResetAnomalies support testing cleanup and data migration logic.

Usage Workflow

When writing a test for a service that consumes the regression store (e.g., an alerting service), the standard workflow involves initializing the mock and defining expected behaviors:

  Test Setup phase:
  +-------------------------+      +--------------------------+
  | 1. Create Mock Store    | ---> | 2. Define Expectations   |
  |    (mocks.NewStore)     |      |    (store.On(...).Return)|
  +-------------------------+      +--------------------------+
                                               |
                                               v
  Execution phase:                 +--------------------------+
  +-------------------------+      | 3. Inject Mock into      |
  | 4. Run Business Logic   | <--- |    System Under Test     |
  +-------------------------+      +--------------------------+
             |
             v
  Verification phase:
  +-------------------------+
  | 5. Assert Expectations  |
  |    (Automatic Cleanup)  |
  +-------------------------+

Implementation Decisions

  • Autogeneration: The module relies on mockery. This decision ensures that the mock remains in sync with the actual Store interface defined in the regression package. If a new method is added to the store interface, regenerating the mock prevents compilation errors in tests.
  • Testify Integration: By using github.com/stretchr_testify/mock, the mocks provide a fluent API for setting up return values and verifying calls.
  • Transaction Support: The mock includes support for pgx.Tx (PostgreSQL transactions) in methods like DeleteByCommit, allowing tests to simulate transactional integrity without a real database connection.

Module: /go/regression/refiner

Regression Refiner

The refiner module provides logic to validate and filter potential performance regressions detected by the Skia Perf system. It acts as a post-processing stage that transforms raw detection responses into confirmed regressions by applying specific criteria defined in alert configurations.

High-Level Overview

In the Skia Perf pipeline, regression detection identifies clusters of data points that exhibit significant changes in performance. However, not every statistical anomaly constitutes a regression of interest according to a user's specific alerting rules.

The refiner module implements the regression.RegressionRefiner interface to bridge the gap between “statistical detection” and “actionable alert.” It evaluates detection summaries against alert parameters such as the direction of the change (improvement vs. regression) and the magnitude of the impact (number of traces affected).

Design Decisions

Step-Fit and Directional Validation

The refiner‘s primary responsibility is to ensure that a detected cluster aligns with the user’s intent. It uses the stepfit status (HIGH or LOW) to determine the direction of the performance shift.

  • Directional Filtering: Users can configure alerts to trigger on “UP”, “DOWN”, or “BOTH” directions. The refiner maps these preferences to stepfit statuses. For example, if an alert is configured only for “DOWN” (typically representing a performance drop in specific metrics), any “HIGH” step-fit clusters are discarded.
  • Commit Alignment: The refiner ensures that the regression‘s “StepPoint” (the point of change) aligns exactly with the midpoint of the data frame’s header. This provides a sanity check that the regression being processed is actually centered on the commit currently under investigation.

Threshold Enforcement

To reduce noise from insignificant or flaky data, the refiner enforces a MinimumNum threshold. This represents the minimum number of keys (traces) that must be part of a cluster for it to be promoted to a “Confirmed Regression.” Clusters failing to meet this count are filtered out of the final summary.

Key Components

DefaultRegressionRefiner

Located in default_regression_refiner.go, this is the standard implementation of the refinement logic. It processes a slice of RegressionDetectionResponse objects and returns a slice of ConfirmedRegression objects.

The refinement workflow follows these steps:

  1. Validation: Checks for nil frames or empty data headers to avoid processing malformed data.
  2. Identification: Determines the target commit number from the midpoint of the data frame.
  3. Filtering: Iterates through all detected clusters and retains only those that:
    • Match the target commit offset.
    • Contain at least the minimum number of traces defined in the Alert config.
    • Match the directionality (UP/DOWN/BOTH) specified in the Alert config.
  4. Reconstruction: Creates a new, filtered ClusterSummary and wraps it in a response if any clusters survived the filtering process.

Key Workflows

Refinement Logic Flow

Raw Detection Responses
          |
          v
+-----------------------------+
| Calculate Midpoint Commit   | <--- Ensures we are looking at the
| (from DataFrame Header)     |      correct point in time.
+-----------------------------+
          |
          v
+-----------------------------+
| Iterate through Clusters    |
|                             |
| 1. Check StepPoint Offset   | ---- Fail ----> [ Discard Cluster ]
| 2. Check Min Trace Count    | ---- Fail ----> [ Discard Cluster ]
| 3. Check Direction Match    | ---- Fail ----> [ Discard Cluster ]
+-----------------------------+
          |
          | Pass
          v
+-----------------------------+
| Build Filtered Summary      |
+-----------------------------+
          |
          v
Confirmed Regressions

Module: /go/regression/regressiontest

Regression Test Utilities

The regressiontest module provides a standardized suite of functional tests for implementations of the regression.Store interface. By centralizing these tests, the project ensures that different storage backends (e.g., SQL-based, memory-based, or Datastore) behave consistently and adhere to the expected contract of the Perf regression system.

Design and Purpose

The primary goal of this module is to enforce a uniform behavior across various regression storage implementations. Instead of duplicating test logic for every new storage driver, developers can import this package and run their implementation against the SubTests suite.

This approach ensures that:

  • Data integrity is maintained during serialization and deserialization of complex types like frame.FrameResponse and clustering2.ClusterSummary.
  • Edge cases, such as range queries where the start and end points are identical, are handled identically across backends.
  • Error conditions, such as triaging a non-existent regression, produce predictable results.

Key Components

Test Suite Orchestration

The module exports a SubTests map, which associates descriptive names with SubTestFunction signatures. This allows implementation-specific test files to iterate over the map and run each test within their own environment (e.g., using a local emulator or a real database instance).

Core Test Logic

The tests within regressiontest.go cover the lifecycle of a regression record:

  • Life Cycle Management: SetLowAndTriage verifies the “happy path” of creating a regression, detecting if it is new versus an update, and updating its triage status.
  • Bulk Operations: Write ensures that multiple regressions can be persisted efficiently in a single operation, while DeleteByCommit verifies the cleanup logic.
  • Querying and Navigation: Range_Exact validates boundary conditions for commit-based lookups, and GetOldestCommit ensures the store can correctly identify the earliest point in its history, which is critical for background cleanup tasks.

Key Workflow: Verifying a New Store Implementation

When a new storage backend is implemented for regressions, it follows this interaction pattern with the regressiontest module:

[ New Store Implementation ]          [ regressiontest Module ]
             |                                    |
             |-- Provides initialized Store ------>|
             |                                    |
             | <---------- Executes SubTests -----|
             |          (SetLow, Range, Write, etc.)
             |                                    |
             |-- Returns Success/Failure -------->|
             |                                    |
[ Validation Complete ]

Implementation Details

The module relies on several key data structures from the Perf domain:

  • regression.Store: The interface being validated.
  • types.CommitNumber: Used as the primary key for organizing regressions.
  • frame.FrameResponse & clustering2.ClusterSummary: These are passed to the store to ensure that implementation-specific serialization (like JSON blobs in a database) correctly preserves the data needed for the UI.

Module: /go/regression/sqlregression2store

High-Level Overview

The sqlregression2store module provides a Spanner-backed implementation of the regression.Store interface. Its primary purpose is to persist and manage performance regressions detected within the Skia Perf system. It handles the storage of statistical metadata, triage states, and the raw data frames that justify a regression's existence.

This module is the “V2” storage layer, designed to be more flexible and descriptive than previous iterations by consolidating alert metadata, subscription links, and multi-source triage information (manual, auto-triage, and auto-bisect) into a unified relational schema.

Design Decisions and Implementation Choices

Algorithm-Aware Storage Logic

A key responsibility of this module is handling the different ways regressions are identified based on the alerting algorithm:

  • K-Means Grouping: Regressions are treated as evolving entities. As more data arrives for a specific <commit, alert> pair, the store updates the existing record with more accurate clustering summaries.
  • StepFit/Individual Grouping: Regressions are often specific to individual traces. Depending on the AllowMultipleRegressionsPerAlertId configuration, the store can either treat the alert as a single entity or allow multiple distinct regression records for the same alert if they involve different traces.

Read-Modify-Write Compatibility

To support the transition from older schemas and maintain data integrity during concurrent updates, the store utilizes a readModifyWriteCompat pattern.

  1. Transactionality: It opens a database transaction.
  2. Lookup: It queries for existing regressions based on the commit number and alert ID.
  3. Callback Logic: It executes a provided closure to modify the regression object (e.g., setting a “High” or “Low” cluster).
  4. Persistence: It writes the results back using an UPSERT (INSERT ... ON CONFLICT) pattern.

Semi-Structured Integration

The store relies heavily on JSONB columns (specifically for frame and cluster_summary). This allows the database to store complex, nested Go structures from the ui/frame and clustering2 packages without requiring a rigid table schema for every statistical detail. This choice prioritizes flexibility in the analysis pipeline over relational normalization for these specific fields.

Triage and Bug Aggregation

The module implements a sophisticated bug-tracking resolution logic in GetBugIdsForRegressions. It doesn't just store a single bug ID; it joins across AnomalyGroups and Culprits tables to provide a comprehensive view of:

  • Manual Triage: Bugs manually linked by users.
  • Auto-Triage: Issues automatically reported by the system.
  • Auto-Bisect: Culprits identified by automated bisection tools.

The store then sorts these bugs by a priority rank (Manual > Auto-Triage > Auto-Bisect) to ensure the most relevant context is presented to the user first.

Key Components

SQLRegression2Store

The main struct implementing regression.Store. It manages a pool of database connections and maintains a cache of prepared SQL statements generated from templates. It also tracks metrics for high/low regression detections.

Statement Templates

Instead of static strings, the module uses Go's text/template to build SQL queries. This allows for dynamic column injection based on the spanner schema definitions, ensuring that the Go code and SQL schema stay in sync regarding field names.

Regression Lifecycle Management

The store provides specialized methods for the operational lifecycle of an anomaly:

  • Nudging: NudgeAndResetAnomalies allows moving a regression's commit range (e.g., if a developer identifies a more accurate culprit range) while resetting its triage status.
  • Triage Transitions: Methods like IgnoreAnomalies and ResetAnomalies provide bulk updates to triage states, transitioning records between untriaged, negative, and ignored.

Key Workflows

Setting a Regression

The workflow for recording a newly detected performance shift:

Detection Logic -> SetHigh/SetLow()
                      |
                      v
             GetAlertConfig() <------- [Check Algo: KMeans vs StepFit]
                      |
                      v
           readModifyWriteCompat()
               (Transaction Start)
                      |
          +-----------+-----------+
          |                       |
    [Existing Match]        [New Regression]
          |                       |
    Apply UpdateFunc        Initialize UUID
          |                Populate Medians
          |                Set PrevCommit
          +-----------+-----------+
                      |
                      v
             writeSingleRegression()
               (UPSERT into DB)
                      |
               (Transaction Commit)

Bug Retrieval Workflow

How the store aggregates different sources of truth for a regression:

Request: GetBugIdsForRegressions(ids)
               |
               v
  1. Load Manual Bug IDs (from Regressions2 table)
  2. JOIN AnomalyGroups (on regression_id) -> Get Auto-Triage IDs
  3. JOIN Culprits (on anomaly_group_id)   -> Get Auto-Bisect IDs
               |
               v
    sortBugs(Manual, AutoTriage, AutoBisect)
               |
               v
    Return enriched Regression objects

Module: /go/regression/sqlregression2store/schema

High-Level Overview

The schema package defines the structured SQL representation for performance regressions within the Skia Perf system. It serves as the source of truth for the Regression2Schema table, which is designed to persist regression data, triage states, and associated metadata.

This module bridges the gap between Go data structures (like clustering2.ClusterSummary and frame.FrameResponse) and the relational database, ensuring that complex performance analysis results are searchable and durable.

Design Decisions and Implementation Choices

Unified Regression Persistence

Unlike earlier iterations of regression storage, this schema focuses on consolidating all aspects of a regression event—its location in time (commits), its statistical significance (medians), and its operational status (triage, bugs, alerts)—into a single relational structure. This allows for complex querying and reporting without needing to join across high-volume telemetry tables.

Semi-Structured Data Storage

The schema uses JSONB for the ClusterSummary and Frame fields. This is a deliberate choice to:

  1. Preserve Context: The full context of the data frame and clustering results used to identify the regression is stored alongside the record.
  2. Schema Flexibility: As the internal structures of clustering2 or frame evolve, the database schema does not require a migration, provided the data remains JSON-serializable.

Temporal and Categorical Organization

Regressions are tracked via two specific commit points: CommitNumber and PrevCommitNumber. This allows the system to define the exact range where a performance shift occurred. Additionally, the inclusion of IsImprovement (boolean) and ClusterType (string) allows the UI and automated tools to quickly filter out noise or focus specifically on regressions vs. improvements.

Optimization via Specialized Indexes

The schema defines several composite and single-column indexes to support common query patterns in the Perf UI and alerting pipelines:

  • Point Lookups: by_commit_alert supports checking if an alert has already fired for a specific commit.
  • Time-Series Tracking: by_sub_name_creation_time is optimized for showing the most recent regressions for a specific subscription (e.g., a “Regression Dashboard” view).
  • Revision History: by_commit_and_prev_commit is tailored for the GetByRevision workflow, allowing the system to quickly retrieve regressions that fall within specific git ranges.

Key Components

Regression2Schema

The primary struct in schema.go. It utilizes Go struct tags to define the DDL (Data Definition Language) for the underlying SQL table.

  • Identity and Origin: Uses a UUID (ID) for global uniqueness and links to the alerting subsystem via AlertID and SubName.
  • Statistical Metadata: Stores MedianBefore and MedianAfter as REAL (float32) values. These are critical for calculating the “magnitude” of a regression without re-processing the raw trace data.
  • Triage State: Contains TriageStatus, TriageMessage, and BugID. These fields represent the human-in-the-loop component of the performance monitoring workflow, tracking whether a regression has been acknowledged or associated with a bug tracker entry.

Regression Data Flow

The following diagram illustrates how the fields in this schema represent the lifecycle of a detected regression:

Discovery Phase           Analysis Phase              Operational Phase
(Detection Logic)        (Statistical Data)          (User Intervention)
-----------------        ------------------          -------------------
   AlertID        ----->    MedianBefore               TriageStatus
   CommitNumber   ----->    MedianAfter      ----->    TriageMessage
   SubName        ----->    ClusterSummary             BugID
   ClusterType    ----->    Frame                      CreationTime

Module: /go/regression/sqlregressionstore

The sqlregressionstore module provides a persistent implementation of the regression.Store interface using a SQL database backend. It is responsible for storing, retrieving, and updating performance regressions detected by the Perf system.

Design and Rationale

The storage strategy is built on a hybrid approach: relational indexing for metadata and JSON serialization for payload.

  1. Identity and Integrity: Regressions are uniquely identified by a composite primary key consisting of commit_number and alert_id. This ensures that for any given commit, a specific alert configuration can only produce one regression record, enforcing data integrity at the database level.
  2. Schema Flexibility: While the identity is relational, the regression details (like cluster summaries and frames) are stored as a JSON blob. This allows the regression.Regression Go struct to evolve—adding or removing fields—without requiring expensive and risky database migrations for every change in the detection algorithms.
  3. Concurrency Control: The store uses a “Read-Modify-Write” pattern wrapped in SQL transactions. This is critical for operations like triaging or updating high/low regression status, as multiple processes might attempt to update the same regression record simultaneously.

Key Components

SQLRegressionStore

Located in sqlregressionstore.go, this is the primary struct implementing the regression.Store interface. It manages the lifecycle of regression data:

  • Persistence Operations: It translates high-level Go requests (like SetHigh, SetLow, or TriageHigh) into SQL statements. It handles the mapping between string-based Alert IDs used in the UI/API and the integer-based Alert IDs used in the database.
  • Atomic Updates (readModifyWrite): This internal method ensures that updates to a regression record are atomic. It begins a transaction, locks the row (if the database supports it), deserializes the JSON, applies a callback function to modify the data, and serializes it back to the database.
  • Batch Queries: The Range method allows for efficient retrieval of all regressions across a span of commits, which is a common requirement for rendering the Perf dashboard or generating reports.

Migration Support

The module includes specific logic to support data evolution. It tracks a migrated status and a regression_id. This allows the system to background-migrate records from this “legacy” store to newer iterations of the regression schema (e.g., regression2) without downtime.

  • GetRegressionsToMigrate retrieves batches of unmigrated records.
  • MarkMigrated updates the record status once it has been successfully moved to the new store.

Data Workflow

The following diagram illustrates how a regression update (e.g., updating a “high” regression) flows through the store:

[ Caller: SetHigh ]
        |
        v
[ SQLRegressionStore.readModifyWrite ]
        |
        |---- 1. BEGIN TRANSACTION
        |---- 2. SELECT regression (JSON) FROM Regressions WHERE commit_number AND alert_id
        |---- 3. JSON Unmarshal -> regression.Regression (Go Struct)
        |---- 4. Execute Callback: Update HighStatus, ClusterSummary, etc.
        |---- 5. JSON Marshal -> Updated JSON string
        |---- 6. UPDATE Regressions SET regression = $1, migrated = false
        |---- 7. COMMIT TRANSACTION
        |
        v
[ Success/Error ]

Implementation Details

  • Dialect Independence: While the tests often run against Spanner, the use of pool.Pool and standard SQL syntax allows the store to be portable across different SQL backends supported by the infra.
  • Metrics: The store automatically tracks perf_regression_store_found counters (partitioned by “high” or “low” direction), providing visibility into the frequency of regression detection and storage activity.
  • Legacy Constraints: Some methods (like GetRegressionsBySubName or GetByIDs) are explicitly left unimplemented in this module. These features are offloaded to the newer regression2 store, reinforcing this module's role as a stable, primary storage for established regression workflows while supporting the transition to more advanced querying capabilities.

Module: /go/regression/sqlregressionstore/schema

The sqlregressionstore/schema module defines the relational database structure used to persist regression data within the Perf system. It serves as the formal bridge between the Go-based regression.Regression objects and their storage representation in SQL.

Design and Rationale

The schema is designed around a composite primary key consisting of commit_number and alert_id. This reflects the operational reality of the Perf system: a regression is uniquely identified by where it happened (the commit) and why it was detected (the specific alert configuration).

By using a composite key instead of a generic auto-incrementing integer, the schema enforces data integrity at the database level, preventing duplicate regression entries for the same alert on the same commit.

Key Components and Responsibilities

RegressionSchema

The primary structure in this module, RegressionSchema, defines the columns for the Regressions table. Its fields reflect a balance between structured querying and flexible data storage:

  • Relational Indexing (CommitNumber, AlertID): These fields are extracted from the regression object to allow the database to perform efficient filtering and joins. Storing the AlertID as a first-class column allows the system to quickly retrieve all regressions associated with a specific detection configuration.
  • Serialized Payload (Regression): Instead of normalizing every possible attribute of a regression (which might change as detection algorithms evolve), the bulk of the regression data is stored as a JSON string. This “schemaless-within-schema” approach provides flexibility for future changes to the regression.Regression Go struct without requiring database migrations.
  • Migration State (Migrated, RegressionId): These fields are specifically included to handle the lifecycle of data evolution. The Migrated boolean and the temporary RegressionId facilitate the movement of records between different iterations of the schema (e.g., transitioning to a “regression2” table) while ensuring no data is lost or duplicated during the transition.

Data Workflow

When a regression is detected or updated, the system maps the high-level Go objects into this schema for persistence:

[ Go Regression Object ]
           |
           | 1. Extract Identity
           v
+------------------------+      2. Serialize Remainder
| commit_number (Key)    | <---------------------------+
| alert_id      (Key)    |                             |
| migrated      (Status) |      +----------------------+
| regression    (JSON)   | <--- | { "low": ...,        |
+------------------------+      |   "high": ...,       |
           |                    |   "frame": ... }     |
           |                    +----------------------+
           v
[ SQL Persistent Storage ]

This architecture ensures that while the database can efficiently index and manage the lifecycle of a regression, the complex details of the detection results remain encapsulated within the JSON blob, maintaining a clean separation between indexing concerns and data representation.

Module: /go/samplestats

samplestats

The samplestats module provides tools for performing statistical analysis on performance metrics. It is designed to compare two sets of samples (typically “before” and “after” a change) to determine if there is a statistically significant difference between them. This is primarily used within the Perf system to detect regressions or improvements in traces.

Overview

The core functionality revolves around taking two maps of trace data and producing a structured analysis. The module handles the heavy lifting of statistical testing, outlier detection, and result ordering, allowing callers to focus on high-level performance trends rather than raw data manipulation.

Design Decisions

  • Non-Parametric vs. Parametric Testing: The module supports both the Mann-Whitney U test (default) and the Welch's T-test. The Mann-Whitney U test is favored as the default because it is non-parametric; it does not assume a normal distribution of data, making it more robust for varied performance metrics which often contain noise or non-Gaussian distributions.
  • Significance-Driven Results: By default, the module only reports results where the p-value is below a defined threshold (alpha). This reduces noise for the end user by filtering out fluctuations that are likely due to random chance.
  • Outlier Resilience: Performance data frequently contains “cold start” anomalies or background noise. The implementation provides an optional Interquartile Range (IQR) rule to prune these outliers before running statistical tests, ensuring the mean and standard deviation reflect the “steady state” of the system.

Key Components

Analysis Engine (analyze.go)

The Analyze function is the primary entry point. It correlates “before” and “after” samples based on their Trace ID. For every pair of samples found, it:

  1. Calculates metrics (mean, stddev) for both sets.
  2. Executes the configured statistical test (UTest or TTest).
  3. Calculates the Delta (percentage change in mean) only if the result is statistically significant ($p < \alpha$).

Metrics Calculation (metrics.go)

Before analysis, raw values are transformed into Metrics objects. This step handles:

  • IQR Filtering: If enabled in Config, it calculates the 25th and 75th percentiles and discards values outside $1.5 \times IQR$.
  • Coefficient of Variation: It calculates the Percent field (Standard Deviation / Mean), which helps in understanding the relative volatility of a specific trace.

Sorting and Ordering (sort.go)

Since analysis can involve thousands of traces, the module provides a flexible sorting mechanism. Results can be ordered by Trace Name or by the magnitude of the Delta. It specifically handles NaN values (representing insignificant changes) by grouping them together during the sort process.

Workflow

The following diagram illustrates the data flow from raw samples to a sorted analysis result:

[Raw Samples] -> [IQR Filter (Optional)] -> [Statistical Test] -> [Delta Calculation]
      |                  |                        |                    |
(Before/After)     (Remove Outliers)        (Compare P vs Alpha)   (% Change if P < Alpha)
                                                                       |
                                                                       v
[Final Result] <---------- [Sort Results] <--------- [Collection of Rows]

Implementation Details

  • Config: The Config struct allows users to toggle the statistical test type, set the Alpha threshold (defaulting to 0.05), and enable outlier removal.
  • Row: Each analyzed trace is returned as a Row, containing the calculated Delta, the P value, and the underlying Metrics. If a test fails (e.g., all samples are identical), the error is captured in the Note field rather than crashing the analysis.
  • Dependencies: The module relies on go-moremath/stats for robust implementations of the Mann-Whitney and Welch's T-test algorithms.

Module: /go/sheriffconfig

Overview

The sheriffconfig module is the management layer for Skia Perf's alerting and subscription system. It provides the mechanism for defining “Sheriff Configurations”—version-controlled rules that dictate how the performance monitoring engine should identify anomalies and which teams should be notified.

This module acts as a bridge between human-readable configuration (stored as code in LUCI Config) and the operational database (Spanner) that drives the Perf alerting engine. It ensures that performance monitoring is “Configuration as Code,” allowing teams to manage their alert thresholds, bug-filing metadata, and trace selections through standard code review processes.

Design Intent: Configuration as Code

The design of sheriffconfig shifts the responsibility of alert management from manual UI interactions to automated, versioned workflows.

  • Auditability: By using LUCI Config as the source of truth, every change to an alert threshold or a subscription's contact list is tracked in Git.
  • Consistency: The module ensures that identical configurations result in identical system behavior by normalizing inputs (e.g., sorting query parameters) before they reach the database.
  • Scalability: A single configuration file can define monitoring for multiple Perf instances. The module handles the distribution of these rules to the correct internal services based on instance-specific filters.

Key Components

1. Schema Definitions (/proto)

The module uses Protocol Buffers to define the structure of a SheriffConfig. This schema decouples the detection intent (e.g., “watch for a 10% shift in memory usage”) from the implementation details of the detection algorithms. It supports a strategy pattern where users can combine different statistical methods (Step) with grouping strategies (Algo).

2. Validation Engine (/validate)

The validation logic acts as a strict gatekeeper. It enforces business rules and structural integrity before any configuration is persisted.

  • Regex Pre-compilation: Any pattern matching utilizing regular expressions (denoted by ~) is compiled during validation. This prevents runtime crashes in the detection engine caused by malformed regex in a config file.
  • Rule Constraints: It enforces specific logic, such as requiring exclusion patterns to be single-keyed, which keeps the backend's query resolution logic predictable and performant.

3. Synchronization Service (/service)

The service layer manages the lifecycle of configuration data. It polls external configuration sources and reconciles them with the internal SubscriptionStore and AlertStore.

  • Revision Awareness: To prevent unnecessary database writes, the service compares the revision of the incoming configuration against the existing state. If the revision matches, the processing is skipped.
  • Rule Expansion: A single AnomalyConfig in a proto can expand into multiple Alert objects in the database. This allows a user to define a single logical subscription that applies to multiple distinct sets of telemetry traces.

Configuration Lifecycle Workflow

The following diagram shows the path a configuration takes from a Git repository to the Perf alerting database:

[ Git Repository ] -> [ LUCI Config ] -> [ sheriffconfig/service ]
                                                |
                                                v
                                      [ sheriffconfig/validate ]
                                       (Check: Names, Regex, Fields)
                                                |
                                                v
[ Spanner DB ] <--- (Atomic Transaction) --- [ Transformation ]
      |                                   (Build Queries, Map Priorities)
      |
      +--> [ Perf Alerting Engine ]
           (Identify Regressions based on stored Alerts)

Design Decisions

URL-Based Trace Selection

Instead of a proprietary query language, the module utilizes standard URL query strings for trace matching.

  • Why: This allows the system to reuse net/url parsers and provides a format that is easily testable and familiar to developers.
  • Implementation: The buildQueryFromRules function in the service layer transforms these human-readable rules into normalized query strings used by the backend database to filter trace data efficiently.

Atomic Updates

When a new configuration file is ingested, the service uses a single database transaction to replace alerts.

  • Why: In a system where an alert is useless without its corresponding subscription (which contains bug-filing info like components and CC lists), partial updates could lead to “orphaned” alerts that trigger but cannot be filed as bugs. The atomic approach ensures the system is always in a consistent state.

Instance Filtering

The service is designed to be “instance-aware.” A single large configuration file might contain subscriptions for v8, chrome, and skia.

  • How: Each sheriffconfigService instance is configured with an instance identifier. It filters the incoming global configuration, only processing and storing the subscriptions that match its assigned instance. This allows for centralized configuration files without leaking cross-project data or overloading specific service instances.

Module: /go/sheriffconfig/proto

Overview

The /go/sheriffconfig/proto module serves as the foundational definition for the Skia Perf alerting and configuration system. It manages the lifecycle of performance monitoring by defining how users describe “what to watch” and “how to react” when performance changes. This module provides the core data structures that bridge the gap between human-readable configurations and the automated backend engines responsible for anomaly detection and issue tracking.

Design and Logic

The architecture is built around a centralized configuration model. Instead of hard-coding detection logic or scattering alert settings across various services, this module consolidates the entire “intent” of a performance sheriff into a structured format.

Strategy-Based Detection

The implementation favors a strategy pattern for anomaly detection. Rather than defining a single detection path, the system allows sheriffs to combine different statistical methods (defined via Step) with different grouping strategies (defined via Algo). This decoupling allows the system to scale from simple “threshold exceeded” alerts to complex “cluster-based” analysis where multiple related traces must shift together to trigger an alert.

Efficient Trace Selection

The selection logic is designed to handle the vast scale of Skia Perf data. By utilizing a rule-based system for trace selection, the module allows for:

  • Logical Inclusion/Exclusion: Using a combination of inclusion queries and exclusion filters to prune noise before the detection algorithms run.
  • Key-Value Filtering: Leveraging the existing Skia trace format to allow sheriffs to target specific bots, benchmarks, or test suites without needing to know the underlying database schema.

Key Components

Data Integrity and Versioning

While the v1 subdirectory contains the active implementation, the root module acts as the container for these definitions. The use of Protocol Buffers ensures that the configuration is both language-agnostic and forward-compatible. This is critical for Skia Perf, where configuration may be stored in Git or a database for long periods while the backend software evolves.

Workflow Orchestration

The module defines the transition from detection to action. The implementation choices here reflect a desire to reduce “alert fatigue”:

  1. Detection: The system identifies a change point based on the AnomalyConfig.
  2. Grouping: Using group_by and Algo settings, the system determines if multiple anomalies should be consolidated into a single report.
  3. Reporting: Based on the Action defined, the system either silently logs the event, creates a manual triage entry, or triggers an automated bisection.

System Workflow

The following diagram illustrates how the components defined in these protos interact to process performance data:

[ Perf Data ] -> [ Rule Matching ] -> [ Detection Engine ] -> [ Action Dispatcher ]
                      |                     |                      |
             (Uses Match/Exclude)     (Uses Algo/Step)       (Uses Action/CC)
                      |                     |                      |
                      v                     v                      v
              Identify relevant      Apply statistical       Create Bug or
              metric traces          analysis window         trigger Bisect

Key Files

  • v1/: This subdirectory contains the versioned definitions of the API. By isolating the versioned protos, the project allows for breaking changes in the configuration schema while maintaining compatibility for existing sheriff configurations.
  • v1/sheriff_config.proto: The definitive source for the data model. It encodes the business logic of how subscriptions, detection rules, and alerting metadata relate to one another.
  • v1/sheriff_config.pb.go: The compiled Go representation of the configuration. This is the primary interface used by the Skia Perf backend to interact with the configuration data.

Module: /go/sheriffconfig/proto/v1

Overview

The go/sheriffconfig/proto/v1 module defines the data structures and serialization format for Skia Perf's anomaly detection and alerting system. It uses Protocol Buffers to specify how performance metrics are selected, how regressions (anomalies) are detected within those metrics, and how the system should respond (e.g., filing bugs or initiating bisections).

This module acts as the contract between the configuration stored in the system and the Perf engine that processes incoming data.

Design and Data Model

The configuration hierarchy is designed to support multi-tenant monitoring where different teams (Sheriffs) can track specific subsets of performance data with customized detection logic.

Configuration Hierarchy

SheriffConfig
  └── [Subscription]
        └── [AnomalyConfig]
              └── Rules (Metric Selection)
  • SheriffConfig: The root object containing all subscriptions for a specific Skia Perf instance (e.g., “chrome-internal”).
  • Subscription: Represents a logical grouping of interest, typically owned by a specific person or team. It defines where alerts go (Buganizer components, CC lists, labels) and what level of urgency they carry (Priority/Severity).
  • AnomalyConfig: Defines the mathematical “how” of detection. It specifies the algorithm, sensitivity thresholds, and grouping logic. A single subscription can contain multiple AnomalyConfig objects to apply different detection logic to different sets of metrics.
  • Rules: The filtering mechanism used to select traces from the Skia database.

Key Components and Implementation Details

Metric Selection (Rules)

Traces are selected using a query-string format: {key1}={value1}&{key2}={value2}.

  • Matching: The match field uses a wildcard-by-default approach. If a key is omitted, it matches everything. It supports regex-style matching (e.g., bot=~lacros-.*-perf).
  • Exclusion: The exclude field allows for fine-grained removal of specific noisy traces.
  • Logic: Multiple match strings are treated as an OR operation, while keys within a single string and exclusion rules are treated as AND operations.

Anomaly Detection (AnomalyConfig)

The module defines several strategies for identifying regressions through the Step and Algo enums:

  • Detection Algorithms (Step): Supports various statistical methods including simple magnitude thresholds (ABSOLUTE_STEP), percentage-based changes (PERCENT_STEP), and advanced statistical tests like COHEN_STEP or MANN_WHITNEY_U.
  • Clustering (Algo): Defines whether traces are analyzed individually (STEPFIT) or grouped together using KMEANS to identify collective shifts in performance across multiple bots or benchmarks.
  • Execution Parameters:
    • radius: Controls the window of commits analyzed around a potential change point.
    • direction: Allows sheriffs to ignore “improvements” (e.g., a speed increase) and only alert on regressions.
    • group_by: A powerful field that allows splitting the analysis across specific keys, ensuring that anomalies are only grouped if they share common attributes.

Alerting and Actionability

The Action enum within AnomalyConfig determines the lifecycle of a detected anomaly:

  • NOACTION: Purely observational; anomalies appear in the UI but trigger no external workflows.
  • TRIAGE: Automates the creation of Buganizer issues using the metadata defined in the parent Subscription.
  • BISECT: The most advanced tier, which triggers automated bisection to find the specific culprit commit behind a regression.

Key Files

  • sheriff_config.proto: The primary source of truth defining the messages and enums. It contains extensive documentation on the expected string formats for rules and the behavior of detection enums.
  • sheriff_config.pb.go: The generated Go code providing the structures used by the Perf service to parse and process configurations.
  • generate.go: Contains the go:generate directives used to keep the Go code in sync with the protobuf definitions.

Module: /go/sheriffconfig/service

The sheriffconfig/service module acts as the synchronization engine between externalized “Sheriff Configurations” stored in LUCI Config and the internal database used by Skia Perf to track subscriptions and trigger alerts. By treating these configurations as code, the service allows teams to manage anomaly detection rules, bug filing metadata, and ownership via version-controlled repositories.

Core Responsibility

The primary role of this service is to fetch, validate, transform, and persist configurations. It bridges the gap between the high-level, human-readable Protobuf definitions (SheriffConfigs) and the low-level SQL structures required by the Perf alerting engine.

Design and Implementation Choices

Revision-Based Synchronization

To minimize database churn and ensure consistency, the service uses a revision-checking mechanism. Before processing a subscription, it queries the subscriptionStore to see if a subscription with the same name and revision already exists.

  • Why: This avoids redundant writes and ensures that if a configuration hasn't changed in the source repository, no updates are pushed to the database. It also facilitates a “point-in-time” history where alerts are tied to specific configuration versions.

Instance Filtering

A single LUCI Config file may contain subscriptions for multiple Perf instances (e.g., “chrome-internal”, “v8”, “skia”).

  • How: The sheriffconfigService is initialized with a specific instance string. During the processConfig phase, it discards any subscription defined in the Protobuf that does not match its assigned instance. This allows centralized management of alerts across a project while maintaining instance-specific execution.

Rule-to-Query Transformation

Sheriff configurations use a rule-based system (match and exclude lists) to define which telemetry traces an anomaly config should monitor.

  • Implementation: The buildQueryFromRules function transforms these rules into URL-style query strings. It handles exclusion logic by prefixing values with !. These queries are then stored in the Alert objects, which the Perf engine uses to filter incoming data.
  • Consistency: Query parts are sorted alphabetically during construction to ensure that identical rules result in identical query strings, preventing duplicate alerts due to key ordering.

Transactional Atomic Updates

When importing a config file, the service wraps the insertion of both subscriptions and alerts (via ReplaceAll) into a single database transaction.

  • Why: Alerts are functionally dependent on their parent subscriptions. Using a transaction ensures that the system never ends up in a state where a new alert exists without its corresponding subscription metadata (like bug components or priority), which would cause failures during the auto-triage or bug-filing process.

Key Components and Workflows

Configuration Import Lifecycle

The service typically runs as a background routine (StartImportRoutine), polling LUCI Config at a defined interval.

[ LUCI Config ] --(Fetch Project Configs)--> [ service.ImportSheriffConfig ]
                                                       |
                                           [ validate.ValidateConfig ]
                                                       |
        +----------------------------------------------+----------------------------------------------+
        |                                              |                                              |
 [ Filter by Instance ]                     [ Transform to Entities ]                      [ Check Revision ]
 (Drop if mismatch)                         (Map Protos to DB Models)                      (Skip if exists)
        |                                              |                                              |
        +----------------------------------------------+----------------------------------------------+
                                                       |
                                           [ DB Transaction (Spanner) ]
                                           |-- Insert Subscriptions
                                           |-- Replace All Alerts
                                           +---------------------------> [ Success/Commit ]

Key Files

  • service.go: Contains the sheriffconfigService implementation. It manages the dependency injection of stores (Alert, Subscription) and the LUCI Config API client. It also defines the mapping constants for algorithm types (e.g., STEPFIT, KMEANS) and action types (e.g., TRIAGE, BISECT).
  • service_test.go: Validates the end-to-end import logic using mocks for the database and external APIs. It specifically tests edge cases such as handling multiple instances in one file and ensuring that invalid configurations are rejected before they touch the database.

Mapping Logic

The service performs significant data translation to bridge the two domains:

  • Priorities and Severities: It maps Protobuf-defined priority/severity levels to the integer values expected by the bug-filing system, applying default values (typically 2) if they are omitted in the configuration.
  • Anomaly Configs: Each AnomalyConfig inside a subscription can generate multiple Alert objects—one for each match rule provided. This expansion allows a single subscription to monitor several distinct sets of traces with different detection parameters (like radius or threshold).

Module: /go/sheriffconfig/validate

Sheriff Config Validation

The validate module provides the logic necessary to ensure the integrity and correctness of Sheriff Configurations used in the Perf tool. It acts as a gatekeeper, verifying that configuration files (typically managed via LUCI Config) adhere to structural and business rules before they are processed by the system.

High-Level Overview

Sheriff configurations define how anomalies (regressions) are assigned to different teams or “subscriptions.” This module takes raw data—usually base64-encoded prototext from an external configuration service—deserializes it into Go protocol buffer objects, and runs a battery of validation checks.

The validation logic is hierarchical, mirroring the structure of the SheriffConfig proto:

  1. Global Level: Checks for overall configuration validity (e.g., uniqueness of subscription names).
  2. Subscription Level: Ensures required metadata like contact emails, bug components, and instances are present.
  3. Anomaly Config Level: Validates the rules used to match specific performance traces.
  4. Pattern Level: Parses and validates the query strings used to identify specific data streams.

Design Decisions

URL Query Format for Patterns

The module uses the standard URL query format (e.g., key1=val1&key2=val2) to define match and exclude patterns.

  • Why: This leverages standard library parsing (net/url.ParseQuery), providing a familiar and robust syntax for users to define trace filters without requiring a custom DSL parser.
  • Regex Support: To support flexible matching, values starting with ~ are treated as regular expressions. The validator explicitly compiles these during the validation phase to catch syntax errors early, preventing runtime failures during actual anomaly matching.

Decoupled Deserialization

The DeserializeProto function specifically handles Base64 decoding followed by Prototext unmarshaling.

  • Why: This design specifically accommodates the LUCI Config API, which returns file content as Base64 strings. By separating deserialization from validation, the module remains flexible enough to validate objects created programmatically (useful for testing) while providing a convenient entry point for production data.

Key Components and Responsibilities

Configuration Validator (validate.go)

This is the core of the module. It implements a top-down validation strategy:

  • ValidateConfig: The entry point for validating a full SheriffConfig. It ensures that the config is not empty and that every subscription has a unique name, which is critical for identifying subscriptions in logs and UI.
  • validateSubscription: Ensures that every subscription is actionable. It mandates a Name, ContactEmail, BugComponent, and Instance. A subscription without these cannot effectively track or report anomalies.
  • validateAnomalyConfig: Focuses on the rules of the subscription. It requires at least one Match pattern, as a configuration that matches nothing is considered a configuration error.
  • validatePattern: The most granular validation step.
    • It ensures match patterns have at least one key-value pair.
    • It enforces a constraint on Exclude patterns: they must only contain a single key. This simplifies the exclusion logic elsewhere in the system, preventing overly complex exclusion rules that are hard to reason about.
    • It validates that all explicit values are non-empty.

Data Flow Process

The typical workflow for a configuration string being processed by this module is:

[ Base64 String ]
       |
       v
[ DeserializeProto ] ---------------------> [ Decode Base64 ]
       |                                           |
       |                                           v
       |                                    [ Unmarshal Prototext ]
       v                                           |
[ *SheriffConfig Proto ] <-------------------------/
       |
       v
[ ValidateConfig ]
       |
       +--> [ validateSubscription ]
                   |
                   +--> [ validateAnomalyConfig ]
                               |
                               +--> [ validatePattern ] (Match)
                               |
                               +--> [ validatePattern ] (Exclude, singleField=true)

Validation Constraints Summary

LevelConstraint
GlobalSubscription names must be unique.
SubscriptionMust contain Name, ContactEmail, BugComponent, and Instance.
Anomaly ConfigMust have at least one Match pattern.
Pattern (Match)Must be a valid URL query string; must have $\ge 1$ key.
Pattern (Exclude)Must have exactly 1 key.
Pattern (Values)Values starting with ~ must be valid Go Regex.

Module: /go/shortcut

Overview

The shortcut module provides a unified interface and core logic for managing “shortcuts” within the Perf application. A shortcut is a persistent, shareable identifier that represents a collection of performance trace IDs. Instead of passing around large lists of trace keys in URLs or API requests, the system generates a compact hash-based ID that can be used to retrieve the original set of keys.

Design and Logic

Idempotency and Content-Addressable IDs

A fundamental design choice in this module is the use of content-addressable storage. The ID of a shortcut is not a random UUID or an auto-incrementing integer; instead, it is a deterministic hash of the trace keys it contains.

The IDFromKeys function implements this logic:

  1. Normalization: It sorts the trace keys alphabetically. This ensures that two shortcuts containing the same keys in a different order result in the same ID.
  2. Hashing: It generates an MD5 hash of the sorted keys.
  3. Legacy Compatibility: The resulting hex string is prefixed with an “X”. This prefix is a holdover from previous storage iterations, maintained to ensure that legacy shortcuts remain valid and new shortcuts follow a consistent format.

This approach ensures that identical sets of traces are automatically deduplicated in the underlying storage, as they will always resolve to the same primary key.

The Store Interface

The module defines a Store interface that abstracts the persistence layer. This allows the application to remain agnostic of whether shortcuts are stored in a relational database, an in-memory cache, or a cloud-native solution.

The interface supports:

  • Dual Ingestion: Shortcuts can be inserted either as a structured Shortcut object (InsertShortcut) or directly from an io.Reader (Insert), which is useful for processing JSON payloads from HTTP requests.
  • Streaming Retrieval: The GetAll method returns a channel of shortcuts. This design decision facilitates large-scale data migrations or maintenance tasks without loading the entire shortcut database into memory, preventing OOM (Out-Of-Memory) errors.

Key Components

  • shortcut.go: Defines the core Shortcut data structure (a simple wrapper around a slice of strings) and the Store interface. It contains the logic for ID generation and normalization.
  • mocks/: Provides autogenerated mock implementations of the Store interface. These are used across the Perf codebase to test components that depend on shortcuts (like the dashboard or alerting systems) without requiring a live database.
  • shortcuttest/: A shared compliance suite. Any new implementation of the Store interface (e.g., for a new database backend) uses this suite to verify it correctly handles edge cases, such as key normalization and asynchronous retrieval.
  • sqlshortcutstore/: The primary production implementation of the Store. It maps the Go interface to a SQL backend (PostgreSQL/Spanner), handling the serialization of trace keys into JSON blobs for efficient storage and retrieval.

Shortcut Lifecycle Workflow

The following diagram illustrates how data flows through the module from creation to retrieval:

  Input Keys              shortcut Module                Storage Backend
  ==========              ===============                ===============
      |                          |                              |
      | 1. Create Shortcut       |                              |
      |------------------------> |                              |
      |                          | 2. Sort Keys & Hash          |
      |                          | 3. Generate ID ("X...")      |
      |                          |                              |
      |                          | 4. Persist (ID, Keys)        |
      |                          |----------------------------> |
      | <------- Return ID ------|                              |
      |                          |                              |
      |                          |                              |
      | 5. Get(ID)               |                              |
      |------------------------> |                              |
      |                          | 6. Fetch by ID               |
      |                          |----------------------------> |
      | <---- Return Keys -------|                              |

Usage Context

This module is typically used by the Perf frontend when a user wants to “pin” a specific view of traces or share a link to a complex query. The frontend sends the list of trace IDs to the backend, which uses this module to generate and store the shortcut, returning a short ID that is then embedded in the URL.

Module: /go/shortcut/mocks

The go/shortcut/mocks module provides a set of autogenerated mock implementations for the shortcut package, specifically targeting the Store interface. These mocks are designed to facilitate unit testing of components that depend on persistent shortcut storage without requiring a live database or complex setup.

Design and Purpose

The primary motivation for this module is to decouple the business logic of the Perf application from its storage layer during testing. By using mocks, developers can simulate various database behaviors, such as:

  • Successful retrieval of a shortcut.
  • Handling of non-existent shortcut IDs.
  • Simulating database transaction failures or connection errors.
  • Verifying that the application logic correctly calls storage methods with the expected parameters (e.g., ensuring a shortcut is inserted before it is used).

The mocks are generated using mockery and are based on the testify/mock framework. This allows for a declarative style of testing where expectations are set up at the beginning of a test case.

Key Components

Store.go

This file contains the Store struct, which implements the shortcut.Store interface. It provides mockable versions of all standard CRUD operations required for shortcut management:

  • Retrieval (Get, GetAll): Allows tests to return predefined shortcut objects or channels. GetAll is particularly useful for testing batch processing or migration scripts that iterate over all stored shortcuts.
  • Persistence (Insert, InsertShortcut): Enables testing of how the system handles new shortcut creation. The Insert method handles raw io.Reader input, while InsertShortcut handles structured objects, reflecting the dual ways shortcuts might be ingested.
  • Management (DeleteShortcut): Supports testing of cleanup routines and transaction handling, as it accepts a pgx.Tx parameter to simulate behavior within a database transaction.

Testing Workflow

A typical testing workflow using this module involves initializing the mock, setting expectations, and then injecting the mock into the consumer service.

+-------------------+       +-----------------------+       +-------------------------+
|   Unit Test       |       |     Mock Store        |       |    Consumer Service     |
+---------+---------+       +-----------+-----------+       +------------+------------+
          |                             |                            |
          | 1. NewStore(t)              |                            |
          +---------------------------->|                            |
          |                             |                            |
          | 2. On("Get").Return(...)    |                            |
          +---------------------------->|                            |
          |                             |                            |
          | 3. Call Method Under Test   |                            |
          +-----------------------------|--------------------------->|
          |                             |                            |
          |                             | 4. Get(ctx, id)            |
          |                             |<---------------------------+
          |                             |                            |
          |                             | 5. Return Mock Data        |
          |                             +--------------------------->|
          |                             |                            |
          | 6. AssertExpectations()     |                            |
          +---------------------------->|                            |

The NewStore function simplifies this process by automatically registering cleanup functions that assert all defined expectations were met before the test finishes, reducing boilerplate code in the test suite.

Module: /go/shortcut/shortcuttest

shortcuttest

The shortcuttest module provides a standardized compliance suite for validating implementations of the shortcut.Store interface. By centralizing test logic, the module ensures that different storage backends (e.g., SQL-based, in-memory, or cloud-native) exhibit consistent behavior regarding data persistence, normalization, and error handling.

Design Philosophy

The primary goal of shortcuttest is to enforce the contract of the shortcut.Store interface. A key design decision in the Perf system is that shortcuts—which are collections of keys representing trace sets—should be idempotent and normalized.

The test suite enforces the following behaviors across all implementations:

  • Normalization on Write: When a shortcut is inserted, the store is expected to normalize the data (specifically sorting the keys). This ensures that identical sets of keys result in predictable retrieval, regardless of the input order.
  • Abstract Storage Validation: Tests are written to be agnostic of the underlying database schema or storage medium, focusing strictly on the API surface of the shortcut.Store.
  • Lifecycle Management: The suite covers the full lifecycle of a shortcut: insertion, retrieval by ID, bulk retrieval via channels, and deletion.

Key Components and Workflow

Test Suite Orchestration

The module exports a SubTests map, which associates descriptive names with SubTestFunction signatures. This allows developers implementing a new shortcut.Store to run the entire suite against their implementation using a standard Go sub-test pattern:

Test Runner (External)
      |
      +---- Loop over SubTests ----+
      |                            |
      v                            v
[ InsertGet ]                [ GetAll ]
Verifies ID generation       Validates stream-based
and key normalization.       retrieval via channels.

Core Test Functions

Instead of providing a single monolithic test, the module breaks down requirements into specific functional checks:

  • InsertGet: This function validates both Insert (via io.Reader) and Get. It specifically checks that the shortcut.Shortcut retrieved from the store has its Keys slice sorted alphabetically, even if the input was unsorted. This ensures that the “Shortcut” concept remains a canonical set of trace keys.
  • GetAll: Validates the asynchronous retrieval pattern used for maintenance or migration tasks. It ensures that the store can correctly stream all existing shortcuts into a channel.
  • DeleteShortcut: Confirms that the store correctly handles the removal of data and that subsequent Get calls reflect the deletion.
  • GetNonExistent: Ensures that the store returns an error (rather than crashing or returning an empty object) when queried with a missing or invalid ID.

Implementation Details

The module relies on the testify library to provide clear assertions. Because it is a testing utility, it resides in its own package to avoid introducing testing dependencies (like testify) into the production shortcut package.

When implementing a new store, the developer typically creates a test in their local package that spins up the required infrastructure (like a local SQL instance), creates the store instance, and passes it to the functions defined in shortcuttest.

Module: /go/shortcut/sqlshortcutstore

Overview

The sqlshortcutstore module provides a production-grade implementation of the shortcut.Store interface using a SQL backend (compatible with PostgreSQL and Spanner). It facilitates the persistence, retrieval, and management of “shortcuts”—compact, shareable identifiers that represent collections of performance trace IDs.

This module acts as the concrete bridge between the high-level Perf shortcut logic and the underlying relational database, ensuring that complex query definitions can be saved and referenced by a simple hash-based key.

Design Decisions and Implementation

Content-Addressable Storage

The store utilizes a content-addressable approach for shortcut IDs. When a shortcut is inserted, the ID is generated based on the hash of the trace keys it contains (via shortcut.IDFromKeys).

  • Why: This design naturally handles deduplication. If two users create a shortcut for the exact same set of trace IDs, they will receive the same ID, and the database will perform a “no-op” on conflict rather than creating redundant rows.

JSON Serialization

While the backend is a SQL database, the trace IDs themselves are stored as a single JSON-encoded string in a TEXT column.

  • How: Before execution of the INSERT statement, the shortcut.Shortcut Go struct is marshaled into a JSON blob. Upon retrieval, this blob is unmarshaled back into the struct.
  • Decision Rationale: Storing the list of IDs as an opaque blob avoids the overhead of managing a separate many-to-many relationship table. Since the application always consumes the shortcut as a complete list, fetching a single row with a JSON blob is significantly more performant than performing multiple joins or row lookups for potentially thousands of trace IDs.

Streaming Retrieval

The GetAll method returns a Go channel (<-chan *shortcut.Shortcut) rather than a slice.

  • Why: Given that the number of shortcuts in a system can grow quite large, loading all shortcuts into memory at once could lead to memory exhaustion. The streaming approach allows the caller to process shortcuts one by one as they are read from the database cursor.

Key Components and Responsibilities

SQLShortcutStore

Located in sqlshortcutstore.go, this is the primary struct implementing the storage logic. It encapsulates a pool.Pool to communicate with the database.

  • Input Validation: Before persisting a shortcut, the store validates that the trace keys within it conform to the expected query format. This prevents malformed data from polluting the database.
  • Transaction Support: The DeleteShortcut method optionally accepts a pgx.Tx (transaction) object. This allows deletion operations to be part of a larger atomic unit of work, which is useful when cleaning up related resources.

SQL Statement Management

The module uses a central statements map to define its SQL queries. This separates the SQL syntax from the Go logic, making the code easier to maintain and ensuring that queries like ON CONFLICT (id) DO NOTHING are handled consistently.

Data Workflow

The following diagram demonstrates the lifecycle of a shortcut being stored and retrieved:

  Application Code          SQLShortcutStore                SQL Database
  ================          ================                ============
         |                         |                              |
  1. Insert(Reader) ------> [ Decode JSON ]                       |
         |                  [ Validate Keys]                      |
         |                  [ Generate ID  ]                      |
         |                  [ Encode JSON  ]                      |
         |                         |--- INSERT (id, blob) ------> |
         | <----- Return ID ------ |       (ON CONFLICT IGNORE)   |
         |                         |                              |
         |                         |                              |
  2. Get(ID) -------------> [ Query Row ]                         |
         |                         | <------- SELECT blob ------- |
         |                  [ Decode JSON ]                       |
         | <--- Return Struct ---- |                              |

Testing and Schema

  • Persistence Schema: The structural contract for the table is defined in the schema sub-module. It defines the Shortcuts table with an id as the Primary Key and trace_ids for the data payload.
  • Integration Testing: The sqlshortcutstore_test.go leverages sqltest to spin up ephemeral Spanner instances, ensuring the store is tested against real database engines rather than mocks. It runs the standard suite of shortcut tests defined in shortcuttest to ensure interface compliance.

Module: /go/shortcut/sqlshortcutstore/schema

Overview

The schema module defines the structural contract for persisting performance trace shortcuts in a SQL database. A shortcut in this context is a persistent mapping between a unique identifier and a collection of Trace IDs, allowing users to reference complex sets of performance data via a compact, shareable key.

Design Decisions and Implementation

The schema is intentionally kept minimal, prioritizing serialization flexibility and retrieval speed over database-level normalization.

Key Component: ShortcutSchema

The ShortcutSchema struct serves as the single source of truth for the database table structure. Its design reflects two primary requirements:

  • Immutable Identification: The ID field is defined as a TEXT UNIQUE NOT NULL PRIMARY KEY. This ensures that every shortcut has a permanent, collision-free anchor. The use of a string-based ID (typically a hash) allows the ID itself to be a representation of the content it points to, facilitating deduplication before insertion.
  • JSON-Backed Storage: The TraceIDs field is stored as a TEXT column intended to hold a serialized shortcut.Shortcut JSON object.

Why JSON over a Normalized Table?

The decision to store trace IDs as a serialized JSON blob rather than in a relational junction table (e.g., a many-to-many mapping of shortcut_id to trace_id) was driven by the access patterns of the Perf system:

  1. Atomicity: Shortcuts are retrieved and used as a single unit. There is rarely a need to query “which shortcuts contain this specific Trace ID” from the database level; instead, the system always fetches the full list of IDs associated with a specific shortcut key.
  2. Performance: Reading a single text blob is significantly faster and requires less overhead than performing joins or multiple row lookups for shortcuts that may contain thousands of individual Trace IDs.
  3. Schema Stability: By treating the TraceIDs as an opaque JSON blob at the database layer, the internal structure of the shortcut.Shortcut Go struct can evolve without requiring a database migration.

Data Workflow

The following diagram illustrates how the schema facilitates the lifecycle of a shortcut:

Application Layer             Schema Layer (SQL)             Database Storage
=================             ==================             ================
1. Create Shortcut  ------>  [ ID (Hash)      ]  ------>  INSERT INTO Shortcuts
   (List of IDs)             [ TraceIDs (JSON)]           (id, trace_ids)
                                     |
                                     v
2. Request Shortcut <------  [ ID (Primary Key)]  <------  SELECT trace_ids
   (via ID)                  [ TraceIDs (JSON)]           WHERE id = ?
                                     |
                                     v
3. Deserialize JSON --------> Result: List of IDs

Key Responsibilities

  • schema.go: Defines the ShortcutSchema struct. This file is the authoritative reference for SQL migration tools and ORM-like mappers used elsewhere in the sqlshortcutstore parent module. It ensures that the Go representation of a shortcut's persistence layer remains synchronized with the actual SQL table constraints (e.g., PRIMARY KEY, UNIQUE).

Module: /go/sql

Perf SQL Module

The /go/sql module is the central authority for the database schema within the Skia Perf application. It implements a “Schema-as-Code” methodology, where Go struct definitions serve as the single source of truth for the underlying Google Cloud Spanner database structure.

High-Level Overview

In a high-throughput performance monitoring system, database consistency across distributed components (ingesters, frontends, and maintenance tasks) is paramount. This module provides a unified interface for defining, generating, migrating, and testing the database schema.

Instead of manually maintaining DDL (Data Definition Language) files, developers modify Go structs. The module then provides tooling to project these definitions into SQL strings, Go constants for type-safe querying, and serialized JSON files used for environment validation.

Design Philosophy: Go-First Schema Management

The architecture is built on the principle that the application code should dictate the database structure, not the other way around.

  • Type Safety: By generating Go constants for table and column names, the module eliminates “stringly-typed” database interactions, catching typos at compile-time rather than runtime.
  • Version Safety (N-1 Compatibility): The module supports a “previous” vs. “next” schema strategy. This allows the system to remain operational during rolling deployments where some service instances might be running the old code while others run the new code, provided the database matches one of the two known states.
  • Automated Lifecycle: The module automates tedious database tasks such as setting up Time-To-Live (TTL) policies for telemetry data while exempting configuration tables (like Alerts or Favorites) to ensure persistence.

Key Components

Schema Definition (tables.go)

The Tables struct in tables.go acts as the master registry. It aggregates schema definitions from various sub-packages across the Perf project (e.g., alerts, regressions, trace stores). This centralized struct is used by reflection-based tools to understand the entire database landscape.

Schema Generation (tosql and exportschema)

These sub-modules transform Go code into deployable artifacts:

  • tosql: A CLI tool that parses the Go structs and generates go/sql/spanner/schema_spanner.go. This generated file contains the raw SQL DDL strings and Go slices of column names used by the application at runtime.
  • exportschema: A utility that serializes the schema into a standardized schema.Description (JSON). This artifact is used for comparing the “expected” state against the “live” state of a production database.

Evolution and Migration (expectedschema)

This component manages the transition of the database over time. It embeds the expected JSON descriptions into the binary and provides the ValidateAndMigrateNewSchema logic.

  • It handles Static Migrations via manual DDL scripts (FromLiveToNextSpanner).
  • It handles Dynamic Schema Updates for the TraceParams table. Since the keys in performance data change as new benchmarks are added, this module dynamically adds or drops generated columns and indexes in Spanner to maintain query performance.

Data Schema Workflow

The following diagram illustrates the lifecycle of a schema change:

[ Developer ]
      |
      v
[ Modify Go Structs ] ----> (e.g., add field to TraceValuesSchema)
      |
      v
[ Run 'tosql' ] ----------> (Updates schema_spanner.go constants)
      |
      v
[ Run 'exportschema' ] ---> (Generates schema_spanner.json)
      |
      v
[ Deployment ] -----------> [ Maintenance Task ]
                                   |
                                   v
                        [ Validate & Migrate ]
                                   |
              +--------------------+--------------------+
              |                    |                    |
        (Match Prev?)        (Match Next?)       (Match Neither?)
              |                    |                    |
      [ Run Migration ]      [ Do Nothing ]        [ Panic/Error ]
              |                    |
              +----------+---------+
                         |
                         v
            [ Update Dynamic TraceParams ]
             (Add/Drop Generated Columns)

Testing and Validation

The sqltest sub-module and sql_test.go provide the infrastructure for integration testing.

  • Emulator Integration: Tests run against the Google Cloud Spanner Emulator and PGAdapter, providing a local, high-fidelity PostgreSQL-compatible interface.
  • Isolation: Each test generates a unique, ephemeral database instance to prevent data contamination during parallel execution.
  • Verification: The testing suite ensures that the migration path from a “Live” (production) schema to the “Next” (development) schema is valid and results in the exact structure defined by the Go source code.

Module: /go/sql/expectedschema

Expected Schema Module

The expectedschema module manages the lifecycle and validation of the database schema for Skia Perf. It serves as the authoritative source for what the database structure should look like at any given version of the software, and provides the mechanism to transition the database from a previous state to the current one.

High-Level Overview

In a distributed system where multiple services (frontend, ingesters, maintenance tasks) share the same database, schema synchronization is critical. This module ensures that:

  1. Deployment Safety: Services can verify the database schema matches their expectations upon startup, preventing data corruption or runtime crashes due to missing columns or indexes.
  2. Automated Migration: Schema updates are applied automatically by maintenance tasks during the deployment process.
  3. Dynamic Optimization: Certain parts of the schema (specifically traceparams) are dynamically adjusted based on the actual data flowing through the system to optimize query performance.

Design Philosophy: “Previous” vs “Next”

The module implements a “n-1” compatibility strategy. It tracks two versions of the schema:

  • schema_prev_spanner.json: The schema as it existed in the previous version of the application.
  • schema_spanner.json: The desired “next” schema for the current version.

This approach is chosen because Perf components are deployed simultaneously. When a new version is rolled out, the maintenance task upgrades the schema. If the frontend or ingester starts before the migration, they check the schema; if it matches neither “prev” nor “next”, they panic. This ensures that the system only runs on a known, supported database state.

Key Components

Schema Definitions (embed.go)

The module uses Go's embed package to include JSON representations of the Spanner schema directly into the binary. This makes the schema definition portable and easily accessible for comparison against the live database.

  • Load(): Retrieves the current expected schema.
  • LoadPrev(): Retrieves the previous version's schema.

Migration Logic (migrate.go)

This file contains the logic for transitioning the database. It defines two raw SQL strings that must be manually updated by developers whenever a schema change is introduced:

  • FromLiveToNextSpanner: The DDL commands to apply the new change.
  • FromNextToLiveSpanner: The DDL commands to revert the change (primarily used for testing and local development).

The ValidateAndMigrateNewSchema function performs the core logic:

  1. Inspects the live database to get its current description.
  2. Calculates the difference between the live schema and the “prev”/“next” definitions.
  3. If the live schema matches “prev”, it executes the migration to “next”.
  4. If it already matches “next”, it does nothing.
  5. If it matches neither, it returns an error, signaling an inconsistent state.

Dynamic Trace Parameters (traceparams_schema.go)

Unlike static tables, the traceparams table uses Spanner's generated columns and indexes to optimize filtering. Since the keys in performance data (params) change over time, this module dynamically manages these columns.

UpdateTraceParamsSchema performs the following workflow:

  1. Identifies the param keys currently in use in the most recent data tiles.
  2. Compares these keys against the existing generated columns in the traceparams table.
  3. Uses a text template (traceParamsUpdateTemplate) to generate and execute DDL that adds missing columns/indexes and drops obsolete ones.

Migration Workflow

The following diagram illustrates how the maintenance task synchronizes the database during a deployment:

[ Start Maintenance Task ]
           |
           v
[ Fetch Live Schema from DB ] <----------+
           |                             |
           +---- matches Next? --------> [ Success: No action needed ]
           |                             |
           +---- matches Prev? --------> [ Execute FromLiveToNextSpanner ]
           |                             |
           +---- matches neither? ------> [ Error: Inconsistent State ]
           |
           v
[ Update Dynamic TraceParams ]
           |
           +--> [ Get keys from recent tiles ]
           +--> [ Add/Drop Generated Columns ]
           +--> [ Add/Drop Indexes ]
           |
           v
       [ Done ]

Implementation Notes

  • Spanner Focus: While some structures are generic, the current implementation and embedded JSON files are specifically tailored for Google Cloud Spanner.
  • Testing: migrate_spanner_test.go provides a suite to verify that migrations correctly transition a database from the “prev” state to the “next” state and that dynamic column generation works as expected.

Module: /go/sql/exportschema

Overview

The exportschema module provides a command-line utility designed to bridge the gap between Go-defined database schemas and their serialized representations. In the context of the Perf system, it acts as a generator that translates internal Go struct definitions and Spanner schema configurations into a standardized schema.Description format. This serialized output is primarily used for schema verification, migrations, and ensuring consistency across different deployment environments.

Design Philosophy: Schema as Code

The primary motivation for this module is to treat the database schema as a “source of truth” defined within the Go codebase rather than in disparate SQL files. By using a Go-based tool to export the schema:

  • Consistency: It ensures that the actual database structure matches the expectations of the application code.
  • Automation: The serialization process can be integrated into CI/CD pipelines to detect accidental schema changes.
  • Portability: By passing different flags to the tool, the system can generate descriptions tailored to specific database backends (e.g., CockroachDB vs. Spanner) while pulling from the same source definitions.

Implementation Logic

The module is a thin wrapper that orchestrates the extraction of schema metadata. It leverages the generic exportschema_lib to perform the actual serialization while providing the Perf-specific schema definitions as inputs.

Workflow

The following diagram illustrates how the tool transforms internal Go definitions into an external schema description:

+-----------------------+      +-----------------------+
|  perf/go/sql/spanner  |      |     perf/go/sql       |
|  (Schema Definitions) |      | (Table Structs/Tags)  |
+-----------+-----------+      +-----------+-----------+
            |                              |
            |      +----------------+      |
            +----->|  exportschema  |<-----+
                   |     (Main)     |
                   +-------+--------+
                           |
                           v
             +----------------------------+
             | go/sql/schema/exportschema |
             |    (Serialization Engine)  |
             +-------------+--------------+
                           |
                           v
                +----------------------+
                | .json / .sql output  |
                | (schema.Description) |
                +----------------------+

Key Components

  • main.go: This is the entry point. It defines the CLI interface, accepting a -databaseType to determine the target dialect and an -out path for the resulting file. It explicitly imports perf/go/sql/spanner to access the Schema object, which contains the specific table layouts and column types required by the Perf application.
  • Integration with sql.Tables{}: The tool passes an empty instance of the Perf SQL tables to the exporter. This allows the reflection-based serialization engine to inspect the struct tags (such as sql:"...") used throughout the Perf module to understand how Go objects map to database columns.

Responsibility

The module is responsible for:

  1. Selection: Identifying which schema definition (currently hardcoded to spanner.Schema) should be exported.
  2. Configuration: Mapping command-line arguments to the parameters required by the shared infrastructure exporting library.
  3. Output Generation: Writing the finalized schema description to the filesystem, which is then typically consumed by automated tests or database initialization scripts.

Module: /go/sql/spanner

Spanner SQL Schema for Perf

The go/sql/spanner module serves as the authoritative source for the Google Cloud Spanner database schema used by the Skia Perf application. It contains the DDL (Data Definition Language) statements required to initialize the database environment and provides Go constants that represent the table structures, ensuring type safety and consistency when interacting with the database.

Design and Implementation Choices

Automated Generation

The primary file, schema_spanner.go, is generated by an external tool (//go/sql/exporter/). This approach ensures that the Spanner schema remains synchronized with the internal Go structures used across the Perf application. Manual edits to this file are discouraged to prevent drift between the application logic and the database state.

Large-Scale Performance Data Management

The schema is optimized for the high-volume time-series data typical of performance monitoring.

  • Bit-Reversed Sequences: Tables like Alerts and SourceFiles use bit_reversed_positive sequences. This is a specific Spanner optimization to prevent hotspots during high-throughput inserts by distributing primary key values across the keyspace.
  • TTL (Time To Live): Most tables include a createdat column and a TTL policy of 1095 days (3 years). This automates data retention and prevents unbounded storage growth for ephemeral performance traces and logs.
  • Trace Storage Strategy: The schema utilizes several tables to handle multidimensional performance data:
    • TraceValues and TraceValues2: Store the actual measurement values associated with a specific trace and commit. TraceValues2 provides more granular dimensions (benchmark, bot, test, subtests) for improved querying.
    • Postings and ParamSets: Facilitate the “inverted index” style search used in Perf, allowing the system to quickly find traces based on key-value pairs (e.g., finding all traces where cpu=arm64).
    • TraceParams: Stores the full set of parameters for a trace ID in a JSONB column, balancing structured searching with flexible metadata storage.

Anomaly and Regression Tracking

The schema defines a sophisticated relationship between performance regressions and their remediation:

  • Regressions & Regressions2: Track detected performance changes at specific commits.
  • AnomalyGroups: Group related regressions together to streamline the triage process.
  • Culprits: Track specific revisions identified as the cause of regressions, including metadata about the host and project.

Key Components

schema_spanner.go

This file contains a single large string constant, Schema, which includes the full set of CREATE TABLE, CREATE INDEX, and CREATE SEQUENCE statements. It also exports slice variables (e.g., var Alerts, var Commits) that list the column names for each table, providing a programmatic way to reference table structures without hardcoding strings in the application logic.

Primary Data Entities

  • Commits: The foundation of the timeline, mapping commit numbers to git hashes and timestamps.
  • Alerts & Subscriptions: Define the configuration for anomaly detection and the notification preferences for different teams.
  • Shortcuts & GraphsShortcuts: Store persistent links to specific views or sets of traces in the Perf UI.

Data Workflow: Trace Ingestion and Querying

The schema supports a workflow where incoming performance data is transformed into searchable traces.

Incoming Data File
      |
      v
[SourceFiles] <------- [Metadata] (Links to external logs)
      |
      +-----> [TraceValues] (Value at Commit X)
      |
      +-----> [TraceParams] (The "What": bot=linux, test=draw)
                   |
                   v
            [Postings] (Inverted index for searching)
            [ParamSets] (Summary of available search terms)
  1. Ingestion: A new file is registered in SourceFiles.
  2. Storage: Values are written to TraceValues or TraceValues2.
  3. Indexing: The trace's parameters are decomposed into Postings and ParamSets, enabling the Perf UI to populate search filters and quickly locate relevant trace_ids.
  4. Detection: Analysis services read from these tables and write findings into Regressions, AnomalyGroups, and Culprits.

Module: /go/sql/sqltest

SQL Test Utility

The sqltest module provides standardized utilities for initializing and managing database instances during unit testing. It is specifically designed to facilitate integration testing against Spanner-compatible PostgreSQL interfaces using local emulators.

Overview and Design Philosophy

Testing database logic requires a consistent, reproducible, and isolated environment. This module automates the orchestration of ephemeral databases to ensure that tests do not interfere with one another and that they run against a schema identical to production.

The implementation relies on two primary architectural choices:

  1. Emulator-Based Testing: Rather than requiring a live Cloud Spanner instance, the module utilizes the Google Cloud Spanner Emulator and the PGAdapter. This allows developers to run tests locally or in CI environments without network overhead or cloud costs.
  2. Schema Enforcement: Tests are executed against a fully initialized schema. The module automatically applies the current production schema (defined in the spanner package) before returning a connection, ensuring that the code under test interacts with the expected table structures.

Key Components and Responsibilities

Database Lifecycle Management

The primary entry point is NewSpannerDBForTests. This function handles the entire lifecycle of a test database:

  • Dependency Verification: It asserts that the necessary emulator processes (Spanner and PGAdapter) are running. If they are missing, the test fails early.
  • Isolation: It generates a unique database name using a provided prefix and a random suffix. This isolation is critical for parallel test execution, preventing cross-test data contamination.
  • Schema Migration: It uses an “eventually” retry logic to apply the SQL schema. This accounts for potential transient delays while the emulator initializes the new database instance.

Connection Wrapping and Safety

The module does not return a raw database driver connection. Instead, it returns a pool.Pool interface wrapped in a timeout validator:

  • Timeout Enforcement: By wrapping the pgxpool with timeout.New, the module ensures that every database operation performed during the test includes a context with a defined timeout. This prevents tests from hanging indefinitely if a deadlock or performance issue occurs in the underlying logic.
  • Interface Abstraction: By returning the pool.Pool interface, it allows the rest of the application to remain agnostic of the underlying driver implementation (PostgreSQL vs. Spanner-via-PGAdapter).

Workflow: Test Database Initialization

The following diagram illustrates the sequence of operations when a test requests a new database connection:

Test Invocation
      |
      v
[ Check Emulators ] ----> (Require Spanner & PGAdapter running)
      |
      v
[ Generate Name ] ------> (Prefix + Random ID)
      |
      v
[ Connect Pool ] -------> (Establish connection to PGAdapter)
      |
      v
[ Apply Schema ] <------- (Loop: Try applying spanner.Schema)
      |                    (until success or 10s timeout)
      v
[ Wrap Connection ] ----> (Inject timeout enforcement wrapper)
      |
      v
  Return Pool

Implementation Details

  • sqltest.go: Contains the logic for connecting to the PostgreSQL-compatible endpoint provided by the emulator. It handles the string formatting for connection strings (e.g., postgresql://root@...) and manages the integration between the pgx library and the project's internal pool abstractions.
  • Naming Constraints: Database names are truncated to 30 characters to comply with emulator and database naming limitations while maintaining enough of the prefix to identify the source test.

Module: /go/sql/tosql

tosql

The tosql module provides a command-line utility designed to maintain a “Go-first” approach to database schema management. It serves as a bridge between high-level Go struct definitions and the concrete SQL schema required by the database engine, specifically targeting Google Cloud Spanner for the Perf application.

Design Philosophy

The primary design goal of this module is to ensure that Go code remains the single source of truth for the database schema. Rather than manually maintaining .sql files and trying to keep Go structs in sync with them, tosql automates the generation of SQL schema strings and column constants directly from Go definitions.

This approach offers several advantages:

  • Compile-time Safety: By generating Go constants for table and column names, the rest of the application can avoid hard-coded strings in queries, reducing the risk of runtime errors due to typos.
  • Documentation and Metadata: Go structs allow for the use of struct tags and docstrings to define database-specific properties (like TTL or primary keys) in a way that is easily readable by developers.
  • Consistency: It ensures that the schema deployed to the database perfectly matches the structures the application expects to serialize and deserialize.

Key Components and Responsibilities

Schema Generation Logic

The module's entry point is main.go. Its responsibility is to orchestrate the conversion process by:

  1. Identifying the source Go structs (located in //perf/go/sql).
  2. Configuring the exporter (from //go/sql/exporter) to translate Go types and tags into Spanner-compatible SQL dialects.
  3. Writing the resulting Go source code—containing the schema string and metadata—to a specific package (e.g., spanner/schema_spanner.go).

Configuration and Policy

The module defines specific transformation policies for the Perf database. A notable implementation choice is the handling of Time To Live (TTL). The generator explicitly excludes certain tables—such as Alerts, Favorites, Subscriptions, and TraceParams—from automated TTL policies. This reflects a design decision to treat configuration and user-created entities as permanent, while allowing raw performance data to be eligible for lifecycle management.

Workflow

The following diagram illustrates how tosql fits into the development lifecycle:

[ Go Structs ]  --> [ tosql ] --> [ Generated Go Code ] --> [ Application ]
 (Source of Truth)      |         (Schema Strings     )      (Type-safe SQL)
                        |         (Column Constants   )
                        v
              [ SQL Exporter Logic ]
                (Spanner Dialect)
                (TTL Exclusions )
  1. Define: A developer modifies a Go struct in the perf/go/sql package to add a new column or table.
  2. Generate: Running the tosql tool triggers the exporter.
  3. Export: The tool parses the structs, applies Spanner-specific conversion rules, and injects the resulting SQL into a generated Go file.
  4. Consume: The Perf application imports the generated package to initialize the database schema or to reference column names in its data access layer.

Module: /go/stepfit

StepFit

The stepfit module provides algorithms for detecting and quantifying “steps” or shifts in time-series data (traces). In the context of performance monitoring, these steps represent regressions (performance degradation) or improvements.

Overview

The core functionality revolves around taking a slice of telemetry data and determining if a significant change in value occurs at a specific point. The module evaluates these changes using several different statistical and heuristic methods, allowing the caller to choose the best detection strategy for their specific data type (e.g., noisy vs. stable benchmarks).

The primary entry point is GetStepFitAtMid, which analyzes a trace centered around a specific index to determine if a step exists at that “turning point.”

Key Concepts

StepFit Structure

The StepFit struct is the result of an analysis. It contains:

  • Status: Categorizes the step as HIGH (step up/potential regression), LOW (step down/improvement), or UNINTERESTING (no significant change).
  • Regression: A calculated score representing the “strength” of the step. Higher absolute values generally indicate more significant changes. The interpretation of this value varies by algorithm.
  • StepSize: The raw difference between the means of the two halves of the trace.
  • TurningPoint: The index in the trace where the step is identified.

Detection Algorithms

The module supports multiple algorithms defined via types.StepDetection:

  • Original Step: Based on a Least Squares Error (LSE) fit of a step function. It normalizes the trace and calculates a regression score as StepSize / LSE. It is effective for identifying clear shifts while accounting for noise.
  • Absolute Step: A simple comparison of the difference between the mean of the first half and the mean of the second half against an absolute threshold.
  • Percent Step: Calculates the step size as a percentage of the mean of the first half. This is useful for benchmarks where relative change is more important than absolute magnitude.
  • Cohen's d: Uses the effect size between two groups. It scales the step size by the pooled standard deviation, making it robust against varying levels of noise in different traces.
  • Mann-Whitney U: A non-parametric test that assesses whether one group tends to have larger values than the other. Here, the Regression value is the p-value of the test, and the Status is determined by whether this p-value meets the “interesting” threshold.
  • Const: A specialized check that looks at a single value at the turning point relative to a threshold, used for specific flagging logic.

Logic Workflow

The following diagram illustrates the general process within GetStepFitAtMid:

Input Trace [x0, x1, ..., xN]
          |
          v
+-----------------------+
|   Pre-processing      | (Normalization or
|                       |  Length Adjustment)
+-----------+-----------+
            |
            v
+-----------+-----------+
| Split Trace at Middle | -> [Left Half] | [Right Half]
+-----------+-----------+
            |
            v
+-----------+-----------+
|  Apply Algorithm      | (Original, Cohen, U-Test, etc.)
|  (Calculate Means,    |
|   StdDev, or Ranks)   |
+-----------+-----------+
            |
            v
+-----------+-----------+
| Calculate Regression  | (Score representing
| and Step Size         |  change magnitude)
+-----------+-----------+
            |
            v
+-----------+-----------+
| Determine Status      | (Compare Regression
|                       |  to Interesting Threshold)
+-----------+-----------+
            |
            v
      Result: StepFit

Implementation Details

Data Normalization

For the OriginalStep algorithm, the module performs normalization using vec32.Norm. This ensures that traces with different scales can be compared using a uniform “interesting” threshold. A stddevThreshold is used to prevent division by zero or extreme amplification of noise in very flat traces.

Handling “Interesting” Thresholds

The interesting parameter passed to GetStepFitAtMid is polymorphic in its meaning depending on the algorithm:

  • For OriginalStep, AbsoluteStep, CohenStep, and PercentStep, a higher interesting value makes the detector less sensitive (requires a larger shift).
  • For MannWhitneyU, where the regression score is a p-value, a lower interesting value (e.g., 0.05) makes the detector less sensitive (requires higher statistical confidence).

Trace Length

The module requires a minimum trace size (defined as 3). For most algorithms, it expects the trace provided to be a window around a specific point. If not using the OriginalStep algorithm, the module truncates the last element of the trace to ensure symmetry (2*N length) for the split-at-mid logic.

Module: /go/subscription

High-Level Overview

The subscription module provides the data management layer for “Subscriptions” within the Skia Perf ecosystem. In this context, a Subscription is a configuration object that defines how the system should react when a performance anomaly is detected. It acts as a bridge between the detection of a regression and the filing of an actionable bug report, containing metadata such as target bug components, priority levels, and point-of-contact information.

This module defines the standard Store interface for persisting these configurations and provides the underlying Protocol Buffer definitions that ensure consistency across the backend services.

Design Decisions and Implementation

Versioning and Immutability

A core design principle of the subscription system is revision-based tracking. Subscriptions are not simply overwritten; they are versioned by a combination of their name and a revision (typically a Git hash or unique identifier from the configuration source).

  • Auditability: By treating configurations as immutable records identified by a name/revision pair, the system can provide a full history of how alerting rules for a specific test or component have evolved.
  • Atomic Updates: The storage implementations (specifically the SQL-based ones) follow a pattern of deactivating old records and inserting new ones within a single transaction. This ensures that the detection engine always sees a consistent, “active” snapshot of all subscriptions at any given time.

Separation of Concerns

The module is structured to decouple the schema of a subscription from its persistence and its testing:

  • Schema (Proto): Defines the data model (labels, components, hotlists) needed to integrate with external issue trackers like Buganizer.
  • Persistence (Store): Provides an interface that allows the system to switch between different database backends (e.g., PostgreSQL or Spanner) without changing the business logic that handles regressions.
  • Mocks: Provides high-fidelity mock implementations to allow other Perf modules to test their alerting logic without interacting with a database.

Key Components

Store Interface (store.go)

The Store interface is the primary contract for subscription data access. It supports two main modes of operation:

  • Current-State Access: Methods like GetActiveSubscription and GetAllActiveSubscriptions are used by the live regression detection pipeline to find the most recent rules for filing bugs.
  • Historical Access: GetSubscription(name, revision) allows the system to reference the exact configuration that was in place when a specific anomaly was detected, even if the subscription has since been updated.

Subscription Proto (/proto)

The v1.Subscription message is the source of truth for what constitutes a subscription. It includes:

  • Routing Information: bug_component, bug_cc_emails, and contact_email.
  • Classification Metadata: bug_labels, hotlists, bug_priority, and bug_severity.
  • Logical Ownership: The name field serves as the unique identifier for a specific monitoring rule.

SQL Implementation (/sqlsubscriptionstore)

The standard implementation of the Store interface. It manages the SQL lifecycle of subscription records, handling the translation between Go structs and database rows, and enforcing the “soft-deactivation” logic during updates.

Subscription Lifecycle Workflow

The following diagram illustrates how a subscription moves from a configuration file into the database and is eventually used during an anomaly detection event:

[ Config Source ] ----> [ Subscription Manager ]
(Git/Repo)               |
                         | 1. Parse & Validate
                         v
[ SQL Store ] <--------- [ Store Interface ]
   |                     | 2. InsertSubscriptions(new_set, tx)
   |                     |    - Set old records is_active = false
   |                     |    - Insert new records is_active = true
   v
[ Database ]
   |
   | 3. GetAllActiveSubscriptions()
   v
[ Anomaly Detector ] ----> [ External Issue Tracker ]
                           4. File bug using
                              labels/components
                              from Subscription

Key Files

  • store.go: Defines the Store interface which abstracts the underlying persistence mechanism.
  • proto/v1/subscription.proto: The definitive schema for subscription data, used for both storage and cross-service communication.
  • sqlsubscriptionstore/sqlsubscriptionstore.go: The SQL implementation of the store, containing the logic for versioned updates and retrieval.
  • mocks/Store.go: An autogenerated mock of the Store interface for use in unit tests.

Module: /go/subscription/mocks

High-Level Overview

The subscription/mocks module provides autogenerated mock implementations of the Store interface used in Perf subscription management. This module is designed to facilitate unit testing for components that depend on the subscription storage layer without requiring a live database connection or complex setup.

By utilizing these mocks, developers can simulate various database states, verify that the application logic calls the storage layer with the expected parameters, and test error-handling scenarios in a predictable, isolated environment.

Design Decisions and Implementation

The implementation relies on testify/mock and is generated via the mockery tool. This approach ensures that the mock interface remains synchronized with the actual Store interface defined in the subscription package.

Key design choices include:

  • Decoupling Logic from Persistence: By providing a mock for the Store, the business logic governing subscriptions (such as validation or processing) can be tested independently of the underlying PostgreSQL implementation (facilitated by the pgx dependency).
  • Transaction Support: The mock supports methods that take pgx.Tx as an argument (e.g., InsertSubscriptions), allowing tests to verify transactional logic even within a mocked context.
  • Automatic Assertion: The NewStore constructor automatically registers a cleanup function on the testing object. This ensures that AssertExpectations is called at the end of every test, enforcing that all expected calls were made and preventing “silent” test failures where code logic skips necessary database interactions.

Key Components

Store.go

This is the primary file containing the Store mock struct. It mirrors the capabilities of the real subscription storage engine:

  • Retrieval Methods: It provides mocks for GetActiveSubscription, GetAllActiveSubscriptions, GetAllSubscriptions, and GetSubscription. These allow tests to simulate the presence or absence of specific subscription configurations (represented by v1.Subscription protos).
  • Persistence Methods: The InsertSubscriptions mock enables verification of how the system writes or updates subscription data, including support for bulk operations and database transactions.

Workflow Example: Testing a Subscription Fetcher

The following diagram illustrates how the mock interacts with a consumer (e.g., a Subscription Manager) and a test suite:

+-----------+              +-----------------------+              +--------------+
|   Test    |              |  Subscription Manager |              |  Mock Store  |
+-----------+              +-----------------------+              +--------------+
      |                            |                              |
      | 1. Setup Expectation       |                              |
      |--------------------------->|                              |
      |    (On "GetSubscription")  |                              |
      |                            |                              |
      | 2. Trigger Action          |                              |
      |--------------------------->|                              |
      |                            | 3. Call GetSubscription()    |
      |                            |----------------------------->|
      |                            |                              |
      |                            | 4. Return Mocked Proto/Error |
      |                            |<-----------------------------|
      | 5. Assert Result           |                              |
      |<---------------------------|                              |
      |                            |                              |
      | 6. Automatic Cleanup       |                              |
      |    (AssertExpectations)    |                              |
      |--------------------------->|----------------------------->|

In this workflow, the Mock Store allows the Test to define exactly what the Subscription Manager should receive when it queries for a subscription, ensuring the manager handles the returned data (or error) correctly according to the system's design requirements.

Module: /go/subscription/proto

The go/subscription/proto module defines the foundational data structures used for anomaly notification routing and issue tracking within the Skia Perf ecosystem. This module serves as the contract between the performance analysis engines—which detect regressions—and the reporting services—which notify stakeholders.

Design and Implementation Choices

The design of the proto definitions in this module reflects a transition toward automated, template-based issue management.

  • Service Decoupling: By centralizing the subscription schema here, the system separates what was found (an anomaly) from who should care and how it should be reported. This allows the detection engine to remain agnostic of the underlying bug-tracking system’s complexities.
  • Integration-First Schema: Unlike a generic notification system, the fields are modeled after specific requirements of enterprise issue trackers (e.g., Buganizer). Attributes like bug_component, hotlists, and bug_labels are first-class citizens, ensuring that when an anomaly is detected, the resulting ticket is pre-triaged and routed to the correct engineering queue.
  • Constraint-Driven Configuration: The schema enforces specific data types for priorities and severities, ensuring that configuration-as-code files remain valid and consistent across different performance monitoring domains.

Key Components

The Subscription Schema

Defined in subscription.proto, the Subscription message is the primary data model. It acts as a routing rulebook for performance regressions.

  • Routing Logic: The bug_component and bug_cc_emails fields define the destination of the alert. This ensures that the right team is notified immediately without manual triage.
  • Contextual Metadata: The bug_labels and hotlists fields allow the system to tag issues with relevant metadata (e.g., “Chromium-Perf-Regression” or “Milestone-110”). This is critical for automated dashboards that track the health of specific product releases.
  • Accountability: The contact_email field is mandatory to ensure every subscription has an owner who can be reached if the alerting rules become noisy or obsolete.

Go Binding and Generation

The module includes the generated Go code (subscription.pb.go) to provide a type-safe interface for the Perf backend.

  • Consistency via generate.go: This file encapsulates the logic for invoking the protocol buffer compiler. By including this in the module, the project ensures that the Go structs remain in sync with the proto definitions, preventing runtime errors during the serialization or deserialization of subscription configurations.

Data Flow Workflow

The following diagram demonstrates how the proto definitions facilitate the transition from a detected performance dip to an actionable engineering task:

[ Regression Detector ]
          |
          | (A) Detects significant change in trace
          v
[ Subscription Manager ] <---- [ Proto-based Config Files ]
          |                    (Defines Name, Component, Priority)
          |
          | (B) Matches trace to "Subscription" name
          v
[ Reporting Service ]
          |
          | (C) Maps Proto fields to API Call:
          |     - Labels    -> bug_labels
          |     - Component -> bug_component
          v
[ External Issue Tracker ]

Key Files

  • v1/subscription.proto: The source of truth for the subscription data model. It defines the structure used by both the configuration files and the internal Go services.
  • v1/subscription.pb.go: The auto-generated Go implementation of the proto. It contains the structs and methods used by the Perf service to manipulate and pass subscription data.
  • v1/generate.go: A utility script used to trigger the code generation process, ensuring the Go bindings are updated whenever the proto definition is modified.

Module: /go/subscription/proto/v1

The subscription.proto module defines the schema for anomaly alerting configurations within the Skia Perf ecosystem. Its primary purpose is to decouple the logic of detecting performance regressions from the logic of reporting them. By providing a structured data format, it allows the system to determine exactly how and where to route notifications when an anomaly is identified.

Design and Implementation Choices

The module is centered around the Subscription message, which acts as a template for issue creation. The design follows several key principles:

  • Traceability via Revisions: The inclusion of a revision field indicates that subscriptions are likely managed as “Configuration as Code.” This allows the system to track which version of an internal configuration repository was used to generate or update the subscription, ensuring that changes to alerting rules are auditable.
  • Issue Tracker Integration: Instead of generic notification fields, the schema is specifically tailored to the requirements of modern issue tracking systems (like Buganizer or Monorail). Fields such as bug_component, bug_priority, and bug_severity (constrained to a 0-4 range) ensure that filed bugs are immediately actionable and correctly categorized without manual intervention.
  • Operational Accountability: The contact_email field ensures that every automated alert has a human owner responsible for the subscription's validity, preventing “zombie” alerts that fire into unmonitored components.

Key Components

Subscription Message

The Subscription message is the core entity. It bridges the gap between a detected event and an external tracking system.

  • Identity and Metadata: The name is the unique key used by the Perf service to look up reporting rules. The contact_email identifies the team or individual maintaining the alert.
  • Issue Metadata: bug_labels and hotlists allow for fine-grained filtering within issue trackers, enabling teams to organize anomalies by sub-project or release milestone.
  • Routing and Priority: bug_component defines the destination, while bug_priority and bug_severity define the urgency. The use of repeated strings for bug_cc_emails allows for cross-team visibility on critical regressions.

Generated Go Code

The subscription.pb.go file provides the concrete implementation of these structures for use in Go services. This ensures type safety when the Perf backend processes subscription data retrieved from storage or configuration files.

Workflow Example

The following diagram illustrates how the Subscription proto is utilized during an anomaly event:

[ Perf Detection Engine ]
          |
          | 1. Anomaly Found
          v
[ Subscription Lookup ] <--- Uses "name" to find Subscription proto
          |
          | 2. Extract Bug Metadata (Component, CCs, Labels)
          v
[ Issue Tracker API ] ----> Creates Bug with:
                             - Component: bug_component
                             - CCs: bug_cc_emails
                             - Labels: bug_labels

Source Files

  • subscription.proto: The source of truth definition for the subscription data structure.
  • subscription.pb.go: The compiled Go code used by internal services to handle subscription data.
  • generate.go: Contains the automation logic for regenerating the Go code when the proto definition changes, ensuring consistency between the schema and the implementation.

Module: /go/subscription/sqlsubscriptionstore

The sqlsubscriptionstore module provides a persistent SQL-based implementation of the subscription.Store interface. It is responsible for storing, versioning, and retrieving configurations that define how the Perf system should handle anomalies, specifically focusing on bug filing metadata such as components, labels, and priority.

Design Decisions and Implementation Choices

Atomic Versioning and State Management

The store implements a “deactivate-then-insert” pattern for updates. When new subscriptions are inserted via InsertSubscriptions, the store wraps the operation in a transaction that first marks all existing subscriptions as inactive before inserting the new set as active.

This design choice ensures that:

  1. Consistency: There is always a clear set of “active” configurations used by the monitoring services.
  2. Auditability: Historical configurations are never deleted. By using a compound primary key of (name, revision), the store maintains a full lineage of how a subscription's metadata (like its bug component or CC list) has changed over time, keyed to specific infrastructure Git revisions.
  3. Soft Deactivation: The is_active flag allows the system to distinguish between the current production configuration and historical records without physical data removal.

Integration with External Issue Trackers

The module is designed to map directly to the requirements of issue tracking systems (like Monorail or Buganizer). Implementation details such as storing BugLabels and Hotlists as string arrays, and BugPriority/BugSeverity as integers, allow the Perf service to programmatically construct bug reports that adhere to specific team triage workflows without needing complex transformation logic at the application layer.

Key Components

SubscriptionStore

Located in sqlsubscriptionstore.go, this is the primary struct implementing the data access logic. It wraps a pool.Pool to interact with the underlying database (typically Spanner or PostgreSQL).

  • Query Management: The store uses a centralized map of SQL statements. This separation of SQL logic from Go code facilitates easier maintenance of the schema-to-struct mapping.
  • Transaction Support: The InsertSubscriptions method explicitly accepts a pgx.Tx (transaction) object. This allows the caller to coordinate subscription updates with other database operations, ensuring that configuration updates are atomic across the system.

Data Schema

The underlying table structure (defined in the schema submodule) enforces the immutability of specific revisions. Fields like bug_cc_emails and contact_email are stored to ensure that the notification engine knows exactly who to alert when an anomaly is detected under a specific subscription's criteria.

Subscription Update Workflow

The following diagram illustrates the process of updating subscriptions within the store, highlighting the transition of active states.

1. Caller starts Transaction (tx)
2. InsertSubscriptions(ctx, new_subs, tx)
      |
      v
   +---------------------------------------+
   | SQL: UPDATE Subscriptions             |
   | SET is_active = false                 |  <-- Archive existing configs
   | WHERE is_active = true                |
   +---------------------------------------+
      |
      v
   +---------------------------------------+
   | SQL: INSERT INTO Subscriptions        |
   | (name, revision, ..., is_active=true) |  <-- Activate new configs
   +---------------------------------------+
      |
      v
3. Caller commits Transaction

Retrieval Modes

The store provides multiple ways to access data based on the caller's context:

  • Point-in-time: GetSubscription(name, revision) retrieves a specific historical version of a config.
  • Current State: GetActiveSubscription(name) or GetAllActiveSubscriptions() retrieves only the configurations currently marked as active, used by the live alerting engine.
  • Historical Audit: GetAllSubscriptions() returns the entire database contents, including inactive versions.

Module: /go/subscription/sqlsubscriptionstore/schema

SQL Subscription Store Schema

The schema module defines the data structure and database layout for storing subscriptions within the Perf system. It serves as the single source of truth for the SQL table definitions used by the sqlsubscriptionstore, ensuring that subscription metadata is persisted consistently and can be queried efficiently.

Design Decisions and Implementation Choices

Immutability via Compound Primary Keys

The schema defines a primary key composed of both name and revision.

PRIMARY KEY(name, revision)

This design choice facilitates versioning and traceability. Instead of overwriting an existing subscription when configurations change, the system records a new entry tied to a specific infra_internal Git hash (revision). This allows the system to:

  • Track the evolution of a subscription over time.
  • Audit which version of a configuration was active when a specific bug was filed.
  • Roll back or reference historical subscription states based on repository history.

Integration with Bug Filing Systems

A significant portion of the schema is dedicated to bug metadata (labels, hotlists, components, priority, and severity). The implementation uses STRING ARRAY types for fields like bug_labels and hotlists to provide flexibility, allowing a single subscription to categorize bugs across multiple workstreams without requiring complex relational mapping tables.

The inclusion of bug_priority and bug_severity as integers (constrained to 0-4) maps directly to standard issue tracking priorities (e.g., P0 through P4), ensuring that the Perf system can programmatically set triage urgency based on the subscription configuration.

Key Components and Responsibilities

SubscriptionSchema Struct

Located in schema.go, this struct defines the mapping between Go objects and the SQL database. Its responsibilities include:

  • Identity Management: Manages the Name and Revision fields which uniquely identify the configuration.
  • Notification Routing: Stores BugCCEmails and ContactEmail to ensure that the correct stakeholders are alerted when the subscription triggers.
  • State Management: The IsActive boolean allows for soft-deactivation of subscriptions, enabling users to pause monitoring without deleting the historical configuration or metadata.

Workflow: Subscription Lifecycle

The following diagram illustrates how the schema supports the transition from a configuration defined in code/Git to a persisted database record used for bug filing.

[ Git Revision ] ----> [ Subscription Config ]
      |                        |
      |          (Name + Revision used as Key)
      |                        |
      |                        v
      |            +-----------------------+
      +----------->|  SQL: Subscriptions   |
                   +-----------------------+
                   | name: "Chrome_Perf"   |
                   | revision: "a1b2c3d"   | <--- Ensures auditability
                   | bug_component: 12345  |
                   | is_active: true       |
                   +-----------+-----------+
                               |
                               v
                   +-----------------------+
                   |  Bug Filing Process   |
                   +-----------------------+
                   | CCs: bug_cc_emails    |
                   | Labels: bug_labels    |
                   +-----------------------+

Files

  • schema.go: Contains the SubscriptionSchema struct definition with SQL tags that define the column types and constraints for the underlying database engine.

Module: /go/tracecache

TraceCache Module

The tracecache module provides a specialized caching layer for Perf trace identifiers. It is designed to bridge the gap between high-level user queries and the underlying data tiles, reducing the computational overhead of repeatedly resolving complex queries against the same dataset.

High-Level Overview

In the Perf system, data is organized into “tiles.” When a user executes a query, the system must identify which traces match that query within a specific tile. This resolution process can be expensive, especially for broad queries or large datasets.

TraceCache addresses this by memoizing the results of query resolutions. It maps a combination of a TileNumber and a query.Query to a list of matching paramtools.Params (trace identifiers). This allows the system to bypass the query engine for subsequent requests for the same data, significantly improving performance for dashboard loading and data exploration.

Design Decisions and Implementation

Key Derivation

The cache's efficiency relies on its key generation strategy. The module uses a composite key: [TileNumber]_[QueryString]

  • Tile Granularity: By including the TileNumber in the key, the cache automatically invalidates or isolates results as time progresses and new tiles are created. This ensures that query results are always contextually tied to the specific temporal bucket of data they represent.
  • Query Normalization: The query.Query object is converted to its KeyValueString() representation. This ensures that queries with the same parameters result in the same cache key, maximizing the hit rate.

Serialization

Trace identifiers are stored as JSON blobs within the cache backend. While JSON introduces a small overhead for marshaling and unmarshaling, it provides a stable, human-readable format that simplifies debugging and ensures compatibility regardless of the underlying cache provider (e.g., in-memory, Redis, or Memcache).

Dependency Injection

The TraceCache struct does not implement a caching engine itself. Instead, it wraps an implementation of the cache.Cache interface. This decoupling allows the tracecache module to remain agnostic of the storage backend, enabling the use of local in-memory caches for development and distributed caching systems for production environments.

Key Components and Responsibilities

TraceCache Struct

The primary coordinator of the module. Its responsibilities include:

  • Encapsulation: Managing the interaction with the generic cache.Cache client.
  • Query-to-Key Mapping: Transforming domain-specific objects (TileNumber and Query) into flat string keys.
  • Data Transformation: Handling the serialization of paramtools.Params arrays into JSON and back.

Key Methods

  • CacheTraceIds: Persists the results of a query resolution. It takes the resulting list of trace parameters and stores them against the tile/query key.
  • GetTraceIds: Retrieves cached results. If the key exists, it deserializes the JSON back into a slice of paramtools.Params; if the key is missing (a cache miss), it returns nil, signaling that the caller must perform the query resolution manually.

Data Workflow

The typical lifecycle of a trace lookup using this module follows this pattern:

User Query + Tile
       |
       v
[ TraceCache.GetTraceIds ] ----(Key: TileID_Query)----> [ Cache Backend ]
       |                                                    |
       +<-----------( JSON Result / Miss )------------------+
       |
       | If Miss:
       |    1. Execute Query against Tile
       |    2. [ TraceCache.CacheTraceIds ] ----------> [ Cache Backend ]
       |    3. Return Results
       |
       | If Hit:
       |    1. Unmarshal JSON
       |    2. Return Results

Module: /go/tracefilter

The tracefilter module provides a specialized tree-based data structure designed to identify and isolate “leaf” traces within a hierarchical path structure. In the context of performance monitoring and trace management, data often arrives with overlapping prefixes or hierarchical relationships. This module allows for the filtering of redundant parent nodes, ensuring that only the most specific (deepest) traces are processed.

Design Motivation

The primary goal of tracefilter is to resolve hierarchical dependencies between trace paths. When multiple paths are added to the filter, some may be prefixes of others. For example, if both root/cpu/usage and root/cpu/usage/core1 are registered, the latter is a more specific leaf node.

By modeling these paths as a tree, the module can efficiently determine which traces represent actual data endpoints versus those that are merely architectural containers for more granular metrics. This is particularly useful for deduplicating metrics or ensuring that aggregations don't double-count data that exists at multiple levels of a hierarchy.

Key Components and Logic

The Tree Structure (TraceFilter)

The core of the module is the TraceFilter struct, which functions as a recursive node in a prefix tree (trie). Each node stores:

  • A value: The specific path segment string (e.g., “p1”).
  • A traceKey: An identifier associated with that specific path.
  • children: A map of sub-paths to nested TraceFilter nodes.

Path Integration (AddPath)

The AddPath method builds the tree incrementally. It accepts a slice of strings representing the hierarchy and a traceKey. As paths are added, the module creates the necessary branch nodes. If a path is added that extends an existing branch, the tree grows deeper.

Leaf Node Resolution (GetLeafNodeTraceKeys)

This is the central logic of the module. It performs a recursive depth-first search to find nodes that have no children.

The implementation logic follows a “specificity wins” rule:

  1. If a node has children, it is considered a “parent” or “container” node. Its own traceKey is ignored, and the search continues into its children.
  2. If a node has no children, it is a “leaf.” Its traceKey is collected and returned.

This ensures that if a parent key is added and later a child of that parent is added, only the child's key (the more specific one) will be returned in the final result set.

Workflow Example

Consider a scenario where various metrics are registered. The tree filters out the intermediate “p2” and “p3” keys because more specific children exist.

Input Paths:
1. ["root", "p1", "p2"]           Key: "key_parent"
2. ["root", "p1", "p2", "p3"]     Key: "key_intermediate"
3. ["root", "p1", "p2", "p3", "t1"] Key: "key_leaf_A"
4. ["root", "p1", "p2", "p4"]     Key: "key_leaf_B"

Tree Construction:
root
 └── p1
      └── p2 (key_parent)
           ├── p3 (key_intermediate)
               └── t1 (key_leaf_A)  <-- Leaf
           └── p4 (key_leaf_B)       <-- Leaf

Resulting Leaf Keys:
["key_leaf_A", "key_leaf_B"]

In this example, “key_parent” and “key_intermediate” are discarded by GetLeafNodeTraceKeys because the filter assumes that the presence of deeper nodes makes the higher-level nodes redundant for the specific filtering task.

Module: /go/tracesetbuilder

TraceSetBuilder

The tracesetbuilder module provides a high-performance, concurrent mechanism for aggregating disparate trace data fragments into a unified TraceSet and a corresponding ParamSet. This is primarily used in Perf to consolidate data fetched from multiple storage tiles or shards into a single contiguous representation suitable for visualization or analysis.

Overview

In the Perf system, performance data is often stored and retrieved in chunks (tiles). When a user requests data over a large time range, the system must fetch multiple tiles and stitch them together. TraceSetBuilder manages this stitching process efficiently.

The design prioritizes performance and thread safety by using a “sharded worker” architecture. Instead of protecting a shared result set with a global mutex—which would cause significant contention when processing thousands of traces—the builder distributes the work across a pool of independent worker routines.

Key Workflows and Design Decisions

The builder uses a pipeline pattern to process incoming trace data:

  1. Input Sharding: When Add() is called, the builder iterates over the provided traces. It calculates a CRC32 checksum of each trace key to determine which worker should handle that specific trace.
  2. Lock-Free Concurrency: By routing all data for a specific trace ID to the same worker, the system ensures that no two workers ever attempt to modify the same trace simultaneously. This allows each worker to maintain its own local TraceSet and ParamSet without any internal locking.
  3. Mapping Logic: The builder translates sparse data (from tiles) into a dense output array. It uses a mapping of CommitNumber to an output index, allowing it to place data points at the correct temporal position regardless of the order in which tiles are processed.
  4. Final Aggregation: When Build() is invoked, the builder waits for all workers to finish their queues and then merges the independent results from each worker into a final consolidated set.
Add(traces)      Worker 1 (Keys A, D)      Build()
    |                +-----------+            |
    |--- Hash(A) --->| TraceSet1 |--- Merged -|
    |                +-----------+            |
    |--- Hash(B) ---.                         |--> Final TraceSet
    |--- Hash(C) --.| Worker 2 (Keys B, C)    |--> Final ParamSet
    |                +-----------+            |
    `--- Hash(D) --->| TraceSet2 |--- Merged -'
                     +-----------+

Key Components

TraceSetBuilder (tracesetbuilder.go)

The primary coordinator. It initializes a pool of 64 workers (defined by numWorkers) and a sync.WaitGroup to track pending work. It is designed for a single lifecycle: you Add() data, Build() the result, and then Close() the builder. It cannot be reused after Build() is called.

mergeWorker (tracesetbuilder.go)

Internal workers that maintain their own state. Each worker listens on a buffered channel for request objects.

  • Trace Merging: If a worker receives a trace key it hasn't seen before, it initializes a new types.Trace filled with sentinel values (missing data). It then populates the trace at specific indices based on the commit mapping provided in the request.
  • ParamSet Tracking: Each worker updates its own paramtools.ParamSet to reflect the dimensions and values present in the traces it has processed.

The Request Structure

The request object is the unit of work passed to workers. It contains:

  • The raw trace data.
  • The parsed Params (to avoid redundant parsing in the workers).
  • A commitNumberToOutputIndex map, which defines exactly where each data point in the input should land in the final output trace.

Usage Details

  • Initialization: New(size int) requires the total length of the resulting traces (e.g., the number of commits in the requested range).
  • Data Insertion: Add() is non-blocking to the extent of the channel buffers. It distributes traces to workers and increments the internal WaitGroup.
  • Completion: Build() blocks until all workers have finished processing their queues. It then performs the final merge of the 64 worker-local maps into the return values.
  • Cleanup: Close() must be called to shut down the worker goroutines and release resources.

Module: /go/tracestore

TraceStore

The tracestore module provides the core abstractions and interfaces for storing, retrieving, and querying performance trace data within the Skia Perf system. It acts as the bridge between raw performance metrics (time-series data) and the storage backends, ensuring that high-cardinality data can be queried efficiently.

High-Level Overview

In the Skia Perf ecosystem, a “trace” is a series of floating-point values associated with a specific set of parameters (e.g., ,arch=x86,config=8888,). The tracestore module defines how these traces are organized into “Tiles”—fixed-size blocks of commits—and provides the interfaces for performing complex queries across these tiles.

The module is built around three primary interfaces:

  1. TraceStore: The main interface for reading and writing trace data, calculating tile offsets, and executing queries.
  2. TraceParamStore: Specifically handles the mapping between a trace's unique identifier (an MD5 hash) and its human-readable parameters.
  3. MetadataStore: Manages “sidecar” information, such as links to source files or diagnostic data associated with the ingestion process.

Design Decisions

Tiled Data Architecture

To handle years of performance data without performance degradation, tracestore utilizes a tiling system.

  • The “Why”: Loading an entire history of a trace is rarely necessary and often memory-prohibitive. By splitting data into tiles (e.g., 256 commits per tile), the system can load only the segments relevant to a user's current view.
  • Implementation: The TraceStore interface exposes methods like TileNumber and CommitNumberOfTileStart to translate between absolute commit numbers and their positions within specific storage blocks.

Separation of Values and Parameters

The design separates the storage of the numeric values (the “what”) from the parameters (the “who”).

  • Efficiency: Instead of storing the full string of parameters with every single data point, the system uses a unique trace_id (MD5 hash).
  • The “How”: The TraceParamStore maintains the lookup table for these IDs, while the TraceStore focuses on the high-volume numeric values and commit associations.

Key Components

TraceStore (tracestore.go)

This is the central entry point for the module. It defines the contract for how the rest of the Perf system (like the dfbuilder for creating DataFrames) interacts with performance data.

Key Responsibilities:

  • Querying: QueryTraces and QueryTracesIDOnly provide the mechanism to search millions of traces based on parameter matches (e.g., finding all traces where cpu=arm64).
  • Data Retrieval: Supports both tile-based reads (ReadTraces) and arbitrary commit range reads (ReadTracesForCommitRange).
  • Ingestion: WriteTraces is responsible for committing new data points into the store, ensuring that the associated ParamSet (the global index of all known keys and values) is updated.

TraceParamStore (traceparamstore.go)

This interface manages the lifecycle of trace identities.

  • Responsibility: It maps the MD5 hex-encoded traceId to the paramtools.Params object.
  • Rationale: By isolating this, backends can implement specialized caching or indexing (like the InMemoryTraceParams found in the SQL implementation) to speed up the translation from IDs back to human-readable strings.

MetadataStore (metadatastore.go)

This interface provides context to the raw numbers.

  • Responsibility: It links a data point back to its origin—specifically the source file name and any external links (e.g., a link to a BuildBucket task or a GCS bucket).
  • Usage: When a user clicks on a point in a Perf graph, the system uses the MetadataStore to find exactly which file generated that specific value.

Implementation Details: SQL Backend

While this module defines the interfaces, the sqltracestore submodule provides a concrete implementation designed for CockroachDB and Spanner. It implements specialized logic for:

  • Parallel Ingestion: Writing trace data in batches to maximize database throughput.
  • In-Memory Search: Using a columnar, integer-encoded index of trace parameters to resolve complex queries in RAM before fetching the actual numeric values from SQL.

Data Workflow: Trace Resolution

The following diagram shows how the tracestore components interact when a user requests data for a specific graph:

       UI / API Request
              |
              v
     [ TraceStore.QueryTraces ]
              |
              |-- 1. Identify matching TraceIDs via Query
              |
              |-- 2. Fetch Values (TraceStore implementation)
              |      [ SQL TraceValues Table ]
              |
              |-- 3. Fetch Params (TraceParamStore)
              |      [ SQL TraceParams Table ]
              |
              |-- 4. Fetch Source Info (MetadataStore)
              |      [ SQL SourceFiles Table ]
              v
      Combined TraceSet + Metadata

Module: /go/tracestore/mocks

tracestore/mocks

The tracestore/mocks module provides autogenerated mock implementations of the core interfaces used for storing and retrieving performance trace data within the Perf system. These mocks are generated using mockery and are based on the testify framework, facilitating unit testing of components that depend on tracestore and metadatastore.

High-Level Overview

In the Perf architecture, the TraceStore and MetadataStore are critical abstractions for interacting with time-series data and its associated metadata (such as source file links). Because these stores often interact with external databases (like BigTable or SQL backends), using real implementations in unit tests is often impractical.

This module provides:

  • TraceStore Mock: Simulates the primary data store for performance traces, supporting operations like querying by parameters, reading by commit range, and tile management.
  • MetadataStore Mock: Simulates the storage used for mapping source file names to additional metadata, such as links or IDs.

Key Components

TraceStore.go

This file contains the mock for the tracestore.TraceStore interface. It is designed to allow developers to simulate complex data retrieval scenarios without a running database.

Key Capabilities:

  • Tile Logic: Methods like TileNumber, CommitNumberOfTileStart, and TileSize allow tests to verify how components handle Perf's “tiled” data architecture.
  • Query Simulation: QueryTraces and QueryTracesIDOnly can be configured to return specific TraceSet results or stream parameters, enabling tests for the UI and alerting logic.
  • Data Ingestion: WriteTraces can be mocked to ensure that ingestion pipelines are correctly formatting and sending data to the store.

MetadataStore.go

This file provides the mock for the MetadataStore interface, focusing on the association between raw trace data and its origin files.

Key Capabilities:

  • Metadata Retrieval: Mocking GetMetadata and GetMetadataMultiple allows testing of features like the “Source File” links in the Perf UI.
  • Bulk Operations: Supports mocking GetMetadataForSourceFileIDs for performance-sensitive batch lookups.

Design Patterns and Usage

The mocks follow the testify/mock pattern. When a new mock is created via NewTraceStore(t) or NewMetadataStore(t), it automatically registers a cleanup function that asserts expectations when the test finishes.

Workflow Example: Testing a Query Component

This diagram illustrates how a test uses the mock to verify a component that processes trace data:

  Test Logic             Component Under Test              TraceStore Mock
      |                         |                                 |
      |-- 1. On("QueryTraces") ->|                                 |
      |   .Return(myTraceSet)   |                                 |
      |                         |                                 |
      |----- 2. RunAction() --->|                                 |
      |                         |------- 3. QueryTraces() ------->|
      |                         |                                 |
      |                         |<------ 4. myTraceSet -----------|
      |                         |                                 |
      |<---- 5. Verify Results -|                                 |
      |                         |                                 |
      |-- 6. Cleanup/Assert ----|-------------------------------->|
  1. Setup: The test defines what the mock should return when a specific query is executed.
  2. Execution: The component calls the mock as if it were a real database.
  3. Verification: The test checks if the component handled the returned TraceSet correctly.
  4. Assertion: The mock verifies that the component actually called QueryTraces with the expected arguments.

Module: /go/tracestore/sqltracestore

This module provides a high-performance, SQL-backed implementation of the tracestore.TraceStore interface for Skia Perf. It is designed to store and query high-cardinality time-series performance data, primarily targeting databases like CockroachDB or Spanner.

The implementation focuses on optimizing two primary workloads:

  1. Fast Range Queries: Retrieving floating-point values for a specific set of traces across a range of commits (tiles).
  2. Metadata Discovery: Navigating the “inverted index” of parameters (e.g., arch=x86) to find relevant traces.

Design Decisions

Tile-Based Sharding

To prevent indices from growing indefinitely and to facilitate data aging/management, data is organized into “Tiles.” Each tile represents a fixed number of commits (e.g., 256). This allows the system to partition lookups and optimize the ParamSets table by only querying the keys and values relevant to the specific time range being viewed.

MD5 Trace Identification

Trace names are structured keys (e.g., ,arch=x86,config=565,). Storing these long strings repeatedly in the TraceValues table would be storage-inefficient and slow for indexing. Instead, the module uses an MD5 hash of the trace name as a BYTEA (or BYTES) primary key (trace_id).

  • Why MD5? It provides a uniform distribution of keys, preventing “hot spots” in distributed SQL databases.
  • Trace Recovery: Because hashes are one-way, the TraceParams table stores the mapping from trace_id back to the original JSONB parameter map.

In-Memory Parameter Indexing (InMemoryTraceParams)

While SQL is powerful, querying millions of traces based on complex parameter combinations (including regex and exclusions) can be slow in a pure SQL environment.

  • The “How”: This module periodically loads the entire TraceParams table into an in-memory, integer-encoded columnar structure.
  • The Benefit: Queries like arch=x86 & config=~.*8888 are resolved in-memory by scanning bitsets or integer arrays, which then produces a list of trace_ids to be used in a highly optimized SQL IN clause against the TraceValues table.

Key Components

SQLTraceStore (sqltracestore.go)

The central orchestrator. It manages the lifecycle of traces, handles the conversion between human-readable trace names and SQL-friendly hashes, and coordinates with caches. It uses Go templates to generate dynamic SQL queries for batch operations.

InMemoryTraceParams (inmemorytraceparams.go)

An in-memory search engine for trace metadata.

  • Parallel Refresh: It uses a partitioned read strategy (splitting the trace_id keyspace into 16 partitions) to rapidly load metadata from SQL into RAM.
  • Encoding: It maps all parameter strings to int32 identifiers to minimize memory footprint and speed up comparison logic.

SQLTraceParamStore (sqltraceparamstore.go)

Handles the durable storage of the trace identity.

  • Responsibility: Maps the MD5 trace_id to the full paramtools.Params (JSON).
  • Optimization: Implements batch writing with a parallel worker pool to handle high-volume ingestion.

SQLMetadataStore (sqlmetadatastore.go)

Stores “sidecar” information about the ingestion process.

  • Responsibility: Maps source_file_id (an integer) to external links or diagnostic metadata. This keeps the primary TraceValues table focused strictly on performance metrics.

Intersection Logic (intersect.go)

A utility for combining results from multiple search channels. It uses a binary tree of Go channels to efficiently find the intersection of ordered trace_id sets without the overhead of reflection.

Data Workflow: Reading Traces

The following diagram illustrates how a user query for “config=565” across a specific tile is resolved:

User Query: "config=565" for Tile 176
      |
      v
[ InMemoryTraceParams ] <--- (Scans encoded columns in RAM)
      |
      | Result: List of matching TraceIDs (MD5 hashes)
      v
[ SQL Database: TraceValues ]
      |
      | SQL: SELECT val FROM TraceValues
      |      WHERE trace_id IN (...) AND commit_number BETWEEN 45056 AND 45311
      v
[ TraceSet Result ] ----> (UI/Graphing)

Data Workflow: Writing Traces

Ingestion prioritizes atomicity and avoiding redundant writes:

Incoming Data: {Commit: 100, Params: {arch: x86}, Value: 1.2, Source: "file.json"}
      |
      | 1. Update SourceFiles: Get/Create ID for "file.json"
      | 2. Update ParamSets: Ensure "arch=x86" is registered for the tile
      | 3. Hash Trace: ",arch=x86," -> MD5 TraceID
      | 4. Write TraceParams: Store {TraceID: Params} (ON CONFLICT DO NOTHING)
      v
[ SQL Database: TraceValues ]
      |
      | INSERT INTO TraceValues (trace_id, commit, val, source_id)
      | ON CONFLICT (trace_id, commit) DO UPDATE ...

Module: /go/tracestore/sqltracestore/schema

The sqltracestore/schema module defines the foundational data structures used to map Go types to SQL table definitions for Skia Perf's trace storage. It acts as the “source of truth” for the database schema, utilizing struct tags to define column types, primary keys, and indices.

Design Evolution and Storage Strategy

The schema is designed to handle high-cardinality time-series data (performance metrics) while maintaining fast lookups for both specific trace values and metadata.

Trace Data Management

The core performance data is stored in TraceValuesSchema and its successor TraceValues2Schema.

  • TraceValuesSchema: Uses a composite primary key of (trace_id, commit_number). This ensures that for any given metric (trace), data points are physically ordered by time (commit number), optimizing range scans for graphing.
  • TraceValues2Schema: Extends the original schema to explicitly include common parameter dimensions (Benchmark, Bot, Test, etc.) as columns. This evolution reflects a shift towards allowing the database engine to filter on specific common dimensions more efficiently than generic JSON or posting-list lookups.

Postings and Search

To facilitate searching across millions of traces based on arbitrary parameters, the module defines a PostingsSchema:

  • Tile-based Partitioning: Data is organized by tile_number. This sharding strategy prevents the posting indices from growing indefinitely, allowing the system to query only relevant time ranges.
  • Inverted Index: The key_value (representing a key=value pair) is indexed against trace_id. This allows the system to quickly resolve a query like device=pixel6 into a set of trace IDs.

Parameter and Metadata Handling

  • ParamSetsSchema: Tracks the global set of all available keys and values within a specific tile. This is used to populate UI filters and autocomplete suggestions.
  • TraceParamsSchema: Stores the full parameter map for a single trace as JSONB. This is used when the system needs to reconstruct the full identity of a trace after it has been located via an index.
  • SourceFilesSchema: Maps raw filenames to internal IDs. This normalization reduces storage overhead in the primary value tables by replacing long strings with integers.

Key Components and Data Relationships

The following diagram illustrates how these entities relate during data ingestion and retrieval:

[ SourceFiles ] <---------- [ TraceValues ] ----------> [ TraceParams ]
(Maps filename to ID)     (The actual metrics)        (Full key/value map)
                                 |
                                 | (linked by trace_id)
                                 v
[ ParamSets ] <---------- [ Postings ]
(All possible keys/vals)  (Search index for traces)
  • TraceID: A byte slice (usually a hash) that serves as the unique identifier for a specific combination of parameters. It is the common link across TraceValues, Postings, and TraceParams.
  • Indices: The schema defines specific secondary indices (like by_source_file_id) to support administrative workflows, such as identifying all data points associated with a corrupted or updated source file.
  • MetadataSchema: Specifically handles non-performance data (like external links or diagnostic information) associated with a source file, kept separate from the “hot” path of performance metrics to keep the trace tables lean.

Module: /go/tracing

High-Level Overview

The perf/go/tracing module serves as a specialized wrapper for initializing distributed tracing within Perf applications. It bridges the gap between the generic infrastructure-level tracing utilities and the specific configuration requirements of a Perf instance.

Its primary purpose is to ensure that performance data and request flows across Perf services are captured and exported consistently to a tracing backend (typically Google Cloud Trace) without requiring each sub-service to manually manage initialization logic or environment-specific metadata.

Design and Implementation Decisions

Centralized Initialization

The module abstracts the complexity of OpenCensus initialization. By consolidating this in one place, the project ensures that all Perf components—such as the frontend, ingestion service, and query engine—use identical sampling logic and metadata tagging. This consistency is crucial for correlating traces across different service boundaries.

Metadata Enrichment

A key design choice in Init is the automatic injection of contextual metadata into every trace.

  • Pod Identification: By capturing the MY_POD_NAME environment variable (injected via Kubernetes templates), the module allows developers to pinpoint exactly which container instance handled a specific request.
  • Instance Scoping: Since a single Perf deployment can represent different logical instances (e.g., “skia”, “chrome”, “flutter”), the instance name is included in the trace attributes to allow for easy filtering in the tracing dashboard.

Conditional Activation (Local vs. Production)

Tracing is intentionally bypassed when running in local mode. This prevents development environments from attempting to authenticate with cloud-based tracing exporters or polluting production trace data with local testing noise.

Configuration-Driven Sampling

The module utilizes TraceSampleProportion from the InstanceConfig. This allows for dynamic control over the volume of traces generated. High-traffic instances can set a lower proportion to manage costs and overhead, while smaller or more critical instances can increase the sample rate for higher visibility.

Key Components and Responsibilities

tracing.go

This is the core of the module, responsible for the following:

  1. Orchestrating Initialization: It invokes the lower-level go/tracing infrastructure package but pre-configures it with Perf-specific defaults.
  2. Project Auto-Detection: It passes an empty string for the Project ID, signaling the underlying library to use Google Cloud's metadata server to auto-detect the hosting project. This simplifies deployment across different GCP projects.
  3. Environment Mapping: It transforms the high-level InstanceConfig and system environment variables into a structured map of attributes that are attached to every trace span.

Workflow: Trace Initialization

The following diagram illustrates how the tracing configuration flows from the application startup into the global tracing state.

Application Startup
       |
       | (local flag, InstanceConfig)
       v
+--------------------------+
|  perf/go/tracing.Init()  |
+--------------------------+
       |
       |-- Check local flag (Return nil if true)
       |-- Extract InstanceName from Config
       |-- Fetch MY_POD_NAME from OS
       |
       v
+-----------------------------------+
|  infra/go/tracing.Initialize()    | <--- Global Trace Exporter
+-----------------------------------+
       |
       |-- Sets Sampling Rate (Proportion)
       |-- Configures Project ID (Auto-detect)
       |-- Attaches {podName, instance} Attributes
       v
   Tracing Ready

Module: /go/ts

TypeScript Definition Generation for Perf

The go/ts module is a utility program designed to bridge the gap between the Go backend and the TypeScript frontend in the Perf application. Its primary responsibility is to ensure type safety across the network boundary by automatically generating TypeScript interfaces and types from Go structs that are serialized into JSON for the web UI.

Design Philosophy

The module addresses the “fragile base class” problem in web development: when a Go struct used in a JSON response changes, the frontend code often breaks silently if its TypeScript definitions are out of sync.

Instead of manually maintaining duplicate type definitions, this module uses reflection (via the go2ts package) to inspect Go structs and produce a source-of-truth TypeScript file. This ensures that:

  1. Type Consistency: Frontend developers can rely on TypeScript definitions that exactly match the backend's JSON output.
  2. Nominal Typing: By setting GenerateNominalTypes = true, the generator treats specific Go types as distinct in TypeScript, preventing logic errors where structurally similar but semantically different types might be confused.
  3. Documentation of APIs: The generator acts as a living document of all data structures exchanged between the Perf frontend and backend.

Key Components and Workflows

Main Execution Logic (main.go)

The core of the module is a CLI tool that configures a go2ts.Go2TS generator. The execution follows a specific sequence:

  1. Initialization: It instantiates the generator and configures global behaviors, such as ignoring nil values for specific mapping types like paramtools.Params to prevent unnecessary optionality in TypeScript.
  2. Type Registration: The bulk of the code involves registering Go types from various sub-packages. It distinguishes between standard structs and “unions” (which Go often represents as constants or enums).
  3. Namespace Organization: To prevent naming collisions and improve code organization on the frontend, certain types are grouped into namespaces (e.g., pivot, progress, ingest).
  4. Rendering: Finally, it writes the generated TypeScript code to a specified output file (typically modules/json/index.ts).

Handling Unions and Enums

Go doesn't have a native “Union” type similar to TypeScript. The module uses a helper function, addMultipleUnions, to map collections of Go constants to TypeScript Union types. This is critical for states, statuses, and configuration options (e.g., regression.Status or alerts.ConfigState), ensuring the frontend can only use valid, predefined values.

Workflow Diagram

[Go Source Code]          [go/ts/main.go]          [TypeScript Output]
       |                         |                         |
       |-- (Reflects on) --------|                         |
       |   Structs & Constants   |                         |
       |                         |                         |
       |                         |-- (Converts to TS) ---->|
       |                         |                         |
       |                         |                         |-- index.ts
       |                         |                         |   (Interfaces,
       |                         |                         |    Namespaces,
       |                         |                         |    Unions)

Key Package Dependencies

The module acts as a central registry, importing almost every major data-holding package in the Perf system to expose their structures:

  • perf/go/frontend/api: Defines the shapes of requests and responses for the web API.
  • perf/go/alerts & perf/go/regression: Core domain objects for alerting logic and anomaly detection.
  • perf/go/clustering2: Data structures representing results from clustering algorithms.
  • perf/go/types & go/paramtools: Low-level primitives for trace keys and parameter sets.
  • perf/go/chromeperf & perf/go/pinpoint: Structures for interacting with external Chromeperf and Pinpoint services.

Usage in Development

The module is intended to be run via go generate. When a developer modifies a Go struct that is sent to the frontend, they should trigger the generator to update the TypeScript definitions, which are then checked into version control. This maintains a synchronized state between the two languages.

Module: /go/types

/go/types

The go/types module serves as the central repository for core domain types and shared constants used throughout the Skia Perf system. It establishes a common language for time-series data, versioning, and anomaly detection configurations, ensuring consistency across data ingestion, storage, and analysis.

Core Abstractions

Versioning: Commit and Tile Numbers

The system handles large-scale time-series data by indexing it against repository commits.

  • CommitNumber: Represents a linear offset from the repository's first commit (0). It assumes a simplified, linear history to facilitate easy indexing and range queries.
  • TileNumber: To optimize data retrieval and storage, traces are partitioned into “Tiles” (fixed-size chunks of commits). This type represents the index of such a tile.

The module provides conversion logic to navigate between these two coordinate systems:

CommitNumber ----(TileSize)----> TileNumber
[0, 255]         / 256           0
[256, 511]        / 256           1

Trace Representation

A Trace is the fundamental unit of measurement data, represented as a slice of float32 values.

  • Missing Data: Traces use a sentinel value (vec32.MISSING_DATA_SENTINEL) to represent gaps in measurement, allowing the system to distinguish between a zero value and no data.
  • TraceSet: A convenience mapping of trace IDs (strings) to their corresponding data slices.
  • TraceSourceInfo: A thread-safe container that maps specific points in a trace (CommitNumbers) back to their original source file IDs in the database, enabling “drill-down” capabilities from a graph point to the raw data file.

Anomaly Detection & Regression Logic

The module defines the enums and types that control how the system identifies changes in performance:

Grouping Strategies

Defines how traces are aggregated before analysis:

  • KMeansGrouping: Clusters similar trace shapes together to identify aggregate shifts.
  • StepFitGrouping: Analyzes each trace individually to find “steps” (sudden jumps or drops).

Step Detection Algorithms

Determines the mathematical approach used to identify a regression within a single trace or cluster centroid:

  • Statistical Tests: CohenStep (Effect size) and MannWhitneyU (Rank-sum test) for robust change detection.
  • Heuristics: PercentStep, AbsoluteStep, and Const for simpler magnitude-based thresholds.

Alerting Actions

Specifies the lifecycle of a detected anomaly via AlertAction:

  1. NoAction: Detection only (no notification).
  2. FileIssue: Creates a task for a human sheriff to investigate.
  3. Bisection: Automatically triggers a bisection job to identify the specific culprit commit.

Key Files

  • types.go: Contains all struct definitions, enums, and utility methods for coordinate conversion and data structure management.
  • types_test.go: Validates the math behind commit-to-tile mapping and boundary conditions for invalid indices.

Design Decisions

  • Linear Versioning: By using int32 for CommitNumber, the system prioritizes performance and simplicity in indexing over the complexity of a full Git DAG.
  • Thread Safety: TraceSourceInfo uses an internal sync.RWMutex. This design choice acknowledges that source information is often updated concurrently during data ingestion while being read by the UI or analysis engines.
  • Sentinel Values: The use of BadCommitNumber (-1) and BadTileNumber (-1) provides a standard way to handle errors or uninitialized references without relying on Go's zero-value (0), which is a valid index.

Module: /go/ui

Overview

The /go/ui module serves as the primary backend orchestration layer for the Perf UI. Its main purpose is to bridge the gap between high-level user interactions (like clicking a “shortcut” link or requesting a custom dashboard) and the underlying data storage and processing systems. It acts as a coordinator, delegating specific tasks to specialized submodules like frame for data processing or shortcuts for state persistence.

Design Decisions and Implementation Choices

State Persistence via Shortcuts

A key design choice in the Perf UI is to avoid massive, complex URLs. Instead of encoding an entire UI state (queries, zoom levels, formula transformations) into the URL, the module uses a “Shortcut” system.

  • Why: This allows users to share short, immutable links to specific views.
  • How: The UI sends a state object to the backend; the backend stores it in a database and returns a short ID. When a user visits a link with that ID, the /go/ui layer retrieves the original state and hydrates the UI.

Decoupling Data Fetching from Rendering

The module is designed around the concept of a Frame. A “Frame” is not just raw data, but a structured package containing trace values, metadata, anomaly markers, and display instructions.

  • Why: This allows the frontend to remain relatively “dumb” regarding data processing logic. The backend decides whether a result should be rendered as a table, a plot, or a pivot view based on the complexity of the request.
  • How: This logic is encapsulated within the frame submodule, which acts as the “brain” for transforming raw trace queries into a format the UI can immediately consume.

Progress and Asynchronicity

Because performance data can span millions of points and take seconds to process, the UI backend implementation prioritizes progress tracking.

  • Implementation: Many operations are wrapped in a progress-tracking context. As the backend fetches data or calculates formulas, it updates a status object that the frontend polls, ensuring the user is never left with a hanging UI.

Key Workflows

The following diagram shows how the ui module coordinates a request to view data, starting from a short URL:

Browser (URL with Shortcut ID)
    |
    |-- 1. Get ID ----> [ /go/ui/shortcuts ] (Retrieve State)
    |                         |
    |<- 2. UI State ----------'
    |
    |-- 3. Request Data --> [ /go/ui/frame ]
    |      (State ID)             |
    |                             |-- a. Query Tracestore
    |                             |-- b. Run Calculations
    |                             |-- c. Attach Anomalies
    |                             |-- d. Link Commits/Source
    |                             V
    |<-- 4. DataFrame <-----------'
    |
(Render Graph/Table)

Key Submodules and Responsibilities

/go/ui/frame

The heavy lifter of the module. It handles the FrameRequest lifecycle. It is responsible for:

  • Query Resolution: Translating user-defined keys into trace data.
  • Calculations: Invoking the calc engine to process mathematical formulas on the fly.
  • Metadata Enrichment: Attaching human-readable links to source code repositories (e.g., Chromium, V8) by comparing commit hashes in the trace data.
  • Anomaly Integration: Overlaying regression data onto the performance traces.

/go/ui/shortcuts

Manages the lifecycle of “Shortcuts” (short IDs that map to complex UI states).

  • It provides the persistence layer for the “Explore” page.
  • It ensures that UI configurations can be shared and bookmarked without hitting URL length limits.

UI Logic and Configuration

Beyond the submodules, the root /go/ui package often contains the logic for global UI settings and navigation. It determines which features are enabled based on the instance configuration (e.g., whether to show anomaly detection features or specific repository links).

Module: /go/ui/frame

The /go/ui/frame module is responsible for orchestrating the transition from a user's high-level data request (represented as queries, formulas, or shortcuts) into a rich, structured DataFrame suitable for visualization in the Perf frontend. It acts as the “brain” of the Explore page, managing the complexity of parallel data fetching, calculation, pivoting, and metadata enrichment.

Core Responsibility: Request Processing

The primary entry point is ProcessFrameRequest, which manages the lifecycle of a FrameRequest. A single request can contain multiple data sources—queries, mathematical formulas, and pre-saved shortcut keys—all of which must be aggregated into a unified view.

The module follows a structured workflow to build a response:

  1. Data Fetching: It uses a DataFrameBuilder to fetch raw trace data based on the provided queries or shortcuts.
  2. Calculation: If formulas are provided, it leverages the go/calc engine to perform transformations (like sum() or filter()) on the fetched traces.
  3. Pivoting: If requested, it reshapes the data using the pivot module, aggregating traces by specific parameters.
  4. Enrichment: It decorates the data with external context, such as anomaly markers from Chrome Perf and source file metadata (links to repositories like V8 or WebRTC).
  5. Progress Tracking: Because data fetching can be long-running, the module updates a progress.Progress object to give the frontend real-time status updates.

Design Decisions and Implementation

Handling Hybrid Request Types

The module supports two distinct ways of looking at time: REQUEST_TIME_RANGE (absolute Unix timestamps) and REQUEST_COMPACT (a fixed number of commits leading up to a point). The implementation abstracts this difference by passing specialized parameters to the dfBuilder while maintaining a consistent internal DataFrame structure.

Trace Filtering and Sentinels

One specific design choice is the use of the preflightqueryprocessor (pqp). Before fetching data, the module prepares queries with “sentinels” (e.g., __missing__). This allows the system to handle complex queries where a user specifically wants to find traces that lack a certain parameter, which is then enforced through an in-memory filterTraceSet pass after the raw data is loaded.

Intelligent Metadata Linking

The getMetadataForTraces and populateTraceMetadataLinksBasedOnConfig functions implement logic to generate human-readable commit ranges. Instead of just showing a raw hash, it compares the current commit to the previous one in the trace and generates a +log/prev..current link if a change is detected. This is specifically tuned for major repositories like Chromium, V8, and WebRTC via configuration.

Response Display Modes

The module determines how the frontend should render the data by analyzing the FrameRequest.

  • If a pivot has a summary operation, it sets DisplayPivotTable.
  • If it has a group-by but no summary, it sets DisplayPivotPlot.
  • Otherwise, it defaults to a standard DisplayPlot.

Key Workflows

The following diagram illustrates how the frameRequestProcess coordinates different sub-systems:

User Request (FrameRequest)
    |
    |---- Queries ----> [ DataFrameBuilder ] ----.
    |                                            |
    |---- Keys -------> [ ShortcutStore ] -------|--> [ Combined DataFrame ]
    |                         |                  |           |
    |---- Formulas ---> [ calc.Eval ] <----------'           |
                                                             |
    .--------------------------------------------------------'
    |
    |---- [ pivot.Pivot ] (Optional Reshaping)
    |
    |---- [ anomalies.Store ] (Attach Anomaly Markers)
    |
    |---- [ MetadataStore ] (Attach Source Links)
    |
    V
Final Response (FrameResponse)

Key Components

  • FrameRequest / FrameResponse: The JSON-serializable structures that define the API between the frontend (Explore page) and the backend logic.
  • frameRequestProcess: A private struct that maintains the state of a single request, including progress counters and references to required stores (Git, Shortcuts, Tracestore).
  • doSearch / doCalc / doKeys: Internal methods that isolate the logic for different data retrieval strategies. doCalc is notable for providing callback functions (rowsFromQuery, rowsFromShortcut) to the calculation engine, allowing formulas to recursively fetch data.
  • Anomaly Integration: Functions like addRevisionBasedAnomaliesToResponse bridge the gap between the trace data and the anomaly detection system, ensuring that points on a graph can be highlighted if they represent performance regressions.

Module: /go/urlprovider

URL Provider

The urlprovider module is a utility component within the Skia Perf system designed to programmatically generate deep-link URLs for various Perf UI pages. It centralizes the logic for constructing complex query parameters, ensuring that links to the Explore page, MultiGraph page, and Group Reports are consistent across the application.

High-level Overview

The primary goal of this module is to abstract the transformation of internal state—such as commit numbers, trace parameters, and shortcut IDs—into URL strings that the Perf frontend can interpret.

A key design choice in this module is the integration with perfgit.Git. Because the Perf UI relies on Unix timestamps for time-range filtering rather than raw commit numbers, the URLProvider uses the Git service to resolve commit numbers into their corresponding timestamps. This ensures that generated URLs point to the correct temporal window even as the underlying data evolves.

Key Components and Responsibilities

URLProvider Struct

Defined in urlprovider.go, this is the main stateful component. It requires an instance of perfgit.Git to perform commit-to-timestamp lookups.

  • Time Range Resolution: The provider automatically converts a range of commit numbers into begin and end URL parameters. A specific implementation choice made here is to shift the end time forward by one day (AddDate(0, 0, 1)). This is done to ensure that the data points or anomalies associated with the final commit are clearly visible on the rendered graph and not cut off at the edge of the display.
  • Explore Page Generation: The Explore method constructs URLs for the /e/ endpoint. It handles the nesting of trace queries by encoding trace parameters into a single queries parameter.
  • MultiGraph Generation: The MultiGraph method targets the /m/ endpoint, utilizing a shortcut ID (representing a saved set of traces) rather than a raw query string.
  • Dynamic Customization: Both methods support a disableFilterParentTraces flag (translated to the disable_filter_parent_traces query parameter) and allow for arbitrary additional query parameters via a url.Values argument.

Static Group Reporting

The GroupReport function is a stateless utility. It generates URLs for the /u/ endpoint, which is typically used for viewing anomaly groups or specific bugs.

  • Validation: To prevent the generation of malformed or unsupported URLs, it enforces an allow-list of valid parameters: anomalyGroupID, anomalyIDs, bugID, rev, and sid.

Key Workflows

URL Generation Process

The following diagram illustrates how the URLProvider orchestrates data from internal services to produce a frontend URL:

Input Parameters                      Git Service (perfgit)
(Commit Nums, Query)                          |
       |                                      |
       v                                      |
+--------------------------+                  |
| URLProvider.Explore()    |                  |
+------------+-------------+                  |
             |                                |
             |---- Request Timestamps ------->|
             |<--- Return Unix Timestamps ----|
             |
             | (Internal Logic)
             | - Add 1 day buffer to End Time
             | - Encode Query parameters
             | - Append Optional Filters
             |
             v
Result: "/e/?begin=123&end=456&queries=..."

File Responsibilities

  • urlprovider.go: Contains the logic for calculating time ranges, encoding parameters, and building the final URL paths for Explore, MultiGraph, and Group Report pages.
  • urlprovider_test.go: Validates that the URL generation correctly handles escaping, timestamp calculation, and optional parameter injection. It uses a mockable or test instance of the Git service to verify the integration between commit numbers and timestamps.

Module: /go/userissue

User Issue Module

The userissue module provides the core abstractions and storage logic for associating external issue tracker IDs (specifically Buganizer) with specific performance data points in the Perf system.

By linking a “trace key” and a “commit position” to a specific issue ID, the system allows developers and automated tools to contextualize performance anomalies with human-reported bugs. This enables the Perf UI to overlay bug information directly onto graphs, helping users understand if a regression or change is already being tracked.

Design Philosophy

The module is designed around the concept of a point-in-time association. Because a trace represents a series of data points over time, an issue is not just linked to a trace, but to a specific moment in that trace's history (the commit position).

Key Abstractions

  • UserIssue Struct: Represents the core domain model. It contains the identity of the user who created the association, the TraceKey, the CommitPosition, and the external IssueId.
  • Store Interface: Defines a storage-agnostic contract for persisting and retrieving these associations. This abstraction allows the system to swap underlying database implementations (e.g., switching to a different SQL dialect or a NoSQL provider) without affecting the business logic or the UI handlers.

Implementation Details

The module follows a clean separation between the interface definition and its concrete implementations:

  1. Core Interface (store.go): Defines the required operations:
    • Save: Persists a new association.
    • Delete: Removes an association based on the unique combination of trace and commit.
    • GetUserIssuesForTraceKeys: A bulk retrieval method designed for high-performance graph rendering, fetching all issues related to a set of traces within a specific range of commits.
  2. SQL Implementation (sqluserissuestore): A production-ready implementation that uses SQL (compatible with Spanner). It utilizes dynamic SQL templating to handle bulk queries efficiently, ensuring that complex filters over varying numbers of trace keys remain performant.
  3. Mock Implementation (mocks): Provides automated mocks for testing. This allows other parts of the Perf system (like the alert service or the API layer) to simulate database interactions, error conditions, and specific data scenarios without requiring a live database connection.

Key Workflows

Creating and Visualizing an Issue Association

When a user identifies a performance change on a graph and links it to a bug, the data flows through the following path:

  User Interaction           Perf Backend                Store Implementation
        |                          |                             |
        |-- 1. Create Link ------->|                             |
        |   (Trace, Commit, ID)    |-- 2. Call Save() ---------->|
        |                          |                             |-- 3. SQL INSERT
        |                          |                             |      (last_modified set)
        |                          |                             |
        | <--- 4. Confirmation ----|                             |
        |                          |                             |
        |                          |                             |
        |-- 5. Refresh Graph ----->|                             |
        |                          |-- 6. Bulk Fetch ----------->|
        |                          |   (for all traces in view)  |-- 7. SQL Template
        |                          |                             |      (IN clause generated)
        | <--- 8. Data with IDs ---|                             |

Data Integrity and Constraints

The module relies on a composite primary key consisting of the trace_key and the commit_position. This design decision ensures:

  • Uniqueness: A single data point (trace + commit) can only be associated with one issue at a time in the storage layer.
  • Atomicity: Deletion and retrieval operations use these two components to ensure that the correct record is targeted, preventing accidental data loss across different commits in the same trace.

Module: /go/userissue/mocks

User Issue Mocks

The userissue/mocks module provides automated mock implementations of the interfaces defined in the userissue package. Its primary purpose is to facilitate unit testing for components that depend on user issue persistence without requiring a live database or a complex manual setup.

Design Philosophy

This module utilizes test-double generation via mockery. By generating mocks based on the Store interface, the project ensures that the testing utilities stay in lockstep with the actual production code.

The decision to provide a dedicated mocks package serves two main purposes:

  1. Decoupling: Tests in other modules (such as API handlers or alert logic) can import this package to simulate various data scenarios—such as database errors, empty result sets, or specific issue lists—without being coupled to the underlying SQL or Spanner implementations of the Store.
  2. Test Reliability: By using the testify/mock framework, developers can write assertive tests that verify not just the output of a function, but also that the interactions with the storage layer (e.g., “was the correct trace key deleted?”) happened exactly as expected.

Key Components

Store Mock (Store.go)

The Store struct is the central component of this module. It implements the userissue.Store interface, providing mockable versions of the following operations:

  • Persistence Operations (Save, Delete): These allow tests to verify that the application correctly attempts to write or remove user issue metadata associated with specific traces and commit positions.
  • Retrieval Operations (GetUserIssuesForTraceKeys): This mimics the complex querying of user issues over a range of commits. In a test environment, this is crucial for simulating “found” vs “not found” states when rendering performance graphs or dashboards.

Usage Workflow

The typical workflow involves initializing the mock within a test suite, setting expectations for method calls, and then injecting the mock into the high-level business logic.

    Test Suite              Component Under Test             Mock Store
        |                            |                           |
        |---- 1. Setup Mock -------->|                           |
        |     (NewStore)             |                           |
        |                            |                           |
        |---- 2. Expect Save() ----->|                           |
        |                            |                           |
        |---- 3. Call Business Logic |                           |
        |        (e.g. CreateIssue) -|---- 4. Call Save() ------>|
        |                            |                           |
        |                            |<--- 5. Return nil/err ----|
        |                            |                           |
        |<--- 6. Verify Result ------|                           |
        |                            |                           |
        |---- 7. Assert Expectations |                           |
                 (Check if Save was called)

Implementation Details

  • Mockery Integration: The files are autogenerated. Manual changes should be avoided; instead, the Store interface in the parent userissue package should be updated and the mock regenerated.
  • Safety: The NewStore constructor automatically registers a cleanup function using t.Cleanup. This ensures that AssertExpectations is called at the end of every test, preventing “silent failures” where a test passes even if a predicted database call never actually occurred.

Module: /go/userissue/sqluserissuestore

Overview

The sqluserissuestore module provides an SQL-backed implementation of the userissue.Store interface. Its primary purpose is to persist and retrieve associations between performance anomalies (identified by a trace and a specific commit) and external issue tracking IDs (specifically Buganizer).

By storing these relationships, the Perf system can contextualize automated performance data with human-reported issues, allowing the UI to overlay bug information directly onto graphs and alerts.

Design Decisions & Implementation Choices

SQL Templating for Dynamic Queries

A key requirement of the store is fetching issue associations across a variable number of trace keys within a specific commit range.

  • Implementation: The module uses Go's text/template package to dynamically construct SQL queries for the GetUserIssuesForTraceKeys method.
  • Why: Since SQL IN clauses require a specific number of placeholders corresponding to the slice length of input keys, templating allows the store to generate the correct number of $n placeholders at runtime while maintaining compatibility with prepared statement parameters to prevent SQL injection.

Consistency and Integrity

The implementation relies on the database schema's composite primary key (trace key + commit position) to enforce data integrity.

  • Error Handling: The Save method does not use an “upsert” (Insert or Update) logic. Instead, it performs a standard INSERT. If an association already exists for a specific trace at a specific commit, the database returns a constraint violation, which the store wraps as an error. This ensures that users do not inadvertently overwrite existing issue mappings without an explicit deletion or update workflow.
  • Explicit Deletion Checks: The Delete operation performs a lookup before executing the DELETE statement. This ensures the system can provide feedback if a user attempts to remove a record that doesn't exist, preventing silent failures in the UI.

Timestamp Management

The store captures the current system time during the Save operation and persists it to the last_modified column. This centralizes the “last modified” logic within the store implementation, ensuring that even if the client doesn't provide a timestamp, the database reflects when the association was actually created or modified.

Key Components

UserIssueStore

Located in sqluserissuestore.go, this is the central struct that satisfies the userissue.Store interface. It wraps a pool.Pool connection to the database. Its methods translate high-level domain objects into SQL commands:

  • Save: Persists a new UserIssue record.
  • Delete: Removes an association based on the unique combination of trace and commit.
  • GetUserIssuesForTraceKeys: Performs a bulk retrieval of issues for a list of traces over a range of commits.

listUserIssues Template

This SQL template handles the most complex query in the module. It filters the UserIssues table by a set of trace keys and a closed interval of commit positions (>= Begin and <= End).

Workflow: Retrieving Issues for a Graph

When the Perf UI renders a graph containing multiple traces, it needs to know which data points have associated bugs. The data flows as follows:

[ Perf UI ]
      |
      | Request (Traces: ["A", "B"], Commit Range: 100-200)
      v
[ UserIssueStore.GetUserIssuesForTraceKeys ]
      |
      |-- 1. Generate SQL Template: "SELECT ... WHERE trace_key IN ($1, $2) AND ..."
      |-- 2. Execute Query with trace keys and range parameters
      v
[ SQL Database ]
      |
      |-- 3. Filter UserIssues table by PK components
      v
[ UserIssueStore ]
      |
      |-- 4. Map SQL Rows to []userissue.UserIssue
      v
[ Perf UI ] (Displays bug icons on relevant graph points)

Component Files

  • sqluserissuestore.go: Contains the logic for CRUD operations and the SQL templates used to interact with the database.
  • schema/: (Referenced by the store) Defines the table structure, ensuring that trace_key and commit_position act as the unique identifier for any given issue association.
  • sqluserissuestore_test.go: Validates the store's behavior against a real SQL instance (typically Spanner for tests), ensuring that constraints are respected and queries return accurate data.

Module: /go/userissue/sqluserissuestore/schema

Overview

The schema module defines the structural contract for persisting user-reported issue associations within the Perf backend. It serves as the single source of truth for the SQL table structure used by the sqluserissuestore.

The primary goal of this schema is to bridge the gap between performance anomalies (represented by a specific trace at a specific point in time) and external issue trackers (Buganizer). By maintaining this mapping, the system can overlay human-provided context onto automated performance graphs and reports.

Design Decisions & Implementation Choices

Compound Primary Key

The schema uses a composite primary key consisting of trace_key and commit_position. This choice reflects the functional requirement that an issue association is unique to a specific data point.

  • Why: A single trace might have different issues at different points in its history. Conversely, multiple traces might be affected by the same issue. By keying on the trace/commit pair, the store ensures data integrity while allowing the same IssueId to be linked to multiple regressions across the system.

User Attribution

The UserId field is explicitly included to capture the email of the person who created the association.

  • How: This is intended to be populated by the identity provided by the uber-proxy authentication layer. This provides an audit trail and allows the system to identify who is responsible for specific manual annotations.

Temporal Tracking

The LastModified field utilizes the TIMESTAMPTZ type with a default of now().

  • Why: Using a timestamp with time zone ensures that updates are consistent regardless of the server's local time configuration. The use of a default value simplifies the application logic, as the database handles the record-keeping of when an association was last touched.

Key Components

UserIssueSchema

Located in schema.go, this struct defines the layout of the UserIssues table. It maps Go types to SQL definitions:

  • Trace and Commit Identity: TraceKey (string) and CommitPosition (int) define where and when the issue occurred.
  • Issue Identity: IssueId (int) links the record to the external Buganizer ticket.
  • Metadata: UserId and LastModified provide context on the origin and age of the data.

Workflow: Data Association

When a user identifies a performance change in the UI and associates it with a bug, the data flows as follows:

[ User Action ]
      |
      | (Auth: User Email)
      v
[ Perf Frontend ] ----> [ sqluserissuestore ]
                               |
                               | (Maps struct to SQL)
                               v
                        [ SQL Database ]
                        +---------------------------------------+
                        | Table: UserIssues                     |
                        | PK: (trace_key, commit_position)      |
                        | Data: issue_id, user_id, last_modified|
                        +---------------------------------------+

This schema ensures that if a user updates an existing association for the same trace and commit, the record is updated (or rejected depending on the store's upsert logic) rather than duplicated, maintaining a clean 1:1 mapping between data points and their primary associated issue.

Module: /go/workflows

High-Level Overview

The go/workflows module serves as the public interface and contract definition for Skia Perf's automated orchestration system. It defines the entry points for complex, long-running processes—such as performance bisection and culprit analysis—that are executed via the Temporal workflow engine.

The primary purpose of this module is to decouple the workflow callers from the workflow implementations. By providing standardized parameter structures and string-based workflow identifiers, it allows various parts of the Skia infrastructure to trigger orchestration logic without needing to import the heavy dependencies of the internal activity and workflow implementations.

Design Decisions and Implementation Choices

Decoupled Service Invocation

The module defines constants for workflow names (e.g., ProcessCulprit, MaybeTriggerBisection). This design choice is critical for Temporal-based systems:

  • Source Independence: It allows client services to start workflows by name without linking against the internal/ implementation code, which typically includes gRPC clients, Gerrit connectors, and complex business logic.
  • Inter-Service Communication: It facilitates a “fire-and-forget” or “fire-and-wait” pattern where the caller only needs to know the “contract” (the parameter and result structs) rather than the “how” of the execution.

Structured Parameter Passing

Rather than passing loose variables, the module defines explicit Param and Result structs for every workflow.

  • Evolutionary Compatibility: Using structs allows for adding optional fields in the future without breaking the function signatures of the workflow callers.
  • Service Discovery: Parameters often include service URLs (e.g., AnomalyGroupServiceUrl). This pushes the responsibility of service location to the caller or the configuration layer, keeping the workflows themselves more generic and testable across different environments.

Key Components

Workflow Definitions (workflows.go)

This file acts as the “API header” for the orchestration layer. It defines two primary workflows:

  • MaybeTriggerBisection:

    • Responsibility: Manages the lifecycle of an anomaly group, deciding whether to initiate a Pinpoint bisection or simply report the anomalies to a developer.
    • Parameters: Requires connectivity information for the Anomaly Group and Culprit services, the specific group ID to process, and the Task Queues where sub-tasks should be routed.
    • Result: Returns a JobId (typically a Pinpoint Job ID) if a bisection was successfully triggered.
  • ProcessCulprit:

    • Responsibility: Handles the post-processing logic once a culprit has been identified by a bisection engine. This includes transforming commit data into internal formats and persisting them.
    • Parameters: Takes a list of commits (using Pinpoint's proto definition) and the associated anomaly group.
    • Result: Returns lists of CulpritIds and IssueIds generated during the persistence and notification phase.

Workflow Orchestration Process

The following diagram illustrates how this module fits into the broader system architecture, acting as the bridge between the service triggering the work and the workers executing it:

  [ Caller Service ]           [ Temporal Cluster ]          [ Perf Worker ]
          |                            |                            |
          |  1. Start Workflow         |                            |
          |     (using Param struct)   |                            |
          +--------------------------->|                            |
          |                            |  2. Schedule Task          |
          |                            +--------------------------->|
          |                            |                            |
          |                            |  3. Execute Implementation |
          |                            |     (defined in /internal) |
          |                            |<---------------------------+
          |  4. Return Result          |                            |
          |     (using Result struct)  |                            |
          |<---------------------------+                            |

Key Submodules

  • internal/: Contains the actual Go logic for the workflows and activities. This is where the gRPC calls to Gerrit, Anomaly Group services, and Culprit services are implemented. It handles the “wait-and-retry” logic and the 30-minute aggregation period for anomalies.
  • worker/: The executable entry point. It registers the implementations from internal/ against the names defined in workflows.go and listens on the Temporal Task Queue for incoming work.

Module: /go/workflows/internal

This module contains the internal Temporal workflow and activity implementations for the Perf orchestration system. It is responsible for the automated lifecycle of performance anomalies—from grouping and initial triage to triggering bisections and notifying users.

High-Level Overview

The module acts as the “glue” between various Skia Perf and Pinpoint services. It orchestrates complex, long-running processes that involve waiting for data stability, interacting with external gRPC services (Anomaly Group, Culprit, and Gerrit), and managing child workflows for performance bisection.

By leveraging Temporal, these workflows provide durability and fault tolerance for operations that can take hours or even days to complete (such as a Pinpoint bisection).

Key Workflows

MaybeTriggerBisectionWorkflow

This is the primary entry point for processing a newly detected anomaly group. It manages the decision logic for how to handle performance regressions.

  1. Wait Period: The workflow begins by sleeping for 30 minutes. This design choice allows the system to aggregate more anomalies into the same group before taking action, preventing redundant bisections or notifications.
  2. Action Dispatch: Based on the GroupAction type of the anomaly group, it branches:
    • BISECT: It resolves the git hashes for the anomaly range via Gerrit, parses benchmark/story metadata, and triggers a Pinpoint CulpritFinderWorkflow as a child workflow. It specifically uses the PARENT_CLOSE_POLICY_ABANDON policy to ensure bisections continue even if the triggering workflow completes.
    • REPORT: It gathers the top anomalies in the group and triggers a user notification (typically a bug report) via the Culprit service.
  3. State Synchronization: After triggering an action, it updates the Anomaly Group service with the resulting Bisection ID or Issue ID.

ProcessCulpritWorkflow

Invoked after a bisection successfully identifies a culprit, this workflow handles the “aftermath” of a find:

  • Data Transformation: It converts Pinpoint-specific commit formats into the internal Culprit service proto format.
  • Persistence: It calls the Culprit service to permanently store the identified culprit.
  • Notification: It triggers the notification logic to alert developers about the specific commit that caused the regression.

Component Responsibilities

Activities

Activities wrap gRPC client calls to external services, providing retry logic and timeout management defined in options.go.

  • AnomalyGroupServiceActivity: Interfaces with the Anomaly Group service to load group metadata, find specific anomalies within a group, and update group status (e.g., attaching a Bisection ID).
  • CulpritServiceActivity: Interfaces with the Culprit service. It handles persisting culprit data and sending notifications for both automated bisection results and manual anomaly reports.
  • GerritServiceActivity: Used primarily to resolve commit positions (integers) into full Git hashes (strings) required by the Pinpoint bisection engine.

Design Decisions and Utilities

  • Legacy Descriptor Mapping: The logic in maybe_trigger_bisection.go (e.g., benchmarkStoriesNeedUpdate) mimics legacy Catapult dashboard behavior. It handles special cases for “System Health” benchmarks where story names require character replacement (e.g., _ to :) to remain compatible with Pinpoint expectations.
  • Statistic Parsing: The system automatically extracts the measurement and statistic (e.g., max, std) from the chart name string. This is necessary because the Perf database often stores these as a combined string, while Pinpoint requires them as separate parameters.
  • Temporal Options: options.go defines strict 1-minute timeouts for gRPC-based activities to ensure the system doesn't hang on network issues, while allowing up to 12 hours for child workflows to accommodate the long compile and execution times of performance tests.

Workflow Logic Diagram

The following diagram illustrates the flow of the MaybeTriggerBisectionWorkflow:

[ Start ]
    |
    v
[ Sleep (30m) ]  <-- Wait for more anomalies to group
    |
    v
[ Load Anomaly Group ]
    |
    +----( GroupAction == BISECT? )----> [ Resolve Git Hashes ]
    |                                          |
    |                                          v
    |                                    [ Trigger Pinpoint ]
    |                                          |
    |                                          v
    |                                    [ Update Group w/ JobID ]
    |
    +----( GroupAction == REPORT? )----> [ Fetch Top 10 Anomalies ]
    |                                          |
    |                                          v
    |                                    [ Notify User / Create Bug ]
    |                                          |
    |                                          v
    |                                    [ Update Group w/ IssueID ]
    v
[ End ]

Module: /go/workflows/worker

Overview

The go/workflows/worker module implements the executable entry point for the Temporal worker responsible for executing Skia Perf's backend automation workflows. It serves as the bridge between the Temporal orchestration engine and the specific business logic required for anomaly detection, bisection triggering, and culprit management.

The primary design goal of this module is to provide a scalable, stateless execution environment. By decoupling the workflow definitions from the service that triggers them, the worker can be scaled independently to handle varying loads of performance analysis tasks.

Architecture and Design Choices

The worker is designed as a long-running daemon that connects to a Temporal cluster. It registers a set of Workflows (stateful orchestrations) and Activities (idempotent units of work) and then listens to a specific task queue for instructions.

Connection Management

The worker establishes a connection to the Temporal service via a client.Dial. This connection is configured with custom metrics handling to export Temporal-specific telemetry to Prometheus, ensuring visibility into worker health, task latency, and execution success rates.

Service Registration

The worker registers several domain-specific activities and workflows. The registration process maps internal Go functions to string-based identifiers used by the Temporal cluster to route tasks.

  • Workflow Orchestration: It registers high-level workflows like ProcessCulprit and MaybeTriggerBisection. These functions orchestrate complex, long-running processes that might involve waiting for external signals or timers.
  • Activity Execution: It registers service-specific activities (CulpritServiceActivity, AnomalyGroupServiceActivity, GerritServiceActivity). These are the “muscles” of the system, performing side-effect-heavy operations such as querying databases or interacting with Gerrit for code reviews.

Key Components and Responsibilities

main.go

This file acts as the lifecycle manager for the worker process. Its responsibilities include:

  • Configuration: Parsing flags for the Temporal host/port, namespace, and task queue.
  • Instrumentation: Initializing the Skia common library to setup Prometheus monitoring.
  • Workflow/Activity Mapping: Explicitly linking the generic worker instance to the specific logic defined in the internal package. This creates a clear separation between the “runner” (this module) and the “logic” (the internal module).

Execution Workflow

The following diagram illustrates how the worker interacts with the broader system:

+----------------+       +-------------------+       +-----------------------+
| Temporal Cloud | ----> |  Worker Process   | ----> |   Internal Services   |
| (Task Queue)   |       | (worker/main.go)  |       | (internal/activities) |
+----------------+       +---------+---------+       +-----------+-----------+
                                   |                             |
                                   | 1. Polls for Tasks          |
                                   |---------------------------->|
                                   |                             |
                                   | 2. Executes Activity/WF     |
                                   | <---------------------------|
                                   |                             |
                                   | 3. Reports Completion       |
                                   |---------------------------->|

Deployment Context

The worker is packaged as a container (skia_app_container) named grouping_workflow. This naming reflects its primary responsibility: managing the lifecycle of anomaly groups and the resulting workflows that process potential performance culprits. In a production environment, this worker typically runs within Kubernetes, connecting to a centralized Temporal service.

Module: /images

The /images module serves as the centralized repository for graphical assets and brand identity markers used across the project. It provides a single source of truth for logos, icons, and UI-specific graphics, ensuring visual consistency across different sub-modules and user-facing components.

Design Philosophy and Implementation Choices

The module prioritizes scalability and cross-platform compatibility by primarily utilizing the SVG (Scalable Vector Graphics) format. This choice allows assets to be rendered at any resolution without loss of quality, which is critical for high-DPI displays and varied UI contexts.

Raster-to-SVG Wrapping

A notable implementation pattern within this module is the use of SVG wrappers for raster data. Files such as androidx.svg, flutter.svg, and fuchsia.svg contain Base64-encoded PNG data embedded within an SVG <image> tag. This approach was chosen for several reasons:

  • Standardized Interface: It allows the UI rendering engine to treat all icons as SVGs, simplifying the code for icon components.
  • Fixed Aspect Ratios: The viewBox and preserveAspectRatio attributes on the SVG wrapper ensure that raster logos are displayed consistently, regardless of the container's constraints.
  • Styling Consistency: Wrapping raster images in SVGs allows for the application of consistent stroke or border effects (as seen in the grey circular stroke in skia.svg or widevine.svg) directly within the asset file.

Format Diversity

While SVG is the preferred format for logos, the module includes other formats based on specific use cases:

  • WebP/PNG: Used for complex textures or photographs (like germanium.webp or alpine.png) where vectorization would be inefficient or impossible.
  • Simple Vectors: Files like line-chart.svg use pure path data for lightweight, performant UI decorations.

Key Components and Responsibilities

The module is responsible for organizing assets into three functional categories:

  • Ecosystem Branding: Contains the official visual identifiers for the core technologies integrated into the project, such as the Chrome logo, the V8 engine icon, and the Skia graphics library.
  • Platform and Framework Logos: Provides assets for external dependencies and supported platforms, including AndroidX, Flutter, Fuchsia, and Alpine. These are typically used in documentation or system information screens.
  • Application UI Elements: Includes generic icons like line-chart.svg that are used for internal data visualization or navigational markers.

Asset Consumption Workflow

Assets are exposed to the rest of the build system through a central configuration that designates which files are available for external reference. This prevents accidental internal dependency on draft or temporary assets.

[ Feature Module ] ----> [ Request: v8.svg ]
                               |
                               v
[ /images Module ] <--- [ BUILD.bazel Exports ]
       |
       +-- (SVG Vector Processing) --> Rendered Icon
       |
       +-- (SVG Raster Decoding)   --> Embedded Bitmap

The use of the exports_files directive in the module's build configuration facilitates this, allowing other packages to consume these specific images as labels without needing access to the entire directory.

Module: /integration

Perf Integration Data Module

The /integration module provides a controlled set of performance data used to verify the Perf ingestion pipeline and integration features. It serves as a bridge between raw performance results and the high-level analysis tools by providing a predictable, historical baseline of metrics tied to a specific demonstration repository.

Design Philosophy

The module is designed around the principle of traceable performance evolution. Rather than providing static benchmarks, it provides a sequential history that mirrors a real-world software lifecycle.

  • Verifiable Regression Testing: By mapping performance metrics (nanoseconds, memory allocations) to specific git_hash values from the perf-demo-repo, the module allows the system to test its ability to identify performance shifts across commits.
  • Pipeline Robustness: The data set is intentionally heterogeneous. It includes “good” data, a file referencing a non-existent (“bad”) commit, and a malformed JSON file. This design ensures the ingestion logic is tested not just for the “happy path,” but also for graceful error handling and data validation.
  • Dimensional Granularity: Implementation choices in the data schema prioritize multi-dimensional analysis. For example, tracking both min and max values for a single metric allows for testing variance detection (jitter), while splitting memory metrics into kb (size) and num (count) allows for testing the detection of different types of resource leaks.

Key Components and Responsibilities

Data Generation (generate_data.go)

This utility is responsible for maintaining the consistency of the integration test suite. It programmatically generates the JSON artifacts to ensure they adhere to the format.Format schema used by the Perf ingester.

  • Synthetic Variance: The generator injects deterministic but varying values (using the loop index and random offsets) into the measurements. This simulates a real development environment where performance fluctuates slightly or degrades over time, providing the necessary data “noise” to test filtering and alerting algorithms.
  • Commit Mapping: It explicitly links measurements to hashes in the demo repository, ensuring that the integration environment has a valid “source of truth” to query against.

Data Repository (/data)

The data directory acts as a mock “filestore” that an ingester of type dir would monitor.

  • demodata_commit*.json: These files represent the standard ingestion format. Each file encapsulates a snapshot of system performance for a specific hardware configuration (arch: x86, config: 8888) and a specific functional test (test: encode).
  • Negative Test Cases: Includes malformed.json and files with unknown git hashes (e.g., ffff...) to verify that the system correctly identifies and reports data quality issues without halting the ingestion of subsequent valid files.

Integration Workflow

The following diagram shows how this module interacts with the broader Perf ecosystem during an integration test:

[ generate_data.go ]
       |
       | (creates)
       v
[ /integration/data/ ] <---------- [ Ingester ('dir' type) ]
       |                                  |
       | (scans filesystem)               | (parses & validates)
       v                                  v
[ Malformed/Bad Hash ]             [ Valid Commit Data ]
       |                                  |
       +--> Log Error                     +--> Map to Git History
       +--> Continue Processing           +--> Update Trace Store
                                          +--> Detect Regressions

Data Schema Logic

The data structure within this module follows a specific hierarchy to support complex queries:

  1. Global Metadata: The top-level Key (e.g., arch, config) defines the environment. Design-wise, this allows the system to separate “what” was tested from “where” it was tested.
  2. Result Key: Each result identifies the specific sub-test (e.g., test: encode), allowing one file to contain multiple independent benchmarks.
  3. Measurements: Measurements are grouped by type (e.g., ns, alloc). Each type contains an array of SingleMeasurement objects, distinguishing between different units or statistical bounds (min, max, count) for that specific metric.

Module: /integration/data

Performance Integration Data

This module serves as a historical repository of performance benchmarks and system metrics, indexed by Git commit hashes. It provides the ground-truth data necessary for detecting regressions, analyzing performance trends over time, and verifying the integration pipeline's ability to handle various data states.

Design Philosophy and Implementation Choices

The data is structured to facilitate automated comparison between software iterations. By decoupling the performance results from the source code and storing them as static JSON artifacts, the system achieves several design goals:

  • Commit-Centric Traceability: Every data point is explicitly linked to a git_hash. This allows the integration engine to map performance spikes or memory leaks directly to specific changes in the codebase.
  • Environmental Context: The key object (containing fields like arch and config) ensures that measurements are not analyzed in a vacuum. It acknowledges that performance is hardware- and configuration-dependent, allowing the consumer to filter results for “apples-to-apples” comparisons.
  • Multi-Dimensional Metrics: Rather than providing a single execution time, the schema separates measurements into categories like alloc (memory footprint) and ns (timing). Each category supports multiple values (e.g., min, max, kb, num), enabling a nuanced view of system behavior, such as identifying increased jitter even if average latency remains stable.
  • Schema Versioning: The inclusion of a version field at the root level allows the integration logic to evolve. If the measurement format changes, the parser can handle legacy data files (like those found in this module) without breaking the analysis pipeline.

Data Schematics and Components

The module's contents represent a sequence of snapshots (demo_data_commit_1.json through demo_data_commit_10.json) showing the evolution of a specific test case, such as the “encode” operation.

Metric Tracking

Measurements are stored in nested arrays to allow for extensibility. For instance, the alloc measurement tracks both the size of memory used (kb) and the count of allocations (num). This distinction is critical for identifying “death by a thousand cuts” scenarios where total memory usage is low, but high allocation frequency causes CPU overhead.

Error Handling and Validation

The presence of malformed.json is a deliberate implementation choice for integration testing. It serves as a negative test case to ensure that any data ingestion service or parser can gracefully handle and report syntax errors in the data stream without crashing the monitoring pipeline.

Performance Data Workflow

The following diagram illustrates how the data in this module is intended to be consumed by an integration or monitoring service:

[ Git Commit ] ----> [ Run Benchmarks ] ----> [ Generate JSON ]
                                                    |
                                                    v
[ Integration Data ] <----------------------- [ /integration/data/ ]
       |
       +--> Compare current git_hash results against previous hashes
       +--> Validate "measurements" (e.g., did "num" of allocs increase?)
       +--> Trigger alerts if metrics exceed defined thresholds

Key Components

  • Result Keys: Found within the results array, these define the specific functional area being tested (e.g., test: encode). This allows a single commit file to store data for multiple distinct sub-systems.
  • Measurement Bounds: By storing both min and max values for nanosecond (ns) timing, the data supports variance analysis. A significant widening of the gap between min and max across commits (as seen between commit 1 and commit 10) indicates decreasing stability in the code path.
  • Linkage: The links field is reserved for cross-referencing external artifacts, such as detailed profiling traces or build logs, though it remains null in the baseline demo sets.

Module: /jupyter

Jupyter Module Documentation

The /jupyter module provides an interface for performing advanced data analysis and visualization of Skia performance data. By leveraging Jupyter Notebooks, it allows developers to move beyond the standard Skia Perf web UI to perform complex calculations, statistical modeling, and custom plotting using the Python data science stack.

Overview

The primary goal of this module is to bridge the gap between the Skia Performance monitoring system (perf.skia.org) and the analytical power of tools like Pandas, NumPy, and Matplotlib.

While the standard Perf UI is excellent for discovering regressions and viewing individual traces, it is not designed for “bulk” analysis—such as calculating the ratio of GPU to CPU performance across hundreds of tests or finding which hardware models exhibit the most noise (coefficient of variation). This module provides the glue code to fetch data from Perf's backend and load it into a Pandas DataFrame for such tasks.

Design and Implementation

The implementation centers around an asynchronous request-and-poll pattern to interact with the Skia Perf API.

Data Retrieval Workflow

Accessing data follows a specific sequence to ensure the notebook remains responsive and handles the potentially large datasets stored in Perf:

  1. Context Initialization: The system first queries the /_/initpage/ endpoint. This is a design choice to automatically discover the current “window” of data (the most recent commits) and the available paramset (all valid keys and values like model, test, device, etc.).
  2. Request Initiation: A request is sent to /_/frame/start. This does not return data immediately; instead, it triggers a long-running query on the server and returns a unique ID.
  3. Status Polling: The module polls /_/frame/status/<id> until the server reports success. This prevents notebook timeouts during heavy calculations on the server side.
  4. Data Transformation: Once ready, the JSON results are fetched and converted into a Pandas DataFrame. The system explicitly handles “missing” or “sentinel” values (e.g., 1e32) by converting them to NaN (Not a Number), ensuring that standard statistical functions like .mean() or .std() work correctly without being skewed by invalid data points.

Key Components

Core API Functions (Perf+Query.ipynb)

The logic is encapsulated in two primary entry points that abstract away the HTTP communication:

  • perf_query(query): Used for selecting raw traces based on metadata (e.g., source_type=skp&sub_result=min_ms). This is the programmatic equivalent of the “Query” dialog in the Perf UI.
  • perf_calc(formula): Used for server-side processing using Skia Perf's functional query language. This allows the server to perform operations like ave(), count(), or ratio() before sending the result to the notebook, which is more bandwidth-efficient than downloading all raw data.

Environment Management (README.md)

Because data science dependencies (like scipy and matplotlib) can be sensitive to system-level Python versions, the module advocates for a Virtualenv-based deployment. This ensures that the analytical environment remains isolated from the system's Python installation and that all required libraries are pinned to versions compatible with the provided notebooks.

Key Workflows

Standard Data Pipeline

This diagram illustrates how data flows from the Skia Perf servers into a local visualization.

[ Jupyter Notebook ]          [ Skia Perf Server ]
        |                              |
        |--- 1. POST (Query/Formula) ->|
        |                              |-- 2. Process Request --|
        |<-- 3. Return Query ID -------|                        |
        |                              |                        |
        |--- 4. GET (Poll Status) ---->|                        |
        |<-- 5. "Still Working" -------|                        |
        |              ...             |                        |
        |--- 6. GET (Poll Status) ---->|                        |
        |<-- 7. "Success" -------------|                        |
        |                              |                        |
        |--- 8. GET (Fetch Results) -->|                        |
        |<-- 9. JSON Traceset ---------|                        |
        |                              |
[ Parse JSON to Pandas ]
        |
[ Generate Matplotlib Plot ]

Analysis Examples

The module provides pre-configured examples for common “Why” questions:

  • Noise Analysis: Iterating through hardware models to calculate the average coefficient of variation, helping identify flaky lab hardware.
  • Performance Ratios: Calculating the ratio between CPU and GPU execution times for specific sets of SKP (Skia Picture) files to identify rendering bottlenecks.
  • Normalization: Using Pandas to normalize disparate traces to a mean of 0 and a standard deviation of 1, allowing for the visual comparison of the “shape” of performance changes across different tests regardless of their absolute scale.

Module: /lint

High-Level Overview

The /lint module provides a specialized reporting interface for static code analysis, specifically designed to integrate with JSHint. Its primary purpose is to bridge the gap between raw analysis data and a human-readable, machine-parseable terminal output. Instead of relying on default verbose formats, this module implements a custom reporting logic that prioritizes clarity and precision in locating syntax errors or stylistic inconsistencies.

Design Rationale: Streamlined Feedback

The core design philosophy behind this module is minimalist observability. In many build environments, linting output can become cluttered with metadata that obscures the actual location of a bug. The implementation in /lint/reporter.js focuses on a “one-line-per-error” strategy.

By standardizing the output format to file:line:character reason, the module ensures that:

  1. Developer Cognition is optimized: Developers can quickly scan the left-hand side for file locations.
  2. Tooling Integration: The format is intentionally compatible with terminal emulators and IDEs that support “click-to-open” functionality for file paths.
  3. Actionable Summaries: The reporter concludes with a singular count of total errors, providing an immediate “pass/fail” signal for CI/CD pipelines.

Key Components and Implementation

Output Formatting Logic (reporter.js)

The module exports a single reporter function expected by the JSHint API. Its responsibility is to iterate through a collection of error objects and transform them into a cohesive string buffer.

  • String Aggregation: Rather than calling console.log for every individual error—which can lead to performance bottlenecks and interleaved output in asynchronous environments—the module aggregates all results into a single string.
  • Buffer Flushing: Output is sent directly to process.stdout.write. This choice avoids the trailing newline logic inherent in console.log, allowing the module complete control over the vertical spacing of the final report.
  • Pluralization Logic: A small but significant detail in the summary implementation is the conditional suffixing of the “error” count. This ensures that the summary remains grammatically correct whether there is a single violation or hundreds, maintaining a professional interface for the end-user.

Workflow Process

The following diagram illustrates how data flows from the static analysis tool through this module to the user's terminal:

[ JSHint Engine ]
        |
        | (Raw Result Array)
        v
[ /lint/reporter.js ]
        |
        |-- For Each Result:
        |   Extract: { file, line, character, reason }
        |   Format:  "file:line:char reason"
        |
        |-- Finalize:
        |   Append Total Count Summary
        v
[ System Stdout ] -> (Displayed to Developer)

Key File Responsibilities

  • reporter.js: Functions as the primary entry point. It contains the transformation logic that maps JSHint's internal error representation to the standardized string format used across the project. It is responsible for the final presentation layer of the linting process.

Module: /modules

Perf Modules Documentation

The /modules directory contains the frontend architecture of the Skia Perf application. It is built as a collection of modular Custom Elements (using Lit) and specialized utility libraries that coordinate performance data querying, time-series visualization, and anomaly triage.

High-Level Overview

The module architecture is designed to handle massive-scale performance telemetry by separating data management from visual presentation. The system revolves around three core pillars:

  1. Data & State: Managing complex DataFrame objects (time-series) and reflecting application state (queries, zoom levels) in the browser URL.
  2. Visualization: High-performance charting engines that overlay statistical anomalies and user-filed bugs onto performance traces.
  3. Triage Workflow: Specialized dialogs and tables that allow “Sheriffs” to investigate regressions, file bugs in external trackers (Buganizer), and trigger automated bisections (Pinpoint).

Design Philosophy

  • URL as Source of Truth: Most modules utilize stateReflector to ensure that every view—including specific zooms, selected traces, and active filters—is shareable via a deep link.
  • Asynchronous Progress: Long-running backend tasks (clustering, dry-runs, data fetching) utilize a standardized progress polling mechanism to keep the UI responsive.
  • Context-Driven Data: Modules often leverage Lit Context to share DataFrame and AnomalyMap data across deeply nested component trees without “prop-drilling.”
  • Composition over Monoliths: Complex pages (like Explore or Triage) are composed of smaller, reusable primitives (like query-chooser-sk or triage-status-sk).

Key Components and Submodules

1. Data Visualization Engine

The charting infrastructure is split between raw data processing and visual rendering.

  • dataframe: Manages the lifecycle of performance data. It handles joining multiple data chunks, padding missing values with MISSING_DATA_SENTINEL, and providing the data to charts via context.
  • plot-google-chart-sk: The primary interactive chart. It uses a layered approach where the lines are SVG-based (Google Charts), but interactive elements like anomalies and bug icons are HTML overlays to maintain performance during panning.
  • explore-simple-sk: The central orchestrator for data exploration. It combines the chart, a navigation summary (plot-summary-sk), and the query interface.
  • plot-summary-sk: Provides a “bird's-eye view” of long-range data. It implements Min-Max downsampling to ensure peaks and valleys remain visible even when thousands of points are condensed into a small sparkline.

2. Anomaly Detection and Triage

These modules facilitate the transition from identifying a “spike” to resolving a performance regression.

  • anomalies-table-sk: A sophisticated management table that groups related anomalies (e.g., by benchmark or revision range) to allow for bulk triaging.
  • triage-menu-sk: A contextual popup used to “nudge” anomaly boundaries, ignore false positives, or initiate bug filing.
  • new-bug-dialog-sk & existing-bug-dialog-sk: Specialized modals that automate the boilerplate of reporting issues by pre-filling titles and metadata derived from the anomaly's trace parameters.
  • bisect-dialog-sk & pinpoint-try-job-dialog-sk: Integration points for Chrome-specific debugging tools, allowing users to trigger A/B bisections directly from a regression point.

3. Querying and Filtering

Navigating millions of traces is handled through hierarchical and summary-based components.

  • test-picker-sk: A “drill-down” interface that guides users through valid parameter combinations (e.g., selecting a benchmark reveals only the bots that ran it).
  • query-sk & query-chooser-sk: The standard multi-select filter interface used to build complex trace queries.
  • paramset-sk: A read-only visualization of a query, used to summarize what data an alert or a graph is currently showing.

4. Infrastructure and Shell

  • perf-scaffold-sk: The master template providing navigation, theme switching (dark/light mode), and authentication integration. It supports both a “Legacy” sidebar and a modern “V2” header layout.
  • telemetry: A buffered reporting system that tracks application performance and user actions, flushing data in batches to minimize network overhead.
  • common: A utility layer containing the ShortcutRegistry for global hotkeys (e.g., p for positive triage) and plot-builder for transposing backend data into chartable formats.

Key Workflows

Data Exploration and Refinement

The system uses a reactive loop to update visualizations as users filter data.

[ User Interaction ] -> [ test-picker-sk ] -> [ Update URL State ]
                                |                     |
                                v                     v
[ explore-simple-sk ] <--- [ query string ] <--- [ stateReflector ]
       |
       |-- 1. requestFrame() (via DataService)
       |-- 2. startRequest() (Polling progress)
       |-- 3. merge results into [ DataFrameRepository ]
       |-- 4. render [ plot-google-chart-sk ]

Anomaly Triage Sequence

Sheriffs move from high-level alerts to specific code changes through integrated navigation.

[ triage-page-sk ] (Matrix of commits vs alerts)
       |
       |-- Click Status Icon --> [ cluster-summary2-sk ] (View Centroid)
                                           |
       +-----------------------------------+-----------------------------------+
       |                                   |                                   |
[ Triage Action ]                  [ Investigation ]                  [ External Link ]
(Mark Pos/Neg)                     (View on Dashboard)                (Link to Gitiles)
       |                                   |                                   |
       v                                   v                                   v
[ POST /_/triage/ ]                [ explore-simple-sk ]               [ Source Browser ]

Coordinate Transformation

Because the UI combines SVG charts and HTML overlays, many modules (like chart-tooltip-sk and plot-google-chart-sk) perform “Pixel to Data” translations.

[ Mouse Hover X/Y ]
       |
       v
[ ChartLayoutInterface ] -> [ Data Value (Commit/Date) ]
                                      |
                                      v
[ lookupCids() ] ----------> [ Git Hash / Author / Message ]
                                      |
                                      v
[ commit-range-sk ] -------> [ HTML Link to Source ]

Internal Infrastructure Details

  • json: Contains the “Source of Truth” for data structures, automatically generated from the Go backend to ensure type safety across the network.
  • cid: A specialized resolution service that translates sequential CommitNumbers (used for storage efficiency) into full Commit metadata.
  • themes: A delta-based styling layer that extends the shared Skia infrastructure with Perf-specific color palettes and spacing resets.
  • errorMessage: A global utility that captures both application errors and network failures, displaying them in a persistent <error-toast-sk> until dismissed.

Module: /modules/alert-config-sk

alert-config-sk

The alert-config-sk module provides a comprehensive configuration interface for managing performance regression alerts in the Perf system. It allows users to define which traces to monitor, how to detect anomalies, and what actions to take when a regression is identified.

Overview

This element serves as the primary editor for Alert configurations. It maps complex JSON configuration objects to a user-friendly form, handling the conditional logic required by different detection algorithms and notification strategies.

The design emphasizes data binding between the UI and a central Alert object. Changes in the UI immediately update the underlying object, which can then be persisted to the backend.

Key Components and Responsibilities

State Management (Alert and ParamSet)

The module's primary inputs are the config (the alert definition) and the paramset (the available keys and values in the performance database).

  • config: An object implementing the Alert interface. The element provides setters/getters that ensure default values (like radius or interesting thresholds) are populated from global settings if missing.
  • paramset: Used to populate the query-chooser-sk and the “Group By” multi-select options, allowing the user to filter traces based on actual metadata present in the system.

Dynamic Regression Detection

The complexity of regression detection is managed through two coordinated selections:

  • Grouping (algo-select-sk): Determines if traces are clustered (K-Means) before analysis or if each trace is analyzed individually.
  • Step Detection (select-sk): Allows the user to choose the mathematical model for finding regressions (e.g., Cohen's d, Mann-Whitney U, or Absolute magnitude).

The UI dynamically updates the Threshold label and units based on the selected Step Detection algorithm using a thresholdDescriptors map. This ensures users provide inputs that make sense for the chosen math (e.g., “standard deviations” for Cohen's d vs “alpha” for Mann-Whitney).

Conditional Workflows

The element's layout changes based on the global window.perf configuration and user selections:

  • Notifications: Depending on whether window.perf.notifications is set to html_email or markdown_issuetracker, the element displays either email recipient fields or Issue Tracker component IDs.
  • Alert Actions: If window.perf.need_alert_action is enabled, it exposes options for automated behaviors like filing bugs or triggering Pinpoint bisections.
  • Testing: Integrated “Test” buttons allow users to validate Bug URI templates or notification destinations against the backend API (/_/alert/bug/try and /_/alert/notify/try) before saving the config.

Implementation Details

Data Flow

The element uses a “top-down” data flow for configuration and “bottom-up” for updates via event listeners:

[Parent Component]
      | (sets .config and .paramset)
      v
[alert-config-sk]
      |
      +-- @input / @change events --> [Updates internal _config]
      |
      +-- [query-chooser-sk] --------> (updates _config.query)
      |
      +-- [algo-select-sk] ----------> (updates _config.algo)

Key Files

  • alert-config-sk.ts: Contains the main logic for the Lit-based element, including the conditional rendering logic and API calls for testing templates.
  • alert-config-sk.scss: Defines the layout, ensuring that nested controls (like spinners and labels) are indented and styled consistently with the Perf theme.
  • alert-config-sk-demo.ts: Provides a sandbox for testing various UI states (e.g., toggling “Group By” or switching between email/issue tracker notifications) without a full backend.

Design Decisions

  • Global Config Dependency: The element relies on window.perf for environment-specific flags. This allows the same UI component to behave differently across different Perf instances (e.g., some instances might not support bisection).
  • Validation: For critical fields like the Issue Tracker Component ID, the element uses HTML5 pattern validation (\d+) and triggers an errorMessage toast on invalid input to prevent malformed data from being sent to the server.
  • Property Upgrading: The connectedCallback uses _upgradeProperty for config and paramset. This ensures that if the properties were set before the custom element was defined, the values are correctly captured and rendered.

Module: /modules/alerts-page-sk

The alerts-page-sk module provides a comprehensive interface for managing performance alert configurations within the Perf application. It allows users to view, create, edit, and archive alert rules that monitor trace data for anomalies.

Design and Architecture

The module is designed around a centralized management table. It acts as a bridge between the backend alert storage and the alert-config-sk component, which handles the complex logic of individual alert parameterization.

Key design choices include:

  • Role-Based Access Control: The component integrates with the alogin-sk module to determine if a user has the “editor” role. Actions like “New”, “Edit”, and “Delete” are restricted or disabled for non-editors.
  • Modality for Configuration: To keep the list view clean, all editing and creation happen within a <dialog> element. This dialog wraps the alert-config-sk element, ensuring a consistent experience between creating a brand-new alert and modifying an existing one.
  • Dynamic UI Adjustments: The page adapts its table headers and content based on global window.perf configurations (e.g., changing “Alert” to “Component” if issue tracker integration is enabled).
  • State Transparency: The module supports viewing archived (deleted) configurations through a toggle, and it provides immediate visual feedback for invalid configurations (e.g., missing queries).

Key Components and Files

  • alerts-page-sk.ts: The core logic of the page. It manages the lifecycle of the alert list, including fetching data from /_/alert/list/, handling the state of the editing dialog, and performing CRUD operations via fetch requests to the backend.
  • alerts-page-sk.scss: Defines the layout for the management table, specifically handling overflow and ellipsis for long query strings to ensure the table remains readable even with complex alert rules.
  • alerts-page-sk-demo.ts: Provides a robust mocked environment for development, simulating various backend responses for alert lists, login statuses, and trace counts.

Key Workflows

Alert Editing Workflow

When a user interacts with the alert list, the module manages the state transition from a read-only list to an interactive configuration form.

[Alerts Table] --(Click Edit)--> [Fetch Current Config]
                                        |
                                        v
[List View] <---(Cancel)--- [Modal Dialog (alert-config-sk)]
      ^                                 |
      |                            (Modify & Accept)
      |                                 |
      +-------(Post to /update) <-------+

Deep Linking

The module supports deep linking. If the page is loaded with a search query (e.g., /a/?5646874153320448), the openOnLoad method automatically identifies the matching alert and opens the edit dialog immediately upon data retrieval.

Dry Run Integration

Every alert in the table includes a “Dry Run” link. This utilizes the dryrunUrl helper to convert the alert's configuration into a URL query string, redirecting the user to the Explore page (/d/) to visualize exactly what data the alert would trigger on before saving changes.

External Dependencies and Interfaces

  • alert-config-sk: Used as the internal editor for alert details.
  • paramset-sk: Used in the table to provide a summarized view of the alert's query.
  • Backend Endpoints:
    • /_/alert/list/{showDeleted}: Retrieves the set of alerts.
    • /_/alert/new: Fetches a default skeleton for a new alert.
    • /_/alert/update: Saves a modified or new alert.
    • /_/alert/delete/{id}: Archives an alert.

Module: /modules/algo-select-sk

algo-select-sk

The algo-select-sk module provides a custom UI component for choosing between different anomaly detection or clustering algorithms in Perf. It acts as a specialized wrapper around the generic select-sk component, providing a type-safe and domain-specific interface for algorithm selection.

Design and Implementation

The module is designed to bridge the gap between low-level UI selection (indexes) and high-level application logic (algorithm names).

State Management

The component uses the algo attribute/property as its source of truth. It supports two primary algorithms defined in the ClusterAlgo type:

  • kmeans: Groups traces by shape and looks for steps within the cluster centroids.
  • stepfit: Analyzes each individual trace for steps independently.

To ensure robustness, the component implements a fallback mechanism. Any invalid string provided to the algo attribute is automatically coerced to kmeans via the internal toClusterAlgo utility.

Component Interaction

Instead of exposing the raw select-sk child, algo-select-sk encapsulates the selection logic. It listens for selection-changed events from its internal select-sk element, maps the selected index to a ClusterAlgo value, and dispatches a domain-specific algo-change event.

[ User Clicks ] -> [ select-sk (index) ] -> [ algo-select-sk (mapping) ] -> [ algo-change Event ]

Key Components

AlgoSelectSk

Located in algo-select-sk.ts, this is the main class for the element.

  • Attributes/Properties: Reflects the algo state. Updating the property updates the attribute and triggers a re-render.
  • Template: Uses lit to render a select-sk containing two options. It uses the ?selected directive to synchronize the internal state of the options with the component's algo property.
  • Event Handling: The _selectionChanged method translates the numerical index from the underlying selector into a string value (kmeans or stepfit) by querying the value attribute of the child div elements.

Events

  • algo-change: This is the primary output of the component. The event detail contains an object of type AlgoSelectAlgoChangeEventDetail:
    {
      algo: 'kmeans' | 'stepfit';
    }
    

Testing and Demonstration

  • Demo Page: algo-select-sk-demo.html and .ts show the component in various states (default, pre-selected, and dark mode) and log event details to the screen when selections change.
  • Unit Tests: algo-select-sk_test.ts validates the attribute-to-property reflection, the fallback logic for invalid inputs, and the correct dispatching of events.
  • Integration Tests: algo-select-sk_puppeteer_test.ts performs visual regression testing using Puppeteer to ensure the component renders correctly and responds to clicks in a real browser environment.

Module: /modules/anomalies-table-sk

Anomalies Table Module

The anomalies-table-sk module provides a comprehensive, interactive table for visualizing, grouping, and triaging performance anomalies detected in the Perf system. It serves as a central hub for users to review regression and improvement alerts, manage associated bugs, and navigate to detailed graphical reports.

Overview

The primary component, AnomaliesTableSk, renders a list of anomalies and provides tools to manipulate their presentation. Rather than a flat list, the table utilizes a sophisticated grouping logic to combine related anomalies, reducing visual clutter and allowing bulk actions.

Design Principles

  • Group-First Workflow: Large sets of anomalies are often related (e.g., the same regression across multiple bots). The table defaults to grouped views to allow users to triage entire sets of alerts simultaneously.
  • State Separation: Selection, grouping, and navigation logic are decoupled into specific controllers (SelectionController, AnomalyGroupingController, ReportNavigationController) to manage complexity.
  • Contextual Triaging: Integrates directly with the triage menu and bug tooltips, allowing users to file bugs, associate alerts with existing issues, or ignore false positives without leaving the context of the list.

Key Components and Responsibilities

AnomaliesTableSk (anomalies-table-sk.ts)

The main UI element. It orchestrates the rendering of table rows, handles keyboard shortcuts (like p for filing a bug or g for graphing), and manages the “Triage Selected” popup. It delegates data processing to sub-controllers while maintaining the visual state of the table (expanded/collapsed groups).

Anomaly Grouping Controller (anomaly-grouping-controller.ts)

Manages how anomalies are aggregated into table rows. It persists user preferences for grouping (e.g., “Group by Benchmark” or “Exact Revision Match”) in localStorage.

The grouping logic follows a specific hierarchy:

  1. Bug ID: Anomalies already associated with a specific bug are always grouped together.
  2. Revision Mode: Remaining anomalies are grouped by their commit range based on three modes:
    • EXACT: Ranges must be identical.
    • OVERLAPPING: Ranges that share any commit.
    • ANY: All anomalies are considered a single group.
  3. Attribute Splitting: Revision groups can be further subdivided by BENCHMARK, BOT, or TEST.

Report Navigation Controller (report-navigation-controller.ts)

Handles the transition from the table to the “Explore” (graphing) pages. It manages:

  • URL Generation: Constructing complex URLs for multi-graph views.
  • SID Management: When a list of anomaly IDs is too long for a standard URL, it interacts with the /_/anomalies/group_report API to obtain a Session ID (SID) which represents the collection.
  • Time Range Calculation: Automatically adds a week of padding before and after an anomaly's range to provide historical context on the generated graphs.

Anomaly Transformer (anomaly-transformer.ts)

A utility class responsible for converting raw data into displayable strings and determining summary values for collapsed groups.

  • Longest Sub-test Path: For groups containing different sub-tests, it finds the longest common path and appends a * (e.g., test1/sub1 and test1/sub2 become test1/sub*).
  • Summary Delta: Determines which percentage change to display on a group row (prioritizing the largest regression magnitude).

Anomalies Grouping Settings (anomalies-grouping-settings-sk.ts)

A configuration panel embedded within the table header that allows users to toggle the grouping criteria and revision modes described above.

Key Workflows

Selection and Bulk Action

The table uses a SelectionController to track which anomalies are currently active. Selection state flows from the UI to the controller, which then triggers a re-render to update checkbox states (including indeterminate states for partially selected groups).

User Interaction (Checkbox Click)
       |
       v
SelectionController updates Set<Anomaly>
       |
       v
LitElement (Table) requestsUpdate()
       |
       +-----> Update Header Checkbox (All/None/Indeterminate)
       +-----> Update Group Summary Checkboxes
       +-----> Update Action Buttons (Triage/Graph Enabled State)

Anomaly Triaging

When a user triages a group or selection, the table interacts with the TriageMenuSk.

[Select Anomalies] -> [Click Triage Selected] -> [triage-menu-sk appears]
                                                       |
        +----------------------------------------------+
        |                      |                       |
 [File New Bug]        [Existing Bug]          [Ignore Anomaly]
        |                      |                       |
  Opens Dialog          Lists Associated        Sends 'RESET' or
                        Issues from API         'IGNORE' to backend

Graphical Investigation

Clicking the “Chart” icon or the “Graph Selected” button initiates a navigation workflow:

  1. Request Group Report: Backend provides a timerange_map for the selected anomalies.
  2. Shortcut Update: The ReportNavigationController calls /_/shortcut/update to store the specific graph configurations.
  3. Redirect: The browser opens a new tab to /m/?shortcut=[id]&begin=[start]&end=[end].

Module: /modules/anomaly-playground-sk

anomaly-playground-sk

The anomaly-playground-sk module provides an interactive environment for testing and tuning anomaly detection algorithms within the Perf ecosystem. It serves as a “sandbox” where developers and data scientists can input arbitrary trace data, apply various statistical detection methods, and visualize the results in real-time without needing to modify production alerts or wait for new data ingestion.

High-Level Overview

This module bridges the gap between algorithm development and visualization. It wraps a specialized instance of the explore-simple-sk component to provide a familiar graphing interface, while adding a control panel for manual data entry and parameter manipulation.

The primary goal is to allow users to answer questions like:

  • “Would this specific shift be caught by the mannwhitneyu algorithm with a threshold of 3.0?”
  • “How does changing the radius affect the sensitivity of detection on noisy data?”
  • “Is a particular jump considered an improvement or a regression based on the expected direction?”

Design Decisions

Data Input and Mocking

Unlike the main Explore page which queries a backend database for historical traces, the playground allows for direct manual input via a comma-separated list of values. This design choice facilitates rapid prototyping of edge cases. When a user inputs data:

  1. The component generates a synthetic DataFrame.
  2. It creates mock CommitNumber and TimestampSeconds headers for each data point to satisfy the requirements of the graphing engine.
  3. It assigns a static trace key (,name=playground,) to the data.

Component Integration

The module leverages explore-simple-sk as its visualization engine rather than re-implementing graphing logic. To make it behave like a “playground” rather than a search tool, several features of the child component are programmatically disabled or hidden:

  • openQueryByDefault is set to false (no need to search a database).
  • showHeader and navOpen are disabled to maximize space for the playground controls.
  • disablePointLinks is enabled because synthetic data points do not link to real Git commits.

State Reflection

The component uses stateReflector to sync the current playground configuration (the trace string, algorithm, radius, threshold, etc.) with the URL's query parameters. This allows researchers to share a specific “scenario” by simply copying and pasting the URL.

Key Workflows

The Detection Process

The workflow follows a standard Input -> Configure -> Request -> Visualize cycle:

[ User Input ] ----> [ Input Parser ] ----> [ Local Plotting ]
      |                                           |
      |                                           v
      |                          [ explore-simple-sk Graph ]
      v                                           ^
[ Param Controls ]                                |
(Algo, Radius, etc)                               |
      |                                           |
      v                                           |
[ "Detect" Click ] --> [ Backend API Request ] ---+
                         (/_/playground/anomaly/v1/detect)
  1. Plotting: As the user types into the text area, the component immediately updates the graph. This is a local operation that transforms the string into a DataFrame.
  2. Validation: The “Detect” button is dynamically enabled/disabled based on whether the required parameters (Algorithm, Radius, Threshold) are valid numbers and selections.
  3. Detection: When “Detect” is triggered, the trace data and parameters are sent to the backend. The backend returns a list of Anomaly objects.
  4. Integration: The component transforms these anomalies into an anomalymap, determines if they are “improvements” based on the selected direction, and calls UpdateWithFrameResponse on the graph to render the familiar red/grey circles on the trace.

Key Components and Files

  • anomaly-playground-sk.ts: The main logic hub. It manages the lifecycle of the synthetic DataFrame, handles synchronization between the UI inputs and the URL state, and coordinates communication with the detection API.
  • explore-simple-sk (External Dependency): While not in this directory, it is the primary visual dependency. The playground acts as a controller for this component, feeding it data and anomaly maps manually.
  • anomaly-playground-sk-demo.ts: Provides a mocked environment for local development, simulating the backend responses for detection and frame updates.

Parameters and Algorithms

The module supports several detection algorithms via the StepDetection type:

  • Algorithms: absolute, const, percent, cohen, mannwhitneyu.
  • Radius: Determines the window of data points to the left and right of a point to consider when calculating medians/statistics.
  • Threshold: The sensitivity of the chosen algorithm.
  • Direction: Defines whether an increase (UP) or decrease (DOWN) in value is treated as a regression or an improvement.

Module: /modules/bisect-dialog-sk

bisect-dialog-sk

The bisect-dialog-sk module provides a specialized modal dialog used in the Perf UI to initiate performance bisection jobs (Pinpoint) for Chrome performance regressions. It captures necessary metadata from a performance anomaly—such as test paths and revision ranges—and submits a request to create a bisection job to identify the root cause of a regression.

Design and Implementation Choices

Chrome-Specific Logic

The bisection logic within this module is specifically tailored for the Chrome performance testing ecosystem. This is reflected in how it parses “test paths” and maps them to specific bisection parameters like benchmark, configuration, and story.

Data Parsing and Transformation

A significant portion of the logic in bisect-dialog-sk.ts involves decomposing a single testPath string into the structured fields required by the Pinpoint bisection API.

  • Path Splitting: The module expects a slash-delimited path (e.g., Master/Bot/Benchmark/Chart/Story).
  • Statistic Extraction: It checks the end of the test path against a set of known statistical suffixes (e.g., avg, max, std). If found, it separates the statistic from the chart name to ensure the bisection job monitors the correct metric.
  • Legacy Compatibility: The module automatically replaces colons (:) with underscores (_) in the story field. This choice was made to reduce errors when querying test paths in legacy data tables.

User Authorization

Bisection is a resource-intensive operation. The module utilizes the alogin-sk infrastructure to verify the user's identity and roles before allowing a submission. If a user is not logged in or lacks the necessary permissions, the dialog prevents the request and surfaces an error message.

Key Components

bisect-dialog-sk.ts

This is the primary implementation file. It defines the BisectDialogSk class, which handles:

  • State Management: Tracks input parameters like startCommit, endCommit, bugId, and the resulting jobUrl.
  • Pre-loading: The setBisectInputParams method allows parent components (like a chart tooltip or an anomaly list) to populate the dialog with context-specific data before opening.
  • Validation: Performs client-side checks to ensure all required fields (start/end hashes, benchmark, etc.) are present before attempting a network request.
  • Submission: Manages the POST request to /_/bisect/create and handles the asynchronous response, displaying a direct link to the created Pinpoint job upon success.

Template and Styling

The UI is built using lit-html and styled with Scss. It provides a clean, form-based layout within a <dialog> element.

  • Loading State: Integrated spinner-sk provides visual feedback during the bisection request.
  • Responsive Inputs: Uses standard HTML inputs for commit hashes and patches, allowing users to manually override pre-loaded data if needed.

Workflow

The typical lifecycle of a bisection request through this module is as follows:

[ External Component ] --(testPath, revisions)--> [ bisect-dialog-sk ]
                                                           |
                                                   [ .open() called ]
                                                           |
                                             <-- User edits/reviews form -->
                                                           |
                                                  [ .postBisect() ]
                                                           |
          --------------------------------------------------------------------------------------
          |                                        |                                           |
 [ Validation Fails ]                      [ Network Request ]                        [ Auth Fails ]
          |                                        |                                           |
 [ Show error-sk ]                         [ /_/bisect/create ]                        [ Show error ]
                                                   |
                                     ------------------------------
                                     |                            |
                             [ Success (200) ]            [ Failure (5xx) ]
                                     |                            |
                          [ Display Pinpoint Link ]        [ Show error-sk ]

Integration Points

  • Preloading: Call setBisectInputParams(params: BisectPreloadParams) to populate the dialog.
  • Execution: Call open() to display the modal to the user.
  • Events: While the module primarily handles its own submission, it relies on the global errorMessage utility to communicate failures to the user.

Module: /modules/bug-tooltip-sk

The bug-tooltip-sk module provides a specialized custom element designed to display a summary of bugs (typically regressions) associated with a data point or alert. It balances a minimal UI footprint with quick access to detailed external bug tracking information.

Design Philosophy

The module is built as a hover-triggered informational component. Instead of cluttering the main interface with long lists of bug IDs, it displays a concise count and reveals a detailed list only when the user expresses interest by hovering over the element.

Key implementation choices include:

  • Lightweight Shadow DOM Bypass: The component uses createRenderRoot() { return this; }, meaning it renders directly into the light DOM. This choice simplifies global styling and ensures that the absolute positioning of the tooltip behaves predictably relative to its parent containers in the Perf UI.
  • CSS-Driven Interactivity: The visibility of the tooltip is managed via CSS :hover states on the .bug-count-container rather than JavaScript event listeners. This reduces the overhead of the component and ensures high performance during rapid UI interactions.
  • Hardcoded Navigation Logic: The element specifically formats links using the http://b/ shortcut, optimized for internal issue tracking workflows.

Key Components and Responsibilities

bug-tooltip-sk.ts

This file defines the BugTooltipSk LitElement. Its primary responsibility is to transform an array of RegressionBug objects into a readable summary.

  • Data Binding: It accepts a bugs property. If the list is empty, the entire component is hidden via the hidden attribute to save space.
  • Customizable Labeling: The totalLabel property allows consumers to change the suffix of the count (e.g., “with 2 regressions” vs “with 2 total”), making it reusable across different alert types.

bug-tooltip-sk.scss

The stylesheet manages the complex positioning and transition logic for the tooltip.

  • Positioning: The tooltip is positioned absolutely at bottom: 125% of the container, ensuring it pops up above the text.
  • Overflow Handling: Because the tooltip might contain many bugs, it is constrained by a max-height and features overflow-y: auto. This prevents the tooltip from expanding beyond the viewable area of the Perf content pane.
  • Visual Feedback: A 0.7s opacity transition is applied to provide a smooth “fade-in” effect when the user hovers over the bug count.

bug-tooltip-sk_po.ts

This file provides the Page Object (BugTooltipSkPO) for the module. It abstracts the DOM structure for integration tests, allowing tests to verify:

  • Visibility states (both the container and the tooltip).
  • Correctness of bug links and text content.
  • Scrollability, ensuring that the CSS constraints on height are functioning correctly when high volumes of bugs are present.

Workflow: Displaying Bug Details

The following diagram illustrates how the component handles user interaction to reveal bug data:

[Data Input] -> [bugs: RegressionBug[]]
                      |
                      v
             +---------------------+
             |  Is bugs.length > 0?| -- No --> [Render Nothing]
             +---------------------+
                      | Yes
                      v
             +------------------------+
             | Render: "with X total" |
             +------------------------+
                      |
             [User Hover Action]
                      |
                      v
             +------------------------+
             | CSS: opacity 0 -> 1    |
             | CSS: visibility: vis   |
             +------------------------+
                      |
             +------------------------+
             | Rendered List:         |
             | - ID (Link to b/ID)    |
             | - Type (Source)        |
             +------------------------+

Data Structure

The component expects the bugs property to conform to the RegressionBug interface (imported from the central JSON definitions), which requires:

  • bug_id: The numeric identifier for the bug.
  • bug_type: A string indicating the origin or category of the bug (e.g., “monorail”).

Module: /modules/calendar-input-sk

calendar-input-sk

The calendar-input-sk module provides a hybrid date-selection component. It combines a manual text input field with a graphical calendar picker, ensuring that users can either type a specific date quickly or browse a calendar for context.

Design and Implementation

The component is designed around the principle of flexibility and validation. It acknowledges that while calendar pickers are user-friendly for relative date selection (e.g., “next Thursday”), manual entry is often faster for absolute date entry (e.g., “1995-03-12”).

Key Components

  • Text Input: A standard HTML <input type="text"> restricted by a regex pattern (YYYY-MM-DD). This provides a lightweight way to enter dates without requiring the heavy overhead of native browser date pickers, which can vary significantly in behavior and styling across platforms.
  • Trigger Button: An icon button (using date-range-icon-sk) that activates the graphical selection interface.
  • Modal Dialog: A native HTML <dialog> element containing a calendar-sk component. Using a native dialog allows the component to leverage built-in browser features for modal behavior, such as focus trapping and the “Esc” key to close.

Interaction Workflow

The component synchronizes state between the text field and the calendar widget:

[ User Input ] --> [ Regex Validation ] --(valid)--> [ Update Internal Date ] --(emit)--> [ input event ]
      |                                                        ^
      |                                                        |
[ Click Icon ] --> [ Open Dialog ] --> [ Select Date in Calendar ] --(close)--> [ Update Input Value ]
  1. Manual Entry: When a user types in the input field, the component monitors the input event. It validates the string against the required pattern. If valid, it parses the date and updates the internal state.
  2. Calendar Selection: Clicking the calendar icon opens the modal. The calendar-input-sk manages this interaction using a Promise-based approach. The openHandler awaits a Promise that is resolved when a date is picked in the sub-component (calendar-sk) or rejected if the user cancels.
  3. Keyboard Support: While the dialog is open, the component proxies keyboard events to the calendar-sk element, allowing users to navigate the calendar grid using arrow keys even though the dialog has focus.

Design Decisions

  • Pattern Validation: Instead of using type="date", which often forces a specific UI localized by the browser, this component uses type="text" with a pattern. This ensures a consistent look and feel across all browsers while still providing immediate feedback via CSS (using the :invalid pseudo-class) when the format is incorrect.
  • State Synchronization: The displayDate property acts as the single source of truth. Setting this property triggers a re-render of the input value and updates the state of the underlying calendar-sk widget.
  • Event Handling: The component emits a custom input event containing the selected Date object in the detail field. This mirrors the standard input behavior while providing a rich data type to the consumer. It explicitly stops propagation of internal native input events to prevent confusing them with the component's own semantic “date changed” event.

Styling

The component uses scoped CSS to handle validation states. When the input's regex pattern is not met, an “invalid” indicator (an “✘” mark) is displayed via CSS transitions:

  • The input:invalid + .invalid selector allows for a CSS-only toggle of error messages, minimizing the amount of manual DOM manipulation required in the TypeScript logic.
  • It utilizes the perf/modules/themes variables to ensure the dialog and input colors remain consistent with the overall application theme (supporting both light and dark modes).

Module: /modules/calendar-sk

The calendar-sk module provides a custom web component that displays an accessible, themeable, and localized calendar for selecting a single date. It is designed to overcome the limitations of the native <input type="date"> (specifically Safari compatibility and lack of styling) and other third-party libraries that may be inaccessible or difficult to theme.

Design Decisions

  • Custom Date Manipulation: The component uses a CalendarDate helper class and local time manipulation to ensure that date selection is predictable and avoids the common pitfalls of UTC vs. local time offsets.
  • Accessibility First: Implementation follows W3C WAI-ARIA practices for date pickers. This includes proper aria-live regions for month/year changes, aria-selected states for the current selection, and a robust focus management system.
  • Localization via Intl API: Rather than hardcoding month and day names, the component utilizes the Intl.DateTimeFormat API. This allows the calendar to automatically adapt its labels (e.g., “January” vs “一月”) and weekday headers based on the provided locale property.
  • Decoupled Keyboard Handling: Instead of automatically capturing all global key presses, the component exposes a keyboardHandler method. This allows the parent application to decide when the calendar should respond to input (e.g., only when a specific dialog is open).

Key Components

calendar-sk.ts

This is the core logic of the module. It defines the CalendarSk class, which extends ElementSk and uses lit-html for rendering.

  • State Management: It maintains an internal _displayDate which determines which month is currently visible.
  • Navigation Logic: It contains methods for incrementing/decrementing days, weeks, months, and years. It handles edge cases like moving from January 31st to February (clamping the date to the last day of the month).
  • Template Generation: It dynamically calculates the grid layout (up to 6 rows) based on the first day of the week and the number of days in the specific month.

calendar-sk.scss

Styles the calendar using CSS variables for themeability (e.g., --background, --secondary, --surface-1dp). It ensures that the calendar buttons are uniform and that the “today” and “selected” states are visually distinct.

Events

The component communicates state changes to the rest of the application via a standard DOM event:

  • change: Fired whenever a user selects a date. The detail property of the event contains the selected Date object.

Key Workflows

Navigation and Selection

The user can navigate through time using UI buttons or keyboard shortcuts. When a date is selected, the component updates its internal state and notifies listeners.

User Action          Component Logic                UI Update
-----------          ---------------                ---------
Click "Next Month" -> incMonth() -----------------> Re-renders table grid
Press "ArrowRight" -> keyboardHandler(incDay()) --> Updates focus & aria-selected
Click a Day Button -> dateClick() ----------------> Dispatches 'change' event

Keyboard Shortcuts

When the keyboardHandler is active, the following shortcuts are supported:

KeyAction
ArrowLeft / ArrowRightMove back/forward one day
ArrowUp / ArrowDownMove back/forward one week
PageUp / PageDownMove back/forward one month

Usage Example

In a parent component or page, you can initialize the calendar and hook into its events:

const calendar = document.querySelector('calendar-sk');

// Set initial date and locale
calendar.displayDate = new Date();
calendar.locale = 'en-US';

// Listen for selection
calendar.addEventListener('change', (e) => {
  console.log('New date selected:', e.detail);
});

// Proxy keyboard events from a container
window.addEventListener('keydown', (e) => calendar.keyboardHandler(e));

Module: /modules/chart-tooltip-sk

chart-tooltip-sk

The chart-tooltip-sk module provides a rich, interactive tooltip designed for performance charts. It serves as the primary interface for users to inspect specific data points, view commit metadata, triage anomalies, and initiate debugging workflows like bisections.

Overview

Unlike a simple text tooltip, chart-tooltip-sk is a complex orchestrator that aggregates data from multiple sources (dataframes, anomaly maps, and backend CID handlers). It is designed to be dynamically positioned over a chart and provides contextual actions based on the nature of the selected data point (e.g., whether it is a single point, a range, or a detected anomaly).

Key Responsibilities

  • Contextual Data Display: Shows date, value, and unit information for any hovered or selected point.
  • Anomaly Triaging: If a point is identified as an anomaly, the tooltip provides details on the percentage change, median values before/after, and provides a triage-menu-sk for filing bugs or ignoring the regression.
  • Workflow Integration: Acts as a bridge to launch Bisect jobs (bisect-dialog-sk) and Pinpoint try jobs (pinpoint-try-job-dialog-sk).
  • Commit Navigation: Integrates with commit-range-sk to show the range of commits associated with a point and provides direct links to the source repository.
  • Source Inspection: Optionally displays links to raw JSON source files for the data point if configured.

Design Decisions

Positioning Logic (moveTo)

The tooltip implements custom positioning logic instead of relying on standard CSS hover tooltips. This is necessary because it must stay within the viewport and the chart boundaries.

  • Smart Shifting: It calculates its own dimensions using getBoundingClientRect() and automatically flips to the left of the cursor if it would overflow the right edge of the screen.
  • Vertical Adjustment: It shifts vertically to ensure it doesn't get cut off by the bottom of the browser window.

“Why” the load method?

Rather than using many individual attributes, the module uses a comprehensive load() method. This decision ensures that all interrelated properties (anomaly data, commit info, color, and triage state) are updated atomically before a render is triggered. This prevents “flicker” where the tooltip might show an old anomaly‘s data with a new point’s coordinates.

Conditional Content

The UI is highly reactive to the global window.perf configuration and the specific data passed to it:

  • Pinpoint/Bisect: Buttons are hidden if the instance doesn't support them or if the git repository is not a supported Chromium source.
  • Anomaly vs. Normal: The template branches significantly. For anomalies, it calculates and colors the “Improvement” vs “Regression” status; for normal points, it can show a user-issue-sk component to track non-anomaly bugs.

Key Components

ComponentRole within Tooltip
commit-range-skDisplays and links the range of revisions for the selected point.
triage-menu-skProvides the UI for filing new bugs or associating the anomaly with an existing one.
point-links-skRenders custom links based on the specific trace configuration (e.g., V8 or WebRTC specific dashboards).
bisect-dialog-skA dialog triggered from the tooltip to start a performance bisection.
json-source-skDisplays the underlying data source when enabled via show_json_file_display.

Key Workflows

Data Loading and Rendering

When a user interacts with a chart, the following process occurs to populate the tooltip:

Chart Event (Hover/Click)
          |
          v
explore-simple-sk (or parent) calls .load(...)
          |
          +--> Update internal state (index, anomaly, commit_info)
          +--> Determine Trace Color
          +--> Configure sub-components (CommitRange, UserIssue)
          |
          v
      ._render()
          |
          +--> logic: Is this an anomaly?
          |    |-- YES: Show anomalyTemplate() (Medians, Triage Menu)
          |    +-- NO: Show user-issue-sk
          |
          +--> logic: Is always_show_commit_info true?
               |-- YES: Show Author/Message/Hash
               +-- NO: Hide commit info if it's a range

Triaging an Anomaly

The tooltip facilitates the transition from “seeing a spike” to “taking action”:

  1. Detection: The parent element passes an Anomaly object to the tooltip.
  2. Visualization: The tooltip displays the “Anomaly Range” and “Percent Change”.
  3. Action:
    • If no bug is associated (bug_id === 0), the triage-menu-sk appears.
    • The user can click “Bisect” to pre-populate a bisection job with the anomaly's revision range.
    • Upon successful triage, the anomaly-changed event refreshes the display to show the new Bug ID.

Module: /modules/cid

Commit ID (CID) Resolution Module

The cid module provides a centralized client-side interface for resolving internal commit identifiers—represented as CommitNumber types—into rich commit metadata.

In the Perf system, performance data is often indexed by a sequential CommitNumber (also known as an offset) to optimize storage and time-series lookups. However, for human readability and integration with version control systems, these numbers must be translated back into git hashes, timestamps, authors, and commit messages. This module abstracts that translation process.

Design Decisions

Centralized Resolution

The decision to use a dedicated RPC endpoint (/_/cid/) rather than embedding commit metadata directly into performance data streams is driven by bandwidth efficiency. Performance results often contain thousands of data points; including full commit details for every point would result in massive payloads. Instead, the UI receives lightweight CommitNumber integers and uses this module to batch-resolve only the specific commits needed for display (e.g., when hovering over a point in a chart or viewing a table of regressions).

Batch Processing

The module is designed around batching. The lookupCids function accepts an array of CommitNumbers, allowing the frontend to resolve an entire range of commits in a single HTTP POST request. This minimizes network overhead and reduces latency when populating large data views.

Key Components

Commit Translation (cid.ts)

The core functionality is encapsulated in the lookupCids function. It acts as the bridge between the frontend and the Perf backend's CID handler.

  • Input: An array of CommitNumber values.
  • Process: It serializes these numbers into a JSON body and sends a POST request to the /_/cid/ endpoint.
  • Output: A CIDHandlerResponse object containing a commitSlice (an array of detailed commit objects) and an optional logEntry for debugging or context.

The use of jsonOrThrow ensures that the calling code doesn't have to manually check HTTP status codes for common failure modes, streamlining error handling in the UI components that consume this data.

Workflow: Resolving Commit Metadata

The following diagram illustrates how a UI component uses this module to transform raw data into a human-readable format:

+----------------+       +------------------+       +-----------------+
|  UI Component  |       |    CID Module    |       |  Perf Backend   |
| (e.g. Chart)   |       |     (cid.ts)     |       |  (/_/cid/ RPC)  |
+-------+--------+       +--------+---------+       +--------+--------+
        |                         |                          |
        | 1. Request Resolution   |                          |
        |    [101, 102, 105]      |                          |
        +------------------------>|                          |
        |                         | 2. POST /_/cid/          |
        |                         |    JSON([101, 102, 105]) |
        |                         +------------------------->|
        |                         |                          |
        |                         | 3. Return Metadata       |
        |                         |    (Hashes, Msgs, etc.)  |
        |                         |<-------------------------+
        | 4. Update View          |                          |
        |<------------------------+                          |
        |                         |                          |
+-------+--------+       +--------+---------+       +--------+--------+

Related Data Structures

The module relies heavily on types defined in /perf/modules/json, specifically:

  • CommitNumber: A branded type representing the sequential index of a commit.
  • CIDHandlerResponse: The schema for the backend response, which includes the Commit objects containing the hash, author, timestamp, and message.

Module: /modules/cluster-lastn-page-sk

cluster-lastn-page-sk

The cluster-lastn-page-sk module provides a comprehensive interface for testing, dry-running, and saving performance alert configurations. It allows developers and performance engineers to “test drive” anomaly detection algorithms against historical data before committing them as active monitors.

Overview

At its core, this module acts as a sandbox for the Perf clustering and regression detection system. It enables users to define parameters for an alert (such as the algorithm, threshold, and data query), run that configuration against a specific range of commits, and inspect the resulting clusters to verify if the alert is too noisy or missing real regressions.

Key Components and Responsibilities

Alert Configuration and State Management

The module heavily relies on alert-config-sk for defining the detection logic.

  • State Reflection: It uses stateReflector to synchronize the current alert configuration with the URL. This allows users to share a specific “dry-run” setup by simply copying the browser address.
  • Alert Persistence: Once a user is satisfied with the dry-run results, the module handles the transition from a temporary configuration to a persistent one via the /_/alert/update endpoint. It dynamically changes its UI (e.g., button labels) based on whether the user is creating a new alert or updating an existing one.

Dry-Run Execution Workflow

The “Run” process is an asynchronous operation that leverages a progress-tracking API.

  1. Request Initiation: It sends a RegressionDetectionRequest to /_/dryrun/start, containing the alert configuration and the commit range (defined by domain-picker-sk).
  2. Progress Monitoring: Instead of a single blocking request, it utilizes a polling mechanism (via startRequest) to receive intermediate updates. This allows the UI to display real-time status messages and partial results.
  3. Result Visualization: Detected regressions are rendered in a tabular format, broken down by commit and the direction of the change (High/Low).

Regression Analysis and Triage

The module doesn't just list regressions; it provides deep-dive capabilities into the clusters found:

  • Triage Status: It integrates triage-status-sk within the results table to show the current status of detected anomalies.
  • Detailed Inspection: Clicking on a result triggers a modal containing cluster-summary2-sk. This allows users to see the specific traces contributing to a cluster without navigating away from the dry-run page.
  • External Linking: Through the open-keys event, the module can open the Explore page in a new tab, pre-populated with the specific trace keys and time range associated with a detected regression.

Key Workflows

Testing an Alert

User Configures Alert -> Clicks "Run"
  |
  V
POST /_/dryrun/start (Alert Params + Domain)
  |
  +--<-- Polling /_/progress/ -> Updates Status UI
  |
  V
Results Received -> Render Table (Commits x Clusters)
  |
  +-- Click Cluster -> Open Triage Dialog (Internal Inspection)
  |
  +-- Click "Accept" -> Save Alert to Database

Domain and Range Selection

The module uses a domain-picker-sk to define the “where” and “when” of the test. Users can specify:

  • Number of commits: How far back to look.
  • Commit Range: Specific start and end points in time. The UI defaults to a “dense” request type to ensure sufficient data points are evaluated during the dry-run, regardless of the underlying data sparsity.

Design Decisions

Modal Dialogs for Configuration

The use of <dialog> elements for alert-config-sk and cluster-summary2-sk ensures that the main dry-run context (the results table and run settings) remains visible and persistent in the background while the user fine-tunes parameters or inspects specific data points.

Error Handling

The module distinguishes between transient request errors and configuration errors. Error messages from the dry-run process are captured and displayed within a dedicated <pre> block to preserve formatting (like stack traces or detailed engine logs), while authentication or persistence errors are routed through a global error-toast-sk.

Module: /modules/cluster-page-sk

cluster-page-sk

The cluster-page-sk module provides a comprehensive interface for performing regression detection and trace clustering within the Skia Perf framework. It allows users to identify groups of performance traces that exhibit similar behavior—such as a coordinated step or shift—around a specific commit.

High-Level Overview

The primary purpose of this module is to give developers and performance engineers a way to “cluster” traces based on their statistical properties. Instead of looking at individual traces, users can identify patterns across hundreds or thousands of benchmarks.

The workflow typically involves:

  1. Selection: Picking a specific commit (the “center” of the analysis) and a set of traces via a query.
  2. Configuration: Choosing an algorithm (e.g., K-Means or Step-Fit) and defining sensitivity parameters.
  3. Execution: Running a long-polling server-side task to compute clusters.
  4. Triage: Reviewing the resulting clusters to identify performance regressions or improvements.

Design and Implementation Choices

Asynchronous Progress Handling

Clustering is a computationally expensive operation that can take a significant amount of time. To prevent UI blocking and handle potential timeouts, the module uses a “start-and-poll” pattern.

  • It initiates a request to /_/cluster/start.
  • It utilizes a specialized progress utility to poll for status updates.
  • Real-time status messages from the server are streamed to the UI, providing feedback on the current stage of the clustering process (e.g., “Calculating centroids”, “Analyzing step fits”).

State Reflection

To support bookmarking and sharing of specific clustering configurations, the module uses stateReflector. Parameters such as the selected commit offset, the query string, algorithm choice ($K$, radius, etc.), and “interestingness” thresholds are automatically mirrored in the URL's hash.

Component-Based Architecture

The page is composed of several specialized sub-elements, each handling a distinct part of the clustering lifecycle:

  • commit-detail-picker-sk: Handles the complex logic of searching for and selecting a specific commit.
  • algo-select-sk: Provides the UI for switching between different clustering strategies (like K-Means vs. Step-Fit).
  • query-sk & query-count-sk: Allow users to filter the multi-million trace dataset down to a specific subset while getting immediate feedback on the number of matches.
  • cluster-summary2-sk: Visualizes the output of the clustering engine, showing centroids and statistical summaries of the discovered groups.

Key Components and Responsibilities

State Management (State class)

The internal State class defines the schema for what makes a clustering request unique. This includes:

  • offset: The commit index to analyze.
  • radius: How many commits before and after the offset to include in the window.
  • k: The number of clusters to find (0 allows the server to auto-calculate).
  • interesting: A threshold score; clusters with a regression score below this are ignored.
  • sparse: A boolean flag to skip traces that lack data in the requested range.

The Clustering Workflow

The start() method is the core logic driver. It gathers the current state into a RegressionDetectionRequest and manages the lifecycle of the network request.

[ User Input ] -> [ State Reflector ] -> [ URL Updated ]
      |
[ Click "Run" ]
      |
      v
[ POST /_/cluster/start ] ----> [ Server starts Job ]
      |                               |
      |<-------[ Poll Progress ] <----+
      |             |
      v             |
[ Update Spinner ] <+
[ Show Messages  ] <+
      |
      v
[ GET Final Results ] -> [ Map to cluster-summary2-sk ]

Event Handling

  • commitSelected: Listens for selection events from the commit picker to update the target offset.
  • openKeys: When a user clicks on a specific cluster summary, this handler constructs a URL for the Explore page (/e/) using a shortcut to the traces in that cluster, allowing for deeper drill-down analysis.
  • queryChanged: Dynamically updates the paramset-sk and triggers a re-count of matching traces to help the user gauge the scope of their request before running it.

Results Visualization and Sorting

Once the results are returned, they are rendered as a list of cluster-summary2-sk elements. The module includes a sort-sk component that allows users to re-order these results based on:

  • Cluster Size: Number of traces in the group.
  • Regression: The calculated severity of the shift.
  • Step Size: The absolute magnitude of the change.
  • Least Squares: The statistical fit of the data to a step function.

Module: /modules/cluster-summary2-sk

cluster-summary2-sk

The cluster-summary2-sk module provides a comprehensive UI component for visualizing and triaging performance regressions in the Perf system. It represents a “cluster” of traces that exhibit similar behavior (usually a step-up or step-down) at a specific point in time.

High-Level Overview

This component serves as a detailed view for an anomaly. It combines several data dimensions into a single interface:

  1. Visual Evidence: A plot showing the centroid of the trace cluster.
  2. Statistical Context: Metrics like regression magnitude, step size, and least squares error.
  3. Metadata: Impacted parameters (via a Word Cloud) and commit details.
  4. Actionability: Controls to triage the anomaly (e.g., mark as “Positive” or “Negative”) or investigate further on the dashboard.

Design Decisions and Implementation

Dynamic Labeling and Formatting

A key design challenge in Perf is that different step detection algorithms (e.g., mannwhitneyu, cohen, percent) produce statistically different outputs. Rather than a generic “Value” label, cluster-summary2-sk uses a mapping strategy (labelsForStepDetection) to provide context-aware labels and formatting.

  • Why: A “Regression Factor” of 0.05 is excellent for a p-value (mannwhitneyu) but potentially insignificant for a “Percentage Change.”
  • How: The component reacts to the alert property. When an alert is set, it updates the internal labels object, changing UI strings (e.g., “p:” vs “Percentage Change:”) and the corresponding number formatters (percent vs decimal).

Layout and Information Density

The component uses a CSS Grid layout to manage a complex set of child elements, ensuring that critical information remains visible even as secondary tools (like the Word Cloud) are toggled.

+-------------------------------------------+
| [Regression Status Banner (High/Low)]     |
+----------------------+--------------------+
| [Stats Row]          | [Triage Controls]  |
+----------------------+                    |
| [Google Chart Plot]  |                    |
+----------------------+--------------------+
| [Commit Detail Panel]                     |
+-------------------------------------------+
| [Action Buttons: Dashboard / Word Cloud]  |
+-------------------------------------------+
| [Collapsible Word Cloud Area]             |
+-------------------------------------------+

Data Integration and State

The component consumes a FullSummary object, which is a composite of the ClusterSummary (the statistics) and the FrameResponse (the raw data for the plot).

  • Plotting: It transforms the cluster centroid (an array of numbers) and the dataframe header (commit info) into a format suitable for plot-google-chart-sk. It also places an “x-bar” (vertical line) at the exact commit where the regression was detected.
  • Permissions: It checks the user's login status via alogin-sk. If the user lacks the editor role, triage controls are visually disabled to prevent unauthorized state changes.

Key Components and Responsibilities

cluster-summary2-sk.ts

The main logic engine. It handles:

  • Event Dispatching: Fires triaged when a user updates the status and open-keys when a user wants to explore the cluster on the main Perf dashboard.
  • Coordinate Lookup: Uses the static lookupCids method to fetch commit metadata when a user clicks a point on the graph.
  • Attribute Management: Supports a notriage attribute to hide triage UI in read-only contexts.

Integrated Sub-elements

The component acts as a coordinator for several other specialized modules:

  • plot-google-chart-sk: Renders the trend line of the cluster's centroid.
  • triage2-sk: Provides the dropdown/selection for anomaly status (Untriaged, Positive, Negative, etc.).
  • word-cloud-sk: Visualizes the param_summaries2 data, helping users identify which dimensions (like arch or config) are most common in the cluster.
  • commit-detail-panel-sk: Displays the git log/author information for the selected point in the regression.
  • commit-range-sk: Allows users to inspect the range of commits around the regression point.

Workflows

The Triage Workflow

When a developer identifies a regression, the following interaction occurs:

  1. Selection: User reviews the plot and word cloud to confirm the regression is real.
  2. Input: User selects a status in triage2-sk and optionally enters a message.
  3. Update: Clicking “Update” triggers the following internal flow: User Click -> update() -> dispatchEvent('triaged', {columnHeader, triageStatus})
  4. External Handling: The parent page (e.g., the Anomaly table) listens for this event to persist the change to the backend.

The Investigation Workflow

If the summary is insufficient, the “View on Dashboard” button facilitates a deep dive: Click "View on Dashboard" -> openShortcut() -> dispatchEvent('open-keys', ...) This event carries a shortcut ID, which the Explorer page uses to reload the exact set of traces and time range represented by the cluster.

Module: /modules/commit-detail-panel-sk

commit-detail-panel-sk

A Custom Element that displays a list of commits in a table format. Each entry in the table is rendered using a commit-detail-sk element, providing a consistent view of commit metadata across the Perf application.

Overview

The commit-detail-panel-sk acts as a container and controller for a collection of commit summaries. It is designed to be versatile, supporting both purely informational displays and interactive selection workflows (e.g., choosing a specific commit from a list associated with a performance anomaly).

Key Responsibilities

  • Data Presentation: Transforms an array of Commit objects into a vertical list of detailed rows.
  • Selection Management: Tracks which commit is currently active or selected via the selected attribute.
  • Interaction Handling: Manages click events on rows and translates them into high-level commit-selected events for parent components.
  • State Propagation: Passes contextual information, such as trace_id, down to individual commit-detail-sk children to ensure they have the necessary context for rendering or linking.

Design Decisions

Interactive vs. Static Modes

The component uses a selectable boolean attribute to toggle between two distinct behaviors:

  1. Static: The panel is purely for viewing. Visual cues like hover pointers and selection highlights are disabled, and click events are ignored.
  2. Interactive: The panel acts as a selection list. The CSS adds a cursor: pointer to rows, and the component responds to user clicks by updating its state and broadcasting the selection.

This dual-mode design allows the same component to be used in a read-only dashboard as well as in a “point-and-click” triage workflow.

Event Delegation and Parent Lookup

The component implements a click listener on the top-level <table> rather than attaching individual listeners to every row. This is more efficient for large lists.

When a click occurs, it uses the findParent utility to locate the nearest TR element. This ensures that even if a user clicks a nested link or span inside the commit-detail-sk child, the panel correctly identifies which index in the details array was targeted.

Selection Workflow

The following diagram illustrates how a user interaction is converted into an application-level event:

User Click
    |
    v
[Table Click Handler] --(Check selectable)--> [Exit if false]
    |
    | (findParent 'TR')
    v
[Extract data-id] ---------------------------> [Update 'selected' attribute]
    |                                                      |
    v                                                      v
[Construct Event Detail]                        [Trigger CSS :selected highlight]
    | (author, message, commit)
    v
[Dispatch 'commit-selected']

Key Components

commit-detail-panel-sk.ts

The core logic of the element. It utilizes lit-html for efficient rendering.

  • Properties/Attributes:
    • details: The source array of Commit objects.
    • selectable: Enables/disables interaction.
    • selected: The index of the currently highlighted commit.
    • hide: When true, prevents the list from rendering any rows, effectively clearing the view without losing the underlying data.
    • trace_id: A string passed to children to provide context for the specific performance trace being inspected.

commit-detail-panel-sk.scss

Defines the visual state of the panel. It leverages CSS variables for theme support (light/dark mode).

  • Highlights rows using the tr[selected] selector.
  • Adjusts opacity based on the selectable state to provide a visual hint of whether the component is interactive.

commit-detail-panel-sk_po.ts

A Page Object (PO) implementation used for automated testing. It abstracts the DOM structure (tables, rows) into a set of asynchronous methods like clickRow(index) and getSelectedRow(), allowing tests to interact with the component at a functional level rather than a DOM level.

Events

commit-selected

Fired when a user clicks a row while the selectable attribute is present.

  • Detail: Contains the index of the selection, a string description (author + message), and the full Commit object.
  • Bubbles: True, allowing parent containers to catch selection events from the panel.

Module: /modules/commit-detail-picker-sk

commit-detail-picker-sk

The commit-detail-picker-sk module provides a specialized UI component for searching and selecting a specific commit from a repository's history. It is designed to handle the discovery of commits by allowing users to browse within a configurable time window and view detailed information before making a selection.

Design and Implementation

The component acts as a high-level wrapper around three key functional areas: a trigger (a button showing the current selection), a search/filter interface (date range selection), and a results display (commit-detail-panel-sk).

Workflow: Commit Selection

The module implements a “modal picker” pattern to keep the main UI clean while providing a rich interface for selection when needed.

User Interaction          State Management & Fetching          Sub-components
+--------------+        +---------------------------+        +-------------------+
| Click Button | ------>| Open <dialog>             |        |                   |
+--------------+        |                           |        |                   |
                        | Fetch Commits (_/cidRange)|        |                   |
                        | (Filtered by Date Range)  |        |                   |
                        +-------------+-------------+        |                   |
                                      |                      |                   |
                                      v                      |                   |
                        +---------------------------+        |                   |
                        | Update .details Property  | ------>| commit-detail-    |
                        +---------------------------+        |   panel-sk        |
                                      |                      |                   |
                                      v                      |                   |
+--------------+        +---------------------------+        |                   |
| Select Commit| <------| Emit 'commit-selected'    | <------| (Item Clicked)    |
+--------------+        | Close <dialog>            |        |                   |
                        +---------------------------+        +-------------------+

Key Components and Responsibilities

  • commit-detail-picker-sk.ts: The core logic handler. It manages the state of the modal (open/closed), the current date range for searching, and the retrieval of commit data from the server.
    • Data Fetching: It communicates with the backend via the /_/cidRange/ POST endpoint. It sends a RangeRequest containing a start and end timestamp and an optional offset. This allows the picker to populate its list based on user-defined time windows.
    • Synchronization: When the selection property (a CommitNumber) is set externally, the component automatically triggers a fetch to ensure the details of that commit are loaded and displayed in the button label.
  • commit-detail-panel-sk: Used within the dialog to render the list of commits. It handles the actual rendering of commit messages, authors, and hashes, and provides the selection mechanism within the list.
  • day-range-sk: Provides the UI for the user to modify the search window. Changing the date range in this component triggers a new fetch in the picker to refresh the available commits.
  • dialog (HTML5): Used to overlay the picker interface. This keeps the commit browsing experience contextual without navigating the user away from their current task.

Events

The component communicates the user's choice to the rest of the application via a custom event:

  • commit-selected: Emitted when a user clicks a commit in the panel. The detail of the event contains the selected index and commit information, following the structure defined by CommitDetailPanelSkCommitSelectedDetails.

Interaction Logic

  1. Initial Load: On attachment to the DOM, the component defaults to a 24-hour window (ending at Date.now()). It fetches the commits for this range to populate the internal panel.
  2. Updating the Range: If a user cannot find the desired commit, they can expand the “Date Range” section. This uses day-range-sk to update the begin and end timestamps, which causes the picker to re-query the backend.
  3. Selection Persistence: The button label is dynamically generated based on the current selection. If no commit is selected or if the selected commit isn't found in the current fetched batch, it defaults to “Choose a commit.”

Module: /modules/commit-detail-sk

The commit-detail-sk module provides a specialized web component for displaying concise information about a single Git commit within the Perf application. It serves as a navigational bridge, allowing users to move from a specific commit to various analysis views such as data exploration, clustering, or triage.

Design and Intent

The element is designed to be a compact, action-oriented summary. It doesn‘t just display metadata; it contextualizes the commit based on the user’s current interaction state.

A key design choice is the conditional behavior of the Explore functionality. The component can navigate to one of two destinations depending on whether a trace_id is provided:

  1. Generic Explore: If only a commit is known, it links to a general view of that commit.
  2. Contextual Explore: If a trace_id is present, the component assumes the user is interested in how that specific trace performed around the time of the commit. It automatically calculates a time window of +/- 4 days around the commit timestamp to provide immediate visual context in the Explore view.

Key Components and Logic

commit-detail-sk.ts

This is the core implementation file. It defines the CommitDetailSk class, which manages the following properties:

  • cid: A Commit object containing the hash, author, timestamp, message, and URL.
  • trace_id: An optional string identifying a specific performance trace.

The component uses Lit for rendering and follows a reactive pattern where updates to cid or trace_id trigger a re-render. It utilizes diffDate from the infra-sk library to display human-readable relative timestamps (e.g., “3 days ago”).

Action Buttons and Navigation

The component renders a set of standard actions, all of which open in a new browser tab:

  • Explore: Navigates to /e/ (contextual) or /g/e/ (generic).
  • Cluster: Navigates to /g/c/[hash], used for analyzing performance clusters associated with that commit.
  • Triage: Navigates to /g/t/[hash], used for managing alerts or regressions at that point in time.
  • Commit: Links directly to the external source hosting service (e.g., Gitiles) using the URL provided in the commit object.

Workflow: Explore Navigation Logic

The following diagram illustrates how the component determines the destination of the “Explore” button click:

User Clicks "Explore"
          |
          v
  Is trace_id set?
    /           \
  [Yes]         [No]
    |             |
    |             v
    |      Navigate to:
    |      /g/e/{commit_hash}
    v
Calculate Time Range:
[ts - 4 days] to [ts + 4 days]
          |
          v
Construct Query Object:
{ keys: trace_id, begin, end, ... }
          |
          v
Navigate to:
/e/?{serialized_query}

Styling and Themes

The module includes commit-detail-sk.scss, which imports both standard color variables and Perf-specific themes. It supports a dark mode and ensures that links and buttons remain accessible and consistent with the broader Skia infrastructure design language. Material Web Components (md-outlined-button) are used for the action triggers to provide a consistent look and feel with other modern Skia modules.

Module: /modules/commit-range-sk

The commit-range-sk module provides a custom element designed to display and link to specific commits or ranges of commits within a repository. It is primarily used in the Perf UI to help users navigate from a data point or a regression in a trace directly to the relevant code changes in a source control browser (e.g., Gitiles/Googlesource or GitHub).

High-Level Design Decisions

The element is designed to be reactive and data-driven, relying on trace data and column headers to resolve human-readable commit numbers into machine-readable Git hashes.

  • Dynamic Link Generation: Instead of hardcoding URL patterns, the element utilizes a global configuration (window.perf.commit_range_url). This allows the same component to work across different repository hosting services by providing templates like .../range/{begin}/{end} or .../commit/{end}.
  • Automatic Range Detection: The element automatically detects if it should display a single commit or a range. If the data point immediately following the previous valid data point is selected, it treats it as a single commit. If there are “holes” (missing data) in the trace between the current point and the last known good point, it expands the link to cover the entire range of commits where the change could have occurred.
  • Asynchronous Resolution: Git hashes are often not present in the initial trace header to save bandwidth. The module fetches these hashes lazily via the cid (Commit ID) lookup service only when a link needs to be rendered.
  • Request Concurrency & Caching: To prevent UI flicker and redundant network requests, the element implements an internal cache for hashes and tracks request IDs to ensure that late-arriving responses from stale requests do not overwrite the current UI state.

Key Components and Responsibilities

commit-range-sk.ts

This is the core logic of the component. It manages the internal state of the link text and URL.

  • Range Calculation: It inspects the trace array and the commitIndex to find the “previous” valid commit. It skips over MISSING_DATA_SENTINEL values to ensure the user is directed to the full range of potential changes.
  • Template Parsing: It replaces {begin} and {end} placeholders in the configured URL. It also includes specific logic for “Googlesource” style URLs, converting range logs (+log/begin..end) into single commit views (+/end) when the range size is one.
  • Event Dispatching: Dispatches a commit-range-changed event when a new link is successfully generated, allowing parent components (like tooltips or info panels) to react.

Interaction Workflow

The following diagram illustrates how the component transforms a user selection into a functional link:

User Selects Point (index N)
          |
          v
Find Previous Valid Point (index M < N) in Trace
          |
          v
Lookup Commit Numbers (Offsets) for M and N in Header
          |
          v
[Network Request] -> lookupCids(OffsetM, OffsetN)
          |
          v
Apply Hashes to window.perf.commit_range_url Template
          |
          v
Render <a> link with text "OffsetM+1 - OffsetN" (if range)
       or "OffsetN" (if single commit)

commit-range-sk_po.ts

Provides the Page Object for testing. It encapsulates the logic for retrieving the link's href and the displayed text, shielding tests from the underlying DOM structure (which alternates between an <a> tag when the URL is ready and a <span> while loading).

test_data.ts

Contains mock data structures that simulate the header and trace objects produced by the Perf backend, ensuring consistent testing of the range-finding logic.

Implementation Details

  • Single vs. Range: A “Range” is defined as a gap where start_commit + 1 < end_commit. If they are adjacent, isRange() returns false, and the UI simplifies the display to a single hash or commit number.
  • GitHub Support: The component has a specific fallback for GitHub URLs; if “github” is detected in the URL template, it truncates the displayed text to a short 7-character hash for better readability.
  • Rendering: The component uses a light DOM (via createRenderRoot() returning this) rather than Shadow DOM, which is a common pattern in this project for elements that need to inherit global styles or be easily accessible by parent tooltips.

Module: /modules/common

/modules/common

This module serves as the foundational utility layer for Skia Perf. It centralizes shared logic for data visualization, anomaly handling, keyboard interaction, and testing infrastructure.

Data Visualization and Plotting

The core responsibility of this module is to bridge the gap between raw backend trace data and the frontend charting engine (Google Charts).

Plot Construction

The module handles the complex task of “transposing” trace data. Backend data typically arrives organized by trace keys, whereas charting libraries require data organized by rows (time/commit positions) with traces as columns.

  • plot-builder.ts: Contains the logic for this transformation. It supports different domains (commits, dates, or both) and handles missing data using a MISSING_DATA_SENTINEL. It also generates consistent color palettes for charts.
  • plot-util.ts: Provides higher-level utilities to create ChartData objects, specifically managing the integration of anomaly markers into the data points so they can be rendered on the graph.

Visual Consistency and Collision Avoidance

To ensure that performance graphs remain readable when comparing similar builds, the module implements a deterministic coloring strategy:

  • Trace Coloring: Colors are derived from a hash of the trace name to ensure consistency across page reloads.
  • Variant Offsets: Specific logic exists to detect collisions between a “base” trace and its variants (e.g., ref or pgo builds). If a collision is detected in the color space, the variants are mathematically shifted to guaranteed distinct colors.

Anomaly Management

Anomalies are a first-class citizen in this module. It provides types and formatting logic to present performance regressions or improvements to the user.

  • anomaly-data.ts: Defines the data structure for a point on a graph that represents an anomaly, including its coordinates and highlight state.
  • anomaly.ts: Contains formatting logic for numeric changes (percentages) and human-readable links to bug trackers. It handles specialized bugId values like “Invalid” or “Ignored” alerts.

Interaction and Styling

Keyboard Shortcuts

To facilitate rapid “triage” workflows, the module implements a centralized shortcut system.

  • ShortcutRegistry: A singleton that manages categories of shortcuts (Triage, Navigation, Report, General).
  • handleKeyboardShortcut: A global handler that maps physical key presses to specific method calls on components (e.g., onTriagePositive, onZoomIn), while intelligently ignoring events originating from input fields or textareas.

Unified UI Components

  • buttons.scss: Defines a mixin (perf-button) that enforces a strict visual design system. It uses !important to ensure that Perf-specific buttons maintain their identity even when embedded in components with conflicting global styles or Shadow DOM boundaries.

Testing and Development Utilities

This module provides extensive infrastructure for both unit and integration testing.

  • test-util.ts: A comprehensive mock environment for demo pages and unit tests. It includes setUpExploreDemoEnv, which mocks the entire Perf backend API (anomalies, trace data, login status, and shortcut persistence) using fetch-mock.
  • puppeteer-test-util.ts: Provides helper functions for E2E testing, such as polling for DOM states, waiting for Google Charts to finish rendering (waitForReady), and validating ParamSet selections.

Workflow: Data to Chart

The following diagram illustrates the flow of data through the common module components:

[Raw TraceSet] ----> [plot-util.ts] <---- [Anomaly Map]
      |                    |
      |          (Match anomalies to points)
      |                    |
      V                    V
[plot-builder.ts] <--- [ChartData]
      |
(Transpose to Rows)
      |
      V
[Google Chart Engine]

Key Files Summary

FileResponsibility
anomaly.tsFormatting and calculation utilities for performance anomalies.
buttons.scssStandardized visual styling for buttons across the application.
graph-config.tsLogic for managing graph state and generating persistent shortcut URLs.
keyboard-shortcuts.tsCentral registry and event handler for application-wide hotkeys.
plot-builder.tsLogic for transposing dataframes and managing chart color palettes.
plot-util.tsHigh-level helpers for merging traces and anomalies into chartable data.
test-util.tsBackend API mocking and dummy data generation for development.
puppeteer-test-util.tsSynchronization and validation helpers for browser-based tests.

Module: /modules/const

Constants Module (/modules/const)

The const module serves as a centralized source of truth for shared values used throughout the Perf UI. Its primary purpose is to ensure data consistency between the backend services (written in Go) and the frontend visualization layers, particularly regarding how incomplete or special data states are represented.

Data Integrity and Sentinels

A significant challenge in performance monitoring is the representation of gaps in time-series data. The design of this module focuses on providing stable “sentinel” values that allow the UI to distinguish between valid data points and missing measurements without relying on non-standard JSON types.

Numeric Sentinels

The backend storage and processing layers (specifically //go/vec32/vec) utilize a specific float32 value to denote missing samples. Because the standard JSON specification does not support NaN or Infinity, the frontend must use a value that is:

  1. A valid, representable float32.
  2. Compact in its string/JSON representation to minimize payload size.
  3. Extremely unlikely to occur as a legitimate measurement in performance testing.

MISSING_DATA_SENTINEL (set to 1e32) satisfies these requirements. When the UI encounters this value within a trace, it interprets the point as a gap rather than a zero or a legitimate data spike, allowing graphing components to break lines or omit points appropriately.

String Sentinels

For categorical data or metadata fields where a value might be absent or undefined, the module provides MISSING_VALUE_SENTINEL. Using a explicit string (__missing__) instead of an empty string or null prevents ambiguity during filtering and grouping operations, ensuring that “missing data” can be treated as its own distinct category in the UI.

Key Exports

ConstantPurpose
MISSING_DATA_SENTINELA numeric float used to mark holes in time-series traces.
MISSING_VALUE_SENTINELA string used to represent the absence of a value in metadata or parameters.

Workflow: Data Rendering

The following diagram illustrates how these constants act as a bridge between the raw data ingestion and the final visualization:

[ Backend Trace ] -> [ JSON Serialization ] -> [ UI Data Fetching ] -> [ Plotting Logic ]
       |                      |                       |                      |
       |                      |                       |                      |
 Uses MissingDataSentinel     Converts to 1e32        Imports MISSING_DATA_  Detects 1e32 and
 (Go)                         (Valid JSON)            SENTINEL (TS)          renders a gap.

Module: /modules/csv

The csv module provides utilities for transforming performance data represented as a DataFrame into a Comma-Separated Values (CSV) format. This functionality is essential for allowing users to export trace data from the Perf system into spreadsheet software or external analysis tools.

Overview

The primary design goal of this module is to flatten high-dimensional trace data—which is structured as a collection of key-value pairs (parameters) and time-series arrays—into a two-dimensional grid. To achieve this, the module dynamically generates a schema based on the unique set of parameter keys present in the provided traces.

Design Decisions and Implementation

Dynamic Column Mapping

A challenge in converting DataFrame objects to CSV is that different traces may have different sets of parameters (e.g., one trace might have an os parameter while another has a config parameter).

To ensure a consistent grid structure:

  1. Parameter Extraction: The module parses the structured trace IDs (comma-separated key-value strings) into distinct parameter sets.
  2. Key Union and Sorting: It identifies every unique parameter key across all traces and sorts them alphabetically. These sorted keys form the leading columns of the CSV.
  3. Normalization: For each row, if a trace lacks a specific parameter key present in another trace, the corresponding cell is left empty.

Handling Time and Missing Data

  • Temporal Headers: The columns following the parameter keys are derived from the DataFrame.header. Timestamps, which are stored internally as seconds, are converted to ISO 8601 strings to ensure they are human-readable and correctly interpreted by external tools.
  • Sentinels: Performance data often contains “holes” where data collection failed or was skipped. The module explicitly checks for MISSING_DATA_SENTINEL values and converts them to empty strings in the CSV output to maintain the numeric integrity of the rest of the column.
  • Filtering: The implementation automatically excludes traces starting with special_. These are internal synthetic traces (like averages or benchmarks) that do not conform to standard parameter schemas and would clutter the export.

Key Workflows

The CSV generation process follows a linear transformation path:

[ DataFrame ]
      |
      | 1. Extract Trace IDs
      V
[ Trace IDs ] --> [ Parse to Params ] --> [ Identify & Sort Unique Keys ]
                                                   |
      +--------------------------------------------+
      |
      V
[ Generate Header Row ] (Sorted Param Keys + ISO Timestamps)
      |
      V
[ Iterate Traces ]
      |
      +--> [ Map Params to Columns ] ----+
      |                                  |--> [ Concatenate Row ]
      +--> [ Map Data Points to Columns ]--+
      |
      V
[ Final CSV String ]

Key Components

  • dataFrameToCSV (index.ts): The main entry point. It orchestrates the collection of keys, the formatting of headers, and the row-by-row serialization of trace data.
  • parseIdsIntoParams & allParamKeysSorted (index.ts): Helper functions that handle the translation between the serialized trace ID format and a structured, normalized set of columns. They rely on fromKey from the paramtools module to decompose the string-based identifiers.

Module: /modules/data-service

The data-service module provides a centralized, singleton-based interface for interacting with the Perf backend. It abstracts the complexities of HTTP communication, error handling, and long-running asynchronous operations into a clean API used by the frontend components.

Core Responsibility

The primary role of the DataService class is to act as the single source of truth for backend data fetching. By encapsulating all network logic, it ensures consistent headers, error reporting (via DataServiceError), and behavior across the application. It specifically manages:

  • State Persistence: Handling “shortcuts” (IDs representing specific sets of graph configurations or trace keys) to allow for shareable URLs.
  • Contextual Data: Fetching initial page settings, timezone-aware data, and default query configurations.
  • Data Manipulation: Calculating time/commit range shifts and retrieving user-reported issues.
  • Asynchronous Processing: Managing complex, multi-stage requests like frame generation which require progress monitoring.

Key Components

DataService (data-service.ts)

The main implementation follows the Singleton pattern, accessible via DataService.getInstance(). This ensures that shared configurations (like local development overrides) are consistent across all callers.

  • Standard Fetching: Methods like getShortcut, getDefaults, and shift wrap standard POST or GET requests. They use a private fetchJson helper that integrates with jsonOrThrow to standardize how the frontend handles malformed or failed responses.
  • Shortcuts: The service handles two types of shortcuts:
    • updateShortcut: Maps a complex GraphConfig array to a short ID.
    • createShortcut: Maps a simple list of trace keys to an ID. These methods include logic to skip execution during local development if perf.disable_shortcut_update is set, preventing unnecessary 500 errors from unconfigured local proxies.

Long-running Operations

The sendFrameRequest method handles one of the most complex workflows in the system: requesting data frames (collections of trace data for graphs). Because frame generation can be slow, the backend uses a “start-and-poll” pattern.

DataService leverages the progress module to manage this lifecycle:

  1. Initialization: It attaches the local browser's timezone to the request.
  2. Lifecycle Management: It accepts callbacks (onStart, onProgress, onMessage, onSettled) to allow UI components to update their loading states or progress bars.
  3. Polling: It delegates the polling logic to startRequest, which communicates with the /_/frame/start endpoint and waits for a “Finished” status.
  4. Error Transformation: If the progress returns an error status, it converts the backend messages into a human-readable DataServiceError.

Data Flow Workflow

The following diagram illustrates the lifecycle of a frame request through the DataService:

Component          DataService           Progress Module          Backend
    |                   |                       |                    |
    |--sendFrameReq()-->|                       |                    |
    |                   |----startRequest()---->|                    |
    |                   |                       |--- /frame/start -->|
    |                   |                       |                    | (Processing)
    |                   |                       |<-- HTTP 200 (ID) --|
    |                   |                       |                    |
    |                   |       [Loop]          |--- Check Status -->|
    |<---onProgress()---|                       |                    |
    |                   |<--updateProgress()----|<-- SerializedMsg --|
    |                   |                       |                    |
    |                   |                       |--- Check Status -->|
    |                   |                       |<-- Status: Fin ----|
    |<-- FrameResponse--|                       |                    |

Error Handling

The module defines a specialized DataServiceError. This class extends the native Error but includes an optional status property (the HTTP status code). This allows calling components to distinguish between network-level failures (e.g., 404, 500) and application-level errors (e.g., failed data processing messages returned within a valid HTTP 200 response).

Module: /modules/dataframe

Dataframe Module

The dataframe module provides the core data structures and management logic for handling performance trace data in the Perf system. It manages the lifecycle of performance data—from fetching raw JSON responses to maintaining a local cache and transforming data for visualization.

Overview

The module centers around the concept of a DataFrame, which represents a set of performance traces (time series data) sharing a common horizontal axis (commits or timestamps). The primary goal of this module is to provide a consistent way to query, extend, merge, and visualize these traces along with their associated metadata, such as anomalies and user-reported issues.

Key Components

DataFrame Management (index.ts)

This file defines the fundamental logic for manipulating DataFrame objects. It is the TypeScript equivalent of the backend Go implementation.

  • Joining and Merging: The join function allows two DataFrames to be combined into one. It handles cases where headers (commit ranges) overlap or differ by recalculating a unified header and padding missing data with a MISSING_DATA_SENTINEL.
  • Subsetting: Functions like findSubDataframe and generateSubDataframe allow for extracting specific slices of data based on commit offsets or timestamps.
  • Anomaly Handling: It provides logic to merge AnomalyMap structures, ensuring that anomaly data from different requests are combined correctly for the same traces.

Data Repository and Context (dataframe_context.ts)

The DataFrameRepository (implemented as <dataframe-repository-sk/>) acts as the state manager for performance data within a frontend application. It utilizes Lit context to provide data to descendant components.

  • Caching and Extension: It maintains a local cache of traces. The extendRange method allows the UI to request more data (forward or backward in time) while maintaining the current ParamSet.
  • Chunking: To improve performance and reliability, large data requests are automatically sliced into smaller time “chunks” (defaulting to 1 month) and fetched concurrently.
  • Context Provision: It provides several contexts:
    • dataframeContext: The raw DataFrame.
    • dataTableContext: A google.visualization.DataTable prepared for google-chart.
    • dataframeAnomalyContext: Current known anomalies.
    • dataframeUserIssueContext: Buganizer issues associated with specific trace points.
    • dataframeLoadingContext: A boolean flag indicating if a fetch operation is in progress.

Trace Identification and Formatting (traceset.ts)

Trace keys in Perf are typically comma-separated strings of key-value pairs (e.g., ,benchmark=JetStream2,bot=MacM1,). This component provides utilities to parse these keys for UI display.

  • Dynamic Title and Legend: It dynamically calculates which parameters are common to all traces in a set (the Title) and which parameters vary (the Legend). This prevents redundant information from cluttering the UI.
  • Special Function Handling: It recognizes and strips transformation functions (like norm()) from keys to ensure that data lookup for issues and anomalies remains consistent even if the data is being transformed for display.

Key Workflows

Data Fetching and Merging

When a user requests to “extend” a chart, the following process occurs:

UI Action (e.g., "Scroll Left")
      |
      v
DataFrameRepository.extendRange(offset)
      |
      +--> calculate deltaRange (new time window)
      +--> sliceRange into chunks (e.g., 30-day blocks)
      +--> concurrent DataService.sendFrameRequest()
      |
      v
Receive multiple FrameResponses
      |
      +--> Sort responses by commit offset
      +--> Merge ColumnHeaders (unified X-axis)
      +--> Map old/new trace indices to unified header
      +--> Pad missing points with SENTINEL
      |
      v
Update Lit Contexts
      |
      +--> dataframeContext (raw data)
      +--> dataTableContext (Google Charts format)
      +--> UI components re-render

Trace Metadata Extraction

The module handles the logic of turning complex trace keys into readable labels:

Trace A: ,benchmark=V8,test=Total,arch=arm,
Trace B: ,benchmark=V8,test=Total,arch=x86,

Logic:
1. Common: benchmark=V8, test=Total  ==> Title: "V8/Total"
2. Unique: arch=arm vs arch=x86      ==> Legend: ["arm", "x86"]

Design Decisions

  • Immutable ParamSet for Extensions: Once a repository is initialized with a ParamSet, extensions (paging through data) use that same set of parameters. To change the query itself, a full resetTraces is required. This simplifies the merging logic by ensuring the “vertical” dimension (the traces) remains relatively stable while the “horizontal” dimension (time) grows.
  • DataTable Conversion: Instead of every UI component manually parsing the DataFrame, the DataFrameRepository performs a centralized conversion to the Google Visualization DataTable format. This ensures that expensive data transformations happen once per update.
  • Sparsed Anomaly Maps: Anomalies are stored in a map-of-maps structure indexed by TraceKey and then CommitPosition. This allows for efficient $O(1)$ lookups when rendering points on a chart, rather than iterating through lists of anomalies for every data point.

Module: /modules/day-range-sk

day-range-sk

The day-range-sk module provides a custom element for selecting a time range, defined by a beginning and an ending timestamp. It is designed to simplify date range selection in applications—such as performance monitoring dashboards—where users need to filter data within specific historical boundaries.

Design and Implementation

The module is built using Lit and extends the ElementSk base class. Its primary responsibility is to synchronize two separate date inputs and expose their combined state as a single range.

Time Representation

Unlike standard HTML date inputs that often use strings or millisecond-based timestamps, day-range-sk standardizes on seconds since the Unix epoch. This decision aligns with the data formats typically used in backend storage systems (like Prometheus or BigTable) within the Perf ecosystem, reducing the need for repeated conversions in the application logic.

Components

The element acts as a composite wrapper around two calendar-input-sk elements:

  • Begin Input: Controls the start of the range.
  • End Input: Controls the end of the range.

The internal state is managed through the begin and end properties, which are mirrored to attributes. This mirroring allows the element to be initialized or manipulated via declarative HTML or imperative JavaScript.

Workflow: Range Selection

When a user interacts with either of the internal calendars, the element processes the change and propagates it upward.

[ User Interface ]          [ day-range-sk ]            [ Consumer/App ]
        |                          |                           |
        |--- Change Begin Date --->|                           |
        |                          |--- Calculate Seconds ---->|
        |                          |--- Update 'begin' Attr -->|
        |                          |--- Dispatch Event --------|
        |                          |    (day-range-change)     |
  1. The user selects a date in one of the calendar-input-sk components.
  2. The component catches the @input event from the calendar.
  3. The Date object from the calendar is converted into a floor-rounded second-based timestamp.
  4. The element updates its own attribute to ensure the UI stays in sync with the state.
  5. A day-range-change event is dispatched, containing both the updated and the stationary timestamp in its detail object.

State Management and Defaults

The element is designed to be “ready to use” immediately upon being added to the DOM.

  • Initialization: If begin or end attributes are not provided, the component defaults to a 24-hour window ending at the current time.
  • Property Upgrading: The implementation includes _upgradeProperty calls in the connectedCallback, ensuring that if the properties were set on the DOM element before the custom element definition was loaded, those values are correctly captured and reflected.
  • Reactivity: By implementing observedAttributes, the element automatically re-renders whenever the begin or end attributes are changed externally, ensuring the visual calendar inputs always match the underlying data.

Events

day-range-change

This is the primary event emitted by the module. It bubbles up the DOM, allowing parent components to listen for any changes to the range.

The detail property of the event implements the DayRangeSkChangeDetail interface:

{
  begin: number; // Seconds since epoch
  end: number; // Seconds since epoch
}

Appearance

The element's layout is controlled via day-range-sk.scss, which ensures the labels and inputs are displayed as block elements with consistent spacing. It integrates with the global theme system by using CSS variables like --on-surface and --surface for the input borders and backgrounds, supporting both light and dark modes.

Module: /modules/domain-picker-sk

The domain-picker-sk module provides a specialized UI component for selecting time and data ranges, specifically tailored for the Skia Perf ecosystem. Its primary purpose is to define the “domain” (the X-axis) of a performance data request, allowing users to switch between absolute time ranges and relative commit counts.

Overview

The component provides two distinct modes for querying data, known as the request_type:

  1. Range Mode (0): Users specify a strict chronological window using a “Begin” and “End” date. This is ideal for investigating events within a known timeframe.
  2. Dense Mode (1): Users specify an “End” date and a fixed number of “Points” (commits) to look back from that date. This is preferred when the frequency of data points is inconsistent, ensuring the resulting visualization contains a predictable density of information.

Design Decisions

State Encapsulation

The component manages its internal state via a DomainPickerState object. This interface acts as the contract between the picker and its parent application (typically a dashboard or query page).

  • Unix Timestamps: Time is handled internally and exposed via the state as Unix seconds (integers), while the UI leverages calendar-input-sk for human-readable interaction.
  • Bi-directional State: The component supports both reading and writing the entire state via a state getter/setter, facilitating easy integration with URL-backed state management or “Reset” buttons.

Restricted Mode via force_request_type

An architectural choice was made to allow parent components to “lock” the picker into a specific mode using the force_request_type attribute.

  • When this attribute is set to range or dense, the component hides the radio-sk selection buttons entirely.
  • This allows the same component to be used in simple views where only one query style is supported, without duplicating the calendar and input logic.

Key Components and Files

  • domain-picker-sk.ts: The core logic. It utilizes Lit for templating and manages the conditional rendering logic that swaps between the “Begin Date” picker (Range mode) and the “Points” numeric input (Dense mode).
  • domain-picker-sk.scss: Defines the layout, ensuring that the calendar-input-sk and various labels align correctly. It uses standard element-sk variables to support theme switching (light/dark mode).
  • Dependency on calendar-input-sk: Rather than implementing date logic itself, the picker delegates date selection to this specialized sub-component, maintaining a consistent date-picking experience across the infra.

Workflow: Range Selection

The following diagram illustrates how the component resolves its output state based on user interaction or attribute overrides:

User Input / State Set
          |
          v
+-----------------------+      YES      +-------------------------+
| force_request_type?   |-------------->| Override request_type   |
+-----------+-----------+               | (Hide Radio Buttons)    |
            | NO                        +------------+------------+
            v                                        |
+-----------------------+                            |
| Render Radio Buttons  |                            |
| (Range vs Dense)      |                            |
+-----------+-----------+                            |
            |                                        |
            +-------------------+-+------------------+
                                |
                                v
                    +-----------------------+
                    | Render Common "End"   |
                    | Calendar Input        |
                    +-----------+-----------+
                                |
                +---------------+---------------+
                |                               |
        [request_type: RANGE]           [request_type: DENSE]
                |                               |
      +---------+---------+           +---------+---------+
      | Render "Begin"    |           | Render "Points"   |
      | Calendar Input    |           | Numeric Input     |
      +-------------------+           +-------------------+

Component State Structure

The state property manages the following object:

FieldTypeDescription
beginnumberUnix timestamp (seconds) for start of range.
endnumberUnix timestamp (seconds) for end of range.
num_commitsnumberCount of points to retrieve (used in Dense mode).
request_typenumber0 for Range, 1 for Dense.

Module: /modules/errorMessage

The errorMessage module provides a standardized utility for surfacing application errors to the user and, optionally, tracking those errors through telemetry. It acts as a wrapper around the core elements-sk error messaging system, tailored for the specific requirements of the Perf application.

Design Goals

The primary goal of this module is to ensure that critical errors are not missed by the user while maintaining observability for developers.

  • Persistence by Default: Unlike many UI notification systems that disappear after a few seconds, this module defaults to a duration of 0. This forces the error message to remain visible until the user manually dismisses it, ensuring that transient network failures or complex logic errors are acknowledged.
  • Integrated Observability: By wrapping the UI notification with telemetry hooks, the module allows developers to monitor error rates and types in production without scattering reporting logic throughout the codebase.

Key Components

Core Utilities (index.ts)

The module exports two primary functions for handling errors:

  • errorMessage: A simplified wrapper that dispatches an error-sk event. This event is typically caught by a global <error-toast-sk> element (or similar) to display a notification. Its main contribution is overriding the default display duration to infinite (0).
  • errorMessageWithTelemetry: Extends the standard error notification by incrementing a metric counter before showing the UI toast. It accepts a TelemetryErrorOptions object to categorize the error by source (e.g., a specific API endpoint) and error code (e.g., “404” or “500”).

Telemetry Integration

The telemetry functionality is designed to categorize errors using a specific metric source (defined by CountMetric). This allows the team to create dashboards based on the “source” and “errorCode” labels, providing a clear picture of application health.

Error Workflow

The following diagram illustrates how an error propagates from a functional call through to the UI and the monitoring backend:

[ Function Call ]
      |
      V
[ errorMessageWithTelemetry(msg, dur, options) ]
      |
      +----( If options.countMetricSource exists )----> [ telemetry.increaseCounter ]
      |                                                        |
      |                                                        V
      |                                             [ External Metrics System ]
      |
      +-----------------------------------------------> [ elementsErrorMessage ]
                                                               |
                                                               V
                                                     [ Dispatch "error-sk" event ]
                                                               |
                                                               V
                                                     [ UI Toast Component ]

Usage Considerations

When using errorMessageWithTelemetry, the source field in TelemetryErrorOptions should be specific enough to identify the feature or component failing, while errorCode should represent the category of failure. If these are omitted, they default to 'default' and '500' respectively.

Both functions handle various message formats—including strings, objects with a message property, and raw Response objects—consistent with the underlying elements-sk implementation.

Module: /modules/existing-bug-dialog-sk

existing-bug-dialog-sk

The existing-bug-dialog-sk module provides a modal dialog designed to associate performance anomalies (alerts) with an existing bug in a tracking system (e.g., Monorail/Issues). It is a key component of the Perf triage workflow, allowing users to consolidate multiple related regressions under a single bug ID.

High-Level Overview

When a user identifies a performance regression, they may want to link it to an ongoing investigation rather than filing a new bug. This module manages the UI for inputting a Bug ID, selecting the target project, and viewing other bugs already associated with the same group of anomalies.

Design and Implementation Logic

Association Workflow

The primary responsibility of the module is to communicate with the triage backend to establish a link between anomaly keys and a bug ID.

  1. Input Collection: The user provides a numeric Bug ID and selects a project (defaulting to “chromium”).
  2. Validation: The UI enforces a 5-9 digit numeric pattern for the Bug ID to prevent malformed submissions.
  3. Submission: The component sends a POST request to /_/triage/associate_alerts.
  4. State Synchronization: Upon success, it updates the local anomalies data to reflect the new association and dispatches an anomaly-changed event. This event ensures that other UI components (like charts or lists) stay in sync without requiring a full page reload.

Bug Discovery and Context

To help users avoid duplicating work, the dialog can fetch and display a list of bugs already associated with the anomalies being triaged.

  • Group Reports: It fetches anomaly group reports from /_/anomalies/group_report. If the backend returns a “SID” (Session ID), the component handles the additional fetch step to resolve the full list of anomalies in that group.
  • Metadata Enrichment: Simply showing a list of numbers (Bug IDs) is often unhelpful. The component calls /_/triage/list_issues to fetch the human-readable titles of these bugs, providing better context for the user before they commit the association.

User Interface Decisions

  • Scoped Styling: The component uses createRenderRoot() { return this; } to render directly into the custom element, allowing it to leverage global Perf themes and styles (SASS) defined in the project.
  • Async Feedback: It uses a spinner-sk and disables the submit button during active network requests to prevent duplicate submissions and provide visual feedback.

Key Components and Files

existing-bug-dialog-sk.ts

The core logic of the dialog. It manages:

  • Internal State: Tracks the _projectId, _busy status, and the bugIdTitleMap (linking bug IDs to their descriptive titles).
  • Anomaly Handling: Accepts anomalies and traceNames as properties. These are used to construct the payload for the triage backend.
  • Event Dispatching: Issues the anomaly-changed event to notify the rest of the application of state changes.

existing-bug-dialog-sk.scss

Defines the layout of the dialog, ensuring it occupies a reasonable portion of the screen (25% width) and handles long lists of associated bugs via a scrollable container (#associated-bugs-table).

Page Objects (existing-bug-dialog-sk_po.ts)

Provides an abstraction for testing the component. It encapsulates the selectors for the input fields and buttons, allowing Puppeteer and Karma tests to interact with the dialog without being brittle to internal DOM changes.

Workflow Diagram

User Actions              Component Logic                   Backend API
--------------------------------------------------------------------------------
Open Dialog      ----->   Fetch Associated Bugs    ----->   /_/anomalies/group_report
                                  |
                                  v
                          Fetch Bug Titles         ----->   /_/triage/list_issues
                                  |
                          Render List + Form
                                  |
Input Bug ID     ----->           |
                                  |
Click Submit     ----->   Validate & Send          ----->   /_/triage/associate_alerts
                                  |
                          Update Local State
                                  |
                          Dispatch 'anomaly-changed'
                                  |
                          Close Dialog

Related Files

  • existing-bug-dialog-sk-demo.ts: Provides a mocked environment to test the dialog's behavior and layout in isolation.
  • test_data.ts: Contains sample Anomaly objects used for both documentation and testing.

Module: /modules/explore-multi-sk

explore-multi-sk

The explore-multi-sk module provides a comprehensive interface for visual data exploration in Perf, allowing users to view and interact with multiple graphs simultaneously. It acts as an orchestrator for multiple explore-simple-sk instances, synchronizing their states (such as time ranges and X-axis scaling) to facilitate comparative analysis across different data dimensions.

High-Level Overview

The module serves two primary exploration modes:

  1. Standard/Split Mode: A “Master-Slave” architecture where one primary graph contains all selected data, and users can “split” this data into individual graphs based on specific parameters (e.g., splitting a single graph containing multiple OS traces into separate graphs for “Android”, “Ubuntu”, etc.).
  2. Manual Plot Mode: An independent mode where users can add and remove graphs arbitrarily, treating each as a standalone snapshot that does not necessarily share the same query parameters as others.

Design Decisions and Implementation

State Management and URL Reflection

The module utilizes stateReflector to persist the exploration state in the URL. To keep URLs manageable and logic simple, explore-multi-sk only tracks properties necessary to reconstruct the graphs.

  • Time Range Logic: Priority is given to explicit begin/end timestamps in the URL. If missing, it falls back to a dayRange (e.g., “last 7 days”). If both are missing, it uses global defaults.
  • Shortcut System: Instead of encoding every query for every graph in the URL, the module generates a shortcut ID. This ID maps to a collection of graph configurations in the backend database, allowing complex multi-graph layouts to be shared via a short link.

Graph Orchestration and Synchronization

The module ensures a unified experience across multiple internal elements through event-driven synchronization:

  • Time Range Sync: When a user zooms or pans on one graph, the range-changing-in-multi and selection-changing-in-multi events trigger an update across all other graphs.
  • X-Axis Consistency: Toggling between “Commit” and “Date” domains on one chart updates the domain state for all instances, ensuring the X-axis remains comparable.
  • Even X-Axis Spacing: Users can toggle discrete spacing (ignoring time gaps between points). This preference is synced across charts and persisted in localStorage.

Performance and Batch Loading

To prevent browser performance degradation when loading dozens of graphs (e.g., splitting by a parameter with many values), the module implements Chunked Loading:

[User Clicks Plot]
       |
       V
[Calculate Groups] -> (e.g., 20 different OS values)
       |
       V
[Load Chunk 1] ----> (Load first 5 graphs, request range data)
       |
[Load Chunk 2] ----> (Load next 5 graphs)
       |
      ...
       |
[Final Load] ------> (Fetch extended range data for all graphs in one batch)

This approach allows the UI to become interactive incrementally while minimizing the total number of expensive backend requests for historical data.

Key Components and Files

  • explore-multi-sk.ts: The core logic coordinator. It handles the State object, manages the lifecycle of explore-simple-sk elements, and implements the splitting/merging logic.
  • explore-multi-sk.scss: Provides layout styling, ensuring that graphs are sized appropriately (e.g., shrinking height when multiple graphs are displayed) and handling the visibility of UI components like the Test Picker.
  • explore-multi-sk_po.ts: A Page Object for Puppeteer testing, providing a clean API to interact with the multi-graph container and its children during integration tests.
  • Integration with test-picker-sk: The module heavily relies on the Test Picker for selecting data. In “Split Mode”, the picker's state is used to determine how to partition the data into individual charts.

Key Workflows

The “Split” Process

When a user selects a “Split By” parameter (e.g., os):

  1. The module identifies all traces currently loaded in the “Master” graph.
  2. Traces are grouped by the value of the chosen parameter.
  3. The Master graph is optionally hidden (becoming a background data accumulator).
  4. New explore-simple-sk instances are created for each group.
  5. Each child graph is initialized with the specific query and data subset corresponding to its group.

Removing Data

Data can be removed in two ways:

  • Individual Trace Removal: Triggered from a graph's UI. The module filters the global TraceSet, updates the internal data models, and tells the relevant graphs to re-render without the specific trace.
  • Graph Removal: In Manual Plot Mode, clicking the “Trash” icon removes that specific explore-simple-sk instance and updates the URL shortcut. In Split Mode, removing the last trace of a graph typically results in the removal of the entire graph instance.

Module: /modules/explore-simple-sk

Explore Simple SK

The explore-simple-sk module provides a comprehensive data exploration interface for the Perf tool. It serves as the primary component for querying, visualizing, and triaging performance traces, allowing users to interact with large datasets through charts, tables, and integrated triage tools.

High-Level Overview

explore-simple-sk is designed to be a versatile “explorer” that can operate in multiple modes (plotting, pivot tables, or simple querying). It manages the state of a data exploration session—including the time range, active queries, formulas, and selected data points—and reflects this state in the browser URL for shareability.

The module acts as a coordinator for several specialized sub-components:

  • Data Management: Uses DataFrameRepository to manage the underlying trace data and anomaly maps.
  • Visualization: Uses plot-google-chart-sk for the main interactive graph and plot-summary-sk for long-range navigation.
  • Querying: Integrates query-sk and pivot-query-sk to allow users to filter data.
  • Triage: Provides a chart-tooltip-sk that facilitates bug filing, anomaly nudging, and bisection.

Key Design Decisions

State Management and URL Reflection

The module utilizes a State class to track all parameters of the current view (e.g., begin, end, queries, formulas, domain).

  • Why: This allows for “deep linking,” where a user can share a specific view of a graph, including the zoom level and selected trace, simply by sharing the URL.
  • How: It uses a state_changed event mechanism and a useBrowserURL method to sync internal variables with URL search parameters.

Incremental Data Loading

To maintain performance when exploring large datasets, the module implements incremental fetching.

  • Why: Fetching an entire repository's history is expensive. Users often start with a small window and “pan” left or right.
  • How: When a user pans or zooms outside the currently loaded data range, the module calculates the delta and requests only the necessary additional frames, joining them with the existing DataFrame in memory.

Domain Switching (Commit vs. Date)

Users can toggle the X-axis between “Commit Position” and “Date.”

  • Why: Performance regressions are often tied to specific code changes (commits), but understanding the real-world timeline (dates) is crucial for identifying infrastructure issues or seasonal patterns.
  • How: The module performs a coordinate transformation when the domain changes, ensuring that any active zoom selection remains focused on the same set of data points by translating commit offsets to timestamps (and vice versa) using the dataframe header.

Key Components and Responsibilities

explore-simple-sk.ts

The main class responsible for the lifecycle of the explorer.

  • Workflow Coordination: It handles the logic for adding traces via queries or formulas (addFromQueryOrFormula) and managing the response display mode (Graph vs. Pivot Table).
  • User Interaction: Processes keyboard shortcuts (zoom/pan), mouse events on the chart, and interactions with the settings dialog (e.g., toggling even X-axis spacing).

nudge-util.ts

A specialized utility for handling anomaly “nudges.”

  • Responsibility: When a user identifies an anomaly, the “true” start of a regression might be slightly off due to sparse data or noise.
  • Logic: It scans the trace to find valid data points (skipping MISSING_DATA_SENTINEL values) and calculates a list of NudgeEntry objects. This ensures that when an anomaly is moved, it always lands on a commit that actually contains data for that specific trace.

explore-simple-sk_po.ts

A Page Object (PO) implementation for Puppeteer testing.

  • Responsibility: Encapsulates the DOM structure and common interactions (like clicking the “Remove All” button or verifying anomaly tooltips) to provide a stable API for integration tests.

Key Workflows

Data Query and Plotting Process

The following diagram illustrates how a user query is transformed into a visual plot:

User Input (Query/Formula)
      |
      V
addFromQueryOrFormula() ----> Validates Query
      |
      V
requestFrame() -------------> DataService (Backend API)
      |                              |
      |<-----------------------------|
      V
UpdateWithFrameResponse()
      |
      +-----> DataFrameRepository (Stores & Merges Data)
      |
      +-----> plot-google-chart-sk (Renders Main Graph)
      |
      +-----> plot-summary-sk (Renders Navigation Bar)
      |
      +-----> paramset-sk (Updates Metadata Panel)

Anomaly Triage Workflow

When a user interacts with a data point on the chart:

Chart Click Event
      |
      V
onChartSelect()
      |
      +-----> enableTooltip()
                 |
                 +-----> Fetches Commit Links (Gitiles/Issue Tracker)
                 |
                 +-----> nudge-util (Calculates Nudge Steps)
                 |
                 +-----> Displays chart-tooltip-sk
                            |
                            +---[File Bug]---> NewBugDialog
                            +---[Nudge]------> Update Anomaly Map
                            +---[Bisect]-----> BisectDialog

CSS and Layout

The module uses a flexible layout defined in explore-simple-sk.scss that adapts based on the displayMode (e.g., .display_query_only, .display_plot). It uses CSS classes to hide/show components like the spinner, the pivot table, or the plot summary based on the current operation, ensuring a clean UI regardless of the data being explored.

Module: /modules/explore-sk

explore-sk

The explore-sk module provides the primary entry point and high-level container for the Skia Perf data exploration interface. It acts as an orchestrator that integrates several complex sub-components—most notably the core graphing engine and the test selection tools—into a unified user experience.

Overview

The purpose of explore-sk is to provide a cohesive environment where users can query performance data, visualize traces, and interact with the resulting graphs. While the actual plotting and state management logic reside in sub-modules, explore-sk handles the high-level layout, global event routing, and the initialization of environment-specific defaults.

Key Components and Responsibilities

The module is structured as a custom element (explore-sk.ts) that manages the lifecycle and communication between several key pieces:

  • ExploreSimpleSk (<explore-simple-sk>): This is the “heavy lifter” of the module. It handles the actual data fetching, state management for queries, and the rendering of the performance charts. explore-sk acts as its parent, passing down configuration and reflecting its state to the URL.
  • TestPickerSk (<test-picker-sk>): A specialized UI component for building queries. It allows users to select specific parameters (like architecture, config, or test name) from dropdowns or autocomplete fields. explore-sk dynamically initializes this component based on the backend's configuration.
  • State Reflection: The module uses stateReflector to ensure that the complex state of the exploration (selected traces, time ranges, etc.) is synchronized with the browser's URL. This allows users to share specific views or bookmarks of their performance analysis.
  • Authentication Integration: It interacts with alogin-sk to determine the user's login status. This is used to conditionally enable features like “Favorites,” which require a user identity to persist data.

Design Decisions

Composition over Monolith

Instead of implementing plotting and querying logic directly, explore-sk serves as a thin wrapper. This design allows explore-simple-sk to remain focused on the core data/charting logic, while explore-sk manages the layout and the integration of optional UI elements like the test-picker-sk.

Dynamic UI (V2 UI and Test Picker)

The module implements logic to switch between different querying interfaces. It checks for backend defaults and local storage flags (like v2_ui) to decide whether to show the traditional query dialog or the newer test-picker-sk. This allows for a staged rollout of new UI features without breaking the core exploration workflow.

Centralized Keyboard Handling

To provide a consistent “app-like” feel, explore-sk captures global keyboard events (like the ? key for help) and delegates them to the appropriate child component (explore-simple-sk). This ensures that shortcuts work regardless of which sub-element currently has focus.

Key Workflows

Initialization and Configuration

When the element is attached to the DOM, it follows a specific sequence to configure the environment:

[ explore-sk ]
      |
      |-- 1. Fetch /_/defaults/ --------> [ Backend ]
      |          |                            |
      |          <------- JSON Config --------|
      |
      |-- 2. Check Auth Status ---------> [ alogin-sk ]
      |          |                            |
      |          <------- Login Status -------|
      |
      |-- 3. Initialize State ----------> [ stateReflector ]
      |          |                            |
      |          <------- URL Params ---------|
      |
      |-- 4. Setup TestPicker (if enabled)
      |
      '-- 5. Pass state & defaults to [ explore-simple-sk ]

Querying via Test Picker

When a user interacts with the test-picker-sk, the communication flows through events:

  1. The user selects parameters in test-picker-sk and clicks “Plot”.
  2. The test-picker-sk emits a plot-button-clicked event.
  3. explore-sk catches this event, extracts the query string from the picker, and calls the addFromQueryOrFormula method on explore-simple-sk.
  4. explore-simple-sk fetches the data and updates the chart.

Trace Highlighting to Query

If a user is looking at a graph and wants to refine their query based on a specific trace:

  1. A “populate-query” event is triggered (usually from a trace detail view).
  2. explore-sk receives the trace key.
  3. It translates that key into a query string and instructs test-picker-sk to update its fields to match that specific trace, allowing the user to easily pivot their search.

Module: /modules/extra-links-sk

Overview

The extra-links-sk module provides a specialized custom element designed to display a curated list of external resources, documentation, or related tools. It serves as a dynamic landing area or sidebar within the Perf application, allowing administrators to surface relevant links that might otherwise be buried in external documentation sites.

Design and Implementation Philosophy

The module is built on the principle of configuration-driven UI. Rather than hardcoding links or managing them through complex state management within the element itself, it leverages the global environment to determine its content.

Global State Integration

The element relies on the window.perf.extra_links configuration object. This design choice decouples the UI component from the backend API calls. By assuming that the global window.perf object (typically populated at page load or via a global configuration fetch) contains the necessary metadata, the element remains lightweight and reactive to the environment it is placed in.

Declarative Templating

Using lit, the element implements a declarative template that handles two primary states:

  1. Configured State: If window.perf.extra_links is populated, it renders a structured table featuring link titles and descriptions.
  2. Empty State: If no configuration is present, it provides a fallback message (“No links have been configured”), ensuring the UI doesn't appear broken or completely empty without explanation.

Key Components and Responsibilities

extra-links-sk.ts

This file defines the ExtraLinksSk class, which extends ElementSk. Its primary responsibility is the lifecycle management and rendering of the link table.

  • Data Mapping: It maps the ExtraLink objects (containing text, href, and description) into a tabular format.
  • Lifecycle: It triggers a render immediately upon being connected to the DOM (connectedCallback), ensuring that the links are visible as soon as the component is attached.

extra-links-sk.scss

The styling is scoped to the extra-links-sk element. It utilizes CSS variables (like --primary and --on-surface) to maintain theme consistency with the rest of the application. The layout uses border-collapse: separate and specific padding to ensure the links are easily readable and touch-friendly.

Data Flow and Workflow

The following diagram illustrates how data flows from the global configuration into the rendered component:

[ Global Scope ]              [ extra-links-sk ]               [ Browser DOM ]
       |                             |                                |
       | 1. Set window.perf.      ---|------------------------------> |
       |    extra_links = {...}      |                                |
       |                             |                                |
       |                             | 2. connectedCallback()         |
       |                             | <----------------------------  |
       |                             |                                |
       |                             | 3. Read window.perf            |
       |                             |    loop through links          |
       |                             |                                |
       |                             | 4. Generate HTML Table         |
       |                             | -----------------------------> |
       |                             |                                |

Configuration Structure

The component expects the configuration to follow this structure within the global window.perf object:

  • title: A string displayed as the main header for the links section.
  • links: An array of objects, where each object contains:
    • text: The clickable label for the link.
    • href: The destination URL.
    • description: A text explanation of what the link provides.

Module: /modules/favorites-dialog-sk

Overview

The favorites-dialog-sk module provides a modal dialog designed for creating and editing user “favorites” within the Perf application. It encapsulates a form for capturing a name, description, and URL, and handles the asynchronous communication with the backend API to persist these changes.

Design and Implementation Logic

The module is built as a LitElement and utilizes the native HTML <dialog> element for modal behavior. The design focuses on a Promise-based workflow for the calling component, allowing the parent to react differently depending on how the dialog was closed.

State Management and Lifecycle

Instead of relying on external events to communicate success, the open() method returns a Promise. This allows the caller to await the user's action:

  • Resolve: The promise resolves if the user successfully saves a new or edited favorite. This indicates to the parent (e.g., a favorites list) that it should refresh its data.
  • Reject: The promise rejects if the user dismisses the dialog via the “Cancel” button or the close icon.

Data Handling

The component distinguishes between “create” and “edit” modes based on the presence of a favId.

  • Creation: If favId is empty, the component defaults the URL to the current window location and targets the /_/favorites/new endpoint.
  • Modification: If a favId is provided, the component populates the fields with existing data and targets the /_/favorites/edit endpoint.

Workflow Diagram

The following diagram illustrates the interaction between the UI and the backend:

[Parent Component]          [favorites-dialog-sk]             [Backend API]
        |                           |                               |
        |---- .open(id, name) ----->|                               |
        |                           |-- (User edits fields)         |
        |                           |                               |
        |                           |---- Click "Save" ------------>|
        |                           |          POST /_/favorites/   |
        |                           |<------- 200 OK / Error -------|
        |                           |                               |
        |<--- Resolve / Reject -----|                               |

Key Components and Files

favorites-dialog-sk.ts

This is the core logic of the module.

  • Form Validation: Ensures that the “Name” and “URL” fields are non-empty before attempting a submission, triggering an errorMessage toast if validation fails.
  • Async Operations: Manages the updatingFavorite state to toggle a <spinner-sk> and disable action buttons while a network request is in flight.
  • Unique ID Generation: Uses a static nextUniqueId counter to ensure that HTML id and for attributes are unique across multiple instances on the same page, maintaining accessibility and correct label-to-input binding.

favorites-dialog-sk.scss

Defines the visual presentation using the Perf theme variables. It handles the layout of the form elements, specifically positioning the close icon and styling the input fields to occupy the standard modal width (500px for inputs).

favorites-dialog-sk-demo.ts

Provides a reference implementation for how to trigger the dialog for both “New” and “Edit” scenarios. It demonstrates the use of the open() method and how to pass initial parameters.

API Interaction

The module interacts with the following endpoints:

  • POST /_/favorites/new: Used when creating a new favorite. The body includes name, description, and url.
  • POST /_/favorites/edit: Used when updating existing favorites. The body includes the original id along with the updated fields.

Errors from the API are captured and displayed to the user via the errorMessage utility, while the dialog remains open to allow the user to correct the issue or try again.

Module: /modules/favorites-sk

The favorites-sk module provides a specialized dashboard interface for managing and viewing bookmarked links within the Perf application. It distinguishes between global system-wide favorites and user-specific links, allowing for personal organization of performance data views.

Design and Logic

The module is built around a centralized configuration fetched from the backend. The primary design goal is to provide a unified view where users can see pre-configured links (such as project-wide dashboards) alongside their own curated list of performance traces or search queries.

Data Fetching and Persistence

Upon mounting (connectedCallback), the element fetches the favorites configuration from /_/favorites/. This configuration drives the entire UI. The module uses an “optimistic-style” refresh pattern: whenever a change occurs (like a deletion or an edit), the component re-fetches the entire configuration to ensure the UI is synchronized with the server's state.

Section Differentiation

The implementation applies different business rules based on the section name:

  • “My Favorites”: This section is treated as mutable. For links under this header, the UI provides “Edit” and “Delete” actions. It integrates with favorites-dialog-sk to facilitate complex editing of link metadata (names, descriptions, and URLs).
  • General Sections: Any other section is treated as read-only, displaying links and descriptions without management controls.

Workflow: Deleting a Favorite

The deletion process includes a safety check to prevent accidental data loss:

[User Clicks Delete]
        |
        v
[Browser Confirm Dialog] --(Cancel)--> [Abort]
        |
      (OK)
        v
[POST to /_/favorites/delete]
        |
   [Success?] --(No)--> [Show Error Message]
        |
      (Yes)
        v
[Re-fetch /_/favorites/]
        |
    [Re-render]

Key Components

  • favorites-sk.ts: The core logic container. It manages the state of the favoritesConfig and handles the asynchronous interactions with the backend API. It uses lit for templating, dynamically generating tables based on the presence of user-specific links.
  • favorites-dialog-sk: While imported from a sibling module, it is a critical dependency for this module's “Edit” workflow. favorites-sk acts as the orchestrator, passing existing link data into this dialog and waiting for a resolution to refresh the view.
  • favorites-sk.scss: Defines the layout for the favorites tables. It uses a spacious design with border-spacing and specific styling for primary links to ensure the dashboard remains readable even with a high density of saved traces.

Implementation Details

The module relies on the ElementSk base class for standard component lifecycle management. For user interactions:

  • Editing: Selecting “Edit” triggers a call to the dialog component's .open() method, passing the id, text, description, and href.
  • API Communication: Uses jsonOrThrow and errorMessage utilities to handle network failures gracefully, ensuring that server-side errors are surfaced to the user via a consistent UI toast/notification system.

Module: /modules/gemini-side-panel-sk

The gemini-side-panel-sk module provides a slide-out interface for interacting with a Gemini-powered AI assistant. It is designed as a persistent UI overlay that can be integrated into any page to provide contextual help or a general-purpose chat interface without navigating away from the current view.

Design and Implementation Choices

The module is implemented as a Lit element, leveraging reactive properties to manage the chat state and visibility.

Slide-out Transition The panel is positioned using position: fixed with a negative right offset. This design choice allows the panel to exist in the DOM but remain hidden off-screen until activated. By toggling the open attribute, the CSS transitions the right property to 0, providing a smooth visual entry. This approach is preferred over display: none because it allows for CSS-driven animations and ensures the element's internal state remains preserved while hidden.

State Management and UI Feedback The element manages three primary pieces of state:

  • messages: An array of chat objects. This acts as the single source of truth for the conversation history.
  • isLoading: A boolean that controls the visibility of a <spinner-sk>. This provides immediate visual feedback to the user during network latency.
  • input: A string tracked via the live() directive. Using live() ensures that the input field remains synchronized with the internal state even if the DOM is updated externally or during rapid typing.

API Interaction The component communicates with a backend via a POST request to /_/chat. It sends the user‘s query as a JSON body and expects a JSON response containing the assistant’s reply. The implementation includes robust error handling that captures both HTTP error codes (e.g., 500) and network-level failures, surfacing these errors directly in the chat history to keep the user informed.

Key Components

GeminiSidePanelSk (gemini-side-panel-sk.ts) This is the core logic and UI controller. It encapsulates the styling, the chat history log, and the input footer. It exposes a public toggle() method and an open property/attribute, allowing parent components or global scripts to programmatically control its visibility.

Chat History Log The history is rendered as a list of message bubbles. The implementation distinguishes between user and model roles using CSS classes to align messages to the right or left, respectively. It uses aria-live="polite" on the history container to ensure that screen readers announce new incoming messages from the AI assistant.

Input Handling The footer contains a text input and a send button. To optimize user experience, the component listens for the Enter key on the input field, allowing for a standard messaging flow. The input is automatically cleared and focused upon a successful message submission.

Chat Workflow

The following diagram illustrates the data flow when a user sends a message:

[ User Input ]  -->  [ Update 'messages' (User) ]  -->  [ Set 'isLoading' = true ]
      |                                                        |
      |                                                        V
      |                                               [ POST /_/chat ]
      |                                                        |
      V                                                        V
[ Clear Input ] <--- [ Update 'messages' (Model) ] <--- [ Receive JSON Response ]
                               |
                               V
                     [ Set 'isLoading' = false ]

Testing Strategy

The module includes two layers of testing:

  • Unit Tests (gemini-side-panel-sk_test.ts): These tests use fetch-mock to simulate backend responses. They verify the internal logic, such as ensuring empty messages aren't sent, verifying that the input clears after sending, and checking that error messages are correctly appended to the history.
  • End-to-End Tests (gemini-side-panel-sk_puppeteer_test.ts): These tests focus on the visual and behavioral aspects, such as confirming the CSS transitions move the panel the correct number of pixels and verifying that the Shadow DOM elements (input, icons) are accessible and interactive.

Module: /modules/graph-title-sk

graph-title-sk

The graph-title-sk module provides a specialized header component designed for performance graphs. It dynamically translates a set of metadata (key-value pairs) into a structured, readable title, handling the complexity of displaying many parameters without cluttering the UI.

Design Goals

The primary purpose of this component is to provide context for a graph. Since performance data often involves many dimensions (e.g., bot name, benchmark, test, subtest, configuration), a simple string is insufficient. The component is designed to:

  • Handle Variable Specificity: It can display a single trace's detailed metadata or a generic “Multi-trace” summary if multiple traces are being viewed simultaneously.
  • Manage Information Density: To prevent the header from pushing the graph off-screen, it enforces a limit on the number of visible parameters, offering a “Show Full Title” option when the metadata is extensive.
  • Prioritize Readability: By splitting keys (parameters) and values into two distinct rows within a flexible grid, it remains legible even when values are long or numerous.

Key Components

graph-title-sk.ts

This is the core custom element, implemented using Lit.

  • Data Input: The element is updated via the set(titleEntries: Map<string, string> | null, numTraces: number) method. This approach allows the parent component to push data updates efficiently.
  • Logic and Filtering:
    • Empty Suppression: Any entry with an empty string for either the key or the value is automatically ignored to keep the title clean.
    • Truncation/Tooltips: While the CSS handles visual layout, the HTML includes a title attribute on values, allowing users to hover over truncated text to see the full value.
    • Expansion Logic: It maintains an internal state (showShortTitle) to toggle between a collapsed view (limited by MAX_PARAMS, currently 8) and a full view.
    • Multi-trace Mode: If numTraces > 0 but the titleEntries map is empty, it renders a generic <h1> header indicating the number of traces.

graph-title-sk.scss

The styling uses a flexbox-based grid system.

  • Responsive Wrapping: The #container uses flex-wrap: wrap, ensuring that if the title is too long for the horizontal space, it flows naturally into subsequent rows.
  • Columnar Layout: Each metadata pair is treated as a discrete column, with the parameter name (.param) styled smaller and lighter above the bolded value (.hover-to-show-text).

Data Flow and Workflows

Setting Title Content

The workflow for updating the title typically involves a parent graph-container or dashboard page:

[ Parent Component ]
      |
      | 1. Gathers metadata (e.g., from a trace ID or API)
      | 2. Calls .set(map, count)
      V
[ graph-title-sk ]
      |
      | 3. Checks numTraces (if 0, hide container)
      | 4. Filters empty entries
      | 5. Truncates list if > MAX_PARAMS
      V
[ Rendered HTML ]

Expanding Long Titles

When the metadata exceeds the limit, the component provides an interactive expansion:

[ User Clicks "Show Full Title" ]
      |
      V
[ showFullTitle() ] sets showShortTitle = false
      |
      V
[ render() ] re-runs getTitleHtml() without the MAX_PARAMS limit
      |
      V
[ UI Updates ] All columns are revealed; button disappears

Testing Utilities

The module includes a Page Object (graph-title-sk_po.ts) to simplify integration and end-to-end testing. This PO abstracts the internal structure (selectors for params, values, and the “show more” button), allowing tests to verify title content without being brittle to changes in the internal DOM structure.

Module: /modules/json

/modules/json

This module serves as the central repository for shared TypeScript type definitions and interfaces used across the Perf application. It acts as the “Source of Truth” for the data structures exchanged between the Go backend and the TypeScript frontend.

Overview

The primary goal of this module is to ensure type safety and consistency across the network boundary. Instead of manually maintaining duplicate type definitions in both Go and TypeScript, this module contains an automatically generated index.ts file. This file reflects the structures defined in the backend, providing a robust contract for API requests, responses, and internal data processing.

The module also implements Nominal Typing for primitive types to prevent logical errors (e.g., accidentally using a TimestampSeconds where a CommitNumber is expected), even though both are represented as numbers at runtime.

Key Components

Core Data Structures

The module defines the fundamental entities of the Perf system:

  • Data Representation: DataFrame, TraceSet, and Trace represent the time-series data fetched for visualization. A DataFrame contains the actual values, the headers (commits/timestamps), and the paramset describing the metadata.
  • Anomalies and Regressions: Interfaces like Anomaly and Regression define the shape of detected performance changes, including statistical metadata (median before/after, p-value) and triage status.
  • Alerting: The Alert interface defines the configuration for regression detection, including the query to monitor and the algorithm parameters used.
  • Backend Communication: FrameRequest and FrameResponse encapsulate the complex parameters needed to query the performance database and the resulting data structure used to render plots.

Nominal Typing Pattern

To improve type safety, the module uses a “branding” pattern for common primitives. This forces developers to explicitly cast or use constructor functions when assigning values to these types, ensuring that the developer has consciously verified the data source.

Value (number) -> Constructor Function -> Branded Type (CommitNumber)
                                     |
                                     +--> Logic error if passed to
                                          TimestampSeconds function

Key branded types include:

  • CommitNumber: Represents an offset in the commit history.
  • TimestampSeconds: Represents a Unix timestamp.
  • Params and ParamSet: Specific dictionary shapes for metadata.

Namespaced Definitions

Certain domains are grouped into namespaces to reflect their specific context within the application:

  • pivot: Definitions related to the “Pivot Table” functionality, including operations like sum, avg, and count.
  • progress: Interfaces for long-running backend tasks that provide status updates (e.g., Running, Finished).
  • ingest: Data formats for the file ingestion pipeline, defining how measurement results are structured before being stored.

Design Decisions

Automatic Generation

The index.ts file is marked with DO NOT EDIT. This choice ensures that the frontend types are never out of sync with the backend. Changes to the data contract must be initiated in the Go code and propagated here via the generation tool (e.g., go2ts).

Use of Interfaces vs. Types

Interfaces are used for complex objects (like Alert or Anomaly) to allow for potential extension and to provide clearer error messages in IDEs. Type aliases are reserved for unions (like Status or ClusterAlgo) and the aforementioned branded nominal types.

Nullability and Optional Fields

The interfaces strictly define which fields are optional (?) and which can be null. This forces frontend components to handle missing data explicitly, reducing runtime TypeError exceptions when processing API responses.

Module: /modules/json-source-sk

The json-source-sk module provides a specialized UI component for the Perf application that allows developers and analysts to inspect the raw JSON data associated with a specific data point in a performance trace. It acts as a bridge between high-level trace visualizations and the underlying ingested source files.

Overview

The primary responsibility of this module is to fetch and display the original JSON metadata and results for a given trace at a specific commit. Because performance traces can be backed by large amounts of data, the module provides options to view either the full ingested file or a “short” version (typically excluding voluminous results) to improve load times and readability.

The component remains hidden by default and only reveals its controls when a valid traceid and cid (Commit ID) are provided, ensuring it only occupies screen space when actionable data is available.

Key Components

JSONSourceSk (json-source-sk.ts)

The core custom element. It manages the state of the retrieved JSON, the visibility of the modal dialog, and the communication with the backend.

  • State Management: It tracks _cid and _traceid. When either property is updated via setters, the internal JSON cache is cleared, and the component re-renders. This ensures that the user never sees stale data from a previous trace point.
  • Data Fetching Logic: The _loadSourceImpl method encapsulates the logic for interacting with the /_/details/ endpoint. It uses a POST request containing the commit and trace identifiers. It also handles the results=false query parameter when the “Short” view is requested.
  • User Feedback: It integrates a spinner-sk to indicate background loading activity and uses the errorMessage utility to bubble up fetch failures to the application's global error reporting system.

User Interface and Interaction

The component uses a <dialog> element for displaying the JSON content. This choice allows the JSON to be viewed in an overlay, preserving the user's context in the main performance graph or table.

  • View Json File: Triggers a full data fetch and opens the modal.
  • View Short Json File: Triggers a fetch with the results=false flag, useful for inspecting metadata without the overhead of every individual measurement.
  • Modal Dialog: Contains a <pre> block for formatted JSON display and a sticky close button for easy navigation.

Workflow: Retrieving Source Data

The following diagram illustrates the lifecycle of a data request within the component:

User Interaction          JSONSourceSk Component             Backend Server
      |                             |                             |
      |-- Click "View Json" ------->|                             |
      |                             |-- Show Spinner              |
      |                             |                             |
      |                             |-- POST /_/details/ -------->|
      |                             |   {cid, traceid}            |
      |                             |                             |
      |                             |<--------- JSON Response ----|
      |                             |                             |
      |                             |-- Hide Spinner              |
      |                             |-- Format JSON string        |
      | <--- Open Modal Dialog -----|                             |
      |      with <pre> content     |                             |

Design Decisions

  • Validation: The visibility of the “View” buttons is tied to validKey(traceid). This prevents the component from attempting to fetch data using malformed or incomplete trace identifiers.
  • Formatting: Data is processed via JSON.stringify(json, null, ' ') before display. This ensures that regardless of the wire format (which is often minified), the user sees a human-readable, indented structure.
  • Cleanup: When the dialog is closed via closeJsonDialog, the internal _json string is cleared. This is a memory management choice to avoid keeping potentially large strings in the DOM when they are not actively being viewed.
  • CSS Scoping: The styles include specific overrides for spinner-sk dimensions and use a flexbox layout for controls to maintain a compact footprint within the Perf UI toolbars.

Testing and Page Objects

The module includes a Page Object (JsonSourceSkPO) located in json-source-sk_po.ts. This encapsulates the internal DOM structure (selectors for buttons, the dialog, and the pre-formatted text), allowing Puppeteer and Karma tests to interact with the component without being brittle to internal HTML changes. This is particularly important for testing the modal's visibility and the content of the fetch results.

Module: /modules/keyboard-shortcuts-help-sk

Keyboard Shortcuts Help Dialog

The keyboard-shortcuts-help-sk module provides a standardized UI component for displaying available keyboard shortcuts to the user. It functions as a discovery mechanism, ensuring that keyboard-driven workflows are accessible and documented within the application interface itself.

Design Philosophy: Centralized Registry Discovery

The core design decision behind this module is to decouple the definition of shortcuts from their presentation. Instead of hard-coding a list of keys into a help dialog, this component acts as a consumer of the ShortcutRegistry (from perf/modules/common:keyboard-shortcuts_ts_lib).

This approach ensures that:

  1. Truth is Centralized: Shortcuts are defined alongside the logic that handles them, but are automatically reflected in the help UI without manual updates to the dialog.
  2. Context Sensitivity: The dialog can filter displayed shortcuts based on a provided KeyboardShortcutHandler. If a shortcut is associated with a specific method that is not present on the current handler, it is hidden from the user, preventing confusion about unavailable actions.

Key Component: KeyboardShortcutsHelpSk

The KeyboardShortcutsHelpSk class is a Lit-based custom element that wraps a Material Design dialog (md-dialog). Its primary responsibilities include:

  • Dynamic Rendering: Upon opening, it queries the ShortcutRegistry to retrieve all registered shortcuts, grouped by category.
  • Handler-Based Filtering: It accepts a handler property. When rendering, it iterates through registered shortcuts and checks if the handler actually implements the method associated with that shortcut. This ensures the help menu is relevant to the user's current context (e.g., different shortcuts for a graph view versus a table view).
  • Visual Organization: It formats shortcuts into a readable table, using CSS to highlight keys (using monospace fonts) and categorize them under bold headers for quick scanning.

Internal Workflow

The following diagram illustrates how the component retrieves and filters data for display:

[ ShortcutRegistry ] <------- (1) Request Shortcuts
        |
        v
[ KeyboardShortcutsHelpSk ] <--- (2) Check 'handler' property
        |
        |--- (3) For each Shortcut:
        |    IF (shortcut.method exists AND handler lacks method)
        |    THEN: Skip
        |    ELSE: Add to Render List
        |
        v
[ md-dialog Content ] <------- (4) Render Table Rows

Key Files

  • keyboard-shortcuts-help-sk.ts: Contains the logic for the Lit element, including the filtering logic and the open()/close() API for controlling the dialog programmatically.
  • keyboard-shortcuts-help-sk.scss: Defines the layout for the shortcut table, ensuring consistent spacing and visual cues for keys and categories using the application's theme variables.
  • keyboard-shortcuts-help-sk_test.ts: Validates that the component correctly pulls data from the ShortcutRegistry and renders the expected HTML structure.

Module: /modules/new-bug-dialog-sk

new-bug-dialog-sk

The new-bug-dialog-sk module provides a specialized modal dialog for the Perf triage workflow. It allows users to file Buganizer issues for one or more detected anomalies (performance regressions or improvements) directly from the Perf UI.

Overview

When a sheriff or developer identifies an untriaged anomaly in a performance chart, they need a streamlined way to report it. This module automates the boilerplate of bug creation by pre-populating fields based on the selected anomaly's metadata, such as the test path, the magnitude of the change, and the affected revision range.

Design Decisions

Automated Metadata Extraction

The dialog is designed to minimize manual data entry. It implements logic to parse Anomaly objects and automatically generate:

  • Bug Titles: A formatted string indicating the percentage change, the type (regression/improvement), the test suite, and the revision range (e.g., “33.6% regression in v8/async-fs at 95940:95944”).
  • Component Selection: It extracts bug components associated with the anomalies and presents them as radio buttons, ensuring the bug is filed in the correct tracker.
  • Label Management: It aggregates unique labels from all selected anomalies, allowing the user to toggle them before submission.

Support for Multiple Anomalies

The dialog supports filing a single bug for a collection of anomalies. This is common when a single underlying commit causes regressions across multiple related metrics. The implementation handles this by:

  1. Calculating the aggregate revision range (minimum start to maximum end).
  2. Determining the range of percentage changes (e.g., “10% to 20% regression”).
  3. Collecting all unique labels and components from the entire set.

User Experience and Feedback

  • Draggable Interface: The dialog implements custom mouse event listeners (onMousedown, onMouseMove, onMouseUp) to allow users to move the dialog. This is helpful if the user needs to see the underlying chart data while filling out the bug report.
  • Loading State: A secondary <dialog> (#loading-popup) is used to provide visual feedback during the asynchronous fetch request to the backend.
  • Auto-CC: Upon opening, the component calls LoggedIn() to identify the current user and automatically adds them to the CC list.

Key Components and Implementation Details

new-bug-dialog-sk.ts

This is the primary logic hub. It manages the internal state of the form and interacts with the Perf backend.

  • Triage Logic: The methods getBugTitle(), getPercentChangeForAnomaly(), and getSuiteNameForAlert() contain the business logic for translating raw anomaly data into human-readable bug reports, mimicking legacy Chromeperf behavior.
  • Submission Workflow: The fileNewBug() method gathers data from the form (including dynamically generated checkboxes and radios), sends a POST request to /_/triage/file_bug, and processes the response.
  • Post-Submission: On success, it opens the newly created bug in a new browser tab and dispatches an anomaly-changed event. This event notifies other components (like charts or lists) that the anomaly's bug_id has been updated and they should re-render to reflect the triaged status.

Workflow: Filing a Bug

[ User Clicks 'File Bug' ]
          |
          v
[ open() called: Fetch login status, show modal ]
          |
          v
[ UI populates Title, Labels, Components from Anomalies ]
          |
[ User adjusts form & clicks 'Submit' ]
          |
          v
[ fileNewBug() ]----------------------> [ Server: /_/triage/file_bug ]
          |                                       |
[ Show Loading Popup ]                  [ Create Buganizer Issue ]
          |                                       |
[ Receive Bug ID ] <------------------------------'
          |
          v
[ 1. Update local Anomaly objects with Bug ID ]
[ 2. Dispatch 'anomaly-changed' event         ]
[ 3. Open https://issues.chromium.org/issues/{ID} ]
[ 4. Close Dialogs                            ]

new-bug-dialog-sk.scss

The styling ensures the dialog fits the Perf theme. It uses a flexible layout for the textarea and ensures the closeIcon is pinned to the top-right for easy dismissal.

new-bug-dialog-sk_po.ts

Provides a Page Object for automated testing. This encapsulates the selectors for the title, description, assignee, and CC inputs, allowing Puppeteer tests to interact with the dialog without being brittle to internal DOM changes.

Module: /modules/paramtools

Paramtools

The paramtools module provides a suite of utility functions for manipulating and transforming “Structured Keys,” Params, and ParamSets. It acts as a client-side mirror of the Go implementation found in /infra/go/paramtools, enabling the frontend to handle Perf trace identifiers and query parameters consistently with the backend.

Design Philosophy

The module is designed around the concept of a Structured Key: a string representation of key-value pairs used to uniquely identify data traces (e.g., ,arch=x86,config=8888,os=linux,).

The implementation prioritizes:

  • Canonical Representation: Keys are always generated with sorted keys to ensure that the same set of parameters always results in the identical string identifier.
  • Performance vs. Validation: Since the server-side Go implementation performs rigorous validation, this TypeScript module focuses on efficient transformation and parsing, assuming the data is largely well-formed.
  • Interoperability: It facilitates easy conversion between internal data structures (Params) and external formats like URL query strings.

Key Data Structures

  • Params: A simple mapping of strings to strings (e.g., { "os": "linux" }). Represents a specific point or trace.
  • ParamSet: A mapping of strings to arrays of strings (e.g., { "os": ["linux", "windows"] }). Represents a collection of possible values for various keys.

Component Responsibilities

Key Manipulation

The module provides logic to move between string identifiers and structured objects:

  • makeKey: Converts a Params object into a canonical structured key. It sorts the keys alphabetically and wraps the result in leading and trailing commas to ensure unambiguous matching.
  • fromKey: Parses a structured key back into a Params object. It includes logic to strip away “Special Functions” (like norm()) that might wrap a key during calculation phases.
  • validKey: A simple validator that checks for the standard ,key=value, format, primarily used to distinguish between raw trace IDs and calculated traces.

ParamSet Aggregation

Functions in this category handle the merging and expansion of parameter collections:

  • addParamsToParamSet: Merges a single Params instance into an existing ParamSet. This is useful when building a global index of available dimensions from a list of specific traces.
  • addParamSet: Merges two ParamSet objects, ensuring that values remain unique within each key.
  • paramsToParamSet: A convenience function to lift a single Params object into the ParamSet type.

Integration Utilities

  • queryFromKey: Converts a structured key directly into a URL-encoded query string (e.g., a=1&b=2). This is essential for synchronization between the application state (trace keys) and the browser's URL for deep-linking.

Workflows

Trace ID to Query String

This workflow illustrates how a trace identifier from the backend is prepared for use in a frontend search query.

Structured Key: ",arch=arm,os=android,"
      |
      v
[ fromKey() ] --> Params: { arch: "arm", os: "android" }
      |
      v
[ queryFromKey() ] -> String: "arch=arm&os=android"

Building a Filter UI

This workflow shows how individual trace IDs are aggregated to populate a user interface with all available filtering options.

Trace A: ",config=565,"  Trace B: ",config=888,"
      |                     |
      +----------+----------+
                 |
                 v
      [ addParamsToParamSet() ]
                 |
                 v
      ParamSet: { config: ["565", "888"] }
                 |
                 v
      (Used to render dropdown menus)

Module: /modules/perf-scaffold-sk

perf-scaffold-sk

The perf-scaffold-sk module provides the foundational layout and shell for all Skia Performance Monitoring (Perf) web pages. It serves as a master template, providing consistent navigation, branding, error handling, and a unified look-and-feel across the application.

High-Level Overview

This module defines the PerfScaffoldSk custom element, which acts as a wrapper for every page in the Perf application. Its primary responsibilities include:

  • Branding and Navigation: Hosting the instance logo, title, and primary navigation links (Explore, Alerts, Triage, etc.).
  • Contextual Shell: Providing a consistent sidebar or header (depending on the UI version) for global actions.
  • Infrastructure Integration: Embedding essential utility components like alogin-sk (authentication), theme-chooser-sk (dark/light mode), and error-toast-sk (global error notifications).
  • Global Configuration: Responding to settings defined in the global window.perf object to customize the UI per instance.

Design Decisions and UI Versions

The scaffold currently supports two distinct UI layouts: Legacy UI and V2 UI. The implementation allows for a phased transition between styles, controlled by both global configuration and user preference.

UI Selection Logic

The choice of layout is determined at render time based on the following hierarchy:

  1. User Preference: A value stored in localStorage under the key v2_ui.
  2. Global Default: The window.perf.enable_v2_ui boolean provided by the server.

Users can manually toggle between these versions via a “Try V2 UI” button in the Legacy sidebar or a “Back to Legacy UI” button in the V2 header. This toggle action updates localStorage and triggers a page reload to re-initialize the scaffold.

Component Structure

+-----------------------------------------------------------+
| perf-scaffold-sk                                          |
| +-------------------------------------------------------+ |
| | app-sk (Legacy or V2)                                 | |
| | +---------------------------------------------------+ | |
| | | header (Top Bar)                                  | | |
| | | - Logo & Title                                    | | |
| | | - Auth & Theme Chooser                            | | |
| | +---------------------------------------------------+ | |
| | | aside (Sidebar - Legacy) OR nav (Header - V2)     | | |
| | | - Links (Explore, Alerts, Triage, etc.)           | | |
| | +---------------------------------------------------+ | |
| | | main (perf-content)                               | | |
| | | - User-provided child content injected here       | | |
| | +---------------------------------------------------+ | |
| | | gemini-side-panel-sk (V2 Only)                    | | |
| | +---------------------------------------------------+ | |
| | | footer                                            | | |
| | | - error-toast-sk                                  | | |
| | | - Build/Version tags                              | | |
| | +---------------------------------------------------+ | |
| +-------------------------------------------------------+ |
+-----------------------------------------------------------+

Key Components and Responsibilities

Layout Management (perf-scaffold-sk.ts)

The core logic resides in PerfScaffoldSk. It manages the lifecycle of the application shell. A key feature is the content redistribution process:

  • When the component is initialized, it takes all original child elements and moves them into a internal #perf-content container within the <main> tag.
  • It specifically identifies elements with the ID sidebar_help and moves them into a specialized help area (a sidebar section in Legacy, or a dropdown menu in V2).

Styling (perf-scaffold-sk.scss)

The styles utilize CSS Grid and Flexbox to create responsive layouts.

  • Legacy UI: Uses a traditional sidebar-heavy layout (aside#sidebar).
  • V2 UI: Implements a modern top-navigation layout with a sticky header and a scrolling main content area. It also handles the positioning of the Gemini AI side panel.

Versioning and Build Info

The scaffold displays the current application version in the footer. It intelligently formats the version string:

  • Git Hashes: Links directly to the source repository.
  • Dev Timestamps: Formats ISO strings into human-readable UTC dates for local development builds.
  • Build Tags: Displays tags retrieved via getBuildTag() from the window module.

Integration with window.perf

The scaffold is highly data-driven, relying on the window.perf object for:

  • header_image_url: Custom instance logos (with a fallback to an “alpine” logo).
  • instance_name / instance_url: Displaying the instance identity.
  • chat_url / feedback_url: Linking to support channels.
  • show_triage_link: Conditionally hiding the Triage navigation item.

Key Files

  • perf-scaffold-sk.ts: The main TypeScript definition for the Lit-based custom element, containing the template logic for both UI versions.
  • perf-scaffold-sk.scss: Theme-aware styles that define the grid layouts for both Legacy and V2 shells.
  • perf-scaffold-sk-demo.ts & perf-scaffold-sk-v2-demo.ts: Demo entry points that mock the window.perf configuration to showcase the scaffold's capabilities in various states.
  • perf-scaffold-sk_puppeteer_test.ts: Integration tests ensuring that layout transitions, version rendering, and content redistribution work as expected.

Module: /modules/picker-field-sk

picker-field-sk

The picker-field-sk module provides a specialized multi-selection component designed for choosing values from a pre-defined list. It wraps a Vaadin multi-select combo box with additional logic for bulk selection and data organization, specifically tailored for complex filtering workflows (such as performance test pickers).

Design Philosophy

The primary goal of picker-field-sk is to simplify the management of large sets of options while providing visual cues and high-level controls for common selection patterns.

Rather than being a generic text field, it addresses specific needs of hierarchical or categorized data:

  • Primary vs. Detailed Options: In many datasets, options without periods in their name represent “primary” or top-level categories. This module automatically identifies these and provides a one-click toggle to select them.
  • Dynamic Controls: Selection features like “Select All” or “Split” are conditionally displayed based on the component's state (e.g., how many items are selected) and its position in a sequence (the index property).
  • Responsive Sizing: The component dynamically calculates its overlay width based on the longest option string to ensure that labels are readable without unnecessary truncation.

Key Components and Logic

Core Selection: Vaadin Multi-Select

The underlying selection mechanism is handled by @vaadin/multi-select-combo-box. This provides the chip-based UI for selected items and the searchable dropdown. The module styles this component to integrate with the local theme, including specific “dark mode” transitions.

Bulk Actions

The component features a set of checkbox-sk elements located in a “split-by-container” above the main field. These controls appear based on the following logic:

  • Select All: Visible if there are more than 2 options and the field is not the primary field (index > 0). It allows selecting the entire list or resetting to the first item.
  • Primary: Visible if the list contains a mix of “primary” items (those without periods) and “detailed” items. It allows users to toggle the selection of top-level categories.
  • Split: Dispatches a split-by-changed event. This is used by parent components to decide if a visualization should be broken down by the attribute represented by this field.

State Management

The internal state is managed through several private properties that trigger re-renders or logic updates:

  • options: Setting this property automatically triggers the filtering of primaryOptions and recalculates the dropdown width.
  • selectedItems: Controls which chips are currently visible.
  • index: Determines whether the bulk action checkboxes should be visible.

Workflow: Selection and Events

The following diagram illustrates how user interaction flows through the component to notify the rest of the application:

User Interaction          picker-field-sk             External App
+----------------+       +-------------------+       +------------------+
| Click "All"    |------>| Update _selected  |       |                  |
| Checkbox       |       | Items & Render    |       |                  |
+----------------+       +---------+---------+       +------------------+
                                   |
                                   v
+----------------+       +-------------------+       +------------------+
| Select Item in |------>| onValueChanged()  |------>| Listen for       |
| Dropdown       |       |                   |       | 'value-changed'  |
+----------------+       +---------+---------+       +------------------+
                                   |
                                   v
+----------------+       +-------------------+       +------------------+
| Toggle "Split" |------>| splitOnValue()    |------>| Listen for       |
| Checkbox       |       |                   |       | 'split-by-changed'|
+----------------+       +-------------------+       +------------------+

Styling and Layout

The component uses a vertical flex layout where the label and selection checkboxes sit atop the combo box.

  • Overlay Width: Calculated using ch (character) units to ensure the dropdown menu scales with the content length.
  • Chip Styling: Selected items (chips) are styled with direction: rtl to handle long strings gracefully within the limited horizontal space of the input field.
  • Theming: Integrated with //perf/modules/themes, utilizing CSS variables for background colors, focus states, and transitions to ensure a consistent look across different UI modes.

Testing Utilities

The module includes a Page Object (PickerFieldSkPO) located in picker-field-sk_po.ts. This encapsulates the complexity of interacting with the Shadow DOM of both the picker-field-sk and the underlying Vaadin components. It provides high-level methods for:

  • Selecting items by text.
  • Removing specific chips.
  • Checking the state of the “Split” and “Select All” checkboxes.
  • Managing the overlay visibility during automated Puppeteer tests.

Module: /modules/pinpoint-try-job-dialog-sk

pinpoint-try-job-dialog-sk

The pinpoint-try-job-dialog-sk module provides a modal dialog designed to trigger Pinpoint A/B “Try jobs.” In the context of the Perf application, its primary purpose is to allow developers to request additional performance traces for specific benchmark runs to debug regressions or verify improvements.

Design and Purpose

This module is specifically tailored for the “Debug Traces” use case. It acts as a bridge between the Perf UI and the Pinpoint performance analysis system. Rather than being a general-purpose Pinpoint job creator, it focuses on taking existing performance data contexts—such as a specific trace found in a chart—and prepopulating a request to gather more detailed diagnostic information (e.g., Chrome trace categories).

Key design decisions include:

  • Contextual Preloading: The dialog is designed to be populated via setTryJobInputParams, which extracts necessary metadata (bot, benchmark, story) from a “test path” string commonly used in Perf.
  • Constraint-Focused: While Pinpoint supports many job types, this dialog simplifies the interface to focus on A/B comparisons between a base commit and an experiment (end) commit.
  • Trace Customization: It provides a default set of tracing arguments (toplevel, toplevel.flow, etc.) but allows users to override them to gather specific category data.

Key Components

PinpointTryJobDialogSk (pinpoint-try-job-dialog-sk.ts)

The main class extending ElementSk. It manages the internal state of the dialog, including the commit hashes, story names, and the resulting Pinpoint job URL.

  • Authentication: Upon connection, it uses alogin-sk to identify the current user. This is crucial as Pinpoint requires a user email to associate with the created job.
  • State Management: It tracks baseCommit, endCommit, and testPath. The testPath is specifically parsed during the submission process to identify the configuration (bot) and benchmark.
  • Submission Logic: The postTryJob method handles the transformation of UI fields into a TryJobCreateRequest. It maps the user's input into the specific JSON structure expected by the / _ / try / endpoint.

Template and Styling

The dialog is rendered using lit-html and styled to match the Perf theme. It utilizes standard HTMLDialogElement functionality for modal behavior and includes a spinner-sk to provide visual feedback during the asynchronous submission process.

Key Workflows

Triggering a Job

The typical lifecycle of the dialog involves an external component passing in performance context before the user interacts with the form.

External Component          Dialog Component             Pinpoint API
        |                        |                            |
        |-- setTryJobInputParams ->|                          |
        |   (commits, testPath)  |                            |
        |                        |                            |
        |------- open() -------->|                            |
        |                        | (User modifies args)       |
        |                        |                            |
        |                        |------- POST /_/try/ ------>|
        |                        |                            |
        |                        |<------ { jobUrl } ---------|
        |                        |                            |
        |                        |-- Updates UI with Link ----|

Data Mapping

When a user submits the form, the module performs a specific mapping from the human-readable “test path” to the Pinpoint API fields:

  1. Test Path Parsing: A string like master/linux-perf/blink_perf.ext/test_case is split.
    • Index 1 becomes the configuration (linux-perf).
    • Index 2 becomes the benchmark (blink_perf.ext).
  2. Argument Injection: The traceArgs input is wrapped into a JSON string under --extra-chrome-categories and passed within extra_test_args.
  3. Naming: The job name is automatically generated to follow a standard pattern: Tracing Debug on <config>/<benchmark>/<story>.

Implementation Files

  • pinpoint-try-job-dialog-sk.ts: Contains the logic for form validation, API interaction, and the Lit template.
  • pinpoint-try-job-dialog-sk.scss: Defines the layout, specifically ensuring the dialog handles long input strings (like commit hashes and trace arguments) gracefully.
  • pinpoint-try-job-dialog-sk_test.ts: Validates that the form correctly parses the input parameters and that the fetch request sent to the backend contains the expected payload structure.

Module: /modules/pivot-query-sk

pivot-query-sk

The pivot-query-sk module provides a specialized UI component for configuring data transformation requests, specifically for pivoting and aggregating performance trace data. It allows users to define how data should be grouped, what primary mathematical operation to perform on those groups, and which additional summary statistics should be calculated.

Overview

The primary purpose of this component is to build and edit a pivot.Request object. This object is used by the Perf backend to reshape time-series data into a tabular or grouped format. The component provides a high-level interface for three specific pivoting dimensions:

  1. Group By: Selecting which keys (from a provided ParamSet) should be used to cluster traces together.
  2. Operation: The main reduction function (e.g., average, sum, min) applied to the data within each group.
  3. Summary: Optional additional statistics (e.g., count, max) to be calculated for each group.

Design Decisions

Data Consistency and ParamSets

The component requires a ParamSet to populate the “Group By” options. A key design choice in pivot-query-sk.ts is the handling of the intersection between the current pivot.Request and the provided ParamSet.

The allGroupByOptions() method merges keys from the current request with those in the ParamSet. This ensures that if a user loads a saved pivot request containing keys that are not present in the current data's ParamSet, the selection is preserved rather than silently dropped. This “additive” approach prevents data loss when switching between different data contexts.

Unique Instance Identification

Because the component uses ARIA attributes (aria-labelledby) to maintain accessibility, it implements a uniqueId system. Each instance of pivot-query-sk on a page increments a static counter. This ensures that internal element IDs (like group_by-0, group_by-1) remain unique, preventing label collisions when multiple query builders are present in the same view.

Validation and State

The component acts as a controlled input. It exposes a pivotRequest getter that utilizes validatePivotRequest from ../pivotutil. If the current internal state is invalid, the getter returns null. This forces consuming components to handle invalid states gracefully before attempting to dispatch a backend request.

Key Components

PivotQuerySk

The main class (pivot-query-sk.ts) manages the state and rendering logic. It uses lit-html for templating and leverages existing elements like multi-select-sk and select-sk to handle the heavy lifting of UI interactions.

  • State Management: It maintains an internal _pivotRequest and _paramset. Any change to these properties via setters triggers a re-render.
  • Event Emission: Whenever a user interacts with the UI (changing a selection or the operation), the component emits a pivot-changed event containing the updated (and potentially null, if invalid) pivot.Request.

Interactions and Workflow

The following diagram illustrates how data flows from user interaction to a valid request:

User Action (Click/Select)
          |
          v
Internal Event Handler (e.g., groupByChanged)
          |
          +--> Updates internal _pivotRequest
          |
          +--> Validation Check (via pivotutil)
          |
          v
Dispatches "pivot-changed" Event
          |
          +--> Parent component receives pivot.Request OR null

Supporting Files

  • pivot-query-sk_po.ts: Provides a Page Object (PO) for testing. It abstracts the complexity of interacting with multiple multi-select-sk elements, allowing tests to select options by text content rather than implementation details.
  • pivot-query-sk.scss: Handles the layout, ensuring that the selection lists are presented in a flexible, readable grid with scrollable areas for large ParamSet keys.

Module: /modules/pivot-table-sk

High-Level Overview

pivot-table-sk is a custom element designed to display aggregated Performance (Perf) data in a tabular format. While traditional DataFrames in the Perf system are often visualized as time-series plots, this module handles cases where data has been “pivoted” and summarized into discrete values (e.g., averages, sums, or standard deviations).

The element transforms complex trace data—where keys are comma-separated parameter strings—into a human-readable table. It allows users to explore multi-dimensional data by grouping by specific parameters and viewing various statistical summaries side-by-side.

Design Decisions and Implementation

Data Transformation and Mapping

The core challenge the module solves is translating a DataFrame (optimized for storage and plotting) into a grid.

  • Key Extraction: Trace IDs in Perf are long strings (e.g., ,arch=x86,config=8888,). The module uses keyValuesFromTraceSet to parse these strings and extract only the values corresponding to the group_by parameters requested by the user.
  • Ordering: The order of columns in the table is strictly dictated by the pivot.Request. The “Key” columns (parameters) appear first, followed by the “Summary” columns (statistical operations).

Advanced Sorting Logic

Rather than a simple per-column sort, pivot-table-sk implements a Sort History mechanism via the SortHistory and SortSelection classes. This approach mimics spreadsheet behavior:

  • Tie-Breaking: When a user clicks a column header, that column becomes the primary sort criteria. However, previous sort actions are preserved in a stack. If two rows have identical values in the primary column, the logic falls back to the second most recent sort column to break the tie, and so on.
  • State Persistence: The entire sort state (which columns, which direction, and in what priority) can be encoded into a compact string (e.g., dk2-us1). This allows the sorting state to be reflected in URL parameters, making table views shareable and bookmarkable.

Validation and Safety

Because the component relies on a specific relationship between the DataFrame and the pivot.Request, it includes a validation layer. It uses validateAsPivotTable to ensure the incoming request is compatible with a tabular display (e.g., checking that the necessary grouping and summary fields are present) before attempting to render.

Key Components

PivotTableSk

The main custom element. It manages the lifecycle of the data, reacting to changes in the df (DataFrame) and req (Request) properties. It uses lit for efficient rendering and manages internal state for the sortHistory and the resulting compare function used by the JavaScript native Array.sort().

SortHistory and SortSelection

These classes encapsulate the multi-column sorting logic.

  • SortSelection handles the metadata for a single column: its index, its type (keyValues vs summaryValues), and its direction.
  • SortHistory manages an array of selections. It provides the buildCompare method, which generates a complex comparison function that iterates through the history stack until a non-zero comparison result is found.

Page Object (PivotTableSkPO)

Located in pivot-table-sk_po.ts, this provides an abstraction for testing. It allows internal tests (Puppeteer) to interact with the table (clicking headers, reading cell values) without being coupled to the specific DOM structure or CSS classes.

Primary Workflow

The following diagram illustrates how data flows into the component and results in a sorted display:

[ DataFrame ] + [ pivot.Request ]
      |               |
      v               v
+-----------------------------+
|      willUpdate()           |
|  1. Extract KeyValues       | <--- Maps Trace IDs to grouped params
|  2. Init/Update SortHistory | <--- Restores state from encoded string
|  3. Build Compare Function  |
+-----------------------------+
      |
      v
+-----------------------------+
|         render()            |
|  1. Validate Request        |
|  2. Sort Keys via CompareFn |
|  3. Generate <table> rows   |
+-----------------------------+
      |
      +--> [ User Clicks Header ] --+
               ^                    |
               |                    v
               +----------- [ Emit 'change' event ]
                            [ Re-run willUpdate() ]

Events

  • change: Emitted whenever the user changes the sort order. The event's detail contains the serialized SortHistory string, allowing parent components to sync the UI state with the application URL.

Module: /modules/pivotutil

Pivot Utilities

The pivotutil module provides a set of client-side utilities designed to facilitate the configuration and validation of pivot operations within the Perf system. Its primary role is to bridge the gap between raw pivot request data structures defined in the backend and the user interface, ensuring that pivot configurations are both semantically valid and human-readable before being processed.

Overview and Purpose

Pivot operations in Perf allow users to transform multi-dimensional trace data into aggregated summaries or reorganized table views. Because a pivot request involves several dependent parameters—such as grouping keys, summary operations, and aggregation methods—it is prone to configuration errors that could lead to empty results or server-side failures.

This module centralizes the logic for:

  1. Humanizing Metadata: Mapping technical operation identifiers (e.g., geo, avg) to user-friendly labels (e.g., “Geometric Mean”, “Mean”) for consistent display across the UI.
  2. Request Validation: Enforcing structural constraints on pivot.Request objects to ensure they contain the minimum necessary information to be actionable.

Key Logic and Design Decisions

Validation Philosophy

The module distinguishes between a “valid pivot request” and a “valid pivot table.” This distinction is necessary because the Perf system supports different ways of visualizing pivoted data:

  • Structural Validation: A baseline pivot request requires at least one group_by field. Without grouping, there is no dimension along which to pivot the data.
  • Contextual Validation (Tables): While a request might be structurally sound for certain visualizations, generating a tabular summary requires at least one summary operation. The validateAsPivotTable function enforces this stricter requirement, ensuring that the UI does not attempt to render an empty summary table when the user has only defined groupings.

Data Mapping

The operationDescriptions map serves as the single source of truth for how pivot operations are presented to the user. By centralizing these strings in pivotutil, the system ensures that different UI components (such as dropdowns, table headers, or chart legends) remain consistent in their terminology.

Key Components

index.ts

This is the core of the module. It exports the validation functions and the description mappings used by UI components to interpret pivot.Request objects.

  • operationDescriptions: A lookup table mapping pivot.Operation types to their display names. It covers standard statistical aggregations like sum, mean (arithmetic and geometric), standard deviation, count, and extrema.
  • validatePivotRequest: Checks for the existence of the request and ensures the group_by array is populated.
  • validateAsPivotTable: Extends the basic validation by verifying that the summary field is also populated, which is a prerequisite for generating a statistical summary table.

Workflows

The typical workflow for utilizing this module involves a UI component gathering user input and validating it before dispatching a network request or updating a visualization.

[ User Input ] ----> [ pivotutil: validatePivotRequest ]
                           |
          +----------------+----------------+
          |                                 |
 [ Returns Error Msg ]            [ Logic Proceeds ]
          |                                 |
[ UI displays Alert ]             [ pivotutil: validateAsPivotTable ]
                                            |
                         +------------------+------------------+
                         |                                     |
                [ Returns Error Msg ]                 [ Success ]
                         |                                     |
              [ UI hides Table View ]              [ UI renders Pivot Table ]

Module: /modules/plot-google-chart-sk

plot-google-chart-sk

Overview

The plot-google-chart-sk module provides a high-performance, interactive charting component built on top of the Google Visualization API (google-chart). It is specifically designed to handle time-series data, anomalies, and user-defined issues within the Perf framework.

Beyond simple data visualization, this module implements specialized interaction modes—such as panning, delta-Y calculations, and dual-axis zooming—to support deep analysis of performance regressions and improvements.

Design Decisions and Implementation Choices

Performance via Overlays

Rendering thousands of data points alongside complex icons (anomalies, regressions, bug icons) directly within the Google Chart SVG can lead to significant performance degradation during interactions like panning or resizing.

  • Implementation: The module uses a layered approach. The base google-chart renders the lines, while anomalies and user issues are rendered as absolute-positioned HTML overlays in separate div containers (.anomaly, .userissue).
  • Benefit: When the user pans the chart, the module recalculates the coordinates using the chart's layout interface and moves the HTML elements without requiring the underlying charting engine to re-render the entire data series.

State Management and Context

The module leverages @lit/context to synchronize state across a complex hierarchy of components without prop-drilling.

  • Data Synchronization: It consumes dataTableContext and dataframeAnomalyContext to reactively update the view when the underlying performance data changes.
  • Color Consistency: It provides a traceColorMapContext. This ensures that if a trace is assigned “Blue” in the chart, the same color is used in the side-panel-sk (legend) and any associated tooltips.

Interaction Modes

The module distinguishes between three primary mouse navigation modes to avoid UI clutter:

  1. Panning (Left-Click Drag): Moves the horizontal view window.
  2. Delta-Y (Shift + Left-Click Drag): Activates v-resizable-box-sk to measure the vertical distance (raw and percentage) between two points on the Y-axis.
  3. Drag-to-Zoom (Ctrl + Left-Click Drag): Activates drag-to-zoom-box-sk to select a specific sub-region for zooming. This supports both horizontal and vertical zooming depending on the global isHorizontalZoom state.

Key Components and Responsibilities

plot-google-chart-sk.ts

The primary element that orchestrates the charting logic.

  • Responsibility: Manages the lifecycle of the google.visualization.DataTable, handles the “Domain” toggle (switching between Commit Position and Date), and coordinates the positioning of overlays.
  • Data View Logic: It uses a google.visualization.DataView to filter which traces are currently visible based on user selections in the side panel.

side-panel-sk.ts

A collapsible legend and control interface.

  • Responsibility: Displays a list of active traces. It groups trace labels by a “display name” (derived from trace parameters like test, arch, etc.).
  • Interaction: Allows users to toggle trace visibility. It prevents the user from unselecting all traces to ensure the chart never becomes completely empty.

v-resizable-box-sk.ts

A specialized selection box for vertical measurements.

  • Responsibility: Calculates the difference between two Y-axis values. It intelligently positions the delta text (e.g., “+15% / 1.2s”) so it doesn't clip outside the chart boundaries.

drag-to-zoom-box-sk.ts

A transparent selection rectangle.

  • Responsibility: Provides visual feedback during a Ctrl-drag operation. It calculates the new coordinate bounds which are then passed back to the main chart to update its viewWindow.

Key Workflows

Data Rendering Flow

Data Update -> willUpdate() -> updateDataView()
                                     |
                                     V
                          Create google.visualization.DataView
                                     |
                                     V
                          Assign Colors to Traces
                                     |
                                     V
                          updateOptions() (Scale/Axis)
                                     |
                                     V
                          plot.redraw() -> onChartReady()
                                                |
                                                V
                                      drawAnomaly() & drawUserIssues()

Coordinate Mapping

Because the overlays are standard HTML elements, the module frequently translates “Data Values” (Commits/Dates/Values) into “Pixel Coordinates” using the Google Chart Layout Interface:

[Data Value: Commit 1234]
          |
          V
[Chart Layout Interface] -> getXLocation(1234)
                                     |
                                     V
[CSS Absolute Position] -> element.style.left = `${x}px`

Events

  • selection-changed: Dispatched when the user finishes panning or zooming, providing the new range and domain.
  • plot-data-select: Dispatched when a specific data point is clicked, returning the tableRow and tableCol.
  • side-panel-toggle: Dispatched when the legend panel is opened or closed.

Module: /modules/plot-summary-sk

Plot Summary (plot-summary-sk)

The plot-summary-sk module provides a high-level “bird's-eye view” of performance data. It is designed to act as a navigation and overview tool for large datasets, allowing users to see trends across a wide time or commit range and select specific sub-sections to investigate in more detail.

High-Level Overview

This component renders a simplified area chart of performance traces using Google Charts. Its primary purpose is to facilitate range selection. Unlike a primary data plot, it focuses on performance and visual density rather than granular data point interaction.

It solves the problem of “information overload” when dealing with thousands of data points by implementing automatic downsampling and providing a specialized UI for horizontal range manipulation.

Design Decisions

Min-Max Downsampling

When the input DataTable contains a large number of rows (exceeding 1000), the component automatically applies a Min-Max bucketing algorithm.

  • Why: Standard downsampling (like averaging) smooths out spikes, which are often the most important features in performance monitoring.
  • How: The data is divided into buckets. For each bucket, the component synthesizes two rows: one representing the minimum value and one representing the maximum value within that interval. This ensures that the visual “envelope” of the data—including all peaks and valleys—remains visible even at low resolutions.

Decoupled Interaction Layer

The selection logic is separated into a sub-component called h-resizable-box-sk.

  • Why: Keeping the logic for mouse interactions (dragging, resizing, drawing) separate from the charting logic makes the codebase more maintainable and allows the resizable box to be reused in other contexts.
  • How: h-resizable-box-sk overlays the chart and translates raw pixel coordinates from mouse events into relative percentage-based ranges. plot-summary-sk then converts these relative positions into domain-specific values (timestamps or commit offsets) using the Google Chart ChartLayoutInterface.

Deterministic Trace Coloring

To ensure visual consistency between the summary plot and the main detail plots (e.g., plot-google-chart-sk), this module uses a shared utility (getTraceColor) to assign colors based on the trace name. This allows a user to identify the same trace across different UI components by color alone.

Key Components and Responsibilities

plot-summary-sk.ts

The main controller for the summary view.

  • Data Management: Consumes DataTable objects via Lit context (dataTableContext) and converts them into a DataView optimized for the summary (filtering columns based on selectedTrace).
  • Coordinate Mapping: Acts as a bridge between the visual chart and the data. It contains the logic to convert between pixel coordinates (used by the resizable box) and data values (commits/dates).
  • Navigation Controls: Optionally renders “load more” buttons (left/right) that interact with a DataFrameRepository to extend the available data range.

h_resizable_box_sk.ts

A specialized UI primitive for horizontal range selection.

  • State Tracking: Manages four distinct user actions: draw (creating a new selection), drag (moving an existing selection), left (resizing the start), and right (resizing the end).
  • Constraint Enforcement: Uses a clamp utility to ensure the selection box never leaves the boundaries of the parent container and maintains a minWidth to prevent the selection from becoming unclickable.

Key Workflows

Selection Process

The following diagram illustrates how a user interaction is transformed into a system-wide range update:

User Action (Mouse) -> [h-resizable-box-sk]
                             |
                      (Pixel Range)
                             |
                             v
                     [plot-summary-sk]
                             |
              (Convert Pixels via ChartLayout)
                             |
                             v
                  [summary_selected Event]
                             |
              (Contains: {begin, end, domain})

Data Update and Redraw

When the underlying data changes, the component goes through the following lifecycle:

  1. Property Change: data, selectedTrace, or domain is updated.
  2. Downsampling: updateDataView checks row count; if > 1000, buckets are created.
  3. Column Filtering: The view is restricted to the domain column (0 or 1) and the data columns for the selected traces.
  4. Async Redraw: Google Chart renders the new SVG.
  5. Selection Realignment: Once the chart emits google-chart-ready, the h-resizable-box-sk is repositioned to match the cachedSelectedValueRange, as the axis scaling might have changed.

Events

  • summary_selected: Dispatched whenever the user finishes a selection or adjustment. The detail contains a range object with begin and end values in the current domain (UNIX timestamp for dates, or integer offset for commits).
  • range-changing-in-multi: Dispatched when “load” buttons are clicked in a multi-chart environment, allowing a parent controller to synchronize data fetching across multiple plots.

Module: /modules/point-links-sk

point-links-sk

This module provides a specialized UI component, PointLinksSk, designed to display context-sensitive links associated with specific data points in Perf. These links are typically sourced from ingestion files and represent metadata such as commit hashes, build logs, or trace artifacts.

Overview

The primary purpose of point-links-sk is to bridge the gap between a raw data point and the external systems that provide more context about it. It doesn't just list static URLs; it dynamically calculates ranges and cleans up metadata keys to present a user-friendly interface for navigating between performance results and source control or build systems.

Key Responsibilities and Logic

1. Commit Range Generation

A significant feature of this module is its ability to compare the selected commit with the previous commit to generate “diff” or “log” links.

  • Logic: If a key is identified as a commit range key (e.g., “V8 Git Hash”), the component fetches the hash for both the current and the preceding commit.
  • Single Commit: If the hashes are identical, the component displays a direct link to that specific commit.
  • Range: If the hashes differ, it constructs a Gitiles-compatible URL (using the +log/start..end syntax) to show all commits in that range.
  • Validation: It performs an asynchronous check via a proxy to googlesource.com (using isRange) to determine if a range actually contains multiple commits, ensuring the UI text accurately reflects whether the user is looking at a single change or a list.

2. Intelligent Data Retrieval and Fallbacks

The component fetches point-specific metadata through internal API endpoints. It implements a fallback strategy to ensure reliability:

  1. It first attempts to fetch from /_/details/?results=false.
  2. If no links are found, it falls back to the /_/links/ endpoint.
  3. Performance Optimization: It accepts an array of CommitLinks as an argument to its load method. This allows the caller to provide cached data, preventing redundant network requests when switching back and forth between points.

3. Data Normalization and Filtering

The module handles several platform-specific quirks to maintain a consistent UI:

  • Key Cleaning: It strips redundant suffixes (like " Git") and renames technical keys to friendlier terms (e.g., “Trace Iteration” becomes “Trace”, “Profiling Traces and Test Artifacts” becomes “Artifacts”).
  • Fuchsia Support: It handles specific formatting used in Fuchsia ingestion where build logs are stored as Markdown-style strings (e.g., [Build Log](url)), extracting the URL for proper anchor tag generation.
  • Repository Rewriting: It contains workarounds to fix incomplete URLs for certain repositories (like V8 or WebRTC) where the ingestion might only provide a hash rather than a full URL.

Workflow: Loading Point Links

The following diagram illustrates the process when a user selects a point and load() is called:

User selects point
      |
      v
[load(commit, prev_commit, trace_id, ...)]
      |
      +--> Check cache? --(Found)--> Render cached links
      |      |
      |   (Missing)
      |      v
      +--> Fetch current point links (/_/details/ or /_/links/)
      |      |
      +--> Fetch previous point links (if range keys requested)
      |      |
      +--> Compare hashes
      |      |-- Same: Create direct commit link
      |      '-- Different: Create +log/ range link
      |      |
      +--> Filter for "Useful Links" (Build logs, etc.)
      |      |
      +--> Normalize keys and extract URLs (e.g., Fuchsia regex)
      |      |
      '--> [ _render() ] Update Lit-html template

Component API and State

Primary Method: load(...)

This is the main entry point for the component. It triggers the data fetching and comparison logic. It returns the updated list of CommitLinks (including any newly fetched data) so the parent component can maintain an up-to-date cache.

Internal State

  • displayUrls: A map of human-readable keys to their calculated destination URLs.
  • displayTexts: A map of keys to the text that should appear inside the link (e.g., the short hash range f052b8c4 - 47f420e8).
  • commitPosition: Tracks the current commit number being inspected to ensure the UI stays synchronized with the user's selection.

Implementation Details

  • Aborting Requests: The component uses an AbortController to cancel pending network requests if a user rapidly clicks different points, preventing race conditions where old data might overwrite new data.
  • Template Rendering: It uses Lit's until directive to show “Loading...” placeholders for individual link rows while asynchronous range validation (isRange) is performed.
  • Styling: Styles are minimal, focusing on a tabular layout for the keys and values, with specific overrides for Material Design icon buttons used in supplementary actions (like copying links).

Module: /modules/progress

The progress module provides a standardized mechanism for triggering, monitoring, and retrieving results from long-running server-side tasks. It abstracts the complexity of asynchronous polling into a lifecycle-aware utility, allowing the frontend to handle heavy operations (like database queries or complex data processing) without blocking the main UI thread or timing out on a single HTTP request.

Design Philosophy: Polling and Lifecycle Management

The module is designed around a state-machine approach where the server dictates the flow of the operation. Rather than the client guessing when a task is finished, the server provides a SerializedProgress object containing the current status and a URL for the next update.

The core function, startRequest, manages the following transition logic:

  1. Initiation: Sends a POST request to a starting URL with a JSON body.
  2. Observation: Evaluates the status field in the response.
  3. Recursion/Polling: If the status is Running, it schedules a subsequent GET request to the URL provided in the previous response after a configurable interval.
  4. Completion: If the status is anything other than Running (e.g., Finished), it resolves the promise with the final data.

This design decouples the UI from the specific endpoint of the task; as long as the server follows the progress.SerializedProgress schema, the client can follow the task through any number of intermediate steps or URL changes.

Key Components

Request Management (progress.ts)

The primary entry point is startRequest. It is built to be flexible through a RequestOptions object, which provides hooks into the various stages of the request lifecycle:

  • onStart: Useful for updating UI state (e.g., showing a spinner) before the first network call is made.
  • onProgressUpdate: Triggered every time the server returns a response while the task is still Running. This allows for real-time progress bars or status message updates.
  • onSuccess: Triggered specifically when the task reaches a terminal successful state.
  • onSettled: Acts like a finally block, executing when the process ends regardless of success or failure, making it ideal for cleanup tasks like hiding loading indicators.

Message Parsing Utilities

Since server responses often include a list of key-value pairs (progress.Message[]) to describe internal state, the module provides utilities to normalize this data for UI display:

  • messagesToErrorString: Prioritizes extracting a message with the key Error. If absent, it concatenates all available messages into a single string. This ensures that even if the server doesn't provide a specific error field, the user receives some context about what went wrong.
  • messageByName: A safe lookup utility to extract a specific value from the message array by its key, providing a fallback to prevent UI breakage.

Workflow Process

The following diagram illustrates the lifecycle of a long-running request managed by this module:

[ Client ]              [ Server ]
    |                       |
    |-- POST (Start URL) -->|
    |                       |-- [ Task Initiated ]
    |<-- 200 (Running) -----|
    |    (JSON: status, url)|
    |                       |
    | [ Wait pollingInterval ]
    |                       |
    |-- GET (Poll URL) ---->|
    |                       |-- [ Task Processing... ]
    |<-- 200 (Running) -----|
    |                       |
    | [ Wait pollingInterval ]
    |                       |
    |-- GET (Poll URL) ---->|
    |                       |-- [ Task Finished ]
    |<-- 200 (Finished) ----|
    |                       |
 [ Resolve Promise ]

Error Handling

The module treats non-ok HTTP statuses (like 4xx or 5xx) as terminal failures, rejecting the promise immediately and triggering the onSettled callback. It does not automatically retry on network failure; it assumes that if the polling chain is broken, the caller should decide whether to restart the entire process.

Module: /modules/query-chooser-sk

query-chooser-sk

The query-chooser-sk module provides a compact, interactive UI component for building and displaying search queries based on a set of parameters (a ParamSet). It acts as a high-level wrapper around the more complex query-sk component, offering a “summary-first” workflow that keeps the UI clean while allowing for detailed query editing.

Overview

In many data-heavy applications, users need to filter large datasets using multiple keys and values. Displaying a full query builder at all times occupies significant screen real estate. query-chooser-sk solves this by:

  1. Displaying a read-only summary of the current selection using paramset-sk.
  2. Providing an “Edit” button that reveals an embedded query-sk interface within a toggleable dialog.
  3. Integrating live feedback via query-count-sk to show how many items match the current selection as the user modifies it.

Key Components and Responsibilities

query-chooser-sk.ts

This is the primary entry point and defines the custom element. Its responsibilities include:

  • State Management: It maintains the current_query (a URL-formatted query string) and the paramset (the available options).
  • Dialog Orchestration: It manages the visibility of the internal #dialog element, which contains the editing tools.
  • Event Propagation: It listens for query-change events from the internal query-sk component, updates its own state, re-renders the summary, and propagates information to the parent application.

Integrated Sub-elements

The functionality of query-chooser-sk is composed of several specialized elements:

  • paramset-sk: Used in the main view to display a concise, non-interactive summary of the active query filters.
  • query-sk: The core interactive builder revealed when editing. It handles the logic of selecting keys and values from the ParamSet.
  • query-count-sk: Situated inside the edit dialog, it performs asynchronous lookups (via the count_url attribute) to provide real-time counts of data points matching the user's current selection.

Workflow

The component operates in a cycle of viewing and refining:

+---------------------------------------+
|  [Edit]  Key1: Val1, Val2             | <--- (Summary View: paramset-sk)
+-----+---------------------------------+
      |
      | (Click Edit)
      v
+-----+---------------------------------+
| [Close]                               |
| +-----------------------------------+ |
| |          (query-sk)               | | <--- (Edit View: User selects filters)
| | Key1: [x]Val1 [x]Val2 [ ]Val3     | |
| +-----------------------------------+ |
| Matches: 1,245                        | <--- (Live Count: query-count-sk)
+---------------------------------------+
  1. Initial State: The component renders a button and a paramset-sk.
  2. Interaction: When the user clicks “Edit”, the _editClick handler adds a CSS class to display the hidden dialog.
  3. Refinement: As the user toggles checkboxes in query-sk, the _queryChange handler updates the current_query property. This update is reactive:
    • The query-count-sk sees the new query and fetches a new count.
    • The paramset-sk summary updates to reflect the latest selections.
  4. Completion: The user clicks “Close”, hiding the editing interface and returning to the summary view.

Design Decisions

  • URL-Formatted Strings: The component uses URL-formatted query strings (key1=val1&key1=val2) as the primary data exchange format for current_query. This makes it trivial to sync the component state with the browser's address bar or use it directly in API requests.
  • Encapsulated Dialog: Instead of using a global modal, the dialog is contained within the element's shadow DOM (or local DOM). This ensures that multiple query-chooser-sk instances can exist on one page without managing z-index or global state conflicts.
  • Property Upgrading: The connectedCallback utilizes _upgradeProperty for attributes like paramset and key_order. This ensures that if the properties are set on the DOM element before the custom element definition is loaded, the values are correctly captured and processed.

Attributes and Properties

AttributePropertyDescription
current_querycurrent_queryThe current selection formatted as a URL query string.
count_urlcount_urlThe endpoint URL used by query-count-sk to fetch match counts.
N/Aparamset(Property only) The object containing all available keys and values.
key_orderkey_orderAn array of strings determining the order in which keys appear in the query builder.

Module: /modules/query-count-sk

query-count-sk

The query-count-sk module provides a specialized UI component designed to report the number of data points or traces that match a specific query string within the Perf system. It serves as a live feedback mechanism, allowing users to understand the scope of their selection (e.g., in a query builder) before executing a full search or visualization.

Design and Implementation

The component is built using Lit and leverages the @lit/task package to manage asynchronous data fetching. This architecture ensures that the component reacts efficiently to property changes while maintaining a responsive UI.

Reactive Fetching Logic

The core of the component is a Task that monitors two primary inputs:

  1. url: The endpoint to which the count request is sent.
  2. current_query: The query string to be evaluated.

Whenever either of these properties changes, the task automatically triggers a POST request. The component is designed to handle rapid changes; if a new query is provided while a previous fetch is still in flight, the previous request is aborted via an AbortSignal to prevent race conditions and unnecessary network traffic.

Data Flow and Side Effects

Unlike a simple display widget, query-count-sk performs two roles upon receiving a successful response from the server:

  • UI Update: It extracts the count and displays it.
  • State Synchronization: It dispatches a paramset-changed custom event. The response from the server includes a paramset (the set of all possible keys and values matching the current query), which the component bubbles up to notify parent components that the available filter options may have changed based on the current selection.

Workflow Diagram

The following diagram illustrates the lifecycle of a query count request:

  Property Change          Fetch Task              Server             DOM / Parent
  (current_query)
        |                      |                     |                     |
        |---- triggers ------->|                     |                     |
        |                      |---- POST (query) -->|                     |
        |                      |                     |                     |
        |                      |<--- JSON Response --|                     |
        |                      |      (count, params)|                     |
        |                      |                     |                     |
        |<--- updates count ---|                     |--- Dispatch Event ->|
        |      in render()     |                     |  (paramset-changed) |

Key Components and Files

query-count-sk.ts

Contains the element definition. It uses a spinner-sk to provide visual feedback during the loading state. To maintain consistency with legacy behaviors, the displayed count is reset to 0 whenever a new fetch task is initiated or pending.

query-count-sk_po.ts

Provides a Page Object (PO) for testing. This file is crucial for integration and end-to-end tests, offering an abstraction layer to query the internal state of the component (like the numeric value of the count or the visibility of the spinner) without coupling tests to the internal DOM structure.

API and Events

  • Attributes:
    • current_query: A string representing the query to count.
    • url: The destination for the CountHandlerRequest.
  • Events:
    • paramset-changed: Dispatched when the server returns a new ReadOnlyParamSet. This allows other UI components (like dropdowns or filters) to update their available options dynamically.

Implementation Details

The component sends a CountHandlerRequest which includes a time range (defaulting to the last 24 hours). This design choice assumes that the “count” of a query is most relevant within the context of recent data, though the time window is currently hardcoded within the task logic. Error handling is integrated with errorMessage to toast notifications to the user if the backend fails to process the query.

Module: /modules/regressions-page-sk

regressions-page-sk

The regressions-page-sk module provides a specialized dashboard for performance regression management. It allows “Sheriffs” (users responsible for monitoring performance) to view, filter, and triage anomalies associated with specific subscription configurations.

High-Level Overview

This module acts as a centralized interface for reviewing performance anomalies detected by the system. It connects to backend endpoints to fetch a list of active subscriptions (Sheriff configurations) and then retrieves the specific anomalies (regressions) associated with a selected subscription.

The page is designed to handle large datasets through pagination (via cursors) and provides filtering capabilities to distinguish between new regressions, triaged issues, and performance improvements.

Key Components and Responsibilities

regressions-page-sk.ts

This is the main entry point and logic controller for the page. It is a LitElement that manages the following:

  • State Management: It maintains the UI state, including the currently selected subscription, whether to show triaged or improvement data, and pagination offsets. This state is synchronized with the URL query parameters to allow deep-linking and persistence across page refreshes.
  • Data Orchestration: It coordinates fetching data from two primary sets of endpoints:
    • Legacy/ChromePerf: Uses standard anomaly list endpoints.
    • SQL/Skia: Uses modern SQL-backed endpoints when fetch_anomalies_from_sql is enabled in the global perf configuration.
  • UI Integration: It acts as a container for two critical sub-tables:
    • <subscription-table-sk>: Displays metadata about the selected sheriff configuration (labels, components, CC list).
    • <anomalies-table-sk>: Displays the actual list of detected regressions.
  • Dynamic Page Title: Updates the browser tab title to reflect the count of anomalies (e.g., “Regressions (12 untriaged)”), providing immediate feedback to the user.

State and Persistence

The module uses a combination of URL parameters and localStorage to ensure a consistent user experience.

  • URL Parameters: Store filters like showTriaged and selectedSubscription so users can share links to specific views.
  • LocalStorage: Remembers the perf-last-selected-sheriff so that when a user returns to the page, their last worked-on subscription is automatically reselected.

User Workflows

The typical workflow involves selecting a sheriff configuration and refining the view to focus on actionable items.

[ Select Sheriff ] -> [ Fetch Subscription Metadata ] -> [ Fetch Anomalies ]
      |                         |                             |
      v                         v                             v
[ Update URL/LS ]     [ Render Subscription Table ]    [ Render Anomalies Table ]
                                                              |
                                                              |---[ Show More ] ---+
                                                              |                    |
                                                              +<---[ Append Data ]-+

Design Decisions

Pagination and Cursors

The component supports two types of pagination depending on the backend:

  1. Cursor-based: Used primarily with the legacy/ChromePerf backend. The component looks for an anomaly_cursor in the JSON response; if present, it displays a “Show More” button and passes the cursor back in the next request.
  2. Offset-based: Used with the SQL-backed backend. It calculates a pagination_offset based on the current length of the cpAnomalies array.

Separation of Concerns

The page itself does not handle the rendering of individual anomaly rows or subscription details. Instead, it delegates these tasks to anomalies-table-sk and subscription-table-sk. This allows regressions-page-sk to focus strictly on the “Page” level logic: URL state, global spinners, and high-level filtering.

Loading Indicators

The module implements a dual-spinner strategy:

  • anomaliesLoadingSpinner: An “upper” spinner that activates during initial loads or filter changes, signaling a full data refresh.
  • showMoreLoadingSpinner: A localized spinner within the “Show More” section, indicating that the page is appending more data to the existing list rather than replacing it.

Testing Architecture

  • Unit/Logic Tests (regressions-page-sk_test.ts): Focuses on state transitions, URL parameter parsing, and ensuring the correct API calls (with correct query strings) are made when filters are toggled.
  • Visual/Integration Tests (regressions-page-sk_puppeteer_test.ts): Uses Page Objects (regressions-page-sk_po.ts) to simulate user interactions like selecting a sheriff from a dropdown and verifying that the resulting tables render correctly via screenshots and DOM inspection.

Module: /modules/report-page-sk

report-page-sk

The report-page-sk module provides a comprehensive reporting view for performance anomalies. It serves as a centralized dashboard where users can review a list of detected regressions (anomalies), visualize them through interactive graphs, and inspect the shared commit history associated with those regressions.

Overview

The primary purpose of this module is to consolidate triage workflows. Instead of looking at anomalies in isolation, report-page-sk groups related issues together, allowing a developer to see how multiple performance shifts might be tied to the same set of commits.

The page logic is driven by URL parameters (such as bugID, anomalyIDs, or sid), which determine which anomalies are fetched from the backend and which graphs are automatically generated upon page load.

Key Components and Responsibilities

Anomaly Management and Tracking

The module uses an internal AnomalyTracker class to maintain the state of all anomalies currently being viewed. This tracker manages the relationship between:

  • The raw Anomaly data.
  • The UI state (whether the anomaly is “checked” in the table).
  • The ExploreSimpleSk graph instance associated with that specific anomaly.
  • The specific Timerange relevant to the regression.

This separation ensures that the page can efficiently add or remove graphs from the DOM as the user toggles checkboxes in the list without losing the underlying data context.

The Anomalies Table

The anomalies-table-sk component (referenced as anomaly-table) displays the metadata for each regression.

  • Initial Selection: On load, the page parses the URL to decide which anomalies should be “checked” and graphed immediately.
  • Event Handling: When a user checks or unchecks a row, the table dispatches an anomalies_checked event. report-page-sk listens for this to dynamically mount or unmount graph components.

Performance-Optimized Graphing

Graphs are rendered using multiple instances of explore-simple-sk. To prevent browser UI freezes when a large number of anomalies are reported (e.g., a massive regression affecting dozens of tests), the module implements chunked loading:

  1. The module identifies all selected anomalies.
  2. It loads them in parallel batches (defaulting to 5 at a time).
  3. It waits for the data-loaded event from the current batch before starting the next.

Each graph is configured to show a “buffer” of one week before and after the anomaly's time range to help users determine if a regression has already been mitigated or if it represents a recurring pattern.

Cross-Graph Synchronization

Since a report often contains multiple graphs representing the same time period or the same commit range, report-page-sk synchronizes user interactions across all visible explore-simple-sk instances.

  • X-Axis Scaling: Toggling between “Commit” and “Date” on one graph updates all others.
  • Zooming/Panning: Adjusting the range on the summary bar of one graph extends the range on all others to keep them in temporal alignment.
  • Even Spacing: Toggling discrete vs. continuous x-axis spacing is synced across the entire report.

Common Commits and Roll Recognition

If the instance uses integer-based commit numbers, the module calculates the intersection of commit ranges for all displayed anomalies.

  • It displays a “Common Commits” section.
  • Roll Recognition: It specifically identifies commits that look like dependency rolls (e.g., “Roll repo from X to Y”).
  • Deep Linking: It provides specialized logic to resolve the underlying “internal” commit URL for rolls, allowing users to jump directly to the source change in a sub-repository rather than just seeing the roll commit itself.

Key Workflows

Loading and Initialization

URL Params -> Fetch /_/anomalies/group_report
                   |
                   v
          Load AnomalyTracker <-------+
                   |                  |
        +----------+----------+       |
        |                     |       |
 Populate Table       Find Commits    |
 (anomalies-table-sk) (lookupCids)    |
        |                     |       |
        +----------+----------+       |
                   |                  |
           Load Graphs in Chunks <----+
           (explore-simple-sk)

Graph Toggle Workflow

User Clicks Checkbox
      |
      v
anomalies-table-sk dispatches "anomalies_checked"
      |
      v
report-page-sk receives event
      |
      +----[ If Checked ]----> Create explore-simple-sk
      |                        Initialize with Anomaly Query
      |                        Append to #graph-container
      |
      +---[ If Unchecked ]---> Find graph in AnomalyTracker
                               Remove from DOM
                               Unset in Tracker

Module: /modules/revision-info-sk

revision-info-sk

The revision-info-sk module provides a specialized component for investigating performance anomalies associated with specific source control revisions. It serves as a bridge between a revision ID and the various performance tests (benchmarks, bots, and test cases) that may have been impacted around that point in time.

Overview

When a regression or improvement is detected in the Skia Perf system, it is often tied to a range of revisions. This module allows users to input a specific revision ID and retrieve a comprehensive list of all anomalies and performance data associated with it.

Beyond simple display, the module facilitates deep-dive analysis by allowing users to select multiple performance traces and navigate to a multi-graph view. This allows for side-by-side comparison of different tests that were affected by the same revision.

Key Components and Files

revision-info-sk.ts

This file contains the core logic for the custom element. It handles several distinct responsibilities:

  • State Management & Persistence: Uses stateReflector to sync the current revision ID with the URL query parameters. This ensures that a specific search state can be bookmarked or shared.
  • Data Fetching: Communicates with the Perf backend (/_/revision/) to fetch RevisionInfo objects.
  • Multi-Graph Coordination: Contains the logic to transform a set of selected revisions into a complex URL for the multi-graph explorer. This involves:
    • Calculating the total time range (earliest start to latest end).
    • Aggregating unique anomaly IDs to highlight them on the resulting graphs.
    • Interacting with the shortcut service to create a shortened URL for the combined queries.

revision-info-sk.scss

Defines the layout for the results table and the loading indicator. It ensures the spinner is positioned consistently relative to the text and that the data table is readable.

revision-info-sk-demo.ts / .html

Provides a mock environment for the component. It uses fetch-mock to simulate backend responses, allowing for UI development and testing without a running Perf server.

Design Decisions

State Reflection

The choice to use stateReflector is driven by the need for deep-linking. Performance analysis is iterative; users often need to jump between the revision info page and graph pages. By keeping the revisionId in the URL, the component supports standard browser navigation (back/forward) and collaborative debugging.

Multi-Graph Redirection

The implementation of getMultiGraphUrl handles the complexity of “joining” different performance queries. Since each anomaly might belong to a different master, bot, or test, the component generates a GraphConfig for each row. Because these combined queries can result in extremely long URLs that exceed browser limits, it uses updateShortcut to store the configuration on the server and use a short ID in the resulting link.

Workflow: From Revision ID to Graphs

The following diagram illustrates the flow of data within the module:

User Input (Revision ID)
          |
          v
[ stateReflector ] <------> URL (?rev=123)
          |
          v
[ getRevisionInfo() ] ----> API Request (/_/revision/)
          |
          |<---------------- Response (JSON)
          v
[ Render Table ] ---------> User selects checkboxes
          |
          v
[ viewMultiGraph() ]
          |
          +--> [ getGraphConfigs() ]
          +--> [ updateShortcut() ] ----> API Request (/_/shortcut/update)
          |                                      |
          |<-------------------------------------+
          v
[ Window Redirect ] ------> /m/?shortcut=ABC&begin=...&end=...

User Interactions

  1. Search: Users enter a revision ID (e.g., a git hash or a sequential number) and click “Get Revision Information”.
  2. Filter/Select: The resulting table shows anomalies with metadata (Bug ID, Master, Bot, etc.). Users can select individual rows or use the “Select All” toggle.
  3. Explore: Clicking “View Selected Graph(s)” opens a new tab/page showing the actual telemetry data for the selected rows, with the relevant anomalies highlighted on the graph.

Module: /modules/split-chart-menu-sk

The split-chart-menu-sk module provides a specialized UI component designed to facilitate the logical partitioning of performance data. It presents a list of available trace attributes (such as benchmark, story, or subtest) to the user, allowing them to select a criterion for splitting a unified data visualization into multiple, more granular charts.

Design Philosophy and Data Integration

The module is built on the principle of reactive data binding via context consumption. Instead of requiring manual property passing, the component integrates directly with the Perf application's data layer:

  • Context Awareness: It consumes dataframeContext and dataTableContext. This ensures that the menu options are always synchronized with the current dataset being viewed. If the underlying data changes, the list of attributes available for splitting updates automatically.
  • Decoupled Selection: The component does not perform the chart splitting itself. Instead, it acts as a trigger in a larger workflow. When a user selects an attribute, it bubbles a custom event, allowing parent containers or layout managers to handle the complex logic of re-rendering or duplicating chart instances.

Key Components and Responsibilities

split-chart-menu-sk.ts

This is the core implementation file. It manages the following responsibilities:

  • Attribute Extraction: It utilizes the getAttributes utility from the traceset module to parse the DataFrame and extract a unique list of keys present in the trace set.
  • State Management: It handles the toggle state (menuOpen) of the dropdown interface, ensuring a standard Material Design interaction pattern.
  • Event Dispatching: Upon selecting a menu item, it dispatches a split-chart-selection event. This event carries the SplitChartSelectionEventDetails interface, containing the selected attribute string.

Layout and Styling

The component uses @material/web components (md-outlined-button, md-menu, and md-menu-item) to provide a consistent look and feel with the rest of the application. The styling in split-chart-menu-sk.css.ts focuses on ensuring the menu anchors correctly within relative layout containers and follows the system-level color palette.

Workflow: From Data to Selection

The following diagram illustrates how data flows through the component to result in a user action:

[ Data Layer ]                     [ split-chart-menu-sk ]              [ Parent Component ]
      |                                      |                                  |
      | DataFrame (via context) ------------>|                                  |
      |                                      |-- extract attributes             |
      |                                      |-- render md-menu                 |
      |                                      |                                  |
      |          User Interaction <----------|                                  |
      |          (Select "benchmark")        |                                  |
      |                                      |-- dispatch CustomEvent           |
      |                                      |   "split-chart-selection" ------>|
      |                                      |                                  |-- Handle Split
      |                                      |                                  |-- Update Layout

Note on Deprecation

While functional, this component is marked as deprecated in favor of “Split Checkboxes.” This suggests a transition in the UI design from a single-selection dropdown model to a multi-selection or checkbox-based model for defining chart splits. External modules should use this component with the understanding that its replacement offers different interaction semantics.

Module: /modules/subscription-table-sk

The subscription-table-sk module provides a specialized custom element designed to display metadata and configuration details for Perf subscriptions and their associated anomaly detection alerts. It serves as a read-only summary view, typically used in dashboards or report pages where users need to verify the settings governing performance monitoring and bug filing.

Design and Data Flow

The component is built using LitElement and follows a reactive property model. It accepts two primary data structures: a Subscription object containing bug-filing metadata (owner, component, priority, etc.) and an array of Alert objects defining the statistical parameters for anomaly detection.

When data is loaded via the subscription and alerts properties, the component renders a summary “details” card. The detailed alerts table is hidden by default to keep the UI clean, but can be expanded by the user to inspect technical detection parameters like the algorithm (e.g., stepfit, mannwhitneyu), radius, and “interestingness” thresholds.

[ Data Source ] -> ( Subscription & Alert Objects )
                          |
                          v
           +-----------------------------+
           | subscription-table-sk       |
           |-----------------------------|
           | [ Summary Card ]            | <--- Formats emails, components,
           |                             |      and Gerrit revisions as links.
           |                             |
           | [ Toggle Button ]           | <--- Manages "showAlerts" state.
           |                             |
           | [ Hidden/Visible Table ]    | <--- Renders Alert params and
           +-----------------------------+      uses <paramset-sk> for queries.

Key Responsibilities

Subscription Visualization

The module is responsible for transforming raw subscription JSON into a user-friendly summary. It implements specific formatting logic for:

  • Revisions: Links the configuration revision hash directly to the internal Git source host where the monitoring configuration is stored.
  • Bug Components: Transforms numeric component IDs into direct search links for the Chromium Issue Tracker, filtered by open issues within that component.
  • Metadata: Aggregates lists of CC'd emails, hotlists, and severity/priority levels into a compact display.

Alert Configuration Table

When expanded, the component displays a dense table of alert parameters. A key implementation detail is the integration with paramset-sk. Since alert queries are often complex URL-encoded strings (e.g., source_type=image&sub_result=min_ms), the component utilizes the toParamSet utility to parse these strings into structured key-value pairs, which are then rendered by the paramset-sk element for better readability.

State Management

The visibility of the alerts table is managed via internal @state. Whenever a new subscription is loaded (via property assignment or the load() method), the table visibility is reset to false. This ensures that switching between different subscriptions provides a consistent initial view.

Components and Files

  • subscription-table-sk.ts: The main element logic. It handles property updates, state transitions for the toggle button, and the template generation for both the summary card and the alert table.
  • subscription-table-sk.scss: Provides scoped styling, specifically handling the layout of the details card and ensuring the configuration table adheres to a compact, “small” font-size suitable for technical parameters.
  • infra-sk/modules/paramset-sk: An external dependency used within the table to render the breakdown of the query parameters that define what data the alert is monitoring.

Module: /modules/telemetry

Telemetry Module

The telemetry module provides a centralized mechanism for capturing and reporting frontend performance metrics and user interaction data. It is designed to provide visibility into the health and performance of the application without significantly impacting network performance or reliability.

Overview

The module facilitates the tracking of two primary types of data:

  • Counters: Used to track the frequency of specific events (e.g., page visits, data fetch failures, or user actions like triaging).
  • Summaries: Used to record numerical values, typically durations, to measure performance (e.g., the time it takes to load a graph or a specific table).

Rather than sending a network request for every individual event—which would be chatty and inefficient—the module buffers events locally and flushes them in batches.

Design Decisions

Efficient Batching

To minimize the overhead on the user's browser and the backend, metrics are held in a local buffer.

  • Time-based flushing: The buffer is automatically flushed every 5 seconds. This ensures that data is reported relatively close to real-time while allowing multiple events to be grouped into a single POST request.
  • Network reliability: If a batch fails to send (e.g., due to a temporary network glitch), the module catches the error and re-queues the metrics to be attempted again in the next cycle.

Data Integrity and Retention

A common challenge with frontend telemetry is losing data when a user closes a tab or navigates away before a scheduled flush occurs.

  • Visibility Listening: The module listens for the visibilitychange event. When the page state becomes hidden, it immediately triggers a flush of all pending metrics, bypassing the 5-second timer.
  • Buffer Management: To prevent unbounded memory growth in extreme scenarios, the buffer is capped at 1,000 metrics. If this limit is exceeded, the module uses a First-In-First-Out (FIFO) strategy, discarding the oldest metrics to make room for new ones.

Key Components

telemetry.ts

This file contains the core logic and exports a singleton instance of the Telemetry class. This singleton ensures that all parts of the application share the same buffer and timing cycle.

  • CountMetric & SummaryMetric Enums: These serve as a “source of truth” for all supported metric names. Adding a new metric requires updating these enums, which provides type safety across the codebase.
  • increaseCounter(name, tags): The primary method for incrementing a counter. It automatically sets the value to 1.
  • recordSummary(name, value, tags): The method used for performance timing or recording specific sizes/counts.
  • sendBufferedMetrics(): An internal asynchronous method that handles the fetch request to the /_/fe_telemetry endpoint. It handles the cloning and clearing of the buffer to prevent race conditions during the network request.

Workflows

Metric Submission Lifecycle

The following diagram illustrates how an event triggered by a user eventually reaches the backend.

User Action / Event
      |
      v
telemetry.increaseCounter()  <-- Application code calls this
      |
      +-----> [ Buffer (Array) ]
                 |
                 | (Wait 5s OR Visibility Hidden)
                 v
      +----------+-----------+
      |  sendBufferedMetrics |
      +----------+-----------+
                 |
                 v
       POST /_/fe_telemetry  --> [ Backend Server ]
                 |
        {Success? Yes} ----> [ Clear local copy ]
                 |
        {Success? No } ----> [ Re-queue metrics ]

Integration Guide

To instrument a new part of the application:

  1. Define: Add the new metric key to the appropriate enum in telemetry.ts.
  2. Call: Import the telemetry singleton and call the relevant method.
  3. Contextualize: Use the optional tags object to provide dimensions (e.g., specific sub-component names or error types) that allow for more granular filtering in dashboards.

Module: /modules/test-picker-sk

The test-picker-sk module provides a specialized UI component for exploring and selecting performance traces. It enforces a hierarchical selection process, guiding users through large datasets by dynamically fetching valid options for subsequent parameters based on previous choices.

Overview

The primary goal of test-picker-sk is to ensure users build valid queries for the Perf database. Rather than presenting all possible parameters at once, which could lead to empty results, the component reveals fields sequentially.

As a user selects values in one field (e.g., “Benchmark”), the component queries the backend to find available values for the next parameter in the hierarchy (e.g., “Bot”). This “drill-down” approach prevents invalid combinations and provides immediate feedback on the number of matching traces found.

Design Decisions

Hierarchical Filtering

The component relies on an ordered list of parameters (e.g., ['benchmark', 'bot', 'test']). This order is critical because it defines the dependency chain for data fetching. When a value is changed at index $i$ in the hierarchy, all fields at index $i+1$ and greater are invalidated and removed. This ensures that the state of the picker always represents a valid path through the data tree.

Trace Count & Plotting Guardrails

To prevent performance degradation on both the client and server, the component enforces a PLOT_MAXIMUM (defaulting to 200 traces).

  • Auto-plotting: If a graph is already active and the selection results in fewer than the maximum allowed traces, changes are automatically pushed to the graph.
  • Manual Plotting: If no graph is active, the “Plot” button is enabled only when the match count is within a safe range ($0 < \text{count} \leq \text{PLOT_MAXIMUM}$).

Handling “Missing” Values

In the Perf database, traces may not have a value for every possible parameter. The component maps these empty strings to a “Default” label in the UI. Internally, these are translated to a sentinel value (__missing__) when constructing queries, allowing users to explicitly select traces that lack a specific attribute.

Key Components and Workflow

Data Management (FieldInfo)

The internal state is managed via an array of FieldInfo objects. Each object tracks:

  • The PickerFieldSk element instance.
  • The parameter name and current selections.
  • Event listeners for value changes and “split-by” toggles.

Workflow: Adding a Field

When a user makes a selection that narrows the results, the component initiates the following process:

User Selects Value
       |
       v
callNextParamList() ----> POST /_/nextParamList/ (with current query)
       |                            |
       |                            v
       |<---- Returns {paramset: {next_param: [options]}, count: N}
       v
addChildField()
       |
       |--> Create new PickerFieldSk
       |--> Populate with options (mapping '' to 'Default')
       |--> Attach 'value-changed' listeners
       |--> Update match count UI

Advanced Logic: Conditional Defaults

The component supports complex “trigger” rules through applyConditionalDefaults. This allows the UI to automatically pre-select values in subsequent fields based on specific selections in earlier ones. For example, selecting a specific metric might automatically select a preferred stat (like ‘avg’), streamlining the user experience for common workflows.

Split-By Functionality

Users can “split” the graph by a specific parameter. The component ensures only one parameter is split at a time. If a user enables “split” on a field, the component disables the split checkbox on all other fields and dispatches a split-by-changed event to notify the parent application to adjust the graph's grouping logic.

Key Files

  • test-picker-sk.ts: The main logic for the hierarchical picker, state management, and event handling.
  • test-picker-sk.scss: Styles the layout, specifically the “drill-down” field container and the match count indicator.
  • test-picker-sk_po.ts: Page Object for automated testing, providing methods to interact with the fields and wait for async loading states.
  • test-picker-sk-demo.ts: Provides a mock environment with a simulated backend (/_/nextParamList/) used for development and visual testing.

Events

  • plot-button-clicked: Dispatched when the user clicks “Plot”. Detail contains the full query string.
  • add-to-graph / remove-trace: Dispatched during “Auto-add” mode to incrementally update an existing visualization.
  • split-by-changed: Dispatched when the “Split” toggle is flipped on any field.

Module: /modules/tests

Perf Integration Tests

The /modules/tests module contains end-to-end (E2E) integration tests for the Perf application. These tests leverage Puppeteer to simulate user interactions and verify the visual and functional integrity of the application's core pages.

Overview

The primary goal of this module is to provide a “sanity check” for the production-facing UI. Unlike unit tests that focus on individual components, these tests ensure that the integration between the frontend and the backend (or a mock representation of it) remains stable.

The tests are designed to be “perf-blocking,” meaning they represent the critical paths a user takes. If these tests fail, it indicates a high probability that a real user will encounter a broken experience.

Implementation Strategy

The module uses a specialized testing infrastructure built around Puppeteer and a mock server:

  • Mock Backend Integration: Tests utilize frontend_mock_server as the sk_demo_page_server. This allows the tests to run against a predictable and stable backend environment, decoupling UI verification from actual database state or external network flakiness.
  • TestBed Utility: Tests use a loadCachedTestBed pattern. This optimizes execution by reusing browser instances where possible, reducing the overhead of spinning up a fresh Puppeteer instance for every test suite.
  • Visual Regression Prevention: A significant portion of the logic is dedicated to taking screenshots (via takeScreenshot). These screenshots serve as a baseline to prevent accidental regressions in layout, CSS, or initial rendering.

Key Components and Responsibilities

Critical Path Sanity (initial_loading_puppeteer_test.ts)

This component is responsible for verifying that the primary entry points of the application load correctly. It targets:

  • /e (Explore Page)
  • /m (Multigraph Page)
  • /a (Regressions/Alerts Page)

It includes logic to handle common UI overlays, such as cookie consent banners, ensuring that screenshots represent the actual application state rather than transient UI elements.

Functional Interaction (explore_multi_page_puppeteer_test.ts)

Beyond simple loading, this component tests specific user workflows within the Multigraph and Explore views. It validates complex UI components (like Vaadin multi-select combo boxes) to ensure that event listeners, data binding, and dropdown behaviors are functioning correctly in a browser environment.

Common Workflows

Test Lifecycle Execution

The following diagram illustrates how a typical test in this module interacts with the infrastructure:

[ Test Suite Start ]
        |
        V
[ loadCachedTestBed() ] <---- Reuses browser instance for efficiency
        |
        V
[ beforeEach() ] -----------+
        |                   |
        |           Set Viewport Size
        |           Navigate to Target URL (e.g., /m)
        |                   |
        V                   |
[ it() Test Case ] <--------+
        |
        +---> [ Interaction ] (Click, Type, Wait for Selector)
        |
        +---> [ Screenshot ] (Capture state for visual diffing)
        |
        V
[ Test Suite End ]

Handling External UI (Cookie Banners)

Because these tests aim to simulate real user visits, they must account for global UI elements that might obscure the application. The acceptCookieBanner helper is a design choice to ensure that “noise” from the base platform doesn't cause false positives in screenshot comparisons or block element visibility during functional tests.

Module: /modules/themes

Overview

The themes module serves as the centralized styling foundation for the project. Rather than defining an entirely new design system from scratch, it acts as a customization layer that bridges the project's specific aesthetic requirements with the base design tokens provided by the shared infra-sk infrastructure.

Design Philosophy: “Deltas over Definitions”

The primary design principle for this module is to maintain a minimal footprint. It is structured to follow a “delta-based” approach, where the styles defined here only represent deviations from the global shared themes or essential overrides for base HTML elements.

This approach was chosen to:

  1. Ensure Consistency: By importing and extending infra-sk/themes, the project automatically inherits updates to the core design system (such as color palettes, spacing units, and typography) without manual intervention.
  2. Reduce Redundancy: By explicitly forbidding the re-definition of existing styles, the module prevents “CSS bloat” and ensures that the source of truth for standard UI components remains in the infrastructure layer.
  3. Centralize Global Resets: It provides a single location for high-level layout adjustments that affect the entire application environment, such as document margins and specialized spacing.

Key Components and Responsibilities

Theme Extension and External Assets

The module is responsible for pulling in the necessary external typography and iconography. Currently, it integrates the Material Icons library, making these glyphs globally available across all web components in the project. It also serves as the bridge to the shared SASS library, ensuring that variables and mixins from the infrastructure are accessible to local stylesheets.

Global Layout Normalization

The themes.scss file handles the “Sanitization” or “Reset” logic for the application's root. It enforces a consistent body configuration (zeroing out default browser margins/padding) to ensure that top-level layout components (like nav bars or sidebars) align perfectly with the viewport boundaries.

Application-Specific Layout Hacks

The module houses structural utilities that facilitate specific UX behaviors. A notable example is the #bottom-spacer implementation.

Workflow: Scroll Buffer Management

[ Viewport ]
|-------------------|
|  Content Area     |
|                   |
|  [Element A]      |
|  [Element B]      |
|                   |
|-------------------| <--- End of content
|  #bottom-spacer   | <--- Provides 500px of "breathing room"
|-------------------|

The inclusion of a large bottom spacer is a deliberate implementation choice to ensure that users can scroll past the final pieces of interactive content, preventing UI elements (like floating action buttons or footer overlays) from obscuring the last items in a list or terminal output.

Technical Integration

The module is exposed as a sass_library. This allows other modules in the project to depend on themes_sass_lib, ensuring that the global styles and infrastructure dependencies are bundled correctly during the Sass compilation process. By depending on //infra-sk:themes_sass_lib, it ensures that the dependency graph correctly resolves the cascading nature of the styles.

Module: /modules/trace-details-formatter

Trace Details Formatter

The trace-details-formatter module provides a standardized way to translate internal trace data (parameters and keys) into human-readable strings and, conversely, to reconstruct query parameters from those strings.

Trace IDs in Perf are often complex sets of key-value pairs. Depending on the specific domain (e.g., standard Skia traces vs. Chrome-specific performance benchmarks), the desired visual representation of these traces and the logic required to query them varies significantly. This module abstracts those differences behind a common interface.

Key Concepts

TraceFormatter Interface

The module defines a central TraceFormatter interface that ensures consistency across different formatting implementations:

  • formatTrace(params: Params): Converts a dictionary of trace parameters into a displayable string.
  • formatQuery(trace: string): Parses a formatted trace string back into a URL query string compatible with the Perf backend.

Implementation Logic

The module selects an implementation at runtime based on the global window.perf.trace_format configuration.

DefaultTraceFormatter

Used when no specific format is defined. It provides a fallback by simply returning the unique Trace ID (the joined key-value pairs). It does not support converting strings back into queries.

ChromeTraceFormatter

Designed specifically for Chrome‘s hierarchical performance data. It handles the mapping between the legacy Chrome “test path” structure and Skia’s parameter-based system.

  • Fixed Hierarchy: It enforces a specific order of keys: master, bot, benchmark, test, and three levels of subtest.
  • Path Joining: formatTrace produces a slash-delimited string (e.g., master/bot/benchmark/...).
  • Query Reconstruction: formatQuery splits these paths back into their constituent keys.

Chrome-Specific Statistics Mapping

A significant responsibility of this module is handling the transition from Chromeperf-style “test paths” to Skia's “stat” parameters. In the Chrome ecosystem, statistical aggregations (like averages or maximums) are often encoded as suffixes in the test name.

When enable_skia_bridge_aggregation is active, the ChromeTraceFormatter automatically extracts these suffixes and maps them to standard Skia stat values:

SuffixSkia Stat Value
avgvalue
stderror
max, min, count, sum(remains same)

If a test name lacks a known suffix, the formatter defaults the stat parameter to value to prevent the system from accidentally loading all available statistical variations (which would result in 6x more data being fetched than intended).

Workflow

The following diagram illustrates how the module resolves a formatter and processes data:

[ Global Config ]
       |
       | window.perf.trace_format
       v
[ GetTraceFormatter() ]
       |
       +---- "chrome" ----> [ ChromeTraceFormatter ]
       |                         |
       |                         +-- formatTrace: join(keys, '/')
       |                         +-- formatQuery: split('/') + Stat Mapping
       |
       +---- (default) ---> [ DefaultTraceFormatter ]
                                 |
                                 +-- formatTrace: makeKey(params)

Key Files

  • traceformatter.ts: Contains the TraceFormatter interface, the concrete implementations for Chrome and Default styles, and the factory function GetTraceFormatter.
  • traceformatter_test.ts: Validates the logic for path splitting and the conditional application of statistical mappings based on global window settings.

Module: /modules/triage-menu-sk

triage-menu-sk

The triage-menu-sk module provides a unified interface for managing and triaging performance anomalies in bulk. It serves as a central control point within the Perf UI, allowing users to categorize detected regressions or improvements by filing bugs, associating them with existing reports, ignoring them, or “nudging” their detected revision range.

Design and Implementation Choices

Centralized Triage Orchestration

Instead of implementing bug-filing logic directly, triage-menu-sk acts as an orchestrator. It encapsulates and manages two specialized dialog components: new-bug-dialog-sk and existing-bug-dialog-sk. This separation of concerns allows the menu to focus on the high-level workflow (selecting anomalies and choosing an action) while delegating complex form handling and bug-tracker integration to the specific dialog modules.

The “Nudge” Workflow

One unique feature of this module is the “Nudge” functionality. Anomalies are detected over a revision range, but the detection might not perfectly align with the actual point of regression. The “Nudge” buttons (typically ranging from -2 to +2) allow users to shift the anomaly's revision boundaries.

  • Visual Feedback: When a nudge is performed, the component updates the AnomalyData (coordinates x and y) locally and dispatches an event so the parent chart can immediately reflect the shift without a full page reload.
  • Backend Sync: Each nudge triggers a POST request to /_/triage/edit_anomalies with the NUDGE action, ensuring the database reflects the refined revision range.

State and Event-Driven Architecture

The component relies heavily on an event-driven model to maintain synchronization with the rest of the application:

  • anomaly-changed Event: This is the primary output of the module. Whenever an anomaly is ignored, nudged, or associated with a bug, this event is dispatched. It carries the updated anomaly details and trace IDs, signaling to parent components (like graphs or tables) that they need to invalidate their caches and re-render.
  • Property-Based Configuration: The menu's state is driven by the anomalies and traceNames properties. By calling setAnomalies(), a parent component can dynamically update which data points the menu is currently acting upon.

Key Components and Responsibilities

triage-menu-sk.ts

The core logic of the module. It handles:

  • Triage Actions:
    • New Bug: Forwards the request to the new-bug-dialog-sk.
    • Existing Bug: Triggers a fetch for recently associated bugs and opens the existing-bug-dialog-sk.
    • Ignore: Sends an IGNORE request to the backend. It sets the bug_id to -2 (a convention for ignored anomalies) and displays a confirmation toast.
  • API Communication: Manages POST requests to /_/triage/edit_anomalies. This endpoint is polymorphic, handling IGNORE, RESET, and NUDGE actions based on the provided body.
  • Telemetry: Integrates with the telemetry module to track which triage actions are most frequently taken by users.

Triage Actions Workflow

User Interaction
      |
      V
[ Triage Menu ] --------------------------+
      |                                   |
      | (Action: New/Existing)            | (Action: Ignore/Nudge)
      V                                   V
[ Dialog Components ]             [ Backend API Call ]
(new-bug-dialog-sk)               (/_/triage/edit_anomalies)
(existing-bug-dialog-sk)                  |
      |                                   |
      +------------> [ Success ] <--------+
                        |
                        V
             [ Dispatch anomaly-changed ]
             [ Show Toast / Update UI   ]

NudgeEntry (Class)

A data structure used to represent potential “nudge” states. It maps a display index (e.g., +1) to specific revision ranges (start_revision, end_revision) and UI coordinates (x, y). This allows the menu to render a sequence of buttons that correspond to valid shifts in the data.

triage-menu-sk_po.ts

Provides the Page Object for testing. It abstracts the internal DOM structure, including the nested dialogs and the ignore toast, allowing Puppeteer tests to interact with the triage flow without being coupled to the specific HTML structure or CSS classes.

Module: /modules/triage-page-sk

triage-page-sk

The triage-page-sk module provides a comprehensive dashboard for reviewing and triaging performance regressions in the Perf system. It allows users to scan a matrix of commits and alerts, visualize clusters of data, and record triage decisions (e.g., positive, negative, or untriaged).

High-level Overview

The page is designed as a high-density “triage queue.” It presents a grid where rows represent commits and columns represent different configured alerts. This layout allows a developer or performance engineer to quickly identify which commits caused regressions across multiple benchmarks or metrics.

The primary workflow involves:

  1. Discovery: Filtering the view to find “Untriaged” regressions within a specific time range.
  2. Investigation: Clicking on a regression status to open a detailed view of the data cluster.
  3. Action: Assigning a status to the regression, which may trigger automated bug reporting.

Key Components and Responsibilities

State Management and Data Fetching

The component uses stateReflector to sync its internal state (time range, subset filters, alert filters) with the URL. This ensures that triage views can be bookmarked or shared.

  • updateRange(): This is the core data-fetching method. It sends a RegressionRangeRequest to the /_/reg/ endpoint. The server responds with a RegressionRangeResponse containing the headers (alerts) and the table data (commits and their regression status).
  • calc_all_filter_options(): This logic processes the categories returned by the server to populate the “Which alerts to display” filter, allowing users to focus on specific teams or components.

The Triage Grid

The grid is constructed dynamically based on the server's response:

  • Rows: Each row uses commit-detail-sk to show the commit hash, author, and message.
  • Columns: Headers represent individual alerts. Columns are split into “Low” and “High” sub-columns if the alert tracks bidirectional changes.
  • Cells: Cells contain triage-status-sk elements. If a cluster is found, it shows the current triage status. If no cluster is found, it displays a “∅” symbol, which links to the generic cluster view for that commit/query combination.

Triage Dialog and Interaction

When a user clicks a status in the grid, the triage_start event is captured, opening a <dialog> containing a cluster-summary2-sk.

  • cluster-summary2-sk: Responsible for rendering the actual plot and summary statistics of the regression.
  • triaged(): When a triage decision is made inside the dialog, this method handles the POST request to /_/triage/. If the triage results in a new bug, it automatically opens the bug reporting URL in a new window.

Workflow: Investigating a Regression

The following diagram illustrates how a user moves from the high-level grid to a specific data investigation:

[ Triage Page Grid ]
       |
       | (User clicks a 'triage-status-sk')
       v
[ triage_start event ] --------------------------+
       |                                         |
       v                                         |
[ Open <dialog> ]                                |
       |                                         |
       +--> [ cluster-summary2-sk ] <------------+
                |
                | (User analyzes plot)
                |
    +-----------+-----------+
    |                       |
(Press 'p' / 'n')     (Press 'g')
    |                       |
    v                       v
[ Update Status ]     [ Open Dashboard ]
    |                 (Full Explore View)
    v
[ POST /_/triage/ ]
    |
    +--> (Optional: Open Bug Tracker)

Keyboard Shortcuts

To facilitate rapid triaging, the module implements KeyboardShortcutHandler. When the triage dialog is open, the following shortcuts are available:

  • p: Mark the current regression as Positive.
  • n: Mark the current regression as Negative.
  • g: Go to the full Explore page for this cluster to perform a deeper analysis of the underlying traces.
  • ?: Open the keyboard shortcuts help overlay.

These shortcuts are managed via handleKeyboardShortcut in the keyDown listener, ensuring they only trigger when the user is not actively typing in an input field.

Design Decisions

  • Modal Dialog for Detail: Instead of navigating away from the grid, details are shown in a modal dialog. This preserves the user's scroll position and filter state in the large triage matrix, allowing them to process dozens of regressions in a single session.
  • Conditional Column Splitting: The grid columns dynamically adjust based on the direction of the alert (UP, DOWN, or BOTH). This minimizes horizontal scrolling by only showing “High” or “Low” columns when relevant to that specific alert's configuration.
  • Subset Filtering: The subset parameter (all, regressions, untriaged) allows the server to prune the data significantly, which is critical for performance when viewing large time ranges (e.g., several weeks of data across hundreds of alerts).

Module: /modules/triage-status-sk

Triage Status Module

The triage-status-sk module provides a visual indicator and interaction point for the triage state of performance clusters. It serves as a compact UI element that communicates the current classification of a detected anomaly (e.g., untriaged, positive, or negative) and initiates the workflow for modifying that state.

High-level Overview

In the Perf system, data anomalies are grouped into clusters that require human intervention to determine if they represent actual regressions or false positives. This module encapsulates the visual representation of that status.

Rather than managing the complex logic of the triage dialog itself (which involves data visualization and form inputs), triage-status-sk acts as a trigger. When a user interacts with the component, it broadcasts the necessary context—including the current triage state, the associated alert configuration, and the cluster summary—to be handled by a parent container or a global dialog manager.

Key Components and Responsibilities

TriageStatusSk (triage-status-sk.ts)

The primary class is an ElementSk that renders a stylized button. Its responsibilities include:

  • State Representation: It mirrors the TriageStatus (status and message) provided via its properties. The visual state is driven by CSS classes that correspond to the status strings (positive, negative, untriaged).
  • Context Storage: It holds metadata required for the triage process, such as the alert (the configuration that triggered the detection), full_summary (the statistical data of the cluster), and cluster_type (indicating if the cluster represents a high or low change).
  • Workflow Initiation: It listens for click events on the internal button and dispatches a start-triage custom event.

Visual Feedback (triage-status-sk.scss)

The module uses a specialized icon component (tricon2-sk) inside the button. The styling logic is tightly coupled with the status:

  • Colors: Uses theme-aware variables (--positive, --negative, --untriaged) to ensure consistency across the application.
  • Theming: Includes specific overrides for dark mode to ensure the icons remain legible against varying background surfaces.

Workflow: Initiating Triage

The following diagram illustrates how the component interacts with the rest of the application to start a triage action:

[ User ]
    |
    | (Clicks Button)
    v
[ triage-status-sk ]
    |
    |-- 1. Bundles: { triage, alert, full_summary, cluster_type }
    |-- 2. Dispatches 'start-triage' Event
    v
[ Parent / Dialog Manager ]
    |
    |-- 3. Receives Event Detail
    |-- 4. Opens Triage Dialog for User Input
    v
[ Backend API ] (Updated via parent)

Event Details: start-triage

The start-triage event is the primary output of this module. Its detail object contains:

PropertyDescription
triageThe current TriageStatus object.
full_summaryThe FullSummary data structure representing the cluster statistics.
alertThe Alert configuration associated with this detection.
cluster_typeWhether the regression is ‘high’ or ‘low’.
elementA reference to the TriageStatusSk instance that fired the event, allowing the receiver to update the element directly upon successful triage.

Module: /modules/triage2-sk

triage2-sk

The triage2-sk module provides a specialized UI component for managing the classification of data points or alerts within the Perf system. It allows users to toggle between three distinct states: positive, negative, and untriaged.

This component is designed to be a compact, intuitive control that provides immediate visual feedback on the current classification of an item while making it easy to change that state with a single click.

Design and Implementation

The module is built as a custom element using the Lit library and extends ElementSk.

State Management

The primary state of the component is driven by its value attribute. The component synchronizes this attribute with a property of the same name. To ensure data integrity, it uses a guard function (isStatus) to validate that any assigned value conforms to the Status type (defined in perf/modules/json).

The internal logic follows a reactive pattern:

  1. Input: A user clicks one of the three buttons or the value attribute is updated programmatically.
  2. Reaction: The attributeChangedCallback triggers a re-render and dispatches a change event.
  3. Visual Feedback: The component uses the ?selected attribute on the internal buttons to highlight the active state, which is then styled via CSS.

UI and Styling

The component consists of a group of three buttons, each containing a specific icon:

  • check-circle-icon-sk: Represents a positive result.
  • cancel-icon-sk: Represents a negative (false positive) result.
  • help-icon-sk: Represents an untriaged (unknown) state.

The styling is implemented in triage2-sk.scss with specific support for “light” and “dark” modes. It relies on CSS variables (e.g., --surface, --on-disabled) to integrate seamlessly with the broader Perf application themes. Deselected buttons are intentionally dimmed to draw focus to the currently selected state.

Workflow: State Update

The following diagram illustrates how an interaction with the UI flows through the component to notify the parent application:

User Click          Component Property          DOM Attribute          Parent Application
----------          ------------------          -------------          ------------------
    |                       |                         |                        |
[Click .positive] ------> [set value]                 |                        |
    |                       |                         |                        |
    |                       | ----------------> [attr: value]                  |
    |                       |                         |                        |
    |                       |                [attributeChanged]                |
    |                       |                         |                        |
    |                  [_render()] <----------------- |                        |
    |                       |                         |                        |
    |                       | -----------------------------------------> [Event: 'change']

Key Components and Files

  • triage2-sk.ts: Contains the TriageSk class logic. It handles the attribute observation, property synchronization, and the dispatching of custom events when the triage status changes.
  • triage2-sk.scss: Defines the visual representation. It uses sophisticated selectors to handle both legacy color schemes and modern theme-based variables, ensuring the icons are appropriately colored (Green for positive, Red for negative, Brown for untriaged) and that “raised” or “hover” states provide tactile feedback.
  • index.ts: The entry point that defines the custom element in the global customElements registry.

Events and Attributes

  • value (Attribute/Property): Reflects the current status. Defaults to untriaged if not set or if set to an invalid value.
  • change (Event): Dispatched whenever the status changes. The detail property of the event contains the new Status value string.

Module: /modules/tricon2-sk

The tricon2-sk module provides a specialized UI component for displaying triage states. It translates semantic status strings—“positive”, “negative”, or “untriaged”—into consistent visual indicators (icons and colors) used across the application to represent the state of performance regressions or test results.

Core Logic and Design Decisions

The component is designed around a single point of truth: the value attribute. By mirroring this attribute to a JavaScript property, the component ensures that updates made via HTML or direct property assignment trigger a re-render.

The implementation uses a declarative template approach. Instead of manually manipulating the DOM to swap icons, the TriconSk class uses a switch statement within its Lit template to determine which underlying icon element to mount:

Value ("positive")  ------>  <check-circle-icon-sk>
Value ("negative")  ------>  <cancel-icon-sk>
Value (default)     ------>  <help-icon-sk>

This design simplifies the component's internal state management, as the visual output is a pure function of the value property.

Theming and Visual Consistency

The styling logic in tricon2-sk.scss is decoupled into three distinct layers to ensure legibility across different UI contexts:

  • Default State: Uses standard CSS variables (e.g., --green, --red) for basic integration.
  • Themed State (.body-sk): Provides specific hex code overrides to ensure that the triage colors meet brand and contrast requirements when the application's standard theme is applied.
  • Dark Mode: Adjusts the brightness and saturation of the icons specifically for dark backgrounds to maintain accessibility.

By encapsulating these color mappings within the component's SCSS, the module prevents “color leak” and ensures that a “positive” icon always appears in the correct shade of green regardless of where it is placed in the application.

Key Components

TriconSk (tricon2-sk.ts)

The primary class extending ElementSk. It manages the lifecycle of the element and observes the value attribute. It is responsible for importing and registering the specific icon elements from elements-sk needed for the three states.

SCSS Styles (tricon2-sk.scss)

Rather than relying on the parent container to style the icons, this file explicitly defines the fill properties for the internal icon components (check-circle-icon-sk, cancel-icon-sk, and help-icon-sk). This ensures that the semantic meaning of the icon (success, failure, or unknown) is always tied to its visual representation.

Demo and Testing

The module includes a demo page (tricon2-sk-demo.html) that showcases the component in all three triage states across light and dark modes. This is used by the Puppeteer test suite (tricon2-sk_puppeteer_test.ts) to perform visual regression testing, ensuring that the icons render correctly and maintain their color associations during UI changes.

Module: /modules/user-issue-sk

High-Level Overview

The user-issue-sk module provides a custom LitElement component designed to manage the association between performance data points (traces at specific commit positions) and external bug tracking system (Buganizer) issues. It acts as a bridge between the Perf monitoring UI and the issue management backend, allowing users to view, link, create, and remove bug references directly within the context of a performance trace.

Design Decisions and Implementation Choices

State-Driven Visibility

The component‘s appearance is heavily dictated by the user’s authentication state and the presence of an existing bug.

  • Authentication: Using the alogin-sk module, the component detects if a user is logged in. If a user is anonymous, they are restricted to viewing existing bug links; they cannot add, delete, or modify issue associations.
  • Bug ID States: The bug_id property serves as a state indicator. A value of -1 hides the element entirely, 0 indicates no bug is currently associated, and a positive integer indicates an active link to an external issue.

Automated vs. Manual Association

The module implements a specific workflow for adding issues (findOrAddIssue) that prioritizes data integrity:

  1. It first checks the backend to see if a bug reference already exists for the given trace and commit.
  2. If the user provides a bug ID that matches an existing record, it simply confirms the link.
  3. If no matching record exists, the component automatically triggers the creation of a new bug via the /pre/_/triage/file_bug endpoint before saving the association. This ensures that every “Add Issue” action results in a valid, tracked entity in the bug host.

Event-Based Updates

Rather than managing the global state of the application, user-issue-sk utilizes a user-issue-changed Custom Event. When an issue is saved or deleted, the component dispatches this bubbling event. This allows parent components or data providers to react to changes (e.g., refreshing a list of anomalies or updating a graph) without the user-issue-sk component needing deep knowledge of the application's architecture.

Key Components and Responsibilities

user-issue-sk.ts

This is the primary implementation file containing the UserIssueSk class. Its responsibilities include:

  • Property Management: Tracks user_id, bug_id, trace_key, and commit_position to contextualize the issue.
  • API Interaction: Handles asynchronous requests to several endpoints:
    • /_/user_issues: To query existing associations.
    • /_/user_issue/save: To persist a link between a trace point and a bug.
    • /_/user_issue/delete: To remove an association.
    • /_/triage/file_bug: To programmatically create a new bug in Buganizer.
  • UI Rendering: Switches between a “Link View” (showing the formatted URL to the bug) and an “Add/Input View” (showing a button or a numeric input field).

user-issue-sk_test.ts

The test suite ensures the component reacts correctly to different state combinations. It mocks the global window.perf configuration and API responses to verify that:

  • Unauthorized users cannot see management icons (like the delete/close icon).
  • The “Add Issue” workflow correctly transitions through input and confirmation states.
  • The bug URL is formatted correctly based on the environment's bug_host_url.

Key Workflows

Adding a New Issue

The following diagram illustrates the logic flow when a logged-in user interacts with the “Add Issue” button:

[ Click "Add Issue" ] -> ( Show Input Field )
                               |
                        [ Enter Bug ID ]
                               |
                        [ Click Checkmark ]
                               |
                               V
                    { Check /_/user_issues }
                               |
             _________________/ \_________________
            |                                     |
    (Issue Exists?)                        (Issue Not Found?)
            |                                     |
    [ Set bug_id ]                        [ Call /_/triage/file_bug ]
            |                                     |
            |                               [ Receive New bug_id ]
            \_________________   ________________/
                              \ /
                               V
                    [ Call /_/user_issue/save ]
                               |
                  [ Dispatch 'user-issue-changed' ]
                               |
                    ( Update UI to Link View )

Removing an Issue

[ Click Close-Icon ] -> [ Call /_/user_issue/delete ]
                               |
                  [ Dispatch 'user-issue-changed' ]
                               |
                   [ Reset bug_id = 0, exists = false ]
                               |
                    ( Update UI to "Add" Button )

Module: /modules/window

Window Module

The window module provides type definitions for global configuration and utility functions for parsing build-specific metadata from the browser's global environment. It serves as the primary bridge between the backend-injected configuration and the frontend application logic.

Global Configuration Management

A central responsibility of this module is extending the global Window interface to include the perf property. This property holds the SkPerfConfig, which is typically populated by the server-side template during the initial page load.

By defining this in a centralized window.ts file, the project ensures type safety across all frontend modules when accessing environment-specific settings, such as the current image tag or deployment configuration.

Build Metadata Extraction

The module implements logic to parse image tags used in the Skia Perf infrastructure. This is necessary for displaying versioning information to developers and operators, allowing them to quickly identify which specific build of the software is currently running.

The implementation focuses on identifying three distinct deployment patterns:

  • Git-based builds: Identified by a tag:git- prefix. The logic extracts the first seven characters of the git hash to provide a short, recognizable revision identifier.
  • Louhi builds: Identified by the presence of a timestamp and the -louhi- string. These represent specific automated build pipeline outputs, where the logic extracts the specific build hash following the “louhi” marker.
  • Generic tags: Fallback for standard tags (e.g., tag:latest or tag:v1.0), where the prefix is stripped to show the human-readable label.

Extraction Workflow

The getBuildTag function follows a specific sequence to normalize the raw image string provided by the backend:

Raw Tag String (e.g., image@tag:git-12345...)
     |
     +--- Split by '@' to isolate the tag portion
     |           |
     |    [No '@' found] --> Return 'invalid'
     |
     +--- Check if starts with 'tag:'
     |           |
     |       [False] ------> Return 'invalid'
     |
     +--- Pattern Matching
         |
         |-- Starts with 'tag:git-'? ----> [type: 'git']   (7-char hash)
         |
         |-- Contains '-louhi-'? --------> [type: 'louhi'] (build hash)
         |
         |-- Else -----------------------> [type: 'tag']   (full tag value)

Key Files

  • window.ts: Contains the global type augmentation for the Window object and the logic for getBuildTag. It imports SkPerfConfig from the JSON schema definitions to ensure the global state remains synchronized with the backend data structures.
  • window_test.ts: Validates the parsing logic against various real-world container image tag formats, ensuring that changes to the deployment pipeline's tagging convention do not break version reporting in the UI.

Module: /modules/word-cloud-sk

The word-cloud-sk module provides a specialized data visualization component designed to represent the distribution and frequency of key-value pairs within a dataset, such as a cluster of performance traces. Despite its name, it renders data as a structured table with integrated bar charts rather than a randomized “cloud” of text, prioritizing legibility and precise comparison of relative frequencies.

Core Responsibility

The primary role of this module is to take a collection of data points (values and their associated percentages) and render them in a format that allows users to quickly identify dominant traits within a selected group. It is specifically designed for the Skia Perf UI to show which metadata keys or configurations (e.g., arch=x86, config=8888) are most prevalent in a given performance cluster.

Design and Implementation

The module follows a declarative pattern using lit for rendering and ElementSk as a base class.

  • Data Structure: The component consumes an array of ValuePercent objects. Each object contains a value (string) and a percent (number). The choice of a percentage-based input simplifies the rendering logic, as the component does not need to calculate totals or handle raw counts; it assumes the data is pre-processed.
  • Visual Representation: The implementation uses a standard HTML <table> for layout. This ensures that labels remain aligned while the distribution is visualized via “micro-bars”—div elements whose widths are set directly to the percentage value.
  • Theming and Styling: The component supports multiple visual contexts (default colors, standard themes, and dark mode) by leveraging CSS variables. The border colors and bar backgrounds adjust dynamically based on the parent container‘s class (e.g., .body-sk or .darkmode), ensuring consistent integration with the rest of the application’s UI.

Key Workflows

Data Binding and Rendering When the items property is updated on the element, it triggers a re-render cycle. The component maps the data into table rows where the percentage is represented both numerically and visually.

[Data Update]
      |
      v
[setter: items(val)] -> updates private _items
      |
      v
[_render()] calls [WordCloudSk.template]
      |
      +--> [WordCloudSk.rows] maps items to:
           |
           +-- <td> {value} </td>             (The label)
           +-- <td> {percent}% </td>          (Numeric value)
           +-- <td> [---bar (width: Xpx)---] </td> (Visual representation)

Components and Files

  • word-cloud-sk.ts: Contains the logic and template for the custom element. It handles property shadowing for items and manages the rendering of the table rows.
  • word-cloud-sk.scss: Defines the layout and theme-aware styling. It uses a fixed width for the percentage bars (100px) so that the percentage value maps 1:1 to a pixel width, providing a consistent scale across different instances.
  • word-cloud-sk-demo.ts/html: Provides a sandbox for testing the component in different CSS contexts (standard vs. themed), demonstrating how the component adapts to its environment.

Module: /nanostat

nanostat

nanostat is a command-line utility used to perform statistical comparisons between two sets of benchmark results generated by Skia's nanobench. It identifies whether performance changes (deltas) are statistically significant or merely the result of measurement noise.

High-Level Overview

In performance engineering, comparing the means of two benchmark runs is often insufficient because execution environments are noisy. A small change in the mean might be significant if the variance is low, while a large change might be statistically meaningless if the variance is high.

nanostat addresses this by applying statistical hypothesis testing to the raw samples collected during benchmarking. It consumes two JSON files (typically an “old” baseline and a “new” experimental run), performs a comparative analysis, and outputs a formatted table showing the magnitude of change alongside a p-value to indicate confidence.

Design Decisions and Implementation

Statistical Methodology

The tool focuses on the “why” of a performance change by evaluating the probability that the observed difference happened by chance.

  • Hypothesis Testing: By default, the tool uses the Mann-Whitney U test (via the samplestats package). This is a non-parametric test, meaning it does not assume the benchmark samples follow a normal distribution, which is ideal for performance data that often contains outliers or asymmetrical distributions. Users can optionally switch to a Welch’s T-test for normally distributed data.
  • Significance Threshold (Alpha): The tool uses a default alpha of 0.05. If the calculated p-value is greater than this threshold, the change is considered “insignificant” and is represented by a tilde (~) instead of a percentage delta to prevent developers from chasing ghosts in the noise.
  • Outlier Rejection: The --iqrr flag enables the Interquartile Range Rule to strip outliers from the sample sets before analysis. This is a design choice to provide a “cleaner” look at the core performance characteristics of the code, independent of transient system spikes.

Data Aggregation and Matching

nanostat doesn't just compare files line-by-line; it understands the structure of Skia benchmark data.

  • Parametric Matching: The tool uses paramtools and parser to group samples. It identifies benchmarks by their parameters (e.g., config, test, name). It automatically detects which parameters vary across the dataset and includes them in the output columns so the user can distinguish between different test configurations (e.g., gl vs gles).
  • Legacy Format Support: The implementation specifically leverages format.ParseLegacyFormat to maintain compatibility with the standard JSON output format used by nanobench.

Key Components and Workflow

Main Logic (main.go)

The core entry point manages the lifecycle of a comparison:

  1. Configuration: Parses CLI flags into a samplestats.Config object, defining the statistical “rules” for the session.
  2. Data Ingestion: Loads JSON files into parser.SamplesSet structures. Each set contains the raw execution times (samples) for every benchmark identified in the file.
  3. Analysis: Delegates the heavy lifting to samplestats.Analyze. This produces a set of “Rows,” where each row represents a unique benchmark found in both files.
  4. Formatting: The formatRows function dynamically determines which metadata (like config or arch) is relevant. If all results share the same arch, that column is hidden to reduce clutter; if they differ, it is shown.

Comparison Workflow

[ File A (Old) ]       [ File B (New) ]
       |                      |
       v                      v
[ Parse Samples ]      [ Parse Samples ]
       |                      |
       +----------+-----------+
                  |
        [ Match by Parameters ] (e.g., name, config)
                  |
        [ Apply Outlier Filter ] (Optional: IQRR)
                  |
        [ Run Statistical Test ] (Mann-Whitney U or T-Test)
                  |
        [ Filter by Significance ] (p < Alpha?)
                  |
                  v
[ Format Table: Mean, StdDev, Delta, p-value, Metadata ]

Formatting Strategy

The tool uses a tabwriter to produce aligned, human-readable terminal output. A key implementation detail in formatRows is the calculation of the “Important Keys.” The tool scans all results and identifies which parameters (keys) have multiple values. It then prioritizes these keys in the output string, ensuring that the user sees exactly what differentiates one row from another without redundant information.

Module: /nanostat/testdata

The /nanostat/testdata module serves as the regression testing suite and ground-truth repository for the nanostat tool. It contains paired performance benchmark results and the corresponding expected output (golden files) used to verify the tool's statistical analysis, formatting, and filtering logic.

Purpose and Design

The data in this directory is structured to facilitate end-to-end testing of how nanostat processes nanobench JSON output. The primary goal is to ensure that the statistical comparisons (Mann-Whitney U tests, p-values, and percentage deltas) remain accurate across code changes.

The design relies on “Golden File Testing”:

  1. Input: Two JSON files representing “old” and “new” benchmark runs.
  2. Process: nanostat compares these files using various flags (e.g., sorting, significance thresholds).
  3. Validation: The output is compared against a .golden file to ensure the results match expectations down to the whitespace and p-value calculation.

Key Components

Benchmark Input Files

  • nanobench_old.json / nanobench_new.json: These are the primary data sources. They contain nested JSON structures where each key represents a specific test case (e.g., desk_googledocs.skp_1_1000_1000).
  • Samples: Each test case contains an array of “samples” (execution times in milliseconds). nanostat uses these raw sample arrays to calculate the mean, standard deviation, and statistical significance of the change between the “old” and “new” datasets.

Golden Files (.golden)

These files represent the expected CLI output under different operational modes. They contain fixed-width tables with columns for baseline (old), current (new), percentage delta, statistical significance (p-value and sample size), and test identification.

  • all.golden: Expected output when showing all results regardless of statistical significance. Includes cases where the delta is negligible (marked with ~).
  • test.golden: The standard output reflecting significant changes.
  • iqrr.golden: Expected output when Interquartile Range (IQR) filtering or robust statistical methods are applied, resulting in different sample sizes (n) and p-values compared to the standard test.
  • sort.golden: Verifies the sorting logic, ensuring results are ordered correctly (e.g., by test name or delta magnitude).
  • nochange.golden: The expected response when no statistically significant differences are found between the two datasets, confirming that the “noise floor” logic works.

Workflow: Statistical Verification

The data in this module illustrates how raw performance samples are transformed into human-readable insights:

[ nanobench_old.json ]       [ nanobench_new.json ]
        |                            |
        |          Comparison        |
        +------------> (v) <----------+
                      |
            [ Mann-Whitney U Test ]
            [  Mean / StdDev Calc ]
                      |
                      v
            [ Formatting & Filtering ] ----> (Compare against .golden)

Data Characteristics

The test data specifically includes varied scenarios to stress-test the tool:

  • Significant Regressions: Large positive deltas (e.g., ~52% in desk_googleimagesearch).
  • Significant Improvements: Negative deltas (e.g., -4% in desk_googledocs).
  • Statistical Noise: High variance samples (e.g., ± 15%) which result in different p-values and significance classifications.
  • Metadata: Examples of non-timing data like max_rss_mb and sksl_compiler bytes to verify that the tool handles different units of measurement correctly.

Module: /pages

The /pages directory serves as the entry point layer for the Perf application's web interface. It defines the structure, styling, and composition of individual HTML pages by orchestrating high-level custom elements (defined in /modules) into functional views.

Design Philosophy

The module follows a “Thin Page” architecture. Rather than containing complex business logic, each page acts as a declarative shell. This approach ensures:

  1. Consistency: Every page utilizes the perf-scaffold-sk component, providing a unified navigation, header, and footer across the entire application.
  2. Modularity: Page-specific logic is encapsulated within specialized “page-level” custom elements (e.g., explore-sk, alerts-page-sk). This makes the pages easy to maintain and the components easy to test in isolation.
  3. Data Injection: Pages serve as the bridge between the Go backend and the frontend. They use Go templating to inject configuration data into the window.perf object, allowing the TypeScript modules to access instance-specific context (like git_repo_url or Nonce for security) immediately upon load.

Key Components and Workflows

Page Composition

Most files in this directory follow a strict triplet pattern:

  • .html: Defines the DOM structure, typically wrapping a single major functional element inside a scaffold.
  • .ts: Handles the side-effect of importing the necessary custom element definitions so they are registered with the browser.
  • .scss: Imports global styles (like body.scss) and applies page-specific layout tweaks.

Data Flow and Initialization

When a user navigates to a Perf page, the following initialization process occurs:

[ Backend ] -> Injects JSON context into <script>
      |
      v
[ HTML Page ] -> Renders <perf-scaffold-sk>
      |
      +--------> Sets window.perf = { ... } (Context)
      |
      v
[ TS Entry ] -> Imports Modules -> Custom Elements Registered
      |
      v
[ Browser ]  -> Upgrades <page-element-sk> -> Fetches data using window.perf

Core Pages

  • Exploration (newindex.ts, multiexplore.ts): The primary interfaces for data visualization. newindex hosts the main explore-sk component for deep-diving into individual traces, while multiexplore allows for side-by-side comparisons. These pages also include sidebar help for keyboard and mouse navigation (Zoom/Pan/Delta).
  • Alerting & Triage (alerts.ts, triage.ts, regressions.ts): These pages manage the lifecycle of performance anomalies. They wrap components that allow users to view configured alerts, triage new regressions, and track existing performance issues.
  • Analysis Tools (clusters2.ts, dryrunalert.ts, playground.ts): Focused on the statistical side of Perf. The “Playground” is specifically designed for experimenting with anomaly detection algorithms on sample data without affecting production configurations.
  • Metadata & Info (revisions.ts, favorites.ts, help.ts): Support pages that provide context. The help.html page is unique as it uses Go templates to dynamically iterate over and describe available query functions (.Funcs) directly from the backend documentation.

Shared Assets and Styles

The body.scss file provides a shared CSS baseline, resetting margins and paddings to ensure the scaffold occupies the full viewport. The BUILD.bazel file manages the distribution of static assets (like SVG icons for various platforms like Chrome, V8, and Fuchsia) to the /dist path, ensuring they are available for the UI regardless of which specific page is loaded.

Module: /res

High-Level Overview

The /res module is the core resource management hub for the application. It serves as the single source of truth for all non-programmatic assets, including user interface layouts, string constants, graphical elements, and configuration values.

The primary design philosophy of this module is the decoupling of UI definition from application logic. By externalizing these resources, the system achieves two critical architectural goals:

  1. Independent Maintenance: Developers can modify the look and feel, update text, or swap images without altering the underlying source code (e.g., Java, Kotlin, or C++ files).
  2. Configuration Switching: The module is structured to support automatic resource selection based on the runtime environment (e.g., screen density, language, or device orientation), allowing a single binary to adapt to diverse hardware and locales.

Design Decisions and Implementation Choices

  • Declarative UI (XML): The choice to use XML for layouts and values allows for a declarative approach to UI construction. This simplifies the development process by allowing visual structures to be defined hierarchically, which is more intuitive for layout management than imperative code.
  • Unique Identifier Mapping (The R Class): To bridge the gap between static files and executable code, the build system maps every resource in this directory to a unique integer ID. This allows logic files to reference resources via a typesafe “alias” (e.g., R.layout.main) rather than error-prone string paths.
  • Strict Submodule Categorization: The module enforces a rigid directory structure (e.g., layout/, values/, drawable/). This design choice ensures that the resource compiler can optimize asset processing (like shrinking unused images or pre-compiling XML) and provides a predictable mental model for developers.
  • Localization-First Architecture: By centralizing strings in values/, the project is architected for global deployment. Translating the application requires adding a qualified subdirectory (e.g., values-es/) rather than refactoring code.

Key Components and Responsibilities

The /res module is partitioned into specialized subdirectories, each managing a specific aspect of the application's presentation layer:

  • Interface Blueprints (/layout): Responsible for defining the structural arrangement of the UI. These files determine where components are placed and how they behave in relation to one another.
  • Graphic Assets (/drawable & /mipmap): Manage visual content. /drawable handles standard UI graphics (vectors, bitmaps, shapes), while /mipmap is specifically reserved for launcher icons to ensure they are available at the highest possible resolution regardless of the device's default density.
  • Constant Definitions (/values): A critical repository for primitive types and styles. It typically contains:
    • strings.xml: All user-facing text.
    • colors.xml: The application's color palette.
    • styles.xml: Reusable UI attribute sets that ensure visual consistency.
  • Raw Data and Interaction (/raw & /menu): /raw holds arbitrary files (like audio or JSON config) that are needed in their original format, while /menu defines the structure of navigation and context menus.

Workflow: Resource Resolution

The following diagram demonstrates how the application retrieves a resource at runtime, highlighting the “Config-Aware” selection process:

[ App Logic ]           [ R.java / ID ]          [ /res System ]          [ Active Config ]
      |                        |                        |                        |
      | 1. Request Asset       |                        |                        |
      |---(e.g. R.string.ok)-->|                        |                        |
      |                        | 2. Lookup ID           |                        |
      |                        |----------------------->|                        |
      |                        |                        | 3. Query State         |
      |                        |                        |----------------------->|
      |                        |                        |                        | 4. Match (e.g. Locale=FR)
      |                        |                        |<-----------------------|
      | 5. Return Value        |                        |
      |<---("D'accord")--------|                        |

When a change is made to a resource, the build system automatically updates the reference mapping. This ensures that the application logic remains stable even as the visual or textual content of the /res module evolves.

Module: /res/img

High-Level Overview

The /res/img directory serves as the centralized repository for static image assets used across the application's user interface. Rather than being a mere storage bucket, this module is organized to ensure that brand identity elements and UI-specific graphics are decoupled from the application logic and styling code.

The primary design goal for this module is asset consistency. By centralizing images here, the project avoids duplication, simplifies path resolution within stylesheets and components, and ensures that updates to visual branding (such as logos or icons) propagate throughout the entire system from a single source of truth.

Design Decisions and Implementation Choices

The architecture of this module favors a flat or shallow hierarchy to minimize path complexity in imports. The choice of file formats follows standard web optimization practices:

  • Vector assets (SVG): Used for icons and logos to ensure scalability across high-DPI (Retina) displays without loss of quality or increased file size.
  • Raster assets (PNG/JPG): Reserved for complex photographic content where vectorization is impractical.
  • Specialized formats (ICO): Utilized specifically for browser-level integration (favicons) where legacy compatibility or specific metadata is required.

By isolating these assets into /res/img, the project implements a “Resource-Based” separation of concerns. This allows developers to reference assets via consistent aliases or relative paths, reducing the risk of broken links during refactoring of component or page structures.

Key Components and Responsibilities

The directory is categorized by the functional role of the images rather than just their file types:

  • Identity and Branding: Contains core visual identifiers like the company logo and wordmarks. These are the most critical assets, as they are often referenced in global headers, footers, and splash screens.
  • Interface Iconography: Includes small-scale visual cues used to enhance navigation or provide feedback (e.g., arrows, status indicators). These are typically optimized for fast loading and uniform styling.
  • Meta Assets (favicon.ico): Specifically responsible for the application's presence outside the viewport, such as browser tabs, bookmark bars, and shortcut icons. The inclusion of favicon.ico in this directory ensures that the “brand at a glance” is managed alongside other visual resources.

Workflow: Asset Consumption

The following diagram illustrates how an asset moves from this module into the rendered application:

[ /res/img/ ]                    [ Styles/Components ]           [ Client Browser ]
      |                                   |                             |
      | 1. Asset Definition               |                             |
      |---- logo.svg -------------------->|                             |
      |                                   | 2. URL Reference            |
      |                                   | (e.g., background-image)     |
      |                                   |---------------------------->|
      |                                   |                             | 3. Fetch & Render
      |                                   |                             |<-- [ GET /res/img/logo.svg ]

When a visual change is required, the workflow focuses on replacing the file within /res/img while maintaining the filename. This allows the application to update its visual state without requiring changes to CSS selectors or JSX/HTML templates.

Module: /samplevariance

Sample Variance Analysis

The samplevariance module provides a command-line tool designed to analyze the stability and noise levels of performance benchmarks. Specifically, it processes “nanobench” results stored in Google Cloud Storage (GCS), where each benchmark run typically contains multiple samples (e.g., 10 repetitions).

By calculating the ratio between the median and the minimum values of these samples, the tool helps engineers identify “flaky” or high-variance benchmarks that may yield inconsistent performance data.

Design and Implementation Logic

The tool is built as a high-throughput data processing pipeline that fetches, parses, and analyzes large sets of JSON telemetry data.

Data Model and Metrics

The core metric used is the ratio of median to minimum.

  • Minimum: Represents the best possible performance (least noise).
  • Median: Represents the typical performance.
  • Ratio: A higher ratio indicates higher variance or “noise” within a single benchmark run.

The sampleInfo struct captures these metrics alongside the traceid, which uniquely identifies the specific benchmark configuration (e.g., test name, device, OS).

Parallel Processing Workflow

To handle thousands of JSON files efficiently, the tool employs a worker pool pattern using golang.org/x/sync/errgroup.

  1. Discovery: It lists all objects in a specified GCS bucket prefix (defaulting to the previous day's data).
  2. Distribution: Filenames are pushed into a thread-safe channel.
  3. Analysis (Concurrent Workers): A pool of 64 workers pulls filenames from the channel. Each worker:
    • Downloads the JSON file from GCS.
    • Parses the legacy Perf format.
    • Filters traces based on user-supplied criteria (e.g., specific hardware or test types).
    • Calculates the min, median, and ratio for each matching trace.
  4. Aggregation: Results are collected into a shared slice, protected by a mutex.
  5. Output: After all workers finish, the tool sorts the results by the highest ratio (most noisy) and exports them as a CSV.
GCS Bucket ----> List Files ----> [Filename Channel]
                                       |
          +----------------------------+----------------------------+
          |                            |                            |
      [Worker 1]                   [Worker 2]                  [Worker n]
    Download & Parse             Download & Parse             Download & Parse
          |                            |                            |
          +----------------------------+----------------------------+
                                       |
                                [Mutex Protected]
                                       |
                                [Global Slice] ----> Sort ----> CSV Output

Key Components

File Processing (main.go)

  • initialize(): Handles command-line flags and sets up GCS clients and output writers. It defaults to a rolling 24-hour window if no prefix is provided.
  • filenamesFromBucketAndObjectPrefix(): Uses an attribute-selection query to fetch only the names of files, reducing metadata overhead.
  • traceInfoFromFilename(): The core logic unit. It integrates with perf/go/ingest/parser to extract raw sample values and uses the go-moremath library for statistical calculations.
  • writeCSV(): Formats the final report, supporting truncation via the --top flag to focus only on the most problematic benchmarks.

Filtering and Querying

The tool leverages the common go/query package. This allows users to pass complex filters via the --filter flag using a URL-query-like syntax (e.g., arch=arm64&config=8888). Only traces matching these key-value pairs are included in the variance analysis.

Execution Control

The module includes a Makefile that simplifies common operations, such as running the tool against specific GCS paths or piping the results directly to temporary files for quick inspection.

Module: /scripts

The /scripts module provides a collection of administrative and developer tools designed to manage the lifecycle of data within the Perf ecosystem. This includes seeding local environments for development, migrating production-scale data for safe experimentation, and manually triggering the ingestion pipeline.

High-Level Overview

The module serves three primary purposes:

  1. Environment Initialization: Facilitating the setup of demo or local development environments with realistic data schemas and sample alerts.
  2. Data Portability: Enabling the high-performance migration of data from production Google Cloud Spanner instances to experimental environments.
  3. Data Ingestion Management: Providing bridges to upload performance data into Google Cloud Storage (GCS) in a format compatible with the Perf ingestion engine.

Design Decisions and Implementation Choices

Bulk Data Migration via Protocol Bridging

A significant portion of this module is dedicated to moving data between Spanner instances. The implementation chooses PGAdapter over native Spanner SDKs for data movement. This decision allows the migration logic to treat Spanner as a PostgreSQL-compatible database, enabling the use of the COPY protocol via the pgx library. This is significantly more efficient than standard INSERT statements for bulk data, as it streams raw data directly into the database's ingestion buffer.

To handle the multi-terabyte scale of production tables like tracevalues, the implementation avoids direct time-based filtering on the values themselves, which would be prohibitively slow. Instead, it utilizes a JOIN on the sourcefiles table, filtering by the file's creation timestamp to isolate a manageable subset of data for migration.

Safety and Data Integrity

The scripts incorporate safeguards to prevent destructive operations:

  • Production Guards: Migration scripts include hardcoded checks to prevent accidental overwrites of known production instances.
  • Idempotency: Before initiating a transfer, the tools check for existing data within the target time range. If data is found, the process halts to prevent duplication.
  • Atomic Optimization: The migration uses PARTITIONED_NON_ATOMIC DML mode on destination instances. This allows Spanner to handle massive bulk operations across multiple partitions without hitting transaction size limits.

Local Development Seeding

For local development, the module provides automated SQL seeding. Rather than relying on manual database entry, scripts use PostgreSQL Here Documents to inject complex JSON structures (like Alert configurations) into local databases. This ensures that developers can quickly replicate specific bug states or UI layouts with consistent data.

Key Components and Workflows

Data Migration Tooling (copy_data_to_experimental_db)

This sub-module manages the complex flow of data from production to development environments. It coordinates the lifecycle of PGAdapter containers and the Go-based streaming logic.

[ Production Spanner ]          [ Experimental Spanner ]
          |                               |
          | (Spanner Protocol)            | (Spanner Protocol)
          v                               v
   [ PGAdapter :5432 ]             [ PGAdapter :5433 ]
          |                               |
          +----[ copy_data (Go Binary) ]--+
          |           (PostgreSQL Protocol)
          |
          1. Query source (via JOIN on sourcefiles)
          2. Stream rows via CopyFrom interface
          3. Apply to destination with Partitioned DML

Database Seeding (add_demo_alert_to_demo_db.sh)

This script automates the population of the Alerts and Subscriptions tables. It is designed to create a “ready-to-use” state for the Perf UI by:

  • Defining a standard Alert JSON payload that covers common regression detection parameters (e.g., stepfit algo, absolute step).
  • Linking alerts to subscriptions to verify notification workflows.
  • Using EXTRACT(EPOCH FROM NOW()) to ensure time-sensitive fields are current, preventing immediate expiration of demo data.

Ingestion Bridge (upload_extracted_json_files.sh)

This utility bridges local performance test results to the cloud ingestion pipeline. It enforces a strict directory structure required by the Perf ingester:

  • Path Logic: It uploads files to gs://skia-perf/nano-json-v1/YYYY/MM/DD/HH.
  • Temporal Organization: By forcing a date-based hierarchy, it ensures that the ingester can process files in chronological order and prevents any single directory from becoming a performance bottleneck in GCS.

Module: /scripts/copy_data_to_experimental_db

Copy Data to Experimental DB

This module provides a utility for copying data from a production Google Cloud Spanner database to an experimental or development Spanner instance. It is designed to facilitate testing and debugging with real-world data volumes and distributions without risking the integrity of production environments.

Overview

The migration process leverages PGAdapter, a proxy that allows Cloud Spanner to be accessed via the PostgreSQL wire protocol. By running two instances of PGAdapter, the migration script can treat both the source (Production) and the destination (Experimental) as standard PostgreSQL databases, using the pgx library's efficient CopyFrom functionality to stream data between them.

Design Decisions and Implementation Choices

PGAdapter as a Bridge

Rather than using Spanner-specific SDKs for row-by-row manipulation, the module uses PGAdapter to expose Spanner through a PostgreSQL interface. This allows the use of the COPY protocol, which is significantly faster for bulk data movement than individual INSERT statements.

Data Safety and Idempotency

To prevent accidental corruption of production or existing experimental data:

  • Source Instance Protection: The run_two_spanners.sh script includes hardcoded checks to prevent known production instances from being targeted as the destination.
  • Duplicate Detection: Before copying, copy_data.go checks if the destination table already contains data for the requested time range. If data exists, the script aborts for that table to avoid duplication.
  • Service Account Scoping: The documentation recommends using a service account with read-only permissions on the source to enforce security at the IAM level.

Large Scale Data Handling (TraceValues)

The tracevalues table in Perf is typically massive (multi-terabyte). Standard time-based filtering on this table is inefficient. To address this, the script implements a specific optimization:

  • It performs a JOIN with the sourcefiles table.
  • It filters records based on the createdat timestamp of the source file rather than the trace value itself, allowing for manageable subsets of data to be migrated based on ingestion time.

Partitioned DML

The script sets SPANNER.AUTOCOMMIT_DML_MODE='PARTITIONED_NON_ATOMIC' on the destination connection. This is a Spanner-specific optimization for bulk operations that allows the database to execute changes across multiple partitions independently, avoiding the overhead of a single massive transaction that would exceed Spanner's mutation limits.

Key Components and Files

run_two_spanners.sh

This bash script manages the environment setup. It launches two Docker containers running PGAdapter:

  • Source Port (5432): Connects to the production instance (defaulting to chrome_int).
  • Destination Port (5433): Connects to the user-specified experimental instance.

copy_data.go

The core logic of the migration. It is responsible for:

  • Mapping: Maintaining the tableToColumns map which defines the schema for Perf tables (e.g., regressions2, commits, postings).
  • Streaming: Implementing the pgx.CopyFromSource interface to pipe rows directly from the source query results into the destination's CopyFrom command.
  • Filtering: Applying duration-based filters (e.g., “last 7 days”) to the SQL queries to limit the volume of data moved.

BUILD.bazel

Defines the Go binary and library dependencies. Notably, it links to //perf/go/sql/spanner, ensuring that the script uses the same column definitions as the main Perf application.

Workflow Process

The following diagram illustrates how data flows from the production Spanner instance to the experimental one through the proxy layer:

[ Production Spanner ]       [ Experimental Spanner ]
          |                            |
          | (Spanner Protocol)         | (Spanner Protocol)
          |                            |
   [ PGAdapter :5432 ]          [ PGAdapter :5433 ]
          |                            |
          +-------< copy_data.go >-----+
                (PostgreSQL Protocol)
        1. Query source for recent data
        2. Stream rows via CopyFrom interface
        3. Insert into destination

Usage Logic

The migration follows a three-step logic:

  1. Schema Preparation: The user must manually ensure the destination table exists (using DDL from the production console or the codebase).
  2. Proxy Initialization: Launching the containers to map the remote Spanner instances to local ports.
  3. Execution: Running the Go binary with a specified --duration. If --table-name all is used, the script iterates through all known tables in the tableToColumns map; otherwise, it targets a specific table.

Module: /secrets

Perf Secrets Management

The /secrets module provides a set of automated tools for managing the sensitive credentials, service accounts, and OAuth tokens required by various Skia Perf components. It ensures that services—such as data ingestion, email alerting, and database backups—have the necessary permissions to interact with Google Cloud Platform (GCP) resources and external APIs securely.

Core Responsibilities

The module is designed to handle three primary types of sensitive information:

  1. GCP Service Accounts: Provisioning accounts with specific IAM roles and linking them to Kubernetes via Workload Identity or static keys.
  2. Email Authentication: Facilitating the OAuth2 “Three-Legged Flow” to allow Perf to send alerts via Gmail.
  3. Cloud Resource Permissions: Granting specific access to GCS buckets (e.g., skia-perf), Pub/Sub topics, and Cloud Trace.

Design Patterns and Implementation

Service Account Provisioning

The module heavily relies on automated scripting to ensure reproducible and consistent permission sets. Most scripts (e.g., create-perf-ingest-sa.sh, create-perf-sa.sh) follow a standardized workflow:

  • Infrastructure as Code (IaC) approach: Scripts define the exact roles (e.g., roles/pubsub.editor, roles/cloudtrace.agent) required for a service to function.
  • Workload Identity: Where possible, scripts configure IAM policy bindings to allow Kubernetes service accounts to impersonate GCP service accounts. This removes the need for long-lived JSON keys, adhering to the principle of least privilege and improving security.
  • Ramdisk Usage: Scripts utilize ../bash/ramdisk.sh to perform sensitive operations in memory, ensuring that temporary secret files or JSON keys are never written to physical disk.

Email Alerting Secrets

The create-email-secrets.sh script manages the complex process of authorizing Perf to send emails. It bridges the gap between Google’s OAuth2 requirements and Kubernetes secrets:

  • Interaction: It prompts the user to provide a client_secret.json (obtained from the GCP Console) and then executes a local Go tool (three_legged_flow) to generate a client_token.json.
  • Normalization: It normalizes email addresses into a format suitable for Kubernetes secret names (e.g., converting @ and . to -).
  • Ephemeral Tokens: It immediately deletes the token file from the local environment after injecting it into the cluster to prevent accidental leakage of refresh tokens.

Key Workflows

Provisioning a New Service Account

The following diagram illustrates the lifecycle of a service account creation within this module:

Local Script Execution
       |
       v
[ RAMDISK Creation ] --------> [ GCP IAM API ]
       |                              |
       |                              +-- Create Service Account
       |                              +-- Assign IAM Roles (PubSub, GCS, Trace)
       v                              +-- Bind Workload Identity (K8s <-> GCP)
[ Generate JSON Key ] (Optional)
       |
       v
[ kubectl create secret ] ----> [ Kubernetes Cluster ]
       |
       v
[ RAMDISK Cleanup ]

Component Summary

  • Email Alerting (create-email-secrets.sh): Handles OAuth2 token generation for Gmail integration. Specifically creates secrets for the alertserver.
  • Ingestion & Backend (create-perf-ingest-sa.sh, create-perf-sa.sh): Configures permissions for the core Perf processes to read from GCS buckets (skia-perf, cluster-telemetry-perf) and write to Pub/Sub for data processing.
  • Specialized Accounts:
    • create-flutter-perf-service-account.sh: Tailored permissions for the Flutter-specific Perf instance.
    • create-perf-cockroachdb-backup-service-account.sh: Minimalist account with roles/storage.objectAdmin specifically for database backup cronjobs.

Module: /smoke_tests

Perf Smoke Tests

The smoke_tests module provides a suite of high-level integration tests for the Perf application. These tests use Puppeteer to automate a headless (or headed) Chrome browser, simulating real user interactions to ensure that critical pages and components load correctly and function as expected in a live environment.

Design Philosophy

The primary goal of these tests is to verify the “health” of the system rather than exhaustive feature testing. They focus on:

  • End-to-End Connectivity: Ensuring the web server, authentication layers (like IAP or auth-proxy), and database backends (like Spanner) are working together.
  • Performance Budgeting: Many tests enforce a timeout (typically 5 seconds) for page loads to ensure the UI remains responsive.
  • Visibility & Debugging: In case of failure, the system automatically captures screenshots and logs browser console output, network responses, and request failures to aid in rapid triaging.

Key Components and Workflows

Authentication and Authorization

Most tests interact with instances protected by Google Identity-Aware Proxy (IAP) or a local auth-proxy.

  • IAP Authentication: Tests like alerts_nodejs_test.ts and cluster_nodejs_test.ts use the google-auth-library to fetch an ID token. This token is injected into the Puppeteer page's extra HTTP headers, allowing the automated browser to bypass the IAP login screen.
  • Local Proxy: By default, tests target http://localhost:8003. The utils.ts file manages the PERF_BASE_URL and provides helper functions to apply standard test configurations (like cookies and logging).

Test Environment Lifecycle

The module supports different execution modes depending on the developer's needs:

  1. Standard Headless Execution: Used in CI/CD and standard local runs via Bazel.
  2. Cloudtop/CRD Debugging: If the DEBUG_VIA_CRD environment variable is set, the tests can launch a browser visible through Chrome Remote Desktop. This introduces a startup delay to allow the developer to switch windows and watch the interaction.
+------------------+       +-------------------+       +-----------------------+
|  Bazel / Node    |       |  Puppeteer/Utils  |       |  Target Perf Instance |
+------------------+       +-------------------+       +-----------------------+
| 1. Launch Test   | ----> | 2. Launch Browser |       |                       |
|                  |       | 3. Auth Injection | ----> | 4. Request Page       |
|                  |       |                   |       | 5. Return HTML/JS     |
| 7. Check Result  | <---- | 6. Wait for Select| <---- |                       |
+------------------+       +-------------------+       +-----------------------+
          |                         |
          +---- (On Failure) -------+----> [ Take Screenshot & Log Errors ]

Core Utilities (utils.ts)

The utils.ts file centralizes common logic to keep individual test files clean:

  • applyPageDefaults: Attaches event listeners to the Puppeteer page to capture console, pageerror, and network failures. It also sets a puppeteer=true cookie, which signals the Perf frontend to disable non-deterministic behaviors like animations or simulated RPC latency.
  • browserForSmokeTest: Abstractly handles the creation of the browser instance, switching between headless and debug modes based on environment variables.

Specialized Tests

  • Regression Tests (regression_page_nodejs_test.ts): These tests are more complex than simple load tests. They navigate to specific subscription views (e.g., V8 or Fuchsia) and use Promise.race to wait for either a populated anomaly table or a “clear” message. This handles the non-deterministic nature of production data where anomalies may or may not exist at any given time.
  • Page Load Tests: Files like perf-chrome-public-load-a_nodejs_test.ts verify that the primary routing endpoints (/a, /m, /e) render their main functional components (like #anomaly-table or #test-picker) within the 5-second budget.

Debugging and Manual Execution

Tests are tagged as manual in the BUILD.bazel file, indicating they are typically run against a specific local or development instance rather than a hermetic build environment.

The Makefile provides shorthand commands for running these tests:

  • test-regressions: Runs the standard regression suite.
  • test-regressions-crd: Runs the suite with settings optimized for debugging via Chrome Remote Desktop, streaming the output to the terminal.

Developers can point the suite to any instance by overriding the PERF_BASE_URL environment variable during the Bazel invocation.