blob: c61f9bc76f2f71bf23e75a36a7ae6ace9f2a2adf [file] [view]
# Module: /
# Skia Perf Technical Documentation
This documentation provides a comprehensive technical overview of the Skia Performance Dashboard (Perf), targeting software engineers new to the project. It focuses on the architectural rationale, data lifecycle, and core concepts beyond simple feature descriptions.
## Project Purpose
Skia Perf is a large-scale performance monitoring and regression detection platform. It ingests high-frequency telemetry data from diverse sources (Chrome, Android, Fuchsia, Skia), organizes it into searchable time-series "traces," and automatically identifies performance regressions using statistical analysis.
The system is designed to handle millions of data points across years of history, translating non-linear source control history into linear, searchable performance trends.
## Fundamental Concepts & Terminology
- **Trace**: A single line on a graph representing measurements for a specific test over time. A trace is uniquely identified by a **Key** (a set of key-value pairs like `benchmark=motion_mark, bot=pixel_6, unit=ms`).
- **CommitNumber**: An internal, zero-indexed integer assigned to every Git commit. This linearizes the X-axis for high-performance database lookups and consistent graphing, regardless of branching or merge frequency.
- **Tile**: A fixed-size partition of data (typically 256 commits). Storage and queries are optimized by loading only the specific "Tiles" relevant to a time range.
- **ParamSet**: An inverted index of all keys and values present in a set of traces (e.g., all available `bot` names). This powers the query UI.
- **Regression**: A statistically significant change in a trace’s value (a "step" up or down) identified by fitting a step function to the data.
- **Shortcut**: A persisted, shortened ID representing a complex query or a mathematical formula (e.g., `#base_memory_usage`).
## High-Level Architecture
Perf is built as a modular system where a single Go binary, `perfserver`, performs different roles based on its execution mode.
```text
[ Data Sources ] [ Ingestion Pipeline ] [ Storage Layer ] [ Analysis & UI ]
| | | |
[ GCS Buckets ] ------> [ perfserver ingest ] ----> [ Spanner/CDB ] <---- [ perfserver cluster ]
| (Parses JSON, Maps Commits) (Trace Store) (Finds Regressions)
| ^ |
| | v
[ Git Repos ] -------------------------------------------+---------- [ perfserver frontend ]
(Web UI & API)
```
## Design Rationale & Implementation Details
### 1. Data Ingestion and Format
Perf mandates a specific JSON schema for incoming data to decouple performance producers from the dashboard.
- **Rationale**: By using a strictly versioned format, the system can ingest data from any build system (Buildbot, GitHub Actions, LUCI) without logic changes.
- **Path Logic**: Files must be stored in Google Cloud Storage using a `YYYY/MM/DD/HH` directory structure. This allows the ingester to process files in chronological order and prevents GCS directory listing bottlenecks.
- **Commit Resolution**: The `perfgit` module monitors Git repositories to map every incoming `git_hash` to a `CommitNumber`. If the database is empty, it reconstructs this mapping from the Git source of truth on startup.
### 2. Tiled Trace Storage
Traces are stored in a specialized "tiled" format within the database (Google Cloud Spanner or CockroachDB).
- **Why**: Storing millions of individual floating-point numbers as independent rows would lead to massive index overhead.
- **How**: Data is grouped into Tiles. A query for a specific time range resolves which Tiles are needed, fetches the compressed buffers, and reconstructs a `DataFrame` for the UI. This ensures O(1) lookup time for any specific point and linear scaling for range queries.
### 3. Regression Detection (Clustering & Step-Fitting)
Regression detection is not just simple thresholding; it uses shape-based analysis.
- **K-Means Clustering**: Similar traces are grouped together. If 100 different tests all show the same "spike" at the same commit, they will likely fall into the same cluster.
- **Step Function Fitting**: The system fits a mathematical step function to the centroid of a cluster.
- **Interestingness Score**: Calculated as `StepSize / LeastSquaresError`. A high score means a clean, significant jump with low noise.
- **Fingerprinting**: To avoid re-alerting on the same issue, clusters are fingerprinted using the IDs of the first 20 traces closest to the centroid. If a new cluster matches an old fingerprint, it is treated as a continuation of the same event.
### 4. Event-Driven vs. Continuous Alerting
The system supports two alerting strategies based on data density.
- **Continuous**: Iterates over all configured alerts every few minutes. Ideal for smaller, dense datasets (like Skia).
- **Event-Driven**: Triggered by PubSub events from the ingester. When new data arrives, only the alerts matching the incoming trace IDs are processed.
- **Rationale**: For massive datasets (like Android, with 40M+ traces), continuous clustering is too computationally expensive. Event-driven detection reduces latency from hours to seconds.
## Significant Modules & Files
### `go/tracestore` (Backend Storage)
This module is the "brain" of data persistence. It manages the separation of trace values (the numbers) from the trace parameters (the metadata). It implements the logic for "joining" disparate tiles into a cohesive `DataFrame` for the frontend.
### `go/regression` (Anomaly Logic)
Coordinates the regression detection lifecycle. It pulls configurations (Alerts), fetches data from the `tracestore`, executes the clustering algorithms, and writes found anomalies to the `regressionstore`. It is responsible for the "Start-Status-Result" polling pattern used by the UI.
### `modules/explore-simple-sk` (Frontend Orchestrator)
The primary UI component for data interaction. It manages the reactive loop:
1. Observes URL state changes (queries, zoom).
2. Requests a "Frame" (data chunk) from the backend.
3. Coordinates with `plot-google-chart-sk` to render the SVG lines.
4. Overlays HTML elements (anomalies, bug icons) on top of the SVG to maintain interactive performance.
### `go/notify` (Alert Delivery)
A modular system for dispatching alerts. It formats regression data into HTML or Markdown templates and interacts with external APIs like the Google Issue Tracker (Buganizer) or SMTP servers. It handles deduplication to ensure a single regression doesn't result in multiple redundant bugs.
## Technical Workflows
### The Query-to-Chart Workflow
When a user selects a filter in the UI:
1. **URL Sync**: The `stateReflector` updates the URL with the new query.
2. **Request Initiation**: `explore-simple-sk` calls `/_/frame/start` with the query.
3. **Backend Processing**: `dfbuilder` identifies the necessary Tiles, fetches trace data, and applies any requested formulas (e.g., `norm()`, `moving_average()`).
4. **Progress Polling**: The frontend polls `/_/frame/status` until the data is ready.
5. **Rendering**: The resulting `DataFrame` is passed to the chart, which translates `CommitNumbers` into X-coordinates using the `ChartLayoutInterface`.
### Data Recovery and Backups
- **Backups**: Handled via `perf-tool database backup`. Only "user-generated" data (Alerts, Regressions, Shortcuts) is backed up.
- **Reconstruction**: Trace data is **not** backed up because it can be 100% reconstructed by re-ingesting the JSON files from GCS. Commits are similarly reconstructed from the Git repository.
# Module: /cockroachdb
### CockroachDB Management Module
This module provides the operational glue for interacting with the CockroachDB cluster deployed within the Skia infrastructure. Rather than managing the database engine itself, this module focuses on providing developer and administrator access to the data layer for debugging, manual SQL intervention, and performance monitoring.
#### Connectivity Design and Interaction
The module is designed around the principle of ephemeral, secure access to a distributed database running inside a Kubernetes environment (`perf-cockroachdb`). Because the database is not exposed directly to the public internet, the scripts implement two primary access patterns:
1. **Direct SQL Execution (In-Cluster):**
The `connect.sh` script facilitates a "cloud-native" way to interact with the database. It spins up a temporary, short-lived Kubernetes pod running the official CockroachDB image. This pod connects to the `perf-cockroachdb-public` service internally. This approach is preferred for quick SQL queries as it avoids local toolchain dependencies and keeps the traffic entirely within the cluster's private network.
2. **Local Tunneling (Port-Forwarding):**
For more complex operations—such as using a local IDE, a native `cockroach` binary, or accessing the web-based Admin UI—the module utilizes Kubernetes port-forwarding.
- **Administrative UI:** `admin.sh` bridges the local port `8080` to the database's status server. This allows developers to use a local browser to inspect cluster health, node status, and query performance metrics.
- **Remote SQL Access:** `skia-infra-public-port-forward.sh` establishes a tunnel from the local machine to the database's wire protocol port (`26257`). This is routed through the `skia-infra-public` context, enabling developers to use local CLI tools as if the database were running on `localhost`.
#### Key Components
- **SQL Access Utilities (`connect.sh`, `skia-infra-public-port-forward.sh`):**
These scripts manage the lifecycle of a connection. The design choice to use the `--insecure` flag suggests that the cluster is configured to rely on network-level isolation and Kubernetes RBAC rather than client-side certificate management for these specific administrative entry points.
- **Observability Bridge (`admin.sh`):**
This component targets the CockroachDB built-in HTTP console. It automates the dual step of establishing the network tunnel and launching the browser, reducing the friction required to monitor the database during performance testing or troubleshooting.
#### Workflow: Remote Administration
The following diagram illustrates the lifecycle of a remote administrative session using the tunneling scripts:
```text
[ Developer Machine ] [ Kubernetes Cluster (perf) ]
| |
(1) Run admin.sh ----------------------> [ Pod: perf-cockroachdb-0 ]
| <-- Port-Forwarding -- | :8080 (HTTP)
| |
(2) Browser opens localhost:8080 |
| |
| |
(3) Run skia-...-port-forward.sh -------> [ Pod: perf-cockroachdb-0 ]
| <-- Port-Forwarding -- | :26257 (SQL)
| |
(4) Local SQL Client -------------------+
connects to 127.0.0.1:25000
```
#### Implementation Choices
- **Pod Targeting:** Scripts target `perf-cockroachdb-0` specifically for port-forwarding. This implies a StatefulSet deployment where the zero-ordinal pod acts as a reliable entry point for administrative tasks, even if the service itself is distributed across multiple nodes.
- **Version Pinning:** The connection script pins the client image to `v19.2.5`. This ensures compatibility with the server-side wire protocol and guarantees that the administrative environment is reproducible across different developer machines.
# Module: /configs
# Perf Instance Configurations
The `/configs` directory serves as the central repository for the operational environment definitions of Skia Perf. Each JSON file in this directory represents a unique instance of the Perf service, defining how it interacts with data stores, ingestion pipelines, version control systems, and alerting mechanisms.
These configurations are designed to be deserialized into the `config.InstanceConfig` Go struct, which acts as the "source of truth" for the application's behavior at runtime.
## Design Philosophy and Implementation Choices
### Configuration Driven Architecture
Perf is designed as a generic engine for time-series visualization and anomaly detection. Rather than hard-coding logic for different projects (like Chrome, Android, or V8), the system uses these configuration files to define the project-specific "shape" of data. This allows a single codebase to support diverse use cases, from local development to massive-scale production environments.
### Data Ingestion and Synchronization
The module defines how performance data moves from build systems into the database.
- **Ingestion Sources:** While production instances typically use Google Cloud Storage (GCS) and Pub/Sub for real-time data streaming, the module also supports local directory monitoring (`source_type: "dir"`). This is used in `demo_spanner.json` and `local.json` to allow developers to test the full ingestion stack without cloud dependencies.
- **Commit Mapping:** A core responsibility of these configs is defining the relationship between performance "traces" and the source code history. Through `git_repo_config`, instances specify how to parse commit positions (e.g., using `commit_number_regex`) to ensure the X-axis of performance graphs remains linear and meaningful across thousands of commits.
### Scalability and Performance Tuning
The configurations allow for fine-grained control over the underlying storage engine (primarily Google Cloud Spanner).
- **Tile-Based Indexing:** The `tile_size` parameter determines the granularity of data partitioning. Smaller tiles (e.g., 256) are optimized for sparse datasets or frequent small-range queries, whereas larger tiles (e.g., 8192) in high-traffic instances like Chrome minimize overhead during massive bulk ingestion.
- **Caching Layers:** To maintain UI responsiveness under heavy load, the configurations define caching strategies. By specifying `level1_cache_key` (often set to `bot` or `benchmark`), the system can pre-index and cache common query patterns in Redis or local memory.
### Workflow Orchestration
Beyond simple visualization, these configurations integrate Perf into the broader CI/CD ecosystem:
```
[ Ingestion ] -> [ Trace Store ] -> [ Regression Detection ] -> [ Bug Filing ]
^ | |
| v v
[ Git Repo ] <---------------------- [ Anomaly Grouping ] -------- [ Issue Tracker ]
```
- **Anomaly Management:** Configurations determine how regressions are identified and reported. Modern instances use `use_regression2_schema` to enable advanced SQL-based anomaly tracking.
- **Issue Tracking:** The `issue_tracker_config` and `notify_config` blocks define the "where" and "how" of alerting—ranging from simple email notifications to automated bug creation in Google's Issue Tracker, including the use of specific API keys and secrets.
## Key Components and Responsibilities
### Instance Metadata and UI Customization
The top-level fields (e.g., `instance_name`, `contact`, `favorites`, `extra_links`) define the identity of the instance. The `favorites` and `extra_links` sections are particularly important for usability, allowing administrators to curate specific views or link to external documentation and dashboards directly within the Perf UI.
### Data Store Configuration (`data_store_config`)
This component defines the backend storage technology. While the project is shifting towards Spanner (as seen in `demo_spanner.json` and the `/spanner` subdirectory), the config remains flexible enough to define connection strings and database types, ensuring the application knows how to communicate with the PostgreSQL-compatible Spanner interface.
### Query and Discovery (`query_config`)
This section controls how users interact with the data:
- **Include Params:** Lists the metadata keys (like `benchmark`, `bot`, `test`) that the UI should expose for filtering and searching.
- **Default URL Values:** Allows setting the "personality" of an instance—for example, deciding whether a specific instance should default to showing a zero-based Y-axis or use a specialized test picker by default.
## Subdirectory Roles
- **/spanner**: Specifically contains configurations for instances that have migrated to the Spanner-based backend. These files represent the current standard for high-performance, horizontally scalable Perf deployments.
- **Local and Demo Configs**: Files like `local.json` and `demo_spanner.json` are essential for the development lifecycle. They point to local data directories (`./demo/data/`) and use simplified auth schemes to allow developers to run the entire Perf stack on a single workstation for testing and debugging.
# Module: /configs/spanner
# Spanner-Based Perf Configurations
The `/configs/spanner` directory contains JSON configuration files for various Skia Perf instances that utilize Google Cloud Spanner as their primary data store. These configurations define how performance data is ingested, stored, queried, and reported for specific projects such as Chrome, Android, V8, Flutter, and Fuchsia.
## Overview
Each file in this directory represents a distinct environment (production, experiment, or internal) for a performance monitoring dashboard. By using Cloud Spanner, these instances benefit from a horizontally scalable, globally consistent relational database, which is particularly suited for handling the high volume of "traces" (time-series performance data) generated by large-scale CI/CD systems.
The move to Spanner (referenced in these configs via the `datastore_type: "spanner"` and a PostgreSQL-compatible connection string) represents an architectural shift toward high-performance SQL-based storage for performance metrics.
## Design Decisions and Implementation Choices
### Data Storage Strategy
The configurations use a "tile-based" storage approach, controlled by the `tile_size` parameter.
- **Small Tiles (256 - 512):** Used by projects like Skia, Angle, and Fuchsia. Smaller tiles are often more efficient for sparse data or instances where users frequently query small ranges of commits.
- **Large Tiles (4096 - 8192):** Used by high-traffic instances like Chrome Internal. Larger tiles optimize for massive ingestion throughput and batch reading of dense performance data.
- **Follower Reads:** Many internal configs enable `enable_follower_reads`. This improves read latency and reduces costs by allowing the application to read from Spanner replicas that might be slightly behind the leader, which is acceptable for dashboard visualization.
### Ingestion Workflow
Data flow is standardized across instances using a Google Cloud Storage (GCS) to Pub/Sub pipeline.
```
[ Build System ] -> [ GCS Bucket ] -> [ Pub/Sub Topic ] -> [ Perf Ingestion Service ] -> [ Spanner DB ]
```
- **Source Type:** Predominantly `gcs`, identifying the bucket where performance JSON files are uploaded.
- **Pub/Sub Integration:** The `topic` and `subscription` fields define the "push" mechanism that triggers ingestion as soon as new data arrives in GCS.
- **Dead Letter (DL) Queues:** Critical instances (like Chrome and WebRTC) include `dl_topic` and `dl_subscription` to handle failed ingestion attempts without losing data.
### Anomaly Detection and Notification
The configurations define how the system reacts to performance regressions:
- **Notification Types:** Options include `markdown_issuetracker` (for automated bug creation), `html_email`, or `anomalygroup` (which clusters related regressions before alerting).
- **Sheriffing:** `enable_sheriff_config` allows these instances to pull alert thresholds and ownership data from a central management system.
- **Regression Schema:** Newer instances use `use_regression2_schema: true` and `fetch_anomalies_from_sql: true`, indicating a transition to a more robust, queryable SQL schema for tracking performance changes over time.
## Key Submodules and Components
### Git Repository Configuration (`git_repo_config`)
Determines how commits are mapped to performance data.
- **Provider:** Most use `gitiles`, which is optimized for Google-hosted source code.
- **Commit Parsing:** Configurations like `v8` and `chrome` use `commit_number_regex` to extract "Commit Positions" (e.g., `refs/heads/main@{#12345}`), which are used as a linear X-axis instead of raw Git hashes.
### Query and Visualization (`query_config`)
Customizes the UI and discovery experience for each project's unique metric structure.
- **Include Params:** Defines which metadata fields (e.g., `benchmark`, `bot`, `test`, `subtest_1`) are indexed and searchable in the Perf UI.
- **Conditional Defaults:** Instances like `android2` use this to automatically select specific stats (like `min` for `timeNs`) when a user selects a certain metric, reducing the manual effort required to find meaningful data.
- **Caching:** High-load instances utilize Redis (`cache_config`) to store common query results, specifically targeting `level1_cache_key` (usually `benchmark`) to speed up dashboard loading.
### Temporal Integration (`temporal_config`)
Specific to internal Chrome and Fuchsia instances, this links the Perf dashboard to **Temporal**, a workflow orchestration engine. This is used to trigger automated "bisects" (`pinpoint_task_queue`)—a process that automatically finds the exact CL responsible for a performance regression.
## Directory Responsibilities
- **Production Configs:** (e.g., `chrome-public.json`, `v8-public.json`) Primary dashboards used by developers.
- **Internal Configs:** (e.g., `chrome-internal.json`, `eskia-internal.json`) Restricted instances for proprietary code or sensitive performance metrics.
- **Autopush/Experiment:** (e.g., `v8-internal-autopush.json`, `chrome-internal-experiment.json`) Testing grounds for new Perf UI features or experimental Spanner schemas.
# Module: /coverage
### Perf Coverage Module
The `/coverage` module provides a comprehensive quality assurance suite for the Perf project. Instead of relying on a single metric, it implements a "triangulated" approach to code health by measuring type safety, test execution coverage, and test effectiveness through mutation testing.
The primary goal of this module is to generate actionable reports that reside in a unified dashboard, allowing developers to identify not just untested code, but also "weakly" tested code or type-system gaps.
#### Key Quality Dimensions
The module is structured around three distinct methodologies for evaluating the codebase:
1. **Type Coverage**: Measures the "strictness" and completeness of TypeScript types across the project. It identifies where the `any` type or missing annotations might be bypassing the safety checks of the compiler. This ensures that the codebase remains maintainable and less prone to runtime errors.
2. **Test Execution (Line) Coverage**: Uses `c8` and `mocha` to track which lines and branches of code are executed during unit tests. This is the traditional metric for identifying "dead" zones in the test suite.
3. **Mutation Testing**: Evaluates the _quality_ of existing tests by injecting small bugs (mutants) into the source code (e.g., changing `>` to `<`). If the test suite still passes despite these changes, the mutation "survived," indicating that the tests are not sensitive enough to detect logic regressions in that area.
#### Component Responsibilities
- **`perf-coverage.sh` (The Orchestrator)**:
This script acts as the central entry point for generating coverage reports. It encapsulates the complex CLI arguments required for various tools, ensuring consistency between local execution and CI/CD pipelines. It allows for targeted runs (e.g., only running mutation tests) or a full suite execution.
- **`add-coverage-links.py` (Navigation Post-Processor)**:
Most coverage tools generate static HTML reports that are isolated from one another. This script uses `BeautifulSoup` to programmatically inject a navigation header ("Back to Perf Coverage Dashboard") into the generated HTML files. This transforms a collection of disparate reports into a cohesive, navigable documentation site. It handles idempotency by removing existing links before inserting new ones to prevent duplicate UI elements during re-runs.
- **`stryker.config.json`**:
Configures the Mutation Testing framework. It is specifically tuned to exclude Puppeteer (integration) tests and page objects, focusing purely on the core business logic within `perf/modules`. It balances performance and thoroughness by defining precise `ignorePatterns`.
- **`tsconfig.coverage.json`**:
A specialized TypeScript configuration used specifically for type-coverage reporting. It extends the base project configuration but restricts the scope to source files, excluding tests and demo files to ensure the reported coverage percentage reflects the production logic accurately.
#### Workflow Process
The following diagram illustrates how the module transforms source code and tests into a unified quality dashboard:
```text
Source Code + Tests
|
|----(typescript-coverage-report)----> [Type Coverage HTML]
| |
|----(c8 + mocha)--------------------> [Test Coverage HTML]
| |
|----(stryker)-----------------------> [Mutation Report HTML]
| |
V V
[Raw HTML Reports] <----------------------- (add-coverage-links.py)
|
| (Injects Navigation UI)
V
[Unified Coverage Dashboard]
```
#### Design Implementation Choices
- **Exclusion of Integration Tests**: The test and mutation configurations specifically exclude `*_puppeteer_test.ts`. This is a deliberate choice to keep the coverage feedback loop fast. Integration tests are often too slow and "noisy" for mutation testing, which requires thousands of test re-runs.
- **LXML for HTML Manipulation**: The Python post-processor uses `lxml` to ensure that even if the reporting tools produce slightly malformed HTML, the navigation links can be reliably injected without corrupting the reports.
- **Static Analysis vs. Runtime Analysis**: By combining `tsconfig` checks with `stryker` runtime analysis, the module covers the entire lifecycle of code reliability—from compile-time correctness to runtime logic validation.
# Module: /csv2days
# csv2days
`csv2days` is a command-line utility designed to post-process CSV files exported from Skia Perf. Its primary purpose is to aggregate time-series data from sub-day granularity (RFC3339 timestamps) into daily granularity.
### Overview
When exporting data from Perf, a CSV may contain multiple columns representing different data points collected on the same calendar day. This granularity can be excessive for certain types of reporting or spreadsheet analysis. `csv2days` simplifies these files by collapsing all columns belonging to the same date into a single column.
### Design Decisions
#### Aggregation via Maximum Value
When multiple columns from the same day are merged, the tool must decide how to represent the data for that day. `csv2days` implements a **Max** strategy. For any set of columns being collapsed, the tool calculates the maximum numerical value across those columns for each row.
This decision is rooted in the common use case of monitoring performance metrics where the "peak" value for a day is often more significant than an average or a sum, particularly when dealing with sparse data where different columns represent different runs of the same task. If a value cannot be parsed as a float, the tool defaults to the first available string in that "run" of columns.
#### Header-Driven Transformation
The transformation logic is strictly driven by the headers of the CSV. The tool assumes that the CSV contains a horizontal timeline where headers follow the RFC3339 format.
1. **Identification**: It uses regular expressions to identify date-time strings in the header.
2. **Date Truncation**: It strips the time and timezone information, keeping only the `YYYY-MM-DD` portion.
3. **Run Identification**: It identifies "runs" of columns—consecutive headers that resolve to the same date.
### Key Components
#### `main.go`
The core logic resides in `transformCSV`. The process follows a specific pipeline:
1. **Header Analysis**: It scans the first row to determine which columns are duplicates (same day). It records the "run lengths" (how many columns belong to one day) and the indices that need to be removed to reach a unique set of dates.
2. **Data Processing**: For every subsequent row:
- **Max Application**: It looks at the indices identified as a "run" and calculates the maximum value within that range.
- **Column Reduction**: It removes the redundant columns (the indices that were merged into the first column of the run).
3. **Streaming Output**: To maintain efficiency, the tool processes the file row-by-row and streams the output to `stdout`.
### Workflow Diagram
The following diagram illustrates how multiple timestamped columns are collapsed into a single date column:
```text
INPUT CSV:
[Header] | Key | 2023-01-01T08:00Z | 2023-01-01T12:00Z | 2023-01-02T09:00Z |
[Row 1] | A | 10 | 20 | 15 |
PROCESS:
1. Identify 2023-01-01 columns as a "Run"
2. Calculate Max(10, 20) for Row 1 -> 20
3. Truncate headers to YYYY-MM-DD
4. Remove redundant indices
OUTPUT CSV:
[Header] | Key | 2023-01-01 | 2023-01-02 |
[Row 1] | A | 20 | 15 |
```
### Usage
The tool requires an input file specified by the `--in` flag and outputs the transformed CSV directly to standard output:
```bash
csv2days --in=perf_export.csv > daily_summary.csv
```
# Module: /demo
### Perf Demo Data Module
The `/demo` module provides a self-contained environment for generating and storing synthetic performance data. Its primary purpose is to demonstrate the capabilities of the Perf system—such as anomaly detection, regression tracking, and trend visualization—without requiring a live production environment.
The module is designed to work in tandem with the [perf-demo-repo](https://github.com/skia-dev/perf-demo-repo.git), mapping performance metrics to specific Git commits within that repository.
#### Design Philosophy: Deterministic Anomaly Simulation
Rather than providing static files that might become stale, the module includes a data generator (`generate_data.go`) that programmatically creates JSON files following the `format.Format` schema.
The generation logic is intentionally designed to simulate real-world performance scenarios:
- **Regression Triggering:** The generator injects artificial spikes in metrics (e.g., adding a significant offset to the "encode" time at a specific commit index) to ensure that alerting and anomaly detection algorithms in Perf have visible "problems" to identify.
- **State Shifts:** By using multipliers that change after specific commit indices, the generator simulates shifts in performance baselines, allowing users to test how the system handles intentional performance improvements or degradations.
#### Key Components
**1. Data Generator (`generate_data.go`)**
This is a Go binary responsible for creating the `data/` directory. It iterates through a hardcoded list of Git hashes from the demo repository and generates a JSON payload for each.
- **Why Go?** Using the same language as the core Perf ingester allows the generator to import `perf/go/ingest/format` directly, ensuring the generated data is always compatible with the system's ingestion requirements.
- **Schema Implementation:** It populates multi-dimensional keys (e.g., `bot`, `benchmark`, `units`) to demonstrate how Perf can pivot and filter data across different environmental facets.
**2. Storage (`/demo/data`)**
This directory acts as a mock "data lake." It contains the JSON output of the generator.
- **File-Based Ingestion:** The files are structured to be consumed by a Perf ingester of type `dir`. This replicates a simple filesystem-based ingestion workflow where a watcher monitors a directory for new performance results.
- **Traceability:** Each file is named sequentially (`demo_data_commit_N.json`) to provide a clear chronological lineage for performance trends.
#### Data Workflow and Integration
The workflow follows a path from source code state to visual representation:
```text
[ Git Commits ] [ generate_data.go ] [ /demo/data/*.json ]
(perf-demo-repo) ------> (Logic + Randomness) ------> (Structured JSONs)
|
| (Ingested by Perf)
v
[ Perf UI / Alerting ]
- Detect "encode" spike
- Graph "ms" vs "kb"
```
#### Key Implementation Choices
- **Multi-Metric Reporting:** Each generated file contains multiple result groups (e.g., one for time in `ms` and one for memory in `kb`). This illustrates how a single ingestion event can update multiple disparate metrics (CPU vs. RAM) simultaneously.
- **Decoupled Metadata:** Environment details like the architecture (`x86`) and the branch (`master`) are stored in a top-level `Key` map. This allows the Perf system to index these files efficiently and enables users to compare performance across different hardware configurations or branches.
- **Measurement Hierarchy:** The use of `SingleMeasurement` objects (mapping `test` categories to specific operations like `encode` or `decode`) provides a granular view, allowing the system to track specific sub-routines within a larger benchmark.
# Module: /demo/data
### Benchmark Data Module
The `/demo/data` directory serves as the primary storage for performance benchmark results within the project. It contains a collection of JSON files, each representing a "point-in-time" snapshot of performance metrics associated with specific Git commits. This structured data allows for regression tracking, performance analysis over time, and cross-platform comparisons.
#### Data Architecture and Schema
The module follows a standardized JSON schema (Version 1) designed to decouple the environmental metadata from the actual performance measurements. This structure ensures that as new benchmarks or hardware bots are added, the reporting format remains consistent.
**1. Metadata and Context**
Each file identifies the specific build and environment that produced the results:
- `git_hash`: The unique identifier for the source code state.
- `key`: A set of environmental descriptors including the `benchmark` ID, the hardware architecture/platform (`bot`), and the project branch (`master`).
**2. Result Grouping**
Performance data is grouped within the `results` array. This grouping strategy is chosen to allow a single commit to report multiple categories of metrics (e.g., time-based vs. size-based) in a single transaction. Each result group is defined by its own `key` (typically defining the `units`).
**3. Measurement Hierarchy**
Inside each result group, the `measurements` object maps specific test categories to an array of values.
- `test`: The common category for operational metrics.
- `value`: The specific operation performed (e.g., "encode", "decode").
- `measurement`: The raw numerical data point.
**4. Extensibility via Links**
The schema supports optional `links` objects at both the global and measurement levels. This design allows for traceability, enabling tools to link a specific performance outlier directly to external logs, search queries, or profiling reports.
#### Data Workflow
The data is intended to be consumed by a visualization or monitoring system. The relationship between the files reflects a chronological progression of the codebase:
```text
[ Commit Hash A ] -> [ Commit Hash B ] -> [ Commit Hash C ]
| | |
v v v
+--------------+ +--------------+ +--------------+
| JSON Data 1 | | JSON Data 2 | | JSON Data 3 |
| (encode: X) | | (encode: Y) | | (encode: Z) |
+--------------+ +--------------+ +--------------+
| | |
+----------+---------+----------+---------+
|
v
[ Performance Trend Graph ]
(Detection of Regressions)
```
#### Design Choices
- **Flat File Storage:** By storing results as individual JSON files named by commit sequence, the system leverages the filesystem/version control for history rather than requiring a separate database for basic storage.
- **Key-Value Pairs for Units:** Units are stored within a `key` object rather than a hardcoded field. This allows the reporting logic to be agnostic of what is being measured (e.g., milliseconds, kilobytes, or operations per second).
- **Measurement Objects:** Using an array of objects for measurements (containing `value` and `measurement`) rather than a simple map allows for the future inclusion of per-measurement metadata, such as the `links` found in `demo_data_commit_4.json`.
# Module: /docs
The `/docs` module serves as the central knowledge repository and architectural blueprint for the Skia Performance Dashboard (Perf). It acts as the "source of truth" for the system’s design, data protocols, and operational procedures. Beyond simple user guides, this module defines the rigid contracts required for cross-project data ingestion and the multi-service architecture that enables performance regression detection at scale.
### Design Rationale: Documentation as Code
The structure of the `/docs` module reflects a design philosophy where documentation is treated with the same rigor as source code:
- **Contract-First Integration:** Files like `FORMAT.md` (defining the `nanobench` JSON structure) exist to decouple data producers from the dashboard. Because Perf ingests data from diverse ecosystems (Fuchsia, Chrome, Android), a strictly versioned and documented format allows the ingestion pipeline to remain generic and stable while producers evolve independently.
- **Centralization of Tribal Knowledge:** The module consolidates technical details that span Go backend services, SQL schema definitions, and LitElement frontend components into comprehensive references like `ai_generated_doc.md`. This reduces the onboarding barrier for a highly fragmented microservice architecture.
- **Traceability via Version Control:** By maintaining documentation in the same repository as the implementation logic, architectural decisions and API changes are tracked through the same peer-review and history mechanisms as the code itself. This prevents the "documentation rot" common in external wikis.
### Key Components and Responsibilities
#### Technical Reference and Aggregation (`ai_generated_doc.md`)
This component acts as the primary technical manual for the entire project. It documents the "why" behind the most significant architectural decisions, such as:
- **The Multi-Mode Server:** Explaining why `perfserver` is a single binary that handles ingestion, clustering, and frontend serving through different flags to simplify containerized deployments.
- **Clustering Logic:** Detailing the use of k-means clustering and step-function fitting to identify "interesting" regressions amidst noisy performance data.
- **Persistence Strategy:** Describing the transition from CockroachDB to Google Cloud Spanner to achieve global consistency and horizontal scalability for time-series metrics.
#### Data Schema and Ingestion Contracts (`FORMAT.md`)
This file is the definitive specification for how performance data must be structured before it reaches the dashboard. It defines the hierarchical relationship between:
- **Version and Metadata:** Identifying the schema version and the Git commit hash.
- **Keys:** Defining the parameters (architecture, OS, configuration) that uniquely identify a performance trace.
- **Results:** Structuring the measurements (timing, memory usage) and statistical aggregates (min, max, median).
#### API Specifications (`API.md`)
Defines the programmatic interfaces for external interactions, primarily focusing on alert management. This allows automated tools to create, list, or update alerts without human intervention, facilitating a "monitoring-as-code" workflow.
### Key Data Ingestion Workflow
The documentation defines how data moves through the system, ensuring that every component adheres to the documented state transitions:
```text
PRODUCERS (Fuchsia, Chrome, CI)
|
| 1. Format raw data into 'nanobench' JSON (per FORMAT.md)
v
STORAGE (Google Cloud Storage)
|
| 2. Organized by YYYY/MM/DD/HH structure
v
INGESTION SERVICE (perfserver ingest)
|
| 3. Validate against formatSchema.json (in /go/ingest/format)
| 4. Resolve Git Hash to Commit Number (via /go/git)
v
DATABASE (Google Cloud Spanner)
|
| 5. Store TraceValues and inverted ParamSets (per /go/sql/spanner)
v
ANALYSIS ENGINE (perfserver cluster)
|
| 6. Group similar traces using k-means (per /go/clustering2)
| 7. Fit Step Functions to detect regressions (per /go/stepfit)
v
USER INTERFACE (perf.skia.org)
|
| 8. Visualize via explore-simple-sk and handle triage
```
### Strategic Module Interaction
The `/docs` module provides the necessary context to understand how the various sub-directories function as a unified whole:
- **Config Management:** Complements the `/configs` module by explaining how JSON configuration files translate to specific Perf instance identities.
- **Go Backend:** Provides the high-level logic for the services found in `/go`, particularly the complex relationship between the `tracestore` and `regression` packages.
- **Frontend Modularization:** Explains the component-based architecture in `/modules`, where UI elements like `chart-tooltip-sk` and `anomalies-table-sk` are orchestrated to provide a cohesive exploration experience.
# Module: /go
# Skia Perf Module (`/go`)
The `/go` directory contains the core backend services, data processing pipelines, and administrative tools for **Skia Perf**, a large-scale performance monitoring and regression detection platform.
## High-Level Overview
Skia Perf is designed to ingest high-frequency telemetry data from diverse sources (Chrome, Android, Fuchsia, Skia), organize it into searchable time-series "traces," and automatically identify performance regressions.
The architecture follows a specialized "tiled" storage model to handle millions of data points across years of history. It decouples the heavy analytical tasks—like k-means clustering and step-fit detection—from the user-facing web interface, utilizing an asynchronous "Start-Status-Result" pattern to maintain responsiveness.
## Design Philosophy and Implementation Choices
### Tiled Storage and Commit-Linearity
A fundamental design choice in Perf is the translation of non-linear Git history into a linear, integer-based coordinate system (`CommitNumber`).
- **Why**: This allows for extremely fast range queries and predictable O(1) lookups in the database.
- **How**: Data is partitioned into "Tiles" (typically 256 commits each). Components like `tracestore` and `dfbuilder` operate on these tiles to fetch only the temporal slices necessary for a given request, preventing memory exhaustion.
### Configuration as Code
The system is highly multi-tenant. A single binary supports vastly different projects (e.g., `v8` vs `android`) by interpreting an `InstanceConfig`.
- **Validation**: The `config` and `validate` modules perform semantic checks (e.g., dry-running Go templates and pre-compiling Regex) at startup to ensure the instance is logically sound before it handles data.
- **Sheriffing**: Alerting rules are managed as version-controlled Proto files (`sheriffconfig`) and synchronized into the operational database, allowing teams to manage monitoring thresholds via standard code reviews.
### Asynchronous Orchestration
Heavy operations like regression detection and bisection are managed via **Temporal** workflows (`workflows`).
- **Reliability**: This ensures that if a network call to an issue tracker or a bisection engine (Pinpoint) fails, the state is preserved and the task can be retried without losing progress.
- **Polling Pattern**: The `progress` and `dfiter` modules facilitate a pattern where the frontend triggers a task and polls for updates, allowing the backend to handle long-running computations outside the HTTP request lifecycle.
## Key Submodules and Responsibilities
The project is organized into functional layers:
### 1. Data Ingestion & Storage
- **`ingest` & `process`**: The continuous pipeline that monitors sources (GCS/PubSub), parses incoming files (`parser`), and populates the database.
- **`tracestore` & `sqltracestore`**: The low-level persistence layer. It separates numeric values from metadata (`traceparamstore`) to optimize query performance.
- **`perfgit`**: Manages the mapping between Git hashes and the internal `CommitNumber` timeline.
### 2. Analysis & Detection
- **`regression`**: The central engine that coordinates regression detection across the commit history.
- **`clustering2` & `kmeans`**: Implements shape-based grouping to find similar performance shifts across disparate tests.
- **`stepfit`**: Provides the mathematical logic for identifying "steps" (sudden jumps or drops) in individual traces.
- **`samplestats`**: Conducts statistical tests (Mann-Whitney U, Welch's T-test) to compare "before" and "after" samples.
### 3. User Interface & API
- **`frontend`**: The web server orchestrator. It manages authentication, serves the UI, and coordinates between various backend stores.
- **`dataframe` & `dfbuilder`**: Constructs the matrix-like data structures used by the UI to render graphs and tables.
- **`ui/frame`**: The "brain" of the exploration page, handling complex query resolution and formula calculations (`calc`).
### 4. Alerting & Communication
- **`notify`**: A modular delivery system that formats regressions into human-readable messages (HTML/Markdown) and dispatches them to Email or Issue Trackers.
- **`anomalygroup`**: Aggregates individual regressions into logical "groups" to prevent alert fatigue and streamline bisection.
- **`issuetracker`**: A high-level client for the Google Issue Tracker, automating bug filing and status updates.
### 5. Management & Tooling
- **`perf-tool`**: A Swiss-army-knife CLI for administrators to perform re-ingestion, backups, and database migrations.
- **`maintenance`**: A dedicated service role for background tasks like schema migration, cache warming (`psrefresh`), and data retention.
- **`ts`**: Automates the generation of TypeScript definitions from Go structs to ensure frontend/backend type safety.
## Core Workflows
### Data Lifecycle: Ingestion to Visualization
This workflow illustrates how a single performance measurement moves from a test bot to a user's screen.
```text
[ Test Bot ] --(JSON)--> [ GCS Bucket ]
|
(PubSub Notification)
|
v
[ Ingest Worker ] ----> [ Parser / Filter ]
| |
| (CommitNumber) <--- [ perfgit ]
v |
[ TraceStore ] <---------------+
|
[ dfbuilder ] <--- [ Frontend Query ]
|
v
[ DataFrame ] ----> [ Web UI (Graph) ]
```
### Detection Workflow: Ingestion to Notification
This workflow shows how new data triggers automated analysis and alerting.
```text
[ Ingest Event ]
|
v
[ continuous/Detector ] ----> [ alert/ConfigProvider ]
| |
| (Fetch Alert Configs) <-----'
|
v
[ regression/Detector ] ----> [ clustering2 / stepfit ]
| |
| (Anomalies Found) <---------'
v
[ anomalygroup ] ------------> [ notify ]
| |
| (Merge into Group) |-- [ Email ]
| |-- [ IssueTracker ]
v `-- [ Pinpoint (Bisection) ]
[ Temporal Workflow ]
```
## Related Modules
- **`/proto`**: Defines the gRPC and storage contracts used for cross-service communication.
- **`infra/go/sql`**: Provides underlying SQL pooling and timeout management.
- **`infra/go/pubsub`**: Manages the event-driven triggers for the ingestion pipeline.
# Module: /go/alertfilter
### High-Level Overview
The `alertfilter` module provides a centralized set of constants used to define the scope of alert visibility within the Skia Perf application. It acts as a shared vocabulary between the backend logic that queries alert configurations and the frontend components that allow users to toggle between different views of those alerts.
### Design and Implementation
The primary design goal of this module is to eliminate "magic strings" and ensure consistency across the Perf codebase when filtering alerts. Alerts in Perf can be numerous, often belonging to different teams or individual developers. To make the system manageable, the UI provides mechanisms to filter these alerts based on ownership.
By defining these modes as constants, the system ensures that:
1. **Backend Queries** use standardized keys when filtering alert configurations from the database.
2. **API Requests** from the frontend remain consistent, avoiding bugs caused by case-sensitivity or typos in string literals.
3. **Future Expansion** of filtering logic (e.g., filtering by "TEAM" or "SUBSYSTEM") has a clear, singular location for definition.
### Key Components and Responsibilities
The module currently defines two primary filtering modes:
- **`ALL`**: Represents a global view. This mode is used when a user or a service needs to inspect every active alert configuration within the instance, regardless of who created them or who is listed as the owner.
- **`OWNER`**: Represents a personalized view. This mode restricts the alert list to those specifically associated with the authenticated user. This is the primary mechanism for reducing noise in the dashboard, allowing developers to focus on the performance regressions they are directly responsible for.
### Workflow
When a user interacts with the Perf alert dashboard, the filtering logic typically follows this flow:
```text
User Interface Backend Handler Database/Store
+--------------+ +-----------------+ +------------------+
| Select View | | Validate Filter | | Query Alerts |
| (ALL/OWNER) |------> | using constants |----> | with Filter Type |
+--------------+ +-----------------+ +------------------+
|
v
[Result Set Filtered]
```
This simple constant-based approach ensures that the "intent" of the user's filter is preserved and correctly interpreted as it passes through the various layers of the Perf service.
# Module: /go/alerts
# Alerts Module
The `alerts` module provides the core data structures and logic for managing performance regression detection configurations in Perf. It defines how an alert is structured, how to derive specific trace queries from generalized alert configurations, and provides a caching layer to ensure high-performance access to these configurations during the anomaly detection process.
## Design Philosophy
The module is designed around the concept of a "Dynamic Alert Configuration." Rather than requiring a separate alert for every single hardware/software combination, the system allows for generalized queries that can be expanded into many specific sub-queries using a "Group By" mechanism. This reduces configuration toil while maintaining granular detection.
### Key Implementation Choices:
- **Expansion via Cartesian Product:** The module uses the `paramset` of the data to expand a single `Alert` config into multiple specific queries. This allows an admin to say "alert on all models," and the system will automatically generate a query for "model=nexus4", "model=nexus6", etc.
- **Soft Deletion:** Alerts are rarely hard-deleted. They transition through `ConfigState` (ACTIVE to DELETED) to maintain historical context for previously detected anomalies.
- **Serialized IDs:** To bridge the gap between backend `int64` database IDs and frontend JSON/JavaScript requirements, the module uses a custom `SerializesToString` type. This ensures that large integer IDs do not lose precision in the browser and that uninitialized IDs (like `0` for issue tracker components) are handled gracefully as empty strings.
## Key Components
### Alert Configuration (`config.go`)
The `Alert` struct is the central entity. It contains:
- **Query Logic:** The `Query` string (URL-encoded params) and `GroupBy` fields.
- **Detection Parameters:** `Algo` (e.g., K-Means), `Step` detection settings, `Radius` (the window of commits to analyze), and `Interesting` (the threshold for regression).
- **Action Metadata:** Where to send the alert (`Alert` email, `IssueTrackerComponent`) and what `Action` to take (report, bisect, or none).
### Config Provider (`configprovider.go`)
Since anomaly detection is a frequent background process, querying the database for every check would be inefficient. The `ConfigProvider` implements a thread-safe, in-memory cache of all alert configurations.
- **Automatic Refresh:** It runs a background "refresher" goroutine that periodically polls the underlying `Store` to update the local cache.
- **State Filtering:** It maintains separate internal maps for "active" alerts and "all" alerts (including deleted ones), allowing callers to quickly retrieve the appropriate set without manual filtering.
### Alert Store Interface (`store.go`)
This file defines the `Store` interface, which abstracts the persistence layer. It supports standard CRUD operations and specialized batch operations like `ReplaceAll` (used for synchronizing alerts with external subscription files). Implementations of this interface (such as `sqlalertstore`) handle the mapping between the Go structs and the database schema.
## Key Workflows
### Query Expansion Process
When the detection engine processes an `Alert`, it doesn't just run the raw `Query`. It expands it based on the `GroupBy` field:
```text
1. Alert Config:
Query: "benchmark=blink_perf"
GroupBy: "browser, machine"
2. ParamSet (Available Data):
browser: [chrome, firefox]
machine: [m1, m2]
3. Expansion (QueriesFromParamset):
-> "benchmark=blink_perf&browser=chrome&machine=m1"
-> "benchmark=blink_perf&browser=chrome&machine=m2"
-> "benchmark=blink_perf&browser=firefox&machine=m1"
-> "benchmark=blink_perf&browser=firefox&machine=m2"
```
### Configuration Caching and Retrieval
The `ConfigProvider` ensures that the detection engine always has a low-latency view of the configurations:
```text
Detection Engine ConfigProvider Alert Store (DB)
| | |
| GetAllAlertConfigs() | |
|----------------------->| (Check Cache) |
| [Alert List] | |
|<-----------------------| |
| | |
| | <--- Periodically Refresh ---|
| | List(includeDeleted) |
| |----------------------------->|
| | [Fresh Alerts] |
| |<-----------------------------|
| | (Update Internal Maps) |
```
## Related Modules
- **`sqlalertstore`**: The primary SQL implementation of the `Store` interface.
- **`mock`**: Mock implementations of `Store` and `ConfigProvider` for unit testing.
- **`perf/go/types`**: Provides shared enums and types like `StepDetection` and `RegressionDetectionGrouping`.
# Module: /go/alerts/mock
# Alerts Mocks
The `go/alerts/mock` module provides autogenerated mock implementations of the interfaces defined in the `go/alerts` package. These mocks are built using [testify](https://github.com/stretchr/testify), allowing developers to simulate the behavior of alert storage and configuration retrieval in unit tests without requiring a live database or complex setup.
## Purpose and Design
The primary goal of this module is to decouple testing of higher-level components (like the anomaly detection engine or the UI handlers) from the underlying persistence layer. By providing programmable behaviors for alert configurations, tests can verify how the system reacts to specific alert states, missing configurations, or database errors.
The mocks are generated via `mockery` and adhere to the standard `testify/mock` pattern. Each mock struct includes a `New[InterfaceName]` constructor that automatically registers a cleanup function with the test runner (`t.Cleanup`), ensuring that expectations are asserted when the test finishes.
## Key Components
### ConfigProvider.go
The `ConfigProvider` mock simulates an object responsible for providing read access to alert configurations. This is typically used by components that need to query alert settings frequently, possibly with caching logic in the real implementation.
- **Capabilities**: It mocks methods like `GetAlertConfig` and `GetAllAlertConfigs`.
- **Use Case**: Testing the anomaly detection loop where the system needs to fetch current alert parameters to determine if a performance regression has occurred.
### Store.go
The `Store` mock simulates the persistent storage layer (usually backed by PostgreSQL). It encompasses the full CRUD lifecycle of an alert configuration.
- **Capabilities**: It mocks write operations (`Save`, `Delete`, `ReplaceAll`) and complex read operations (`List`, `ListForSubscription`).
- **Transactional Testing**: The `ReplaceAll` method accepts a `pgx.Tx` parameter. In the mock, this allows verifying that bulk updates are intended to be part of a transaction, even if no actual transaction is executed during the test.
## Typical Test Workflow
The mocks are utilized by setting expectations on specific method calls and defining what they should return.
```
[ Test Case ]
|
| 1. Create Mock: m := mock.NewStore(t)
|
| 2. Set Expectation: m.On("Save", ...).Return(nil)
|
| 3. Inject Mock into Component under test
|
| 4. Execute Logic
|
V
[ Assertions ] <--- (Automatic cleanup checks if "Save" was called)
```
By using these mocks, you can simulate failure modes that are difficult to trigger with a real database, such as specific `pgx` errors or race conditions where an alert is deleted between two different read operations.
# Module: /go/alerts/sqlalertstore
# sqlalertstore
The `sqlalertstore` module provides a SQL-backed implementation of the `alerts.Store` interface used in Perf. It manages the persistence, retrieval, and lifecycle of alert configurations, which define how the system detects anomalies in performance data.
## Design Philosophy: Hybrid Storage
To balance the need for high-performance querying with the flexibility required for evolving alert configurations, this module employs a hybrid storage strategy:
- **Serialized State (JSON):** The complete definition of an alert—including complex query parameters, filtering rules, and metadata—is stored as a JSON blob. This "Document Store" approach allows the alert structure to evolve without requiring frequent database schema migrations.
- **Relational Columns:** Critical operational fields (like `config_state` and `sub_name`) are "promoted" from the JSON blob to top-level SQL columns. This enables the database to perform efficient indexing and filtering, which is essential for performance-sensitive tasks such as dashboard rendering and subscription-based alert processing.
## Key Components
### SQLAlertStore
The primary struct `SQLAlertStore` implements the `alerts.Store` interface. It wraps a database connection pool (`pool.Pool`) and manages a pre-defined map of SQL statements.
### Schema and Data Mapping
The underlying table structure uses specific columns to optimize common workflows:
- **ID:** The primary key. The store handles both inserting new alerts (where an ID is generated) and updating existing ones.
- **ConfigState:** Represents the operational status (e.g., `ACTIVE` or `DELETED`). The store implements "soft deletes" by updating this column rather than removing rows, ensuring historical data remains intact while allowing the application to filter for active alerts quickly.
- **Subscription Linking:** Columns `sub_name` and `sub_revision` link alerts to specific subscriptions. An index on `sub_name` ensures that `ListForSubscription` operations are highly performant.
- **LastModified:** A Unix timestamp updated on every change. This facilitates cache invalidation and ensures downstream anomaly detection engines use the most recent configuration.
## Key Workflows
### Saving and Updating Alerts
When an alert is saved, the store determines if it is a new entry or an update based on the presence of a valid ID. It serializes the entire configuration into JSON and extracts the relational fields for the SQL columns.
```text
Application SQLAlertStore Database
| | |
| Save(SaveRequest) | |
|------------------------------>| |
| | Serializes Cfg to JSON |
| | Identifies ID status |
| | |
| | INSERT/UPDATE ... |
| |---------------------------->|
| | |
| Success/Error | <---------------------------|
| <-----------------------------| |
```
### Batch Replacement (`ReplaceAll`)
This workflow is used when a set of alerts needs to be synchronized with an external source (like a subscription configuration). It operates within a single transaction:
1. Marks all currently `ACTIVE` alerts as `DELETED`.
2. Inserts the new set of alert configurations.
This ensures an atomic transition from the old state to the new state.
### Alert Retrieval and Sorting
The `List` and `ListForSubscription` methods retrieve alerts and deserialize the JSON blobs back into Go structs. Because database results are not guaranteed to be ordered by application-level logic, the module explicitly sorts the resulting slice by `DisplayName` (and then by `ID` as a tie-breaker) before returning it to the caller.
## Implementation Details
- **Soft Deletion:** The `Delete` method performs an `UPDATE` statement setting `config_state` to `1` (DELETED) and updating the `last_modified` timestamp.
- **Serialization:** The module uses standard JSON encoding to store the `alerts.Alert` struct. During retrieval, the SQL `id` is injected back into the struct after unmarshaling to ensure the application remains synchronized with the database's primary key.
- **Concurrency:** By using `last_modified` and standard SQL transactions (in `ReplaceAll`), the store maintains consistency even when multiple processes attempt to update alert configurations simultaneously.
# Module: /go/alerts/sqlalertstore/schema
The `sqlalertstore/schema` module defines the structural contract for persisting Perf alerts within a SQL database. It serves as the single source of truth for the database schema, ensuring that the Go representation of an alert maps correctly to the relational storage layer used by the `sqlalertstore`.
### Design Philosophy: Hybrid Storage
The schema employs a hybrid storage strategy, balancing relational querying capabilities with the flexibility of document storage:
- **Serialized State (The "What"):** The bulk of an alert's configuration—including its complex filtering rules, parameters, and metadata—is stored as a serialized JSON blob in the `Alert` column. This allows the alert definition to evolve (adding or removing fields) without requiring frequent and expensive SQL migrations.
- **Relational Columns (The "How"):** Key fields are "promoted" to top-level SQL columns to facilitate efficient indexing, filtering, and sorting. This is critical for performance-sensitive operations, such as displaying alert dashboards or processing state-specific tasks.
### Key Components and Data Mapping
#### State Management
The `ConfigState` column extracts the operational state of an alert (e.g., active, deleted) from the JSON blob. By storing this as an integer, the system can perform rapid lookups of all "active" alerts across thousands of entries without parsing JSON strings in the database engine.
#### Subscription Linking
Alerts in Perf are often tied to specific subscriptions. The schema explicitly tracks:
- `SubscriptionName`: The identifier for the alert's origin.
- `SubscriptionRevision`: A pointer to the specific version of the subscription configuration.
The module includes an explicit index (`idx_alerts_subname`) on the `sub_name` column. This design choice optimizes for the common workflow of retrieving all alerts associated with a specific subscription, which is a frequent operation in both the UI and the automated ingestion pipeline.
#### Concurrency and Updates
The `LastModified` column stores a Unix timestamp. This is primarily used for cache invalidation and ensuring that downstream consumers (like the anomaly detection engine) are operating on the most recent version of the alert definition.
### Data Flow Overview
The following diagram illustrates how the schema interacts with the application and storage layers:
```text
Go Application Layer SQL Database Layer
+-----------------------+ +----------------------------+
| | | Alerts Table |
| alerts.Alert Struct | ----+ | [ID] (Primary Key) |
| | | | |
+-----------------------+ +-->| [Alert] (JSON Blob) |
| | | |
(Extraction Logic) +-->| [ConfigState] (Indexed) |
| | | |
+------------------+-->| [SubscriptionName] (Index) |
| |
| [LastModified] |
+----------------------------+
```
### Future Considerations
The schema identifies two specific areas for technical debt reduction to improve performance and consistency:
1. **JSONB Transition:** Moving from `TEXT` to `JSONB` for the `Alert` column to allow for more efficient internal database indexing of the blob content.
2. **Type Rationalization:** Aligning the `ConfigState` representation across the Go codebase and SQL to prevent casting overhead and improve type safety.
# Module: /go/anomalies
### High-level Overview
The `go/anomalies` module defines the core abstraction for interacting with performance anomalies (regressions) within Skia Perf. It provides a standardized interface for querying anomaly data, regardless of whether that data resides in the legacy Chrome Perf system or Skia Perf's native SQL-based regression store.
The primary goal of this module is to decouple the consumption of anomaly data—used for visualization, alerting, and analysis—from the underlying storage implementation and the specific protocols required to communicate with external APIs.
### Design Decisions and Implementation Choices
#### Unified Abstraction via Interfaces
The central component of the module is the `Store` interface. This design choice allows the Perf system to remain agnostic about the data source. By using a single interface, the system can switch between a direct SQL backend, a proxied Chrome Perf API, or a cached implementation without modifying the business logic of the calling components.
#### Cross-System Compatibility
The module leverages data structures defined in `chromeperf.AnomalyMap` and `chromeperf.AnomalyForRevision`. This maintains a consistent data contract between the frontend (which historically expected Chrome Perf formats) and various backends. This compatibility layer ensures that anomalies generated by different systems can be merged and displayed in a uniform way on performance dashboards.
#### Support for Diverse Query Patterns
The interface is designed to support the three primary ways users and automated systems interact with performance data:
1. **Trace-Centric:** Querying anomalies for specific performance metrics (traces) across a range of commits.
2. **Time-Centric:** Querying anomalies within a specific window of time, which is essential for "last 24 hours" views or investigating incidents at specific clock times.
3. **Revision-Centric:** Investigating the context around a specific git revision to see if a particular change caused regressions across multiple disparate traces.
### Key Components and Responsibilities
#### `anomalies.go`
This file defines the `Store` interface, which is the foundational contract for the entire module.
- **`GetAnomalies`**: Retrieves anomalies based on commit positions. It allows for filtering by specific `traceNames`. If the slice is empty, the implementation is expected to return all anomalies within the commit range.
- **`GetAnomaliesInTimeRange`**: Facilitates temporal lookups. Implementations (like the SQL-based one) often need to resolve these time ranges into commit ranges using a Git provider before querying the underlying database.
- **`GetAnomaliesAroundRevision`**: Provides a way to "zoom in" on a specific point in history, returning anomalies that occurred at or near a target revision.
#### Submodules and Implementations
The module's functionality is extended and specialized through its submodules:
- **`impl`**: Contains the concrete logic for data retrieval. This includes the `sql_impl.go` for native Skia Perf storage and `chromeperf_impl.go` for interacting with the legacy Google-internal Chrome Perf API.
- **`cache`**: Implements a middleware layer that wraps another `Store`. It uses LRU (Least Recently Used) caches and a time-based invalidation strategy to reduce the latency of repeated queries and minimize the load on the source-of-truth databases or APIs.
- **`mock`**: Provides autogenerated mocks for unit testing, allowing other modules to simulate various anomaly data scenarios (such as empty results or API errors) in a controlled environment.
### Workflow: Interface Interaction
The following diagram illustrates how the `Store` interface acts as a gateway between the Perf UI/Services and the various data backends:
```text
+---------------------------------------+
| Perf UI / Regression Detection |
+---------------------------------------+
|
v
+-----------------------+
| anomalies.Store (I) |
+-----------------------+
|
+----------------+----------------+
| | |
v v v
+----------------+ +----------------+ +----------------+
| cache.Store | | sql.Store | | chromeperf. |
| (Middleware) | | (Native DB) | | Store (API) |
+----------------+ +----------------+ +----------------+
| | |
+------ wraps ---+ +--- calls ---> Chrome Perf API
```
### Interactions
The module depends heavily on the `perf/go/chromeperf` package for its data models. This dependency reflects the module's role as a bridge between the modern Skia Perf infrastructure and the established data formats of the Chrome Performance monitoring ecosystem.
# Module: /go/anomalies/cache
### High-level Overview
The `anomalies/cache` module provides a performance-optimized caching layer for anomaly data retrieved from the Chrome Perf API. It acts as an intermediary `Store` that reduces the load on external API services and improves the responsiveness of Skia Perf when querying for regressions and anomalies.
It is designed to handle three primary types of lookups:
1. **Trace-based:** Anomalies associated with specific trace names within a commit range.
2. **Revision-based:** Anomalies occurring around a specific revision number.
3. **Time-based:** Anomalies occurring within a specific time window.
### Design Decisions and Implementation Choices
#### Layered Caching Strategy
The module utilizes two distinct LRU (Least Recently Used) caches to balance memory usage and performance:
- **`testsCache`**: Indexed by a composite key of trace name and commit range. This handles the most frequent queries where users are looking at specific performance graphs.
- **`revisionCache`**: Indexed by revision number, supporting workflows that investigate specific changesets.
#### Invalidation and Accuracy Trade-offs
A key challenge in caching anomaly data is that anomalies can be modified (e.g., marked as "invalid" or "fixed") in the source system. To handle this, the module implements an `invalidationMap`.
Instead of a complex, fine-grained invalidation logic that would require deep inspection of every cache entry, the module uses a "simple and safe" approach:
- When a trace is marked as modified via `InvalidateTestsCacheForTraceName`, its name is added to a map.
- During subsequent fetches, if a trace is found in this map, the cache is bypassed even if a hit occurs, forcing a fresh fetch from Chrome Perf.
- To prevent this map from growing indefinitely, it is completely wiped every 24 hours. While this creates a small window where an old anomaly might reappear briefly if the wipe happens immediately after a modification, it ensures O(1) operations and minimal memory overhead compared to tracking individual commit-level changes.
#### Proactive Cleanup
Standard LRU behavior handles capacity, but it doesn't account for data staleness. The module implements a background goroutine that periodically checks the oldest items in the cache against a Time-to-Live (TTL) of 10 minutes. This ensures that even low-traffic data is eventually refreshed to reflect the current state of the Chrome Perf database.
### Key Components and Responsibilities
#### `cache.go`
This is the primary implementation file. It defines the `store` struct and the logic for the anomaly store.
- **`GetAnomalies`**: Orchestrates a hybrid fetch. It checks the LRU cache for each requested trace. Any traces missing from the cache or marked in the `invalidationMap` are bundled into a single batch request to the `ChromePerf` client. The results are then merged and the cache is updated.
- **`cleanupCache`**: A background worker function that drains the LRU cache of items older than the `cacheItemTTL`. It specifically targets the "oldest" items to minimize the work performed during each tick.
- **`getAnomalyCacheKey`**: Generates a deterministic string key: `traceName:startCommit:endCommit`. This ensures that different ranges for the same trace are cached independently, preventing range-mismatch bugs.
#### Workflow: Data Retrieval
The following diagram illustrates how the `GetAnomalies` method handles a request for multiple traces:
```text
User Request (Traces A, B, C)
|
v
+---------+----------+
| Check testsCache | <-------+
+---------+----------+ |
| |
+-----+-----+ |
| | |
[A, B] Hit [C] Miss/Invalid |
| | |
| +-----v--------------+-----+
| | Fetch [C] from ChromePerf|
| +------------+-------------+
| |
| +------v------+
| | Update Cache|
| +------+------+
| |
+--------+---------+
|
v
Merged Results
```
#### Interactions
The module heavily relies on the `chromeperf.AnomalyApiClient` interface. This decoupling allows the cache to be tested with mocks (as seen in `cache_test.go`) and ensures the caching logic remains independent of the underlying transport (HTTP/gRPC) used to communicate with Chrome Perf.
# Module: /go/anomalies/impl
The `go/anomalies/impl` module provides concrete implementations of the `anomalies.Store` interface. Its primary purpose is to abstract the retrieval of performance anomalies (regressions) from different backends—specifically the legacy Chrome Perf API and the modern Skia Perf SQL-based regression store.
By providing a unified interface, this module allows the rest of the Perf system to query for anomalies using commit ranges, time ranges, or specific revisions without needing to know whether the data is coming from an external service or a local database.
### Key Components
#### Chrome Perf Implementation (`chromeperf_impl.go`)
The `store` struct in this file acts as a proxy to the Chrome Perf Anomaly API. It is used in deployments where Skia Perf needs to display or synchronize with anomalies managed by the legacy Chrome Perf system.
- **Responsibility**: Facilitates communication with the `chromeperf.AnomalyApiClient`.
- **Design Choice**: It performs minimal logic, primarily sorting trace names to ensure deterministic API requests and mapping the results into the standard `chromeperf.AnomalyMap` format.
#### SQL Implementation (`sql_impl.go`)
The `sqlAnomaliesStore` provides an implementation that retrieves anomalies directly from Skia Perf's own database by wrapping a `regression.Store`.
- **Responsibility**: Translates Skia Perf "Regressions" into the "Anomaly" format expected by the frontend and other consumers.
- **Conversion Logic**: Since Skia Perf stores data as `regression.Regression` objects, this implementation uses `compat.ConvertRegressionToAnomalies` to transform them.
- **Multiplicity Tracking**: A significant detail in this implementation is how it handles multiple anomalies on the same trace at the same commit. It maintains a `multiplicities` map during the conversion process to increment the `Multiplicity` field of the anomaly, ensuring each unique regression is identifiable even if they overlap in the commit/trace dimensions.
- **Dependency on Git**: Unlike the Chrome Perf implementation, the SQL store requires a `git.Git` provider. This is because Skia Perf's regression store is indexed by commit numbers. When a user requests anomalies for a **time range**, the store first uses the Git provider to resolve that time range into a slice of commits, determining the start and end commit positions before querying the database.
### Workflows
#### Time-Range Query Resolution (SQL Store)
When querying by time, the module performs a two-step resolution to bridge the gap between wall-clock time and the commit-indexed database.
```text
User Request (Time Range)
|
v
+-----------------------+ +-----------------------+
| sqlAnomaliesStore | | git.Git |
| |----->| |
| GetAnomaliesInTime... | | CommitSliceFromTime...|
+-----------------------+ +-----------|-----------+
| |
| <---------- Commit IDs --------+
v
+-----------------------+ +-----------------------+
| regression.Store | | SQL Database |
| |<---->| |
| Range/RangeFiltered| | (Regressions Table) |
+-----------------------+ +-----------------------+
|
v
Convert to Anomalies --------> Result Map
```
#### Revision Windowing
For the `GetAnomaliesAroundRevision` method, the SQL implementation implements a "sliding window" strategy. It defines a hardcoded window (currently 500 commits) around the target revision. This provides context to the user, showing not just an anomaly at a specific point, but also nearby fluctuations that might be related to the same root cause.
### Design Decisions
- **Interface Compatibility**: The module heavily utilizes the `chromeperf` package's data structures (like `AnomalyMap` and `AnomalyForRevision`). This design choice was made to maintain compatibility with existing UI components and tools that were originally built for the Chrome Perf ecosystem.
- **Filtering at the Source**: In `sql_impl.go`, the code distinguishes between fetching all regressions (`Range`) and fetching regressions for specific traces (`RangeFiltered`). This allows the module to offload filtering to the database layer when possible, rather than fetching all data and filtering in-memory.
- **Error Handling**: The implementations are designed to be resilient; for instance, if the Chrome Perf API fails, it logs the error but may return a partial/empty result rather than crashing the calling process, acknowledging that anomaly data is often "best-effort" in high-latency environments.
# Module: /go/anomalies/mock
The `go/anomalies/mock` module provides a mock implementation of the anomaly storage interface used within the Perf system. Its primary purpose is to facilitate unit testing for components that depend on anomaly data without requiring a live connection to ChromePerf or a production database.
### Design and Implementation
The module leverages the `testify/mock` framework to provide a programmable substitute for the anomaly `Store`. This approach allows developers to define expected behaviors, such as returning specific sets of anomalies or simulating network errors, ensuring that higher-level logic (like regression detection or UI rendering) handles various data scenarios correctly.
The mock is autogenerated based on the `Store` interface defined in the anomalies package. This ensures that the mock remains synchronized with the actual interface used by the system.
### Key Components
#### Store.go
The core of this module is the `Store` struct. It implements the methods required to query anomaly data across different dimensions:
- **Commit-based Lookups**: The `GetAnomalies` method allows tests to simulate retrieving anomalies within a specific range of commit positions for a set of traces.
- **Time-based Lookups**: The `GetAnomaliesInTimeRange` method enables testing workflows that rely on temporal queries rather than commit sequences.
- **Revision Context**: The `GetAnomaliesAroundRevision` method provides functionality to mock the retrieval of anomalies centered around a specific point in history, useful for validating "nearby" anomaly detection logic.
### Usage Workflow
When writing a test for a component that consumes anomalies, the `mock.Store` is instantiated and injected as a dependency. The general flow for using this module is as follows:
```
+------------------+ +------------------------+ +------------------+
| Unit Test | | Mock Store | | System Under Test|
+------------------+ +------------------------+ +------------------+
| | | | | |
| 1. Setup Mock |--------->| Register Expectations | | |
| (On/Return) | | (e.g., GetAnomalies) | | |
| | +------------------------+ | |
| 2. Inject Mock |--------------------------------------------->| Execute Logic |
| | +------------------------+ | |
| | | 3. Intercept Call | <--------| Call Store API |
| | | Return Mock Data | -------->| |
| | +------------------------+ | |
| 4. Verify | | | | |
| AssertExpects |--------->| Check Call History | | |
+------------------+ +------------------------+ +------------------+
```
1. **Initialization**: Use `NewStore(t)` to create a new mock instance. This automatically registers cleanup functions to verify that all expected calls were made before the test finishes.
2. **Expectation Setting**: Use the `.On(...)` syntax to define which parameters the mock should expect and what values (or errors) it should return.
3. **Assertion**: The mock tracks all interactions, allowing the test to verify not just the output of the system under test, but also that the system interacted with the anomaly store in the expected manner (e.g., querying the correct trace names).
# Module: /go/anomalygroup
# Anomaly Grouping Module
The `anomalygroup` module provides a centralized system for aggregating individual performance regressions (anomalies) into cohesive groups. In a high-scale performance monitoring environment, a single root cause (like a specific commit) often triggers multiple alerts across different benchmarks or configurations. This module shifts the workflow from managing hundreds of isolated alerts to managing a single "Anomaly Group," which acts as the unit of work for bisection, bug reporting, and remediation tracking.
## High-Level Overview
The primary goal of this module is to reduce "alert fatigue" and streamline root-cause analysis. It achieves this by correlating new anomalies with existing ones based on shared metadata (e.g., benchmark name, domain, and subscription) and overlapping commit ranges.
The module is structured as a tiered system:
- **Storage Layer**: Defines the `Store` interface and provides a SQL implementation for persisting group metadata and membership.
- **Service Layer**: A gRPC implementation that orchestrates interactions between anomaly, culprit, and regression data.
- **Notification/Utility Layer**: Integrates the grouping logic into the Perf detection pipeline, ensuring every detected regression is either funneled into an existing group or initiates a new one.
## Design and Implementation Choices
### The "Find-or-Create" Lifecycle
The module follows a strict "find-or-create" pattern. When a regression is detected, the system does not immediately alert a human. Instead, it queries the `Store` for existing groups that match the anomaly's context.
- If a match is found, the anomaly is added to the group, and any linked issues (bugs) are updated with a comment.
- If no match is found, a new group is created, which may trigger automated processes like **Temporal** workflows for bisection.
### Overlap-Based Grouping
A key design decision is the "Common Revision Range" logic. When an anomaly is added to a group, the group's `start_commit` and `end_commit` are narrowed to the **intersection** of the current group range and the new anomaly's range. This ensures that a group only contains anomalies that could logically have been caused by the same commit.
### Data Model Flexibility
The module uses **JSONB** (in the SQL implementation) for group metadata. This allows the system to store heterogeneous attributes like `subscription_name` or `benchmark_name` without requiring rigid schema migrations as the types of performance data evolve.
### Concurrency and Consistency
To prevent race conditions where multiple detection workers might try to create a group for the same regression simultaneously, the utility layer employs a global mutex during the "find-or-create" phase. This ensures that the mapping of anomalies to groups remains consistent and prevents duplicate bisection jobs or bug reports.
## Key Components
### Store (`/go/anomalygroup/store.go`)
The `Store` interface defines the data access layer. It is responsible for the persistence of groups and their associations (Anomaly IDs and Culprit IDs).
- **`AddAnomalyID`**: Not only links an ID but also performs the mathematical narrowing of the group's commit range.
- **`FindExistingGroup`**: The primary discovery mechanism used to deduplicate regressions.
### Service (`/go/anomalygroup/service`)
The `AnomalyGroupService` implements the gRPC API. It acts as an aggregator, fetching data from the `anomalygroup` store while also pulling detailed regression data (like medians and trace params) to provide a rich view of the group's impact. It includes ranking logic to identify "top anomalies" within a group based on the percentage change in performance.
### Notifier (`/go/anomalygroup/notifier`)
This acts as the glue between the Perf engine and the grouping logic. It filters out "summary-level" regressions (which are too broad for specific grouping) and constructs a canonical **Test Path** (e.g., `master/bot/benchmark/test/subtest`) required for consistent cross-referencing with external systems like Chromeperf.
### Utils (`/go/anomalygroup/utils`)
Contains the `AnomalyGrouper`, which handles the high-level business logic. It coordinates between the internal stores and external systems (Issue Trackers and Temporal). It decides whether to post a comment to an existing bug or trigger a new bisection workflow.
## Key Workflows
### Processing a New Regression
This workflow illustrates how a detected anomaly is integrated into the grouping system.
```text
[ Perf Detection Engine ]
|
v
[ AnomalyGroupNotifier ] --------------------.
| |
(Validate Trace & |
Build Test Path) |
| |
v |
[ AnomalyGrouper (Utils) ] <-----------------'
|
(Lock Mutex)
|
v
[ FindExistingGroup? ]
/ \
(No) (Yes)
| |
v v
[ Create New Group ] [ AddAnomalyID ]
[ Trigger Bisection ] [ Update Issue/Bug ]
| |
'-------.---------'
|
(Unlock Mutex)
|
v
[ Return GroupID ]
```
### Revision Range Narrowing
When adding an anomaly to a group, the module maintains the narrowest possible window for bisection:
```text
Group Range: [100 ..................... 150]
New Anomaly: [120 ............. 160]
===============================
Result Range: [120 .......... 150]
(Narrowest common overlap)
```
# Module: /go/anomalygroup/mocks
# Anomaly Group Mocks
The `go.skia.org/infra/perf/go/anomalygroup/mocks` module provides mock implementations of the interfaces defined in the `anomalygroup` package. Its primary purpose is to facilitate unit testing for components that depend on anomaly group persistence and retrieval without requiring a live database or complex setup.
## High-Level Overview
This module is generated using [mockery](https://github.com/vektra/mockery) and is based on the `testify` assertion framework. It focuses on mocking the `Store` interface, which is the primary abstraction for managing groups of performance anomalies. By using these mocks, developers can simulate various database states, such as existing groups, missing records, or successful updates, to verify the business logic of higher-level services.
## Key Components and Responsibilities
### Store.go
The `Store` struct is a mock implementation of the `anomalygroup.Store` interface. It records method calls and allows tests to define expected return values or behaviors (using `On` and `Return` methods from the testify mock package).
The mock covers the full lifecycle of an anomaly group:
- **Creation and Discovery**: Mocking `Create` and `FindExistingGroup` allows testing of the logic that decides whether a new anomaly should start a new group or join an existing one based on subscription name, domain, and commit range.
- **Association Management**: Methods like `AddAnomalyID` and `AddCulpritIDs` enable verification that the system correctly links individual detections and identified root causes to a group.
- **State Updates**: `UpdateBisectID` and `UpdateReportedIssueID` are used to test the integration between anomaly grouping and external systems like bisection services and issue trackers.
- **Retrieval**: `LoadById`, `GetAnomalyIdsByAnomalyGroupId`, and `GetAnomalyIdsByIssueId` support testing data access patterns and reporting logic.
## Usage Workflow
When writing a test for a component that manages anomalies, the mock is typically initialized and injected as a dependency.
```
Tester Mock Store Component Under Test
| | |
|-- NewStore(t) ------->| |
| | |
|-- On("LoadById")... ->| |
| | |
|-- Inject Store ------>|-------------------------->|
| | |
|-- Trigger Action -------------------------------->|
| | |
| |<-- Call LoadById() -------|
| | |
| |--- Return AnomalyGroup -->|
| | |
|-- AssertExpectations()| |
```
## Implementation Decisions
- **Automated Generation**: The use of `mockery` ensures that the mock stays in sync with the `Store` interface defined in the core `anomalygroup` package. If the interface changes, the mock can be regenerated to reflect the new API.
- **Testify Integration**: Leveraging `github.com/stretchr_testify/mock` provides a standard, expressive syntax for setting up expectations, making tests more readable and maintainable.
- **Protobuf Dependency**: The mock depends on `go.skia.org/infra/perf/go/anomalygroup/proto/v1`, ensuring that the data structures returned by the mock methods (like `v1.AnomalyGroup`) are exactly the same as those used by the real implementation.
# Module: /go/anomalygroup/notifier
# Anomaly Group Notifier
The `anomalygroup/notifier` module provides an implementation of a regression notifier that integrates with the Anomaly Grouping system. Instead of simply sending a static notification (like an email or chat message), this notifier delegates the handling of a detected regression to an `AnomalyGrouper`, which manages how regressions are aggregated, tracked, and associated with issues.
## Overview
In the Perf system, a "Notifier" is typically responsible for alerting users when a regression is found. The `AnomalyGroupNotifier` fulfills this interface but focuses on the structured management of anomalies. Its primary role is to validate the incoming regression data, extract relevant metadata (such as "Test Paths"), and pass it to the `anomalygroup/utils` package for logic-heavy operations like finding or creating groups and updating issue trackers.
### Design Decisions
- **Granularity Filtering**: The notifier is designed to ignore "summary level" regressions. If a regression involves multiple traces (e.g., a high-level benchmark without a specific story), it is excluded from anomaly grouping. This prevents the system from creating noisy or overly broad groups that don't map clearly to specific test paths.
- **Test Path Construction**: To maintain compatibility with external systems (like Chromeperf), the module enforces a specific hierarchy for identifying tests. It constructs a `testPath` string by concatenating parameters like `master`, `bot`, `benchmark`, `test`, and various `subtest` levels.
- **Minimal State**: The notifier itself is stateless, acting as a translation layer between the Perf notification event and the persistent Anomaly Grouping storage.
## Key Components
### AnomalyGroupNotifier
The central struct that implements the notification interface. It holds a reference to an `AnomalyGrouper`.
- **RegressionFound**: This is the primary entry point. When a regression is detected, this method:
1. Validates that the regression represents a single, specific trace.
2. Parses the trace keys to extract performance parameters.
3. Calculates median values before and after the anomaly (using `vec32`) for logging and diagnostic purposes.
4. Constructs a canonical `testPath`.
5. Calls `ProcessRegressionInGroup` to handle the actual grouping and issue tracking.
- **No-op Methods**: Methods like `RegressionMissing` and `UpdateNotification` are currently implemented as no-ops. This indicates that the anomaly grouping logic currently focuses on the discovery of regressions rather than the automated closing or updating of groups when a regression disappears.
### Test Path Logic
The functions `getTestPath` and `isParamSetValid` encapsulate the requirements for a regression to be "groupable."
- A valid regression must contain a specific set of keys: `master`, `bot`, `benchmark`, `test`, and `subtest_1`.
- The path is built following the pattern: `master/bot/benchmark/test/subtest_1/.../subtest_3`.
## Workflow
The following diagram illustrates how a detected regression flows through this module:
```text
Perf Detection Engine
|
v
[AnomalyGroupNotifier.RegressionFound]
|
|-- (Validate: Is it a single trace?)
|-- (Validate: Does it have required params?)
|-- (Action: Construct Test Path)
|
v
[AnomalyGrouper (via /go/anomalygroup/utils)]
|
|-- (Action: Find existing group or create new one)
|-- (Action: Update Issue Tracker)
|
v
End Result: Regression is grouped and tracked.
```
## Related Modules
- **`perf/go/anomalygroup/utils`**: Contains the `AnomalyGrouper` interface and the core logic for managing the lifecycle of an anomaly group.
- **`perf/go/alerts`**: Provides the alert configurations that trigger these notifications.
- **`perf/go/issuetracker`**: Used by the underlying grouper to link anomalies to bug reports.
# Module: /go/anomalygroup/proto
### High-Level Overview
The `/go/anomalygroup/proto` module serves as the primary definition layer for the Anomaly Grouping system. It establishes the contract between the performance monitoring services and the data storage layer responsible for organizing regressions. This module is essential for transitioning from "individual data point alerts" to "structured performance investigations."
While the actual implementation details and specific gRPC service definitions are contained within versioned subdirectories (e.g., `v1`), this root module acts as the entry point for cross-service communication regarding grouped performance regressions.
### Design and Implementation Choices
#### Protobuf-First Architecture
The choice to define anomaly groups using Protocol Buffers (protobuf) is driven by the need for interoperability across different microservices in the Perf ecosystem. By defining the "Anomaly Group" as a structured message, the system ensures that the detector, the bisection engine, and the reporting UI all share a consistent view of what constitutes a group, regardless of their specific internal languages or storage backends.
#### Evolution via Versioning
The module structure (specifically the use of the `v1` subdirectory) reflects a design decision to support long-term API stability. Because Anomaly Groups are often linked to external trackers (like Monorail or Buganizer), the schema must evolve without breaking existing integrations. Versioning allows for:
- **Backward Compatibility**: Services can continue to use older message formats while the backend transitions to more complex grouping logic.
- **Incremental Feature Rollout**: New fields, such as those supporting advanced bisection parameters or different action types, can be introduced in new versions of the proto without disrupting the current production workflow.
### Key Components and Responsibilities
The primary responsibility of this module is to provide the data models and service definitions required to manage the lifecycle of an anomaly group.
#### Abstraction of Regressions
The module defines how individual anomalies (sketches of performance drops) are aggregated. Instead of treating every regression as a unique event, the proto definitions allow for a many-to-one mapping. This choice minimizes "alert fatigue" by ensuring that multiple regressions caused by the same commit or affecting the same benchmark are treated as a single unit of work.
#### Service Definitions
The module hosts the gRPC service definitions that facilitate:
- **Querying**: Finding groups based on metadata such as subscriptions, benchmarks, or specific commit ranges.
- **Mutation**: Updating the state of a group as external events occur, such as a bisection job identifying a culprit or a developer linking a bug ID.
### Workflow Integration
The proto definitions in this module facilitate the following logical flow across the Skia Perf system:
```
[ Perf Detector ] --> [ Proto: FindExistingGroups ]
|
+---( Existing Group? )
| |
(No) CreateNewGroup <----+ +----> (Yes) UpdateGroup
| |
v v
[ Anomaly Group Data Model: ActionType, Metadata, Anomaly IDs ]
|
+-----------> [ Bisection Service ]
|
+-----------> [ Reporting/Issue Service ]
```
By providing a unified message format, this module ensures that once a group is created or updated via the gRPC interface, all downstream services (like Pinpoint for bisection or the auto-filer for bug reports) can consume the data without needing to understand the underlying database schema.
# Module: /go/anomalygroup/proto/v1
### High-Level Overview
The `go/anomalygroup/proto/v1` module defines the core data structures and gRPC service interface for managing **Anomaly Groups** within the Perf system. Anomaly grouping is a critical abstraction used to cluster related performance regressions—typically those sharing a similar commit range, benchmark, or subscription—into a single actionable entity.
By grouping anomalies, the system can automate post-detection workflows, such as filing single bug reports for multiple related regressions or triggering bisection jobs to identify a specific culprit commit.
### Design and Implementation Choices
#### Action-Oriented Grouping
The design pivots around the `GroupActionType` enum (`REPORT`, `BISECT`, `NOACTION`). Rather than being a passive collection of data points, an anomaly group is defined by the intended outcome.
- **`REPORT`**: Indicates the group is intended for manual review or automated bug filing.
- **`BISECT`**: Indicates the group is a candidate for automated regression testing (Pinpoint/Bisection) to find a culprit.
#### Decoupling from Specific Regressions
The `CreateNewAnomalyGroup` RPC is explicitly designed to avoid binding a group to a single regression initially. This allows the system to find "existing" groups that match the criteria of a newly detected anomaly before creating a redundant group. This deduplication logic is supported by `FindExistingGroups`, which searches based on `subscription_name`, `test_path`, and commit ranges.
#### Metadata vs. Entity IDs
The `AnomalyGroup` message separates entity relationships (`anomaly_ids`, `culprit_ids`, `reported_issue_id`) from the metadata that defined the group (`subscription_name`, `benchmark_name`). This allows the service to track the _evolution_ of a group as more anomalies are discovered or as a bisection job identifies specific culprits.
### Key Components and Workflows
#### AnomalyGroupService
This gRPC service (`anomalygroup_service.proto`) is the primary interface for the anomaly management lifecycle.
- **Discovery & Creation**: When the detector finds a new regression, it uses `FindExistingGroups` to see if it fits into a current investigation. If not, `CreateNewAnomalyGroup` initializes a new tracker.
- **Refinement**: As bisection jobs complete or manual triaging occurs, `UpdateAnomalyGroup` is used to append `culprit_ids` (found by bisection) or `issue_id` (from a bug tracker).
- **Analysis**: `FindTopAnomalies` provides a prioritized list of regressions within a group, allowing the system to pick the "most significant" anomaly to lead a bisection job.
#### The Anomaly Entity
The `Anomaly` message acts as a bridge between Skia's internal regression format and the requirements of external tools (like Pinpoint). It captures:
- **Context**: The `paramset` map translates Skia tags (e.g., `stat`, `measurement`) into the "bot/benchmark/story" format required by ChromePerf.
- **Significance**: `median_before` and `median_after` provide the raw data needed to calculate the magnitude of the regression.
### Key Workflows
The following diagram illustrates how a new anomaly interacts with this module to determine if it should trigger a new action or join an existing investigation:
```
[ New Anomaly Detected ]
|
v
[ FindExistingGroups ] <---------- [ Search by Sub, Benchmark, & Commit ]
|
+---( Match Found? )---+
| |
YES | NO |
v v
[ UpdateAnomalyGroup ] [ CreateNewAnomalyGroup ]
(Append Anomaly ID) |
| v
| [ Determine Action ]
| (BISECT or REPORT)
| |
+----------+-----------+
|
v
[ Anomaly Group State ]
- List of Anomaly IDs
- Culprit IDs (if bisected)
- Issue ID (if reported)
```
### Key Files
- **`anomalygroup_service.proto`**: The source of truth for the API and data models.
- **`anomalygroup_service.pb.go`**: Contains the generated Go structs for messages and enums, including the `GroupActionType` logic.
- **`anomalygroup_service_grpc.pb.go`**: Contains the gRPC client and server interfaces used by Perf components to communicate with the anomaly group store.
# Module: /go/anomalygroup/proto/v1/mocks
### High-Level Overview
This module provides mock implementations of the `AnomalyGroupService` defined in the version 1 Protocol Buffers for the Perf system. Its primary purpose is to facilitate isolated unit testing of components that interact with anomaly grouping logic. By using these mocks, developers can simulate various service behaviors—such as successful data retrieval, persistence errors, or specific search results—without requiring a live gRPC server or an underlying database.
### Design and Implementation Decisions
The mocks in this module are built using the `stretchr/testify/mock` framework. This choice allows for a declarative style of testing where expectations (input arguments) and returns (output data or errors) are defined before the execution of the code under test.
#### gRPC Interface Compliance
A key implementation detail in `AnomalyGroupServiceServer.go` is the manual embedding of `v1.UnimplementedAnomalyGroupServiceServer`.
- **The "Why":** Standard gRPC server generation in Go requires implementations to embed the `Unimplemented` version of the server struct. This ensures forward compatibility; if new methods are added to the Protobuf definition, existing implementations (including mocks) will still satisfy the interface by inheriting the default "Unimplemented" behavior for the new methods.
- **The "How":** Because the `mockery` generation tool occasionally fails to include this embedding, it was added manually. This ensures the mock type remains a valid `AnomalyGroupServiceServer` even as the service definition evolves.
#### Assertion and Cleanup
The module provides a `NewAnomalyGroupServiceServer` constructor that integrates with Go's `testing.T`. It automatically registers a cleanup function via `t.Cleanup`. This design ensures that `mock.AssertExpectations(t)` is called at the end of every test, verifying that all expected calls to the service were actually made, which prevents "silent" test passes where expected logic was bypassed.
### Key Components and Responsibilities
#### AnomalyGroupServiceServer
This is the central mock struct. It mirrors the gRPC server interface and provides hooks for the following service responsibilities:
- **Group Lifecycle Management:** Methods like `CreateNewAnomalyGroup` and `UpdateAnomalyGroup` allow tests to simulate the creation and modification of anomaly clusters.
- **Data Retrieval:** `LoadAnomalyGroupByID` and `FindExistingGroups` allow callers to simulate the lookup of groups based on specific identifiers or search criteria.
- **Anomaly Analysis:** `FindTopAnomalies` facilitates testing logic that prioritizes or filters specific anomalies within a group context.
### Testing Workflow
The typical workflow involving this module focuses on intercepting calls between a high-level business logic component and the anomaly group persistence layer.
```
[ Test Case ]
|
| (1) Set Expectations:
| mock.On("LoadAnomalyGroupByID", ...).Return(fakeGroup, nil)
v
[ Component Under Test ]
|
| (2) Call: LoadAnomalyGroupByID(ctx, req)
v
[ AnomalyGroupServiceServer (Mock) ]
|
| (3) Matches arguments and returns fakeGroup
v
[ Component Under Test ]
|
| (4) Process fakeGroup and perform assertions
v
[ Test Case ]
|
| (5) Cleanup: Verify all mock expectations were met
```
# Module: /go/anomalygroup/service
# Anomaly Group Service
The `anomalygroup/service` module provides a gRPC implementation for managing and querying **Anomaly Groups**. Anomaly groups are logical collections of performance regressions (anomalies) that share common characteristics, such as being detected within the same benchmark, subscription, or commit range.
This service acts as an orchestration layer that interfaces with underlying storage systems for anomaly groups, culprits, and regressions to provide a unified API for the Skia Perf backend.
## Key Responsibilities
The service is responsible for the lifecycle and metadata management of grouped anomalies:
- **Group Creation and Discovery**: Creating new groups based on subscription and commit criteria, and finding existing groups that match a specific test path and commit range.
- **Metadata Management**: Updating groups with external identifiers, such as Bisection IDs (from automated bisects), Issue IDs (from bug trackers), and associating specific Culprit IDs or new Anomaly IDs with an existing group.
- **Analysis and Ranking**: Identifying the "top" anomalies within a group based on the magnitude of the performance shift.
- **Correlation**: Linking groups to issues through detected culprits.
## Design Decisions
### Group Identification and Search
When searching for existing groups (`FindExistingGroups`), the service parses a `TestPath` string. It expects a specific hierarchical format (e.g., `domain/bot/benchmark/measurement/test`). The service specifically extracts the **Domain** and **Benchmark** to query the store, effectively grouping anomalies that occur on the same benchmark even if they are on different bots or specific test sub-metrics.
### Anomaly Ranking Logic
The `FindTopAnomalies` functionality implements a specific ranking strategy:
1. **Metric**: It calculates the percentage change between `MedianBefore` and `MedianAfter`.
2. **Sorting**: Regressions are sorted in descending order of this percentage change.
3. **Story Identification**: The service attempts to identify the "story" (the specific sub-test) by looking at `subtest_3`, then `subtest_2`, then `subtest_1` in the paramset. This prioritization ensures the most specific test description available is returned.
### Data Validation
The service enforces a strict schema for anomaly metadata via `isParamSetValid`. It requires the presence of specific keys (`bot`, `benchmark`, `test`, `stat`, `subtest_1`) and ensures that these keys contain exactly one value. This ensures consistency when these anomalies are exported or displayed in the UI.
## Key Components
### AnomalyGroupService
The primary struct implementing the gRPC server defined in `anomalygroup/proto/v1`. It integrates the following dependencies:
- **Store (anomalygroup)**: Handles the persistence of the group entities.
- **Store (culprit)**: Used to resolve which issues are associated with the culprits in a group.
- **Store (regression)**: Used to fetch the detailed performance data (medians, paramsets) for the anomalies contained within a group.
- **Temporal Client**: Integrated for workflow orchestration (e.g., triggering bisections or reports).
## Workflow: Updating a Group
The `UpdateAnomalyGroup` method acts as a multi-purpose update sink. Depending on the fields populated in the request, it routes to different store operations:
```text
Request (UpdateAnomalyGroup)
|
|-- Has BisectionId? ----> anomalygroupStore.UpdateBisectID
|
|-- Has IssueId? --------> anomalygroupStore.UpdateReportedIssueID
|
|-- Has AnomalyId? ------> regressionStore.GetByIDs (to get commit range)
| |
| +-> anomalygroupStore.AddAnomalyID
|
+-- Has CulpritIds? -----> anomalygroupStore.AddCulpritIDs
```
## Internal Ranking Workflow
When identifying the most significant regressions in a group:
```text
Load Group by ID
|
Fetch all Regression details for AnomalyIds in Group
|
For each Regression:
Calculate: (MedianAfter - MedianBefore) / MedianBefore
|
Sort descending by calculated diff
|
Take Top N (Limit)
|
Extract specific params (bot, benchmark, measurement, etc.)
|
Return Anomaly list
```
# Module: /go/anomalygroup/sqlanomalygroupstore
The `sqlanomalygroupstore` module provides a SQL-backed implementation for managing anomaly groups in the Perf system. It transitions the system from a "per-anomaly" management style to a "group-centric" workflow, allowing related performance regressions to be handled as a single unit for bisection, bug reporting, and state tracking.
## Overview and Purpose
In performance monitoring, a single underlying issue often triggers multiple anomalies across different bots or benchmarks. Treating these as independent events leads to redundant bisections and fragmented issue tracking. This module solves that by providing a persistent store to aggregate these anomalies.
The store acts as the source of truth for the lifecycle of a regression:
1. **Grouping**: Collating anomalies based on shared context (benchmark, domain, subscription).
2. **Range Refinement**: Dynamically calculating the intersection of revision ranges as new anomalies are added to a group.
3. **Action Orchestration**: Tracking whether a group has been reported to an issue tracker or sent for bisection.
## Key Components and Design Decisions
### Data Modeling and Storage
The implementation balances relational structure with the flexibility needed for heterogeneous performance data.
- **JSONB Metadata**: The `group_meta_data` field uses JSONB to store attributes like `subscription_name`, `domain_name`, and `benchmark_name`. This avoids rigid schema migrations when new metadata categories are introduced while still allowing for efficient SQL filtering via JSON path expressions.
- **Array Types for Membership**: `AnomalyIDs` and `CulpritIDs` are stored as `UUID ARRAY` (or text arrays). This allows the system to retrieve all members of a group in a single row fetch, optimizing for read-heavy "group view" operations.
- **Denormalized Revision Ranges**: The fields `common_rev_start` and `common_rev_end` are stored directly on the group. This denormalization allows the system to perform fast range-based lookups (e.g., "Find all groups affecting commit X") without joining against hundreds of individual anomaly records.
### Anomaly Aggregation Logic
The store implements specific logic when adding an anomaly to an existing group via `AddAnomalyID`. Rather than just appending an ID, it updates the group's `common_rev_start` and `common_rev_end` using `GREATEST` and `LEAST` functions respectively. This ensures the group's "Common Revision Range" always represents the narrowest overlapping window shared by all member anomalies, which is essential for accurate bisection.
## Key Workflows
### Finding and Joining Groups
The `FindExistingGroup` method is the entry point for anomaly deduplication. When a new anomaly is detected, the system queries for existing groups that match the metadata and whose revision range overlaps with the new anomaly.
```text
New Anomaly Detected
|
v
Check Store: FindExistingGroup()
(Match Metadata + Revision Range Overlap)
|
+----[ Match Found ]----> AddAnomalyID()
| (Narrows common_rev_start/end)
|
+----[ No Match ]-------> Create()
(Starts new group lifecycle)
```
### Remediation Tracking
The module provides dedicated update methods to link the group to external entities.
- `UpdateBisectID`: Links the group to a specific bisection job.
- `UpdateReportedIssueID`: Links the group to a bug in an issue tracker.
These links prevent duplicate actions. For example, the system can query `GetAnomalyIdsByIssueId` to find all data points associated with a specific bug, facilitating "cluster" views of performance regressions.
## File Responsibilities
- **`sqlanomalygroupstore.go`**: Implements the `AnomalyGroupStore` struct and its methods. It contains the raw SQL logic for Spanner/PostgreSQL, including complex array unnesting for ID lookups and JSONB extraction for group metadata.
- **`schema/`**: Defines the database layout and provides the conceptual "why" behind the table structures, such as the use of temporal tracking for audit trails.
- **`sqlanomalygroupstore_test.go`**: Validates the SQL logic using a real database instance (Spanner), specifically testing edge cases like revision range narrowing and UUID validation.
# Module: /go/anomalygroup/sqlanomalygroupstore/schema
# Anomaly Group SQL Schema
The `schema` package defines the structured data model for storing and managing anomaly groups within a SQL database. It serves as the single source of truth for the database layout used by the `sqlanomalygroupstore`, ensuring that anomaly aggregations, their associated metadata, and subsequent remedial actions are persisted consistently.
## Overview and Purpose
In the Perf system, individual anomalies are often related by shared characteristics such as benchmark, bot, or revision range. The `AnomalyGroupSchema` is designed to transition from a "per-anomaly" view to a "group-centric" view. This grouping is critical for:
- **Action Orchestration**: Managing actions like bisections or bug reporting on a group of related regressions rather than triggering redundant tasks for every single data point.
- **State Tracking**: Maintaining the lifecycle of a performance regression from discovery (creation) to resolution (culprit identification).
- **Performance Optimization**: Denormalizing key fields (like revision ranges) to allow the system to query and filter groups without performing expensive joins or aggregations across the primary anomaly tables.
## Key Components and Design Decisions
### AnomalyGroupSchema
The core structure represents a single row in the `AnomalyGroups` table. The implementation choices reflect a balance between strict relational integrity and the flexibility required for evolving metadata.
- **Identity and Temporal Tracking**:
Each group is assigned a UUID (`ID`) to prevent collisions across distributed systems. It tracks `CreationTime` and `LastModifiedTime` to allow the cleanup of stale groups and to provide audit trails for when a group's state last changed.
- **Anomalies and Culprits (Array Storage)**:
The schema utilizes `UUID ARRAY` types for `AnomalyIDs` and `CulpritIDs`. This design choice favors read performance for group-specific views, as it allows the system to retrieve the entire membership list of a group in a single row fetch, avoiding the overhead of a separate many-to-many mapping table for common operations.
- **Dynamic Metadata (JSONB)**:
The `GroupMetaData` field is implemented as a `JSONB` object. This is a deliberate choice to accommodate the heterogeneous nature of performance data. While currently used for tracking subscriptions and benchmark identifiers, the JSONB format allows the system to store additional context (like environment variables or hardware configurations) without requiring a schema migration every time a new metadata tag is introduced.
- **Denormalized Revision Ranges**:
`CommonRevStart` and `CommonRevEnd` represent the overlapping revision range shared by all anomalies within the group. These values are recalculated and updated as the group grows. By storing these directly on the group record, the system can quickly identify which groups are relevant to a specific commit range during bisection lookups.
- **Action and Workflow State**:
The schema integrates directly with the alerting and bisection workflows through fields like `Action`, `BisectionID`, and `ReportedIssueID`.
- `Action` acts as a state machine indicator (e.g., `report`, `bisect`).
- `ActionTime` tracks when these external processes were triggered to prevent duplicate actions during subsequent scanning loops.
## Data Workflow
The following diagram illustrates how the schema fields are populated and updated during the lifecycle of an anomaly group:
```text
Discovery Phase Aggregation Phase Action Phase
(Anomaly Detected) (Group Created/Updated) (Remediation)
| | |
v v v
[ Individual Anomaly ] ----> [ AnomalyGroupSchema ] ----> [ Bisection Job ]
| - CommonRevStart/End | | - BisectionID
| - AnomalyIDs (Array) | <---+
| - GroupMetaData | |
| - Action ('bisect') | ----+
|
+--------------> [ Issue Tracker ]
| - ReportedIssueID
```
This workflow ensures that as the system moves from detecting a regression to investigating it, the `AnomalyGroupSchema` remains the central repository for the group's evolving state and history.
# Module: /go/anomalygroup/utils
### High-Level Overview
The `anomalygroup/utils` module provides the logic for organizing individual performance regressions (anomalies) into cohesive groups. Instead of treating every detected regression as an isolated event, this module attempts to correlate new anomalies with existing ones based on shared metadata like subscription names, commit ranges, and test paths. This grouping mechanism is critical for reducing alert fatigue and enabling automated root-cause analysis workflows, such as bisection.
### Design and Implementation Choices
The module is designed around a "find-or-create" pattern for anomaly groups, prioritizing the consolidation of information into existing groups to maintain a single source of truth for related issues.
- **Concurrency Control**: The module uses a global `sync.Mutex` during the grouping process. This design choice addresses the potential for race conditions where multiple parallel processing containers might simultaneously attempt to create a new group for the same set of regressions.
- **Decoupled Action Handling**: The logic distinguishes between two primary group actions: `REPORT` (creating/updating bug tracker issues) and `BISECT` (triggering automated culprit finding). The implementation chooses how to update external systems (like the Issue Tracker) based on these action types.
- **Workflow Integration**: When a new group is created, the module doesn't just store data; it proactively triggers long-running processes via **Temporal**. This offloads heavy lifting—like deciding whether to start a Pinpoint bisection—to a durable execution framework.
### Key Components and Responsibilities
#### AnomalyGrouper Interface and Implementation
The `AnomalyGrouper` interface defines the contract for processing a regression within the context of grouping. The primary implementation, `AnomalyGrouperImpl`, acts as a coordinator between the Perf backend services, the Issue Tracker, and the Temporal workflow engine.
#### Regression Processing Logic (`anomalygrouputils.go`)
The core logic resides in `ProcessRegression`. Its responsibilities include:
1. **Correlation**: Querying the backend service via `FindExistingGroups` to see if the new anomaly fits into an active group based on its subscription and commit range.
2. **Group Management**:
- If no group exists: It creates a new group and immediately triggers the `MaybeTriggerBisection` Temporal workflow.
- If groups exist: It associates the anomaly with all matching groups.
3. **Communication Sync**: It ensures that external issue trackers are kept up-to-date. If a group has already been reported as a bug or is linked to a culprit, the module adds comments to those issues to notify stakeholders of the new regression.
#### Issue Identification (`FindIssuesToUpdate`)
This helper function encapsulates the logic for mapping an `AnomalyGroup` back to physical issue IDs.
- For `REPORT` actions, it looks for a specifically linked `ReportedIssueId`.
- For `BISECT` actions, it queries the backend for issues associated with "culprits" (identified causes) linked to the group.
### Key Workflow: Processing a New Regression
The following diagram illustrates how the module handles an incoming regression and decides whether to create a new group or update an existing one.
```
[ New Regression Detected ]
|
v
+--------------------------+
| Lock Grouping Mutex | (Prevent race conditions)
+--------------------------+
|
v
+--------------------------+ YES +----------------------------+
| Find Existing Groups? |-------------->| 1. Link Anomaly to Groups |
+--------------------------+ | 2. Find Associated Issues |
| | 3. Post Updates to Issues |
| NO +----------------------------+
v |
+--------------------------+ |
| 1. Create Anomaly Group | |
| 2. Link Anomaly to Group | |
| 3. Trigger Temporal WF | |
+--------------------------+ |
| |
v v
+-----------------------------------------------------------------------+
| Unlock Mutex & Return |
+-----------------------------------------------------------------------+
```
# Module: /go/anomalygroup/utils/mocks
### High-Level Overview
The `anomalygroup/utils/mocks` module provides mock implementations of the interfaces defined within the `anomalygroup` utility suite. Its primary purpose is to facilitate unit testing for components that depend on anomaly grouping logic—specifically the categorization and association of regressions into logical groups—without requiring a live database or the complex state management associated with real anomaly grouping operations.
### Design and Implementation Choices
The module utilizes **testify/mock** to provide a programmatic way to simulate the behavior of the `AnomalyGrouper` interface.
The core design decision here is the use of **automatically generated mocks** (via `mockery`). This approach ensures that the mock implementation remains strictly in sync with the parent interface. By generating these mocks in a dedicated package, the project maintains a clean separation between production code and testing utilities, preventing test dependencies (like `testify`) from polluting the production binary.
The mock is designed to support:
- **Behavioral Verification**: Ensuring that the calling code passes the correct context, alert configurations, and commit ranges.
- **Deterministic Outcomes**: Allowing tests to simulate both successful grouping (returning a group ID) and various error states (e.g., database failures or validation errors) to verify error handling in the consumer.
### Key Components
#### AnomalyGrouper.go
This file contains the `AnomalyGrouper` struct, which mocks the primary service responsible for regression management.
The central responsibility of this mock is to simulate the `ProcessRegressionInGroup` workflow. In a real-world scenario, this method involves complex logic to determine if a new anomaly should be joined to an existing group or start a new one based on metadata. The mock simplifies this for callers by allowing them to define expectations:
```
Input Parameters:
- ctx: Request context.
- alert: The alert configuration that triggered the detection.
- anomalyID: The unique identifier for the detected regression.
- startCommit/endCommit: The range where the regression occurred.
- testPath/paramSet: Metadata describing the specific trace and attributes.
Return Values:
- string: The ID of the anomaly group the regression was assigned to.
- error: Any simulated operational failure.
```
The mock includes a `NewAnomalyGrouper` constructor that integrates with the Go `testing.T` cleanup lifecycle, ensuring that any unmet expectations (e.g., a method was expected to be called but wasn't) are automatically reported as test failures.
### Typical Testing Workflow
When a component (such as a regression detector or a notification manager) identifies a regression, it interacts with the `AnomalyGrouper`. The mock allows you to simulate this interaction:
```
+-------------------+ +-----------------------+ +-------------------------+
| Unit Test | | Component Under Test | | Mock AnomalyGrouper |
+---------+---------+ +-----------+-----------+ +------------+------------+
| | |
| 1. Set Expectations | |
| (On ProcessRegressionInGroup)| |
+---------------------------->| |
| | |
| 2. Trigger Action | |
+---------------------------->| 3. Call Process... |
| +--------------------------->|
| | |
| | 4. Return Preset Result |
| |<---------------------------+
| 5. Assert Requirements | |
+---------------------------->| |
```
# Module: /go/backend
# Perf Backend Module
The `go/backend` module implements the internal gRPC service architecture for the Skia Perf application. It serves as a centralized, non-user-facing API layer designed to decouple the frontend from heavy background operations and workflow orchestrations.
### High-Level Overview
The backend service acts as a standard interface contract between different components of the Perf cluster. By isolating logic such as manual Pinpoint job triggering, anomaly group management, and culprit tracking into a dedicated service, the system ensures that user-facing components (the frontend) remain responsive.
This architecture allows for significant backend implementation changes—such as swapping out the underlying workflow engine (Temporal) or database logic—without requiring modifications to the frontend or other calling services.
### Design and Implementation Choices
- **Internal Service-to-Service (S2S) Communication**: The backend is explicitly designed for internal traffic. It uses Kubernetes DNS for service discovery within the cluster and relies on gRPC for efficient, typed communication.
- **Declarative Authorization**: Security is not hardcoded into individual handlers. Instead, every service must implement the `BackendService` interface, which requires providing an `AuthorizationPolicy`. This policy is then enforced by a unified gRPC interceptor.
- **Workflow Abstraction**: Heavier operations, particularly those involving long-running tasks like regression detection or Pinpoint bisection, are offloaded to this module. It frequently acts as a bridge to a Temporal cluster to manage stateful workflows.
- **Dependency Injection**: The `Backend` struct is initialized with various "stores" (AnomalyGroup, Culprit, Subscription, Regression). This allows the service to remain agnostic of the specific storage implementation (e.g., Spanner vs. CockroachDB) while facilitating easier unit testing through mocks.
### Key Components and Responsibilities
#### Backend Application (`backend.go`)
This is the core orchestrator. Its primary responsibility is the lifecycle management of the gRPC server. During initialization, it:
1. Validates the instance configuration.
2. Instantiates the necessary data stores and notification providers.
3. Sets up the Temporal client (if anomaly grouping is enabled).
4. Registers all sub-services (Pinpoint, Anomaly Group, Culprit) and applies their specific authorization policies to the gRPC interceptor stack.
#### Pinpoint Service (`pinpoint.go`)
A specialized wrapper around the Pinpoint service logic. It bridges the Perf backend to the Pinpoint bisection engine. Its primary role is to expose gRPC endpoints that allow the Perf UI to trigger and monitor performance bisection jobs. It implements strict role-based access control, typically requiring the `Editor` role.
#### Service Authorization Policy (`/shared`)
Contained within the `shared` sub-package, the `AuthorizationPolicy` structure defines the security contract for every endpoint. It supports:
- **Service-wide roles**: A baseline role required to access any method in the service.
- **Method-specific overrides**: Finer-grained control for sensitive operations.
- **Unauthenticated access**: Explicitly allowing public access where necessary, though this is rare for backend services.
#### Client Utility (`/client`)
To ensure uniformity across the codebase, this sub-module provides a factory for creating gRPC clients. It abstracts away the complexities of:
- **Authentication**: Automatically attaching Google OAuth2 identity tokens to requests.
- **TLS Configuration**: Managing secure connections within the VPC.
- **Connection Dialing**: Handling the boilerplate of `grpc.Dial` with appropriate interceptors.
### Core Initialization Workflow
The following diagram illustrates how the backend service starts up and wires its internal dependencies:
```text
[ Config File ] -> [ validate.LoadAndValidate ]
|
v
[ Storage Builders ] -> [ NewAnomalyGroupStore ]
[ NewCulpritStore ]
[ NewRegressionStore ]
|
v
[ External Services ] -> [ NewTemporalClient ]
[ GetDefaultNotifier ]
|
v
[ Service Registry ] -> [ NewPinpointService ]
[ NewAnomalyGroupServ ]
[ NewCulpritService ]
|
v
[ gRPC Server ] <------- [ Apply Auth Interceptors ]
|
+--> [ Listen on Port (e.g., :8005) ]
+--> [ Enable Reflection ]
+--> [ Serve Traffic ]
```
### Key Submodules
- **`backendserver`**: The executable entry point that parses CLI flags and calls the backend initialization logic.
- **`testdata`**: Contains environment-specific configurations (like `demo.json`) used to bootstrap the service in development or CI environments.
# Module: /go/backend/backendserver
### High-Level Overview
The `backendserver` module provides the executable entry point for the Perf backend service. Its primary purpose is to act as a thin wrapper that bootstraps the backend environment, parses operational configuration from the command line, and initiates the long-running service process. It bridges the gap between the infrastructure's execution environment and the core logic defined in the `perf/go/backend` package.
### Design and Implementation Choices
The module is designed around the `urfave/cli` framework to ensure that the service is highly configurable and self-documenting.
- **Flag-Driven Configuration**: Rather than relying on static configuration files or hardcoded environment variables, the server uses the `config.BackendFlags` struct to define its requirements. This allows the deployment system to pass parameters directly, facilitating easier integration with container orchestration tools.
- **Separation of Concerns**: The `main.go` file intentionally contains minimal logic. It delegates the heavy lifting—such as database connections, caching, and API routing—to the `perf/go/backend` package. This ensures that the core backend logic is decoupled from the CLI interface, making the system easier to test and reuse in different contexts.
- **Standardized Logging**: The server initializes a standard output logger early in the lifecycle. This choice ensures that all startup events, including flag parsing and service initialization, are captured in a format compatible with cloud-native logging aggregators.
### Key Components and Responsibilities
#### CLI Application (main.go)
The core responsibility of `main.go` is to define the command structure for the backend. It currently supports a `run` command, which serves as the primary execution path for the service.
When the `run` command is executed:
1. **Flag Processing**: The application converts the definitions in `config.BackendFlags` into CLI flags.
2. **Lifecycle Management**: It initializes the logger and logs the current flag configuration to provide visibility into the running state.
3. **Core Initialization**: It calls `backend.New()`, passing the parsed flags. While the current implementation passes `nil` for several parameters (likely reserved for dependency injection or specialized handlers), this is where the system's core components are wired together.
4. **Service Execution**: It invokes `Serve()`, which enters the main event loop of the backend, handling incoming requests until an interrupt signal is received.
### Service Workflow
The following diagram illustrates the initialization and execution flow of the `backendserver`:
```text
[ OS Args ]
|
v
[ CLI Flag Parser ] ----> [ Log Configuration ]
|
v
[ backend.New() ] <----- [ BackendFlags ]
|
+--> [ Instantiate internal components ]
+--> [ Setup Listeners/Handlers ]
|
v
[ b.Serve() ] <--------- [ Infinite Loop ]
|
+--> [ Accept RPC/HTTP Requests ]
+--> [ Process Data ]
```
### Key Dependencies
- **`perf/go/backend`**: Contains the actual service implementation. The `backendserver` is essentially a caller for this package.
- **`perf/go/config`**: Defines the schema for the backend's configuration.
- **`go.skia.org/infra/go/urfavecli`**: Provides the standardized CLI scaffolding used across Skia infrastructure projects.
# Module: /go/backend/client
The `backend/client` module serves as the central factory for establishing gRPC connections to various Perf backend services. It abstracts the complexities of authentication, transport security, and connection management, providing a unified interface for other components of the system to communicate with backend microservices like Anomaly Groups, Culprits, and Pinpoint.
### Design Decisions and Implementation
#### Centralized Connection Management
The module is designed around the concept of a shared connection utility (`getGrpcConnection`). By centralizing how gRPC connections are dialed, the system ensures consistent application of security policies and authentication headers across all clients. This approach allows developers to instantiate high-level service clients without needing to understand the underlying networking or security configuration of the cluster.
#### Security and Authentication
The client supports two primary connection modes based on the environment and specific service requirements:
- **Insecure Connections:** Primarily used for local development or specific internal testing scenarios where TLS is not required.
- **Secure Internal Communication:** For production workloads within a GKE cluster, the client uses a hybrid security model. It employs TLS for transport encryption but is configured with `InsecureSkipVerify: true`. This decision reflects a common internal networking pattern where communication stays within a trusted VPC/cluster boundary, making full certificate chain validation secondary to ensuring encrypted transit.
- **OAuth2 Identity:** Authentication is handled via Google Default Application Credentials. The module automatically retrieves the service account's token source and attaches it as `PerRPCCredentials` to the gRPC connection, ensuring that every request is authorized with the appropriate identity (scoped to `userinfo.email`).
#### Configuration-Driven Connectivity
The module relies on the global `perf/go/config` to determine the target host (`BackendServiceHostUrl`). This allows the same binary to target different backend instances based on the deployment configuration. Additionally, every client factory supports an `override` parameter, facilitating flexible routing for integration tests or cross-cluster communication.
### Key Components and Responsibilities
#### backendclientutil.go
This is the primary implementation file containing the logic for connection lifecycle management and client instantiation.
- **Connection Factory (`getGrpcConnection`):** This internal function manages the `grpc.Dial` process. It handles the logic for choosing between insecure credentials and the TLS/OAuth2 stack.
- **Service Clients:** The module provides specific factory functions for the different protobuf-defined services. These include:
- `NewPinpointClient`: For interacting with the Pinpoint service.
- `NewAnomalyGroupServiceClient`: For managing and querying anomaly groups.
- `NewCulpritServiceClient`: For accessing information regarding identified culprits.
### Workflow: Client Initialization
The following diagram illustrates the internal process when a consumer requests a new service client:
```
[ Consumer Call ]
|
v
[ Check if Backend Enabled? ] ---- No ----> [ Return Error ]
|
Yes
|
[ Determine Host URL ] <--- (Override or Global Config)
|
[ Create gRPC Connection ]
|
+---- If Secure: [ Fetch OAuth Token ]
| [ Configure TLS (Skip Verify) ]
|
+---- If Insecure: [ Use Insecure Creds ]
|
[ grpc.Dial(host, opts) ]
|
v
[ Wrap Connection in Service Client ]
|
v
[ Return (e.g., AnomalyGroupServiceClient) ]
```
### Key Submodules and Dependencies
- **`perf/go/anomalygroup/proto/v1`**: Provides the interface for anomaly group interactions.
- **`perf/go/culprit/proto/v1`**: Provides the interface for culprit tracking.
- **`pinpoint/proto/v1`**: Provides the interface for Pinpoint integration.
- **`go/auth`**: Used for managing Google-based authentication scopes.
# Module: /go/backend/shared
### High-Level Overview
The `backend/shared` module serves as a centralized location for common data structures and logic used across various backend services within the Perf system. Its primary purpose is to standardize how cross-cutting concerns—specifically security and access control—are defined and enforced across different service implementations.
### Centralized Authorization Policy
The core of this module is the `AuthorizationPolicy` structure. Rather than hard-coding permission checks within individual RPC handlers or middleware, this module provides a declarative way to define access requirements. This approach decouples the "rules" of the service from the "engine" that enforces them.
#### Design Decisions and Implementation
- **Granular vs. Global Control**: The design supports a tiered authorization model. By providing both `AuthorizedRoles` (service-wide) and `MethodAuthorizedRoles` (method-specific), the system allows developers to define a baseline security posture for a service while overriding or tightening requirements for sensitive operations.
- **Role-Based Access Control (RBAC)**: The module integrates directly with the common `go/roles` package. This ensures that the backend uses a unified identity and permission vocabulary, preventing discrepancies where different services might interpret "Admin" or "Viewer" differently.
- **Public Access Handling**: The inclusion of the `AllowUnauthenticated` flag allows the policy to explicitly document when a service is intended to be public. This makes security audits easier, as public-facing endpoints are opted-into explicitly rather than being the default state.
#### Workflow: Authorization Evaluation
When a request enters a backend service, the service implementation typically references an `AuthorizationPolicy` instance to determine if the request should proceed.
```text
Incoming Request
|
v
[ Auth Middleware ] <--- References --- [ AuthorizationPolicy ]
| |
+---- (1) Is AllowUnauthenticated? ------+--> [ Allow ]
| YES |
| |
+---- (2) Does user have a role in ------+--> [ Allow ]
| MethodAuthorizedRoles[RPC]? |
| YES |
| |
+---- (3) Does user have a role in ------+--> [ Allow ]
| AuthorizedRoles? |
| YES |
| |
+---- (4) No conditions met -------------+--> [ Deny (403) ]
```
### Key Components
- **`authorization.go`**: Defines the `AuthorizationPolicy` struct. This file is the source of truth for how backend services should describe their security requirements. It acts as the contract between service definitions and the middleware responsible for enforcing those definitions.
# Module: /go/backend/testdata
### Overview
The `/go/backend/testdata` directory serves as a repository for static configuration files used to simulate real-world runtime environments during development, testing, and demonstration of the Perf backend. Rather than relying on hardcoded defaults within the Go source code, this module provides a centralized location for JSON-based configurations that define how a Perf instance behaves, connects to data sources, and interacts with external services.
### Design Rationale
The primary motivation for maintaining this module is to provide a "Single Source of Truth" for a functional Perf deployment environment that can be spun up locally or in a CI environment.
By using `demo.json`, the system achieves:
- **Decoupling:** Separation of the application logic from environment-specific parameters like database connection strings or repository URLs.
- **Reproducibility:** Ensuring that developers and automated tests operate against a consistent set of configurations, such as the specific CockroachDB connection string or the local directory ingestion path.
- **Validation:** Serving as a schema reference for the `Config` struct used within the backend, ensuring that changes to the configuration format are reflected in a working example.
### Key Components and Responsibilities
#### Configuration Specifications (`demo.json`)
This file is the core of the module. it defines a comprehensive instance profile. Its responsibilities include:
- **Identity and Networking:** Establishing the instance name (`chrome-perf-demo`) and mapping the local communication ports for both the frontend and backend services.
- **Data Persistence Layer:** Explicitly choosing `cockroachdb` as the storage engine and defining the `tile_size` (e.g., 256). This choice impacts how the backend optimizes data retrieval for trace queries.
- **Ingestion Logic:** Configuring the backend to monitor a local directory (`./demo/data/`) rather than a cloud-based Pub/Sub or GCS bucket. This is crucial for offline development and rapid prototyping of data parsers.
- **External Integration Mocking:** Providing placeholders for issue trackers, authentication headers (`X-WEBAUTH-USER`), and Git repository synchronization. By pointing to a public demo repo (`perf-demo-repo.git`), it allows the system to demonstrate commit-linking functionality without requiring private credentials.
- **UI Customization:** Defining "Favorites" sections which allow the backend to populate the user interface with predefined links and documentation, simulating a curated production dashboard.
### Workflow: Configuration Ingestion
The backend utilizes these files to bootstrap its internal services. The flow generally follows this pattern:
```
[ Backend Startup ]
|
V
[ Load /go/backend/testdata/demo.json ]
|
+-----> [ Initialize CockroachDB Connection ]
| (Using connection_string)
|
+-----> [ Initialize Ingestion Service ]
| (Watching ./demo/data/ for new trace files)
|
+-----> [ Sync Git Provider ]
| (Cloning/Updating /tmp/perf-demo)
|
+-----> [ Apply Auth/Notification Policies ]
(Setting header names and issue tracker secrets)
```
This structure ensures that the backend can transition from a "demo" state to a "production" state simply by swapping the configuration file, keeping the underlying binary logic identical across environments.
# Module: /go/bug
The `bug` module provides a specialized utility for generating bug reporting URLs within the Perf application. Its primary purpose is to bridge the gap between performance regression detection and issue tracking by dynamically populating bug templates with contextual metadata.
### Design and Implementation Logic
The module is built around the concept of URI templates. Rather than hard-coding support for specific issue trackers (like Monorail or GitHub Issues), it utilizes a template-based approach to remain agnostic of the underlying bug-tracking system. This allows administrators to configure different reporting destinations without modifying the source code.
The core logic relies on the RFC 6570 URI Template standard via the `uritemplates` library. This ensures that all components of the URL—specifically those containing special characters like query parameters in a cluster link—are correctly escaped and encoded to prevent broken links in the resulting bug report.
### Key Components
**Template Expansion (`bug.go`)**
The module exposes the `Expand` function, which serves as the primary entry point. It takes a raw template string and injects three critical pieces of context:
- `cluster_url`: A direct link to the Skia Perf cluster view where the regression was identified.
- `commit_url`: The link to the specific git commit (provided via `provider.Commit`) suspected of causing the regression.
- `message`: User-provided commentary or summary of the issue.
The function handles the mapping of these domain-specific concepts to the template variables, ensuring that the integration between the performance monitoring UI and the bug tracker is seamless.
### Data Flow Workflow
The following diagram illustrates how the module transforms raw performance data and user input into a navigable bug report link:
```
[Perf UI / Detection] [Git Provider] [User Input]
| | |
v v v
(clusterLink) (commit.URL) (message)
| | |
+-------------------------+------------------+
|
v
+----------------------+
| bug.Expand | <--- [URI Template]
+----------------------+
|
v
[Encoded Reporting URL]
|
v
(Opens in User Browser)
```
### Usage in Testing and Examples
The module includes an `ExampleExpand` function and associated tests to verify that the encoding logic correctly handles complex URLs. This is particularly important for the `cluster_url`, which often contains its own set of encoded query parameters that must be safely nested within the final bug reporting URL.
# Module: /go/builders
# Perf Builders Module
The `go/builders` module serves as the central factory for the Skia Perf application. It is responsible for instantiating complex objects—such as data stores, version control interfaces, and file sources—by interpreting a central `config.InstanceConfig` object.
## Design Philosophy
The primary motivation for this module is to resolve **cyclical dependencies**. Many sub-packages within Perf (like `tracestore` or `regression`) need to know about the configuration, but the configuration logic often needs to reference these packages to define how they are initialized. By centralizing the "construction" logic here, other packages can remain focused on their specific domains without needing to know how their peers are instantiated or how the global configuration is structured.
A key implementation choice is the use of a **Singleton Database Pool**. Since a Perf instance typically talks to a single backend (like Spanner or PostgreSQL), the module maintains a global `singletonPool`. This prevents the application from accidentally opening multiple connection pools to the same database, which could exhaust file descriptors or database connection limits.
## Key Responsibilities and Components
### Database Management
The module handles the lifecycle of the database connection pool.
- **`NewDBPoolFromConfig`**: This is the core initializer. It parses connection strings, configures connection limits (`MaxConns` and `MinConns`), and wraps the raw pool in a timeout layer to ensure query hygiene.
- **Schema Validation**: When initializing a pool, the builder optionally performs a schema check. It compares the actual database schema against the `expectedschema` to ensure the database is compatible with the current version of the code before the application starts processing traffic.
### Store Factories
The module provides `New[Component]StoreFromConfig` functions for every major data entity in Perf. These functions encapsulate the logic of choosing between different implementations (e.g., SQL-based vs. Cache-backed stores).
- **Trace & Metadata Stores**: Constructs `sqltracestore` instances. It also manages the initialization of `InMemoryTraceParams` to optimize trace lookups.
- **Regression & Shortcut Stores**: Handles the logic of selecting versioned implementations, such as switching between `sqlregressionstore` and `sqlregression2store` based on the `UseRegression2` config flag.
- **Anomaly & Alert Stores**: Standardizes the creation of stores for alerts, anomaly groups, culprits, subscriptions, and user-reported issues.
### Data Ingestion Sources
The builders resolve how Perf reads incoming data files:
- **`NewSourceFromConfig`**: Determines whether data should be pulled from Google Cloud Storage (`GCSSource`) or a local directory (`DirSource`).
- **`NewIngestedFSFromConfig`**: Provides a standard Go `fs.FS` interface to the underlying storage, allowing the rest of the application to treat GCS and local filesystems interchangeably.
### Caching Strategy
The `GetCacheFromConfig` function determines the caching layer for queries. It supports:
- **Redis**: Utilizing a Google Cloud Redis client.
- **Local**: An in-memory cache for local development or small-scale deployments.
## Core Workflow: Object Initialization
The typical flow for initializing a component involves resolving the database pool first, then passing it into the specific constructor for the requested store.
```text
Config Object (InstanceConfig)
|
v
[ NewDBPoolFromConfig ] <-----------+
| | (Check Schema)
| v
+------> [ singletonPool ] ---+
| (Thread-safe) |
| |
v v
[ New...StoreFromConfig ] [ NewPerfGitFromConfig ]
| |
+---> Returns Interface +---> Returns perfgit.Git
(e.g. alerts.Store)
```
## Implementation Details
- **Concurrency**: The `singletonPool` is protected by a `sync.Mutex` (`singletonPoolMutex`) to ensure that concurrent calls to initialize the database during startup do not create race conditions or multiple pools.
- **Logging**: A custom `pgxLogAdaptor` is implemented to redirect internal database driver logs (from `pgx`) into the standard `sklog` system, ensuring unified log formatting across the application.
- **Timeouts**: All database pools are wrapped using `go/sql/pool/wrapper/timeout`. This enforces that every context passed to a database operation has a deadline, preventing "hanging" queries from blocking the application indefinitely.
# Module: /go/chromeperf
### Overview
The `chromeperf` module provides a comprehensive Go client and integration layer for interacting with the **Chrome Performance Monitoring (Chromeperf)** ecosystem. Its primary responsibility is to bridge the gap between Skia Perf's internal data structures and the legacy Chromeperf APIs, specifically focusing on anomaly detection, regression reporting, and alert group management.
The module acts as a translation and transport layer, allowing Skia Perf to:
1. **Retrieve performance anomalies** (regressions or improvements) from the Chromeperf backend.
2. **Report new regressions** discovered by Skia's analysis engines back to Chromeperf.
3. **Manage Alert Groups**, which aggregate multiple related anomalies into a single triagable unit.
4. **Normalize data identifiers**, converting between Skia's structured trace keys and Chromeperf's slash-delimited `TestPath` format.
### Design Decisions and Implementation Choices
#### Communication via Skia-Bridge
A key architectural decision is the use of `skia-bridge-dot-chromeperf.appspot.com` as the default endpoint. While a legacy direct path to `chromeperf.appspot.com` exists, the module defaults to the bridge. This design allows for a more stable interface and potentially specialized authentication/filtering logic between the two systems. The `ChromePerfClient` interface abstracts this, supporting URL overrides for local development and testing.
#### Resilience and Status Code Handling
The `SendPostRequest` and `SendGetRequest` implementations in `chromeperfClient.go` incorporate specific logic for "accepted status codes." Unlike standard HTTP clients that might treat any 2xx as success, this module allows callers to define exactly which codes are valid for a given operation. For example, `ReportRegression` accepts `404` as a non-error state in specific scenarios where parameter names differ between systems, preventing transient synchronization issues from triggering hard failures in the Skia backend.
#### Trace Name to TestPath Translation
Chromeperf identifies performance series using a hierarchical string (e.g., `Master/Bot/Benchmark/Test/Subtest`), whereas Skia Perf uses a flat map of key-value pairs. The `TraceNameToTestPath` function implements a deterministic mapping strategy:
- **Order matters**: It strictly enforces a hierarchy: `master` -> `bot` -> `benchmark` -> `test` -> `subtest_1...N`.
- **Statistical Suffixes**: Because Chromeperf often encodes statistics in the test name (e.g., `_avg`, `_max`), the translator can optionally append suffixes based on Skia's `stat` parameter to ensure lookups hit the correct legacy series.
#### Lossy Sanitization and the Reverse Key Map
Skia Perf restricts certain characters in trace keys (like `?` or `:`), replacing them with underscores. To prevent this from breaking the ability to query the original data source, the module utilizes a `ReverseKeyMapStore`. This allows the system to "remember" that a sanitized Skia value like `cpu_io` actually corresponds to a Chromeperf value of `cpu:io`.
### Key Components
#### Anomaly API (`anomalyApi.go`)
This is the core functional area of the module. It defines the `Anomaly` struct, which contains extensive metadata about performance shifts (medians before/after, P-values, segment sizes, and bug tracking information).
- **Reporting**: `ReportRegression` sends new detections to Chromeperf to trigger the alerting pipeline.
- **Retrieval**: Supports both revision-based (`GetAnomalies`) and time-based (`GetAnomaliesTimeBased`) queries.
- **Normalization**: The `UnmarshalJSON` method for `Anomaly` handles legacy numeric IDs by transparently converting them to strings, ensuring compatibility with different versions of the Chromeperf backend.
#### Alert Group API (`alertGroupApi.go`)
Manages the grouping of anomalies. It provides methods to fetch details for a specific group key. A critical function here is `GetQueryParams`, which parses the anomaly list within a group to generate Skia-compatible query parameters, allowing users to jump from a Chromeperf alert group directly to a Skia Perf visualization of all affected traces.
#### Chromeperf Client (`chromeperfClient.go`)
The low-level transport implementation. It handles:
- **Authentication**: Uses Google Default Credentials with the `userinfo.email` scope.
- **Tracing**: Integrates with OpenCensus for distributed tracing of API calls.
- **JSON Serialization**: Manages the encoding and decoding of complex request/response objects.
### Key Workflow: Anomaly Retrieval and Mapping
The following diagram illustrates how the module transforms a Skia trace request into a Chromeperf anomaly set:
```text
[ Skia Perf ]
Trace Name: ",master=CP,bot=M1,benchmark=SunSpider,test=total,stat=value,"
|
v
[ TraceNameToTestPath ]
Converts to: "CP/M1/SunSpider/total_avg"
|
v
[ AnomalyApiClient.GetAnomalies ]
POST /anomalies/find { "tests": ["CP/M1/SunSpider/total_avg"], ... }
|
v
[ Chromeperf Backend ]
Returns: { "anomalies": { "CP/M1/SunSpider/total_avg": [ {Anomaly_Data} ] } }
|
v
[ getAnomalyMapFromChromePerfResult ]
1. Maps "CP/M1/SunSpider/total_avg" back to the original Skia Trace Name.
2. Resolves Git Hashes to Commit Numbers using perfgit.Git.
|
v
[ AnomalyMap ]
{ "trace_name": { CommitNumber: Anomaly } }
```
### Submodules
- **`compat/`**: A translation layer that converts internal Skia `regression.Regression` objects into `chromeperf.Anomaly` structures.
- **`sqlreversekeymapstore/`**: A SQL-backed implementation of the `ReverseKeyMapStore` for persisting character transformation mappings.
- **`mock/`**: Autogenerated mocks for unit testing components that depend on these APIs.
# Module: /go/chromeperf/compat
### Overview
The `compat` module provides a translation layer between Skia Perf's internal regression formats and the legacy ChromePerf (Anomaly) data structures. Its primary purpose is to ensure interoperability during the transition or integration period where Skia Perf needs to communicate regression data to systems that still rely on the ChromePerf "Anomaly" schema.
The module simplifies the complex, multi-dimensional data captured in a Skia `regression.Regression` object into a flat, trace-oriented `chromeperf.AnomalyMap`.
### Design Motivations
The translation logic addresses several structural differences between the two systems:
- **Trace Identification**: Skia Perf uses structured trace keys (comma-separated key-value pairs), while ChromePerf uses a slash-delimited `TestPath` (e.g., `Master/Bot/Benchmark/Test`). This module handles the mapping of these identifiers to ensure regressions are attributed to the correct entities in legacy dashboards.
- **Revision Ranges**: Skia tracks regressions primarily by specific commit numbers. The translation maps these into `StartRevision` and `EndRevision` fields to satisfy the "range-based" anomaly model used by ChromePerf.
- **Triage State Mapping**: The module translates Skia's internal `TriageStatus` into a string-based state and applies specific flags (like `IgnoreBugIDFlag`) when a regression is marked as "Ignored," ensuring the legacy system respects the triage decisions made in Skia.
- **Bug ID Handling**: Skia supports multiple bugs per regression, whereas the legacy anomaly format historically expects a single primary Bug ID. The module currently selects the first available bug but includes diagnostic logging to monitor instances where data might be truncated, facilitating future schema improvements.
### Key Workflows
#### Regression to Anomaly Conversion
The core functionality is encapsulated in `ConvertRegressionToAnomalies`. The process follows this logical flow:
1. **Validation**: Ensures the regression contains valid data frames. If no trace data is present, it returns an empty map.
2. **Trace Iteration**: For every trace involved in the regression, it attempts to resolve the legacy `TestPath`.
3. **Field Mapping**: Values like medians (before/after), revision numbers, and improvement flags are cast and moved into the `Anomaly` struct.
4. **Status Sync**: The triage status (e.g., Untriaged, Positive, Ignored) is synchronized.
5. **Map Construction**: The resulting anomalies are grouped into a `CommitNumberAnomalyMap`, indexed by the trace key, allowing callers to look up anomalies by their specific performance series.
```text
[regression.Regression]
|
v
+-----------------------------+
| ConvertRegressionToAnomalies |
+-----------------------------+
|
|-- Extract TraceSet Keys
|-- Resolve TestPaths (e.g. Master/Bot/...)
|-- Map Medians & Revisions
|-- Resolve Bug IDs & Triage State
v
[chromeperf.AnomalyMap]
{
"trace_key_A": { CommitNum: Anomaly },
"trace_key_B": { CommitNum: Anomaly }
}
```
### Key Components
- **`compat.go`**: Contains the primary conversion logic. It is responsible for the heavy lifting of data transformation, error handling for malformed trace names, and the temporary logic for narrowing down multiple bug assignments into a single field.
- **`compat_test.go`**: Validates the conversion accuracy across various scenarios, including successful mappings, handling of nil data frames, and ensuring that different triage statuses (like `Ignored`) result in the correct legacy flag values.
# Module: /go/chromeperf/mock
The `/go/chromeperf/mock` module provides a suite of autogenerated mock implementations for the interfaces defined in the `chromeperf` package. These mocks are designed to facilitate hermetic unit testing of the Skia Perf service by simulating interactions with external Chrome Performance monitoring APIs and storage layers.
### Design Philosophy
The module leverages the `testify/mock` framework and is maintained via `mockery`. This approach was chosen to ensure that the testing infrastructure remains synchronized with the primary interfaces. When the core `chromeperf` interfaces evolve—such as adding new parameters to anomaly queries or modifying the regression reporting structure—the mocks can be regenerated to reflect these changes, reducing the manual overhead of updating test suites.
By using these mocks, developers can:
- Simulate API failures (e.g., non-200 status codes, network timeouts) to ensure robust error handling.
- Validate that specific parameters, such as commit positions or trace names, are correctly passed to the transport layer.
- Provide deterministic return values for complex data structures like `AnomalyMap` or `ReportRegressionResponse` without requiring a live backend.
### Key Components
#### AnomalyApiClient
The `AnomalyApiClient` mock simulates high-level operations related to performance anomalies. It allows tests to define expectations for fetching anomaly data across several dimensions:
- **Range-based queries:** Mocking `GetAnomalies` and `GetAnomaliesTimeBased` allows tests to simulate data retrieval over commit ranges or specific time intervals.
- **Revision-specific lookups:** `GetAnomaliesAroundRevision` enables testing of logic that centers on a specific point in time or a specific commit.
- **Regression Reporting:** The `ReportRegression` mock is critical for verifying the logic that identifies and pushes new performance regressions to the Chrome Perf dashboard, including the validation of metadata like median values before and after a change.
#### ChromePerfClient
This mock represents the lower-level transport layer. While `AnomalyApiClient` focuses on the "what" (anomalies), `ChromePerfClient` focuses on the "how" (generic HTTP-like requests). It mocks `SendGetRequest` and `SendPostRequest`, providing a way to test the underlying serialization and communication logic. This is particularly useful for verifying that the correct API endpoints and query parameters are constructed before being sent over the wire.
#### ReverseKeyMapStore
The `ReverseKeyMapStore` mock facilitates testing of the data translation layer. In Skia Perf, keys or trace names may be modified or obfuscated for storage or display. This mock simulates the persistence and retrieval of mappings between "modified" values and "original" values. It allows tests to verify that the system can correctly resolve internal identifiers back to their source values during data processing or anomaly reporting.
### Testing Workflow
The standard workflow for utilizing these mocks involves setting expectations within a Go test, injecting the mock into the component under test, and asserting that the interactions occurred as predicted.
```text
+-------------------+ +-----------------------+ +-------------------------+
| Go Test | | Component Under Test | | Mock AnomalyApiClient |
+-------------------+ +-----------------------+ +-------------------------+
| | |
| 1. Set Expectations | |
|---------------------------->| |
| (On "GetAnomalies").Return()| |
| | |
| 2. Execute Logic | |
|---------------------------->| |
| | 3. Call API Method |
| |------------------------------>|
| | |
| | 4. Return Mock Data |
| |<------------------------------|
| 5. Assert Expectations | |
|---------------------------->| |
| (AssertExpectations) | |
```
The `New...` constructor functions in each file include a `Cleanup` registration. This design ensures that `AssertExpectations` is automatically called at the end of each test, preventing "silent" failures where a test passes even if an expected API call was never actually made by the code.
# Module: /go/chromeperf/sqlreversekeymapstore
# SQL Reverse Key Map Store
The `sqlreversekeymapstore` module provides a persistent storage mechanism for mapping sanitized Skia Perf parameter values back to their original Chromeperf identifiers. This is a critical utility for maintaining interoperability between the two systems, particularly during anomaly detection and cross-platform data lookups.
## Design Rationale
When data flows from Chromeperf to Skia Perf, certain characters in test paths and parameter keys are considered "invalid" by Skia's internal naming conventions. To ensure compatibility, these characters are typically replaced with underscores (`_`).
This transformation is **lossy**. For example, both `cpu:io` and `cpu-io` might be sanitized to `cpu_io`. Because multiple distinct original values can map to the same sanitized value, it is impossible to programmatically "undo" the sanitization to find the original Chromeperf source of truth.
This module solves the problem by recording these transformations as they occur. By maintaining a lookup table, the system can deterministically resolve a sanitized Skia parameter back to the specific Chromeperf value it originated from, enabling accurate queries against Chromeperf's legacy APIs.
## Key Components
### Implementation (`sqlreversekeymapstore.go`)
The core logic is encapsulated in the `ReverseKeyMapStoreImpl` struct. It abstracts the database interactions required to store and retrieve these mappings.
- **Database Agnosticism**: The store supports multiple backend dialects (Standard SQL and Google Spanner). It uses the `config.DataStoreType` to select the appropriate SQL syntax, specifically handling differences in `INSERT ... ON CONFLICT` behavior.
- **Idempotent Writes**: The `Create` method is designed to be safe for concurrent or repeated calls. If a mapping for a specific `ModifiedValue` and `ParamKey` already exists, the database ignores the new insertion attempt.
- **Deterministic Lookups**: The `Get` method allows callers to provide a sanitized value and its associated parameter key to retrieve the original string.
### Schema and Data Integrity
The underlying database table, `ReverseKeyMap`, is structured to optimize for lookup speed and data consistency:
- **Primary Key**: A composite key consisting of `(modified_value, param_key)`. This ensures that for any given parameter category (like a test path component), a sanitized string can only point to one "correct" original string.
- **Persistence Strategy**: The design assumes that while the table may grow as new test paths are discovered, the set of unique paths eventually stabilizes, causing the storage overhead to plateau.
## Workflow: Mapping and Restoration
The following diagram demonstrates the lifecycle of a parameter value as it moves from Chromeperf to Skia and back again via the store:
```
[ Chromeperf ] [ Sanitization ] [ Skia Perf ]
Original Value ---> Transformation ---> Modified Value
"cpu:io" ( ":" -> "_" ) "cpu_io"
| |
| [ Store.Create ] |
+----------------------------------------------+
|
[ SQL ReverseKeyMap ]
Modified: "cpu_io"
ParamKey: "test_path"
Original: "cpu:io"
|
+------------------------+
| [ Store.Get ]
v
[ Original Restored ] <--- Used for Anomaly Lookups in Chromeperf
```
## Key Methods
- **`New(db pool.Pool, dbType config.DataStoreType)`**: Initializes the store with the appropriate SQL dialect based on the database provider.
- **`Create(ctx, modifiedValue, key, originalValue)`**: Persists a new mapping. Returns the `originalValue` if successful, or an empty string/error if a collision or validation issue occurs.
- **`Get(ctx, modifiedValue, key)`**: Retrieves the original value associated with the sanitized input. If no mapping exists, it returns an empty string without an error, signifying that no transformation was recorded for that specific pair.
# Module: /go/chromeperf/sqlreversekeymapstore/schema
# SQL Reverse Key Map Schema
The `sqlreversekeymapstore/schema` module defines the database structure required to maintain a mapping between sanitized Skia Perf parameter values and their original Chromeperf counterparts. This mapping is essential for maintaining interoperability between the two systems, specifically during anomaly lookups and cross-platform queries.
## Design Rationale
When data is migrated or uploaded from Chromeperf to Skia Perf, "invalid" characters within test paths are replaced with underscores to comply with Skia’s data requirements. Because this transformation is lossy (multiple distinct original characters might all be mapped to the same underscore), it is mathematically impossible to deterministically reconstruct the original Chromeperf test path from the modified Skia Perf path without external metadata.
Without this schema, querying Chromeperf for anomalies based on a Skia Perf test path would be unreliable, as the system would not know which original characters the underscores represent.
By storing these transformations as they occur, the system can perform a reverse lookup to find the "source of truth" original value. The design assumes that the set of unique test paths is relatively stable; therefore, while the table grows initially as new paths are encountered, the storage overhead is expected to plateau once all existing test paths have been processed.
## Key Components and Responsibilities
### schema.go
This file defines the `ReverseKeyMapSchema` struct, which represents the relational table structure. The schema is designed around three primary attributes:
- **ModifiedValue**: The sanitized string as it exists in Skia Perf (containing underscores).
- **ParamKey**: The specific parameter category (e.g., a specific part of the test path).
- **OriginalValue**: The raw, unmodified string as it exists in Chromeperf.
### Data Integrity and Indexing
The schema enforces uniqueness through a composite primary key consisting of the `ModifiedValue` and the `ParamKey`.
- **Mapping Logic**: The combination of a parameter key and its modified value must point to a unique original value. This ensures that the lookup remains deterministic.
- **Search Performance**: By using the `ModifiedValue` and `ParamKey` as the primary key, the database is optimized for the most common workflow: taking a known Skia Perf parameter and looking up its original Chromeperf identity.
## Workflow: Key Restoration
The following diagram illustrates how this schema facilitates communication between the two systems:
```
Chromeperf Path Skia Perf Path Reverse Key Map
(Original) (Sanitized) (Database Store)
---------------- --------------- -----------------------------
"master/bot/cpu:io" -> "master/bot/cpu_io" -> Modified: "cpu_io"
ParamKey: "test_path"
Original: "cpu:io"
|
|
[Anomaly Detection] <- [Query Original] <- [Lookup via ModifiedValue]
```
# Module: /go/clustering2
# Clustering2 Module
The `clustering2` module provides the logic for grouping performance traces based on their shapes using the k-means algorithm. It is primarily used within the Perf framework to identify patterns in telemetry data, such as regressions or improvements, by clustering similar behavioral trends across different test configurations.
## Design Philosophy
The module is designed around the concept of "trace shapes." Instead of looking at individual data points, it treats a series of values over time (a trace) as a multi-dimensional vector. By clustering these vectors, the system can discover that a specific set of tests all experienced a similar performance shift at the same point in time, even if the absolute values of their metrics differ.
### Key Implementation Choices
- **K-Means for Shape Analysis**: The module uses k-means clustering because it is efficient at grouping large sets of traces into a predefined number of clusters ($K=50$ by default).
- **Centroid-Based Summaries**: Each cluster is represented by a "centroid"—the average shape of all traces in that cluster. This allows the system to characterize a potentially massive number of traces with a single representative trend line.
- **Step Detection Integration**: Once clusters are formed, the module fits the centroids to step functions. This helps distinguish between clusters representing "noisy" data and those representing "meaningful" shifts (regressions or improvements).
- **Parameter Statistical Weighting**: To help users understand _what_ is common among traces in a cluster, the module calculates the percentage frequency of key-value pairs (e.g., `arch=x86`) within that cluster.
## Key Components and Responsibilities
### Cluster Calculation (`clustering.go`)
The primary entry point is `CalculateClusterSummaries`. It orchestrates the following workflow:
1. **Observation Conversion**: Converts a `dataframe.DataFrame` into a slice of `kmeans.Clusterable` objects. Traces are normalized or processed via `ctrace2` to ensure the clustering is based on the shape of the data rather than absolute magnitude.
2. **Iterative Refinement**: Runs the k-means algorithm for a maximum of 100 iterations or until the total error change falls below a threshold (`KMEAN_EPSILON`).
3. **Distance-Based Sorting**: After clusters are formed, members within each cluster are sorted by their distance to the centroid. The traces closest to the centroid are considered the most "representative" of that cluster's behavior.
### Data Structures
- **`ClusterSummary`**: Contains the centroid data, the list of representative trace keys, the results of the step-fit analysis, and a summary of the parameters common to the cluster.
- **`ClusterSummaries`**: A container for all clusters found during a single run, including metadata like the $K$ value used and the standard deviation threshold.
### Parameter Summarization (`valuepercent.go`)
This component analyzes the metadata keys of all traces in a cluster to identify commonalities.
- **`ValuePercent`**: Represents how often a specific `key=value` pair appears as a percentage of the total cluster size.
- **Human-Friendly Sorting**: The `SortValuePercentSlice` function implements a specialized sorting logic. It groups values by their key (e.g., all `config` values together) and then sorts those groups by the highest percentage. This ensures that the most dominant traits of a cluster appear at the top of the report.
## Workflows
### Clustering Process
```text
DataFrame (Traces)
|
v
[Convert to Clusterable Traces] <--- Normalize shapes
|
v
[Initialize K Centroids] <--------- Randomly select K traces
|
+----[ Loop: K-Means Iteration ]
| |
| v
| [Assign Traces to Nearest Centroid]
| [Recalculate Centroid Positions]
| [Calculate Total Error]
| |
+----------+--- (Break if Error Change < EPSILON)
|
v
[Post-Processing]
|
+--> [Fit Centroids to Step Functions]
+--> [Calculate Parameter Percentages]
+--> [Sort Members by Distance to Centroid]
|
v
ClusterSummaries (Final Result)
```
## Implementation Details
- **Distance Metric**: The module relies on the `Distance` implementation provided by the `ctrace2` package's `ClusterableTrace`, which typically measures the similarity between two floating-point arrays.
- **Centroid Calculation**: Centroids are updated in each iteration by averaging the values of all traces assigned to that cluster (via `ctrace2.CalculateCentroid`).
- **Concurrency**: The clustering process is currently synchronous within the `CalculateClusterSummaries` call, though it accepts a `context.Context` for cancellation and a `Progress` callback to report the total error back to the caller/UI.
# Module: /go/config
# Perf Configuration Module
The `go/config` module defines the structural and semantic requirements for configuring a Skia Perf instance. It serves as the single source of truth for the application's runtime behavior, governing how data is ingested, stored, queried, and notified.
## High-Level Overview
Perf is a highly configurable system designed to handle diverse performance data sources. The configuration system is built around a central `InstanceConfig` struct, which is typically populated from a JSON file at startup. This module handles:
- **Data Structure**: Defining the Go types that represent the configuration.
- **Schema Generation**: Automatically creating JSON schemas from Go types to ensure documentation and validation stay in sync.
- **Validation**: Providing a two-tier verification process (structural and semantic) to catch configuration errors before they reach production.
## Design Decisions and Implementation Choices
### Single Source of Truth via Reflection
Rather than maintaining a separate JSON schema file and Go struct, this module uses the `invopop/jsonschema` library. By performing reflection on the `InstanceConfig` struct, the system generates `instanceConfigSchema.json`. This ensures that any change to a Go field (like adding a new `QueryConfig` parameter) is automatically reflected in the validation logic and IDE autocompletion for configuration authors.
### Separation of Structural and Semantic Validation
Validation is split into two distinct phases to maximize reliability:
1. **Structural**: Handled by the generated JSON schema to verify types, required fields, and nesting.
2. **Semantic**: Handled by custom Go logic in the `validate` submodule. This is crucial because a configuration might be valid JSON but logically broken (e.g., a notification template referencing a non-existent variable, or a Regex that uses unsupported syntax).
### Duration Serialization
Standard Go `time.Duration` serializes to an integer (nanoseconds) in JSON, which is not human-readable. The module implements a custom `DurationAsString` type. It supports Marshaling/Unmarshaling strings like `"2h"` or `"10m"`, making the JSON configuration files much easier for humans to maintain and review.
## Key Components and Responsibilities
### InstanceConfig (`config.go`)
The root configuration object. It aggregates several sub-configs, each responsible for a specific subsystem:
- **`DataStoreConfig`**: Defines where trace data lives. It supports `Spanner` as the primary datastore and allows configuring connection pools and caching layers (either in-memory LRU or Memcached via `CacheConfig`).
- **`IngestionConfig` & `SourceConfig`**: Control the flow of data into Perf. It defines where files come from (Google Cloud Storage or local directories) and how to handle arrival events via PubSub (including "Dead Letter" topics for failing messages).
- **`GitRepoConfig`**: Configures how Perf interacts with source control. It supports both CLI-based git and the Gitiles API. It also handles "commit number" logic, allowing Perf to map git hashes to sequential integers used for graphing.
- **`NotifyConfig` & `IssueTrackerConfig`**: Manage regression alerts. These utilize Go text templates for subjects and bodies, allowing instances to customize how they report anomalies to developers.
- **`QueryConfig`**: Customizes the "Explore" UI. It allows instances to set default parameter selections (e.g., always default `stat` to `value`) and define "Conditional Defaults" (e.g., if a user selects `metric=cpu`, automatically suggest `stat=avg`).
### Configuration Validation (`/validate`)
This submodule ensures the provided JSON is safe to run. It doesn't just check syntax; it performs "dry runs" of notification templates and compiles all regular expressions to ensure they are compatible with Go's RE2 engine.
### Command-Line Integration
The module provides `AsCliFlags()` methods for different service types (`BackendFlags`, `FrontendFlags`, `IngestFlags`). This allows the various Perf microservices to share a consistent set of command-line arguments (like `--config_filename` and `--connection_string`) while keeping their specific needs isolated.
## Configuration Workflow
The following process describes how a configuration file moves from a static file to a running service:
```text
[ config.json ]
|
v
+-----------------------+
| Structural Check | Checks: JSON types, required fields,
| (JSON Schema) | and valid nesting.
+-----------------------+
|
v
+-----------------------+ Checks:
| Semantic Validation | - Do Go templates compile?
| (validate.go) | - Are Regex patterns valid RE2?
+-----------------------+ - Are TileSizes logically consistent?
|
v
+-----------------------+
| Global Config State | The validated object is stored in
| (config.Config) | config.Config for the app to use.
+-----------------------+
```
## Critical Constants
- **`MaxSampleTracesPerCluster`**: Limits the number of traces shown in a cluster summary (default: 50) to maintain UI performance.
- **`QueryMaxRunTime`**: Hard limit (10 minutes) on trace queries to prevent runaway database processes from exhausting resources.
- **`MinStdDev`**: The floor for normalization (0.001); values smaller than this are treated as zero to avoid division-by-zero or noise amplification in regression detection.
# Module: /go/config/generate
### Purpose
The `/go/config/generate` module serves as a bridge between Go type definitions and runtime configuration validation. Its primary responsibility is to ensure that the `InstanceConfig` struct—the central configuration object for Perf—is accurately represented as a JSON Schema.
By automating the generation of this schema, the system guarantees that any structural changes made to the configuration in Go code are immediately reflected in the validation logic. This prevents the "drift" that often occurs when manual documentation or separate validation files are maintained alongside source code.
### Design and Implementation
The module is implemented as a minimal Go binary designed to be executed via `go generate`.
#### Schema Synthesis
The core logic utilizes the `jsonschema` utility package to perform reflection on the `config.InstanceConfig` struct. This process transforms Go-specific metadata (such as struct tags, nested types, and field types) into a formal JSON Schema specification.
This approach was chosen to maintain a **single source of truth**. Instead of manually writing a JSON Schema to validate incoming configuration files, the Go struct itself defines the constraints. The generated schema at `../validate/instanceConfigSchema.json` then acts as a portable artifact that can be used by:
- Static validation tools.
- IDE integrations for autocomplete and linting of configuration files.
- Runtime validators that check user-provided configurations before the application starts.
#### Workflow
The generation process follows a linear path from Go source to a serialized JSON file:
```text
[ Go Source Code ]
|
| (reflection)
v
[ InstanceConfig Struct ] ----> [ jsonschema generator ]
|
| (serialization)
v
[ instanceConfigSchema.json ]
```
### Key Components
- **main.go**: The entry point that orchestrates the generation. It explicitly links the `config` package (where the business logic definitions reside) with the `jsonschema` package (the transformation engine). It targets a specific output path in the `validate` directory, ensuring the generated schema is placed where the validation logic expects it.
- **InstanceConfig Integration**: While not defined within this directory, the `InstanceConfig` struct from `//perf/go/config` is the critical input. The generator relies on the struct tags (like `json:`) and documentation comments within that struct to produce a human-readable and accurate schema.
# Module: /go/config/validate
# Perf Instance Configuration Validation
The `go/config/validate` module provides a robust validation layer for Skia Perf instance configurations. Its primary purpose is to ensure that JSON configuration files are not only structurally sound according to a schema but also semantically valid for the Perf runtime environment.
## Overview
Configuration in Perf is complex, involving regular expressions, Go templates for notifications, and interdependent database settings. Simple JSON schema validation is insufficient for catching errors like an invalid regex or a notification template that references a non-existent field. This module bridges that gap by performing deep inspection of the configuration object before the application starts.
The validation process follows a two-tier approach:
1. **Structural Validation**: Uses a JSON schema (`instanceConfigSchema.json`) to ensure types, required fields, and nesting are correct.
2. **Semantic Validation**: Executes custom Go logic to verify templates, compile regular expressions, and check cross-field dependencies.
## Key Components and Responsibilities
### Schema Enforcement (`instanceConfigSchema.json`)
The module embeds a JSON schema that defines the structure of an `InstanceConfig`. This schema is the first line of defense, ensuring that mandatory blocks like `data_store_config`, `ingestion_config`, and `git_repo_config` are present. It also constrains the allowed properties for various sub-configs (e.g., `QueryConfig`, `AuthConfig`), preventing "silent" typos in configuration keys.
### Semantic Validation Logic (`validate.go`)
The core validation logic resides in the `Validate` function. It performs several critical checks:
- **Notification Template Execution**: For configurations using `MarkdownIssueTracker`, the validator doesn't just check if the template is valid Go syntax; it attempts to actually "dry-run" the template. It mocks data for commits, alerts, and clusters to ensure that the user-provided templates (subject and body) can be successfully expanded without runtime errors.
- **Regular Expression Compilation**: Fields such as `invalid_param_char_regex` are compiled using Go's `regexp` package. This ensures that the patterns are compatible with RE2 syntax. Specifically, for `invalid_param_char_regex`, the validator enforces that the regex _must_ match both a comma (`,`) and an equals sign (`=`), as these are fundamental delimiters in the Perf trace system.
- **Inter-dependency Checks**: The module verifies logic that spans multiple configuration blocks. For example, it ensures that if `notifications` is set to a specific tracker type, the corresponding API key secrets are also provided. It also validates that `CommitChunkSize` in the query config is logically consistent with the `TileSize` in the data store config.
### Validation Test Suite (`testdata/` and `validate_test.go`)
The module includes a comprehensive suite of fixtures to prevent regressions:
- **Golden Files**: Validates all existing production configurations against the current logic.
- **Failure Cases**: Includes `invalid_regex.json` (testing unsupported RE2 features like lookaheads) and `invalid-notify-template.json` (testing references to non-existent template fields).
## Validation Workflow
The following diagram illustrates the lifecycle of a configuration file as it passes through this module:
```
[ JSON Config File ]
|
v
+-----------------------+
| JSON Schema Check | ----> [ Fail: Invalid types/missing keys ]
+-----------------------+
|
v
+-----------------------+ +-----------------------------------+
| Semantic Validation | | - Compile Regex |
| (Validate) | <--> | - Dry-run Notification Templates |
+-----------------------+ | - Verify cross-field logic |
| +-----------------------------------+
v
+-----------------------+
| Load into Global Mem | ----> [ Success: Perf proceeds to boot ]
| (config.Config) |
+-----------------------+
```
## Implementation Details
The module provides two primary entry points:
- **`InstanceConfigFromFile`**: Reads a file from disk, performs schema validation, unmarshals it into the Go struct, and then runs semantic validation.
- **`LoadAndValidate`**: A higher-level wrapper that logs schema violations to the system logs and populates the global `config.Config` singleton if validation passes. This is typically called during the initial setup of the Perf server.
# Module: /go/config/validate/testdata
The `testdata` module provides a suite of JSON-based test fixtures used to verify the robustness of configuration validation logic. Its primary purpose is to exercise the parser's ability to distinguish between structurally sound configurations and those that contain semantic errors in complex fields, such as Go templates and regular expressions.
### Design Intent
The data within this module is structured to target specific failure modes that are difficult to catch with simple schema validation:
- **Go Template Correctness**: The notification system relies on Go templates to format alerts. The test data includes both a comprehensive "golden" file (`valid-notify-template.json`) containing all supported variables (e.g., `.Commit.GitHash`, `.Alert.DisplayName`) and failure cases (`invalid-notify-template.json`). This allows the validator to ensure that templates do not reference non-existent properties or use invalid syntax, which would otherwise lead to runtime errors during alert generation.
- **Regex Engine Compatibility**: Go’s `regexp` package uses the RE2 syntax, which does not support certain features like lookahead assertions. The `invalid_regex.json` file specifically includes a lookahead pattern (`(?=...)`) to verify that the validator correctly identifies and rejects patterns that are incompatible with the underlying Go environment.
- **Schema Boundaries**: The `empty.json` file serves as a baseline for testing how the system handles null or empty inputs, ensuring that required fields are enforced and default values are applied correctly when an incomplete configuration is provided.
### Key Components and Responsibilities
The module is categorized by the specific validation criteria it aims to test:
#### Notification and Template Validation
The files `valid-notify-template.json` and `invalid-notify-template.json` define the expected interface for the notification engine.
- **Responsibility**: They map out the complex object hierarchy available to the template executor, including `Commit`, `Alert`, `Cluster`, and `StepFit` objects.
- **Validation Depth**: Beyond checking if a string is a valid template, these files help verify that the validator checks for the existence of nested fields, such as `{{ .Alert.DirectionAsString }}` or `{{ index .ParamSet "device_name" }}`.
#### Regular Expression Constraints
Configuration fields like `invalid_param_char_regex` and `commit_number_regex` are validated to ensure they can be compiled by the application.
- **Responsibility**: `invalid_regex.json` provides a negative test case for patterns that might be valid in other engines (like Perl or JavaScript) but are unsupported in the project's Go environment.
#### Minimalist Configurations
- **Responsibility**: `empty.json` tests the "fail-fast" capability of the validator. It ensures that the application does not attempt to boot with a blank configuration, requiring at least the presence of mandatory blocks like `auth_config` or `data_store_config`.
### Validation Workflow
The following diagram illustrates how these files are typically utilized by the validation logic:
```
[ Configuration File ] [ Validation Logic ] [ Result ]
| | |
|---(Load JSON)--------------->| |
| |-- Check Structure |
| |-- Compile Templates |
| |-- Compile Regex |
| | |
|<-----------------------------|---(Report Errors)-------|
| | |
V V V
(testdata/*.json) (config/validate) (Pass / Fail)
```
By maintaining these fixtures, the module ensures that any changes to the configuration schema or the notification engine's data model are accompanied by corresponding updates to the validation suite, preventing regressions in configuration parsing.
# Module: /go/ctrace2
The `ctrace2` module provides the bridging logic between raw performance trace data and the `kmeans` clustering engine. It defines how individual performance traces are normalized, compared, and averaged to facilitate anomaly detection and pattern discovery in Perf.
### Core Responsibility: Clusterable Performance Data
The primary goal of `ctrace2` is to transform raw, noisy performance data into a standardized mathematical representation. Performance traces often vary significantly in scale (e.g., one test might take 10ms while another takes 500ms), making direct comparison difficult.
To solve this, `ctrace2` implements the `ClusterableTrace` struct, which satisfies the interfaces required by the `kmeans` package. This allows the clustering algorithm to treat performance traces as points in an N-dimensional space.
### Data Normalization and Preparation
A key design choice in this module is the mandatory normalization of data via `NewFullTrace`. Before a trace can be used for clustering, it undergoes two critical transformations:
1. **Gap Filling**: Missing data points (sentinels) are filled using linear interpolation or zero-filling via `vec32.Fill`. This ensures that traces with intermittent data can still be compared.
2. **Unit Standard Deviation**: Traces are normalized so that their mean is 0 and their standard deviation is 1 (using `vec32.Norm`). This shift from absolute values to relative "shapes" allows the system to cluster traces that show similar _behavior_ (e.g., a 10% performance regression) even if their absolute magnitudes differ.
The normalization includes a `minStdDev` parameter to prevent the amplification of "flatline" noise; if a trace is almost perfectly flat, it will not be scaled up to unit standard deviation, as doing so would exaggerate insignificant measurement jitter.
### Key Components
#### ClusterableTrace
This is the central data structure. It holds the identifying `Key` of a trace and its normalized `Values`.
- **Distance**: Implements the Euclidean distance calculation. Since all traces in a clustering operation are expected to have the same length (guaranteed by the data preparation layer), it calculates the square root of the sum of squared differences between corresponding data points.
- **CalculateCentroid**: Provides the logic to create a "representative" trace for a cluster. It calculates the arithmetic mean for every data point across all members of a cluster, resulting in a new `ClusterableTrace` that serves as the cluster's center.
### Workflow: From Raw Trace to Cluster Centroid
The following diagram illustrates how raw performance data is processed to become part of a cluster:
```text
Raw Trace Data Normalization (NewFullTrace) K-Means Processing
+-------------+ +---------------------------+ +-------------------+
| Key: "test" | | 1. Fill missing values | | Compare Distance |
| [10, e, 12] | ===> | 2. Shift mean to 0 | ===> | to other traces |
+-------------+ | 3. Scale std dev to 1 | +---------+---------+
+-------------+-------------+ |
| |
v v
+-------------------+ +-------------------+
| ClusterableTrace | <-------- | CalculateCentroid |
| [ -1.2, 0.1, 1.1] | | (Average of group)|
+-------------------+ +-------------------+
```
### Key Constants
- **CENTROID_KEY**: When a cluster centroid is exported or visualized (e.g., in a DataFrame), it is assigned the special key `special_centroid` to distinguish the "average" shape from the actual measured data traces.
# Module: /go/culprit
# Culprit Module
The `go/culprit` module serves as the central authority for managing "culprits"—specific commits or sets of commits definitively identified as the cause of performance regressions—within the Skia Perf ecosystem. It provides the infrastructure to persist culprit data, link it to detected anomalies, and orchestrate the notification process to alert developers via external issue trackers.
## High-level Overview
This module bridges the gap between the bisection engine (which discovers culprits) and the communication layers (which report them). It is responsible for the entire lifecycle of a culprit:
1. **Persistence**: Storing the association between a commit and a performance regression.
2. **Mapping**: Maintaining the N:M relationship between code changes, Anomaly Groups, and external Issue Tracker IDs.
3. **Formatting**: Transforming raw performance data into human-readable alerts.
4. **Notification**: Dispatching these alerts to the appropriate teams based on subscription configurations.
## Design Decisions and Implementation
### Service-Oriented Architecture
The module is structured as a gRPC service (`/proto` and `/service`). This design allows different components of the Skia Perf backend—such as automated bisection tools or manual triage UIs—to interact with culprit data through a unified interface.
### Resilience Through Data Redundancy
A key architectural choice in the data schema (found in `culprit_service.proto`) is the local definition of the `Anomaly` structure. Instead of referencing external proto files from other services, `culprit` maintains its own representation. This ensures **service independence**: changes to how other modules represent anomalies won't cause cascading breaking changes in the culprit management logic.
### Safety and "Mocking" in Production
Because performance alerts can be noisy, the service includes a "Subscription Guarding" mechanism (`/service`). Before sending a notification, the system checks an allowlist (`SheriffConfigsToNotify`). If a subscription is not yet verified, the service automatically reroutes the notification to a safe, internal "mock" destination. This allows for testing new configurations in production environments without spamming development teams.
### Decoupling Content from Delivery
The notification logic is strictly separated into two domains:
- **Formatters (`/formatter`)**: Use Go's `text/template` engine to turn protobuf data into Markdown. This allows for flexible, instance-specific report styling without changing the underlying logic.
- **Transports (`/transport`)**: Handle the actual network communication (e.g., Google Issue Tracker API). This allows the system to swap delivery methods (or use a `NoopTransport` for local development) without affecting the notification orchestration.
## Key Components
### Culprit Service (`/service` and `/proto`)
The orchestration layer. It implements `PersistCulprit` and `NotifyUserOfCulprit`. It coordinates between the storage layer and the notifier to ensure that when a bug is filed, the resulting Issue ID is recorded back into the database, creating a bidirectional link between the regression and the ticket.
### SQL Culprit Store (`/sqlculpritstore` and `store.go`)
The persistent storage implementation.
- **Revision-Centric Identity**: It treats the git revision (host/project/ref/revision) as the primary identity.
- **Upsert Logic**: When a culprit is reported, the store checks for an existing record for that commit. If found, it appends the new `anomaly_group_id` rather than creating a duplicate, ensuring a single commit's impact is tracked holistically.
- **JSONB Mapping**: It uses `JSONB` to store the `GroupIssueMap`, allowing it to track which specific anomaly group triggered which specific bug report in a single, efficient record.
### Formatter (`/formatter`)
The "Logic-to-Markdown" engine. It takes `Culprit` and `Anomaly` protos and applies templates to generate subjects and bodies. It includes helper functions to calculate percentage changes and build URLs to the Perf UI or Git hosts.
### Notifier (`/notify`)
The coordinator for alerts. It takes a request to notify, fetches the content from a `Formatter`, and passes it to a `Transport`.
## Key Workflows
### From Discovery to Notification
The following diagram illustrates how a culprit is processed from the moment a bisection tool identifies it to the point an external bug is filed:
```text
[ Bisection Engine ] [ Culprit Service ] [ SQL Store ] [ Issue Tracker ]
| | | |
|-- 1. PersistCulprit -->| | |
| (Commit, GroupID) |-- 2. Upsert() ----->| |
| | | |
|-- 3. NotifyCulprit --->| | |
| (CulpritID) |-- 4. Get Culprit -->| |
| | | |
| |-- 5. Format Msg ----| |
| | | |
| |-- 6. Transport Send -------------------->|
| | | |
| |<-- 7. Issue ID --------------------------|
| | | |
| |-- 8. AddIssueId() ->| |
|<-- 9. Success (ID) ----| | |
```
### Identification and Linking
The module manages the complex relationship between commits and regressions:
1. **Many-to-Many**: One commit (Culprit) can cause many regressions (Anomaly Groups).
2. **Tracking**: Each Anomaly Group within a Culprit record can have its own unique Issue ID, allowing the system to post updates to multiple relevant bug reports simultaneously.
# Module: /go/culprit/formatter
# Culprit and Anomaly Formatter
The `formatter` module is responsible for transforming raw performance regression data into human-readable notifications. It sits between the regression detection logic and the notification delivery systems (such as issue trackers or email services), ensuring that alerts contain actionable context like commit links, benchmark details, and performance delta percentages.
## High-level Overview
The module provides a standardized way to generate subjects and message bodies for two primary scenarios:
1. **New Culprits**: When a specific commit is identified as the cause of a performance regression.
2. **Anomaly Groups**: When a collection of regressions is grouped together (e.g., across multiple bots or benchmarks) and needs a summary report.
By decoupling the data representation from the final message format, the system allows for flexible reporting that can be customized per-instance via configuration files.
## Design Decisions and Implementation
### Template-Driven Formatting
The core implementation uses Go's `text/template` engine. This choice allows the formatting logic to remain generic while supporting complex data injection. The `MarkdownFormatter` uses predefined default templates but can be overridden by an instance's `IssueTrackerConfig`.
This design supports:
- **Contextual Data Injection**: Templates have access to `TemplateContext` (for culprits) and `ReportTemplateContext` (for anomaly groups), which include metadata about the subscription, the commit, and the anomalies themselves.
- **Custom Functions**: The formatter registers helper functions like `buildCommitURL`, `buildGroupUrl`, and `buildAnomalyDetails` within the template engine. This moves complex string manipulation (like calculating percentage changes or formatting bot names) out of the raw template and into tested Go code.
### Flexibility and Fallbacks
The `NewMarkdownFormatter` implements a "fallback" pattern. If the `InstanceConfig` does not provide a specific subject or body template, the module uses hardcoded `defaultNewCulpritSubject`, `defaultNewReportBody`, etc. This ensures the system is always capable of sending a notification even with a minimal configuration.
### Interface-Based Architecture
The `Formatter` is defined as an interface. This abstraction allows the Perf system to swap implementations easily:
- **MarkdownFormatter**: The standard implementation for systems that support markdown (like Monorail or GitHub).
- **NoopFormatter**: A "no-operation" implementation used in testing or environments where notifications should be suppressed without changing the calling service's logic.
- **Mocks**: Automated mocks are generated to facilitate unit testing of higher-level services (like the notification service) without requiring a full template rendering setup.
## Key Components
### Formatter Interface (`formatter.go`)
Defines the contract for all formatting implementations. It requires two methods:
- `GetCulpritSubjectAndBody`: Formats a message for a specific `Culprit` proto.
- `GetReportSubjectAndBody`: Formats a summary for an `AnomalyGroup` and its associated list of `Anomaly` protos.
### MarkdownFormatter (`formatter.go`)
The primary implementation. It stores compiled templates and instance-specific URLs (like the host URL and commit URL templates). During initialization, it parses the templates and attaches the functional maps required to generate links.
### Workflow: Generating an Anomaly Report
The following diagram illustrates how the formatter processes an anomaly group into a notification:
```
[ Data Source ] [ MarkdownFormatter ] [ Output ]
| | |
|-- AnomalyGroup -------->| |
|-- Subscription -------->|-- Resolve Templates ------|
|-- Top Anomalies ------->|-- Execute Funcs: ---------|
| | * buildGroupUrl |
| | * buildAnomalyDetails --|--> Subject String
| | |--> Body (Markdown)
```
### NoopFormatter (`noop.go`)
A stub implementation that returns empty strings. It serves as a safe default when no notification formatting is required, preventing nil pointer exceptions in the orchestration services.
### Data Contexts
- **TemplateContext**: Contains the `Culprit` commit information and the `Subscription` details (e.g., the name of the team or component being notified).
- **ReportTemplateContext**: Contains the `AnomalyGroup` (group ID, benchmark name) and a list of `TopAnomalies`, which are the most significant regressions selected to represent the group.
# Module: /go/culprit/formatter/mocks
The `go.skia.org/infra/perf/go/culprit/formatter/mocks` module provides automated mock implementations of the `Formatter` interface. This module exists to facilitate unit testing for components within the Perf system that handle notifications and reports related to performance regressions (culprits) and anomaly groups.
### Design and Purpose
The primary design goal is to allow developers to test high-level notification logic without depending on the actual formatting logic (which typically involves complex template rendering or external metadata lookups). By using these mocks, tests can verify that the system correctly passes data to the formatter and handles the resulting subject lines and message bodies as expected.
The implementation utilizes the `testify` mock framework. This choice allows for expressive test assertions, such as ensuring that a specific culprit or subscription triggered a formatting request, or simulating error conditions during the message generation process.
### Key Components
#### Formatter
The `Formatter` struct in `Formatter.go` is the central component of this module. It is a mock object that simulates the behavior of a culprit/anomaly formatter. It implements two primary functional workflows:
- **Culprit Notification Formatting**: Through `GetCulpritSubjectAndBody`, the mock simulates the creation of notification content for a specific performance culprit. It accepts a `Culprit` proto and a `Subscription` proto, returning a mocked subject string, body string, and error.
- **Anomaly Group Reporting**: Through `GetReportSubjectAndBody`, the mock simulates the creation of reports for collections of anomalies. This is used in testing workflows where multiple regressions are aggregated into a single notification for a specific subscription.
### Workflow Example
In a typical test scenario, the mock acts as a stand-in for the real formatter to verify the orchestration logic of the notification service:
```
[ Test Case ] -> [ Notification Service ] -> [ Mock Formatter ]
| | |
|-- 1. Setup expectation ------------------>| (Expect GetCulpritSubjectAndBody)
| | |
|-- 2. Trigger Action ---->| |
| |-- 3. Call Format() --->|
| |<-- 4. Return Mock Data-|
| | |
|-- 5. Assert service used mock data ------>|
```
### Usage in Testing
The `NewFormatter` function is the standard entry point for using this mock. It automatically registers cleanup functions with the Go testing framework (`t.Cleanup`), ensuring that expectations (e.g., "this method must be called exactly once") are asserted at the end of the test execution without manual boilerplate.
# Module: /go/culprit/mocks
The `go/culprit/mocks` module provides autogenerated mock implementations of the interfaces defined in the `culprit` package. Its primary purpose is to facilitate unit testing for components that depend on culprit persistence and retrieval without requiring a live database or a complex setup of the `culprit.Store`.
### Design Philosophy
The module leverages `testify/mock` to provide a flexible way to simulate the behavior of the culprit storage layer. By using mocks, developers can:
- **Isolate Components**: Test business logic in services like anomaly detection or regression analysis without being affected by the state of a real database.
- **Simulate Edge Cases**: Easily trigger specific error conditions (e.g., database timeouts or unique constraint violations) or return specific protobuf structures that might be difficult to reproduce with real data.
- **Verify Interactions**: Ensure that the calling code correctly invokes storage methods with the expected parameters, such as specific `anomaly_group_id`s or commit slices.
### Key Components
#### Store.go
This file contains the `Store` struct, which mocks the primary interface for managing culprits. It is generated via `mockery` and mirrors the methods required to interact with culprit data in the Perf system.
The mock provides implementations for the following critical workflows:
- **Culprit Ingestion and Updates (`Upsert`)**: Allows tests to simulate the creation or updating of culprits associated with specific anomaly groups. It mimics the behavior of returning a list of generated culprit IDs based on provided commit information.
- **Metadata Association (`AddIssueId`)**: Simulates the linking of a culprit to an external issue tracker ID. This is crucial for testing the integration between Perf's internal culprit tracking and external bug reporting systems.
- **Data Retrieval (`Get`, `GetAnomalyGroupIdsForIssueId`)**: Facilitates testing of UI endpoints or reporting tools by returning pre-defined `v1.Culprit` protobuf messages or mapping issue IDs back to internal anomaly groups.
### Typical Testing Workflow
When utilizing this module, a test typically follows the "Setup-Expect-Verify" pattern:
```
Test Component Mock Store Internal Logic
| | |
|-- 1. Setup Mock ----->| |
| | |
|-- 2. Set Expectations | |
| (On "Get" return X) | |
| | |
|-- 3. Call Method ---->|---------------------->|
| | |
| |<-- 4. Call "Get" -----|
| | |
| |--- 5. Return X ------>|
| | |
|-- 6. Verify Mocks ----| |
```
The `NewStore` function simplifies this by automatically registering a cleanup function on the provided `testing.T` instance, ensuring that `AssertExpectations` is called when the test completes.
# Module: /go/culprit/notify
### High-level Overview
The `go.skia.org/infra/perf/go/culprit/notify` module is responsible for orchestrating the notification process when performance regressions (anomalies) or their root causes (culprits) are identified. It acts as a bridge between the internal detection logic and external communication platforms, such as issue trackers.
The module abstracts the "how" of notification by separating the content generation (formatting) from the delivery mechanism (transport).
### Design and Implementation Choices
The module follows a "Strategy" pattern to handle different notification environments and requirements.
- **Decoupling via Interfaces**: The core logic relies on the `CulpritNotifier` interface. This allows the system to switch between real notifications and no-op (no-operation) modes easily, which is essential for local development or testing environments where sending real bugs is undesirable.
- **Separation of Concerns**: The implementation divides the task into two distinct roles:
- **Formatter**: Responsible for taking raw data (Protobuf messages for culprits or anomalies) and transforming them into human-readable subjects and bodies (typically Markdown).
- **Transport**: Responsible for the actual network communication with external APIs (like Buganizer/Issue Tracker).
- **Factory Pattern for Configuration**: The `GetDefaultNotifier` function acts as a factory that inspects the `InstanceConfig`. It determines whether to instantiate a functional `IssueNotify` system or a `NoneNotify` (noop) system based on the deployment configuration.
### Key Components and Responsibilities
#### CulpritNotifier Interface
The primary contract for the module. It defines two main entry points:
- `NotifyAnomaliesFound`: Triggered when a group of regressions is first detected.
- `NotifyCulpritFound`: Triggered when an automated analysis has narrowed down a specific commit as the cause of a regression.
#### DefaultCulpritNotifier
This is the standard implementation of the `CulpritNotifier`. It does not contain formatting or transport logic itself; instead, it coordinates the two. It fetches the content from the `formatter`, passes it to the `transport`, and returns the resulting identifier (e.g., a Bug ID).
#### Integration Logic (notify.go)
This file handles the lifecycle of a notification. It ensures that if a `Subscription` (the configuration defining who should be alerted) is missing, the system fails gracefully or logs the omission rather than attempting to send a malformed alert.
### Key Workflow: Notification Orchestration
The following diagram shows how the `DefaultCulpritNotifier` coordinates the flow of information from a detected event to an external system:
```
[ Caller ] [ DefaultCulpritNotifier ] [ Formatter ] [ Transport ]
| | | |
|-- NotifyCulpritFound() ---->| | |
| |-- GetSubjectAndBody() ->| |
| |<-- (subject, body) -----| |
| | | |
| |-- SendNewNotification() ------------------->|
| | | |
| |<--------- (bug_id) -------------------------|
|<----- (bug_id, err) --------| | |
```
### Testing Utilities
The module includes a `mocks` sub-package (generated via `mockery`). This is used by other parts of the Perf system to simulate the notification layer. By using these mocks, developers can verify that the detection pipeline correctly triggers notifications with the right metadata without actually creating tickets in an issue tracker.
# Module: /go/culprit/notify/mocks
### High-level Overview
The `go.skia.org/infra/perf/go/culprit/notify/mocks` module provides automated mock implementations for the culprit notification system within Perf. Its primary purpose is to facilitate unit testing for components that depend on the notification logic—such as anomaly detection pipelines or culprit analysis engines—without triggering actual external notifications (e.g., creating real bug reports or sending emails).
### Design and Implementation Choices
The module is built using [testify/mock](https://github.com/stretchr/testify), which was chosen to provide a consistent, type-safe way to assert that notification events occur with the expected parameters.
A key design choice in this module is the use of `mockery` for code generation. By generating the `CulpritNotifier` mock automatically from an interface definition (presumably located in the parent `notify` package), the project ensures that the test infrastructure stays in lockstep with the production API. This prevents "stale" tests where a mock might satisfy an old version of an interface that has since changed.
The implementation focuses on two distinct stages of the Perf alerting lifecycle:
1. **Initial Anomaly Grouping**: Handling a collection of detected performance regressions.
2. **Culprit Identification**: Handling the specific commit identified as the root cause.
### Key Components and Responsibilities
#### CulpritNotifier.go
This file contains the `CulpritNotifier` struct, which implements the interface required to simulate the notification subsystem. It manages the lifecycle of notifications through two primary mocked methods:
- **`NotifyAnomaliesFound`**: This method simulates the process of alerting users about a new `AnomalyGroup`. It accepts the group details, the associated `Subscription` (which contains routing/alerting metadata), and a list of specific `Anomaly` objects. In a test environment, this allows developers to verify that the system correctly identifies which subscription should be notified when a set of regressions is detected.
- **`NotifyCulpritFound`**: This method simulates the final stage of an investigation where a specific `Culprit` (a commit) has been identified. It validates that the notification logic correctly associates a culprit with the right subscription and returns a simulated notification ID (like a bug URL).
The file also includes a constructor, `NewCulpritNotifier`, which leverages Go's `Cleanup` interface. This is a critical design pattern here as it automatically registers `AssertExpectations` to run at the end of a test, ensuring that no expected notification calls were missed without requiring the developer to manually call assertion methods.
### Key Workflow: Testing a Culprit Discovery
The following diagram illustrates how this mock integrates into a typical test suite workflow to validate the notification logic:
```
[ Test Case ] [ Component Under Test ] [ Mock CulpritNotifier ]
| | |
|-- Register Expectation -> |
| (NotifyCulpritFound) | |
| | |
|---- Execute Action ---->| |
| |---- Call NotifyCulprit() -->|
| | |-- Record Call
| |<------- Return Mock ID -----|
| | |
| <--- Verify Results ----| |
| | |
| (Test Cleanup) | |
|------------------------------------------------------>|-- AssertExpectations()
| (Fails if not called)
```
# Module: /go/culprit/proto
### Overview
The `go/culprit/proto` module defines the communication interface and data schema for the **Culprit Service**. This service acts as the central authority for managing "culprits"—commits definitively identified as the cause of performance regressions—within the Skia Perf ecosystem.
By providing a unified gRPC interface, this module bridges the gap between the bisection engine (which discovers culprits), the storage layer (which persists them), and the notification systems (which alert developers).
### Design and Implementation Choices
#### Resilience Through Data Redundancy
The `Anomaly` data structure in this module is a local definition rather than a reference to external proto files. While this creates some duplication with services like `anomalygroup`, it is a deliberate architectural choice to ensure **service independence**. If the anomaly grouping logic changes its internal representation, the Culprit Service remains stable, preventing breaking changes from cascading through the microservice architecture.
#### Granular Issue Tracking
A key design feature of the `Culprit` message is the `group_issue_map`. In large-scale performance monitoring, a single problematic commit (a culprit) often triggers multiple regressions across different platforms or benchmarks, which might be tracked in different anomaly groups. This mapping allows the service to:
1. Maintain a many-to-many relationship between culprits and anomaly groups.
2. Track specific issue IDs (e.g., Monorail or Buganizer) for each group, ensuring that updates are posted to the correct bug reports.
#### Global Commit Identification
The `Commit` message is designed to be repository-agnostic. By explicitly requiring `host`, `project`, and `ref` alongside the `revision`, the service can handle culprits across the diverse set of repositories monitored by Skia (e.g., Chrome, Skia, V8, Angle). This allows a single instance of the service to manage regressions originating from different source control providers.
### Key Components
#### Service Interface (`culprit_service.proto`)
The `CulpritService` defines the lifecycle management of a regression:
- **Identification Persistence**: The `PersistCulprit` method transforms the results of a bisection (a commit) into a permanent record linked to an anomaly group.
- **Asynchronous Notification**: The service separates the detection of an anomaly from the identification of a culprit. `NotifyUserOfAnomaly` is used for initial "regression found" alerts, while `NotifyUserOfCulprit` is used for "culprit found" alerts, allowing the system to provide immediate feedback followed by precise root-cause analysis.
#### Data Structures
- **`Anomaly`**: Captures the state of the world at the time of regression. It stores the "before" and "after" medians, which are critical for calculating the magnitude of the impact, and the dimensions (test name, bot name) to identify the specific environment affected.
- **`Culprit`**: Represents a validated performance regression. It serves as an audit log, containing the commit metadata and a history of the notifications sent to developers.
### Core Workflow: From Detection to Notification
The following diagram illustrates how the Culprit Service coordinates between the engine that finds bugs and the trackers that manage them:
```text
Detection/Bisection Engine Culprit Service Database / External API
| | |
|--- PersistCulprit ---->| |
| (Commit + GroupID) |---- Store Culprit ----->|
| | |
|-- NotifyUserOfCulprit -| |
| (CulpritID) |---- Fetch Metadata ---->|
| | |
| |---- Create/Update Bug ->|
| |<--- Return Issue ID ----|
| | |
|<------- Success -------|---- Update Map/Link --->|
```
### Key Files
- **`culprit_service.proto`**: The primary definition file. It defines the gRPC service and all message types used for requests and responses.
- **`culprit_service.pb.go` & `culprit_service_grpc.pb.go`**: The compiled Go code. These files provide the concrete types and client/server boilerplate used by other Go services in the repository to interact with the Culprit Service.
- **`generate.go`**: The automation hook that ensures the generated Go code stays in sync with the proto definitions.
# Module: /go/culprit/proto/v1
This module defines the gRPC interface and data structures for the **Culprit Service**, a component of the Skia Perf ecosystem responsible for managing performance regression culprits and user notifications. It serves as the contract between the bisection engine (which identifies culprits) and the storage/notification layers.
### Overview
The Culprit Service handles the lifecycle of a "culprit"— a specific commit identified as the cause of a performance regression. The module's primary responsibilities include:
- **Persistence**: Storing and retrieving culprit data linked to specific anomaly groups.
- **Notification**: Orchestrating alerts to users (e.g., creating bug tracker issues) when anomalies are detected or when bisection successfully identifies a culprit.
### Design and Implementation Choices
#### Separation of Concerns
The data structures for `Anomaly` are intentionally duplicated from other services (like `anomalygroup_service.proto`). This redundancy allows the Culprit Service to evolve its definition of an anomaly independently of the grouping service, preventing tight coupling in a microservices environment where different teams might own different parts of the pipeline.
#### Mapping Culprits to Issues
The `Culprit` message includes a `group_issue_map`. This design choice recognizes that a single commit (culprit) might cause regressions across multiple different test suites or "anomaly groups." By mapping `anomaly_group_id` to `issue_id`, the service can track which bugs were filed for which specific performance regressions associated with the same culprit.
#### Commit Metadata
The `Commit` message provides a normalized way to identify changes across different repositories. By including `host`, `project`, `ref`, and `revision`, it ensures that the service can uniquely identify commits even when Skia Perf monitors multiple disparate Git repositories (e.g., Chromium, V8, Skia).
### Key Components
#### CulpritService
The gRPC service definition (`culprit_service.proto`) defines the following core operations:
- **`PersistCulprit`**: Called after a bisection process identifies a culprit. It links a list of commits to an `anomaly_group_id`.
- **`GetCulprit`**: Used by the UI or other services to fetch detailed metadata about identified culprits.
- **`NotifyUserOfAnomaly`**: Triggered when a regression is first detected. This typically results in the creation of a tracking issue.
- **`NotifyUserOfCulprit`**: Triggered when bisection finishes. It updates existing issues or creates new ones to alert developers that a specific commit they authored caused a regression.
#### Data Models
- **`Anomaly`**: Contains the statistical context of a regression, including "before" and "after" medians and the specific dimensions (bot, benchmark, measurement) where the regression occurred.
- **`Culprit`**: The record of a confirmed regression-causing commit, maintaining links to the anomaly groups it affected and the issues filed in response.
### Typical Workflow
The following diagram illustrates how the Culprit Service interacts with the bisection and notification flow:
```text
Bisection Engine Culprit Service Storage / Issue Tracker
| | |
|-- PersistCulprit ---->| |
| (Commits + GroupID) |---- Save to DB ----------->|
| | |
|-- NotifyUser ---------| |
(Culprit IDs) |---- Create/Update Issue -->|
|<--- Return Issue ID -------|
|<-- Return Success ----| |
```
### Files
- **`culprit_service.proto`**: The source of truth for the service interface and message definitions.
- **`culprit_service.pb.go`**: Generated Go structures for messages.
- **`culprit_service_grpc.pb.go`**: Generated gRPC client and server interfaces.
- **`generate.go`**: Contains the `go:generate` directives used to rebuild the protobuf and gRPC code via Bazel.
# Module: /go/culprit/proto/v1/mocks
### Overview
The `go.skia.org/infra/perf/go/culprit/proto/v1/mocks` module provides mock implementations of the Culprit Service gRPC server. Its primary purpose is to facilitate unit testing for components that depend on the `CulpritService`. By using these mocks, developers can simulate various service behaviors—such as successful culprit persistence or notification failures—without requiring a running gRPC backend or database.
### Design and Implementation Choices
The module relies on the `testify/mock` framework to provide a flexible, programmable interface for defining expected behaviors during tests.
A critical implementation detail in this module is the manual handling of gRPC interface requirements. In standard Go gRPC implementations, a server must embed an `Unimplemented...` struct to ensure forward compatibility with the interface. Since many auto-generation tools (like `mockery`) may fail to include this embedding, it has been manually added to the `CulpritServiceServer` struct. This ensures the mock remains a valid implementation of the `v1.CulpritServiceServer` interface defined in the parent proto package.
### Key Components
#### CulpritServiceServer
This is the central mock type. It mimics the behavior of the `CulpritService` by allowing tests to "stub" responses for specific RPC calls. It covers the following key service responsibilities:
- **Culprit Management**: Functions like `GetCulprit` and `PersistCulprit` allow tests to simulate the retrieval and storage of performance regression culprits.
- **User Notification**: Functions like `NotifyUserOfAnomaly` and `NotifyUserOfCulprit` enable verification of the notification logic, ensuring that the system correctly attempts to alert users when regressions or specific culprits are identified.
#### Initialization and Cleanup
The module provides a `NewCulpritServiceServer` constructor. This function is designed to integrate tightly with Go's `testing.T`. It automatically registers a cleanup function that calls `AssertExpectations`, which ensures that all programmed mock behaviors (e.g., "expect this function to be called exactly once") were actually executed before the test finishes.
### Typical Workflow
When testing a component that interacts with the Culprit Service, the workflow generally follows these steps:
```
1. Setup Mock : Create mock instance using NewCulpritServiceServer(t).
2. Set Expectations: Define what inputs are expected and what should be returned.
(e.g., On("PersistCulprit", ...).Return(&v1.PersistCulpritResponse{}, nil))
3. Injection : Pass the mock into the component being tested.
4. Execution : Run the logic of the component under test.
5. Verification : The Cleanup function automatically verifies that the
component called PersistCulprit as expected.
```
### Files
- **`CulpritServiceServer.go`**: Contains the mock struct and method definitions for the gRPC service. This is where the manual embedding of `v1.UnimplementedCulpritServiceServer` resides to satisfy gRPC interface constraints.
# Module: /go/culprit/service
# Culprit Service
The `culprit/service` module provides a gRPC implementation for managing culprits and automating the notification process when performance regressions (anomalies) are identified in the Perf system. It acts as the orchestration layer between the storage of anomaly data and the external notification systems (e.g., bug trackers).
## Overview
The primary purpose of this service is to handle the lifecycle of a "culprit"—a specific commit or set of commits identified as the cause of a performance change. It bridges several domains:
1. **Persistence**: Saving identified culprits and associating them with specific Anomaly Groups.
2. **Lookup**: Retrieving culprit details for UI or backend processing.
3. **Notification**: Triggering alerts (filing bugs) based on the findings of bisection or anomaly detection.
The service is designed to be used by backend components that perform bisection and need to report their findings, or by systems that detect anomalies and require immediate user notification.
## Key Components and Responsibilities
### Culprit Persistence and Management
The service coordinates with `culprit.Store` and `anomalygroup.Store` to ensure that when a culprit is identified, the relationship between the problematic commit and the group of affected traces is maintained.
- **`PersistCulprit`**: This workflow ensures atomicity at the application level. It first saves the culprit commits to the `culpritStore` and then updates the corresponding `AnomalyGroup` to include these new Culprit IDs. This bidirectional link is essential for tracking which regressions were caused by which commits.
- **`GetCulprit`**: Provides a standard interface to fetch culprit metadata by ID.
### Notification Logic
The service handles two types of notifications via the `notify.CulpritNotifier` interface. Both workflows rely on "Subscriptions" to determine where and how to file reports (e.g., which bug component, labels, or CC list to use).
- **Culprit Notification (`NotifyUserOfCulprit`)**: Triggered typically after a successful bisection. It loads the culprit details, identifies the relevant subscription associated with the anomaly group, and files a bug specifically for that culprit. It also records the resulting Issue ID back into the culprit record.
- **Anomaly Notification (`NotifyUserOfAnomaly`)**: Triggered when a group of anomalies is identified but a specific culprit may not yet be confirmed (or is being reported as a set). This files a broader report based on the anomaly group's characteristics.
### Subscription Guarding and Mocking
A unique aspect of this service is the `PrepareSubscription` logic. Because performance alerts can be noisy or sensitive, the service includes a safety mechanism to prevent accidental notifications to end-users during testing or when onboarding new "Sheriff" configurations.
- **Allowlist Check**: The service checks the `InstanceConfig.SheriffConfigsToNotify` list. If a subscription's name is not in this list, the service overwrites the bug destination (labels, components, CCs) with "mock" values. This ensures that even if a notification is triggered in a staging environment or for an unverified config, the bug is routed to a safe, internal hotlist rather than the actual team's queue.
## Key Workflows
### Culprit Discovery and Reporting
When a bisection tool finds a culprit, the following process occurs:
```text
Bisection Tool -> PersistCulprit(Commits, GroupID)
|
v
[ Culprit Store ] <--- Save Commits
|
v
[ AnomalyGroup Store ] <--- Link Culprit IDs to Group
|
+------> Response (Culprit IDs)
Bisection Tool -> NotifyUserOfCulprit(CulpritIDs, GroupID)
|
+-----> Load Subscription (via Group Name)
|
+-----> PrepareSubscription (Safe-guarding/Mocking)
|
v
[ Culprit Notifier ] ---> EXTERNAL: File Bug
|
v
[ Culprit Store ] <--- Record Issue ID
```
## Implementation Decisions
- **Separation of Concerns**: The service does not implement the logic for _how_ to file a bug or _how_ to store a commit; it strictly orchestrates the calls between specialized stores and the notifier.
- **GRPC Integration**: By implementing `backend.BackendService`, this module easily integrates into the Skia Perf backend infrastructure, inheriting standard service registration and (eventually) centralized authorization policies.
- **Mocking for Safety**: The `PrepareSubscription` function is an intentional "shim" in the implementation. It allows the team to run the full service logic in production-like environments while ensuring that experimental anomaly groups do not spam developers until their configurations are explicitly added to the allowlist.
# Module: /go/culprit/sqlculpritstore
# SQL Culprit Store
The `sqlculpritstore` module provides a persistent SQL-based implementation for managing "Culprits" within the Skia Perf ecosystem. A Culprit represents a specific commit (defined by its host, project, ref, and revision) that has been identified as the root cause of one or more performance regressions.
## Design Philosophy
The primary challenge in managing culprits is the N:M relationship between code changes, diagnostic clusters (Anomaly Groups), and tracking systems (Issue Trackers). A single commit can cause regressions in multiple tests, and a single bug report might track several related regressions.
To address this, the store is designed around the following principles:
- **Revision-Centric Identity**: While records are assigned a UUID for internal database efficiency, the business logic treats the git revision as the primary identifier.
- **Contextual Linking**: The store doesn't just track _that_ a commit is a culprit, but also _why_ (via `AnomalyGroupIDs`) and _where_ it is being tracked (via `IssueIds`).
- **Explicit Mapping**: Through the `GroupIssueMap`, the store maintains a JSONB-encoded link between specific anomaly groups and their corresponding issue IDs. This allows the system to determine exactly which regression triggered a specific bug report without complex join operations.
## Key Components
### CulpritStore (`sqlculpritstore.go`)
The main struct implementing the storage interface. It handles the translation between Go protobuf messages (`pb.Culprit`) and the underlying SQL schema.
- **Upsert Logic**: The `Upsert` method is a critical path. It identifies if a culprit already exists based on its commit coordinates. If it exists, the method appends the new `anomaly_group_id` to the existing list and updates the `last_modified` timestamp. If not, it generates a new UUID and creates a record. This ensure that a single commit is never duplicated in the store, regardless of how many regressions it causes.
- **Issue Management**: The `AddIssueId` method enforces data integrity by ensuring an issue can only be linked to a culprit if the associated `group_id` is already recognized as being caused by that culprit.
### Schema (`/schema`)
Defines the table structure and indexing strategy. A notable implementation choice is the `by_revision` composite index:
`INDEX by_revision (revision, host, project, ref)`
By leading with the `revision` (a high-entropy hash), the database avoids "hotspots" and distributes data more evenly across partitions compared to leading with low-entropy strings like `host`.
## Key Workflows
### Identifying and Storing a Culprit
When the system identifies a set of suspect commits for a regression:
```text
Discovery Engine -> [Anomaly Group ID + Commits]
|
v
CulpritStore.Upsert()
|
+-- Check if (Host/Project/Ref/Revision) exists?
| |
| +-- YES: Append Anomaly Group ID to array; Update LastModified
| |
| +-- NO: Generate UUID; Create new record
v
[ Database Updated ]
```
### Linking an Issue
When a user or automated system files a bug for a specific regression:
```text
Issue Tracker -> [Culprit ID + Issue ID + Anomaly Group ID]
|
v
CulpritStore.AddIssueId()
|
+-- Verify: Is Anomaly Group ID linked to this Culprit?
| |
| +-- NO: Return Error (Prevents orphaned/incorrect links)
|
+-- Update: Append Issue ID; Update GroupIssueMap (JSONB)
v
[ Database Updated ]
```
## Implementation Details
- **Concurrency and Updates**: The store uses the `last_modified` field (Unix timestamp) to allow external caches or services to synchronize and identify updated culprit records efficiently.
- **Data Consistency**: The `Upsert` method performs a validation check to ensure that all commits in a single batch belong to the same repository (Host, Project, and Ref), preventing accidental cross-pollination of repository metadata.
- **JSONB Handling**: The `GroupIssueMap` is stored as `JSONB` to provide flexibility for future metadata expansion while allowing the system to retrieve the full context of a culprit's impact in a single query.
# Module: /go/culprit/sqlculpritstore/schema
### Culprit Storage Schema
The `schema` package defines the foundational data structure for persisting "Culprits" within the Perf system's SQL storage. A Culprit represents a specific commit identified as the root cause of a performance regression.
#### Design Philosophy: Beyond Single Regressions
In a performance monitoring ecosystem, a single commit might trigger multiple regressions across different subsystems or test suites. Conversely, multiple anomaly groups might eventually point to the same underlying code change.
To handle this N:M relationship, the schema is designed to treat the Culprit as a central entity that tracks its associations across various diagnostic contexts (Anomaly Groups) and tracking systems (Issue Trackers).
#### Key Components and Implementation Choices
**1. The Culprit Identity**
A Culprit is uniquely identified by its source control coordinates: `Host`, `Project`, `Ref`, and `Revision`. While the system generates a UUID for primary key lookups, the business logic primarily interacts with the commit hash.
**2. Relational Mapping and the Group-Issue Link**
The schema manages the relationship between regressions and their resolutions through three specific fields:
- `AnomalyGroupIDs`: Tracks which diagnostic clusters have flagged this commit.
- `IssueIds`: Tracks which bug reports are associated with this commit.
- `GroupIssueMap`: A `JSONB` field that explicitly maps a specific Anomaly Group to a specific Issue ID.
The inclusion of `GroupIssueMap` as a JSONB object allows the system to maintain the context of _why_ a bug was filed (i.e., which regression group triggered it) without requiring complex join tables for metadata that is frequently accessed together. Note: There is a planned refactoring to consolidate `AnomalyGroupIDs` and `IssueIds` into this map to reduce data redundancy.
**3. Performance and Indexing Strategy**
The schema implements a composite index `by_revision` to optimize for the most common query pattern: "Is this specific commit already known as a culprit?"
The ordering of the index is a deliberate choice for database performance:
`INDEX by_revision (revision, host, project, ref)`
By placing the `revision` (a high-entropy git hash) at the leading edge of the index, the storage engine can effectively distribute data across nodes and avoid "hotspots" that occur when sequential or low-entropy data (like a Host name) is used as the primary index prefix.
#### Logical Data Flow
```text
Commit Hash (Revision)
|
v
[ Culprit Record ] <-----------+
| |
+--[ Anomaly Group A ] --+--> [ Issue 123 ]
| |
+--[ Anomaly Group B ] --+--> [ Issue 456 ]
| |
+-- [ GroupIssueMap ] ---+ (Stores the explicit links)
```
#### Schema Evolution
The schema currently supports `LastModified` as a Unix timestamp to facilitate cache invalidation and synchronization workflows, ensuring that external services can efficiently poll for updates to culprit statuses.
# Module: /go/culprit/transport
# Culprit Transport
The `culprit/transport` module provides a unified abstraction for dispatching notifications regarding identified culprits in the Skia Perf system. By decoupling the notification logic from the culprit detection engine, the system can support diverse communication channels—starting with automated issue tracking—while maintaining a consistent interface for the rest of the application.
## Design Philosophy
The module is designed around the `Transport` interface, which abstracts the "where" and "how" of message delivery.
- **Interface-Driven Delivery**: The core logic of the culprit detector does not need to know whether it is filing a bug in an issue tracker or sending an email. It simply provides a subscription configuration and the message content.
- **Context-Aware Routing**: The transport implementations use `Subscription` metadata (defined in `subscription/proto/v1`) to determine routing details like component IDs, priorities, and CC lists.
- **Reliability and Observability**: Given that notifications are critical for developer action, the transport layer includes built-in metrics to track delivery success and failure rates.
## Key Components
### Transport Interface
Defined in `transport.go`, this interface contains a single method: `SendNewNotification`. It returns a `threadingReference` (typically a bug ID or message URL) which allows the calling system to track the notification or perform follow-up actions (like posting comments on an existing thread).
### IssueTrackerTransport
The primary production implementation of the `Transport` interface. It bridges Skia Perf with the Google Issue Tracker (Buganizer).
- **Authentication**: It leverages the `secret` module to retrieve API keys and uses OAuth2 for authorized requests to the `issuetracker` service.
- **Data Transformation**: It maps subscription-level configuration (e.g., `BugComponent`, `BugPriority`, `Hotlists`) into the specific data structures required by the Issue Tracker API.
- **Validation**: It ensures that critical routing information, such as the `BugComponent`, is present before attempting to create an issue, preventing orphaned or unroutable notifications.
### NoopTransport
A "No-Operation" implementation found in `noop.go`. This is used in environments where notifications are undesirable (e.g., local development or dry-run modes). It satisfies the interface by returning a successful result without performing any network I/O or side effects.
## Workflow: Filing a Culprit Issue
The following diagram illustrates how the `IssueTrackerTransport` processes a notification request:
```
+----------------+ +------------------------+ +-------------------+
| Culprit | | IssueTrackerTransport | | Google Issue |
| Service | | | | Tracker API |
+-------+--------+ +-----------+------------+ +---------+---------+
| | |
| 1. SendNewNotification() | |
|--------------------------->| |
| (Subscription, Subj, Body) | 2. Map Proto to Issue |
| | (Priority, CCs, etc.) |
| | |
| | 3. POST /v1/issues |
| |----------------------------->|
| | |
| | 4. Return Issue ID |
| |<-----------------------------|
| | |
| 5. Increment Success Metric| |
| 6. Return Issue ID String | |
|<---------------------------| |
| | |
```
## Implementation Details
- **Metric Integration**: The `IssueTrackerTransport` maintains two counters: `perf_issue_tracker_sent_new_culprit` and `perf_issue_tracker_sent_new_culprit_fail`. These are essential for monitoring the health of the alerting pipeline.
- **Error Handling**: If an issue creation fails, the transport attempts to serialize the issue data into the error message. This provides high-fidelity debugging information in the logs, allowing developers to see exactly what payload the Issue Tracker rejected.
- **Markdown Support**: Notifications are sent with `FormattingMode: "MARKDOWN"`, allowing the culprit detector to send rich text, links, and tables to the issue tracker for better readability.
# Module: /go/culprit/transport/mocks
# Culprit Transport Mocks
The `culprit/transport/mocks` module provides a programmatic double for the `Transport` interface used within the Skia Perf culprit detection system. Its primary purpose is to facilitate unit testing of components that handle culprit notifications—such as anomaly detection engines or alert managers—without triggering actual external side effects like sending emails or filing issue tracker tickets.
## Design Philosophy
The module relies on the `stretchr/testify/mock` framework. This choice allows developers to write declarative tests that specify exactly how the notification system should be invoked. By using a mock rather than a fake or a manual stub, the system ensures that:
1. **Call Verification:** Tests can assert that a notification was sent exactly once, or not at all, preventing duplicate or missing alerts.
2. **Input Validation:** Tests can verify that the generated notification `subject` and `body` contain the expected metadata (e.g., commit hashes, regression magnitudes) before they are sent to a real user.
3. **Error Injection:** Developers can simulate transport-layer failures (e.g., API timeouts, authentication errors) to ensure the culprit detection pipeline handles notification failures gracefully.
## Key Components
### Transport
The `Transport` struct is an autogenerated mock implementation. It mirrors the methods required to dispatch culprit information to various communication channels.
- **SendNewNotification**: This is the core functional hook. In a production environment, this method would interface with external APIs (defined by the `Subscription` proto). In this mock implementation, it captures the `context.Context`, the `Subscription` configuration, and the message content. It returns a mockable string (typically representing a message ID or URL) and an error.
## Usage Workflow
The mock is designed to be integrated into Go tests via the `NewTransport` constructor, which automatically handles test cleanup and expectation assertions.
```
+------------------+ +----------------------+ +------------------+
| Unit Test | | Component Under | | Mocks Transport |
| (Logic/Policy) | | Test | | (This Module) |
+---------+--------+ +----------+-----------+ +---------+--------+
| | |
| 1. Setup Expectations | |
|------------------------------>| |
| (Expect SendNewNotification) | |
| | |
| 2. Execute Action | |
|------------------------------>| 3. Trigger Notification |
| |---------------------------->|
| | | 4. Record Call
| | | 5. Return Mock
| | <---------------------------| Values
| | |
| 6. Assertions (Auto-Cleanup) | |
| <-----------------------------| |
```
## Implementation Details
The implementation of `SendNewNotification` uses type assertion logic to provide flexible return values. It can return static values configured via `.Return()` or dynamic values generated by a function passed to `.Run()`. This is particularly useful when the "Message ID" returned by the transport needs to be used in subsequent logic within the test case.
# Module: /go/dataframe
### Overview
The `dataframe` module provides a structured, table-like representation of performance measurement data, specifically optimized for the Skia Perf ecosystem. A `DataFrame` combines a set of time-series traces (`TraceSet`) with their corresponding commit metadata (`ColumnHeader`) and a calculated set of searchable attributes (`ParamSet`).
In the context of Perf, a "Trace" is a series of measurements associated with a unique key (a set of key-value pairs). The `DataFrame` organizes these traces so they can be visualized or analyzed over a common timeline of git commits.
### Design and Implementation Choices
#### The "Why" Behind the DataFrame Structure
Unlike a simple collection of data points, a `DataFrame` represents a cohesive "slice" of performance history. The design choice to include `Header`, `TraceSet`, and `ParamSet` in a single object is driven by the need for self-contained data:
- **TraceSet:** Stores the raw numerical data.
- **Header:** Maps each index in a trace to specific commit information (hash, author, timestamp). This decoupling allows traces to be represented as simple arrays (`[]float32`) while still being linked to rich git history.
- **ParamSet:** A computed summary of all keys present in the `TraceSet`. This is maintained within the object to allow the UI to quickly provide filtering options based only on the data currently loaded.
#### Column-Oriented Data Management
The module treats columns as discrete points in time (commits). Operations like `MergeColumnHeaders` and `Join` are implemented to handle the "sparse" nature of performance data, where different traces might have data for different sets of commits.
```text
Trace Key A: [ 1.2, nil, 1.4 ] (Commits 1, 2, 3)
Trace Key B: [ nil, 2.2, 2.4 ] (Commits 1, 2, 3)
| | |
Header[0] Header[1] Header[2]
```
#### Memory and Performance Optimization
- **Compression:** The `Compress()` method identifies and removes columns that contain no data across all traces. This is vital for reducing the payload size when sending data to a frontend, especially after a query returns a range where many commits might not have produced results for the requested traces.
- **Slicing:** The `Slice()` method enables efficient pagination or windowing of data by creating sub-frames.
- **Sentinel Values:** The module uses `vec32.MissingDataSentinel` to represent gaps in data, ensuring that trace arrays remain a fixed length relative to the `Header` while explicitly marking missing measurements.
### Key Components and Responsibilities
#### DataFrameBuilder Interface
The `DataFrameBuilder` defines how data frames are constructed from underlying storage. It abstracts the complexity of querying the database and joining it with git metadata.
- **Query-based fetching:** `NewFromQueryAndRange` handles fetching data matching specific attributes (e.g., "arch=x86") over a time window.
- **Key-based fetching:** `NewFromKeysAndRange` is used when specific trace IDs are already known.
- **N-point fetching:** Methods like `NewNFromQuery` are designed for "overview" or "sparkline" views, where the user wants exactly $N$ points of history leading up to a specific time.
#### Join and Merge Logic
The `Join` and `MergeColumnHeaders` functions are the core of the module's data-alignment logic. They perform an "outer join" on the commit offsets.
1. **Header Merging:** It identifies the unique union of all commits from two sources, sorted by their commit number/offset.
2. **Trace Alignment:** It maps the data points from the original traces into the new, larger indices of the merged header, filling gaps with the missing data sentinel.
#### ParamSet Calculation
The `BuildParamSet()` method is responsible for reflecting the current state of the data. If traces are filtered out (via `FilterOut`), the `ParamSet` must be rebuilt so the UI doesn't display filtering options for data that is no longer present in the frame.
### Data Merging Workflow
When joining two DataFrames (A and B) that represent different time ranges or different sets of traces:
```text
DataFrame A Headers: [C1, C2, C4]
DataFrame B Headers: [C3, C4, C5]
1. Merge Headers -> [C1, C2, C3, C4, C5]
2. Map A indices -> 0->0, 1->1, 2->3
3. Map B indices -> 0->2, 1->3, 2->4
4. Resulting Trace for Key X:
[ValA(C1), ValA(C2), ValB(C3), ValA/B(C4), ValB(C5)]
```
### Key Files
- **`dataframe.go`**: Defines the primary `DataFrame` and `ColumnHeader` structs and implements the logic for merging, joining, and filtering data.
- **`dataframe_test.go`**: Contains logic for validating the complex index-mapping required during joins and ensuring that `ParamSet` calculations correctly reflect the trace data.
- **`mocks/`**: Provides a mock implementation of the `DataFrameBuilder` for testing higher-level components (like the Perf API handlers) without requiring a database.
# Module: /go/dataframe/mocks
# DataFrame Mocks
The `/go/dataframe/mocks` module provides auto-generated mock implementations for the `DataFrameBuilder` interface. These mocks are primarily used to facilitate unit testing in the Perf system by simulating complex data retrieval and frame construction processes without requiring a live database or the actual heavy-duty `dataframe` implementation.
## Design and Purpose
The core of this module is the `DataFrameBuilder` mock, which is generated using `mockery`. The decision to provide these mocks in a dedicated sub-package allows other modules within the Skia infrastructure to write deterministic tests for components that depend on data loading, such as UI handlers, alert systems, or analysis pipelines.
By using these mocks, developers can:
- **Simulate Data Latency:** Test how the system handles long-running data fetches by controlling the mock's response time.
- **Inject Edge Cases:** Easily return empty DataFrames, specific error conditions, or DataFrames with unusual shapes (e.g., mismatched trace lengths) that might be difficult to reproduce with real data.
- **Verify Query Logic:** Ensure that the calling code is passing the correct `query.Query` objects or time ranges to the builder.
## Key Components
### DataFrameBuilder.go
This file contains the `DataFrameBuilder` struct, which embeds `mock.Mock` from the `testify` framework. It implements the `dataframe.DataFrameBuilder` interface, covering several data retrieval patterns:
- **Query-Based Construction:** Methods like `NewFromQueryAndRange` and `NewNFromQuery` allow tests to simulate fetching data based on structured queries.
- **Key-Based Construction:** Methods like `NewFromKeysAndRange` and `NewNFromKeys` simulate fetching specific traces when the exact keys are already known.
- **Metadata Exploration:** `NumMatches` and `PreflightQuery` allow testing of the "dry run" or "count" functionality often used in the Perf UI to tell a user how many traces a query will return before they execute it.
## Typical Workflow in Tests
The mock is designed to be integrated into Go tests using the `testify` pattern.
1. **Initialization:** Create the mock using `NewDataFrameBuilder(t)`. This automatically registers cleanup functions to assert that all expected calls were actually made.
2. **Expectation Setting:** Define what the mock should return when specific methods are called.
3. **Injection:** Pass the mock into the component being tested (which should accept the `dataframe.DataFrameBuilder` interface).
4. **Verification:** The `testify` framework handles the verification of calls during the test's cleanup phase.
```text
+-------------------+ +-----------------------+ +-------------------------+
| Unit Test | ----> | MockDataFrameBuilder | ----> | Component Under Test |
+---------+---------+ +-----------+-----------+ +------------+------------+
| | |
| 1. Setup Expectations | |
|---------------------------->| |
| | |
| 2. Execute Action | |
|--------------------------------------------------------->|
| | |
| | 3. Call Interface Method |
| |<---------------------------|
| | |
| | 4. Return Mock Data |
| |--------------------------->|
| | |
| 5. Assertions/Cleanup | |
|<----------------------------| |
```
## Implementation Details
The implementation uses `mockery`'s standard template, providing flexible return value handling. For every method, it checks if a functional return has been provided (allowing for dynamic logic in mocks) or if a static value was registered via `.Return()`.
Special attention is given to the `progress.Progress` interface, which is passed to most builder methods. The mock allows testers to verify that progress is being tracked or to ignore it using `mock.Anything`.
# Module: /go/dfbuilder
# dfbuilder
The `dfbuilder` module is responsible for constructing `DataFrame` objects by querying and aggregating performance trace data from a `TraceStore`. It acts as the orchestration layer that translates high-level user requests (time ranges, queries, specific keys) into efficient, often parallelized, database operations.
## Overview
A `DataFrame` in the Perf system is a matrix of performance data where columns represent commits (ordered by time) and rows represent individual traces (identified by structured keys). The `dfbuilder` handles the complexity of:
1. **Commit Mapping**: Resolving time ranges or counts into specific commit numbers using Git.
2. **Tile-Based Retrieval**: Interfacing with the `TraceStore`'s tiled architecture to fetch data efficiently.
3. **Aggregation**: Merging data from multiple tiles into a single coherent matrix.
4. **Optimization**: Using caching and parallel "pre-flight" queries to improve UI responsiveness.
## Design Decisions
### Tiled Parallelism
The `TraceStore` stores data in fixed-size "tiles" (e.g., 256 commits per tile). When a user requests a large time range, the `dfbuilder` calculates which tiles are involved and launches parallel goroutines to query each tile simultaneously. This avoids serial bottlenecks and utilizes the horizontal scalability of the underlying database.
### Backward Search (NewN...)
Many UI views request the "N most recent points." Because performance data can be sparse (not every trace has data for every commit), the `dfbuilder` implements a backward-searching algorithm. It starts from the most recent commit and steps backward through tiles until it has collected exactly $N$ data points for the requested traces, or until it hits a configurable `maxEmptyTiles` limit.
### Pre-flight Query Logic
To provide a responsive "Query" UI, the `dfbuilder` performs "pre-flight" queries. Instead of fetching all raw data, it:
- Calculates how many traces match a partial query.
- Dynamically builds a `ParamSet` of valid options for the next dropdown based on current selections.
- **Sub-querying**: It can optionally remove one key from a multi-key query to find all possible values for _that specific key_ that would still result in a valid trace when combined with the other keys.
## Key Workflows
### Constructing a DataFrame from a Query and Range
When a user requests data for a specific time range and a trace query:
```text
User Request (Range, Query)
|
v
[Git Service] <---- Resolve time range to Commit Numbers/Headers
|
v
[DFBuilder] <---- Calculate required Tile Numbers
|
+------+------+ (Parallel Tile Queries)
| |
[Tile N] [Tile N-1]
| |
+------+------+
|
v
[TraceSetBuilder] <--- Merge results into a matrix
|
v
[DataFrame] (Compressed & Filtered)
```
### Parent Trace Filtering
The module includes logic to filter out "parent" traces. In Perf, traces often have a hierarchical structure (e.g., `benchmark`, `bot`, `test`, `subtest`). If a specific `subtest` trace exists, the higher-level "parent" trace (which might be an average or aggregate) is often redundant in the same view. The `filterParentTraces` function uses a `TraceFilter` to prune the `TraceSet` to only include the most specific (leaf) nodes.
## Key Components
### `builder`
The primary implementation of the `dataframe.DataFrameBuilder` interface. It maintains references to:
- `perfgit.Git`: For commit metadata.
- `tracestore.TraceStore`: For raw data access.
- `tracecache.TraceCache`: An optional caching layer to speed up trace ID lookups.
### `preflightProcessRecentTiles`
Handles the logic for scanning the most recent data tiles to populate the query UI. It uses an `errgroup` to query multiple tiles in parallel, ensuring that the "count" and "available parameters" are calculated quickly even if the latest tile is partially empty.
### `fromIndexRange`
A utility that bridges the gap between Git and the DataFrame. It converts a range of commit numbers into `ColumnHeader` objects containing the Git hash, author, and timestamp required for the DataFrame header.
## Implementation Details
- **Timeouts**: Individual tile queries are protected by `singleTileQueryTimeout` (default 1 minute). This prevents a single "poison" tile or a massive ingestion spike from locking up the server during regression detection.
- **Sparse Data Handling**: The builder uses a `vec32.MissingDataSentinel` to represent gaps in traces where no data was recorded for a specific commit, ensuring the matrix alignment remains consistent across all traces.
- **Trace Source Info**: Beyond raw values, the builder tracks `SourceInfo`, which points back to the original files (e.g., Google Cloud Storage paths) from which the data was ingested, allowing for "drill-down" features in the UI.
# Module: /go/dfiter
The `dfiter` module provides a high-level abstraction for iterating over performance data stored in `DataFrames`. Its primary responsibility is to transform raw, potentially sparse data retrieved from a `TraceStore` into a series of smaller, dense "windows" suitable for regression detection algorithms.
### High-Level Overview
In the Skia Perf ecosystem, regression detection involves analyzing traces over time to find "steps" or shifts in performance. Because these algorithms typically operate on a fixed-sized window of points (defined by an `Alert.Radius`), the `dfiter` module acts as a bridge. It manages the complexity of querying the underlying data builders and then "slices" that data into the specific shapes required by the detection logic.
### Design Decisions and Key Components
#### 1. The Iterator Pattern (`dfiter.go`)
The module centers around the `DataFrameIterator` interface. This design allows the regression detection engine to remain agnostic of whether it is processing a single specific commit or scanning a wide range of history.
- **Exact Point Requests:** When a `Domain.Offset` is provided, the iterator returns a single `DataFrame` centered on that specific commit.
- **Range Requests:** When `Offset` is zero, the iterator behaves as a sliding window over a larger dataset.
#### 2. Caching and Concurrency (`dfIterProvider.go`)
Dataframe generation is an expensive operation involving database lookups and data processing. To optimize this, the `DfProvider` implements:
- **In-Memory Caching:** Stores recently built DataFrames keyed by a combination of the query, the end time, and the number of points requested.
- **SingleFlight Grouping:** Uses `golang.org/x/sync/singleflight` to prevent "thundering herd" problems. If multiple concurrent requests ask for the same DataFrame (e.g., several different regression tasks triggered by the same alert), only one builder execution occurs; the result is then shared among all callers.
#### 3. Trace Slicing Strategies (`traceSlicer.go`)
The module handles data differently depending on the regression algorithm being used. The choice of slicer is controlled by the `DfIterTraceSlicer` experiment flag and the `Alert.Algo` type.
- **K-Means Slicing (`kmeansDataframeSlicer`):**
This is the legacy approach. It treats the entire `DataFrame` as a unit, creating a sliding window across all traces simultaneously.
```
Original DataFrame: [C1, C2, C3, C4, C5] (Radius=1, WindowSize=3)
Iter 1: [C1, C2, C3]
Iter 2: [C2, C3, C4]
Iter 3: [C3, C4, C5]
```
- **StepFit Slicing (`stepFitDfTraceSlicer`):**
This strategy is optimized for individual trace analysis. It iterates through the `DataFrame` trace-by-trace rather than column-by-column.
- **Data Densification:** A key feature is that it filters out `MissingDataSentinel` values. If a trace is sparse, the slicer collapses it into a dense array of valid points before applying the window. This ensures the regression algorithm always sees a full set of real data points, even if they were originally non-contiguous in time.
### Key Workflows
#### Creating an Iterator
When `NewDataFrameIterator` is called, the following logic determines the data source:
```
User/System Request
|
v
Check Domain.Offset?
|
+--- [!= 0] ---> Find specific Commit Time -> Fetch exactly 2*Radius+1 points
| |
+--- [== 0] ---> Check DfProvider Cache <------+
| |
+--- [Hit] ---> Return Cached DF
| |
+--- [Miss] ---> Call DataFrameBuilder -> Cache Result
|
v
Select Slicer Implementation
(K-Means vs. StepFit)
|
v
Return DataFrameIterator
```
#### Iterating with StepFit
The `stepFitDfTraceSlicer` provides a more granular iteration than traditional time-based slicing:
```
Trace A: [1.1, nan, 1.2, 1.3, nan, 1.4]
Trace B: [5.0, 5.1, 5.2, 5.3, 5.4, 5.5]
StepFit Filtering (Radius 1):
1. Collapse Trace A -> [1.1, 1.2, 1.3, 1.4]
2. Slice A (Win 1) -> [1.1, 1.2, 1.3]
3. Slice A (Win 2) -> [1.2, 1.3, 1.4]
4. Move to Trace B...
5. Slice B (Win 1) -> [5.0, 5.1, 5.2]
...and so on.
```
### Key Files
- **`dfiter.go`**: Entry point for creating iterators; handles "settling time" logic and metadata metrics.
- **`dfIterProvider.go`**: Implements the caching layer and concurrency controls.
- **`traceSlicer.go`**: Contains the logic for the different sliding window implementations.
- **`traceSlicer_test.go`**: Provides extensive examples of how missing data is handled during slicing.
# Module: /go/dryrun
The `dryrun` module provides the functionality to test Perf Alerts against historical data. This allows users to validate and tune alert configurations by observing which regressions would have been detected had the alert been active over a specific range of commits.
### Core Logic and Design
The primary purpose of a dry run is to simulate the regression detection pipeline without triggering side effects like filing bug reports or sending notifications.
The module is designed around an asynchronous execution model. Because scanning historical data for regressions can be a long-running process, the module uses a "tracker" pattern. When a dry run is initiated, it returns a progress handle immediately, while the actual computation continues in a background goroutine.
#### Key Components
- **`Requests` struct**: The central coordinator that manages dependencies required for regression detection, including access to Git data, trace shortcuts, and data frame builders.
- **`StartHandler`**: The entry point for HTTP requests. It decodes a `RegressionDetectionRequest`, validates the alert configuration, and kicks off the background processing.
- **`detectorResponseProcessor`**: A specialized callback function defined within the start handler. It acts as the glue between the raw clustering results and the user-facing progress updates. Its responsibilities include:
- Converting raw cluster responses into formal `Regression` objects.
- Merging multiple regressions found for the same commit into a single entry.
- Enriching commit identifiers with full metadata (author, message, timestamp) from Git.
- Updating the `progress.Tracker` so the UI can display real-time results.
### Design Choice: Result Merging
A single dry run may execute multiple queries (e.g., if the alert uses a "group by" clause). This can result in multiple detections for the same commit across different sub-queries. The implementation uses a map (`foundRegressions`) keyed by `CommitNumber` to aggregate these results. As the detector finds new clusters, it merges them into the existing regression record for that commit, ensuring the user sees a consolidated view of all issues found at a specific point in time.
### Dry Run Workflow
The following diagram illustrates the lifecycle of a dry run request:
```text
User Request (POST)
|
v
[StartHandler] ------------------------+
| |
| 1. Validate Alert Config | 2. Return Progress ID
| 3. Add to Progress Tracker | to Frontend immediately
| 4. Launch Goroutine |
| +-----> [HTTP Response (JSON)]
v
[Background Goroutine]
|
| calls regression.ProcessRegressions(...)
|
+-----> [detectorResponseProcessor] (Callback)
|
| A. Convert cluster results to Regression objects
| B. Lookup Git commit details
| C. Merge results for same commits
| D. Update Tracker with current findings
|
v
[Progress Tracker] <------- (Frontend polls this)
```
### Implementation Details
- **Asynchrony**: The module explicitly uses `context.Background()` for the background goroutine instead of the request context (`r.Context()`). This prevents the dry run from being cancelled when the user's initial HTTP request terminates.
- **Data Enrichment**: The `RegressionAtCommit` struct is used to package the raw regression data with the `provider.Commit` metadata. This ensures the frontend has all the information necessary to display a human-readable list of results without performing additional lookups.
- **Error Handling**: Errors encountered during the background process are piped into the `req.Progress` object. This allows the system to report failures (like invalid queries or database timeouts) back to the user through the progress polling mechanism.
# Module: /go/e2e
### High-Level Overview
The `go/e2e` module provides a specialized test runner and infrastructure for executing end-to-end (E2E) tests within the Skia infrastructure. It serves as a bridge between the high-level Task Driver system and specific test suites (such as Node.js-based Puppeteer tests). The primary goal of this module is to automate the lifecycle of E2E testing: checking out the source code, executing tests via Bazel, capturing results in a standardized xUnit XML format, and persisting those results to Google Cloud Storage (GCS).
### Design and Implementation Philosophy
The module is designed as a **Task Driver**, leveraging the `task_driver` library to ensure that E2E tests can be executed reliably on Swarming bots with full observability and step-by-step logging.
- **Standardized Result Reporting:** While E2E tests may output diverse log formats, this runner wraps execution to generate xUnit-compatible XML. This design choice ensures that testing results can be ingested by standard CI reporting tools, providing a consistent view of failures, errors, and execution time.
- **Unique Traceability in GCS:** To prevent result collisions and maintain a historical record of test runs, the runner implements a unique object prefix generation strategy. It uses time-based partitioning (`YYYY-MM-DD/HH-MM-SS`) and an iterative collision check to ensure every test run has a dedicated, non-overlapping location in the storage bucket.
- **Environment Flexibility:** The runner supports both `--local` and bot-based execution. When running on a bot, it automatically handles complex environment setup, including Gerrit authentication and Git cookie management, which are abstracted away from the actual test logic.
- **Bazel-Centric Execution:** The runner utilizes Bazel as the underlying execution engine. By using specific flags like `--config=mayberemote` and `--nocache_test_results`, it ensures that E2E tests (which are often sensitive to environment state) are executed fresh when requested, while still benefiting from RBE (Remote Build Execution) when available.
### Key Components and Responsibilities
#### Test Runner (`test_runner.go`)
This is the core entry point. Its responsibilities include:
- **Environment Orchestration:** Managing the scratch work directory and initializing the repository checkout using `checkout` and `git_steps`.
- **Execution Management:** Invoking Bazel to run specific test targets. It parses the standard output of these commands using regular expressions to extract failure counts even when the underlying test process exits with an error.
- **Result Transformation:** Converting the raw output of Node.js/Puppeteer tests into the `TestSuites` and `TestSuite` XML structures.
- **Artifact Persistence:** Handling the authenticated upload of XML results to GCS, ensuring that developers can access detailed logs even after the Swarming bot has been reclaimed.
#### Infrastructure Integration (`BUILD.bazel`)
The build configuration defines the `e2e-test-runner` binary, which bundles the logic required to interact with Google Cloud API, Gerrit, and the internal Task Driver framework. It marks the runner as a public binary, allowing it to be triggered by various CI tasks across the repository.
### Typical Execution Workflow
The following diagram illustrates how the runner coordinates the lifecycle of an E2E test run:
```
[ Task Driver Start ]
|
v
[ Setup Environment ] ----> [ Initialize Git/Gerrit Auth ]
| [ Create Temp Workdir ]
v
[ Perform Checkout ] -----> [ Ensure Git Clone at Target Revision ]
|
v
[ Execute Bazel ] --------> [ Run Node.js E2E Test Target ]
| [ Capture Stdout and Exit Codes ]
v
[ Process Results ] ------> [ Regex Parse Failures ]
| [ Generate xUnit XML ]
v
[ Archive Artifacts ] ----> [ Generate Unique GCS Path ]
| [ Upload test_result.xml ]
v
[ Task Driver End ]
```
### Key Files
- **`test_runner.go`**: Contains the main logic for the task driver, including the GCS upload logic, the Bazel execution wrapper, and the XML result generation.
- **`BUILD.bazel`**: Defines the Go library and binary targets, specifying the dependencies on the cloud storage, Gerrit, and Task Driver libraries.
# Module: /go/e2e/tests
### High-Level Overview
The `/go/e2e/tests` module provides a framework and suite for end-to-end (E2E) testing of web-based applications. Unlike unit or integration tests that verify isolated logic or API contracts, this module validates the entire system stack by simulating real user interactions within a headless browser. Its primary goal is to ensure that critical user journeys—from page load to UI state changes—function correctly in a production-like environment.
### Design and Implementation Philosophy
The testing strategy is built around programmatic browser control and behavior-driven assertions. The following design choices guide the implementation:
- **Browser Orchestration via Puppeteer:** The module utilizes Puppeteer to automate Chrome. This choice allows tests to interact with the DOM, handle asynchronous rendering, and capture page metadata exactly as a user would.
- **Environment-Agnostic Browser Execution:** To ensure tests run consistently across local development machines and CI environments, the module relies on the `CHROME_BIN` environment variable. This decouples the test logic from the specific installation path of the browser.
- **Resource Efficiency and Isolation:** Tests are structured to balance performance with isolation. While a single browser instance is typically launched for a suite (via `before` hooks) to save overhead, individual test cases (`it` blocks) utilize fresh browser pages/tabs. This prevents state leakage between tests while avoiding the heavy cost of restarting the browser executable for every assertion.
- **Container-Friendly Configuration:** The browser is launched with specific flags, such as `--no-sandbox` and `--disable-dev-shm-usage`. These are intentional design choices to ensure compatibility with containerized execution environments where shared memory may be limited or namespace sandboxing is restricted.
### Key Components and Responsibilities
#### Test Execution and Assertions
The module relies on the Chai assertion library to provide a descriptive and readable syntax for validating application state. The responsibility of a test file is to define a specific user scenario, navigate to the target service, and verify outcomes such as page titles, element visibility, or data consistency.
#### Browser Lifecycle Management
Each test suite is responsible for managing its own lifecycle. This involves:
1. **Setup:** Initializing the browser driver and configuring the base URL.
2. **Execution:** Navigating to specific routes and interacting with the page.
3. **Teardown:** Ensuring the browser process is terminated (via `after` hooks) to prevent memory leaks and orphaned processes in the testing infrastructure.
### Typical Test Workflow
The following diagram illustrates the lifecycle of an E2E test within this module:
```
[ Test Suite Start ]
|
v
[ Launch Browser ] <--- Uses CHROME_BIN and container-optimized flags
|
+---- [ Setup Page ] <--- Create a new isolated tab/page
| |
| v
| [ Navigate ] <--- page.goto(baseUrl)
| |
| v
| [ Verify ] <--- Assert DOM state or page metadata
| |
| v
| [ Close Page ]
|
v
[ Close Browser ] <--- Teardown process to free resources
|
v
[ Test Suite End ]
```
### Key Files
- **`example_nodejs_test.ts`**: Serves as the reference implementation for new E2E tests. It demonstrates how to initialize the Puppeteer instance, manage the page lifecycle, and perform assertions using the Chai library.
- **`BUILD.bazel`**: Defines the test targets. It identifies the necessary dependencies, such as the Puppeteer driver and assertion libraries, ensuring the test runner has access to the required Node.js modules and browser binaries.
# Module: /go/favorites
### High-Level Overview
The `go/favorites` module defines the core domain model and persistence interface for "Favorites" within the Perf system. A "Favorite" represents a saved configuration—specifically a name, description, and URL—that allows users to bookmark and revisit specific data visualizations or query states.
This module acts as a contract layer, decoupling the business logic of the Perf application from specific storage implementations (such as SQL databases or mock objects used in testing).
### Design and Implementation Choices
The module is designed around a strictly defined interface to ensure that the Perf frontend can manage user preferences consistently, regardless of the underlying infrastructure.
- **Interface-Driven Design**: By defining the `Store` as an interface, the system supports pluggable backends. This allows for the `sqlfavoritestore` implementation in production and the `mocks` implementation for unit testing.
- **Encapsulation of State**: The `Favorite` struct encapsulates all metadata required to reconstruct a saved view. The inclusion of `LastModified` (as a Unix timestamp) enables the UI to sort or filter favorites by recency without requiring complex timezone handling at the database level.
- **Separation of Concerns (SaveRequest)**: The use of a `SaveRequest` struct separate from the `Favorite` struct is a deliberate choice to distinguish between _input data_ (what a user provides) and _stored records_ (which include system-generated fields like `ID` and `LastModified`).
- **Arbitrary Liveness Responsibility**: A unique design choice in this module is the inclusion of the `Liveness` method. While not strictly a "favorites" function, the `Store` was selected as a lightweight probe point to verify database connectivity for the entire application. This avoids adding overhead to more performance-critical stores while ensuring the system can monitor its health.
### Key Components
#### `store.go`
This file defines the data structures and the behavioral contract for favorite management.
- **`Favorite` Struct**: The primary data model. It includes the `UserId` (typically an email address) to enforce ownership and a `Url` which contains the encoded state of the Perf dashboard.
- **`Store` Interface**: Defines the lifecycle of a favorite:
- **Creation and Updates**: `Create` and `Update` use the `SaveRequest` pattern to ensure only mutable fields are passed from the frontend.
- **Retrieval**: `Get` fetches a single record by ID, while `List` provides a collection of all favorites owned by a specific user.
- **Security-Conscious Deletion**: The `Delete` method requires both an `id` and a `userId`. This ensures that the storage layer enforces ownership, preventing a user from deleting another person's favorite by simply knowing its ID.
### Favorite Lifecycle Workflow
The following diagram illustrates the lifecycle of a Favorite configuration from creation to retrieval:
```text
User Action Data Structure Store Interface
----------- -------------- ---------------
| | |
[ Save Dashboard ] ----> [ SaveRequest ] |
| (Name, URL, Desc, User) |
| | |
| +----------------------> [ Create() ]
| |
| (ID & Timestamp generated)
| |
v | v
[ View My List ] <------- [ []*Favorite ] <------------ [ List(UserId) ]
| (ID, Name, URL, Modified) |
| | |
| | |
[ Delete Entry ] ------- (ID + UserId) ----------------> [ Delete() ]
```
# Module: /go/favorites/mocks
### High-Level Overview
The `go/favorites/mocks` module provides a programmatic double of the `favorites.Store` interface. Its primary purpose is to facilitate unit testing for components within the Perf system that depend on "favorites" functionality—such as saving, retrieving, or listing user-defined favorite configurations—without requiring a live database or a real implementation of the storage layer.
By using these mocks, developers can isolate the business logic of higher-level services, ensuring that tests are deterministic, fast, and do not rely on external infrastructure like Spanner or SQL.
### Design and Implementation Choices
The module utilizes **automated mock generation** via `mockery`. This choice ensures that the mock implementation remains perfectly synchronized with the `favorites.Store` interface definition found in `go/favorites`.
#### Why use mocks here?
- **Isolation**: Testing a service that uses favorites (e.g., a frontend API handler) should not fail because of a database connection issue.
- **Behavior Simulation**: The mocks allow testers to simulate specific scenarios that are difficult to trigger with a real store, such as specific database errors, timeouts, or the return of empty datasets.
- **Verification**: Beyond just providing data, these mocks allow tests to assert that specific methods (like `Update` or `Delete`) were called with the expected arguments and the correct number of times.
#### Implementation with `testify`
The implementation is built on the `github.com/stretchr_testify/mock` framework. This allows for a fluent API when setting up expectations:
```text
Test Code Mock Store Component Under Test
--------- ---------- --------------------
Setup Expectation -> [ On("Get").Return(...) ]
Execute Test ----------------------------> Calls Store.Get()
|
[ Match Arguments ] <-----|
[ Return Fake Data ] ----> Result processed by logic
Assert Expectations <- [ AssertExpectations ]
```
### Key Components
#### Store.go
This is the core file of the module, containing the `Store` struct. It implements every method required by the `favorites.Store` interface:
- **CRUD Operations**: `Create`, `Get`, `Update`, and `Delete` methods are implemented to capture arguments and return values defined in the test suite.
- **Query Operations**: The `List` method simulates retrieving all favorites associated with a specific user ID.
- **Infrastructure Checks**: The `Liveness` method is mocked to allow tests to simulate the health status of the underlying storage engine.
#### NewStore Constructor
The `NewStore` function is the standard entry point for using this module in a test. It takes a `testing.T` (or compatible interface) which allows it to:
1. Register the mock for the current test context.
2. Automatically hook into the test's `Cleanup` phase to call `AssertExpectations`, ensuring that any configured expectations were actually met before the test finished.
# Module: /go/favorites/sqlfavoritestore
This module provides a SQL-based implementation of the `favorites.Store` interface, enabling the persistence and management of user "Favorites" (saved URLs/configurations) within the Perf application. It bridges the gap between the high-level favorites domain logic and the underlying relational database (typically CockroachDB or Spanner).
### Design and Implementation Choices
The module is built with a focus on performance for user-centric queries and reliability in a distributed database environment.
- **Relational Storage**: By using a SQL backend, the module leverages robust indexing and ACID transactions. The schema is optimized for the primary access pattern: retrieving all favorites associated with a specific user.
- **Decoupled SQL Statements**: SQL queries are defined as a mapped collection of constants (`statements`). This separation ensures that the Go logic for scanning rows remains clean while providing a single location to tune or update database queries.
- **Stateless Operations**: The `FavoriteStore` struct is designed to be stateless, holding only a reference to the database connection pool (`pool.Pool`). This allows the store to be safely shared across multiple goroutines.
- **Liveness Monitoring**: Beyond standard CRUD operations, the module includes a `Liveness` check. This is a strategic inclusion for cloud-native deployments, allowing the application's frontend to verify its database connectivity independently of functional queries.
- **Timestamping**: The store handles `LastModified` logic at the application level using Unix timestamps. This ensures consistency in how time is recorded regardless of the database's internal time configuration.
### Key Components and Responsibilities
#### `sqlfavoritestore.go`
This is the core of the module, implementing the `FavoriteStore` and its associated methods. It handles the translation between Go structs (from the `favorites` package) and SQL parameters.
- **CRUD Implementation**:
- **Create**: Injects a new record and automatically calculates the `last_modified` timestamp.
- **Update**: Modifies existing records based on ID and provides feedback if no rows were affected (e.g., if the ID was invalid).
- **Delete**: Requires both the `ID` and the `UserId` to perform a deletion. This is a security-in-depth choice, ensuring a user can only delete their own favorites even if an ID is guessed or leaked.
- **List**: Retrieves a subset of fields (`id`, `name`, `url`, `description`) for all favorites owned by a user, optimized for summary views.
- **Error Handling**: Uses `skerr` to wrap standard database errors with contextual information (e.g., "Failed to load favorite"), making it easier to trace issues in logs.
#### `schema/` (Submodule)
Defines the architectural "blueprint" for the `Favorites` table. It ensures that the database indexes and constraints (like `NOT NULL` on URLs) align with the requirements of the Go code. It manages the identity strategy (UUIDs) and the `by_user_id` index critical for performance.
### Key Workflow: Saving and Retrieving a Favorite
The following diagram illustrates how the store interacts with the database to persist a user's favorite:
```text
[ Caller ]
|
| 1. Create(ctx, SaveRequest{UserId, Name, Url...})
v
[ FavoriteStore ]
|
| 2. Generate current Unix timestamp
| 3. Execute 'insertFavorite' SQL
| (UserId, Name, Url, Desc, Timestamp)
v
[ SQL Database ]
|
| 4. Generate UUID for 'id'
| 5. Persist record & Update 'by_user_id' index
v
[ FavoriteStore ]
|
| 6. Return nil (Success)
v
[ Caller ]
|
| 7. List(ctx, UserId)
| 8. Execute 'listFavorites' SQL
| (Uses index for fast lookup)
v
[ SQL Database ]
|
| 9. Return matching rows
v
[ FavoriteStore ]
|
| 10. Scan rows into []*favorites.Favorite
v
[ Caller ]
```
# Module: /go/favorites/sqlfavoritestore/schema
This module defines the data architecture for persisting user "Favorites" within a SQL database. It serves as the single source of truth for the database schema, ensuring that the Go representation of a favorite aligns with the underlying storage constraints and indexing requirements.
### Design and Implementation Choices
The schema is designed to support a multi-user environment where favorites are frequently queried by ownership but accessed via unique identifiers for updates and deletions.
- **Identity Management**: The `ID` uses a UUID generated at the database level (`gen_random_uuid()`). This choice provides a globally unique identifier that prevents ID exhaustion and allows for future decentralization or data migration without key collisions.
- **User Association**: Users are identified by their email addresses (`UserId`), sourced from `uber-proxy` authentication. By using a string-based email as the primary key for user association rather than an internal integer ID, the system simplifies integration with the authentication layer and avoids an extra lookup table for user metadata.
- **Indexing Strategy**: A dedicated index `by_user_id` is defined on the `user_id` column. This is a critical performance choice, as the primary access pattern for the application is expected to be "fetch all favorites for the current logged-in user." Without this index, user-specific queries would require a full table scan.
- **Time Tracking**: The `LastModified` field uses a standard Unix timestamp (integer). This provides a lightweight, timezone-agnostic way to handle sorting by recency or implementing cache invalidation logic on the client side.
### Key Components and Responsibilities
#### `schema.go`
This file contains the `FavoriteSchema` struct, which uses specialized `sql` struct tags to define the DDL (Data Definition Language) properties of the table.
- **Content and Metadata**: The schema separates core functional data (`Url`) from descriptive metadata (`Name`, `Description`). While the `Url` is mandatory (`NOT NULL`), the name and description are optional, allowing users to save links quickly without mandatory labeling.
- **Structural Integrity**: By defining constraints like `PRIMARY KEY` and `NOT NULL` directly in the struct tags, the module ensures that the database enforcement layer matches the application's data requirements.
### Key Workflow: User Data Retrieval
The schema is structured to optimize the flow from authentication to data display:
```text
[ Authentication Layer ]
|
| Provides User Email
v
[ sqlfavoritestore ]
|
| Query: SELECT * FROM Favorites WHERE user_id = {email}
| (Uses 'by_user_id' Index for O(log n) lookup)
v
[ SQL Database ]
|
| Returns set of FavoriteSchema records
v
[ UI / Consumer ]
```
# Module: /go/file
### Overview
The `file` module defines the core abstraction for data ingestion within the Skia Perf system. It provides a common interface and data structure that decouples the ingestion logic (the "how" of processing data) from the storage and transport layers (the "where" of the data).
By standardizing how files are represented and discovered, this module allows the system to seamlessly switch between local development environments, automated testing, and production cloud-scale ingestion.
### Core Abstractions
#### The `File` Struct
The `file.File` struct is the primary data transfer object. It encapsulates both the data and the metadata required for a single unit of ingestion.
- **`Name`**: The identifier for the file, typically a path or URI.
- **`Contents`**: An `io.ReadCloser` that provides the raw data. This allows the ingestion pipeline to stream data rather than loading entire files into memory, which is critical when handling large performance traces.
- **`Created`**: A timestamp indicating when the file was originated.
- **`PubSubMsg`**: An optional reference to a `pubsub.Message`. This is included to allow downstream consumers to manually acknowledge or negatively acknowledge the message if the source is backed by a messaging service like Google Cloud Pub/Sub.
#### The `Source` Interface
The `Source` interface defines a unified mechanism for discovering files. It follows a "push-based" model through a Go channel:
```go
type Source interface {
Start(ctx context.Context) (<-chan File, error)
}
```
This design choice allows the ingestion engine to remain reactive. Whether the files are being walked on a local disk or arriving via real-time cloud notifications, the consumer simply listens to the channel until it is closed.
### Design Rationale
- **Streaming over Buffering**: The use of `io.ReadCloser` instead of a byte slice for file contents ensures that the memory footprint of the ingestion process remains low even if individual files are large.
- **Decoupled Lifecycle Management**: The `Start` method is designed to be called once per instance. This prevents race conditions and ensures a predictable lifecycle for the background workers (goroutines) that different implementations (like `dirsource` or `gcssource`) use to monitor their respective backends.
- **Metadata Passthrough**: Including the `PubSubMsg` in the `File` struct is a design compromise that breaks total isolation in favor of reliability. It allows the ingestion pipeline to signal to the underlying transport layer exactly when a file has been successfully persisted or if it needs to be retried.
### Workflow: General Ingestion Pattern
The following diagram illustrates how the `file` module acts as the contract between various data providers and the main ingestion engine:
```text
+-------------------+ +-------------------+ +-------------------+
| Local Directory | | GCS Bucket | | Other Providers |
| (dirsource) | | (gcssource) | | (Future) |
+---------+---------+ +---------+---------+ +---------+---------+
| | |
| implements | implements | implements
+--------------------------+--------------------------+
|
+-------v--------+
| file.Source |
+-------+--------+
|
| Start(ctx) returns <-chan file.File
v
+----------------+
| Perf Ingestion |
| Engine |
+----------------+
```
### Submodules and Implementations
While the `file` package defines the contract, its submodules provide the concrete implementations used across different environments:
- **`dirsource`**: A filesystem-based implementation. It walks a local directory and streams the files found. It is primarily used for local development and unit tests.
- **`gcssource`**: A production-ready implementation that listens to Google Cloud Pub/Sub notifications to ingest files from GCS buckets in real-time.
### Files and Responsibilities
- **`file.go`**: Defines the `Source` interface and the `File` struct. This file serves as the single point of truth for how data enters the Perf system, ensuring that adding a new storage backend only requires implementing the `Source` interface.
# Module: /go/file/dirsource
### Overview
The `dirsource` module provides a filesystem-based implementation of the `file.Source` interface. Its primary purpose is to abstract a local directory into a stream of data, allowing the Skia Perf system to ingest files directly from the disk.
This implementation is intentionally kept simple and is designed specifically for **local development, demonstration modes, and unit testing**. It allows developers to run the Perf ingestion pipeline against local files without requiring a cloud storage provider or a complex messaging infrastructure.
### Design Rationale
The implementation of `DirSource` prioritizes ease of setup over high-performance production features like real-time file watching or sophisticated metadata tracking.
- **One-Shot Iteration:** Instead of monitoring the directory for new events (e.g., via `inotify`), the module performs a one-time walk of the directory tree when `Start` is called. This simplifies the state management of the source, making it predictable for tests.
- **Asynchronous Streaming:** To prevent the caller from blocking while the filesystem is scanned, `Start` launches a background goroutine that pushes files into a buffered channel. This allows the ingestion pipeline to begin processing the first file while the source is still discovering the next.
- **Modified Time as Proxy:** Since many filesystems do not reliably track or expose an "original creation time" in a cross-platform manner, the module uses the file's `ModTime` (Modified Time) to fill the `Created` field in the `file.File` struct.
- **Safety Constraints:** The module enforces that `Start` can only be called once. This prevents accidental duplicate processing of the same directory within the same lifecycle, ensuring data consistency in demo environments.
### Key Components
#### DirSource
Defined in `dirsource.go`, this is the core struct implementing the `file.Source` interface. It maintains the absolute path to the target directory and tracks whether the ingestion process has already been initiated.
The `Start` method performs the following actions:
1. Verifies the source hasn't been started.
2. Creates a buffered channel (`channelSize = 10`) to hold `file.File` objects.
3. Spawns a goroutine to execute `filepath.Walk`.
4. For every non-directory file encountered, it opens a file handle and emits a `file.File` containing the path, the open reader, and the modification timestamp.
### Workflow: File Discovery and Emission
The following diagram illustrates how `DirSource` transforms a filesystem structure into a stream of data for the ingestion pipeline:
```text
+--------------+ +-----------------------+ +-------------------+
| Ingestion | | DirSource | | Local Filesystem |
| Engine | | (Background) | | (Disk) |
+--------------+ +-----------------------+ +-------------------+
| | |
| (1) Start(ctx) | |
|----------------------------->| |
| | (2) filepath.Walk(dir) |
| (Returns Channel) |------------------------------>|
|<-----------------------------| |
| | (3) Open & Read Metadata |
| |<------------------------------|
| | |
| (4) Receive file.File{} | |
|<-----------------------------| |
| | (5) Repeat for all files |
| |------------------------------>|
| | |
| (6) Channel Closed | |
|<-----------------------------| |
```
### Files and Responsibilities
- **`dirsource.go`**: Contains the logic for scanning the filesystem and mapping `os.FileInfo` to the common `file.File` structure used by Perf.
- **`dirsource_test.go`**: Validates that the directory walker correctly identifies files, handles multiple files in a single pass, and properly errors out if `Start` is invoked more than once.
- **`testdata/`**: A collection of static JSON files used during testing to ensure the source correctly reads file contents and handles path resolution.
# Module: /go/file/dirsource/testdata
### Overview
The `testdata` directory serves as a controlled environment containing static filesystem artifacts used to validate the behavior of the `dirsource` module. Rather than relying on dynamically generated files or mock objects, this directory provides concrete, predictable JSON structures that allow for end-to-end testing of directory scanning, file reading, and data parsing logic.
### Design Rationale
The primary design choice here is the use of **representative static assets**. By storing physical `.json` files on disk, the test suite can exercise the full I/O stack—ensuring that the module correctly handles file handles, directory pathing, and content deserialization in a way that matches real-world usage.
Key considerations for these test files include:
- **Schema Consistency:** The files (`filea.json`, `fileb.json`) follow a uniform structure (`{"status": "..."}`). This allows tests to verify that the `dirsource` implementation can iterate through multiple files and map them to a consistent internal data model or interface.
- **Path Resolution Testing:** These files enable the verification of recursive or non-recursive directory crawling. By having multiple files in a single flat structure, the module can test its ability to identify, filter, and ingest specific file extensions (JSON) while ignoring others if necessary.
### Key Components
The module is comprised of distinct JSON payloads that represent different data states:
- **`filea.json`**: Acts as the primary data point for positive testing. It contains a standard "status" string used to verify that the scanner successfully opens a file and extracts its content accurately.
- **`fileb.json`**: Provides a secondary data point. This is used to ensure that the `dirsource` logic correctly handles collections of files, verifying that the ingestion process doesn't stop after the first successful read and that it maintains data integrity across multiple distinct sources.
### Ingestion Workflow
The typical interaction between the parent module and these files follows this process:
```text
+-------------------+ +-----------------------+ +-------------------+
| dirsource logic | ----> | Read /testdata/ dir | ----> | Map JSON content |
+---------+---------+ +-----------+-----------+ +---------+---------+
| | |
| (1) Path discovery | (2) File I/O | (3) Validation
v v v
[ filea.json ] <---------------------+---------------------> [ fileb.json ]
[ "A test..." ] [ "just another" ]
```
This workflow ensures that the system can navigate the filesystem and translate raw disk bytes into structured application data.
# Module: /go/file/gcssource
# GCS Source Module
The `gcssource` module provides an implementation of the `file.Source` interface specifically for Google Cloud Storage (GCS). It is designed to enable real-time ingestion of performance data files as they are uploaded to GCS buckets.
## Overview
The primary purpose of this module is to bridge GCS storage events with the Perf ingestion pipeline. It leverages GCS Pub/Sub notifications to detect new file arrivals, filters those files based on configuration, and streams the file contents for processing.
By using an event-driven approach rather than polling, the module ensures that the system reacts immediately to new data while minimizing unnecessary API calls to GCS.
## Key Components and Responsibilities
### GCSSource
The central struct `GCSSource` manages the lifecycle of the ingestion source. Its responsibilities include:
- **Subscription Management**: Setting up and maintaining a connection to a Google Cloud Pub/Sub topic.
- **Event Handling**: Listening for messages that indicate a new object has been created in a bucket.
- **Validation and Filtering**: Ensuring that only relevant files are passed into the pipeline.
- **Resource Management**: Providing an `io.ReadCloser` for each discovered file, allowing the consumer to read data directly from GCS.
### Filtering Logic
The module implements a multi-stage filtering process to determine if a file should be ingested:
1. **Filename Patterns**: Uses a `filter.Filter` (configured via `AcceptIfNameMatches` and `RejectIfNameMatches`) to include or exclude files based on regex-like patterns.
2. **Source Path Restriction**: Checks if the file resides within the allowed prefixes defined in the `SourceConfig.Sources` list. This prevents the ingestion of files from unauthorized or unrelated directories within the same bucket.
### Reliability and Acknowledgment
The module carefully manages Pub/Sub message acknowledgments (Ack/Nack) to ensure no data is lost:
- **Ack**: If a message is malformed (invalid JSON) or the file is explicitly rejected by filters, it is acknowledged to prevent it from being retried.
- **Nack**: If a transient error occurs (e.g., GCS API is down, or the file cannot be read), the message is negatively acknowledged so it can be redelivered and retried later.
- **Dead Letter Support**: If dead letter collection is enabled in the configuration, the logic shifts to prioritize moving failing messages to a dead letter queue rather than infinite retries.
## Design Decisions
### Single Subscriber Strategy
The module defaults to a low number of parallel receives (controlled by `maxParallelReceives`). This is an intentional design choice to maintain a predictable load on the ingestion system and to simplify the management of GCS read streams.
### Manual Deadline Management
The module disables automatic deadline extensions by the Pub/Sub library (`MaxExtension = -1`). This forces the system to either process a message or let it time out quickly. This prevents the "stuck ingestor" problem where a single problematic message holds up the entire queue because its deadline is being automatically extended indefinitely.
### Data Flow Workflow
```text
GCS Bucket (New File)
|
v
Cloud Pub/Sub Topic
|
v
GCSSource.Receive (Pub/Sub Message)
|
|--[ Deserialize JSON ]--> (If invalid: Ack & Drop)
|
|--[ Apply Filter ]------> (If rejected: Ack & Drop)
|
|--[ Check Sources ]-----> (If no match: Ack & Drop)
|
|--[ Get GCS Reader ]----> (If GCS error: Nack & Retry)
|
v
file.File channel (Streamed to Ingestor)
```
## Key Files
- **`gcssource.go`**: Contains the core logic for the `GCSSource` struct, the Pub/Sub message listener, and the GCS object retrieval logic.
- **`gcssource_manual_test.go`**: Provides integration tests using GCP emulators to verify the end-to-end flow from Pub/Sub message publication to file channel output.
# Module: /go/filestore
### High-Level Overview
The `filestore` module provides a unified abstraction for interacting with different storage backends (Local and Google Cloud Storage) through the standard Go `io/fs` interface. It is designed to allow the Skia Perf system to remain storage-agnostic, enabling high-level components to consume data—primarily ingestion files—without needing to know whether the source is a physical disk or a cloud bucket.
The module focuses on a read-only, stream-oriented pattern, prioritizing simplicity and compatibility with the Go standard library over exhaustive implementation of filesystem metadata features.
### Design Decisions and Implementation
#### Unified Interface (`fs.FS`)
The decision to wrap storage backends in `fs.FS` rather than creating a custom storage interface allows the system to leverage Go’s rich ecosystem of standard library tools (e.g., `io.ReadAll`, `bufio.NewScanner`). This abstraction ensures that unit tests can swap out a production GCS backend for a local directory or an in-memory filesystem without changing any business logic.
#### Common Path Handling vs. Backend Specifics
While both backends implement the same interface, they handle path resolution differently based on the nature of the underlying storage:
- **Local Backend**: Implements a "chrooted" approach. It anchors all operations to a specific root directory on the local disk. It uses path translation to ensure that even if absolute system paths are provided, they are resolved relative to the configured storage root, preventing accidental access to unauthorized parts of the host filesystem.
- **GCS Backend**: Implements a URL-based approach (`gs://<bucket>/<path>`). It treats the entire GCS namespace as a single virtual filesystem. It parses these URLs on the fly to determine which bucket and object to fetch using the Google Cloud Storage API.
#### Read-Only Philosophy
Both implementations are optimized for data consumption.
- They utilize read-only scopes (in GCS) or standard read-only file handles (in Local).
- Non-essential methods like `Stat()` are often left unimplemented or return minimal information. This reflects the design goal: the Perf ingestion engine cares about the **content** of the files (the JSON or Proto data) rather than the filesystem-level metadata like permissions or modification times.
### Key Components
#### `gcs` Submodule
Bridges the `storage.Client` from the Google Cloud SDK to `fs.FS`.
- **Responsibility**: Managing authenticated GCS clients and translating virtual `gs://` paths into API requests.
- **Implementation Choice**: It embeds `*storage.Reader` within a custom `file` struct. This allows the module to satisfy the `fs.File` interface automatically, as `storage.Reader` already provides the necessary `Read` and `Close` methods.
#### `local` Submodule
Wraps the `os` package's filesystem capabilities.
- **Responsibility**: Providing restricted access to a local directory.
- **Implementation Choice**: It utilizes `os.DirFS` internally. The primary value-add of this submodule is the path sanitization and translation logic that allows the rest of the application to use standardized paths while the module maps them to the correct local disk location.
### Storage Resolution Workflow
The following diagram shows how the `filestore` abstraction allows the application to remain indifferent to the underlying storage type:
```text
[ Application Logic ]
|
| Requests "path/to/data.json"
v
[ filestore (fs.FS) ]
|
+-----------------------+-----------------------+
| | |
v v v
[ Local Implementation ] OR [ GCS Implementation ]
| |
| 1. Resolve relative path | 1. Parse gs:// URL
| 2. Access local disk | 2. storage.NewReader()
v v
[ Physical File ] [ GCS Object ]
```
### Key Files
- **`gcs/gcs.go`**: Contains the logic for parsing GCS URIs and the implementation of the `fs.FS` and `fs.File` interfaces for cloud storage.
- **`local/local.go`**: Contains the logic for anchoring file access to a local root directory and implementing the `fs.FS` interface for disk-based storage.
# Module: /go/filestore/gcs
### High-Level Overview
The `gcs` module provides a bridge between the standard Go `io/fs` interface and Google Cloud Storage (GCS). It allows systems—specifically the Skia Perf backend—to treat GCS objects as if they were files in a standard filesystem.
The primary motivation for this module is to abstract the complexities of the GCS client (authentication, bucket management, and reader handling) behind the ubiquitous `fs.FS` interface. This allows higher-level components to remain storage-agnostic, facilitating testing and potential migrations to other storage backends.
### Design Decisions and Implementation
#### Interface Adherence vs. Functionality
The module implements the `fs.FS` and `fs.File` interfaces. However, it follows a "minimal implementation" philosophy tailored to the needs of the Perf system.
- **Read-Only Focus**: The implementation uses `storage.ScopeReadOnly` during initialization. This design choice minimizes the security footprint of the service, ensuring it can only consume data and never accidentally modify or delete ingestion files.
- **Deferred Implementation**: Methods such as `Stat()` on the `file` struct are intentionally not implemented and return `ErrNotImplemented`. This decision was made because the Perf ingestion pipeline focuses on streaming data content rather than inspecting metadata (like timestamps or permissions) provided by `os.FileInfo`.
#### Path Parsing logic
Because `fs.FS` traditionally expects paths relative to a root, but GCS requires both a bucket name and an object path, the module utilizes a URL-based naming convention: `gs://<bucket>/<path>`.
The `parseNameIntoBucketAndPath` function decomposes these strings. It is designed to handle the nuances of URL parsing, such as stripping leading slashes from the URL path to convert them into valid GCS object keys.
### Key Components
#### filesystem (`gcs.go`)
This is the central coordinator that satisfies `fs.FS`. It holds a long-lived `storage.Client`, which is authenticated using Google Application Default Credentials.
- **Responsibility**: Managing the lifecycle of the GCS client and translating `Open(name)` calls into GCS readers.
- **Workflow**: When `Open` is called, the filesystem parses the provided string into bucket and object components, then initializes a new `storage.Reader` using a background context.
#### file (`gcs.go`)
A thin wrapper around `*storage.Reader`.
- **Responsibility**: To bridge the `storage.Reader` (which provides `Read` and `Close`) with the `fs.File` interface.
- **Design Choice**: By embedding `*storage.Reader`, the struct automatically inherits the methods required for reading, keeping the implementation concise.
### Data Access Workflow
The following diagram illustrates how a request for a file is translated from a standard interface call into a GCS network request:
```
[ Caller ]
|
| 1. Open("gs://my-bucket/data.json")
v
[ filesystem ]
|
| 2. parseNameIntoBucketAndPath() -> ("my-bucket", "data.json")
| 3. storage.Client.Bucket("my-bucket").Object("data.json").NewReader()
v
[ storage.Reader ] <--- Wrapped in ---> [ file (fs.File) ]
|
| 4. Read() / Close() operations
v
[ Google Cloud Storage API ]
```
# Module: /go/filestore/local
### High-Level Overview
The `local` module provides an implementation of the standard library's `fs.FS` interface specifically for the local file system. In the context of the larger Perf system, this module serves as a bridge between high-level file operations and physical storage on disk. By wrapping local file access in the `fs.FS` interface, it allows other components to remain agnostic about whether they are interacting with local storage, cloud storage, or an in-memory mock.
### Design Decisions and Implementation
The implementation focuses on creating a "chrooted" view of the local filesystem. This is achieved by anchoring all file operations to a specific `rootDir`.
#### Root Isolation and Path Resolution
A key design choice is the use of `os.DirFS`. While `os.Open` can access any path on the system, `os.DirFS` restricts access to a specific directory tree. By combining an absolute root path with `os.DirFS`, the module ensures that callers interact with a controlled environment.
When a file is opened via the `Open` method:
1. The module takes the requested path (which may be absolute or relative to the system root).
2. It calculates the relative path of that file with respect to the module's initialized `rootDir`.
3. It passes that relative path to the internal `os.DirFS` instance.
This approach provides a layer of safety and abstraction: the consumer of the `local` package can provide paths as they exist on the system, and the module handles the translation necessary to satisfy the `fs.FS` requirements, which typically expect paths relative to the filesystem root.
### Key Components
#### The `filesystem` Struct
Defined in `local.go`, this struct is the core of the module. It maintains two primary pieces of state:
- `rootDir`: The absolute path to the base directory. This is captured during initialization via `filepath.Abs` to ensure that the base of the filesystem is immutable and clearly defined, even if the process's working directory changes.
- `fs`: An internal `fs.FS` instance (specifically a `dirFS` from the `os` package). This handles the actual low-level directory traversal and file reading.
#### The `Open` Workflow
The `Open` method acts as a translation layer. Instead of directly opening a path, it enforces the "local root" logic.
```text
Input Path (name)
|
v
[ filepath.Rel ] <--- compares 'name' against 'rootDir'
|
+---- Error (if name is outside rootDir)
|
v
Relative Path
|
v
[ f.fs.Open ] <--- os.DirFS handles actual I/O
|
v
fs.File
```
This workflow ensures that even if a full system path is passed to `Open`, the module correctly identifies the segment relative to its configured root, preventing accidental access to files outside the intended scope of the `perf` storage directory.
# Module: /go/frontend
# Perf Frontend Module
The `/go/frontend` module serves as the central web server and orchestration layer for the Skia Perf application. It is responsible for serving the Web UI, managing user authentication, and coordinating communication between various backend services such as trace stores, regression detection engines, and issue trackers.
## Overview
The frontend service is designed as a controller-based system that abstracts complex performance data operations into user-facing API endpoints. It acts as the "glue" that binds together the telemetry data (traces), the version control history (Git), and the automated analysis tools (clustering and regression detection).
A key design philosophy of this module is **asynchronous data handling**. Performance dataframes can be massive, and generating them often exceeds the duration of a standard HTTP request. Consequently, the frontend utilizes a "Start-Status-Result" pattern, allowing the UI to poll for progress while a background worker processes the data.
## Key Components and Responsibilities
### Service Entry Point (`frontend.go`)
This file acts as the primary initializer for the service. It performs several critical roles:
- **Dependency Injection**: It constructs all necessary store implementations (e.g., `TraceStore`, `AlertStore`, `RegressionStore`) based on the provided configuration.
- **Template Orchestration**: It manages the lifecycle of Go HTML templates. In development mode, templates are reloaded on every request to facilitate rapid UI iteration, while in production, they are loaded once and cached for performance.
- **Global Context Injection**: The `getPageContext` method serializes the application's state into a `window.perf` JavaScript object. This ensures that the frontend UI has immediate access to instance-specific settings, feature flags (like `FetchAnomaliesFromSql`), and environment metadata (like the `ImageTag`).
### API Routing and Logic (`/api` sub-module)
The logic is partitioned into specialized API structs, each responsible for a functional domain of the application:
- **Trace and Graph Management**: Handles the retrieval of performance metrics and the construction of dataframes for visualization.
- **Regression and Triage**: Manages the lifecycle of detected performance anomalies, providing the bridge between automated detection and manual human verification.
- **Personalization**: Manages user-specific shortcuts and favorites, allowing researchers to save specific views of the data.
### Request Proxying (`proxy.go`)
To circumvent Cross-Origin Resource Sharing (CORS) limitations when the browser needs to fetch data from external sources (e.g., `googlesource.com`), the module includes a specialized proxy handler. It forwards GET requests while carefully stripping security-sensitive headers like `Origin` and `Referer` to ensure the request is accepted by the destination server.
### User Authentication and Role Enforcement
The module integrates with the `alogin` package to provide identity management. It uses a decorator pattern (`RoleEnforcedHandler`) to wrap sensitive endpoints, ensuring that only users with specific roles (e.g., `Admin` or `Bisecter`) can access administrative or resource-intensive functions.
## Key Workflows
### Server Initialization and Background Processes
When the frontend starts, it doesn't just wait for requests; it initiates several background synchronization tasks to ensure the data served is fresh.
```text
Startup Sequence
|
|-- Load & Validate Config JSON
|-- Initialize Trace & Metadata Stores
|-- Start ParamSet Refresher (Periodic refresh of available trace keys)
|-- Start Continuous Clustering (If enabled, runs background regression detection)
|-- Initialize Notifiers (Email/Issue Tracker integrations)
|
V
Serve HTTP
```
### Git-to-UI Navigation (`gotoHandler`)
A common workflow involves navigating from a specific Git commit hash to its representation in the Perf UI. The `gotoHandler` manages this translation:
1. It resolves a Git hash to a `CommitNumber` using the `perfGit` provider.
2. It calculates a temporal window (range) around that commit.
3. It redirects the user to the appropriate sub-page (Explore, Clustering, or Triage) with the time-range parameters pre-populated in the URL.
## Design Decisions
### Configuration Validation
The module enforces strict validation on startup via the `testdata` fixtures. This ensures that misconfigurations (such as an empty `instance_name` or invalid connection strings) result in an immediate failure during deployment rather than subtle runtime errors or UI breakage.
### Non-Production Flexibility
To support CI/CD and staging environments, the system includes logic to override hostnames and strip environment-specific suffixes (like `-autopush`). This allows staging instances to use production-like configurations without requiring a complete duplication of the networking and authentication infrastructure.
### Multi-Backend Support
The frontend is built to be "backend agnostic" regarding anomaly storage. It can be configured to fetch data from legacy Chromeperf APIs or the modern SQL-based (Spanner/CockroachDB) implementation. This is managed via the `FetchAnomaliesFromSql` flag, which determines which implementation of the `TriageBackend` is injected into the API controllers.
# Module: /go/frontend/api
# Perf Frontend API Module
The `/go/frontend/api` module defines the HTTP interface for the Perf application. It serves as the orchestration layer between the web frontend and various backend services, including trace stores, regression detection engines, issue trackers, and the Chromeperf legacy system.
## Overview
This module follows a controller-based pattern where specific functional areas (Alerts, Anomalies, Graphs, Triage, etc.) are encapsulated in individual API structs. Each struct implements a `RegisterHandlers` method to attach its endpoints to a central Chi router.
The design emphasizes:
- **Abstraction of Backend Implementation**: The API layer interacts with interfaces (e.g., `Store`, `IssueTracker`) allowing the system to switch between Skia-native implementations and Chromeperf-compatible backends.
- **Asynchronous Processing**: Long-running operations, such as generating complex dataframes or running regression detection, use a progress-tracking pattern to avoid blocking HTTP connections.
- **Multi-Instance Compatibility**: Logic in `common.go` ensures that requests from non-production environments (staging, autopush) are correctly routed or identified, facilitating a seamless CI/CD flow.
## Functional Areas and Key Components
### Graph and Trace Data (`graphApi.go`, `mcpApi.go`)
Responsible for fetching and formatting performance trace data for visualization.
- **Frame Requests**: `graphApi` manages the "Start-Status-Results" lifecycle for building dataframes. This allows the UI to poll for progress while the backend processes large volumes of trace data.
- **Data Point Details**: Provides the "why" behind a specific data point by fetching source file metadata and point-specific links from the `ingestedFS`.
- **Model Context Protocol (MCP)**: `mcpApi` provides a specialized endpoint for LLM/Agentic tools to query trace data within specific time ranges and query parameters.
### Regression and Anomaly Management (`regressionsApi.go`, `anomaliesApi.go`)
Handles the detection, listing, and lifecycle of performance regressions.
- **Compatibility Layer**: `anomaliesApi` is designed to support both the legacy Chromeperf backend and the modern Skia-native storage. It uses `preferLegacy` flags to determine whether to proxy requests to Chromeperf or query the local `regStore` and `subStore`.
- **Group Reports**: Aggregates anomalies by Bug ID, Revision, or Anomaly Group ID to provide a holistic view of a performance change.
### Triage and Issue Tracking (`triageApi.go`, `triageBackend.go`)
Facilitates the workflow of turning detected anomalies into actionable bugs.
- **Backend Switching**: Through the `TriageBackend` interface, the system can either file bugs directly into the Issue Tracker (Skia-native) or proxy the request to Chromeperf's triage service.
- **Nudging and Resetting**: Allows users to refine anomaly boundaries or clear triage states, which involves complex coordination between the `regression.Store` and the commit history.
### Alerts and Subscriptions (`alertsApi.go`, `sheriffConfigApi.go`)
Manages the configuration that drives automated regression detection.
- **Dry Runs**: Supports testing alert configurations and bug templates before they go live via `alertBugTryHandler` and `alertNotifyTryHandler`.
- **LUCI Config Integration**: `sheriffConfigApi` provides metadata and validation endpoints used by LUCI Config to ensure that configuration changes in external repositories are valid before being ingested.
### Shortcuts and Favorites (`shortcutsApi.go`, `favoritesApi.go`)
Enhances user experience through personalization and shareability.
- **State Persistence**: `shortcutsApi` maps complex trace selections (lists of keys) to short IDs, enabling shareable URLs for specific graph views.
- **User Favorites**: Manages per-user links and sections, merging global instance-wide favorites with user-specific entries stored in the database.
## Key Workflows
### Asynchronous Data Loading
The following process is used for operations like `/v1/frame/start`:
```text
User Request API Layer Progress Tracker Backend Worker
| | | |
|-- POST /start -->| | |
| |-- Create Progress ID --> | |
|<-- Return ID ----| | |
| |-- Start Go Routine --------------------------->|
| | | |-- Fetch Data --|
|-- GET /status -->| | | |
|<-- Progress % ---| <---- Query Status ------| | |
| | | |-- Process -----|
|-- GET /results ->| | | |
|<-- DataFrame ----| <---- Get Results -------| <--- Mark Done -----|
```
### Anomaly Triage Flow
The API handles triage by coordinating between the UI, the internal regression store, and external trackers:
```text
[UI] ----(EditAnomaliesRequest)----> [triageApi]
|
[TriageBackend]
/ \
(Skia Native) / \ (Chromeperf Proxy)
/ \
[regStore.SetBugID] [Chromeperf Client]
| |
[DB Update] [External API POST]
```
## Design Decisions
### Non-Production Host Overriding
In `common.go`, the function `getOverrideNonProdHost` is used to strip suffixes like `-autopush` or `-staging`. This was implemented to allow testing environments to interact with production-like service configurations without needing to replicate the entire networking and authentication stack for every environment variant.
### Trace Cleaning
`anomaliesApi.go` includes logic to "clean" test names. This is a defensive implementation against malformed trace IDs that might contain characters incompatible with URL parsing or specific database query engines. It uses a configurable regex (`InvalidParamCharRegex`) to ensure consistency across different Perf instances (e.g., Fuchsia vs. Skia).
### Subscription Uniqueness
The `subscriptionsHandler` in `alertsApi.go` is designed to provide a flat list of all monitoring subscriptions. This is critical for the "Sheriff" view in the frontend, where users need to filter regressions based on their team's ownership rather than individual alert IDs.
# Module: /go/frontend/api/testdata
The `/go/frontend/api/testdata` directory serves as a controlled environment for testing the Perf frontend API and its configuration parsing logic. It provides a canonical example of a complete system configuration, allowing developers to verify how the application interprets complex settings without relying on a live production environment.
### Purpose and Design Decisions
The primary component of this module is `config.json`. This file is designed to simulate a realistic application state for integration tests and local development mocks. By centralizing these values, the project ensures that API endpoints—which often behave differently based on the underlying data store or authentication headers—can be tested against a predictable "ground truth."
A key design choice reflected in this data is the use of a local, file-based environment that mirrors production complexity. For instance, the configuration specifies a `cockroachdb` datastore type and a local directory for data ingestion. This allows the testing suite to validate the frontend's ability to handle SQL-backed data flows and ingestion triggers in an isolated sandbox.
### Key Components and Configuration Logic
The data within this module covers several critical functional areas of the Perf frontend:
- **Instance Metadata and Security**: Defines how the instance identifies itself (e.g., `chrome-perf-test`) and how it handles user identity. The `auth_config` specifies `X-WEBAUTH-USER` as the source of truth for identity, which is essential for testing authorization middleware and audit logging.
- **External Service Integration**: Encapsulates the configuration for issue trackers and notification systems. It includes references to secret management (e.g., Google Cloud Secret Manager paths for API keys), allowing the API to test the logic that fetches credentials without exposing actual production keys.
- **Data Ingestion and Schema**: Controls how the system views incoming performance data. By setting `use_regression2_schema: true`, the test data forces the application into a specific architectural path for regression detection, facilitating tests for the newer data schema.
- **Query and UI Customization**: Defines which parameters (like `arch`, `config`, or `bot`) are indexed for queries and provides a "Favorites" structure. This part of the configuration is used to verify that the frontend correctly renders navigation links and filters based on the configuration file rather than hardcoded values.
### Data Flow Overview
The following diagram illustrates how the configuration data in this module influences the behavior of the API during testing:
```text
+-----------------------+ +-------------------------+ +----------------------+
| /testdata/config.json | ---> | API Configuration Layer | ---> | Mock/Test Handlers |
+-----------------------+ +-------------------------+ +----------------------+
| | |
| (Defines Auth) | (Defines Storage) | (Defines UI)
v v v
[Header: X-WEBAUTH-USER] [Conn: cockroachdb/demo] [Favorites & Links]
```
### Usage in Implementation
Developers use this module to:
1. **Validate JSON Unmarshaling**: Ensure the Go structures in the `frontend/api` package align with the expected JSON format.
2. **Mock Environment Dependencies**: Use the `backend_host_url` and `git_repo_config` values to simulate cross-service communication.
3. **Sanitize Inputs**: The `invalid_param_char_regex` provides a standard for testing input validation across various API endpoints to prevent injection or malformed queries.
# Module: /go/frontend/mock
The `/go/frontend/mock` module provides a self-contained, high-fidelity mock server for the Perf application. Unlike simple unit tests or component-level demos, this server renders the actual production HTML templates and JavaScript bundles while simulating the entire backend API.
It is primarily used for:
- **Demonstrations:** Providing a "live" version of the Perf UI with deterministic data.
- **Integration Testing:** Serving as a target for E2E testing frameworks (like Puppeteer) via the `test_on_env` Bazel rule.
- **Frontend Development:** Allowing UI developers to iterate on features without needing a local instance of BigTable, Spanner, or authenticated microservices.
### Design Decisions and Implementation
#### High-Fidelity Rendering
The mock server uses the real `frontend.Frontend` logic to load and execute Go HTML templates found in `perf/pages/production`. It injects a specialized `mockContext` (defined in `frontend_mock_for_demo.go`) into these templates. This context mimics the global configuration usually provided by the production server, enabling features like the Pinpoint bisect button or specific chart tooltips that are toggleable via feature flags.
#### Stateless API Simulation
The backend is simulated using a set of hardcoded data structures in `frontend_mock_api_impl.go`. The implementation focuses on mimicking the _behavior_ of the Perf API rather than just returning static JSON:
- **Query Builder:** The `nextParamListHandler` simulates the hierarchical filtering of trace keys (e.g., selecting an "arch" narrows down the available "os" values).
- **Asynchronous Jobs:** Since the real Perf backend often processes graph requests asynchronously, the mock implements the `/frame/start` and `/status/{id}` pattern. It stores a "pending" query in memory and returns a "Finished" status with a mock dataframe when polled.
- **Deterministic Data:** Trace data is generated based on the length of the trace keys, ensuring that the graphs look consistent across different runs.
#### Environment Integration (`test_on_env`)
The server includes specific logic to support automated testing environments:
1. **Dynamic Port Allocation:** If the `ENV_DIR` environment variable is detected, the server listens on port `:0` to avoid collisions during parallel test execution.
2. **Readiness Signaling:** It writes its assigned port and a "ready" file to a specific directory. This allows a test runner to wait until the server is fully initialized before attempting to connect.
### Key Components
| Component | Responsibility |
| :-------------------------- | :-------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `frontend_mock_for_demo.go` | Entry point. Configures the `chi` router, handles static asset serving (JS/CSS/Images), and defines the `render` helper that injects mock global state into HTML templates. |
| `frontend_mock_api_impl.go` | Contains the mock logic for all `/_/` API endpoints, including alerts, regressions, triage, favorites, and trace data retrieval. |
| `BUILD.bazel` | Defines the `mock_dist_files` filegroup, which collects all production UI assets (CSS, JS, Maps, Images) required to make the server functional without an external CDN. |
### Core Workflow: Data Retrieval
The following diagram illustrates how the mock server simulates the asynchronous data fetching process used by the frontend to render graphs:
```text
Browser Mock Server (frontend_mock_server)
| |
|-- POST /_/frame/start -->| 1. Stores Queries in m.currentQueries
| | 2. Returns status: "Running", url: "/_/status/demo-req"
|<-- JSON {status, url} ---|
| |
|-- GET /_/status/demo-req | 3. Retrieves stored queries
| | 4. Filters mockTraceData based on queries
|<-- JSON {status: "Finished", results: {dataframe}}
| |
| (Browser renders plot) |
```
### Mock State and Data
The server maintains a small amount of in-memory state (protected by `sync.Mutex`) to track the "current" query, but otherwise acts as a functional simulator of:
- **Parameters:** A fixed set of architectures (`arm`, `x86_64`) and OSs (`Android`, `Ubuntu`, etc.).
- **Anomalies:** Predetermined regression points linked to specific trace keys to test the triage and alerting UI.
- **Authentication:** Always reports the user as `user@google.com` with `admin` and `bisecter` roles.
# Module: /go/frontend/testdata
# Frontend Test Data
The `testdata` module serves as a centralized repository for configuration fixtures used during the unit testing and integration testing of the Perf frontend. It focuses on simulating various application states and ensuring that the configuration parser and validator logic are robust against malformed or boundary-pushing inputs.
### Purpose and Design Choice
The primary design goal for this module is to provide repeatable, immutable data structures that represent both valid and invalid application states. Rather than programmatically generating complex configuration objects in Go code, these JSON files allow developers to see the exact structure the frontend expects to ingest from environment-specific configuration maps.
Using separate files for specific failure modes (e.g., missing or excessive string lengths) allows the test suite to execute table-driven tests that specifically target the validation logic within the frontend initialization phase. This approach ensures that the application fails gracefully with descriptive errors when provided with an invalid configuration, rather than encountering runtime panics.
### Component Responsibilities
**Standard Configuration Fixture**
The `config.json` file represents a complete, valid configuration. It defines the operational environment for the frontend, including:
- **Data Store Connectivity**: Configures backend persistence (e.g., CockroachDB) and tile sizes for data processing.
- **Ingestion and Repository State**: Sets up the relationship between data sources (local directories or Pub/Sub topics) and the git history provider.
- **UI/UX Customizations**: Defines "favorites" sections, query parameter inclusions, and notification settings that control the end-user experience.
**Validation Boundary Fixtures**
The remaining files in this module are designed specifically to test the constraints of the `instance_name` field. This field is a critical identifier used in telemetry, logging, and external service integration.
- **Empty and Missing States**: `config_empty_instance_name.json` and `config_no_instance_name.json` are used to verify that the system correctly identifies missing required fields or prevents the use of empty strings where a unique identifier is expected.
- **Constraint Testing**: `config_long_instance_name.json` provides a value that exceeds standard character limits (typically 64 characters). This is used to test that the frontend validation logic prevents data that might be rejected by downstream cloud services or cause UI layout breakage.
### Configuration Parsing Workflow
The following diagram illustrates how these files are utilized during the application lifecycle testing:
```
+-----------------+ +-----------------------+
| Test Runner | ----> | Load JSON fixture |
| (Frontend Unit) | | from /testdata/ |
+-----------------+ +-----------+-----------+
|
v
+-----------------+ +-----------------------+
| Assert Expected | <---- | Execute Unmarshal and |
| Error/Success | | Validation Logic |
+-----------------+ +-----------------------+
```
By checking against these predefined files, the frontend ensures that changes to the configuration structures in the Go source code do not silently break compatibility with existing configuration formats used in production environments.
# Module: /go/fuchsia_to_skia_perf
# Fuchsia to Skia Perf
`fuchsia_to_skia_perf` is a command-line utility designed to bridge the gap between Fuchsia's performance testing infrastructure and the Skia Performance monitoring system. It transforms performance data from Fuchsia's native JSON format into a schema that the Skia Perf ingestion pipeline can parse and visualize.
## High-Level Overview
The tool operates as a specialized data ETL (Extract, Transform, Load) pipeline. It extracts raw metrics from Fuchsia build artifacts, transforms them by calculating statistical aggregates and normalizing units, and loads them into a destination suitable for Skia Perf—either a local directory or a Google Cloud Storage (GCS) bucket.
The primary goal of this tool is to ensure that performance regressions in Fuchsia can be tracked using Skia's visualization tools, which require specific metadata (like "improvement direction") and a flat, trace-based data structure.
## Design Decisions
### Benchmark Granularity and Data Partitioning
Fuchsia test results often bundle multiple test suites into a single large JSON file. However, Skia Perf is optimized for partitioning data based on specific benchmarks. To align these two systems, the converter splits a single Fuchsia input record into multiple Skia Perf JSON files, keyed by the test suite name. This allows Skia to treat each suite as a distinct entity, improving query performance and visualization clarity.
### Statistical Aggregation and Visualization
Rather than just reporting raw values, the converter automatically generates two distinct types of entries for every metric:
1. **Comprehensive Stats:** A base entry containing the full statistical profile (Min, Max, Sum, Count, First, and Standard Deviation).
2. **Average Focused (`_avg`):** A specialized entry focusing on the mean and error (standard deviation).
This dual-entry approach is a deliberate choice to support different visualization needs in Skia Perf: the `_avg` series provides a clean trend line for dashboards, while the base entry allows for deep-dive analysis into the variance and distribution of test results.
### Unit Normalization and Polarity
Fuchsia and Skia use different conventions for units and "improvement direction" (i.e., whether a higher or lower number is better). The module implements a mapping logic that:
- **Normalizes Units:** Converts various strings (e.g., `nanoseconds`, `ns`) to a canonical format (e.g., `ms`) and scales values accordingly (e.g., dividing by 1,000,000).
- **Infers Polarity:** Assigns a `smallerIsBetter` or `biggerIsBetter` flag based on the unit. It supports overrides within the input data, allowing developers to explicitly mark a metric (e.g., `ms_biggerIsBetter`) if the default assumption is incorrect.
## Key Components
### Transformation Engine (`/convert/lib.go`)
This is the core logic of the module. It handles the lifecycle of a conversion run:
1. **Validation:** Ensures the input contains necessary metadata like `build_id` and `commit_id`.
2. **Grouping:** Categorizes measurements by their benchmark/test suite.
3. **Calculation:** Computes numerical statistics and applies unit scaling.
4. **Formatting:** Maps the processed data into the `SkiaPerfResult` schema.
```text
Processing Workflow:
[Fuchsia JSON] -> [Unmarshal] -> [Group by Benchmark]
|
v
[Calculate Stats] <--- [Map Units/Direction] <--- [Scale Values]
|
v
[Construct SkiaPerfResult] -> [Write Local File]
-> [Upload to GCS (Optional)]
```
### Data Models (`/convert/types.go`)
This component defines the structural contract between the two systems.
- **`FuchsiaPerfResults`**: Models the input, focusing on build metadata and raw measurement arrays.
- **`SkiaPerfResult`**: Models the output, which includes a `Key` map (defining the trace's identity—bot, master, benchmark) and the calculated result items.
### Entry Point (`main.go`)
The CLI wrapper handles environment-specific configuration. It manages:
- **Authentication:** Sets up Google Cloud credentials if GCS uploading is requested.
- **Configuration:** Parses flags to define where the data comes from and where it should go.
- **Path Partitioning:** If uploading to GCS, it organizes files into a `ingest/YYYY/MM/DD/` structure to facilitate efficient discovery by the Skia Perf ingester.
## Implementation Details
The tool generates output filenames using a specific pattern: `<build_id>-<benchmark>-<bot>-<master>.json`. This naming convention ensures uniqueness and provides enough context for administrators to manually inspect the ingestion bucket if necessary.
When calculating the "error" metric for the average results, the tool utilizes a sample standard deviation. This provides a statistically sound representation of variance, which Skia Perf uses to render error bars in its UI.
# Module: /go/fuchsia_to_skia_perf/convert
# Fuchsia to Skia Perf Conversion
The `convert` module provides the logic for transforming performance test results from the Fuchsia JSON format into the Skia Perf format. This conversion allows performance data generated by Fuchsia builders to be ingested and visualized by Skia's performance monitoring tools.
## Overview
The module functions as a data pipeline that reads a specific Fuchsia performance schema, normalizes units and improvement directions, calculates statistical aggregates, and outputs files compatible with Skia Perf's ingestion requirements.
### Design Decisions
- **Benchmark Granularity:** The converter splits a single Fuchsia input record (which may contain multiple test suites) into separate Skia Perf JSON files per benchmark (test suite). This aligns with how Skia Perf organizes data, where a "benchmark" is a primary key for partitioning performance traces.
- **Unit Normalization:** To ensure consistency across different test runners, the module maps various Fuchsia unit strings (e.g., `nanoseconds`, `ns`, `milliseconds`) to a canonical set of Skia units (e.g., `ms`). It also handles value scaling, such as converting nanoseconds to milliseconds or bytes to MiB.
- **Automated Statistics:** For every test result provided, the module automatically generates two Skia result items:
1. A base item containing comprehensive statistics (min, max, sum, count, first value, and standard deviation).
2. An `_avg` item that specifically focuses on the mean and error (standard deviation), which is often the primary metric for visualization.
- **Improvement Direction:** The module infers whether a metric should "go up" or "go down" to indicate improvement. It uses a default mapping based on the unit (e.g., `ms` defaults to `smallerIsBetter`) but allows the input data to override this via a suffix (e.g., `ms_biggerIsBetter`).
## Key Components
### Data Transformation Logic (`lib.go`)
This file contains the core processing engine. The conversion process follows this workflow:
```text
Fuchsia JSON Input
|
v
Unmarshal & Validate (Check BuildID, CommitID, etc.)
|
v
Group results by Test Suite (Benchmark)
|
+------> Calculate Stats (Min, Max, Avg, StdDev)
|
+------> Map Units & Improvement Direction
|
v
Construct SkiaPerfResult Object
|
+------> Write to Local Disk (if configured)
|
+------> Upload to GCS (if client provided)
```
- **`Run(cfg Config)`**: The entry point that orchestrates the file reading, grouping, and output generation.
- **`PopulateResults(perfResults)`**: Maps the raw measurements into the structured `SkiaResultItem` format, applying the dual-item (base + average) generation strategy.
- **`MapUnitAndDirection(input)`**: Resolves the final string used by Skia to determine both the unit type and the visualization polarity (up/down).
- **`CalculateStats(results)`**: Performs numerical analysis and unit conversion (e.g., `ns` to `ms`). It uses a sample standard deviation calculation for the error metric.
### Data Models (`types.go`)
Defines the structure of both the input and output formats.
- **`FuchsiaPerfResults`**: Represents the input schema, which is a list of build records, each containing metadata like `build_id`, `builder`, and `commit_id`, alongside an array of performance measurements.
- **`SkiaPerfResult`**: Represents the output schema. It includes a `Key` map (defining the trace identity) and a `Results` array containing the actual measurements and their associated metadata.
## Configuration
The conversion process is controlled via the `Config` struct, which specifies:
- **Data Sources**: The path to the input JSON file.
- **Destinations**: A local directory for output files and/or a Google Cloud Storage (GCS) bucket path.
- **Metadata**: The "Master" name and an optional date for GCS path partitioning (organized as `ingest/YYYY/MM/DD/`).
# Module: /go/git
# Perf Git Module
The `perf/go/git` module provides a high-level abstraction and persistence layer for Git repository data within the Perf system. Its primary responsibility is to bridge the gap between non-linear Git history (identified by hashes) and Perf's internal requirement for a linear, integer-based timeline (identified by `CommitNumber`).
## High-Level Overview
In the Perf ecosystem, performance data is plotted against a continuous x-axis. Because Git hashes are non-deterministic and non-sequential, this module maps every relevant Git commit to a monotonically increasing `CommitNumber`.
The module performs three core functions:
1. **Ingestion**: Periodically polling a Git source (via the `provider` abstraction) to find new commits.
2. **Persistence**: Storing commit metadata (hash, author, subject, timestamp) in an SQL database (PostgreSQL or Spanner) to enable fast range queries without hitting the Git backend.
3. **Resolution**: Providing an API to translate between timestamps, Git hashes, and commit numbers.
## Design Decisions and Implementation Choices
### The Commit Number Mapping
A fundamental design choice in Perf is the use of `CommitNumber` as the primary coordinate for data. This module supports two ways of determining this number:
- **Sequential Assignment**: By default, as the module discovers new commits, it assigns the next available integer in the database.
- **Repo-Supplied (Regex)**: For repositories like Chromium that embed a monotonic position in the commit message (e.g., `Cr-Commit-Position: refs/heads/master@{#727989}`), the module can be configured with a regex to extract this number directly. This ensures that the `CommitNumber` in Perf matches the official project revision.
### Caching for Performance
To minimize database load, the implementation utilizes an LRU (Least Recently Used) cache for commit details. Given that a typical commit entry is approximately 400 bytes, the cache is capped at 25,000 entries (roughly 10MB). This significantly speeds up the rendering of dashboards and alerts where the same recent commits are requested frequently.
### Background Polling and Synchronization
The module follows a "sync-and-cache" pattern. It does not query the Git provider (Gitiles or local CLI) for every user request. Instead, it runs a background goroutine that pulls updates into the local SQL database. This ensures that even if the Git backend is temporarily slow or unavailable, the Perf UI remains responsive using the cached metadata.
## Key Components and Files
### `interface.go`
Defines the `Git` interface, which is the contract for all Git-related operations in Perf. This includes methods for range lookups (`CommitSliceFromTimeRange`), history traversal (`PreviousGitHashFromCommitNumber`), and file-specific auditing (`CommitNumbersWhenFileChangesInCommitNumberRange`).
### `impl.go`
The primary implementation of the `Git` interface. It manages the lifecycle of the background updater and contains the SQL logic for both PostgreSQL and Spanner.
- **Update Logic**: The `Update` method identifies the delta between the most recent hash in the database and the current `HEAD` of the repository, then streams and inserts the missing commits.
- **Collision Handling**: It uses `ON CONFLICT DO NOTHING` clauses to ensure that multiple service instances or rapid update cycles do not result in duplicate entries.
### `gittest/`
A specialized test harness that bootstraps a complete environment for integration testing. It creates a real Git repository with deterministic timestamps, initializes a test database, and provides a pre-populated set of hashes. This ensures that logic involving time-to-hash mapping can be tested without flakiness.
## Data Workflow: Update Cycle
The following diagram illustrates how the module synchronizes the database with the remote repository:
```text
[ Background Poller ] [ SQL Database ] [ Git Provider ]
| | |
|--- 1. Get Most Recent ---->| |
| Commit from DB | |
|<----- (Hash, Number) ------| |
| | |
|--- 2. Fetch New Commits ---------------------------->|
| since <Hash> | |
|<---------------------------- (Stream of Commit Objs) |
| | |
|--- 3. Extract Number ----> | |
| (Regex or Incr) | |
| | |
|--- 4. INSERT INTO Commits >| |
| (Hash, Meta, No.) | |
```
## Related Submodules
- **`provider/`**: Defines the low-level interface for fetching raw Git data.
- **`providers/`**: A factory module that selects between `git_checkout` (local CLI) and `gitiles` (REST API) based on the instance configuration.
- **`schema/`**: Defines the database table structure used to persist commit metadata.
- **`mocks/`**: Provides autogenerated mocks for unit testing components that depend on the `Git` interface.
# Module: /go/git/gittest
The `gittest` module provides a high-level test harness for the Perf system's Git integration. It is designed to bootstrap a realistic environment for integration tests, bridging the gap between raw Git repositories and the Perf service's data structures.
### Design and Intent
The primary goal of this module is to abstract away the repetitive setup required to test Git-based performance monitoring. Testing the Perf Git logic requires a complex state consisting of:
1. A valid Git repository with a deterministic commit history.
2. An initialized SQL database schema.
3. A local checkout (mirror) that the system can use to perform analysis.
4. Matching configurations that link these pieces together.
By providing a single constructor (`NewForTest`), this module ensures that tests across the Perf codebase use a consistent dataset, making it easier to verify algorithms that traverse history or map timestamps to commit hashes.
### Key Components
#### The Test Environment Lifecycle
The core of the module is the `NewForTest` function. It manages the orchestration of several distinct subsystems:
- **Repository Generation**: It uses `testutils.GitBuilder` to initialize a temporary Git repository. It populates this repository with a predefined sequence of commits starting at `StartTime` (Unix 1680000000), spaced exactly one minute apart. This predictability allows test authors to write assertions based on relative time offsets.
- **Database Provisioning**: It initializes a Spanner database instance for tests using `sqltest.NewSpannerDBForTests`. This provides the persistence layer where Perf stores metadata associated with Git commits.
- **Provider Abstraction**: It instantiates a `git_checkout.Provider`. This is the component responsible for actually interacting with the Git binary and the local filesystem, ensuring that the test environment behaves identically to a production deployment.
- **Automatic Cleanup**: The module utilizes `t.Cleanup` to ensure that temporary directories, database connections, and background processes (like the Git builder) are torn down immediately after a test completes, preventing resource leaks in the test runner.
### Data Workflow
When `NewForTest` is called, the following process occurs:
```
[ GitBuilder ] --(Creates Repo)--> [ Local .git Dir ]
| |
| (Commits files at 1min intervals) |
v v
[ Commit Hashes ] <----------- [ git_checkout.Provider ]
| |
| (Configuration Links Them) |
v v
[ InstanceConfig ] <---------- [ Spanner DB Instance ]
```
1. **Seed**: A local Git repository is created and populated with synthetic commits (`foo.txt`, `bar.txt`).
2. **Configure**: An `InstanceConfig` is generated, pointing the `GitRepoConfig.URL` to the `GitBuilder`'s directory and setting a temporary path for the local checkout.
3. **Sync**: The `git_checkout` provider is initialized, which effectively "clones" the builder's repo into the temporary directory.
4. **Return**: The function returns the context, the database handle, the builder, the ordered list of hashes, the provider, and the config object.
### Usage in Tests
The returned `hashes` slice is critical for testing. Since Git commit hashes are non-deterministic (based on authorship and exact time of creation), the module returns the generated hashes in chronological order. Tests use these hashes to verify that the Git provider correctly identifies the "revision" associated with specific performance data points.
# Module: /go/git/mocks
The `go.skia.org/infra/perf/go/git/mocks` module provides a mock implementation of the `Git` interface used within the Perf system. This module is essential for unit testing components that interact with Git history, commit data, and repository metadata without requiring a live Git repository or network access to a Git provider.
### Purpose and Design
The primary goal of this module is to enable predictable, isolated testing of Perf's business logic. In the Perf system, the `Git` interface (defined in `perf/go/git/provider`) acts as the bridge between performance data and the source code history. Many operations—such as calculating regression ranges, mapping timestamps to commits, or identifying when specific files changed—rely on this interface.
By using these mocks, developers can:
1. **Simulate Edge Cases:** Easily test behavior when a commit is missing, a repository is empty, or a Git provider returns an error.
2. **Ensure Determinism:** Avoid flaky tests caused by changes in an external repository or network latency.
3. **Speed Up Tests:** Bypass the overhead of cloning repositories or executing shell commands to `git`.
### Key Components
#### The `Git` Struct
The core of this module is the `Git` struct found in `Git.go`. It is an autogenerated mock produced by [mockery](https://github.com/vektra/mockery), utilizing the `testify/mock` framework. It implements every method required by the Perf system's Git provider interface, including:
- **Resolution Methods:** Converting between `types.CommitNumber` (Perf's internal sequential index), Git hashes, and timestamps (e.g., `CommitNumberFromGitHash`, `CommitNumberFromTime`).
- **Retrieval Methods:** Fetching detailed commit metadata or slices of commits based on ranges (e.g., `CommitFromCommitNumber`, `CommitSliceFromTimeRange`).
- **Analysis Methods:** Identifying changes within a range (e.g., `CommitNumbersWhenFileChangesInCommitNumberRange`).
- **Lifecycle Methods:** Controlling the background state of the provider (e.g., `Update`, `StartBackgroundPolling`).
### Usage Workflow
When writing a test, you instantiate the mock using `NewGit`, which automatically registers cleanup functions to verify that all expected calls were made before the test finishes.
```text
Test Setup Phase
----------------
1. Call mocks.NewGit(t)
2. Define expectations using .On(...).Return(...)
Execution Phase
---------------
3. Pass the mock into the component being tested
4. Component calls Git methods (e.g., GitHashFromCommitNumber)
5. Mock returns the pre-defined values
Verification Phase
------------------
6. Test finishes
7. Cleanup function calls AssertExpectations
```
### Implementation Details
The mock is tightly coupled with:
- `go.skia.org/infra/perf/go/types`: For internal Perf types like `CommitNumber`.
- `go.skia.org/infra/perf/go/git/provider`: For the `Commit` data structure and the interface definition.
- `github.com/stretchr/testify/mock`: For the underlying mock engine.
Because the code is generated, the logic within `Git.go` focuses on checking types and returning values provided during the "Setup" phase of a test. If a method is called that wasn't expected, or if a return value wasn't specified for a called method, the mock will trigger a panic to alert the developer of an incomplete test configuration.
# Module: /go/git/provider
### High-Level Overview
The `provider` module establishes a uniform abstraction layer for interacting with Git repositories within the Skia infrastructure. Rather than forcing downstream consumers to handle the specifics of local Git CLI operations versus remote Gitiles API calls, this module defines a common interface and data structure for retrieving commit history, metadata, and file-specific changes.
By decoupling the data source from the data consumption, the module allows Perf and other services to remain agnostic about how repository data is physically fetched or stored.
### Design Rationale
The primary design goal is to provide a consistent view of repository history that is optimized for consumption by performance monitoring systems.
- **Sequential Processing via Callbacks**: The `CommitProcessor` pattern used in `CommitsFromMostRecentGitHashToHead` is designed for efficiency when processing large ranges of history. Instead of loading thousands of commits into memory simultaneously, the provider streams commits to the caller. This minimizes memory overhead during initial repository indexing or catch-up tasks.
- **Agnostic Backend**: The `Provider` interface is intentionally minimal. It assumes that the underlying implementation (whether it's a local `git` checkout or a network-based service like `Gitiles`) handles the complexities of authentication, caching, and network protocols.
- **Database-Friendly Commit Model**: The `Commit` struct serves as a bridge between Git's raw output and the internal database schema. It includes `CommitNumber` (a monotonic offset used for indexing in Perf) and utilizes JSON annotations that maintain compatibility with legacy `CommitDetail` structures.
- **Commit Body vs. Persistence**: A specific implementation choice is made in the `Commit` struct where the `Body` field is kept for parsing metadata (such as extracting specific commit numbers or Gerrit footers), but is explicitly noted as not being intended for database storage. This saves significant storage space while still providing the necessary context during the ingestion phase.
### Key Components
#### The Provider Interface
Defined in `provider.go`, this interface outlines the essential operations required to sync and query repository state:
- **Incremental Updates**: `Update` ensures the local view of the repository is current.
- **History Traversal**: `CommitsFromMostRecentGitHashToHead` enables "incremental" ingestion. By passing the last known Git hash, the provider can determine the delta between the database and the current HEAD, processing only new commits in chronological order.
- **Targeted Auditing**: `GitHashesInRangeForFile` allows the system to filter noise by identifying exactly when specific configuration or data files were modified within a range, rather than scanning every commit in the repository.
#### The Commit Model
The `Commit` struct represents the canonical version of a Git commit within the Skia ecosystem. It includes helper methods for human-readable output:
- **`Display`**: Generates a standardized short-form string (e.g., `7abc123 - 2 days ago - Fix memory leak`) used in UI logs and CLI outputs.
- **`HumanTime`**: Leverages the `go/human` package to convert Unix timestamps into relative durations, providing a more intuitive sense of "when" a change occurred compared to raw epoch values.
### Workflow: Incremental Ingestion
The typical interaction between a consumer and the provider follows a "sync-and-stream" pattern to keep internal databases up to date with the remote repository.
```text
[ Consumer ] [ Provider ] [ Git Backend ]
| | |
|--- 1. Update() ----->| |
| |--- 2. git pull / API |
| |<-- 3. New Commits ---|
| | |
|--- 4. CommitsFrom(lastHash, callback) ----->|
| | |
| |--- 5. Parse Commits -|
| | |
|<-- 6. Invoke callback(Commit) [Repeated] ---|
| | |
|--- 7. Store in DB -->| |
```
This workflow ensures that the consumer only processes what is necessary and that the logic for "what has changed" remains encapsulated within the provider implementation.
# Module: /go/git/providers
### High-Level Overview
The `providers` module serves as a factory and abstraction layer for obtaining Git data in the Perf system. Its primary purpose is to instantiate a `provider.Provider` based on the system's configuration.
By abstracting the source of Git information, the rest of the Perf application can remain agnostic of whether it is interacting with a local disk-based repository or a remote web-based Gitiles instance. This flexibility allows Perf to scale across different infrastructure environments—from high-performance local setups to cloud-native deployments where persistent disk management is undesirable.
### Design and Implementation Choices
#### The Factory Pattern
The module implements a single factory function, `New`, which encapsulates the logic for selecting and initializing the appropriate Git provider. This centralizes the dependency management for Git access, ensuring that the calling code does not need to know about authentication scopes, HTTP clients, or local filesystem paths.
#### Provider Selection Logic
The choice of provider is driven by the `GitRepoConfig.Provider` setting in the `InstanceConfig`:
1. **Local Checkout (`git_checkout`):** Selected if the provider is explicitly set to `CLI` or left empty (default). This uses a local Git binary and a directory on disk.
2. **Gitiles (`gitiles`):** Selected when the provider is set to `Gitiles`. This bypasses the local disk and interacts with repositories via the Gitiles Web API.
#### Unified Authentication Management
A key responsibility of this module is preparing the necessary credentials for remote communication.
- For **Gitiles**, the factory automatically configures a `google.DefaultTokenSource` with the `auth.ScopeGerrit` scope. It then wraps this in a standard `httputils` client before passing it to the Gitiles implementation. This ensures that the provider is ready to perform authenticated API calls immediately upon creation.
- For **Local Checkout**, authentication is handled internally by the `git_checkout` module (via `gitauth` and cookie files), but the factory ensures the correct environment configuration is passed down.
### Key Components and Responsibilities
#### `builder.go`
This is the entry point for the module. It manages the imports for all supported backend implementations (`git_checkout` and `gitiles`). It acts as the "glue" that translates configuration strings into functional Go objects.
#### `provider.Provider` Interface
While defined in an external package (`//perf/go/git/provider`), this interface is the "contract" that this module fulfills. Any provider returned by the factory is guaranteed to support:
- Fetching commits in chronological order.
- Tracking history for specific files.
- Retrieving commit metadata (author, timestamp, message).
- Synchronizing with the remote source.
### Key Workflows
#### Provider Initialization Workflow
The following diagram illustrates how the `New` function determines which implementation to return:
```text
[ InstanceConfig ]
|
v
Check GitRepoConfig.Provider
|
+--- [ empty ] or "CLI" ----> [ Initialize git_checkout ]
| |
| v
| Check/Clone local directory
| |
| v
| Return git_checkout.Impl
|
+--- "Gitiles" -------------> [ Initialize Gitiles ]
| |
| Create OAuth2 Token
| |
| Setup HTTP Client
| |
| v
| Return gitiles.Gitiles
|
+--- [ Other ] -------------> Return Error (Invalid Type)
```
# Module: /go/git/providers/git_checkout
### High-Level Overview
The `git_checkout` module provides a Git repository provider for the Perf system. It implements the `provider.Provider` interface by wrapping a local Git checkout and executing `git` commands via system calls. This module is designed for environments where a persistent, on-disk Git clone is preferred for performance or local tool integration, allowing Perf to synchronize with a remote repository and query its history.
### Design and Implementation Choices
#### External Process Execution
The primary design choice is "shelling out" to the system's `git` executable rather than using a pure Go Git implementation. This ensures full compatibility with all Git features, including complex authentication schemes (like Gerrit/git-cookie) and standard performance optimizations that the native Git binary provides.
#### State and Lifecycle
The module manages a local directory (specified in the `InstanceConfig`).
- **Initialization**: Upon creation, it verifies the existence of the directory. If it doesn't exist, it performs an initial `git clone`.
- **Synchronization**: The `Update` method performs a `git pull` to bring the local checkout up to date with the remote tracking branch.
- **Commit Tracking**: The provider supports a `startCommit` configuration. This acts as a logical "horizon"; the provider can be configured to ignore history preceding this commit, which is useful for large repositories where only recent history is relevant to performance tracking.
#### Authentication
The module integrates with Google Cloud's `google.DefaultTokenSource` to handle Gerrit authentication. When enabled, it uses `gitauth` to manage a `/tmp/git-cookie` file, ensuring that the background `git` processes have the necessary credentials to interact with protected remote repositories.
#### Efficient Log Parsing
To avoid loading large amounts of git history into memory at once, the module uses a streaming parser (`parseGitRevLogStream`). It pipes the output of `git rev-list` directly into a scanner that processes commits one-by-one, invoking a callback for each. This design allows the system to process thousands of commits with a constant memory footprint.
### Key Components and Responsibilities
#### `Impl` Struct
The core implementation of the `provider.Provider` interface. It maintains the absolute path to the Git executable and the repository's location on disk.
#### Commit Retrieval (`CommitsFromMostRecentGitHashToHead`)
Retrieves new commits since a given hash. It utilizes `git rev-list` with a range (e.g., `hash..HEAD`) and specific formatting flags to extract the author, subject, and Unix timestamp.
- If no recent hash is provided, it falls back to the `startCommit`.
- If neither is available, it starts from the beginning of the repository's history reachable from `HEAD`.
#### File-Specific Queries (`GitHashesInRangeForFile`)
Finds all commits within a range that modified a specific file. This is crucial for Perf's "blame" or "trace" features where changes to specific configuration or data files need to be tracked. It translates the request into a `git log --format=%H -- <filename>` command.
#### Metadata Extraction (`LogEntry`)
Provides the full, human-readable commit message and metadata for a specific hash using `git show -s`. This is typically used for UI display when a user inspects a specific point on a performance trace.
### Key Workflows
#### New Provider Initialization
The following diagram illustrates the workflow when `New()` is called:
```text
Config -> [ Auth Check ] -> ( Gerrit Auth via gitauth/git-cookie )
|
v
[ Find Git Binary ] -> ( Resolve Absolute Path )
|
v
[ Repo Check ] ------> ( Directory Exists? )
| |
| No | Yes
v v
( Run git clone ) ( Use existing dir )
| |
+----------+----------+
|
v
Return Impl{}
```
#### Synchronizing and Processing Commits
The workflow for identifying new work typically follows this pattern:
```text
Caller -> Update() -> [ git pull ]
|
+-> CommitsFromMostRecentGitHashToHead(last_hash)
|
+-> [ git rev-list last_hash..HEAD --pretty ]
|
+-> ( Stdout Pipe )
|
v
[ parseGitRevLogStream ] -> ( Callback for each Commit )
```
# Module: /go/git/providers/gitiles
### Overview
The `gitiles` module provides an implementation of the `provider.Provider` interface that interacts with Git repositories via the Gitiles Web API. In the context of the Perf system, this module is responsible for discovering new commits, retrieving commit metadata, and filtering history for specific files without requiring a local checkout of the repository.
By using Gitiles, the system can operate in environments where local disk space is at a premium or where maintaining a constantly updated local git clone is inefficient. It acts as a bridge between the high-level performance tracking logic and the remote version control system.
### Key Components and Responsibilities
#### `Gitiles` Struct
The core of the module is the `Gitiles` struct, which encapsulates the logic for communicating with a remote Gitiles instance. It stores configuration such as the repository URL, the target branch, and an optional starting commit to limit the scope of history processing.
#### Commit Ingestion and Processing
The primary responsibility of this module is to stream commits from a known point up to the current HEAD. This is handled by `CommitsFromMostRecentGitHashToHead`.
- **Design Choice (Batching):** Instead of fetching one commit at a time, which would be prohibitively slow over HTTP, the module uses `LogFnBatch` to fetch commits in batches (defaulting to 100). This reduces round-trip overhead while maintaining a manageable memory footprint.
- **Design Choice (Ordering):** The module uses the `gitiles.LogReverse()` option. This ensures that commits are processed in chronological order (oldest to newest), which is critical for the Perf system to build its internal representation of history linearly.
- **Branch Handling:** The implementation distinguishes between the "main" branch and side branches. For the main branch, it typically requests a range from a hash to `HEAD`. For other branches, it uses the fully qualified branch name and a starting commit offset to ensure it tracks the correct line of development.
#### File-Specific History
The `GitHashesInRangeForFile` method allows the system to query history for specific paths. This is used when the system needs to determine which commits actually modified a specific configuration file or test suite, allowing it to skip irrelevant commits during analysis.
#### Metadata Retrieval
The `LogEntry` method provides a standardized way to retrieve a formatted string containing commit details (Author, Date, Subject, Body). This is used for displaying commit information in the Perf UI.
### Workflows
#### Commit Discovery Workflow
When the Perf system needs to update its view of the world, it triggers a discovery process through this provider:
```
[Perf System] -> Call CommitsFromMostRecentGitHashToHead(last_known_hash)
|
v
[Gitiles Provider] -> Determine branch expression (e.g., "refs/heads/main")
|
v
[Gitiles Provider] -> Request batch of commits from Gitiles API (Reversed)
|
+--< [Batch Received]
| |
| v
| [CommitProcessor Callback] -> (Perf system stores/indexes commit)
| |
+----------+ (Repeat until HEAD reached)
|
v
[Perf System] -> Update complete
```
### Implementation Details
- **Provider Interface:** The module explicitly validates that it satisfies the `provider.Provider` interface at compile time: `var _ provider.Provider = (*Gitiles)(nil)`.
- **Stateless Updates:** The `Update` method is a no-op in this provider. Unlike local git providers that need to perform a `git fetch` to update local state, the Gitiles provider is naturally up-to-date as it queries the remote API directly for every request.
- **Error Handling:** The module uses `go.skia.org/infra/go/skerr` to wrap errors from the Gitiles client, providing context on whether a failure occurred during batch loading, log retrieval, or callback processing.
# Module: /go/git/schema
# Perf Git Schema
The `schema` module defines the foundational data structures and database schema for tracking Git commits within the Perf system. This module serves as the "source of truth" for how commit metadata is persisted and mapped between the relational database and Go types.
## Design Philosophy: Mapping History to Integers
A core requirement of the Perf system is the ability to map linear Git history to a continuous range of integers, referred to as `CommitNumber`. While Git natively identifies commits via non-linear hashes, the Perf database requires a strictly increasing integer key to efficiently handle time-series data, range queries, and regressions.
The `Commit` struct acts as the bridge between these two worlds. It pairs the immutable Git metadata (hash, author, timestamp) with a monotonically increasing `CommitNumber` that defines the commit's position in the Perf system's timeline.
## Key Components
### The Commit Struct
The `Commit` struct is designed to be used directly with an SQL-based ORM or schema generator. Its fields are chosen to satisfy the requirements of both the UI (showing author and subject) and the backend analytical engines (filtering by time or commit range).
- **CommitNumber**: This is the primary key. It is used as the coordinate for the x-axis in Perf graphs. By using an integer as the primary key rather than the Git hash, the system ensures that queries for "the last N commits" or "commits between X and Y" are highly performant.
- **GitHash**: Stored as a unique, non-null string to ensure data integrity. This allows the system to resolve external Git references back to the internal `CommitNumber`.
- **Timestamp**: Stored as a Unix timestamp (seconds). This allows the system to correlate commits with the wall-clock time the data was ingested or produced, which is critical for identifying infrastructure-related regressions.
- **Metadata (Author/Subject)**: These fields are included to provide context in the Perf UI without requiring a secondary lookup to a Git host (like Gitiles) during the initial rendering of search results or alerts.
## Data Workflow
The schema facilitates a workflow where raw Git data is ingested and "indexed" into the Perf database:
```text
Git Repository Ingestion Process Perf Database (Schema)
+------------+ +-----------------------+ +---------------------------+
| SHA: a1b2c | ------> | 1. Assign CommitNumber| ----> | PK: CommitNumber (e.g. 5) |
| Author: .. | | 2. Extract Metadata | | Hash: a1b2c |
| Subject:.. | | 3. Insert into DB | | Timestamp: 1672531200 |
+------------+ +-----------------------+ +---------------------------+
```
## Legacy Compatibility
The struct includes JSON annotations designed to maintain serialization parity with the legacy `cid.CommitDetail` types. This implementation choice allows the backend to transition to the new schema-driven database approach without breaking existing frontend consumers that expect a specific JSON shape when requesting commit details.
# Module: /go/graphsshortcut
### High-Level Overview
The `graphsshortcut` module provides the core data structures and interfaces for managing graph shortcuts within the Perf system. A "shortcut" is a persistent snapshot of a user's dashboard configuration—including multiple graph definitions, queries, and formulas—represented by a unique, content-addressed ID.
This module acts as the domain layer for the "permalink" and "multigraph" features, allowing complex visualizations to be shared via short URLs. By decoupling the shortcut definition from the specific storage implementation, it enables the system to support diverse backends like SQL databases for production and in-memory caches for local development.
### Design Decisions and Implementation Choices
#### Content-Addressable Identification
The module implements a deterministic ID generation strategy in the `GetID()` method. The ID is an MD5 hash derived from the contents of the `GraphsShortcut` object.
- **Deduplication**: Because the ID is based on the content, identical graph configurations will always result in the same ID. This prevents the storage layer from accumulating redundant entries for identical shortcuts.
- **Normalization**: Before hashing, the module sorts all `Queries` and `Formulas` within each `GraphConfig`. This ensures that two shortcuts representing the same data but created with different UI selection orders remain functionally and identifies as identical.
- **Order Sensitivity**: While internal query/formula order is normalized, the order of the `Graphs` array itself _is_ preserved in the hash. This is a deliberate choice because the sequence of graphs on a dashboard is part of the user's intended layout.
#### Interface-Driven Persistence
The module defines a `Store` interface rather than a concrete implementation. This allows the core logic of Perf to remain agnostic of the underlying database technology. It provides a contract for two primary operations:
- `InsertShortcut`: Storing a configuration and returning its unique ID.
- `GetShortcut`: Retrieving a configuration based on its ID.
### Key Components
#### Data Models (`graphsshortcut.go`)
- **`GraphConfig`**: Represents the parameters for a single visualization. It bundles `Queries` (trace filters), `Formulas` (mathematical transformations), and `Keys` (specific trace identifiers).
- **`GraphsShortcut`**: A container for one or more `GraphConfig` objects. This allows a single shortcut to represent an entire dashboard of multiple graphs rather than just a single trace.
#### The Store Interface
The `Store` interface serves as the gateway to persistence. Implementations of this interface (found in the sibling `graphsshortcutstore` module) handle the complexities of JSON serialization and database interactions. By keeping the interface in this base module, the system avoids circular dependencies between the storage logic and the domain objects.
### Identification Workflow
The following diagram illustrates how the module ensures that shortcut IDs are generated consistently, regardless of how the user interacted with the UI to create the queries.
```text
[ User Input ] [ GraphsShortcut.GetID() ] [ Result ]
| | |
| 1. Create Shortcut | |
| Graph A: | |
| - arch=arm | |
| - config=8888 | |
+---------------------------->| |
| 2. Sort Queries/Formulas |
| (arch, config) |
|-------------------+ |
| | |
|<------------------+ |
| |
| 3. MD5 Hash Content |
|-------------------+ |
| | |
|<------------------+ |
| |
| 4. Return Hex String |
+--------------------------->| "c21e3c..."
```
# Module: /go/graphsshortcut/graphsshortcutstore
### High-Level Overview
The `graphsshortcutstore` module provides implementations for persisting and retrieving graph shortcuts in the Perf system. A graph shortcut is essentially a saved state of multiple graphs—such as trace filters, queries, and display settings—that can be referenced via a unique ID.
This module fulfills a critical role in the "permalink" and "multigraph" features of Perf, allowing users to share or revisit complex dashboard configurations without encoding the entire state into a URL.
### Design Decisions and Implementation Choices
The module is designed around the `graphsshortcut.Store` interface, with two distinct implementations tailored for different environment constraints:
- **Production Persistence (SQL)**: The standard implementation uses an SQL database (typically Spanner) for durable storage. It treats the database as a content-addressable store where the primary key is the shortcut ID and the payload is a JSON-serialized blob. This approach provides durability and global availability across production instances.
- **Local Development & Debugging (Cache)**: A specialized implementation, `cacheGraphsShortcutStore`, uses an in-memory or distributed cache rather than a database. This was specifically designed to solve the "breakglass" problem: when developers connect a local instance to a production database for debugging, they often lack write permissions to the production SQL tables. By routing shortcut writes to a local cache, developers can still use features like "multigraph" and shortcut generation without needing elevated database privileges.
- **JSON Serialization**: Both implementations serialize the `GraphsShortcut` struct into JSON before storage. This avoids the need for a complex relational schema for the graph configurations themselves, which are frequently subject to UI-driven changes. By storing them as blobs, the store remains agnostic to the internal structure of the graph data.
### Key Components
#### GraphsShortcutStore
Located in `graphsshortcutstore.go`, this is the primary SQL-backed implementation. It manages the lifecycle of shortcuts using two main operations:
- **InsertShortcut**: Encodes the shortcut to JSON and performs an `INSERT ... ON CONFLICT (id) DO NOTHING`. The "Do Nothing" strategy is used because shortcut IDs are typically derived from the hash of their content; if the ID already exists, the content is identical, and no update is necessary.
- **GetShortcut**: Retrieves the JSON blob by ID and decodes it back into the domain objects.
#### cacheGraphsShortcutStore
Located in `cachegraphsshortcutstore.go`, this implementation wraps a `cache.Cache` interface. It mirrors the logic of the SQL store—serializing data to JSON—but directs the output to a cache. This is the preferred implementation for local development environments.
#### Testing Infrastructure
The module utilizes a suite of subtests defined in `graphsshortcuttest`. This allows the SQL implementation to be verified against a real database instance (via `sqltest.NewSpannerDBForTests`) while ensuring it adheres to the standard behavior expected by the rest of the Perf application.
### Shortcut Lifecycle Workflow
The following diagram illustrates how a shortcut moves from the application into persistent storage:
```text
[ Perf UI / Logic ] [ graphsshortcutstore ] [ Storage (SQL or Cache) ]
| | |
| 1. Create GraphsShortcut | |
|----------------------------->| |
| | 2. Serialize to JSON |
| |------------------+ |
| | | |
| |<-----------------+ |
| | |
| | 3. Store (ID, JSON) |
| |---------------------------->|
| 4. Return ID (Hash) | |
|<-----------------------------| |
| | |
| 5. Request ID (Permalink) | |
|----------------------------->| |
| | 6. Fetch JSON by ID |
| |<----------------------------|
| | |
| | 7. Deserialize JSON |
| |------------------+ |
| | | |
| 5. Return Shortcut Object |<-----------------+ |
|<-----------------------------| |
```
# Module: /go/graphsshortcut/graphsshortcutstore/schema
### High-Level Overview
The `schema` module defines the structural contract for persisting shortcut data within the Graphs Shortcut Store. Its primary purpose is to provide a unified Go representation of the underlying SQL table structure used to store and retrieve serialized graph configurations.
In the context of the Perf system, a "shortcut" is a durable reference to a specific state or collection of graphs. This module ensures that both the application code and the database schema remain synchronized regarding how these shortcuts are identified and stored.
### Design Decisions and Implementation
The schema is designed around a simple key-value paradigm optimized for content-addressable storage or unique identifier lookups.
- **ID-Based Retrieval**: The `ID` field serves as the unique handle for a set of graphs. By using a `TEXT` type with a `UNIQUE NOT NULL PRIMARY KEY` constraint, the system enforces data integrity at the database level, preventing collision and ensuring that every shortcut can be retrieved with $O(1)$ complexity via its primary key.
- **Serialized Persistence**: The `Graphs` field is defined as a `TEXT` type rather than a structured relational set of tables. This implementation choice favors flexibility and performance for the following reasons:
- **Schema Evolution**: Since the internal structure of a "graph" might change frequently in the frontend or higher-level Go modules, storing it as a serialized string (typically JSON) avoids the need for complex database migrations whenever a new UI property is added.
- **Atomic Operations**: Storing the entire state in a single column allows the system to save or load a complete dashboard state in a single database operation, reducing the overhead of joins or multiple queries.
### Key Components
#### GraphsShortcutSchema
The core structure `GraphsShortcutSchema` utilizes struct tags (`sql:"..."`) to bridge the gap between Go types and the SQL dialect used by the persistence layer.
- **`ID`**: Acts as the immutable identifier for the shortcut. In practice, this is often a hash of the content or a generated UUID, allowing the application to generate permalinks to specific graph views.
- **`Graphs`**: Contains the payload of the shortcut. This is the "source of truth" for the graph configurations, including parameters like trace filters, time ranges, or specific visualization settings.
### Data Flow Workflow
The following diagram illustrates how the schema acts as the intermediary between the application logic and the physical storage:
```text
[ Application Logic ] [ schema.GraphsShortcutSchema ] [ SQL Database ]
| | |
| 1. Construct Schema Object | |
|--------------------------------->| |
| (Set ID and Serialized JSON) | |
| | 2. Execute INSERT/SELECT |
| |-------------------------------->|
| | (Uses struct tags for SQL) |
| | |
| 3. Receive Hydrated Object | <-------------------------------|
|<---------------------------------| |
| (Deserialize Graphs field) | |
```
# Module: /go/graphsshortcut/graphsshortcuttest
# graphsshortcuttest
The `graphsshortcuttest` module provides a standardized test suite for validating implementations of the `graphsshortcut.Store` interface. By centralizing these tests, the system ensures that different storage backends (e.g., SQL-based, In-memory) behave consistently regarding data persistence, normalization, and error handling.
## Design Philosophy: Contract Testing
The module is designed around the concept of "contract testing." Instead of each implementation of a `Store` writing its own basic functional tests, they import and run the `SubTests` defined here. This approach ensures:
1. **Consistency**: All backends must adhere to the same behavioral expectations.
2. **Normalization Verification**: The tests specifically verify side effects of storage, such as the automatic sorting of query strings, ensuring that the `Store` acts as a canonicalization layer.
3. **Reduced Boilerplate**: Implementation-specific test files only need to handle the setup/teardown of their respective drivers (like starting a Docker container for a database) before passing the resulting `Store` instance to this suite.
## Key Components and Responsibilities
### Test Suite Map (`SubTests`)
The primary entry point is the `SubTests` map. It maps descriptive test names to `SubTestFunction` signatures. This structure allows implementation-specific tests to iterate over the map and run each test as a subtest:
```
For each name, func in SubTests:
t.Run(name, func(t, myStoreInstance))
```
### Functional Validation (`InsertGet`)
The `InsertGet` function validates the primary lifecycle of a shortcut. A key implementation detail tested here is **query normalization**. When a `GraphsShortcut` is provided with queries in an arbitrary order, the `Store` is expected to return them in a sorted, deterministic state. This is crucial for deduplication and predictable UI rendering.
```
Input Shortcut Storage Backend Output Shortcut
+--------------+ +--------------+ +--------------+
| Queries: | | | | Queries: |
| - arch=x86 | ----> | Persist & | -----> | - arch=arm |
| - arch=arm | | Normalize | | - arch=x86 |
+--------------+ +--------------+ +--------------+
```
### Error Handling (`GetNonExistent`)
This ensures that the `Store` implementation correctly propagates errors when a requested ID does not exist, rather than returning an empty or partially initialized object.
## Implementation Details
- **graphsshortcuttest.go**: Contains the logic for the test suite. It defines the `SubTestFunction` type, which abstracts the `testing.T` and `graphsshortcut.Store` dependency, allowing the tests to be decoupled from the actual storage driver.
- **Data Integrity**: The tests use `testify/assert` and `testify/require` to enforce strict equality between what is sent to the store and what is retrieved, ensuring that no fields (like `Keys` or the list of `Graphs`) are dropped or corrupted during the serialization/deserialization process.
# Module: /go/graphsshortcut/mocks
The `/go/graphsshortcut/mocks` module provides automated mock implementations of the `graphsshortcut.Store` interface. These mocks are designed to facilitate unit testing of components that depend on graph shortcut persistence without requiring a live database or storage backend.
### Design and Implementation
The module utilizes `testify/mock` to provide a flexible, programmable implementation of the storage layer. This approach allows developers to:
1. **Isolate Unit Tests**: Test business logic in services that use graph shortcuts by simulating various storage outcomes (success, specific errors, or timeouts).
2. **Verify Interactions**: Assert that the expected methods are called with the correct parameters, ensuring that the calling code correctly handles the lifecycle of a shortcut.
The code is autogenerated using `mockery`, ensuring that the mock implementation remains strictly synchronized with the `graphsshortcut.Store` interface definition. This eliminates the maintenance overhead of manually updating test doubles when the primary interface changes.
### Key Components
#### Store.go
This file defines the `Store` struct, which embeds `mock.Mock`. It provides mock implementations for the primary persistence operations:
- **`GetShortcut(ctx, id)`**: Simulates retrieving a serialized graph configuration. It allows tests to return a specific `graphsshortcut.GraphsShortcut` object or an error based on the provided ID.
- **`InsertShortcut(ctx, shortcut)`**: Simulates the creation of a new shortcut. In a test environment, this is typically used to return a pre-defined ID string, allowing the caller to proceed as if a database write succeeded.
### Usage Workflow
The `NewStore` function is the entry point for utilizing these mocks. It integrates directly with the Go testing lifecycle by registering a cleanup function that automatically asserts expectations.
```text
+-------------------+ +-----------------------+
| Unit Test | | Mocks.Store (Mock) |
+---------+---------+ +-----------+-----------+
| |
| 1. NewStore(t) |
+-------------------------------->|
| |
| 2. On("GetShortcut").Return(...)|
+-------------------------------->|
| |
| 3. Invoke System Under Test |
+-------------------------------->|
| |
| 4. AssertExpectations (Auto) |
|<--------------------------------+
```
By using `NewStore(t)`, the mock is bound to the test's lifespan. If the code under test fails to call a method that was "set up" or calls it with the wrong arguments, the test will fail during the `Cleanup` phase.
# Module: /go/ingest
# Ingest Module
The `go/ingest` module serves as the primary entry point and configuration layer for the Skia Perf ingestion system. It provides the high-level logic to instantiate and connect the various sub-modules—filtering, formatting, parsing, and processing—into a cohesive pipeline that transforms raw benchmark files into indexed, searchable performance traces.
## Overview
The module's main responsibility is to bridge the gap between human-readable configuration (usually provided via JSON or command-line flags) and the specialized internal engines that handle data. It defines the `Config` structure, which acts as the blueprint for an ingestion instance, specifying everything from where data is sourced (e.g., Google Cloud Storage) to how it should be validated and where it should be stored.
## Design Decisions
### Configuration-Driven Architecture
The design is heavily configuration-driven, centered around the `Config` struct. This allows a single binary to support vastly different ingestion workflows (e.g., internal Skia benchmarks vs. external Chrome performance tests) simply by changing the configuration file. This decoupling ensures that the core logic in `process` or `parser` remains agnostic of the specific environment.
### Reliability and Observability
Because ingestion is the "front door" of the Perf system, the module is designed for high reliability:
- **Constructors with Validation:** The `NewConfig` and subsequent component initializers validate inputs (like regex patterns or database connection strings) early. This "fail-fast" approach prevents the system from starting in a broken state.
- **Operational Metrics:** The module wires up `metrics2` across all sub-components, providing real-time visibility into ingestion rates, error frequencies, and latency.
### Component Orchestration
The module doesn't just pass data; it manages the lifecycle of dependencies. For example, it coordinates the setup of the `git.Git` connector used by the `process` module to resolve hashes, ensuring that the local git cache is initialized before workers start processing files.
## Key Components and Responsibilities
### Ingest Configuration (`config.go`)
This is the central definition of an ingestion instance. It categorizes configuration into several key areas:
- **Source Configuration:** Defines the `SourceConfig` (GCS bucket, file prefixes) and `PubSubConfig` for event-driven ingestion.
- **Ingestion Logic:** Contains parameters for the `Filter` (which files to ignore) and the `Parser` (which branches to accept).
- **Infrastructure Links:** Holds connection details for the `TraceStore` (where data goes) and the `Git` repository (how data is mapped to commits).
### Integration Workflow
The following diagram illustrates how the `ingest` module assembles the sub-modules into a functional pipeline:
```text
[ Config File / JSON ]
|
v
+------------------------+
| Ingest Module |
| (Initialization) |
+-----------+------------+
|
+--> [ Filter ] (Rules for file selection)
|
+--> [ Parser ] (Rules for data transformation)
|
+--> [ Git ] (Connector for commit mapping)
|
v
+------------------------+
| Process Module |
| (The Execution) |
+-----------+------------+
|
+----[ Workers ]<----( Source: GCS/PubSub )
| |
| +--> [ Parse & Map ]
| |
| +--> [ Write to TraceStore ]
| |
| +--> [ Notify Downstream ]
v
[ Persistent Storage ]
```
## Sub-Module Interaction
- **`filter`**: Invoked early in the process to discard irrelevant files before they consume CPU time in the parser.
- **`format`**: Provides the structural definitions and JSON schemas that the `parser` uses to validate incoming blobs.
- **`parser`**: Utilized by the workers in the `process` module to turn raw bytes into standardized trace IDs and values.
- **`process`**: The active engine started by the `ingest` module to manage the actual flow of data and interaction with databases.
# Module: /go/ingest/filter
The `go/ingest/filter` module provides a mechanism for determining whether a file should be processed or ignored during ingestion based on its name. This is a critical component for performance monitoring pipelines that ingest data from large-scale storage (like Google Cloud Storage), where filtering out irrelevant files or transaction logs early prevents unnecessary resource consumption and reduces processing noise.
### Logic and Design
The filtering logic is built around two optional regular expressions: `accept` and `reject`. The design follows a "deny-by-default" approach when an `accept` pattern is provided, and an "allow-by-default" approach otherwise, provided the file doesn't match a `reject` pattern.
The evaluation logic follows these rules:
1. **Acceptance**: If an `accept` regex is defined, the filename _must_ match it. Failure to match results in immediate rejection.
2. **Rejection**: If a `reject` regex is defined, the filename _must not_ match it. A match results in immediate rejection.
3. **Default**: If neither regex is provided, all filenames are accepted.
The `Filter` struct caches the compiled `*regexp.Regexp` objects to ensure that performance is optimized for high-volume ingestion where thousands of filenames may be evaluated against the same ruleset.
### Workflow
The following diagram illustrates the decision flow when `Filter.Reject(name)` is called:
```text
[ Input Filename ]
|
v
+-------------------+
| Is Accept Regex |---- No ----+
| Defined? | |
+---------+---------+ |
| Yes |
v |
+---------+---------+ |
| Does it Match? |---- Yes ---+
+---------+---------+ |
| No |
v |
[ REJECTED (true) ] |
v
+-------------------+
| Is Reject Regex |---- No ----+
| Defined? | |
+---------+---------+ |
| Yes |
v |
+---------+---------+ |
| Does it Match? |---- No ----+
+---------+---------+ |
| Yes |
v v
[ REJECTED (true) ] [ ACCEPTED (false) ]
```
### Key Component
#### filter.go
This file contains the core `Filter` implementation.
- `New(accept, reject string)`: Validates and compiles the provided regex strings. It returns an error if the regex syntax is invalid, ensuring that ingestion processes fail fast during configuration rather than during runtime execution.
- `Reject(name string) bool`: The primary interface for the module. It returns `true` if the file should be discarded and `false` if it should be processed. By returning `true` for a "reject" action, it allows callers to write clean guard clauses like `if filter.Reject(name) { continue }`.
# Module: /go/ingest/format
# Ingest Format
The `go/ingest/format` module defines the data structures and validation logic for performance data files ingested into the Perf system. It serves as the formal specification for how external processes and test runners should format their results to be correctly indexed and visualized.
## Overview
The primary goal of this module is to provide a flexible yet strictly validated schema that maps raw performance measurements to **Trace IDs**. A Trace ID is a comma-separated string of key-value pairs (e.g., `,arch=x86,config=8888,test=draw_circle,units=ms,stat=min,`) used by Perf to identify a unique time series of data points.
The module supports two formats:
1. **Standard Format (Version 1):** The modern, recommended format designed for clarity and multi-metric reporting.
2. **Legacy Format:** A format primarily used by `nanobench` (Skia's internal microbenchmarking tool) which relies on nested maps.
## Design Decisions
### Trace ID Construction
The design favors a flat key-value structure for identification. The `Format` struct and its sub-components (`Result`, `SingleMeasurement`) are structured so that keys defined at the top level (global to the file) are merged with keys defined at the result level and specific measurement level. This allows for efficient data representation where common metadata (like `git_hash` or `arch`) is defined once, while specific metrics (like `min`, `max`, `median`) are defined locally.
### Multi-Metric Support
A single test run often produces multiple related values (e.g., different statistical aggregations of the same test). The `Result` struct allows a single entry to contain multiple `measurements`. This avoids duplicating the entire metadata block for every statistical variation of a single test, reducing file size and improving readability.
### Single Source of Truth via JSON Schema
To prevent "schema drift" between the Go implementation and external data producers, the module uses an embedded JSON Schema (`formatSchema.json`). This schema is programmatically generated from the Go structs (via the `generate` submodule). This ensures that validation logic used during ingestion is identical to the documentation provided to contributors.
## Key Components
### Standard Format (`format.go`)
This is the primary entry point for modern ingestion. It defines the `Format` struct, which includes:
- **Contextual Metadata:** `GitHash`, `Issue` (CL), and `Patchset` to associate results with specific code versions.
- **Global Keys:** A `Key` map containing parameters that apply to every measurement in the file (e.g., hardware configuration).
- **Results:** A list of `Result` objects. Each result can either be a simple `Measurement` (float32) or a complex `Measurements` map.
#### Trace ID Generation Workflow
The following diagram illustrates how keys are aggregated from different levels of the `Format` struct to form a final Trace ID:
```
File Level:
{"key": {"arch": "x86", "config": "8888"}}
|
v
Result Level:
{"key": {"test": "draw_circle", "units": "ms"}}
|
v
Measurement Level:
{"measurements": {"stat": [{"value": "min", "measurement": 1.2}]}}
|
+---------------------------------------+
| Resulting Trace ID: |
| ,arch=x86,config=8888,stat=min,test=draw_circle,units=ms, |
+---------------------------------------+
```
### Validation and Parsing
The module provides robust utilities to ensure data integrity:
- **`Parse`:** Decodes JSON into the `Format` struct and enforces version checking.
- **`Validate`:** Performs a two-pass check. First, it ensures the JSON is syntactically valid and matches the internal Go types. Second, it validates the blob against the embedded JSON Schema to catch logic errors (e.g., missing required fields like `git_hash`).
- **`GetLinksForMeasurement`:** A helper function that resolves all URLs associated with a specific trace. It merges global links (file-level) with measurement-specific links, allowing users to jump from a specific data point in the Perf UI to external logs or artifacts.
### Legacy Format (`leagacyformat.go`)
This file maintains compatibility with older Skia tooling. It defines the `BenchData` struct, which uses a more nested structure: `Results -> [Config/State] -> [TestName] -> [Metric]`. Unlike the standard format, which is versioned and schema-validated, the legacy format is handled as a `map[string]interface{}` to accommodate the highly dynamic nature of older benchmark outputs.
### Embedded Schema (`formatSchema.json`)
The schema file is embedded into the Go binary using `//go:embed`. This allows the `Validate` function to perform schema validation without requiring external file dependencies at runtime. It defines the strict requirements for the Version 1 format, such as mandatory fields and allowed data types for measurements.
# Module: /go/ingest/format/generate
The `generate` module is a utility designed to maintain consistency between the Go-based implementation of the Perf ingestion format and its external documentation. Its primary responsibility is to act as a code-to-schema compiler that ensures the structural definition of data ingested into Perf remains synchronized across all tools and external data producers.
### Design Philosophy
The core design decision behind this module is to treat the Go source code as the single source of truth for the ingestion protocol. In a complex data pipeline, the format used to describe performance results can evolve. Manually maintaining a separate JSON Schema file is error-prone and leads to "schema drift," where the documentation or validation rules fail to match the actual parsing logic in the Go codebase.
By programmatically generating the schema from the `format.Format` struct, the system guarantees that:
- Any field added, renamed, or removed in the Go struct is immediately reflected in the schema.
- Validation logic (e.g., required fields or data types) remains identical for both internal processing and external validation.
- External developers producing data for Perf have a machine-readable specification that is guaranteed to be accurate.
### Implementation Strategy
The module leverages the `go.skia.org/infra/go/jsonschema` package to perform structural reflection on the `format.Format` struct.
1. **Reflection:** The generator inspects the Go types, field names, and especially the `json` tags of the target struct.
2. **Mapping:** It maps Go-specific primitives and complex types to their corresponding JSON Schema representations.
3. **Output:** The result is serialized into a standard `formatSchema.json` file located in the parent directory. This output is used by other parts of the system to validate incoming JSON blobs before they are processed by the ingestion engine.
### Process Flow
The following diagram illustrates how this module fits into the development workflow:
```
+--------------------------+
| perf/go/ingest/format/ |
| (Go Structs) |
+------------+-------------+
|
| Source of Truth
v
+------------+-------------+
| /format/generate/main.go | <-- This Module
+------------+-------------+
|
| Reflection & Generation
v
+------------+-------------+
| formatSchema.json |
| (Machine-readable Spec) |
+------------+-------------+
|
+----------------------------> External Data Producers
|
+----------------------------> Validation Middlewares
```
### Key Components
- **Schema Generator (`main.go`):** The entry point that executes the generation logic. It specifically references the `format.Format` struct and directs the output to a static file path relative to the module. Its simplicity is intentional, as it acts strictly as a bridge between the internal type system and the file system.
# Module: /go/ingest/parser
### High-Level Overview
The `parser` module provides the logic for transforming raw performance data files into a standardized format suitable for storage in a trace-based database. It serves as the translation layer between external benchmarking tools—which may produce data in various JSON schemas—and the internal Perf system.
The module is designed to handle both a "Legacy" format and a modern "Version 1" format, ensuring backward compatibility while supporting newer features like explicit commit positions and complex measurement maps.
### Design Philosophy and Implementation Choices
The parser's implementation is guided by the need for data integrity and system stability in a high-volume ingestion pipeline:
- **Format Autodiscovery:** Instead of requiring explicit configuration for file types, the `Parser` attempts to decode files using the Version 1 schema first. If that fails, it falls back to the Legacy parser. This allows a single ingestion pipeline to handle a heterogeneous mix of data sources.
- **Key Sanitization:** A critical responsibility of the parser is ensuring that parameter keys and values do not contain characters (like `,` or `=`) that would break the internal string-based representation of traces. It uses configurable regular expressions and a "force valid" approach to replace illegal characters, preventing database corruption or query errors.
- **Metric Filtering:** To keep the database clean, the parser identifies and discards "noise." For example, it explicitly ignores parameters prefixed with `GL_` (internal OpenGL constants) in legacy files, as these are considered too verbose for high-level performance tracking.
- **Branch-Based Gating:** The parser can be configured to only accept data from specific branches. This prevents experimental or development branch data from polluting the production performance metrics unless explicitly desired.
### Key Components and Responsibilities
#### Parser Struct (`parser.go`)
The central coordinator of the module. It maintains the state necessary for ingestion, including:
- **Validation Logic:** Uses `invalidParamCharRegex` to sanitize incoming metadata.
- **Branch Filtering:** Holds a map of allowed branch names to quickly decide if a file should be processed or skipped via `ErrFileShouldBeSkipped`.
- **Metrics Tracking:** Integrates with `metrics2` to track successful parses, failures, and files with no data, providing operational visibility into the ingestion pipeline.
#### Version 1 Parsing
Handles the modern schema which supports:
- **Commit Numbers:** Recognizes the `CP:nnnnnn` prefix in the `git_hash` field to treat Git hashes as sequential commit positions.
- **Complex Measurements:** Processes the `Measurements` map, which allows a single result entry to contain multiple named metrics (e.g., `min_ms`, `max_rss`) without duplicating the common metadata.
#### Legacy Parsing
Maintains compatibility with older benchmarking outputs. Its primary task is flattening deeply nested JSON structures (Test Name -> Config -> Results) into a flat list of parameters and float values. It also handles the extraction of "Samples" (multiple runs of the same test) which are specifically aggregated for the `min_ms` sub-result.
#### Parameter Management
Functions like `buildInitialParams` and `getParamsAndValuesFromVersion1Format` are responsible for merging "Global" keys (describing the machine/environment) with "Local" keys (describing the specific test run). This creates the unique identity for every performance trace.
### Data Transformation Workflow
The following diagram illustrates how the `Parse` method processes a file from raw input to standardized trace data:
```text
Input: file.File (Name, Contents)
|
V
[ Read all contents into memory ]
(Allows multiple passes for format detection)
|
V
[ Try Version 1 Extraction ] ---- Success? ----> [ Sanitize Keys ]
| |
Fail? |
| |
[ Try Legacy Extraction ] ------- Success? ----> [ Sanitize Keys ]
| |
Fail? |
| |
[ Return Error ] [ Filter by Branch ]
|
[ Skip if excluded? ]
|
V
Standardized Output:
- []paramtools.Params (Trace IDs)
- []float32 (Values)
- Hash (Commit ID)
```
### Key Files
- **`parser.go`**: Contains the primary `Parser` implementation and logic for both schema versions.
- **`parser_test.go`**: Defines the behavioral contract of the parser using a wide array of test fixtures to ensure stability across edge cases like malformed JSON or special character collisions.
- **`testdata/`**: An authoritative collection of JSON fixtures representing different data scenarios (success, failure, different schemas) used to validate the parser's logic.
# Module: /go/ingest/parser/testdata
The `/go/ingest/parser/testdata` module serves as the authoritative collection of test fixtures for the performance data ingestion system. Its primary role is to define the operational boundaries of the ingestion logic, ensuring that the system remains resilient across schema evolutions, handles data corruption gracefully, and correctly identifies performance metrics from various benchmarking sources.
### Design Philosophy and Implementation Choices
The directory is structured to separate data by schema version (`legacy` and `version_1`). This separation reflects a fundamental design choice to maintain strict backward compatibility while allowing for the evolution of the ingestion format.
The fixtures are designed around three core testing principles:
1. **Identity Verification:** Ensuring that the combination of global keys (e.g., OS, Architecture) and local result keys (e.g., Test Name, Configuration) correctly resolves to a unique time-series identity.
2. **Sanitization and Collisions:** Validating that the parser can handle special characters (`,`, `=`) that might otherwise collide with the internal delimiters used by the time-series database.
3. **Data Filtering:** Defining "noise" (such as legacy `GL_` prefixes or experimental branch data) through negative test cases, ensuring that only relevant metrics enter the long-term storage.
### Key Components
#### Legacy Data Handling (`/legacy`)
The files within this component represent a historical, more permissive JSON schema. The parser's responsibility here is heavily focused on **traversal and filtering**. Because the legacy format lacks strict enforcement, the test data validates the parser's ability to:
- **Navigate Deep Nesting:** Locating metrics within structures like `results -> configuration -> metrics`.
- **Execute Exclusion Rules:** Using files like `unknown_branch.json` to verify that data from specific development paths is discarded.
- **Manage Mixed Types:** Handling arrays that may contain both numeric data and non-numeric "noise" within the same block.
#### Version 1 Schema Validation (`/version_1`)
The Version 1 fixtures represent the modern, more structured ingestion format. The focus shifts from filtering noise to **identifying metadata and handling special cases**. Key responsibilities demonstrated here include:
- **Commit Position Resolution:** Utilizing the `CP:` prefix in `git_hash` fields to distinguish between traditional Git SHAs and sequential commit numbers, as seen in `with_commit_number.json`.
- **Escaping Logic:** Validating that the parser correctly preserves data integrity when keys or values contain mathematical symbols or database delimiters (e.g., `with_comma_in_param.json`).
- **Measurement Aggregation:** Testing how the system interprets different measurement formats, such as single scalar values versus maps of multi-config measurement arrays.
### High-Level Ingestion Workflow
The following diagram illustrates how the ingestion logic uses these fixtures to transform raw JSON input into a standardized record:
```text
Raw JSON File (Test Data)
|
V
[ Format Detection ] -----------------------+
| |
+--> [ Legacy Parser ] +--> [ Version 1 Parser ]
| (Filters GL_ prefixes, | (Handles CP: prefixes,
| Maps nested metrics) | Resolves Identity Keys)
| |
V V
[ Key Normalization ] <---------------------+
|
|-- Check for delimiter collisions (",", "=")
|-- Merge global and local key blocks
|
V
[ Value Extraction ]
|
|-- Convert numeric strings to float64
|-- Validate sample arrays (ignore non-numeric)
|
V
Standardized Ingestion Record
(Used for Database Write)
```
### Usage in Testing
These files are not merely static examples; they are the inputs for the `parser_test.go` suite. The system compares the output of the parser against "golden" expectations derived from these files.
- **Positive tests** (e.g., `success.json`) ensure that valid data is correctly parsed into the internal data model.
- **Negative tests** (e.g., `invalid.json`, `invalid_commit_number.json`) ensure that the parser returns explicit errors or handles exceptions without crashing, which is critical for a high-volume ingestion pipeline where malformed data is a common occurrence.
# Module: /go/ingest/parser/testdata/legacy
The `/go/ingest/parser/testdata/legacy` directory serves as a comprehensive suite of test fixtures designed to validate the ingestion and parsing logic for legacy performance result formats. These files are used to ensure that the parser correctly handles various edge cases, data structures, and validation rules inherent in the older JSON schema used by benchmarking systems.
### Purpose and Design
The primary goal of these data files is to define the expected boundaries of the legacy ingestion system. Because legacy formats often lack strict schema enforcement, these files document through example how the parser should interpret nested objects, handle missing keys, and filter out noise.
Key design considerations reflected in these files include:
- **Schema Flexibility:** Validating how the parser traverses deeply nested structures (e.g., `results -> test_name -> configuration -> metrics`).
- **Data Integrity:** Defining which fields are considered "noise" (like specific GL prefixes or non-string values in certain contexts) and should be discarded during ingestion.
- **Robustness:** Ensuring the system handles corrupted or empty data gracefully without crashing.
### Key Test Scenarios
#### 1. Data Structure Variations
- **Measurement Types:** `one_measurement.json` and `zero_measurement.json` verify the parser's ability to extract single data points and handle boundary values like zero, which might otherwise be misinterpreted as missing data.
- **Sample Aggregation:** `samples_success.json` demonstrates how the system handles arrays of raw performance measurements (`samples`) alongside aggregated statistics like `min_ms`. It also tests the parser's ability to ignore non-numeric values within these arrays.
- **Metadata Handling:** Files like `success.json` showcase complex results containing a mix of `key` (identifying the environment), `options` (contextual metadata), and `meta` (test-specific metrics like `max_rss_mb`).
#### 2. Filtering and Validation Logic
Several files contain keys specifically named `SHOULD_NOT_APPEAR_IN_RESULTS...` (e.g., in `samples_success.json` and `unknown_branch.json`). These are used to test the parser's filtering logic:
- **Prefix Filtering:** Ignoring keys starting with `GL_`.
- **Type Validation:** Ensuring that only string values are accepted in certain metadata blocks, while non-numeric values are ignored in measurement blocks.
- **Branch Filtering:** `unknown_branch.json` tests the ingestion engine's ability to ignore data from specific development branches (e.g., "ignoreme").
#### 3. Error and Edge Cases
- **Malformed Input:** `invalid.json` provides a baseline for how the parser handles non-JSON content.
- **Empty Results:** `no_results.json` and `samples_no_results.json` ensure the system doesn't fail when valid metadata is present but no actual performance metrics are included.
### Workflow: Data Parsing Logic
The following diagram illustrates how the ingestion logic typically processes a file from this test data set:
```text
JSON Input File
|
V
[ Schema Validation ] ----> If Invalid (invalid.json) -> Reject
|
V
[ Extract Global Keys ] --> gitHash, issue, patchset, system
|
V
[ Iterate Results ]
|
+-- [ Filter Keys ] --> Ignore "GL_" prefixes or "ignoreme" branches
|
+-- [ Parse Metrics ]
| |
| +-- Numeric values (min_ms, samples) -> Store
| +-- Non-numeric/Strings in Metrics -> Discard
|
+-- [ Map Metadata ] -> Map 'options' and 'meta' to result tags
|
V
Processed Ingestion Record
```
### File Summary
- **`success.json` / `samples_success.json`**: The gold standard for valid legacy data, covering diverse configurations (8888, 565, gpu) and metric types.
- **`invalid.json`**: Tests resilience against syntax errors.
- **`no_results.json` / `samples_no_results.json`**: Tests handling of empty result sets with valid headers.
- **`one_measurement.json` / `zero_measurement.json`**: Tests specific numerical edge cases.
- **`unknown_branch.json`**: Tests environmental filtering logic.
# Module: /go/ingest/parser/testdata/version_1
This directory serves as a comprehensive suite of test cases for the "Version 1" ingestion format. It contains JSON files designed to validate the robustness, edge-case handling, and schema compliance of the ingestion parser.
### Purpose and Design Decisions
The primary goal of these data samples is to define the boundaries of what the parser should accept, reject, or transform. The design of these files reflects real-world ingestion scenarios where data might be messy, incomplete, or formatted using specific conventions (such as commit position markers).
By providing these samples, the module ensures that the parser can:
1. **Differentiate between various measurement structures** (single values vs. multi-config arrays).
2. **Handle identity metadata** across different levels of the JSON hierarchy.
3. **Sanitize or preserve special characters** within keys and values.
### Key Data Scenarios
The test data can be categorized into three functional groups:
#### 1. Valid and Edge-Case Data Structures
These files demonstrate the flexibility of the Version 1 schema:
- **Success Cases (`success.json`, `one_measurement.json`):** Demonstrate the standard structure. Results can contain a top-level `measurement` or a nested `measurements` map containing arrays of values (e.g., different configs like `8888` or `gpu`).
- **Commit Identity (`with_commit_number.json`):** Shows the use of the `CP:` prefix in the `git_hash` field to represent a "Commit Position" rather than a standard Git SHA.
- **Result Variations:** Includes cases with zero measurements (`zero_measurement.json`) or entirely empty results lists (`no_results.json`), which the parser must handle without crashing.
#### 2. Character and Format Robustness
Ingested data often contains characters that could conflict with internal storage formats (like key-value pair delimiters).
- **Special Characters (`with_special_chars.json`):** Tests a wide range of symbols (e.g., `!~@#$%^&*()`) within both keys and values.
- **Delimiters (`with_comma_in_param.json`, `with_equal_in_param.json`):** Specifically tests strings containing `,` and `=`, which are often used as separators in time-series databases. These files verify that the parser correctly escapes or encapsulates these values.
#### 3. Error and Validation Cases
These files define the failure modes of the parser:
- **Syntactic Errors (`invalid.json`):** Plain text that is not valid JSON.
- **Schema Violations (`invalid_commit_number.json`):** Identifies cases where fields like `git_hash` contain malformed prefixes (e.g., `CP:727A901` where the number format might be incorrect).
### Data Workflow
The following diagram illustrates how the parser uses these files to determine the final identity of a performance measurement:
```text
JSON Input File
|
V
+-----------------+ +-----------------------+
| Global Keys |----->| Common Metadata |
| (arch, os, etc) | | (applied to all) |
+-----------------+ +-----------+-----------+
|
V
+-----------------+ +-----------------------+ +--------------------+
| Result Keys |----->| Unique Series ID |----->| Final Ingested |
| (test, config) | | (Global + Result Keys)| | Data Point |
+-----------------+ +-----------------------+ +--------------------+
|
+------------------------------+
|
V
+-----------------+ +-----------------------+
| Measurements |----->| Value (float64) |
| (single or map) | | |
+-----------------+ +-----------------------+
```
### Key Components
- **`git_hash` / `version`:** Every valid file includes these to identify the schema version and the point in time the data represents.
- **`key` block:** Found at both the root level (global params) and within individual `results` (test-specific params). The parser must merge these to create a full set of dimensions for the data.
- **`links`:** Demonstrated in `with_commit_number.json`, showing how external references (like documentation or build logs) are attached to the ingestion record.
# Module: /go/ingest/process
# Ingest Process
The `process` module is the core execution engine of the Skia Perf ingestion pipeline. It coordinates the lifecycle of performance data from its raw state in a source (like Google Cloud Storage or a local directory) to its indexed state within a `TraceStore`.
## Overview
The module provides a multi-threaded ingestion worker system that handles parsing, commit mapping, data normalization, and storage. It is designed to be resilient, utilizing retries for database operations and supporting Google Cloud Pub/Sub for both input signaling and downstream event notifications.
The entry point is the `Start` function, which initializes the necessary infrastructure components—source monitors, trace stores, metadata stores, and git connectors—and launches a configurable number of parallel worker goroutines.
## Key Components and Responsibilities
### Worker Lifecycle
The system operates using a producer-consumer model:
1. **Source:** A `file.Source` (e.g., GCS bucket listener) produces `file.File` objects onto a channel.
2. **Workers:** Multiple worker goroutines consume from this channel. Each worker maintains its own `parser.Parser` to handle the transformation of raw bytes into structured performance data.
3. **Processing:** Each file undergoes a specific workflow: Parse -> Commit Mapping -> ParamSet Construction -> Store Write -> Event Notification.
### Commit Mapping and Git Integration
Ingested files typically contain a Git hash. The `process` module is responsible for resolving this hash into a monotonic `types.CommitNumber`.
- It uses the `git.Git` interface to look up commit numbers.
- If a hash is unrecognized, the worker triggers an update of the local git metadata to ensure it isn't simply a new commit that hasn't been cached yet.
- If a hash remains invalid after an update, the file is acknowledged (skipped) to prevent infinite retry loops in Pub/Sub.
### Data Normalization and Storage
Once parsed and mapped to a commit, the data is prepared for the `tracestore.TraceStore`:
- **ParamSet Construction:** The worker aggregates all parameters from the file into a `ParamSet`, which serves as the index for searching traces.
- **Resilient Writing:** Database writes are wrapped in a retry loop (defaulting to 10 attempts) to handle transient failures or contention in the underlying storage (e.g., Spanner or SQL).
- **Metadata:** If the file contains supplemental links or metadata, these are stored in a separate `MetadataStore`.
### Downstream Notifications
After a successful write, the module can notify other services via a Pub/Sub topic defined in `FileIngestionTopicName`.
- It filters and deduplicates trace IDs (see **Trace Clustering Logic**) before sending.
- This allows downstream systems like the clustering or anomaly detection engines to react immediately to newly ingested data.
## Trace Clustering Logic
To optimize downstream processing (like clustering), the module includes logic to prune redundant trace IDs. Many performance tests report multiple statistics for the same logical test (e.g., `test_name`, `test_name_avg`, `test_name_min`).
The `getTraceIdsForClustering` function implements a "canonicalization" check:
- If a trace has a suffix like `_avg`, `_min`, `_max`, or `_count`, the system looks for a "canonical" version of that same trace (the name without the suffix and a matching `stat` key).
- If the canonical version exists in the same file, the suffixed version is excluded from the Pub/Sub notification to reduce noise.
## Internal Workflow
```text
[ Source (GCS/Dir) ]
|
v
[ file.File Channel ]
|
+----[ Worker 1 ]----[ Parser ]----> (Parsed Data)
| |
| +---------[ Git ]-------> (Commit Number)
| |
| +---------[ TraceStore ]-> [ Persistent Storage ]
| |
| +---------[ Pub/Sub ]----> [ Ingestion Events Topic ]
|
+----[ Worker 2 ]---- ...
|
+----[ Worker N ]---- ...
```
## Design Decisions
### Dead Letter Collection (DLC)
The module supports "Dead Letter" semantics via Pub/Sub Nacks. If `DeadLetterCollection` is enabled in the configuration, processing failures will trigger a `Nack()`, allowing the message to be redelivered or moved to a dead-letter queue by the infrastructure. If disabled, or if the error is unrecoverable (like a bad git hash), the message is `Ack()`-ed to clear the pipeline.
### Context and Timeouts
A `defaultDatabaseTimeout` of 60 minutes is applied to file processing. This high threshold accounts for large files that might contain thousands of traces requiring significant indexing time in the database, while still providing a circuit breaker for stalled connections.
### Parallelism
The number of parallel ingesters is configurable. This allows the system to scale horizontally based on the CPU/Memory available to the container and the IOPS capacity of the underlying `TraceStore`.
# Module: /go/ingestevents
### Overview
The `ingestevents` module provides a standardized data contract and serialization format for communication between Perf ingesters and regression detection components. Within the Perf architecture, ingesters process raw performance data files and store them in the database. Once a file is successfully processed, the system must trigger downstream tasks—specifically regression detection—to analyze the newly arrived data.
This module facilitates this "Event-Driven Alerting" by defining the `IngestEvent` structure and providing utilities to pass this data through Google Cloud PubSub efficiently.
### Design Decisions: Payload Efficiency
A key challenge in event-driven architectures is balancing the richness of the event data against transport limits. PubSub has a maximum message size (10MB), and high-volume performance data can easily exceed this if not handled carefully.
To address this, the module implements a mandatory compression strategy:
- **Gzip Compression**: All `IngestEvent` payloads are Gzipped before being sent to PubSub and must be decompressed upon receipt. This ensures that even files containing thousands of `TraceIDs` or complex `ParamSets` remain well below the transport limits.
- **JSON Encoding**: Within the compressed envelope, data is stored as standard JSON to maintain readability and ease of debugging during development.
### Key Components
#### IngestEvent Structure
The `IngestEvent` struct is the core data transfer object. It contains three primary pieces of information that allow downstream clusterers to perform regression detection without re-querying the database for metadata:
- **`TraceIDs`**: A slice of unencoded trace identifiers found in the ingested file. This tells downstream consumers exactly which series of data points have been updated.
- **`ParamSet`**: A summary of the parameters (key-value pairs) associated with the `TraceIDs`. This provides immediate context about the hardware, benchmarks, and configurations affected by the new data.
- **`Filename`**: The source of the data, useful for auditing and tracking the ingestion pipeline.
#### Serialization Utilities
The module provides two primary functions to manage the lifecycle of an event:
- **`CreatePubSubBody`**: Orchestrates the encoding process. It uses a `bytes.Buffer` coupled with a `gzip.Writer` to transform an `IngestEvent` into a compressed byte slice ready for PubSub publishing.
- **`DecodePubSubBody`**: The inverse operation. It handles the Gzip decompression and JSON decoding, returning a pointer to the original `IngestEvent`.
### Workflow: Data Ingestion to Alerting
The following diagram illustrates how this module fits into the broader Perf data pipeline:
```text
[ Raw Data File ]
|
v
[ Ingester Service ] ----> ( Writes to Database )
|
| ( Creates IngestEvent )
v
[ ingestevents.CreatePubSubBody ]
|
| ( Gzipped JSON )
v
[ Google Cloud PubSub ]
|
v
[ Clusterer / Detection Service ]
|
| ( Receives Message )
v
[ ingestevents.DecodePubSubBody ]
|
| ( Result: TraceIDs, ParamSet )
v
[ Regression Detection Logic ]
```
### Implementation Details
The implementation leverages `go.skia.org/infra/go/util` for safe Gzip stream handling and `go.skia.org/infra/go/skerr` for structured error wrapping. This ensures that failures during decompression or decoding (e.g., due to malformed PubSub messages) provide enough context to identify where the pipeline stalled.
# Module: /go/initdemo
# initdemo
The `initdemo` module provides a utility for bootstrapping a local development environment for Skia Perf. Its primary purpose is to automate the creation of the required database and the application of the current schema, ensuring developers have a consistent and functional backend for local testing.
## Design Philosophy
This utility is designed for idempotency and speed in local development. Rather than relying on complex migration tools or manual database setup steps, `initdemo` uses the direct Spanner-compatible schema defined within the Perf codebase.
The choice to use a simple Go binary for this task reflects a preference for:
- **Consistency**: It ensures that the local "demo" database matches the exact schema expectations of the current source code.
- **Simplicity**: By wrapping both database creation and schema application in one command, it reduces the friction for new contributors setting up their environment.
## Key Components and Responsibilities
### Database Initialization (`main.go`)
The core logic resides in `main.go`. It performs two distinct phases of setup:
1. **Database Provisioning**: It attempts to create a new database (defaulting to `demo`). It gracefully handles cases where the database already exists, allowing the tool to be run repeatedly without side effects.
2. **Schema Application**: It retrieves the Spanner-compatible schema directly from the `perf/go/sql/spanner` package. This creates all necessary tables, indices, and constraints required for Skia Perf to function.
### Schema Source
The module depends on `go.skia.org/infra/perf/go/sql/spanner`. This dependency is critical because it acts as the "Source of Truth" for the database structure. By importing `spanner.Schema`, `initdemo` guarantees that the local environment is always in sync with the production-ready schema definitions used by the main Perf application.
## Workflow
The following diagram illustrates the sequence of operations performed by the utility:
```text
[ Developer ]
|
V
+--------------+
| initdemo run |
+--------------+
|
| 1. Connects to local DB instance (e.g., CockroachDB/Spanner Emulator)
|
V
+-----------------------+
| CREATE DATABASE demo; |----( If exists, log and continue )
+-----------------------+
|
| 2. Fetch Schema from /perf/go/sql/spanner
|
V
+-----------------------+
| Apply SQL Statements |----( Create Tables, Indices, etc. )
+-----------------------+
|
V
+-----------------------+
| Success / Exit |
+-----------------------+
```
## Configuration
The module supports customization through command-line flags, primarily allowing the user to point the utility at a specific database instance or rename the target database.
- `database_url`: Defines the connection string. Although the tool is used for Spanner-compatible schemas, it utilizes the `pgx` library, reflecting the common local development pattern of using CockroachDB or similar PostgreSQL-wire-compatible emulators.
- `databasename`: Allows the user to override the default "demo" name.
# Module: /go/issuetracker
### Overview
The `issuetracker` module provides a high-level abstraction for interacting with the Google Issue Tracker (Buganizer) API, specifically tailored for the Skia Perf ecosystem. Its primary goal is to automate the lifecycle of performance regressions—from initial detection to filing bugs and updating them with diagnostic data.
By wrapping the lower-level `issuetracker/v1` API, this module handles the complexities of authentication, data formatting (Markdown), and the mapping of Perf-specific entities (anomalies, regressions, and subscriptions) into actionable bug reports.
### Design and Implementation Choices
#### Data-Driven Bug Filing
Unlike a simple API client, `FileBug` relies heavily on internal state from the `regression.Store`. When a bug is filed, the module does not simply trust the parameters passed from the frontend; instead, it queries the database to:
1. Identify the correct **Bug Component**, **Priority**, and **Severity** based on the `Subscription` linked to the regression.
2. Aggregate technical details (Bots, Benchmarks, Measurements) directly from the regression data to ensure the bug description is accurate and comprehensive.
#### Safety and Testing Rails
The implementation includes a "test run" mechanism (`checkTestRun`). If a regression is not linked to a specific internal testing email (e.g., `sergeirudenkov@google.com`), the module defaults the bug status to `NEW`, clears the assignee, and removes CCs. This prevents automated systems from accidentally spamming production engineering teams during development or misconfiguration.
#### URL Management
The module handles the "Long URL" problem inherent in web-based analysis tools. Performance reports often involve hundreds of individual regression keys. To prevent breaking Issue Tracker or browser limits, the module calculates the length of the generated graph URL. If it exceeds a safe threshold (~2000 characters), it swaps the direct link for a "Link by Bug ID" (e.g., `/u?bugID=12345`), which leverages Perf's ability to look up regressions associated with a specific tracker ID.
#### Authentication and Environments
The module supports two distinct operating modes:
- **Production**: Uses `go/secret` to fetch API keys from GCP Secret Manager and initializes an OAuth2 authorized client.
- **Development**: If `devMode` is active, it redirects traffic to a local `mockhost` (port 8081) and bypasses authentication, allowing for end-to-end UI testing without real API credentials.
### Key Components
#### IssueTracker Interface (`issuetracker.go`)
This is the primary entry point. It defines the contract for filing bugs, adding comments, and querying issues. The implementation (`issueTrackerImpl`) coordinates with several sub-systems:
- **`regression.Store`**: Provides the underlying data for detected performance shifts.
- **`userissue.Store`**: Tracks issues manually filed by users to prevent duplicates and maintain a history of user-driven triage.
- **`anomalygroup/service`**: Used to rank and select the "Top Anomalies" to include in the bug's summary, ensuring the most impactful data is presented first.
#### Bug Description Generation
The module dynamically constructs Markdown descriptions. The workflow for generating a bug body follows this logic:
```
[ Regression IDs ] --> [ Fetch Subscription Details ] --> [ Determine P/S Level ]
|
v
[ Fetch Regression Data ] --> [ Aggregate Bot Names ]
|
v
[ Rank Anomalies ] ---------> [ Format Top 10 List ]
|
v
[ Final Markdown ] <--------- [ Construct Graph Links ]
```
#### User Issue Logic (`FileUserIssue`)
While `FileBug` is often automated, `FileUserIssue` handles cases where a user manually identifies a regression on a specific trace. This workflow is simpler but critical for manual triage; it creates a bug with a standardized title containing the Trace ID and Commit Position, then persists this relationship in the `userissue.Store`.
### Key Workflows
#### Automated Bug Selection
When multiple regressions are grouped into a single bug filing request, the module must decide which metadata to use. It iterates through all associated subscriptions and selects the one with the highest priority (lowest numerical value) and highest severity.
```
Regressions: [R1, R2, R3]
|
+--> Sub A (P2, S2)
+--> Sub B (P1, S3)
|
[ Selection: Sub B ] (Priority P1 wins over P2)
```
#### Bug Update Loop
Once a bug is created, the module immediately posts a follow-up comment. This comment contains a specialized Perf URL that uses the newly created `IssueId` as a query parameter. This ensures that anyone viewing the bug can immediately jump back into the Perf UI to see the live, filtered graph of the relevant regressions.
# Module: /go/issuetracker/mockhost
### Overview
The `mockhost` module provides a lightweight, standalone HTTP server that emulates a subset of the Issue Tracker API. Its primary purpose is to facilitate local development and testing of services that interact with the issue tracking system, allowing developers to verify request/response handling without requiring access to a live production API or complex authentication setups.
### Design and Implementation
The module is designed for simplicity and predictability. It implements a RESTful interface using the `chi` router, mimicking the endpoint structure expected by clients of the `issuetracker/v1` library.
Instead of maintaining a complex state or an in-memory database, the mock host uses a "static response" strategy. It accepts valid API requests, logs the incoming parameters for visibility during debugging, and returns pre-defined JSON payloads that conform to the `issuetracker` data structures. This approach ensures that the mock remains low-maintenance and deterministic.
#### Key Workflows
The server handles three primary operations, mapping HTTP methods to specific Issue Tracker behaviors:
```
[ Client ] [ mockhost (:8081) ]
| |
| GET /v1/issues |
|-------------------->| Log query -> Return static issue list
| |
| POST /v1/issues |
|-------------------->| Decode body -> Return new issue with ID 98765
| |
| POST /v1/issues/{id}/comments
|-------------------->| Parse {id} -> Return comment confirmation
```
### Components and Responsibilities
#### Entry Point and Routing (`main.go`)
The `main.go` file acts as the central coordinator. It initializes a `chi` router and maps specific URL patterns to handler functions. The server listens on port `:8081` by default. Its main responsibility is routing and ensuring the HTTP server's lifecycle is managed.
#### API Handlers (`main.go`)
The handlers encapsulate the logic for simulating the Issue Tracker API:
- **`listIssuesHandler`**: Simulates searching for issues. It extracts the `query` parameter from the URL to log what the client is searching for, then returns a `ListIssuesResponse` containing a single mock issue (ID `12345`). This allows clients to test list-parsing logic.
- **`fileBugHandler`**: Simulates the creation of a new bug. It decodes the incoming `Issue` object from the request body to reflect the submitted title back to the client, while assigning a static mock ID (`98765`) to simulate the backend's ID generation.
- **`createCommentHandler`**: Simulates adding a comment to an existing issue. It validates that the `issueId` in the URL is a valid integer and echoes the comment text back in the response. This is useful for verifying that clients are correctly targeting the right issue resources.
#### Data Structures
The module relies on `//go/issuetracker/v1:issuetracker` for its data models. By using the same structures as the production client, the mock ensures that the JSON serialization and deserialization remain perfectly compatible with the real service.
#### Logging
Integration with `go/sklog` ensures that all interactions with the mock host are recorded to the console. This allows developers to inspect the payloads being sent by their services in real-time by simply watching the `mockhost` output.
# Module: /go/issuetracker/mocks
The `issuetracker/mocks` module provides an automated mocking implementation of the `IssueTracker` interface. Its primary purpose is to facilitate unit testing for components within the Perf system that interact with external issue tracking services without requiring actual network calls or authentication against a real Issue Tracker API.
### Design and Implementation Choice
The module utilizes `mockery` to generate code based on the `issuetracker.IssueTracker` interface. This approach ensures that the mock stays in sync with the actual interface definition found in `/perf/go/issuetracker`. By using the `testify/mock` framework, it allows developers to:
1. **Program Behaviors**: Define specific return values or errors for calls to the issue tracker.
2. **Verify Interactions**: Assert that the system under test (SUT) calls specific methods (like `FileBug` or `CreateComment`) with the expected parameters.
3. **Decouple Tests**: Isolate Perf logic (such as anomaly detection or regression filing) from the complexities of the Issue Tracker V1 API.
### Key Components
#### IssueTracker.go
This file contains the `IssueTracker` struct, which embeds `mock.Mock`. It implements the standard operations required for Perf's integration with bug tracking:
- **Bug Creation (`FileBug`, `FileUserIssue`)**: These methods simulate the creation of new issues. In a test environment, they allow the SUT to receive a mock Issue ID (int) to verify that the ID is correctly stored or referenced in the Perf database.
- **Communication (`CreateComment`)**: Mocks the addition of comments to existing issues, used for updating status or providing additional data on detected regressions.
- **Discovery (`ListIssues`)**: Simulates querying the tracker for existing issues, returning a slice of `v1.Issue` objects. This is crucial for testing logic that prevents duplicate bug filing.
### Typical Testing Workflow
The mock is designed to be instantiated within a test suite using `NewIssueTracker(t)`. This constructor automatically registers cleanup functions to assert that all defined expectations were met before the test finishes.
```
+-----------+ +----------------------+ +-----------------+
| Test | | System Under Test | | Mock (this) |
| Routine | | (e.g., Alerter) | | IssueTracker |
+-----------+ +----------------------+ +-----------------+
| | |
|-- 1. Setup Expectations ->| |
| (On FileBug return 123) | |
| | |
|-- 2. Trigger Action ----->| |
| |-- 3. Call FileBug(ctx, req) ->|
| | |
| |<-- 4. Return (123, nil) ------|
| | |
|-- 5. Assert Result -------| |
| | |
|-- 6. Cleanup/Verify (Auto)|----------------------------->|
| (Was FileBug called?)
```
### Key Dependencies
- **`perf/go/issuetracker`**: Defines the request and response structures (e.g., `FileBugRequest`) that the mock must handle.
- **`go.skia.org/infra/go/issuetracker/v1`**: Provides the underlying data models for the issues themselves.
- **`github.com/stretchr/testify/mock`**: The engine driving the programmatic responses and assertions.
# Module: /go/kmeans
# Generic K-Means Clustering
The `kmeans` module provides a flexible, generic implementation of Lloyd's Algorithm for k-means clustering. Rather than being tied to a specific data format like 2D coordinates or high-dimensional vectors, it uses a set of interfaces that allow it to cluster any data type where a distance metric and a centroid calculation can be defined.
This is particularly useful in the context of performance monitoring (Perf), where clustering might be applied to different types of trace data or experimental results.
## Design and Architecture
The implementation decouples the clustering logic from the mathematical specifics of the data. This is achieved through three primary abstractions:
- **`Clusterable`**: An empty interface (`interface{}`) representing the data points (observations) to be clustered.
- **`Centroid`**: An interface representing the "center" of a cluster. It must provide a `Distance` method to calculate how far a `Clusterable` is from itself and an `AsClusterable` method to allow the centroid to be treated as a data point in results.
- **`CalculateCentroid`**: A function type responsible for generating a new `Centroid` from a slice of `Clusterable` observations. This encapsulates the logic of how to "average" a specific group of data points.
### Decision: Interface-Based Abstraction
By using interfaces, the module avoids hardcoding Euclidean distance or vector arithmetic. For example, if clustering time-series data, the `Centroid` implementation could use Dynamic Time Warping (DTW) for distance, while a categorical dataset might use Hamming distance. The core algorithm remains unchanged regardless of these implementation details.
## Key Workflows
### The Clustering Loop
The module executes the standard iterative k-means process. Each iteration (performed by the `Do` function) follows these steps:
1. **Assignment**: Each observation is assigned to the nearest centroid based on the `Distance` metric.
2. **Update**: For each resulting cluster, a new centroid is calculated using the provided `CalculateCentroid` function.
3. **Refinement**: If a centroid has no assigned observations, it is discarded, potentially reducing the number of clusters (k).
```text
Initial Centroids + Observations
|
v
+-----------------------------+
| Do() Iteration Loop | <-----------+
| 1. Find closest centroid | |
| 2. Group observations | | Repeat N times
| 3. Calculate new centroids | | (iters)
+-----------------------------+ |
| |
+--------------------------------+
|
v
Final Centroids + Grouped Clusters
```
### Result Aggregation
The `GetClusters` function organizes the final output. It produces a two-dimensional slice where each inner slice represents a cluster. By convention, the first element of each inner slice is the `Centroid` itself (converted via `AsClusterable`), followed by all observations belonging to that cluster. This provides a clear, grouped view of the algorithm's output.
## Implementation Details
### `kmeans.go`
This is the core of the module.
- **`Do`**: Implements a single iteration of the algorithm. It is designed to be called repeatedly. Note that it returns a new slice of centroids and may return fewer than the input if clusters become empty.
- **`KMeans`**: A convenience wrapper that runs `Do` for a fixed number of iterations.
- **`TotalError`**: A utility to calculate the sum of distances from all observations to their respective centroids, providing a measure of how well the clusters fit the data.
### `kmeans_test.go`
The tests serve as the primary documentation for how to implement the required interfaces. They demonstrate a concrete 2D implementation (`myObservation`) where the same struct satisfies both `Clusterable` and `Centroid` interfaces, and a corresponding `calculateCentroid` function that computes the arithmetic mean of X and Y coordinates.
# Module: /go/maintenance
# Perf Maintenance Module
The `maintenance` module serves as the central orchestration point for all long-running background processes and administrative tasks within a Skia Perf instance. Instead of handling user requests, this module is responsible for database health, data synchronization, schema migrations, and cache warming.
## High-Level Overview
In a distributed system like Skia Perf, various tasks must occur outside the critical path of the web UI or ingestion engine. The `maintenance` module consolidates these tasks into a single entry point. It manages the lifecycle of background goroutines that handle:
- **Database Schema Management**: Ensuring the SQL schema is up-to-date and performing migrations.
- **Data Ingestion & Sync**: Keeping the local representation of Git repositories fresh.
- **Data Retention**: Pruning old regressions and shortcuts to manage database size.
- **Cache Management**: Periodically refreshing Redis caches to ensure query performance remains high.
- **External Config Sync**: Pulling configurations (like sheriffing rules) from LUCI Config.
## Key Components and Responsibilities
### Process Orchestration (`maintenance.go`)
The primary responsibility of this module is the `Start` function. It acts as a switchboard, using configuration flags (`MaintenanceFlags`) and instance settings to decide which background services to initialize.
Design decisions in this coordinator include:
- **Blocking Execution**: The `Start` function is designed to run indefinitely (ending in a `select {}`). This is intended for use in a dedicated "maintenance" microservice or container that runs alongside the main Perf application.
- **Centralized Scheduling**: It defines the "heartbeat" of the system through various constants (e.g., `gitRepoUpdatePeriod`, `deletionPeriod`). By centralizing these, developers can easily reason about the total background load on the database.
### Schema and Migration
The module ensures the database environment is ready before starting other services. It utilizes `expectedschema` to validate and migrate the core schema. It also handles specialized migrations, such as moving regression data between table formats, which are executed in small, controlled batches (`regressionMigrationBatchSize`) to avoid locking the database or exhausting resources.
### Data Lifecycle and Retention
Through the `deletion` submodule, the maintenance process enforces a data retention policy. It targets "Shortcuts" (temporary trace groupings) and "Regressions" that have aged out (currently 18 months).
### Refreshing Query Caches
To prevent the first user of the day from experiencing slow queries, the maintenance module performs "cache warming." It initializes a `ParamSetRefresher` which scans the `TraceStore` and populates Redis. This ensures that the available query parameters (keys and values) are always pre-calculated and ready for the UI.
### External Service Integration
- **Git Polling**: Periodically fetches new commits from the source of truth to ensure the Perf database stays mapped to the correct revision history.
- **Sheriff Config**: Integrates with LUCI Config to import subscription and alerting rules, allowing teams to manage their Perf configurations via version-controlled files outside of the Perf database itself.
## Key Workflows
### Initialization and Background Loop
When the maintenance service starts, it follows a specific sequence to ensure dependencies are met before background loops begin:
```text
Start(ctx, flags, config)
|
|-- 1. Initialize Tracing (Observability)
|-- 2. Connect to Database & Validate/Migrate Schema
|-- 3. Initialize Git Provider & Start Polling
|
|-- 4. Launch Concurrent Goroutines (if enabled):
| |--> [Migration] Periodic Regression Migration
| |--> [Config] LUCI Config Import Routine
| |--> [Cache] Redis ParamSet Refresh Routine
| |--> [Deletion] Data Retention / TTL Cleanup
|
|-- 5. Block (select {})
```
## Design Choices: Why a Separate Module?
- **Isolation of Concerns**: By separating maintenance from the main `frontend` or `ingest` processes, heavy operations (like schema migration or massive deletions) do not steal CPU or IO cycles from user-facing requests.
- **Fault Tolerance**: If a background migration fails or hangs, it does not crash the web server.
- **Single-Writer Principle**: For certain migration tasks, having a single maintenance instance ensures that multiple pods aren't trying to perform schema changes or batch deletions simultaneously, reducing transaction contention.
# Module: /go/maintenance/deletion
# Perf Data Retention Maintenance
The `deletion` module provides a background maintenance service responsible for enforcing data retention policies within the Skia Perf system. It specifically targets the cleanup of aged regression data and their associated shortcuts to ensure the database remains performant and focused on relevant recent history.
## High-Level Overview
In the Perf system, regressions (detections of performance changes) and shortcuts (references to specific sets of traces) accumulate over time. To maintain database health, this module implements a Time-To-Live (TTL) policy. Currently, the system is hardcoded to a **18-month retention period**.
The module operates by periodically scanning the database for regressions older than this TTL, identifying the specific database keys (commit numbers and shortcut IDs), and removing them in atomic batches.
## Key Components and Responsibilities
### Deleter (`deleter.go`)
The `Deleter` is the central coordinator. It interacts with both the `regression.Store` and the `shortcut.Store`. Its primary responsibility is to bridge the two stores; since shortcuts are often referenced by regression entries, they should be cleaned up together to prevent orphaned data or broken references in the UI.
### Logic & Design Choices
#### TTL Enforcement
The deletion logic uses the timestamp of the "step point" (the point in time where a performance shift occurred) within a regression's cluster summary to determine eligibility. If the timestamp of a regression's `Low` or `High` cluster is older than 18 months relative to the current time, it is marked for deletion.
#### Batch-Based Processing
Instead of a single massive delete operation—which could lock database tables and degrade performance—the module uses a "batching" approach.
- It starts scanning from the oldest known commit in the database.
- It collects eligible regressions and shortcuts until a configurable `shortcutBatchSize` is met.
- The actual deletion is performed within a single database transaction to ensure consistency (either both the regression and its shortcut are removed, or neither is).
#### Frequency and Scheduling
The `RunPeriodicDeletion` method establishes a long-running goroutine. It uses a ticker to trigger `DeleteOneBatch` at a regular `iterationPeriod`. This allows the maintenance to run continuously in the background at a slow, steady pace, eventually catching up to the TTL window without causing spikes in database load.
## Workflows
### Periodic Deletion Loop
The following diagram illustrates how the background process manages the steady cleanup of data:
```text
RunPeriodicDeletion(period, batchSize)
|
| (Wait for 'period')
|-----> DeleteOneBatch(batchSize)
|
|-- 1. Get Oldest Commit Number
|-- 2. Scan Range [oldest, oldest + batchSize]
|-- 3. Filter for regressions older than 18 months
|-- 4. If Batch not full, extend range and repeat step 2
|
|-- 5. Open Database Transaction
|-- 6. Delete Regressions by Commit ID
|-- 7. Delete Shortcuts by ID
|-- 8. Commit Transaction
|
| (Wait for next 'period')
|-----> ...
```
## Key Files
- **`deleter.go`**: Contains the core logic for calculating the 18-month cutoff, scanning the regression store, and executing the transactional deletes.
- **`deleter_test.go`**: Provides integration tests using a test database (Spanner) to verify that only data older than the TTL is removed and that the batching logic correctly identifies eligible records.
# Module: /go/notify
# Perf Regression Notifications
The `notify` module is a high-level orchestration layer responsible for transforming detected performance regressions into human-readable alerts and delivering them to various destinations. It decouples the "what" of a regression (statistical data and commit history) from the "how" (formatting and transport).
## High-Level Overview
The notification system follows a pipeline where raw detection data is first gathered into a common metadata format, then passed to a provider to be formatted into a specific message (e.g., HTML or Markdown), and finally handed off to a transport layer for delivery (e.g., Email or Issue Tracker).
This modular design allows the Perf instance to support diverse workflows:
- **Standard Alerts:** Sending HTML emails or creating Buganizer/Monorail issues.
- **Android-Specific Workflows:** Generating deep links to internal build diffs and formatting test method names.
- **Chromeperf Integration:** Reporting anomalies directly to the Chromeperf API.
- **Dry Runs:** Using a "Noop" transport for testing detection logic without bothering developers.
## Key Components and Responsibilities
### Core Orchestration (`notify.go`)
The `Notifier` interface is the primary entry point. The `defaultNotifier` implementation manages the flow of data. When a regression is found (or goes missing), it:
1. **Gathers Metadata:** Combines the alert configuration, commit details, and cluster statistics into a `RegressionMetadata` object.
2. **Hydrates Links:** If the regression is tied to specific traces, it queries the `tracestore` and filesystem to find "source" links (e.g., links to the raw JSON or log files that produced the data point).
3. **Executes Formatting:** Uses a `NotificationDataProvider` to turn metadata into a subject and body.
4. **Dispatches Transport:** Sends the final message via the configured `Transport`.
### Data Providers and Formatters
The system distinguishes between how data is gathered and how it is styled:
- **`NotificationDataProvider`**: Determines _what_ fields are available for the message.
- The **Default Provider** uses standard commit and cluster data.
- The **Android Provider** (`android_notification_provider.go`) adds specialized logic for Android-specific metadata, such as extracting Build IDs from commit subjects and formatting test class/method strings.
- **`Formatter`**: Handles the template rendering.
- **HTML Formatter** (`html.go`): Used primarily for rich emails.
- **Markdown Formatter** (`markdown.go`): Used for issue trackers and includes custom template functions like `buildIDFromSubject` to parse specific URL structures.
### Transports
Transports are the final leg of the journey, abstracting the I/O required to reach the user:
- **Email (`email.go`)**: Sends multi-part emails. It supports "threading references," allowing "Regression Missing" notifications to appear as replies to the original "Regression Found" alert.
- **Issue Tracker (`issuetracker.go`)**: Creates and updates bugs via the Google Issue Tracker (Buganizer) API. It automatically sets priorities, severities, and components based on the alert configuration.
- **Chromeperf (`chromeperfnotifier.go`)**: A specialized transport that doesn't send a message to a human, but instead reports the anomaly to the Chromeperf service for cross-platform tracking.
- **Noop (`noop.go`)**: A null-object pattern implementation for environments where notifications should be suppressed.
## Key Workflows
### Regression Found Process
This workflow illustrates how a statistical anomaly becomes a developer-facing bug:
```text
[ Detection Engine ] -> RegressionFound(commit, alert, cluster)
|
v
[ defaultNotifier ]
|-- getRegressionMetadata() --> Fetches Git hashes & source links
|-- GetNotificationData() --> Executes Go Templates (HTML/Markdown)
|-- SendNewRegression() --> Calls Transport (Email/API)
v
[ Transport Layer ]
|-- IssueTracker: Creates Bug #1234
|-- Email: Sends message with Message-ID <abc@perf>
v
[ Persistence ] --> Notification ID (#1234 or <abc@perf>) is saved to track history
```
### Notification Threading
To avoid "alert fatigue" and keep histories clean, the system uses a `threadingReference`.
```text
1. Initial Regression Found -> Transport returns "ID-123"
2. Performance recovers -> RegressionMissing(threadingReference="ID-123")
3. Transport uses ID-123 -> Adds a comment to Bug #123 OR Sends a Reply-To Email
```
## Design Decisions
### Template-Driven Messages
The use of Go's `text/template` and `html/template` allows instance administrators to customize notification content without changing Go code. The `config.NotifyConfig` allows specifying custom body and subject templates in the instance's JSON configuration.
### Commit Range URLs
Because different projects use different git mirrors (e.g., Gerrit, GitHub, internal Gitiles), the `commitrange.go` logic uses a configurable `commitRangeURITemplate`. This allows the notification to link to a side-by-side diff (using `{begin}` and `{end}` placeholders) rather than just a single commit landing page.
### Separation of Metadata and Presentation
By defining common structures in the `/common` submodule, the system ensures that the detection logic remains pure and doesn't need to know if the final output is a Markdown table or an HTML list. This also simplifies testing, as mocks can return standard `NotificationData` regardless of the transport being tested.
# Module: /go/notify/common
# Notification Common
The `notify/common` module defines the core data structures used across the Perf regression notification system. It acts as a bridge between the detection engine and the various notification delivery mechanisms (such as email, issue trackers, or chat platforms).
By centralizing these structures, the system ensures that different notification formatters have access to a consistent set of metadata regardless of the specific alert configuration or the final destination of the message.
## Core Data Structures
### RegressionMetadata
This structure is the primary data container passed to notification formatters. It is designed to encapsulate the full context of a detected performance change, allowing for the generation of rich, actionable reports.
The inclusion of both the `RegressionCommit` and the `PreviousCommit` is critical for providing a "diff" view, enabling users to see exactly what changed in the codebase to cause the regression.
Key components of the metadata include:
- **Contextual Links:** The `InstanceUrl` provides a direct path back to the Perf instance for deep-flow analysis.
- **Analytical Data:** The `Cl` (Cluster Summary) and `Frame` (UI Frame Response) contain the statistical backing for the regression, allowing notifications to include high-level summaries of the data points involved.
- **Detection Specifics:** The module handles two distinct detection paradigms:
- **Standard Regressions:** Primarily use the commit range and alert configuration.
- **Individual Trace Regressions:** When detection is set to "Individual" mode, the structure provides granular details including `TraceID` and specific commit links. This allows notifications to pinpoint exact changes in high-cardinality data environments.
### NotificationData
While `RegressionMetadata` contains the raw information about a performance change, `NotificationData` represents the _output_ of the formatting process. It separates the presentation layer from the delivery layer.
- **Body:** Contains the formatted content (often HTML or Markdown) intended for the recipient.
- **Subject:** Contains a concise summary, typically used for email subject lines or issue titles.
## Workflow: From Detection to Notification
The `common` module facilitates the transition of data through the following conceptual pipeline:
```text
[ Detection Engine ]
|
| Identifies anomaly and collects:
| - Alert Config
| - Commit Range
| - Cluster Data
v
[ RegressionMetadata ] <--- (Defined in notify/common)
|
| Passed to a Formatter (e.g., HTML/Markdown)
v
[ NotificationData ] <--- (Defined in notify/common)
|
| Passed to a Transport (e.g., Email/Issue Tracker)
v
[ Final Recipient ]
```
This separation ensures that the logic for _what_ a regression is (metadata) is kept distinct from _how_ it is described to a human (notification data), allowing the system to easily support new notification channels by simply implementing new formatters that consume these common structures.
# Module: /go/notify/mocks
The `go/notify/mocks` module provides a suite of autogenerated mock implementations for the core interfaces used within the Perf notification system. These mocks are built using `testify/mock` and are designed to facilitate unit testing of components that handle regression alerts, data formatting, and message delivery without requiring live connections to external services (like email servers or issue trackers).
### High-Level Purpose
The notification system in Perf follows a decoupled architecture where data retrieval, message construction, and transport delivery are handled by distinct components. This mock package allows developers to:
1. **Isolate Logic**: Test the logic of a `Notifier` implementation by mocking the `Transport` layer.
2. **Verify State Transitions**: Assert that the system correctly identifies when to send a "New Regression" vs. a "Regression Missing" (resolved) notification.
3. **Simulate Failures**: Inject errors into the data provider or transport layers to ensure robust error handling in the calling services.
### Key Components
The module mirrors the primary interfaces found in the parent `notify` package:
#### Notifier.go
The `Notifier` mock simulates the high-level orchestration of notifications. It is responsible for deciding what content should be sent based on regression events.
- **Design Role**: It acts as the entry point for the detection logic.
- **Key Workflows**: It mocks methods like `RegressionFound` and `RegressionMissing`, which typically involve complex arguments such as `ClusterSummary`, `FrameResponse`, and `Commit` data. This allows tests to verify that the notification system receives the correct metadata when a performance anomaly is detected.
#### Transport.go
The `Transport` mock represents the delivery mechanism (e.g., Email, Monorail/Issue Tracker).
- **Design Role**: It abstracts the actual I/O.
- **Why it's used**: Instead of sending real emails or creating real bugs during a test run, this mock captures the `body`, `subject`, and `threadingReference` (used for message chaining/threading) to ensure the outgoing message is formatted correctly.
#### NotificationDataProvider.go
This mock handles the assembly of the data payload required for a notification.
- **Design Role**: It sits between the raw performance data and the formatted message.
- **Functionality**: It mocks the retrieval of `NotificationData` based on `RegressionMetadata`. This is crucial for testing how different regression scenarios (found vs. missing) are transformed into user-facing information.
### Workflow Example
The following diagram illustrates how these mocks are typically used in a unit test for a component that manages regression life cycles:
```text
[ Test Suite ]
|
| 1. Setup expectations on Notifier Mock
v
[ System Under Test (e.g., Regression Detector) ]
|
| 2. Detects anomaly -> Calls RegressionFound()
v
[ Notifier Mock ]
|
| 3. Returns a canned "NotificationID"
v
[ Test Suite ]
|
| 4. Assert that Notifier was called with
| the expected Commit and Alert objects.
```
### Usage Implementation Note
All mocks in this package include a `New[InterfaceName]` helper function. These helpers automatically register a cleanup function with the `*testing.T` instance, ensuring that `AssertExpectations` is called at the end of the test to verify that all defined mock calls were actually executed.
# Module: /go/notifytypes
# Notifytypes Module
The `notifytypes` module serves as the central source of truth for defining how the Perf system communicates regression alerts and performance data to external consumers. Rather than scattering string constants or logic throughout the codebase, this module provides a typed schema that dictates both the **medium** of notification (the "how") and the **context** of the data being sent (the "what").
## Core Abstractions
The module is built around two primary type definitions that decouple the notification logic from the underlying alert detection systems.
### Notifier Mediums (`Type`)
The `Type` abstraction defines the destination and format of a notification. This is used by the system to instantiate the correct notification client. The design supports a variety of delivery methods:
- **Human-Readable Formats:** `HTMLEmail` and `MarkdownIssueTracker` cater to human consumption, specifying not just the destination but the markup language required for clear presentation.
- **System-to-System Integration:** `ChromeperfAlerting` and `AnomalyGrouper` represent automated workflows. Instead of sending a message to a human, these types signal the system to push structured data into external tracking services or internal grouping logic for further automated analysis.
- **No-Op Actions:** The `None` type allows for a "dry-run" or silenced state where regressions are detected and logged but no external side effects are triggered.
### Data Contexts (`NotificationDataProviderType`)
While the `Type` defines the transport, the `NotificationDataProviderType` defines the **source-specific schema** of the data.
In a multi-tenant environment like Perf, different projects (e.g., standard Skia vs. Android) require different metadata to be included in an alert. For example, an `AndroidNotificationProvider` might bundle specific build IDs or device characteristics that are irrelevant to other projects. By using this type, the notification engine can select the appropriate data formatter to bridge the gap between generic regression data and project-specific requirements.
## Workflow Integration
The constants in this module act as the glue between alert configuration and the notification dispatcher:
```text
[ Alert Configuration ]
|
v
[ Notification Dispatcher ] <--- [ notifytypes.Type ]
| (e.g., HTMLEmail)
|
+---------------------> [ Notification Data Provider ]
| (e.g., AndroidNotificationProvider)
|
v
[ External Systems ]
(Email, Issue Tracker, Chromeperf)
```
By centralizing these types, the system ensures that adding a new notification destination or a new specialized data provider only requires an update to this registry, providing a consistent interface for all alerting components in the Perf ecosystem.
# Module: /go/perf-tool
### Overview
The `perf-tool` is a comprehensive command-line interface (CLI) designed for administrative and diagnostic interactions with Skia Perf. It serves as the primary tool for managing Perf instances, providing capabilities that span database maintenance, data lifecycle management, and infrastructure provisioning.
The tool bridges the gap between local configuration files and remote cloud resources (GCS, PubSub, CockroachDB), allowing developers and SREs to perform complex operations like re-ingesting historical data, migrating alerts between instances, and debugging specific trace data without needing to write custom scripts.
### Design Philosophy and Implementation Choices
The project is structured to separate the CLI definition (routing and flags) from the business logic.
- **Interface-Driven Logic:** The core functionality is encapsulated within the `application` module. By defining an `Application` interface, the CLI implementation in `main.go` remains clean and highly testable. This abstraction allows the CLI to focus on flag parsing and environment setup while delegating complex workflows to the application layer.
- **Configuration as Truth:** Most commands require a `--config_filename` flag. The tool is designed to treat the `InstanceConfig` (JSON/TOML) as the definitive source of truth for the environment it is interacting with. This ensures that operations like PubSub creation or Database restores are always scoped to the correct instance.
- **Safe Data Portability:** Database operations (Alerts, Shortcuts, Regressions) use a custom serialization format (Go `gob` inside `.zip` files) rather than raw SQL dumps. This choice provides:
- **Portability:** Backups can be restored across different database versions or instances.
- **Atomicity:** Related entities (like Regressions and their associated Shortcuts) can be bundled together to ensure functional integrity after restoration.
- **Idempotency and Safety:** Many operations, such as PubSub provisioning and database restoration, are designed to be idempotent. The "dry-run" capability for re-ingestion allows users to verify which files will be affected before committing to expensive cloud operations.
### Key Components and Responsibilities
#### CLI Entry Point (`main.go`)
This file defines the user interface of the tool using the `urfave/cli` framework. Its responsibilities include:
- **Flag Management:** Handling global and command-specific flags such as connection strings, commit ranges, and file paths.
- **Context Initialization:** Setting up logging (via `sklog`) and instantiating the `TraceStore` or `InstanceConfig` based on the provided flags.
- **Command Routing:** Mapping CLI commands to the appropriate methods in the `application` module.
#### Application Orchestrator (`/application`)
This module contains the heavy lifting for all functional areas:
- **Database Operations:** Implements logic for backing up and restoring `Alerts`, `Shortcuts`, and `Regressions`. It manages the complexity of batching large datasets and maintaining referential integrity (e.g., ensuring a regression backup includes the shortcuts it references).
- **Ingestion Management:** Provides tools to force the system to re-process data. It can scan GCS buckets for historical files and republish them to PubSub topics to trigger the standard ingestion pipeline. It also includes a `validate` sub-command to check ingestion files against the schema and parser logic locally.
- **Trace Debugging:** Provides direct access to the `TraceStore`. This allows users to list trace IDs matching a specific query or export raw performance data for specific commit ranges into JSON files for external analysis.
- **Infrastructure Provisioning:** Automates the creation of necessary Google Cloud PubSub topics and subscriptions based on the instance configuration, ensuring the cloud environment stays in sync with the code-defined configuration.
### Key Workflows
#### Trace Data Export
This workflow demonstrates how the tool extracts data from the storage layer for external use.
```
[ CLI: traces export ] -> [ Instance Config ] -> [ TraceStore (BigTable/CockroachDB) ]
| | |
|-- 1. Parse Query -| |
| |
|------- 2. Query Commits [Begin, End] ------->|
|
[ Local JSON File ] <--- 4. Encode & Write <--- 3. Retrieve Trace Values
```
#### Infrastructure Synchronization
When setting up a new Perf instance or updating an existing one, the tool synchronizes the cloud environment.
```
[ InstanceConfig ] [ Google Cloud PubSub ] [ Local State ]
| | |
1. Read Topics Config ------------>| |
| | |
2. Check Existence <---------------| |
| |
3. Create Missing Topics/Subscriptions ------------------------>|
| |
4. Set Dead Letter Policies/ACK Deadlines --------------------->|
```
# Module: /go/perf-tool/application
### Overview
The `application` module serves as the central orchestration layer for the `perf-tool` CLI. It encapsulates the high-level business logic and complex workflows required to manage a Skia Perf instance, acting as a bridge between the command-line interface and the underlying storage, ingestion, and cloud infrastructure systems.
By centralizing these operations, the module ensures that administrative tasks—such as database migrations, data re-ingestion, and trace debugging—are executed consistently and safely across different environments (local vs. production).
### Design Philosophy and Implementation Choices
The module is designed around the `Application` interface, which promotes testability and provides a clean abstraction for the CLI handlers.
- **Transactional Safety in Backups:** Design decisions for database backups (Alerts, Shortcuts, Regressions) prioritize data integrity and portability. Instead of raw database dumps, the module uses Go's `gob` encoding wrapped in `.zip` archives. This choice allows for versioned, structured backups that include necessary metadata and allow for targeted restoration.
- **Deterministic Regression Backups:** When backing up regressions, the module also identifies and exports the specific **Shortcuts** referenced by those regressions. This ensures that a restored regression remains functional and linkable to the original trace data in a new environment.
- **GCS and PubSub Integration:** For data ingestion management, the module interacts directly with Google Cloud Storage and PubSub. The `IngestForceReingest` logic uses hourly directory partitioning to efficiently scan large buckets, and it leverages PubSub to trigger the standard ingestion pipeline, ensuring that "forced" data follows the same processing path as live data.
- **Validation before Ingestion:** The `IngestValidate` component performs a two-stage check: first against the schema to ensure structural correctness, and second through the actual parser to verify that keys, measurements, and links are generated as expected before a user commits to a large-scale ingestion.
### Key Components and Responsibilities
#### Database Operations
Managed through functions like `DatabaseBackup*` and `DatabaseRestore*`, these components interact with `builders` to instantiate the appropriate stores (Alert, Shortcut, or Regression) based on the provided `InstanceConfig`.
- **Regressions:** Backed up in batches (defaulting to 1000 commits) to manage memory pressure. The restore process is idempotent; it recreates deterministic shortcut IDs to maintain data consistency.
- **Alerts/Shortcuts:** Handled as discrete entities, allowing administrators to migrate configurations without necessarily moving performance data.
#### Trace Management
The module provides tools to inspect the `TraceStore` directly from the command line.
- **`TracesList`:** Performs queries against specific tiles to debug trace IDs and values.
- **`TracesExport`:** Facilitates data extraction for external analysis. It maps query strings to internal trace names and exports the resulting values as JSON, supporting both file output and standard output.
#### Ingestion and Infrastructure
- **PubSub Provisioning:** `ConfigCreatePubSubTopicsAndSubscriptions` automates the creation of the ingestion infrastructure. It handles complex configurations like Dead Letter Policies and acknowledgement deadlines, ensuring the cloud environment matches the local configuration file.
- **Re-ingestion Logic:** `IngestForceReingest` allows for "time-traveling" data. By scanning GCS objects within a date range and republishing their metadata to the ingestion topic, it triggers the system to re-process historical data (e.g., after a parser bug fix).
### Key Workflows
#### Data Re-ingestion Process
This workflow illustrates how the module triggers the reprocessing of historical performance data.
```
[ User Input ] [ GCS Bucket ] [ PubSub Topic ] [ Perf Ingestor ]
| | | |
1. Start/End Dates --------> | | |
| 2. List Objects | |
| <--------------------| | |
| | |
3. Path Filter Apply ----> (Filter Files) | |
| | |
4. Publish Message (Object Metadata) --------------> | |
| ---- 5. Notify ----> |
|
6. Re-parse File
```
#### Regression Backup with Dependencies
Backing up regressions requires a "lookup and include" strategy for shortcuts.
```
[ Regression Store ] [ Perf Git ] [ Shortcut Store ] [ ZIP Archive ]
| | | |
1. Fetch Regressions (Batch) | | |
| ---- 2. Get Dates -> | | |
| <--------------------| | |
| | |
3. Extract Shortcut IDs --------------------------------> | |
| | ---- 4. Fetch -----> |
| | <--------------------|
| |
5. Encode Regressions + Encoded Shortcuts -------------------------------------> |
```
# Module: /go/perf-tool/application/mocks
The `/go/perf-tool/application/mocks` module provides mock implementations of the core application logic interfaces for the `perf-tool` CLI. These mocks are generated using `mockery` and are built upon the `testify` framework, facilitating unit testing of command-line interactions and high-level workflows without requiring a live database, cloud infrastructure, or real file system mutations.
### Purpose and Design Decisions
The primary goal of this module is to decouple the CLI's user interface (command parsing and flag handling) from the actual execution of heavy operations like database backups, trace exports, and ingestion management.
By using mocks, developers can:
- **Verify Parameter Passing:** Ensure that command-line flags (like `--start`, `--stop`, or `--dryrun`) are correctly parsed and passed to the underlying application logic.
- **Simulate Failures:** Test how the CLI handles errors returned from complex operations (e.g., a failed PubSub topic creation) without needing to manually induce environmental errors.
- **Performance:** Run tests for the `perf-tool` management commands in milliseconds, avoiding the overhead of connecting to BigTable or SQL backends.
### Key Components
#### Application Mock (`Application.go`)
The `Application` struct is the central mock in this package. It mirrors the interface used by the `perf-tool` application layer, covering several functional domains of the Perf system:
- **Database Maintenance:** Includes mocks for backing up and restoring high-level entities such as **Alerts**, **Regressions**, and **Shortcuts**. This allows testing the backup/restore CLI commands while ensuring the logic correctly handles file paths and instance configurations.
- **Ingestion Management:** Provides hooks for `IngestForceReingest` and `IngestValidate`. This is critical for testing the logic that triggers data reprocessing across specific time ranges or validates ingestion file formats.
- **Trace Operations:** Mocks for `TracesExport` and `TracesList`. These facilitate testing how the tool queries `TraceStore` and writes results to output files or standard output, utilizing `types.CommitNumber` and `types.TileNumber` for range-based logic.
- **Infrastructure Setup:** The `ConfigCreatePubSubTopicsAndSubscriptions` mock allows testing the initialization commands that provision Google Cloud PubSub resources based on the provided `InstanceConfig`.
### Typical Testing Workflow
When testing a new command in `perf-tool`, the mock is used to intercept calls from the command-line handlers.
```
[ CLI Command ] ----> [ Application Interface (Mock) ] ----> [ Test Assertions ]
| | |
1. User runs: 2. Mock records call: 3. Test verifies:
"perf-tool ingest..." "IngestForceReingest(true, ...)" - Was it called?
- Were flags correct?
```
The `NewApplication` function simplifies this by automatically registering the mock with the `testing.T` cleanup routine, ensuring that `AssertExpectations` is called when the test finishes to verify that all expected calls were made.
# Module: /go/perfclient
### Overview
The `perfclient` module provides a standardized interface for sending performance benchmarking data to Skia's Perf ingestion system. It functions as a specialized wrapper around Google Cloud Storage (GCS), abstracting the complexities of file naming conventions, data compression, and directory structuring required by the Perf ingestion engine.
### Design Philosophy
The module is designed around the principle of **deterministic, time-series organization**. The Perf ingestion system expects data to be organized in GCS using a specific hierarchy based on time and task metadata. By centralizing this logic in `perfclient`, different Skia services can ensure that their performance results are stored in a way that the ingestion service can automatically discover and process them.
Key implementation choices include:
- **Automatic Compression**: To optimize storage costs and upload speed, the client transparently compresses the JSON payload using GZIP. It utilizes GCS "transcoding" features by setting the `Content-Encoding: gzip` header, allowing the data to be served uncompressed if requested while remaining compressed at rest.
- **Collision Avoidance**: File names are generated using a combination of a user-provided prefix, an MD5 hash of the data content, and a millisecond-precision timestamp. This ensures that even if multiple tasks upload data simultaneously for the same configuration, they will not overwrite each other.
- **Path Hierarchy**: Data is organized into a `YYYY/MM/DD/HH` folder structure. This allows the ingestion engine to poll specific time-based slices of data efficiently rather than scanning the entire bucket.
### Key Components
#### ClientInterface
The primary entry point is the `ClientInterface`. It defines the contract for pushing data to Perf. This abstraction allows other modules to use a `MockPerfClient` during unit testing, avoiding actual GCS network calls.
#### Client
The concrete implementation of the interface. It holds a reference to a `gcs.GCSClient` and a `basePath` (the root directory in the bucket where all performance data should reside).
#### Data Workflow
The `PushToPerf` method executes the following logic:
1. **Serialization**: Converts the `format.BenchData` struct into JSON.
2. **Compression**: Gzip-compresses the resulting JSON bytes.
3. **Path Calculation**: Invokes `objectPath` to determine the exact destination in GCS.
4. **Upload**: Transfers the compressed bytes to GCS with the appropriate metadata headers (`Content-Encoding` and `Content-Type`).
```
Data Flow:
[BenchData Struct]
|
v
[JSON Marshaling] -> [MD5 Hashing]
| |
v v
[GZIP Compression] -> [Path Construction]
| |
+----------+---------+
|
v
[GCS Upload (with gzip headers)]
```
#### Path Construction (`objectPath`)
This function is critical for maintaining compatibility with the Perf ingestion system. It constructs paths following this pattern:
`[basePath]/[YYYY]/[MM]/[DD]/[HH]/[folderName]/[filePrefix]_[hash]_[timestamp].json`
- **basePath**: The root GCS folder for the specific environment or service.
- **folderName**: Typically represents a high-level grouping, such as a Task name (e.g., "My-Task-Debug").
- **filePrefix**: A descriptor for the type of benchmark (e.g., "nanobench").
- **now**: The timestamp used to determine the directory hierarchy and the file name suffix.
# Module: /go/perfresults
# Perf Results
The `perfresults` module is a Go library and set of tools designed to bridge the gap between Chromium's distributed build/test infrastructure (LUCI) and the Skia Perf ingestion system. Its primary responsibility is the automated discovery, retrieval, and parsing of performance benchmark results—typically stored as JSON files in Content Addressed Storage (CAS)—produced by Swarming tasks.
The module provides a unified interface to navigate the hierarchy of Buildbucket builds and Swarming tasks to extract telemetry data for long-term storage and trend analysis.
### Design and Data Flow
The architecture follows a "discovery-to-normalization" pipeline. Instead of requiring a direct path to a result file, the module starts with a high-level **Build ID** and programmatically resolves the underlying storage locations.
```text
[ Buildbucket ID ]
|
| (Lookup Build Metadata)
v
[ Swarming Parent Task ]
|
| (Identify Shards/Children)
v
[ Child Task IDs ]
|
| (Query CAS Outputs)
v
[ RBE CAS Digests ]
|
| (Fetch & Merge JSONs)
v
[ Internal PerfResults ] ----> [ Ingestion / CLI / Workflows ]
```
### Key Components
#### The Loader (`perf_loader.go`)
The `loader` is the central orchestrator. It encapsulates the logic required to communicate with multiple LUCI services in the correct sequence.
- **Service Coordination**: It manages the transition from Buildbucket (to get build properties and the root Swarming task) to Swarming (to find child tasks and their CAS output references).
- **Dependency Injection**: It uses an `rbeProvider` to generate RBE clients on the fly based on the specific CAS instance identified in the task metadata, ensuring it can fetch data across different infrastructure silos (e.g., `chrome-swarming` vs `chromium-swarm`).
#### Result Parsing and Histograms (`perf_results_parser.go`)
This component handles the "Histogram Set" JSON format. Because these files can be large (10MB+), the parser is designed for efficiency:
- **Streaming Decoding**: It uses a streaming `json.Decoder` to process entries one by one, reducing memory footprint compared to loading the entire file into a byte slice.
- **Data Model**: Results are stored in a `PerfResults` struct, which maps a `TraceKey` (comprising Chart, Unit, Story, Architecture, and OS) to a `Histogram` (a collection of raw sample values).
- **Aggregation Mapping**: Since raw samples are often too granular for time-series databases, the module provides a standard mapping for statistical reductions like `mean`, `max`, `min`, `std`, and `count`.
#### Infrastructure Clients (`buildbucket.go`, `swarming.go`, `rbecas.go`)
These files provide specialized wrappers around LUCI and RBE protocol buffer clients:
- **Buildbucket**: Extracts `BuildInfo`, including the Git revision and "Machine Group" (e.g., `ChromiumPerf`). This metadata is critical for placing the results on the correct timeline in Skia Perf.
- **Swarming**: Handles the logic of finding child tasks. It uses task creation/completion timestamps to narrow the search space when querying the Swarming API for tasks tagged with a specific `parent_task_id`.
- **RBE CAS**: Specialized for "flattening" CAS directory trees. It searches through the output tree of a task to locate files named `perf_results.json`, even if they are nested within benchmark-specific subdirectories.
### Submodules
The project is extended by several specialized submodules that handle specific parts of the performance lifecycle:
- **`ingest`**: Translates the internal `PerfResults` structures into the specific JSON schema and Google Cloud Storage (GCS) path hierarchy required by the Skia Perf ingester.
- **`cli`**: A command-line tool that allows developers or CI scripts to manually trigger the loading and transformation of results for a given Buildbucket ID.
- **`workflows`**: Contains [Temporal](https://temporal.io/) workflow definitions for managing long-running, fault-tolerant ingestion jobs. It ensures that if a network call fails during the multi-step discovery process, the job can resume without losing state.
- **`testdata`**: A comprehensive suite of recorded gRPC/HTTP interactions and sample JSON files, allowing for deterministic testing of the entire pipeline without live infrastructure access.
### Design Decisions
- **Merging Strategy**: When multiple child tasks (shards) produce results for the same benchmark, the `Loader` automatically merges them. If two histograms share the same `TraceKey`, their `SampleValues` are concatenated. This treats sharded test execution as a single logical benchmark run.
- **Secure-by-Default Ingestion**: In the `ingest` submodule, if a builder's configuration cannot be explicitly verified as "public," the system defaults to storing results in non-public internal buckets to prevent accidental data leaks.
- **Trace Key Uniqueness**: The `TraceKey` includes `Architecture` and `OSName` because these are derived from the Swarming bot dimensions. This ensures that even if two different machines run the same benchmark story, their results are stored as distinct traces if their hardware/OS profiles differ.
# Module: /go/perfresults/cli
# Perf Results CLI
The `perfresults/cli` module provides a command-line tool designed to bridge the gap between Buildbucket task execution and the Skia Perf ingestion system. Its primary purpose is to retrieve raw performance data associated with a specific Buildbucket build, transform it into a standardized format suitable for Skia Perf, and persist it as local JSON files.
This tool is particularly useful in CI/CD pipelines where performance benchmarks are executed as sub-tasks of a main build, and those results need to be extracted and prepared for long-term storage and analysis.
### Design and Data Flow
The CLI acts as an orchestrator between the `perfresults` loading logic and the `ingest` formatting logic. The design favors a "pull and transform" model:
1. **Retrieval**: It uses the `perfresults` package to abstract the complexity of communicating with Buildbucket and locating relevant benchmark artifacts.
2. **Normalization**: Raw results are grouped by benchmark. For each benchmark, the CLI attaches contextual metadata—specifically the Git revision and the Buildbucket job link—to ensure the data is traceable back to its source.
3. **Transformation**: It delegates the conversion of internal performance structures to the Skia Perf ingestion format via the `ingest` package.
4. **Persistence**: Results are written to individual files named by benchmark and build ID, providing a clear output stream for downstream processes (like cloud storage uploaders).
```text
[ Buildbucket ID ]
|
v
+--------------+ +-----------------------+
| perfresults |----->| Raw Benchmark Results |
| Loader | | (Memory Structures) |
+--------------+ +-----------------------+
|
v
+--------------+ +-----------------------+ +-----------------+
| ingest |<-----| Add Metadata: | | Output Files: |
| Converter | | - Git Revision |----->| bench_123.json |
+--------------+ | - Buildbucket Link | | bench_456.json |
+-----------------------+ +-----------------+
```
### Key Components
#### Main Logic (`main.go`)
The entry point handles command-line flag parsing and coordinates the execution flow. It is responsible for:
- **Contextualization**: It merges high-level build information (from `perfresults.Loader`) with specific benchmark data.
- **File Management**: It manages the creation of the output directory and ensures each benchmark result is serialized correctly.
- **Inter-process Communication**: By printing the paths of the generated files to `stdout`, the CLI allows parent scripts or automation tools to easily identify and process the resulting JSON files.
### Integration with Other Modules
The CLI serves as the glue between several specialized modules:
- **`perf/go/perfresults`**: Provides the `Loader` which handles the heavy lifting of finding and downloading artifacts from Buildbucket.
- **`perf/go/perfresults/ingest`**: Contains the logic to translate internal Go structures into the specific JSON schema required by the Skia Perf ingestion pipeline.
# Module: /go/perfresults/ingest
# Perf Results Ingestion
The `perfresults/ingest` module provides the logic necessary to transform raw performance results into a structured format suitable for the Skia Perf ingestion pipeline and determines the appropriate storage locations within Google Cloud Storage (GCS).
It acts as a bridge between the data structures defined in the `perfresults` module (which represent the raw telemetry/benchmark output) and the `format.Format` expected by the Skia Perf ingester.
## High-Level Overview
The ingestion process involves two primary responsibilities:
1. **Format Conversion**: Translating internal performance result structures (Histograms and BuildInfo) into a standardized JSON format that the Perf ingester can parse and index.
2. **Path Resolution**: Determining the standardized GCS URI where the results should be stored based on the execution time, builder configuration, and benchmark name.
## Design Decisions
### Data Aggregation
The `perfresults` format often contains a collection of sample values for a single measurement. However, for charting and time-series analysis, these samples need to be reduced to specific statistical points (e.g., mean, max, min).
Instead of choosing a single representative value, the module utilizes `perfresults.AggregationMapping` to generate multiple traces for a single histogram. Each aggregation (like "avg" or "std") is converted into a `format.SingleMeasurement`. This allows users to toggle between different statistical views of the same benchmark data in the Perf UI.
### Internal vs. External Data Routing
Security and visibility are handled at the path generation level. The module distinguishes between "public" and "non-public" buckets based on the builder name.
- **Default to Internal**: If a builder's configuration cannot be explicitly verified as public via `bot_configs`, the module defaults to the internal bucket (`chrome-perf-non-public`). This "secure-by-default" approach prevents accidental exposure of sensitive performance data.
## Key Components
### JSON Transformation (`json.go`)
This file handles the structural mapping between the `perfresults` package and the `ingest/format` package.
- **`ConvertPerfResultsFormat`**: This is the entry point for data transformation. It maps histogram keys (Chart, Unit, Story, Arch, OS) into the `Key` metadata map used by the ingester for filtering.
- **`toMeasurement`**: Processes the raw `SampleValues` from a histogram. It filters out invalid numerical values (Inf, NaN) before they reach the ingestion pipeline to ensure database integrity.
### GCS Path Management (`gcs.go`)
This file defines the organizational hierarchy of the performance data in GCS. The path structure is designed to be easily browsable and predictable for the ingester:
`gs://<bucket>/ingest/<YYYY>/<MM>/<DD>/<HH>/<MachineGroup>/<BuilderName>/<Benchmark>`
- **Time Normalization**: All paths are generated using UTC time. The `convertTime` function flattens the precision to the hour, grouping results into hourly "buckets" to optimize file discovery and ingestion batching.
- **Builder Metadata**: It handles defaults for missing metadata (e.g., using `ChromiumPerf` as the default Machine Group and `BuilderNone` for missing builders) to ensure the path remains valid and consistent.
## Workflow
The typical flow of data through this module can be visualized as follows:
```
[Raw PerfResults] -> ConvertPerfResultsFormat() -> [format.Format Object]
|
v
[Build Metadata] -> convertPath() --------------> [GCS URI Destination]
+ |
[Timestamp] v
(Ready for Upload/Ingest)
```
The resulting JSON object and GCS path are then used by higher-level services to write the data to GCS, where the Skia Perf ingester will eventually pick it up for processing into the trace database.
# Module: /go/perfresults/testdata
This module serves as a centralized repository of deterministic test inputs and recorded network interactions used to verify the functionality of performance result processing. It enables the testing of complex workflows—such as fetching task metadata, parsing performance histograms, and merging result sets—without requiring active connections to external services like Buildbucket or Swarming.
### Core Responsibilities
The data within this module is structured to support three primary testing objectives:
1. **API Replay and Service Mocking:**
The module contains recorded pRPC/gRPC interactions (captured in `.json` and `.rpc` files). These files allow the `perfresults` clients to simulate communication with infrastructure services. By providing pre-recorded request/response pairs, the tests can verify how the system handles various states—such as a successful build lookup, a non-existent task ID, or a complex task hierarchy—under stable, repeatable conditions.
2. **Data Schema and Parsing Verification:**
Files like `full.json`, `empty.json`, and `valid_histograms.json` represent the expected internal schema for performance data. These are used to ensure that parsers correctly translate raw JSON inputs into internal Go structures (e.g., `GenericSet`, `DateRange`, and `Histogram` objects) and that diagnostic metadata is correctly associated with specific samples.
3. **Aggregation and Logic Validation:**
Specialized datasets like `merged.json` and `merged_diff.json` are designed to test higher-level logic. These files provide the "before" and "after" states required to validate that the module can successfully combine multiple results or calculate differences between distinct performance runs.
### Key Components and Design
The test data is organized into functional groups to reflect the multi-stage nature of performance result processing:
- **Infra Metadata (`FindTaskID_...`, `SwarmingClient_...`):** These files mock the discovery phase. They contain the specific metadata—such as Swarming instance names and CAS (Content Addressed Storage) digests—needed to locate where performance results are actually stored after a build completes.
- **Result Loading (`LoadPerfResults_...`):** These datasets cover the edge cases of the loading logic. This includes scenarios where a build exists but contains no performance data ("NoChildRuns") or where the build identification is entirely invalid.
- **Histogram Sets (`perftest` group):** These files represent the final performance metrics. They include complex diagnostic maps that link specific measurements to bot IDs, operating systems, and benchmark versions.
### Workflow Visualization
The data in this module facilitates the testing of the following automated discovery and parsing pipeline:
```
[ Build ID ] --> ( Mock Buildbucket ) --> [ Swarming Task ID ]
|
v
[ CAS Digest ] <-- ( Mock Swarming ) <--- [ Task Result ]
|
+--------> ( Mock RBE/CAS ) ------> [ histogram.json ]
|
v
[ Internal Result Set ] <--------------- ( Parser Logic )
```
By providing static files for every step in this chain, the module ensures that logic changes in the parser or client can be verified for correctness and backward compatibility with historical data formats.
# Module: /go/perfresults/workflows
### Perf Results Workflows
The `perfresults/workflows` module contains the business logic for orchestrating the ingestion and processing of performance data within the Skia infrastructure. It leverages the Temporal framework to manage long-running, distributed tasks that require strong guarantees on state persistence and fault tolerance.
#### Design Philosophy: Fault-Tolerant Ingestion
Performance data ingestion is a multi-step process involving data retrieval, validation, storage in the trace store, and triggering downstream analysis (such as regression detection). The module is built on the following principles:
- **Reliable Transitions**: By using Temporal, the system ensures that if a step fails (e.g., a network timeout during a storage write), the workflow can resume from its last successful state rather than restarting the entire ingestion pipe.
- **Separation of Concerns**: The module separates the _orchestration_ (the sequence of steps) from the _execution_ (the actual work). Workflows define the "recipe," while Activities perform the "cooking."
- **Idempotency**: Activities are designed to be idempotent so that retries—inherent in distributed systems—do not result in duplicate data or corrupted state.
#### Key Components and Responsibilities
The module is structured to support both the high-level workflow definitions and the granular activities they invoke.
- **Workflow Definitions**: These are the top-level Go functions that define the lifecycle of a performance result. A typical workflow orchestrates the flow of data from an external upload or discovery event into the internal Skia Perf ecosystem. It handles branching logic, error handling policies, and timeout configurations.
- **Activities**: These are the atomic units of work called by workflows. Common responsibilities include:
- **Data Validation**: Checking the schema and integrity of incoming performance JSON files.
- **Storage Operations**: Interfacing with the trace store (e.g., BigTable or Spanner) to persist the results.
- **Downstream Notifications**: Sending signals to the clustering and regression detection systems once new data has been successfully ingested.
- **Input/Output Contracts**: The module defines the data structures used to pass information between steps. These contracts ensure that as workflows evolve, the data passed between disparate activities remains consistent and type-safe.
#### Workflow Architecture
The relationship between the orchestrating workflow and the underlying infrastructure is depicted below:
```text
[ Trigger Event ]
|
v
+-----------------------------+
| Temporal Workflow |
| (Orchestration & State) |
+--------------+--------------+
|
+--------+--------+
| | |
v v v
+----------+ +----------+ +----------+
| Activity | | Activity | | Activity |
| (Fetch) | | (Parse) | | (Store) |
+----+-----+ +----+-----+ +----+-----+
| | |
+------------+------------+
|
v
[ Result Finalization ]
```
1. **Orchestration**: The Workflow manages the control flow, ensuring that "Parse" only happens after "Fetch" succeeds, and "Store" only occurs if "Parse" produces valid data.
2. **Execution**: Each Activity is picked up by a [Worker](./worker) and executed. The status of these activities is reported back to the Temporal server to maintain the global state of the ingestion job.
3. **Completion**: Upon success, the workflow may trigger secondary events, such as updating "latest result" pointers or alerting developers of significant performance shifts.
#### Interaction with the Worker
While the `workflows` module defines the logic, it relies on the `worker` submodule to provide the execution environment. The workflows are registered with the worker at startup, allowing the worker to "claim" tasks from the Temporal task queue that match the workflow and activity names defined here. This decoupling allows the workflow logic to be updated and deployed independently of the worker's infrastructure configuration, provided the interfaces remain compatible.
# Module: /go/perfresults/workflows/worker
### Perf Upload Worker
The `perf-upload-worker` serves as the execution engine for Temporal workflows related to performance result ingestion and processing. In the context of the Perf results subsystem, this worker acts as the bridge between the Temporal orchestration engine and the actual execution of tasks, such as uploading data to storage or triggering indexing processes.
#### Design Philosophy: Temporal Execution
The worker is designed around the principle of **externalized orchestration**. Instead of embedding business logic directly into a monolith, the worker provides the compute resources to execute workflows defined elsewhere. This decoupling allows for:
- **Scalability**: Multiple worker instances can be deployed to handle high volumes of performance data uploads by listening to the same task queue.
- **Reliability**: Since Temporal manages state and retries, this worker remains relatively stateless. If a worker process crashes, the Temporal server detects the timeout and redistributes the pending tasks to other available workers.
- **Observability**: By integrating Prometheus metrics directly into the worker's lifecycle, the system tracks execution latency and task failures at the infrastructure level.
#### Key Components and Responsibilities
The primary responsibility of this module is the lifecycle management of the Temporal worker process, managed within `main.go`.
- **Connection Management**: The worker establishes a long-lived connection to the Temporal cluster (configured via `--host_port` and `--namespace`). It uses a single, heavyweight `client.Client` instance to minimize resource overhead, as per Temporal best practices.
- **Task Queue Subscription**: The worker listens to a specific `--task_queue`. By default, it dynamically generates a queue name based on the current system user (e.g., `localhost.username`), which facilitates local development and testing without interfering with production workflows.
- **Metrics Integration**: The worker utilizes a specialized `MetricsHandler` to export Temporal-specific SDK metrics to Prometheus. This is crucial for monitoring the health of the workflow execution environment, such as worker pollers and activity execution rates.
#### Operational Workflow
The worker operates in a continuous loop, polling the Temporal server for work. The high-level interaction between the worker and the broader system is illustrated below:
```text
+------------------+ +------------------+ +-----------------------+
| Temporal Server | <----1----> Perf Worker | <----2----> Workflow/Activity |
| (Orchestrator) | | (Execution) | | Implementations |
+------------------+ +------------------+ +-----------------------+
^ |
| | 3. Export Metrics
| v
| +------------------+
+------------------+ Prometheus/Skia |
| Monitoring |
+------------------+
```
1. **Polling**: The worker establishes a persistent gRPC connection to the Temporal Server and polls the configured Task Queue.
2. **Dispatch**: When the server has a scheduled task (Workflow or Activity), the worker receives the task and dispatches it to the registered implementation logic.
3. **Telemetry**: Throughout the execution, the worker pushes heartbeat and performance data to the Prometheus endpoint (defaulting to port `:8000`).
#### Configuration and Deployment
The worker is packaged as a containerized application (`perf_upload_worker`). It relies on command-line flags to determine its environment:
- `--task_queue`: Defines which set of tasks this specific worker fleet will handle.
- `--namespace`: Segregates workflow execution within the Temporal cluster (e.g., separating "prod" from "staging").
# Module: /go/perfserver
# Perfserver
The `perfserver` module provides a unified entry point for all long-running processes required to operate a Skia Perf instance. It is designed as a multi-command CLI tool that encapsulates disparate operational roles—web serving, data ingestion, maintenance, and regression detection—into a single binary.
### Architectural Philosophy
The design of `perfserver` follows a "sidecar" or "micro-service" compatible architecture where a single codebase can fulfill different roles depending on the command-line arguments. This approach simplifies deployment and configuration management: instead of managing multiple distinct binaries, the same container image or executable is deployed across different service tiers (e.g., Kubernetes deployments), with only the entrypoint command changing.
### Key Components and Responsibilities
The functionality is divided into several sub-commands, each targeting a specific area of the Perf lifecycle:
#### 1. Frontend (`frontend`)
The `frontend` command launches the primary web server. It is responsible for serving the user interface and handling API requests for data visualization. While it primarily focuses on the "read" path of the system, it acts as the central hub for user interaction with performance traces.
#### 2. Ingestion (`ingest`)
The `ingest` command starts the data processing pipeline. Its responsibility is to monitor configured sources (such as cloud storage buckets), parse incoming performance files, and populate the `TraceStore`.
- **Workflow**: It operates continuously, utilizing parallel workers to ensure that as new benchmark data is produced, it is indexed and made available for queries with minimal latency.
#### 3. Regression Detection (`cluster`)
Despite its name, the `cluster` command is essentially a specialized instance of the frontend logic configured specifically for background analysis. It focuses on the "alerting" path—continuously scanning newly arrived data against configured alert definitions to identify performance regressions or improvements.
#### 4. Maintenance (`maintenance`)
The `maintenance` command runs background tasks that are required for the long-term health of the database and application state. These tasks are typically "singleton" operations, meaning only one instance of the maintenance process should run per Perf instance to avoid data contention or redundant processing.
### Operational Workflow
The `perfserver` coordinates these components through a shared configuration validation logic. Every sub-command (excluding documentation generators) follows a similar initialization pattern:
1. Parse CLI flags to locate the instance configuration file.
2. Validate the configuration against a schema to ensure environmental consistency.
3. Initialize telemetry and monitoring (Prometheus).
4. Hand off execution to the specialized package (e.g., `perf/go/frontend` or `perf/go/ingest/process`).
```
[ Configuration File ]
|
v
+-----------------------------+
| perfserver |
+-----------------------------+
|
+--- [ frontend ] ----> Serves Web UI & API
|
+--- [ ingest ] ------> Monitors Storage -> Populates TraceStore
|
+--- [ cluster ] ------> Runs Alerting & Regression Detection
|
+--- [ maintenance ] --> Database Cleanup & Singleton Tasks
```
### Implementation Details
- **Flag Management**: The module heavily leverages `go/urfavecli` to map configuration structures directly to command-line flags. This ensures that the CLI interface stays in sync with the underlying configuration objects defined in `perf/go/config`.
- **Shared Logic**: By centralizing these commands, the server ensures that logging (via `sklog`), error handling (via `skerr`), and metrics initialization are applied consistently across all roles of the Perf system.
- **Validation**: Before starting critical processes like `ingest` or `maintenance`, the server uses `validate.InstanceConfigFromFile` to catch configuration errors early, preventing partial failures in production.
# Module: /go/pinpoint
# Pinpoint Client Module
The `pinpoint` module provides a Go client for interacting with Pinpoint, the performance regression analysis service used by Chrome and Skia. It abstracts the complexities of communicating with legacy Chromeperf and Pinpoint endpoints, allowing Skia Perf to programmatically trigger "Try Jobs" (to test specific patches) and "Bisect Jobs" (to identify the root cause of performance regressions).
## Overview
This module acts as a bridge between Skia Perf and the Pinpoint service. Its primary responsibility is to translate high-level requests—such as "bisect this anomaly" or "run a try job with this patch"—into the specific URL-encoded POST requests required by Pinpoint's legacy API.
The client manages:
- **Authentication**: Uses Google Default Token Sources with `auth.ScopeUserinfoEmail` to authorize requests.
- **Routing**: Determines whether to send bisect requests to the modern Pinpoint API or the legacy Chromeperf bisect service based on the source of the anomaly.
- **Data Transformation**: Normalizes inputs, such as converting underscores to dots in story names, to match Pinpoint's internal requirements.
- **Monitoring**: Tracks the success and failure rates of job creation via internal metrics.
## Key Components and Responsibilities
### Client (`pinpoint.go`)
The `Client` struct is the central entry point. It wraps an `http.Client` configured with necessary OAuth2 credentials and holds telemetry counters.
- **`CreateTryJob`**: Initiates a job to compare a base commit/patch against an experimental commit/patch. This is typically used to verify if a proposed fix actually improves performance before landing.
- **`CreateBisect`**: Initiates a bisection to find the specific commit that introduced a performance change. It supports two different backend paths depending on whether the configuration indicates the anomaly was fetched from the new SQL-based system.
- **`doPostRequest`**: A private helper that handles the low-level HTTP execution, response body reading, and error extraction. It specifically knows how to parse Pinpoint's error JSON format to provide actionable error messages.
### Request Orchestration
The module uses specific structures to define job parameters:
- **`TryJobCreateRequest`**: Captures details like `BaseGitHash`, `ExperimentPatch`, `Benchmark`, and `Story`.
- **`BisectJobCreateRequest`**: Captures regression-specific data like `StartGitHash`, `EndGitHash`, `ComparisonMagnitude`, and `AlertIDs`.
### Logic Flow: Request Building
Pinpoint's legacy API consumes parameters via URL query strings even for POST requests. The module handles this through several "build" functions that ensure required fields are present and formatted correctly.
```text
[ Skia Perf ] --(Request Struct)--> [ Client.CreateBisect ]
|
[ getBisectRequestURL ]
/ \
(New Anomaly?) (Legacy Anomaly?)
/ \
[ buildPinpointURL ] [ buildChromeperfURL ]
\ /
[ buildBisectRequestParams ]
|
(dotify stories)
(add "skia_perf" tags)
|
[ Pinpoint API ] <---(POST with Params)---'
```
## Implementation Decisions
### Legacy API Compatibility
The module intentionally targets the legacy Pinpoint API endpoints (`/api/new` and `/pinpoint/new/bisect`). This decision necessitates the use of URL query parameter encoding for POST bodies, as seen in `buildTryJobRequestURL` and `buildBisectRequestParams`.
### Data Normalization (`dotify`)
Pinpoint internally expects story names to use dot notation (e.g., `story.name`) rather than underscores (e.g., `story_name`), which are common in other parts of the Skia ecosystem. The `dotify` function automatically handles this transformation to prevent job submission failures.
### Error Handling
Instead of returning generic HTTP errors, `extractErrorMessage` attempts to parse the JSON response from Pinpoint to find a specific `error` field. If Pinpoint returns a 400 or 500 status code with a message like `{"error": "benchmark not found"}`, this module ensures that specific string is propagated back to the caller.
### Conditional Routing
The function `getBisectRequestURL` uses the `config.Config.FetchAnomaliesFromSql` toggle to decide which legacy endpoint to hit. This allows the system to support a transition period between old Chromeperf-managed anomalies and newer SQL-managed anomalies without breaking the bisection workflow.
# Module: /go/pivot
# Pivot Module
The `pivot` module provides functionality to transform and aggregate Performance `DataFrames`. It allows users to group traces by specific keys and apply mathematical operations to summarize data, similar to a "Pivot Table" in a spreadsheet or a `GROUP BY` clause in SQL.
## Overview
In Perf, data is typically represented as a series of traces (floating-point arrays) identified by a set of parameters (e.g., `arch=arm, config=8888`). The `pivot` module allows you to "collapse" these traces based on a subset of those parameters.
For example, if you have traces for various configurations across different architectures, you can pivot by `arch` to see the aggregate performance of `arm` vs. `intel`, regardless of the specific configuration.
## Key Concepts
### Pivot Request
The transformation is governed by a `Request` struct which defines three things:
1. **GroupBy**: A list of keys to retain. All other keys in the original trace IDs are discarded, and traces sharing the same remaining keys are grouped together.
2. **Operation**: The aggregation function used to combine multiple traces within a group into a single representative trace.
3. **Summary** (Optional): A list of operations to apply to the _resulting_ traces to reduce them from a series of values (over time/commits) into a single scalar value per operation.
### Operations
The module supports several mathematical operations for both grouping and summarization:
- **Sum / Avg**: Standard arithmetic sum and mean.
- **Geo**: Geometric mean.
- **Std**: Standard deviation.
- **Count**: Number of data points.
- **Min / Max**: Extremum values.
## Workflow and Design
The pivoting process follows a structured pipeline:
### 1. Grouping
The module identifies all unique combinations of the keys provided in `GroupBy` that exist in the `DataFrame`. It then maps every existing trace in the input `DataFrame` to one of these groups. If a trace does not contain all the keys specified in `GroupBy`, it is excluded from the result.
### 2. Aggregation (Group By)
For each group, the `Operation` (e.g., `Sum`) is applied across all traces in that group. This results in one trace per group. The trace ID for this new trace contains only the keys specified in the `GroupBy` list.
```text
Input Traces:
{arch: arm, config: 8888} -> [1, 0, 0]
{arch: arm, config: 565 } -> [0, 2, 0]
{arch: intel, config: 8888} -> [1, 1, 1]
Pivot (GroupBy: ["arch"], Operation: Sum):
{arch: arm} -> [1+0, 0+2, 0+0] -> [1, 2, 0]
{arch: intel} -> [1] -> [1, 1, 1]
```
### 3. Summarization (Optional)
If `Summary` operations are provided, the module further transforms the aggregated traces. Instead of a trace representing values over multiple commits, the resulting "trace" contains one value for each operation listed in `Summary`.
```text
Intermediate Grouped Trace (from above):
{arch: arm} -> [1, 2, 0]
Summary (Summary: [Avg, Max]):
{arch: arm} -> [1, 2] // Avg is 1, Max is 2
```
## Implementation Details
- **Logic Mapping**: The module uses an internal `opMap` to link `Operation` enums to specific implementation functions from the `go/calc` and `go/vec32` packages. This ensures consistency between how data is grouped and how it is summarized.
- **DataFrame Reconstruction**: After pivoting, the module rebuilds the `ParamSet` and updates the `DataFrame` headers. If a summary is performed, the headers are replaced with simple offsets representing the summary columns.
- **Performance**: It utilizes `query.MakeKeyFast` and `query.ParseKeyFast` for efficient trace ID manipulation and supports context cancellation for long-running aggregations on large datasets.
# Module: /go/playground
# Playground
The `playground` module serves as an interactive experimentation hub for performance data analysis. It provides a web-accessible sandbox where developers and performance engineers can validate detection algorithms, test regression logic against synthetic or real-world traces, and fine-tune sensitivity parameters without impacting production systems or persistent storage.
## Design Philosophy and Core Functionality
The primary goal of the playground is to decouple the **analysis logic** from the **data storage and ingestion infrastructure**. In the standard Perf production environment, anomaly detection is often part of a large, automated pipeline that reads from BigTable and writes to SQL databases. The playground bypasses these dependencies by accepting raw data via HTTP and processing it in-memory.
This design enables:
- **Rapid Iteration**: Immediate feedback on how changing a "radius" or "threshold" affects anomaly detection.
- **Algorithm Validation**: Comparison between different regression methods (e.g., `AbsoluteStep` vs. `OriginalStep`) on the same data set.
- **Noise Reduction Testing**: Testing consolidation strategies like Non-maximum Suppression to ensure that a single regression isn't reported as multiple adjacent anomalies.
## Key Submodules and Responsibilities
The module is structured to separate the API lifecycle from the specific mathematical analysis being performed.
### Anomaly Detection (`/anomaly`)
This is the primary functional area of the playground. It implements a sliding window approach to identify shifts in time-series data.
- **Windowing Strategy**: Rather than treating a trace as a single entity, the module slides a window of size `2 * radius + 1` across the data. This localization allows the `regression` package to focus on finding a single "best" step within a small context, which is more robust against long-term trends or multiple shifts in a single trace.
- **Data Adaptation**: A significant portion of the implementation involves "shim logic." The core `regression` and `dataframe` packages used by Skia Perf expect complex structures (trace sets, headers, paramsets). The playground's `anomaly` logic constructs transient, "dummy" dataframes to wrap raw float slices, allowing the production-grade regression code to run as if it were processing a standard database query.
- **Anomaly Consolidation (Non-maximum Suppression)**: To prevent "jitter" (where several points around a step are all flagged), the module implements a grouping logic. If enabled, it identifies clusters of contiguous points flagged as anomalies and selects only the point with the highest absolute regression score—the point where the "step" is most pronounced.
## Workflow: Request-to-Analysis
The following diagram illustrates how data flows from a user request through the detection engine:
```text
[User Request (JSON)]
| (Trace, Threshold, Radius, Algorithm)
v
[HTTP Handler]
|
+-----> [Data Cleaning] (Remove missing data sentinels)
|
+-----> [Sliding Window Loop]
| |
| v
| [regression.StepFit] <--- (Analyzes N points)
| |
| +--> [Threshold Check] (Is regression > threshold?)
|
+-----> [Optional: Grouping/Suppression]
| | (Merges adjacent hits, keeps max score)
v
[Enriched Response]
(Indices, Medians Before/After, Regression Scores)
```
## Implementation Decisions
- **Statistically Driven Summaries**: When an anomaly is detected, the module calculates `MedianBefore` and `MedianAfter` values. These are calculated using `vec32.RemoveMissingData` to ensure that gaps in telemetry do not result in "NaN" or skewed medians, providing the user with a clean delta of the performance change.
- **Algorithm Agnostic API**: The request structure uses a string-based `Algorithm` field. This allows the playground to support any algorithm registered in the `regression` package without changing the API schema, making it extensible as new detection methods are developed.
- **Simulated Environment**: By using the `PlaygroundTraceName` constant, the module satisfies internal requirements for named traces while maintaining the abstraction that this data is ephemeral and not tied to a real hardware bot or test suite.
# Module: /go/playground/anomaly
# Anomaly Playground
The `anomaly` playground module provides a sandbox environment for testing and tuning anomaly detection algorithms on performance data. It exposes an HTTP interface that allows users to submit individual data traces and receive a list of detected anomalies based on configurable parameters like window size and sensitivity thresholds.
This module acts as a bridge between the frontend and the core `regression` detection logic, allowing developers and users to experiment with detection settings without modifying production configurations or underlying databases.
## Detection Logic and Design Decisions
The module implements a **Sliding Window Step Fit** approach. Instead of analyzing a whole trace at once, it moves a window across the data to identify localized "steps" or shifts in value.
### Key Workflows
1. **Request Handling**: The `Handler` receives a `DetectRequest` containing the raw trace data, a `Radius` (determining window size), a `Threshold` (sensitivity), and the specific `Algorithm` to use (e.g., `AbsoluteStep`, `OriginalStep`).
2. **Windowing**: The `slidingWindowStepFit` function iterates through the trace. At each index `i`, it creates a window of size `2 * radius + 1`.
3. **Step Fitting**: For each window, the module wraps the slice into a temporary `dataframe.DataFrame` and calls `regression.StepFit`. This leverages the existing production logic used by the Perf service.
4. **Anomaly Consolidation**:
- If `GroupAnomalies` is false, every index flagged by the algorithm is returned.
- If `GroupAnomalies` is true, the module performs **Non-maximum Suppression**. It groups consecutive indices flagged as anomalies and only returns the one with the highest absolute regression score (the "most significant" point in the cluster).
### Process Diagram
```text
[Trace Data]
|
v
[Windowing] ----> [Sub-trace (i - radius to i + radius)]
| |
| v
| [regression.StepFit Analysis]
| |
| v
|<--- [Is it a "High" or "Low" Step?]
|
v
[Candidate Anomalies]
|
+--- (If GroupAnomalies=true) ---> [Non-maximum Suppression]
| (Pick best in group)
v
[JSON Response (Anomalies)]
```
## Key Components
### `anomaly.go`
Contains the core logic for the playground:
- **`DetectRequest` / `DetectResponse`**: Defines the JSON API. The request allows choosing the algorithm and whether to group nearby anomalies to reduce noise.
- **`slidingWindowStepFit`**: The engine that breaks the trace into windows. It constructs dummy `dataframe` headers to satisfy the requirements of the `regression` package's API, simulating a real data environment.
- **`Handler`**: The HTTP entry point. It manages the lifecycle of a request, invokes the detection, calculates metadata for each detected anomaly (like `MedianBeforeAnomaly` and `MedianAfterAnomaly`), and performs the optional grouping logic.
### `anomaly_test.go`
Provides functional tests for the detection logic. It verifies that the playground correctly identifies simple steps (up/down), handles empty or flat traces, and correctly implements the grouping suppression logic to ensure only the most relevant points are reported.
## Implementation Details
- **Handling Missing Data**: The module uses `vec32.RemoveMissingDataSentinel` when calculating medians to ensure that gaps in performance data (common in real-world traces) do not skew the statistical summary of the detected anomaly.
- **Regression Scores**: The "strength" of an anomaly is determined by the `Regression` value returned by the step-fit algorithm. When grouping anomalies, the absolute value is used to determine which point in a cluster represents the most significant shift.
- **Performance Trace Isolation**: The constant `PlaygroundTraceName` is used to identify traces within the temporary dataframes created for analysis, ensuring compatibility with internal Perf logic that expects named traces.
# Module: /go/preflightqueryprocessor
# preflightqueryprocessor
The `preflightqueryprocessor` module provides specialized logic for handling complex trace queries in Skia Perf, particularly during the "preflight" stage of data exploration. It manages the aggregation of parameters and trace counts across multiple data tiles, with specific support for "missing value" logic that standard database queries cannot easily represent.
## Core Responsibility
In Skia Perf, a query typically filters traces based on a set of key-value pairs (e.g., `benchmark=V8_Flash`). However, users often need to perform "preflight" checks to understand the shape of their data before running a full analysis. This module facilitates:
1. **Shared Parameter Aggregation**: Collecting all unique values for all keys across all traces that match a query.
2. **Subquery Processing**: Efficiently narrowing down available options for a specific key based on existing filters on other keys.
3. **Sentinel Value Handling**: Supporting the `__missing__` sentinel, which allows users to query for traces that _lack_ a specific key.
## Design Decisions and Implementation
### The Sentinel Strategy (`__missing__`)
Standard database backends often struggle to query for the absence of a key alongside specific values in a single pass. To solve this, the module implements a "fetch-and-filter" strategy:
- **Query Transformation**: When a query contains the `MissingValueSentinel` (`__missing__`), the processor removes that key from the actual database query. This ensures a superset of traces (including those missing the key) is fetched from the store.
- **In-Memory Filtering**: The `FilterParams` function then applies the logic manually: a trace matches if the key is missing OR if the key's value is within the user's explicitly allowed set.
### Concurrent Aggregation
Preflight queries often run across multiple goroutines (one per data tile). To handle this efficiently:
- **Shared State**: A `preflightQueryBaseProcessor` holds a shared `sync.Mutex` and `paramtools.ParamSet`. All subqueries and the main query share these instances to build a single unified result.
- **Batching**: To minimize mutex contention, the `preflightSubQueryProcessor` collects values into a local slice for each tile and then performs a single batch update to the shared state once the tile is fully processed.
### Main vs. Subquery Processors
The module distinguishes between the primary query and supplementary subqueries:
- **Main Processor**: Responsible for the total count of unique traces and identifying missing keys (recording them as empty strings in the result set).
- **Subquery Processor**: Used when the UI needs to know "what are the available values for key X, given the current filters on keys Y and Z?". It focuses strictly on populating the values for its target key.
## Key Workflows
### Query Processing Flow
The following diagram illustrates how a query is handled when a sentinel value is present:
```text
User Query: [config=gpu, arch=__missing__]
|
v
PrepareQueryWithSentinel()
|-- 1. Create FilterMap: { "arch": { "allowed": [] } }
|-- 2. Strip "arch" from Query -> Backend Query: [config=gpu]
v
Fetch Traces from Tiles (Parallel)
|
|--> Tile 1 Results ----> FilterParams() ----> If Match: Add to Shared ParamSet
|--> Tile 2 Results ----> FilterParams() ----> If Match: Add to Shared ParamSet
v
Finalize()
|-- Subqueries move collected values into the final shared ParamSet.
|-- Result: Total Count + Aggregated ParamSet.
```
## Key Components
### `ParamSetAggregator` & `PreflightQueryResultCollector`
These interfaces define how trace data is consumed. The main processor implements both (tracking count and params), while the subquery processor only implements aggregation.
### `preflightMainQueryProcessor`
The primary coordinator. It uses a `map[string]bool` of unique trace IDs to ensure the total count is accurate even if traces overlap across tiles. It also supports `SetKeysToDetectMissing`, which forces the aggregator to record an empty string if a specific expected key is absent from a trace.
### `preflightSubQueryProcessor`
Optimized for "discovery" queries. It tracks values in a local `filteredValuesFromTiles` map during tile processing and only populates the shared `ParamSet` during the `Finalize` phase to reduce lock overhead.
### `PrepareQueryWithSentinel` and `FilterParams`
The logic engine for the `__missing__` value. `PrepareQueryWithSentinel` modifies the query object in place (after cloning) to ensure the backend returns a broad enough dataset for the manual `FilterParams` logic to work correctly.
# Module: /go/progress
# Perf Progress Tracking Module
The `progress` module provides a standardized mechanism for tracking and reporting the status of long-running backend tasks in the Perf application. It bridges the gap between asynchronous server-side processes (like complex data queries or "dry runs") and the user interface, which needs real-time feedback on task advancement.
## Design Philosophy
The module is built around a "push-pull" architecture:
1. **Push:** The long-running backend task updates a `Progress` object with its current state, messages, and eventually, results.
2. **Pull:** The frontend polls a specific URL associated with that task. The backend `Tracker` intercepts these requests and returns a serialized snapshot of the task's progress.
To ensure consistency and prevent logical errors in task reporting, the state machine is strictly enforced. Once a task transition out of the `Running` state (to `Finished` or `Error`), any further attempts to modify its state or messages will result in a panic. This encourages developers to handle task finalization at the outermost calling level, ensuring a clean lifecycle.
## Key Components
### Progress Interface
The `Progress` interface is the core unit of state. It manages:
- **Status:** A task is always in one of three states: `Running`, `Finished`, or `Error`.
- **Messages:** An ordered collection of key-value pairs used to describe stages (e.g., "Step: 1/5", "Stage: Analyzing traces"). If a message key is reused, the value is updated in place, allowing for dynamic progress bars or counters.
- **Results:** An arbitrary data structure containing the final output of the task.
- **URL:** A unique endpoint where the UI can poll for updates.
### Tracker
The `Tracker` acts as a registry and HTTP handler for all active `Progress` objects. It manages the lifecycle of these objects using an internal Least Recently Used (LRU) cache.
Key responsibilities include:
- **ID Assignment:** When a `Progress` is added to the `Tracker`, it is assigned a UUID and a corresponding polling URL based on a configured `basePath`.
- **HTTP Handling:** The `Tracker` provides a standard `http.Handler` that extracts the task ID from the URL path, retrieves the task from the cache, and serializes its state to JSON.
- **Cleanup:** To prevent memory leaks, the `Tracker` runs a background goroutine that evicts completed or failed tasks from the cache after a set duration (defaulting to 5 minutes).
## Workflow Example
The following diagram illustrates the interaction between an HTTP handler, a background worker, and the `Tracker`.
```text
HTTP Handler Background Worker Tracker / UI
------------ ----------------- ------------
1. Create Progress ----> 2. Start Goroutine
3. Add to Tracker ----> 4. Return JSON (Initial URL) -> UI starts polling
|
5. update Message() ----------> UI sees "Step 1"
|
6. update Results()
|
7. call Finished() ----------> UI sees "Finished"
& fetches Results
|
[ 5 minutes later ] ----------> Tracker evicts task
```
## Implementation Details
- **Concurrency:** The standard implementation of `Progress` uses a `sync.Mutex` to ensure that concurrent updates from a worker and read requests from the `Tracker` handler are thread-safe.
- **Serialization:** The `SerializedProgress` struct is designed to be easily consumed by TypeScript frontends, using `go2ts` compatible tags.
- **Error Handling:** When `Error(msg)` is called, the status is updated, and the error message is automatically stored in the messages list under a reserved `Error` key.
- **Persistence (Future):** While currently memory-backed, the module includes hooks for Redis-based persistence to support progress tracking across horizontal scaling boundaries.
# Module: /go/psrefresh
### High-Level Overview
The `psrefresh` module is responsible for maintaining and providing an up-to-date `ParamSet` for a Perf instance. A `ParamSet` is a collection of all keys and values (metadata) for all traces stored in the system.
In a high-volume performance monitoring system, querying the underlying database to discover which parameters are available (e.g., "Which benchmarks ran on this specific bot?") can be expensive. This module solves that by background-loading metadata into memory and optionally caching filtered results to ensure that the user interface remains responsive when users build queries.
### Design Decisions and Implementation Choices
#### Tile-Based Aggregation
The module retrieves metadata by looking at "tiles"—chunks of time-series data. The `defaultParamSetRefresher` is designed to aggregate metadata from a configurable number of the most recent tiles (typically the two most recent). This ensures that the `ParamSet` reflects currently active traces while ignoring stale parameters from deleted or very old data.
#### In-Memory vs. Cached Access
There are two primary ways this module serves data:
1. **Direct Refresher (`defaultParamSetRefresher`)**: Keeps the full, global `ParamSet` in memory. This is updated on a periodic background tick.
2. **Cached Refresher (`CachedParamSetRefresher`)**: Wraps the default refresher. It pre-calculates and stores filtered `ParamSet` results in a cache (like Redis or local memory) based on specific "Level 1" and "Level 2" keys defined in the configuration. This is a performance optimization for UI components that drill down through common hierarchies (e.g., `Benchmark -> Bot`).
#### Thread Safety and Reliability
- **Concurrency**: The refresher uses a `sync.Mutex` to protect the in-memory `ParamSet` during background updates, ensuring that readers never see a partially constructed set.
- **Resilience**: During the refresh process (`oneStep`), the system is designed to be tolerant of failures. While failing to fetch the _latest_ tile results in an error, failures to fetch older supplementary tiles are logged as warnings rather than crashing the refresh cycle, allowing the system to provide "mostly complete" data rather than no data at all.
### Key Components and Files
#### `psrefresh.go`
Contains the core logic for the `defaultParamSetRefresher`.
- **`OPSProvider` Interface**: Abstracts the data source (usually a `TraceStore`). It requires two methods: identifying the latest tile and retrieving a `ParamSet` for a specific tile.
- **`oneStep()`**: The atomic unit of work that fetches the latest tile ID, iterates backward to collect the requested number of tiles, merges their metadata, and "freezes" the result into a read-only structure.
#### `cachedpsrefresh.go`
Implements the `CachedParamSetRefresher`, which adds a caching layer over the standard refresher.
- **Hierarchical Pre-population**: It uses `PopulateCache()` to proactively execute "Preflight" queries. It iterates through values of a primary key (Level 1) and optionally a secondary key (Level 2), storing the resulting `ParamSet` and trace count in the cache.
- **Smart Query Routing**: When `GetParamSetForQuery` is called, the component checks if the query matches the cached levels (e.g., exactly 1 or 2 specific keys). If it matches, it serves from the cache; otherwise, it falls back to a real-time database query.
### Key Workflows
#### Background Refresh Process
The standard refresher maintains the global state of available parameters.
```
[ Timer Tick ]
|
V
[ oneStep() ] ----------------------> [ TraceStore (OPSProvider) ]
| |
| <--- Get Latest Tile ID -------------|
| |
| <--- Get ParamSet for Tile N --------|
| <--- Get ParamSet for Tile N-1 ------|
|
[ Merge & Normalize ]
|
[ Lock Mutex ] -> [ Update pf.ps ] -> [ Unlock Mutex ]
```
#### Cached Query Workflow
When the UI requests a filtered `ParamSet` (e.g., selecting a benchmark to see available bots), the cached refresher determines the most efficient data path.
```
[ Request: GetParamSetForQuery(Query) ]
|
|-- If Query has Level1/Level2 keys only? --+
| |
| [ YES ] [ NO ]
V V
[ Check Cache ] [ Real-time DB Query ]
| |
|-- Cache Hit? --+ |
| | |
[ YES ] [ NO ] |
V V V
[ Return Data ] [ Fetch from DB ] --------> [ Return Data ]
```
# Module: /go/psrefresh/mocks
### High-Level Overview
The `go/psrefresh/mocks` module provides mock implementations of interfaces used by the ParamSet Refresher (`psrefresh`) system within the Perf service. Its primary purpose is to facilitate unit testing for components that depend on an `OPSProvider` (Ordered ParamSet Provider), allowing developers to simulate data retrieval from the underlying storage layer without requiring a live database or complex setup.
### Design and Implementation Choices
The module relies on **testify/mock** and is generated via **mockery**. This choice ensures that the mocks are strictly typed and consistent with the actual interfaces they represent. By using generated mocks, the project maintains a clear separation between the logic being tested and the data-providing infrastructure.
A key design aspect of these mocks is the abstraction of tile-based data access. In the Perf system, data is organized into "tiles" (chunks of time-series data). The `OPSProvider` mock allows tests to control exactly what a component perceives as the "latest tile" or what "ParamSet" (a collection of key-value pairs representing trace metadata) exists within a specific tile.
### Key Components
#### OPSProvider.go
This file contains the `OPSProvider` struct, which mocks the interface responsible for bridging the refresher logic and the actual data store. It manages two primary responsibilities in a test environment:
- **State Simulation**: It allows tests to define the current state of the system by mocking `GetLatestTile`. This is crucial for testing how the refresher reacts when new data is added or when the system is already up to date.
- **Data Injection**: Through the `GetParamSet` mock function, developers can inject specific `paramtools.ReadOnlyParamSet` objects into the workflow. This allows for fine-grained testing of how the Perf system indexes metadata and how it handles potential errors during data retrieval.
The mock includes a `NewOPSProvider` constructor that automatically handles test cleanup and expectation assertions, ensuring that tests fail if the code under test does not interact with the provider as expected.
### Key Workflows
The typical workflow for using this module involves setting up expectations within a unit test to simulate the lifecycle of a ParamSet refresh operation:
```
[ Test Setup ]
|
V
[ Mock OPSProvider ] <--- Define Return: GetLatestTile (e.g., Tile #500)
|
V
[ Mock OPSProvider ] <--- Define Return: GetParamSet(ctx, 500) (e.g., custom ParamSet)
|
V
[ Component Under Test ] --- Calls GetLatestTile() ---> [ Mock ]
| |
|<--- Returns Tile #500 ----------------------------|
|
[ Component Under Test ] --- Calls GetParamSet(500) ---> [ Mock ]
| |
|<--- Returns ParamSet -----------------------------|
|
V
[ Assertions ] <--- Verify component processed ParamSet correctly
```
# Module: /go/redis
The `go/redis` module provides the integration layer between Skia Perf and Google Cloud Memorystore (Redis). Its primary role is to manage the discovery of Redis instances and facilitate data caching to improve the performance of the Perf query UI.
The module bridges two distinct domains: the management of Google Cloud Platform (GCP) resources and the application-level interaction with Redis data structures.
### Design and Implementation Choices
The module is designed around the `RedisWrapper` interface, which abstracts the complexity of GCP infrastructure management. This abstraction allows for clean separation between the logic that locates a database and the logic that consumes it, while also enabling the automated mocking found in the `mocks` sub-module.
Key design decisions include:
- **Dynamic Instance Discovery**: Rather than relying on hardcoded IP addresses or brittle DNS entries, the module uses the GCP Cloud Redis API to list and identify instances. This allows the system to be resilient to infrastructure changes, such as migrating instances or updating service endpoints in a specific project/zone.
- **Asynchronous Refresh Cycle**: The implementation utilizes a background goroutine (`StartRefreshRoutine`) to periodically poll the state of Redis. This ensures that the application has up-to-date metadata about the target Redis instance without blocking the main execution path.
- **Thread-Safe Access**: The `RedisClient` uses a `sync.Mutex` during cache updates. This prevents race conditions when multiple refresh cycles or concurrent operations attempt to modify the client's internal state or interaction logic simultaneously.
### Key Components
#### RedisClient
The `RedisClient` is the primary implementation of the `RedisWrapper`. It acts as a coordinator between three major dependencies: the GCP Cloud Redis client (for infrastructure), the `TraceStore` (the source of data), and the Redis data client (for caching).
- **Lifecycle Management**: Through `StartRefreshRoutine`, the client manages a ticker-based loop. It searches for a specific Redis instance name (provided via configuration) within a target GCP project and zone.
- **Infrastructure Discovery**: The `ListRedisInstances` method handles the pagination and iteration logic required by the GCP API, converting the stream of instances into a usable slice of `redispb.Instance` objects.
- **Cache Maintenance**: The `RefreshCachedQueries` method (and its associated workflows) is responsible for the actual data movement. It establishes a connection to the discovered host and port and performs the necessary Redis commands to update the cache. This ensures the Query UI can retrieve pre-computed results instead of performing expensive lookups on the primary `TraceStore`.
### Key Workflows
#### Redis Discovery and Cache Refresh
The following diagram illustrates how the module moves from a configuration state to an active cached state:
```
[ StartRefreshRoutine ]
|
| (Every refreshPeriod)
v
[ ListRedisInstances ] <---- Queries GCP Cloud Redis API
|
| Filters by config.Instance Name
v
[ Target Instance Found? ] -- No --> [ Log Error/Wait ]
|
Yes (Extract Host/Port)
v
[ RefreshCachedQueries ]
|
| 1. Create redis.NewClient(Host:Port)
| 2. Lock Mutex
| 3. Update cache keys (e.g., "FullPS")
| 4. Unlock Mutex
v
[ Cache Updated ]
```
This workflow ensures that even if the Redis instance is recreated or its internal IP changes, the Perf system will automatically rediscover the new endpoint and resume caching operations without requiring a manual restart.
# Module: /go/redis/mocks
The `go/redis/mocks` module provides automated mock implementations of the Redis management interfaces used within the Perf system. Its primary purpose is to enable unit testing of components that interact with Google Cloud Redis instances without requiring a live cloud environment or complex integration setups.
### Design and Implementation Choices
The module is built around the `RedisWrapper` mock, which is generated using `mockery`. The decision to use generated mocks rather than manual stubs ensures that the testing layer stays in sync with the actual `RedisWrapper` interface used in production.
The implementation utilizes the `testify/mock` framework. This allows developers to:
1. **Define expected behaviors**: Specify exactly how many times a method should be called and with what arguments.
2. **Control return values**: Simulate both successful API responses (such as lists of `redispb.Instance` objects) and error conditions (such as context timeouts or API failures).
3. **Validate assertions**: Automatically verify that the code under test interacted with the Redis management layer as expected during the test cleanup phase.
### Key Components
#### RedisWrapper.go
This file contains the `RedisWrapper` struct, which mocks the interface responsible for high-level Redis lifecycle and discovery operations. It focuses on two primary responsibilities:
- **Instance Discovery**: Through `ListRedisInstances`, the mock simulates the retrieval of Redis instance metadata from a specific project or region. This is critical for testing logic that needs to dynamically discover Redis endpoints based on cloud configurations.
- **Background Maintenance**: The `StartRefreshRoutine` method mocks the behavior of long-running background processes that handle the periodic refreshing of Redis configurations or connections. In a test environment, this allows callers to verify that the refresh cycle is initiated with the correct duration and configuration parameters (`config.InstanceConfig`) without actually spawning persistent goroutines.
### Mocking Workflow
The typical usage pattern involves initializing the mock within a test and injecting it into the higher-level service that requires a `RedisWrapper`.
```
[ Test Case ]
|
| 1. Initialize NewRedisWrapper(t)
v
[ RedisWrapper Mock ] <------- 2. Setup expectations (On("ListRedisInstances").Return(...))
|
| 3. Inject mock into Perf Component
v
[ Component Under Test ] ----> 4. Calls ListRedisInstances()
|
| 5. Test ends, mock automatically asserts expectations
v
[ Assertions Passed/Failed ]
```
This workflow ensures that components responsible for Perf data storage and caching can be validated in isolation, ensuring that logic governing instance selection and maintenance routines is robust against various infrastructure states.
# Module: /go/regression
# Regression Module
The `regression` module is the core analytical engine of Skia Perf. It is responsible for detecting, refining, and persisting performance regressions (shifts in metric data) across the commit history.
## High-Level Overview
Performance regressions are identified by comparing metric values before and after a specific commit. The module operates by fetching "frames" of data (a window of commits), applying statistical algorithms to identify clusters of traces that show similar shifts, and then triaging these shifts based on user-defined alert configurations.
The system supports two primary detection methodologies:
1. **K-Means Grouping:** Identifies broad shifts affecting many traces by clustering similar performance profiles.
2. **StepFit:** Analyzes individual traces to find the exact point where a value shifted significantly, useful for pinpointing regressions in specific sub-components.
## Key Design Decisions
### Separation of Detection and Refinement
The module splits the lifecycle of a regression into distinct stages:
- **Detection (`detector.go`):** Executes the heavy mathematical lifting (K-Means or StepFit). It is intentionally "greedy," finding all statistical anomalies within the provided data frame.
- **Refinement (`refiner/`):** A post-processing layer that filters the raw detection results. It applies business logic—such as minimum trace thresholds and directionality (UP/DOWN/BOTH)—to ensure only actionable regressions are reported.
### "Domain-Centric" Data Fetching
Instead of scanning every individual commit, the detector uses a "Domain" (a range of commits). It leverages `dfiter.DataFrameIterator` to efficiently slide a window across the data. For each step in the iteration, the target commit is placed at the center of the window (the "midpoint"), allowing the algorithms to compare a stable "before" baseline against an "after" state.
### GroupBy Query Expansion
To prevent small regressions from being "drowned out" by the noise of a large dataset, the module supports `GroupBy`. If an alert is configured to group by a parameter (e.g., `device`), the detector doesn't run one large query. Instead, `allRequestsFromBaseRequest` expands the base request into multiple sub-queries, one for each unique value of that parameter. This ensures high-granularity detection.
### Hybrid Storage Strategy
The `Store` interface supports a transition from legacy to modern schemas:
- **Relational Indexing:** Metadata (Commit Number, Alert ID, Triage Status) is stored in relational columns for fast searching and filtering.
- **JSON Payloads:** Complex objects like `ClusterSummary` and the `FrameResponse` (the actual data points) are stored as JSON blobs. This provides the flexibility to update detection algorithms without requiring database schema migrations.
## Key Components
### Detection Engine (`detector.go`, `stepfit.go`)
The `ProcessRegressions` function is the entry point for detection. It manages a pool of workers to process multiple alert configurations in parallel.
- **`tooMuchMissingData`**: A critical heuristic that filters out traces with more than 50% missing data on either side of the midpoint. This prevents "gaps" in data from being falsely identified as performance drops.
- **`StepFit`**: Implements individual trace analysis. It looks for a "Turning Point" in the trace and calculates the magnitude of the shift.
### The Regression Model (`regression.go`)
The `Regression` struct tracks the state of a detected anomaly. A single `Regression` object can track both a `High` (regression) and a `Low` (improvement) cluster for the same commit and alert ID. This allows the UI to present a unified view of all shifts occurring at a specific point in time.
### Storage Layers (`sqlregressionstore`, `sqlregression2store`)
The module provides different implementations of the `Store` interface:
- **`sqlregression2store`**: The modern implementation optimized for Spanner. It supports advanced features like `NudgeAndResetAnomalies` (moving a regression to a more accurate commit) and multi-source bug tracking (Manual, Auto-Triage, and Bisection).
- **`migration/`**: Orchestrates the background movement of data from the legacy V1 store to the V2 store, using a transactional "pull and mark" strategy to ensure no data is lost during the transition.
## Key Workflows
### Detection and Reporting Workflow
The following diagram shows how a detection request is transformed into a confirmed regression.
```text
[ RegressionDetectionRequest ]
|
v
[ detector.ProcessRegressions ]
|-- allRequestsFromBaseRequest (Expand GroupBy)
|-- DataFrameIterator (Fetch window of commits)
|
+--> [ Algorithm: KMeans or StepFit ]
| |
| v
| [ ClusterSummaries Generated ]
|
v
[ refiner.Process ]
|-- Validate Midpoint Match
|-- Filter by Direction (UP/DOWN/BOTH)
|-- Filter by Minimum Trace Count
|
v
[ Store.SetHigh / SetLow ]
|-- SQL UPSERT (Atomic Read-Modify-Write)
|-- Persist JSON Cluster & Frame
```
### Continuous Orchestration (`continuous/`)
The `continuous` submodule acts as the driver for the detection engine. It monitors for new data ingestion (via PubSub) or timer triggers, identifies which Alert configs match the new data, and dispatches them to the `detector`. It also handles the logic for deduplicating notifications so users aren't alerted multiple times for the same evolving regression.
# Module: /go/regression/continuous
# Continuous Regression Detection
The `continuous` module is responsible for the background detection of performance regressions in the Skia Perf ecosystem. It acts as an orchestrator that monitors incoming data and configuration changes to identify shifts in performance metrics without requiring manual user intervention.
## High-Level Overview
Continuous regression detection is the "auto-pilot" of Skia Perf. While users can perform ad-hoc analysis in the UI, this module ensures that every commit and every ingested data point is evaluated against predefined **Alerts** (regression detection configurations).
The module supports two primary operational modes:
1. **Polling (Traditional):** Periodically scans the most recent commits (defined by a "radius") across all active alert configurations.
2. **Event-Driven (Modern):** Listens for Google Cloud PubSub events triggered by the ingestion of new files. It identifies which alerts match the incoming trace IDs and runs detection specifically for that new data.
## Key Design Decisions
### Event-Driven vs. Polling
Historically, regression detection was a heavy background process that scanned many commits for all configurations. The implementation of `RunEventDrivenClustering` addresses scalability by moving toward an incremental model. By leveraging PubSub notifications from the ingestion pipeline, the system can pinpoint exactly which alerts need to be re-evaluated, reducing the lag between data ingestion and regression notification.
### Parallelism and Workload Distribution
Regression detection is computationally expensive, involving trace fetching and clustering algorithms. The module employs a hierarchical worker pool strategy to maintain performance:
- **`processAlertConfigsWorkerCount`**: Distributes different Alert configurations across parallel goroutines.
- **`processAlertConfigForTracesWorkerCount`**: Within a single configuration (especially in event-driven mode), individual traces are processed in parallel.
- **Random Shuffling**: In polling mode, alert configurations are shuffled before processing. This ensures that if multiple instances of the service are running, they don't all get stuck processing the same large/slow alert at the same time.
### GroupBy and Query Refinement
To prevent "alert fatigue" and ensure precision, the module dynamically refines queries. If an alert has a `GroupBy` setting (e.g., grouping by `device`), the continuous detector doesn't just run the generic alert query. Instead, it generates specific sub-queries for each group or trace ID found in the incoming data. This allows the system to detect regressions that might be "smothered" by noise in a larger dataset.
## Workflow: Event-Driven Detection
The following diagram illustrates how a new data point travels from ingestion to a potential notification:
```text
[Data Ingestion]
|
v
[PubSub Message] -> Received by buildTraceConfigsMapChannelEventDriven
|
|-- Decode IngestEvent (contains Trace IDs)
|-- Match Trace IDs against all Alert Configs
|
v
[Config/Trace Map] -> Dispatched to Workers
|
|-- Parallel Processing: ProcessAlertConfig
| |-- Fetch Dataframe (commits within radius)
| |-- Run Clustering / Step Fit
|
v
[Regression Found?]
|
|-- YES: reportRegressions
| |-- Store in Regression Store
| |-- Send Notifcation (Email/Bug/etc.)
| |-- Update existing notifications if direction matches
|
|-- NO: Continue
```
## Key Components
### `Continuous` Struct
The central coordinator that holds references to data stores (`regression.Store`, `shortcut.Store`), the git provider, and the notification system. It maintains the lifecycle of background detection.
### `continuous.go`
Contains the core logic for the detection loops:
- **`reportRegressions`**: Evaluates clustering results. It specifically looks for the "Step Point"—the exact commit where the performance shift occurred—and validates if the magnitude and direction (UP, DOWN, BOTH) meet the Alert's criteria.
- **`updateStoreAndNotification`**: Handles the deduplication of alerts. It checks if a regression for a specific commit and alert ID already exists. If it's new, it triggers the `notifier`; if it exists but has changed, it updates the existing notification.
- **`getQueryWithDefaultsIfNeeded`**: A utility that merges global instance defaults with specific alert queries, ensuring that filters like `stat=value` are applied even if omitted in a specific alert configuration.
### Detection Requests (`ProcessAlertConfig`)
This function transforms an `alerts.Alert` into a `regression.RegressionDetectionRequest`. It calculates the "Domain" (the range of commits to analyze) based on flags and configures the `regression` package to execute the heavy lifting of data fetching and mathematical analysis.
## Configuration and Flags
The behavior of this module is heavily influenced by the `InstanceConfig` and `FrontendFlags`:
- **Radius**: Determines how many commits to look at on either side of a potential regression point to establish a baseline and a new state.
- **EventDrivenRegressionDetection**: A boolean toggle that switches the entire logic from the polling ticker to the PubSub listener.
# Module: /go/regression/migration
This module provides a mechanism for migrating regression data from the legacy `regressions` table schema to the updated `regressions2` table schema. It is designed to facilitate a smooth transition between storage formats while ensuring data integrity and allowing for incremental, background processing.
### Overview
The migration is handled by the `RegressionMigrator` struct, which orchestrates the transfer of data between the legacy `sqlregressionstore` and the modern `sqlregression2store`. The primary motivation for this migration is to move toward a more robust schema that includes additional fields and better indexing, as supported by the `regressions2` table.
The migrator is designed to be run either as a one-off batch process or as a periodic background task that slowly drains the legacy table without impacting system performance.
### Key Components
- **`RegressionMigrator`**: The central coordinator. It holds references to both the legacy and new stores and manages the transactional logic required to ensure that a regression is either fully migrated or not at all.
- **Legacy Store (`sqlregressionstore`)**: Used to identify regressions that haven't been migrated yet (via `GetRegressionsToMigrate`) and to mark them as completed once they are successfully stored in the new schema.
- **New Store (`sqlregression2store`)**: Handles the conversion of legacy regression objects into the new format and persists them to the updated database schema.
### Migration Logic and Design Decisions
The migration process follows a "pull and mark" strategy to allow for resumes and to prevent data loss.
1. **Batching**: The migrator operates in batches (configurable `batchSize`). This prevents memory exhaustion when dealing with large historical datasets and allows the migration to be interleaved with standard production traffic.
2. **Atomicity**: Each regression migration within a batch is wrapped in its own database transaction.
- **Action**: Write to `regressions2` -> Update migration status in `regressions`.
- **Reason**: If a failure occurs during the migration of a specific record, only that record's transaction is rolled back. This ensures that the system doesn't have to re-process an entire batch if only one record fails, and prevents duplicate entries in the new table.
3. **Data Enrichment**: Legacy regression objects often lack fields required by the new schema (beyond `AlertId` and `CommitNumber`). The `sqlregression2store` handles the necessary transformations to ensure the data is compatible with the new format before writing.
4. **Handling Updates**: The system is designed to handle cases where a legacy regression might be updated (e.g., triaged) after its initial migration. The `GetRegressionsToMigrate` logic in the legacy store identifies these "stale" records so they can be re-synced to the new store.
### Key Workflows
#### Periodic Migration Process
The migrator can run a background loop that periodically checks for work.
```text
[ Timer Trigger ]
|
v
[ RunOneMigration ] --------------------------+
| |
v |
[ Fetch Batch from Legacy Store ] | Timeout
| | (1 Minute)
v |
[ For each Regression in Batch ] |
| |
+--> [ Start Transaction ] |
| | |
| v |
| [ Write to New Store ] |
| | |
| v |
| [ Mark Legacy as Migrated ] |
| | |
| v |
| [ Commit Transaction ] |
| |
+--------------------------------------+
```
#### Instantiation
To initialize the migrator, the `New` function sets up the required dependencies, including an `AlertConfigProvider`. This is necessary because the new regression store requires context regarding alerts that the legacy store did not strictly enforce or link in the same manner.
# Module: /go/regression/mocks
The `/go/regression/mocks` module provides a set of autogenerated mock implementations for the regression storage layer in the Perf system. These mocks are built using `mockery` and are based on the `testify` framework, specifically designed to facilitate unit testing of components that interact with regression data without requiring a live database or complex setup.
### High-Level Purpose
The primary component in this module is the `Store` mock. In the production environment, a regression store is responsible for persisting and retrieving performance regression data, handling triage statuses, and linking regressions to bug tracking systems. By providing a mock version of this store, the system allows developers to:
1. **Isolate Business Logic**: Test regression detection and notification workflows independently of the underlying PostgreSQL storage implementation.
2. **Simulate Edge Cases**: Easily trigger specific return values, such as database errors, empty result sets, or complex nested regression structures, to ensure robust error handling.
3. **Verify State Changes**: Assert that specific methods—like `SetBugID` or `TriageHigh`—are called with the expected parameters during a test execution.
### Key Components and Design
#### The Store Mock
The `Store.go` file defines the `Store` struct, which embeds `mock.Mock`. It replicates the interface used by the actual regression storage layer, covering a wide range of operations:
- **Data Retrieval**: Methods like `GetRegression`, `GetByIDs`, and `Range` allow tests to simulate the retrieval of regression clusters based on commit numbers, alert IDs, or time ranges.
- **Triage and State Management**: Methods such as `TriageHigh`, `TriageLow`, and `SetHigh/SetLow` enable the simulation of user actions or automated processes that mark regressions as "triaged" or "ignored".
- **Bug Integration**: Functions like `SetBugID` and `GetBugIdsForRegressions` facilitate testing the integration between Perf regressions and external issue trackers.
- **Maintenance Operations**: Methods like `DeleteByCommit` and `NudgeAndResetAnomalies` support testing cleanup and data migration logic.
#### Usage Workflow
When writing a test for a service that consumes the regression store (e.g., an alerting service), the standard workflow involves initializing the mock and defining expected behaviors:
```text
Test Setup phase:
+-------------------------+ +--------------------------+
| 1. Create Mock Store | ---> | 2. Define Expectations |
| (mocks.NewStore) | | (store.On(...).Return)|
+-------------------------+ +--------------------------+
|
v
Execution phase: +--------------------------+
+-------------------------+ | 3. Inject Mock into |
| 4. Run Business Logic | <--- | System Under Test |
+-------------------------+ +--------------------------+
|
v
Verification phase:
+-------------------------+
| 5. Assert Expectations |
| (Automatic Cleanup) |
+-------------------------+
```
### Implementation Decisions
- **Autogeneration**: The module relies on `mockery`. This decision ensures that the mock remains in sync with the actual `Store` interface defined in the `regression` package. If a new method is added to the store interface, regenerating the mock prevents compilation errors in tests.
- **Testify Integration**: By using `github.com/stretchr_testify/mock`, the mocks provide a fluent API for setting up return values and verifying calls.
- **Transaction Support**: The mock includes support for `pgx.Tx` (PostgreSQL transactions) in methods like `DeleteByCommit`, allowing tests to simulate transactional integrity without a real database connection.
# Module: /go/regression/refiner
# Regression Refiner
The `refiner` module provides logic to validate and filter potential performance regressions detected by the Skia Perf system. It acts as a post-processing stage that transforms raw detection responses into confirmed regressions by applying specific criteria defined in alert configurations.
## High-Level Overview
In the Skia Perf pipeline, regression detection identifies clusters of data points that exhibit significant changes in performance. However, not every statistical anomaly constitutes a regression of interest according to a user's specific alerting rules.
The `refiner` module implements the `regression.RegressionRefiner` interface to bridge the gap between "statistical detection" and "actionable alert." It evaluates detection summaries against alert parameters such as the direction of the change (improvement vs. regression) and the magnitude of the impact (number of traces affected).
## Design Decisions
### Step-Fit and Directional Validation
The refiner's primary responsibility is to ensure that a detected cluster aligns with the user's intent. It uses the `stepfit` status (HIGH or LOW) to determine the direction of the performance shift.
- **Directional Filtering**: Users can configure alerts to trigger on "UP", "DOWN", or "BOTH" directions. The refiner maps these preferences to `stepfit` statuses. For example, if an alert is configured only for "DOWN" (typically representing a performance drop in specific metrics), any "HIGH" step-fit clusters are discarded.
- **Commit Alignment**: The refiner ensures that the regression's "StepPoint" (the point of change) aligns exactly with the midpoint of the data frame's header. This provides a sanity check that the regression being processed is actually centered on the commit currently under investigation.
### Threshold Enforcement
To reduce noise from insignificant or flaky data, the refiner enforces a `MinimumNum` threshold. This represents the minimum number of keys (traces) that must be part of a cluster for it to be promoted to a "Confirmed Regression." Clusters failing to meet this count are filtered out of the final summary.
## Key Components
### DefaultRegressionRefiner
Located in `default_regression_refiner.go`, this is the standard implementation of the refinement logic. It processes a slice of `RegressionDetectionResponse` objects and returns a slice of `ConfirmedRegression` objects.
The refinement workflow follows these steps:
1. **Validation**: Checks for nil frames or empty data headers to avoid processing malformed data.
2. **Identification**: Determines the target commit number from the midpoint of the data frame.
3. **Filtering**: Iterates through all detected clusters and retains only those that:
- Match the target commit offset.
- Contain at least the minimum number of traces defined in the `Alert` config.
- Match the directionality (UP/DOWN/BOTH) specified in the `Alert` config.
4. **Reconstruction**: Creates a new, filtered `ClusterSummary` and wraps it in a response if any clusters survived the filtering process.
## Key Workflows
### Refinement Logic Flow
```text
Raw Detection Responses
|
v
+-----------------------------+
| Calculate Midpoint Commit | <--- Ensures we are looking at the
| (from DataFrame Header) | correct point in time.
+-----------------------------+
|
v
+-----------------------------+
| Iterate through Clusters |
| |
| 1. Check StepPoint Offset | ---- Fail ----> [ Discard Cluster ]
| 2. Check Min Trace Count | ---- Fail ----> [ Discard Cluster ]
| 3. Check Direction Match | ---- Fail ----> [ Discard Cluster ]
+-----------------------------+
|
| Pass
v
+-----------------------------+
| Build Filtered Summary |
+-----------------------------+
|
v
Confirmed Regressions
```
# Module: /go/regression/regressiontest
# Regression Test Utilities
The `regressiontest` module provides a standardized suite of functional tests for implementations of the `regression.Store` interface. By centralizing these tests, the project ensures that different storage backends (e.g., SQL-based, memory-based, or Datastore) behave consistently and adhere to the expected contract of the Perf regression system.
## Design and Purpose
The primary goal of this module is to enforce a uniform behavior across various regression storage implementations. Instead of duplicating test logic for every new storage driver, developers can import this package and run their implementation against the `SubTests` suite.
This approach ensures that:
- Data integrity is maintained during serialization and deserialization of complex types like `frame.FrameResponse` and `clustering2.ClusterSummary`.
- Edge cases, such as range queries where the start and end points are identical, are handled identically across backends.
- Error conditions, such as triaging a non-existent regression, produce predictable results.
## Key Components
### Test Suite Orchestration
The module exports a `SubTests` map, which associates descriptive names with `SubTestFunction` signatures. This allows implementation-specific test files to iterate over the map and run each test within their own environment (e.g., using a local emulator or a real database instance).
### Core Test Logic
The tests within `regressiontest.go` cover the lifecycle of a regression record:
- **Life Cycle Management**: `SetLowAndTriage` verifies the "happy path" of creating a regression, detecting if it is new versus an update, and updating its triage status.
- **Bulk Operations**: `Write` ensures that multiple regressions can be persisted efficiently in a single operation, while `DeleteByCommit` verifies the cleanup logic.
- **Querying and Navigation**: `Range_Exact` validates boundary conditions for commit-based lookups, and `GetOldestCommit` ensures the store can correctly identify the earliest point in its history, which is critical for background cleanup tasks.
## Key Workflow: Verifying a New Store Implementation
When a new storage backend is implemented for regressions, it follows this interaction pattern with the `regressiontest` module:
```
[ New Store Implementation ] [ regressiontest Module ]
| |
|-- Provides initialized Store ------>|
| |
| <---------- Executes SubTests -----|
| (SetLow, Range, Write, etc.)
| |
|-- Returns Success/Failure -------->|
| |
[ Validation Complete ]
```
## Implementation Details
The module relies on several key data structures from the Perf domain:
- **`regression.Store`**: The interface being validated.
- **`types.CommitNumber`**: Used as the primary key for organizing regressions.
- **`frame.FrameResponse` & `clustering2.ClusterSummary`**: These are passed to the store to ensure that implementation-specific serialization (like JSON blobs in a database) correctly preserves the data needed for the UI.
# Module: /go/regression/sqlregression2store
### High-Level Overview
The `sqlregression2store` module provides a Spanner-backed implementation of the `regression.Store` interface. Its primary purpose is to persist and manage performance regressions detected within the Skia Perf system. It handles the storage of statistical metadata, triage states, and the raw data frames that justify a regression's existence.
This module is the "V2" storage layer, designed to be more flexible and descriptive than previous iterations by consolidating alert metadata, subscription links, and multi-source triage information (manual, auto-triage, and auto-bisect) into a unified relational schema.
### Design Decisions and Implementation Choices
#### Algorithm-Aware Storage Logic
A key responsibility of this module is handling the different ways regressions are identified based on the alerting algorithm:
- **K-Means Grouping:** Regressions are treated as evolving entities. As more data arrives for a specific `<commit, alert>` pair, the store updates the existing record with more accurate clustering summaries.
- **StepFit/Individual Grouping:** Regressions are often specific to individual traces. Depending on the `AllowMultipleRegressionsPerAlertId` configuration, the store can either treat the alert as a single entity or allow multiple distinct regression records for the same alert if they involve different traces.
#### Read-Modify-Write Compatibility
To support the transition from older schemas and maintain data integrity during concurrent updates, the store utilizes a `readModifyWriteCompat` pattern.
1. **Transactionality:** It opens a database transaction.
2. **Lookup:** It queries for existing regressions based on the commit number and alert ID.
3. **Callback Logic:** It executes a provided closure to modify the regression object (e.g., setting a "High" or "Low" cluster).
4. **Persistence:** It writes the results back using an `UPSERT` (INSERT ... ON CONFLICT) pattern.
#### Semi-Structured Integration
The store relies heavily on JSONB columns (specifically for `frame` and `cluster_summary`). This allows the database to store complex, nested Go structures from the `ui/frame` and `clustering2` packages without requiring a rigid table schema for every statistical detail. This choice prioritizes flexibility in the analysis pipeline over relational normalization for these specific fields.
#### Triage and Bug Aggregation
The module implements a sophisticated bug-tracking resolution logic in `GetBugIdsForRegressions`. It doesn't just store a single bug ID; it joins across `AnomalyGroups` and `Culprits` tables to provide a comprehensive view of:
- **Manual Triage:** Bugs manually linked by users.
- **Auto-Triage:** Issues automatically reported by the system.
- **Auto-Bisect:** Culprits identified by automated bisection tools.
The store then sorts these bugs by a priority rank (Manual > Auto-Triage > Auto-Bisect) to ensure the most relevant context is presented to the user first.
### Key Components
#### SQLRegression2Store
The main struct implementing `regression.Store`. It manages a pool of database connections and maintains a cache of prepared SQL statements generated from templates. It also tracks metrics for high/low regression detections.
#### Statement Templates
Instead of static strings, the module uses Go's `text/template` to build SQL queries. This allows for dynamic column injection based on the `spanner` schema definitions, ensuring that the Go code and SQL schema stay in sync regarding field names.
#### Regression Lifecycle Management
The store provides specialized methods for the operational lifecycle of an anomaly:
- **Nudging:** `NudgeAndResetAnomalies` allows moving a regression's commit range (e.g., if a developer identifies a more accurate culprit range) while resetting its triage status.
- **Triage Transitions:** Methods like `IgnoreAnomalies` and `ResetAnomalies` provide bulk updates to triage states, transitioning records between `untriaged`, `negative`, and `ignored`.
### Key Workflows
#### Setting a Regression
The workflow for recording a newly detected performance shift:
```text
Detection Logic -> SetHigh/SetLow()
|
v
GetAlertConfig() <------- [Check Algo: KMeans vs StepFit]
|
v
readModifyWriteCompat()
(Transaction Start)
|
+-----------+-----------+
| |
[Existing Match] [New Regression]
| |
Apply UpdateFunc Initialize UUID
| Populate Medians
| Set PrevCommit
+-----------+-----------+
|
v
writeSingleRegression()
(UPSERT into DB)
|
(Transaction Commit)
```
#### Bug Retrieval Workflow
How the store aggregates different sources of truth for a regression:
```text
Request: GetBugIdsForRegressions(ids)
|
v
1. Load Manual Bug IDs (from Regressions2 table)
2. JOIN AnomalyGroups (on regression_id) -> Get Auto-Triage IDs
3. JOIN Culprits (on anomaly_group_id) -> Get Auto-Bisect IDs
|
v
sortBugs(Manual, AutoTriage, AutoBisect)
|
v
Return enriched Regression objects
```
# Module: /go/regression/sqlregression2store/schema
### High-Level Overview
The `schema` package defines the structured SQL representation for performance regressions within the Skia Perf system. It serves as the source of truth for the `Regression2Schema` table, which is designed to persist regression data, triage states, and associated metadata.
This module bridges the gap between Go data structures (like `clustering2.ClusterSummary` and `frame.FrameResponse`) and the relational database, ensuring that complex performance analysis results are searchable and durable.
### Design Decisions and Implementation Choices
#### Unified Regression Persistence
Unlike earlier iterations of regression storage, this schema focuses on consolidating all aspects of a regression event—its location in time (commits), its statistical significance (medians), and its operational status (triage, bugs, alerts)—into a single relational structure. This allows for complex querying and reporting without needing to join across high-volume telemetry tables.
#### Semi-Structured Data Storage
The schema uses `JSONB` for the `ClusterSummary` and `Frame` fields. This is a deliberate choice to:
1. **Preserve Context:** The full context of the data frame and clustering results used to identify the regression is stored alongside the record.
2. **Schema Flexibility:** As the internal structures of `clustering2` or `frame` evolve, the database schema does not require a migration, provided the data remains JSON-serializable.
#### Temporal and Categorical Organization
Regressions are tracked via two specific commit points: `CommitNumber` and `PrevCommitNumber`. This allows the system to define the exact range where a performance shift occurred. Additionally, the inclusion of `IsImprovement` (boolean) and `ClusterType` (string) allows the UI and automated tools to quickly filter out noise or focus specifically on regressions vs. improvements.
#### Optimization via Specialized Indexes
The schema defines several composite and single-column indexes to support common query patterns in the Perf UI and alerting pipelines:
- **Point Lookups:** `by_commit_alert` supports checking if an alert has already fired for a specific commit.
- **Time-Series Tracking:** `by_sub_name_creation_time` is optimized for showing the most recent regressions for a specific subscription (e.g., a "Regression Dashboard" view).
- **Revision History:** `by_commit_and_prev_commit` is tailored for the `GetByRevision` workflow, allowing the system to quickly retrieve regressions that fall within specific git ranges.
### Key Components
#### Regression2Schema
The primary struct in `schema.go`. It utilizes Go struct tags to define the DDL (Data Definition Language) for the underlying SQL table.
- **Identity and Origin:** Uses a UUID (`ID`) for global uniqueness and links to the alerting subsystem via `AlertID` and `SubName`.
- **Statistical Metadata:** Stores `MedianBefore` and `MedianAfter` as `REAL` (float32) values. These are critical for calculating the "magnitude" of a regression without re-processing the raw trace data.
- **Triage State:** Contains `TriageStatus`, `TriageMessage`, and `BugID`. These fields represent the human-in-the-loop component of the performance monitoring workflow, tracking whether a regression has been acknowledged or associated with a bug tracker entry.
### Regression Data Flow
The following diagram illustrates how the fields in this schema represent the lifecycle of a detected regression:
```text
Discovery Phase Analysis Phase Operational Phase
(Detection Logic) (Statistical Data) (User Intervention)
----------------- ------------------ -------------------
AlertID -----> MedianBefore TriageStatus
CommitNumber -----> MedianAfter -----> TriageMessage
SubName -----> ClusterSummary BugID
ClusterType -----> Frame CreationTime
```
# Module: /go/regression/sqlregressionstore
The `sqlregressionstore` module provides a persistent implementation of the `regression.Store` interface using a SQL database backend. It is responsible for storing, retrieving, and updating performance regressions detected by the Perf system.
### Design and Rationale
The storage strategy is built on a hybrid approach: **relational indexing** for metadata and **JSON serialization** for payload.
1. **Identity and Integrity:** Regressions are uniquely identified by a composite primary key consisting of `commit_number` and `alert_id`. This ensures that for any given commit, a specific alert configuration can only produce one regression record, enforcing data integrity at the database level.
2. **Schema Flexibility:** While the identity is relational, the regression details (like cluster summaries and frames) are stored as a JSON blob. This allows the `regression.Regression` Go struct to evolve—adding or removing fields—without requiring expensive and risky database migrations for every change in the detection algorithms.
3. **Concurrency Control:** The store uses a "Read-Modify-Write" pattern wrapped in SQL transactions. This is critical for operations like triaging or updating high/low regression status, as multiple processes might attempt to update the same regression record simultaneously.
### Key Components
#### SQLRegressionStore
Located in `sqlregressionstore.go`, this is the primary struct implementing the `regression.Store` interface. It manages the lifecycle of regression data:
- **Persistence Operations:** It translates high-level Go requests (like `SetHigh`, `SetLow`, or `TriageHigh`) into SQL statements. It handles the mapping between string-based Alert IDs used in the UI/API and the integer-based Alert IDs used in the database.
- **Atomic Updates (`readModifyWrite`):** This internal method ensures that updates to a regression record are atomic. It begins a transaction, locks the row (if the database supports it), deserializes the JSON, applies a callback function to modify the data, and serializes it back to the database.
- **Batch Queries:** The `Range` method allows for efficient retrieval of all regressions across a span of commits, which is a common requirement for rendering the Perf dashboard or generating reports.
#### Migration Support
The module includes specific logic to support data evolution. It tracks a `migrated` status and a `regression_id`. This allows the system to background-migrate records from this "legacy" store to newer iterations of the regression schema (e.g., `regression2`) without downtime.
- `GetRegressionsToMigrate` retrieves batches of unmigrated records.
- `MarkMigrated` updates the record status once it has been successfully moved to the new store.
### Data Workflow
The following diagram illustrates how a regression update (e.g., updating a "high" regression) flows through the store:
```text
[ Caller: SetHigh ]
|
v
[ SQLRegressionStore.readModifyWrite ]
|
|---- 1. BEGIN TRANSACTION
|---- 2. SELECT regression (JSON) FROM Regressions WHERE commit_number AND alert_id
|---- 3. JSON Unmarshal -> regression.Regression (Go Struct)
|---- 4. Execute Callback: Update HighStatus, ClusterSummary, etc.
|---- 5. JSON Marshal -> Updated JSON string
|---- 6. UPDATE Regressions SET regression = $1, migrated = false
|---- 7. COMMIT TRANSACTION
|
v
[ Success/Error ]
```
### Implementation Details
- **Dialect Independence:** While the tests often run against Spanner, the use of `pool.Pool` and standard SQL syntax allows the store to be portable across different SQL backends supported by the infra.
- **Metrics:** The store automatically tracks `perf_regression_store_found` counters (partitioned by "high" or "low" direction), providing visibility into the frequency of regression detection and storage activity.
- **Legacy Constraints:** Some methods (like `GetRegressionsBySubName` or `GetByIDs`) are explicitly left unimplemented in this module. These features are offloaded to the newer `regression2` store, reinforcing this module's role as a stable, primary storage for established regression workflows while supporting the transition to more advanced querying capabilities.
# Module: /go/regression/sqlregressionstore/schema
The `sqlregressionstore/schema` module defines the relational database structure used to persist regression data within the Perf system. It serves as the formal bridge between the Go-based `regression.Regression` objects and their storage representation in SQL.
### Design and Rationale
The schema is designed around a composite primary key consisting of `commit_number` and `alert_id`. This reflects the operational reality of the Perf system: a regression is uniquely identified by _where_ it happened (the commit) and _why_ it was detected (the specific alert configuration).
By using a composite key instead of a generic auto-incrementing integer, the schema enforces data integrity at the database level, preventing duplicate regression entries for the same alert on the same commit.
### Key Components and Responsibilities
#### RegressionSchema
The primary structure in this module, `RegressionSchema`, defines the columns for the `Regressions` table. Its fields reflect a balance between structured querying and flexible data storage:
- **Relational Indexing (`CommitNumber`, `AlertID`):** These fields are extracted from the regression object to allow the database to perform efficient filtering and joins. Storing the `AlertID` as a first-class column allows the system to quickly retrieve all regressions associated with a specific detection configuration.
- **Serialized Payload (`Regression`):** Instead of normalizing every possible attribute of a regression (which might change as detection algorithms evolve), the bulk of the regression data is stored as a JSON string. This "schemaless-within-schema" approach provides flexibility for future changes to the `regression.Regression` Go struct without requiring database migrations.
- **Migration State (`Migrated`, `RegressionId`):** These fields are specifically included to handle the lifecycle of data evolution. The `Migrated` boolean and the temporary `RegressionId` facilitate the movement of records between different iterations of the schema (e.g., transitioning to a "regression2" table) while ensuring no data is lost or duplicated during the transition.
### Data Workflow
When a regression is detected or updated, the system maps the high-level Go objects into this schema for persistence:
```text
[ Go Regression Object ]
|
| 1. Extract Identity
v
+------------------------+ 2. Serialize Remainder
| commit_number (Key) | <---------------------------+
| alert_id (Key) | |
| migrated (Status) | +----------------------+
| regression (JSON) | <--- | { "low": ..., |
+------------------------+ | "high": ..., |
| | "frame": ... } |
| +----------------------+
v
[ SQL Persistent Storage ]
```
This architecture ensures that while the database can efficiently index and manage the lifecycle of a regression, the complex details of the detection results remain encapsulated within the JSON blob, maintaining a clean separation between indexing concerns and data representation.
# Module: /go/samplestats
# samplestats
The `samplestats` module provides tools for performing statistical analysis on performance metrics. It is designed to compare two sets of samples (typically "before" and "after" a change) to determine if there is a statistically significant difference between them. This is primarily used within the Perf system to detect regressions or improvements in traces.
## Overview
The core functionality revolves around taking two maps of trace data and producing a structured analysis. The module handles the heavy lifting of statistical testing, outlier detection, and result ordering, allowing callers to focus on high-level performance trends rather than raw data manipulation.
### Design Decisions
- **Non-Parametric vs. Parametric Testing**: The module supports both the Mann-Whitney U test (default) and the Welch's T-test. The Mann-Whitney U test is favored as the default because it is non-parametric; it does not assume a normal distribution of data, making it more robust for varied performance metrics which often contain noise or non-Gaussian distributions.
- **Significance-Driven Results**: By default, the module only reports results where the p-value is below a defined threshold (alpha). This reduces noise for the end user by filtering out fluctuations that are likely due to random chance.
- **Outlier Resilience**: Performance data frequently contains "cold start" anomalies or background noise. The implementation provides an optional Interquartile Range (IQR) rule to prune these outliers before running statistical tests, ensuring the mean and standard deviation reflect the "steady state" of the system.
## Key Components
### Analysis Engine (`analyze.go`)
The `Analyze` function is the primary entry point. It correlates "before" and "after" samples based on their Trace ID. For every pair of samples found, it:
1. Calculates metrics (mean, stddev) for both sets.
2. Executes the configured statistical test (`UTest` or `TTest`).
3. Calculates the `Delta` (percentage change in mean) only if the result is statistically significant ($p < \alpha$).
### Metrics Calculation (`metrics.go`)
Before analysis, raw values are transformed into `Metrics` objects. This step handles:
- **IQR Filtering**: If enabled in `Config`, it calculates the 25th and 75th percentiles and discards values outside $1.5 \times IQR$.
- **Coefficient of Variation**: It calculates the `Percent` field (Standard Deviation / Mean), which helps in understanding the relative volatility of a specific trace.
### Sorting and Ordering (`sort.go`)
Since analysis can involve thousands of traces, the module provides a flexible sorting mechanism. Results can be ordered by Trace Name or by the magnitude of the Delta. It specifically handles `NaN` values (representing insignificant changes) by grouping them together during the sort process.
## Workflow
The following diagram illustrates the data flow from raw samples to a sorted analysis result:
```text
[Raw Samples] -> [IQR Filter (Optional)] -> [Statistical Test] -> [Delta Calculation]
| | | |
(Before/After) (Remove Outliers) (Compare P vs Alpha) (% Change if P < Alpha)
|
v
[Final Result] <---------- [Sort Results] <--------- [Collection of Rows]
```
## Implementation Details
- **Config**: The `Config` struct allows users to toggle the statistical test type, set the `Alpha` threshold (defaulting to 0.05), and enable outlier removal.
- **Row**: Each analyzed trace is returned as a `Row`, containing the calculated `Delta`, the `P` value, and the underlying `Metrics`. If a test fails (e.g., all samples are identical), the error is captured in the `Note` field rather than crashing the analysis.
- **Dependencies**: The module relies on `go-moremath/stats` for robust implementations of the Mann-Whitney and Welch's T-test algorithms.
# Module: /go/sheriffconfig
### Overview
The `sheriffconfig` module is the management layer for Skia Perf's alerting and subscription system. It provides the mechanism for defining "Sheriff Configurations"—version-controlled rules that dictate how the performance monitoring engine should identify anomalies and which teams should be notified.
This module acts as a bridge between human-readable configuration (stored as code in LUCI Config) and the operational database (Spanner) that drives the Perf alerting engine. It ensures that performance monitoring is "Configuration as Code," allowing teams to manage their alert thresholds, bug-filing metadata, and trace selections through standard code review processes.
### Design Intent: Configuration as Code
The design of `sheriffconfig` shifts the responsibility of alert management from manual UI interactions to automated, versioned workflows.
- **Auditability**: By using LUCI Config as the source of truth, every change to an alert threshold or a subscription's contact list is tracked in Git.
- **Consistency**: The module ensures that identical configurations result in identical system behavior by normalizing inputs (e.g., sorting query parameters) before they reach the database.
- **Scalability**: A single configuration file can define monitoring for multiple Perf instances. The module handles the distribution of these rules to the correct internal services based on instance-specific filters.
### Key Components
#### 1. Schema Definitions (`/proto`)
The module uses Protocol Buffers to define the structure of a `SheriffConfig`. This schema decouples the detection intent (e.g., "watch for a 10% shift in memory usage") from the implementation details of the detection algorithms. It supports a strategy pattern where users can combine different statistical methods (`Step`) with grouping strategies (`Algo`).
#### 2. Validation Engine (`/validate`)
The validation logic acts as a strict gatekeeper. It enforces business rules and structural integrity before any configuration is persisted.
- **Regex Pre-compilation**: Any pattern matching utilizing regular expressions (denoted by `~`) is compiled during validation. This prevents runtime crashes in the detection engine caused by malformed regex in a config file.
- **Rule Constraints**: It enforces specific logic, such as requiring exclusion patterns to be single-keyed, which keeps the backend's query resolution logic predictable and performant.
#### 3. Synchronization Service (`/service`)
The service layer manages the lifecycle of configuration data. It polls external configuration sources and reconciles them with the internal `SubscriptionStore` and `AlertStore`.
- **Revision Awareness**: To prevent unnecessary database writes, the service compares the revision of the incoming configuration against the existing state. If the revision matches, the processing is skipped.
- **Rule Expansion**: A single `AnomalyConfig` in a proto can expand into multiple `Alert` objects in the database. This allows a user to define a single logical subscription that applies to multiple distinct sets of telemetry traces.
### Configuration Lifecycle Workflow
The following diagram shows the path a configuration takes from a Git repository to the Perf alerting database:
```
[ Git Repository ] -> [ LUCI Config ] -> [ sheriffconfig/service ]
|
v
[ sheriffconfig/validate ]
(Check: Names, Regex, Fields)
|
v
[ Spanner DB ] <--- (Atomic Transaction) --- [ Transformation ]
| (Build Queries, Map Priorities)
|
+--> [ Perf Alerting Engine ]
(Identify Regressions based on stored Alerts)
```
### Design Decisions
#### URL-Based Trace Selection
Instead of a proprietary query language, the module utilizes standard URL query strings for trace matching.
- **Why**: This allows the system to reuse `net/url` parsers and provides a format that is easily testable and familiar to developers.
- **Implementation**: The `buildQueryFromRules` function in the service layer transforms these human-readable rules into normalized query strings used by the backend database to filter trace data efficiently.
#### Atomic Updates
When a new configuration file is ingested, the service uses a single database transaction to replace alerts.
- **Why**: In a system where an alert is useless without its corresponding subscription (which contains bug-filing info like components and CC lists), partial updates could lead to "orphaned" alerts that trigger but cannot be filed as bugs. The atomic approach ensures the system is always in a consistent state.
#### Instance Filtering
The service is designed to be "instance-aware." A single large configuration file might contain subscriptions for `v8`, `chrome`, and `skia`.
- **How**: Each `sheriffconfigService` instance is configured with an `instance` identifier. It filters the incoming global configuration, only processing and storing the subscriptions that match its assigned instance. This allows for centralized configuration files without leaking cross-project data or overloading specific service instances.
# Module: /go/sheriffconfig/proto
### Overview
The `/go/sheriffconfig/proto` module serves as the foundational definition for the Skia Perf alerting and configuration system. It manages the lifecycle of performance monitoring by defining how users describe "what to watch" and "how to react" when performance changes. This module provides the core data structures that bridge the gap between human-readable configurations and the automated backend engines responsible for anomaly detection and issue tracking.
### Design and Logic
The architecture is built around a centralized configuration model. Instead of hard-coding detection logic or scattering alert settings across various services, this module consolidates the entire "intent" of a performance sheriff into a structured format.
#### Strategy-Based Detection
The implementation favors a strategy pattern for anomaly detection. Rather than defining a single detection path, the system allows sheriffs to combine different statistical methods (defined via `Step`) with different grouping strategies (defined via `Algo`). This decoupling allows the system to scale from simple "threshold exceeded" alerts to complex "cluster-based" analysis where multiple related traces must shift together to trigger an alert.
#### Efficient Trace Selection
The selection logic is designed to handle the vast scale of Skia Perf data. By utilizing a rule-based system for trace selection, the module allows for:
- **Logical Inclusion/Exclusion**: Using a combination of inclusion queries and exclusion filters to prune noise before the detection algorithms run.
- **Key-Value Filtering**: Leveraging the existing Skia trace format to allow sheriffs to target specific bots, benchmarks, or test suites without needing to know the underlying database schema.
### Key Components
#### Data Integrity and Versioning
While the `v1` subdirectory contains the active implementation, the root module acts as the container for these definitions. The use of Protocol Buffers ensures that the configuration is both language-agnostic and forward-compatible. This is critical for Skia Perf, where configuration may be stored in Git or a database for long periods while the backend software evolves.
#### Workflow Orchestration
The module defines the transition from detection to action. The implementation choices here reflect a desire to reduce "alert fatigue":
1. **Detection**: The system identifies a change point based on the `AnomalyConfig`.
2. **Grouping**: Using `group_by` and `Algo` settings, the system determines if multiple anomalies should be consolidated into a single report.
3. **Reporting**: Based on the `Action` defined, the system either silently logs the event, creates a manual triage entry, or triggers an automated bisection.
### System Workflow
The following diagram illustrates how the components defined in these protos interact to process performance data:
```
[ Perf Data ] -> [ Rule Matching ] -> [ Detection Engine ] -> [ Action Dispatcher ]
| | |
(Uses Match/Exclude) (Uses Algo/Step) (Uses Action/CC)
| | |
v v v
Identify relevant Apply statistical Create Bug or
metric traces analysis window trigger Bisect
```
### Key Files
- **`v1/`**: This subdirectory contains the versioned definitions of the API. By isolating the versioned protos, the project allows for breaking changes in the configuration schema while maintaining compatibility for existing sheriff configurations.
- **`v1/sheriff_config.proto`**: The definitive source for the data model. It encodes the business logic of how subscriptions, detection rules, and alerting metadata relate to one another.
- **`v1/sheriff_config.pb.go`**: The compiled Go representation of the configuration. This is the primary interface used by the Skia Perf backend to interact with the configuration data.
# Module: /go/sheriffconfig/proto/v1
### Overview
The `go/sheriffconfig/proto/v1` module defines the data structures and serialization format for Skia Perf's anomaly detection and alerting system. It uses Protocol Buffers to specify how performance metrics are selected, how regressions (anomalies) are detected within those metrics, and how the system should respond (e.g., filing bugs or initiating bisections).
This module acts as the contract between the configuration stored in the system and the Perf engine that processes incoming data.
### Design and Data Model
The configuration hierarchy is designed to support multi-tenant monitoring where different teams (Sheriffs) can track specific subsets of performance data with customized detection logic.
#### Configuration Hierarchy
```
SheriffConfig
└── [Subscription]
└── [AnomalyConfig]
└── Rules (Metric Selection)
```
- **SheriffConfig**: The root object containing all subscriptions for a specific Skia Perf instance (e.g., "chrome-internal").
- **Subscription**: Represents a logical grouping of interest, typically owned by a specific person or team. It defines _where_ alerts go (Buganizer components, CC lists, labels) and _what_ level of urgency they carry (Priority/Severity).
- **AnomalyConfig**: Defines the mathematical "how" of detection. It specifies the algorithm, sensitivity thresholds, and grouping logic. A single subscription can contain multiple `AnomalyConfig` objects to apply different detection logic to different sets of metrics.
- **Rules**: The filtering mechanism used to select traces from the Skia database.
### Key Components and Implementation Details
#### Metric Selection (Rules)
Traces are selected using a query-string format: `{key1}={value1}&{key2}={value2}`.
- **Matching**: The `match` field uses a wildcard-by-default approach. If a key is omitted, it matches everything. It supports regex-style matching (e.g., `bot=~lacros-.*-perf`).
- **Exclusion**: The `exclude` field allows for fine-grained removal of specific noisy traces.
- **Logic**: Multiple match strings are treated as an `OR` operation, while keys within a single string and exclusion rules are treated as `AND` operations.
#### Anomaly Detection (AnomalyConfig)
The module defines several strategies for identifying regressions through the `Step` and `Algo` enums:
- **Detection Algorithms (Step)**: Supports various statistical methods including simple magnitude thresholds (`ABSOLUTE_STEP`), percentage-based changes (`PERCENT_STEP`), and advanced statistical tests like `COHEN_STEP` or `MANN_WHITNEY_U`.
- **Clustering (Algo)**: Defines whether traces are analyzed individually (`STEPFIT`) or grouped together using `KMEANS` to identify collective shifts in performance across multiple bots or benchmarks.
- **Execution Parameters**:
- `radius`: Controls the window of commits analyzed around a potential change point.
- `direction`: Allows sheriffs to ignore "improvements" (e.g., a speed increase) and only alert on regressions.
- `group_by`: A powerful field that allows splitting the analysis across specific keys, ensuring that anomalies are only grouped if they share common attributes.
#### Alerting and Actionability
The `Action` enum within `AnomalyConfig` determines the lifecycle of a detected anomaly:
- **NOACTION**: Purely observational; anomalies appear in the UI but trigger no external workflows.
- **TRIAGE**: Automates the creation of Buganizer issues using the metadata defined in the parent `Subscription`.
- **BISECT**: The most advanced tier, which triggers automated bisection to find the specific culprit commit behind a regression.
### Key Files
- **`sheriff_config.proto`**: The primary source of truth defining the messages and enums. It contains extensive documentation on the expected string formats for rules and the behavior of detection enums.
- **`sheriff_config.pb.go`**: The generated Go code providing the structures used by the Perf service to parse and process configurations.
- **`generate.go`**: Contains the `go:generate` directives used to keep the Go code in sync with the protobuf definitions.
# Module: /go/sheriffconfig/service
The `sheriffconfig/service` module acts as the synchronization engine between externalized "Sheriff Configurations" stored in LUCI Config and the internal database used by Skia Perf to track subscriptions and trigger alerts. By treating these configurations as code, the service allows teams to manage anomaly detection rules, bug filing metadata, and ownership via version-controlled repositories.
### Core Responsibility
The primary role of this service is to fetch, validate, transform, and persist configurations. It bridges the gap between the high-level, human-readable Protobuf definitions (SheriffConfigs) and the low-level SQL structures required by the Perf alerting engine.
### Design and Implementation Choices
#### Revision-Based Synchronization
To minimize database churn and ensure consistency, the service uses a revision-checking mechanism. Before processing a subscription, it queries the `subscriptionStore` to see if a subscription with the same name and revision already exists.
- **Why**: This avoids redundant writes and ensures that if a configuration hasn't changed in the source repository, no updates are pushed to the database. It also facilitates a "point-in-time" history where alerts are tied to specific configuration versions.
#### Instance Filtering
A single LUCI Config file may contain subscriptions for multiple Perf instances (e.g., "chrome-internal", "v8", "skia").
- **How**: The `sheriffconfigService` is initialized with a specific `instance` string. During the `processConfig` phase, it discards any subscription defined in the Protobuf that does not match its assigned instance. This allows centralized management of alerts across a project while maintaining instance-specific execution.
#### Rule-to-Query Transformation
Sheriff configurations use a rule-based system (`match` and `exclude` lists) to define which telemetry traces an anomaly config should monitor.
- **Implementation**: The `buildQueryFromRules` function transforms these rules into URL-style query strings. It handles exclusion logic by prefixing values with `!`. These queries are then stored in the `Alert` objects, which the Perf engine uses to filter incoming data.
- **Consistency**: Query parts are sorted alphabetically during construction to ensure that identical rules result in identical query strings, preventing duplicate alerts due to key ordering.
#### Transactional Atomic Updates
When importing a config file, the service wraps the insertion of both `subscriptions` and `alerts` (via `ReplaceAll`) into a single database transaction.
- **Why**: Alerts are functionally dependent on their parent subscriptions. Using a transaction ensures that the system never ends up in a state where a new alert exists without its corresponding subscription metadata (like bug components or priority), which would cause failures during the auto-triage or bug-filing process.
### Key Components and Workflows
#### Configuration Import Lifecycle
The service typically runs as a background routine (`StartImportRoutine`), polling LUCI Config at a defined interval.
```
[ LUCI Config ] --(Fetch Project Configs)--> [ service.ImportSheriffConfig ]
|
[ validate.ValidateConfig ]
|
+----------------------------------------------+----------------------------------------------+
| | |
[ Filter by Instance ] [ Transform to Entities ] [ Check Revision ]
(Drop if mismatch) (Map Protos to DB Models) (Skip if exists)
| | |
+----------------------------------------------+----------------------------------------------+
|
[ DB Transaction (Spanner) ]
|-- Insert Subscriptions
|-- Replace All Alerts
+---------------------------> [ Success/Commit ]
```
#### Key Files
- **service.go**: Contains the `sheriffconfigService` implementation. It manages the dependency injection of stores (Alert, Subscription) and the LUCI Config API client. It also defines the mapping constants for algorithm types (e.g., `STEPFIT`, `KMEANS`) and action types (e.g., `TRIAGE`, `BISECT`).
- **service_test.go**: Validates the end-to-end import logic using mocks for the database and external APIs. It specifically tests edge cases such as handling multiple instances in one file and ensuring that invalid configurations are rejected before they touch the database.
#### Mapping Logic
The service performs significant data translation to bridge the two domains:
- **Priorities and Severities**: It maps Protobuf-defined priority/severity levels to the integer values expected by the bug-filing system, applying default values (typically `2`) if they are omitted in the configuration.
- **Anomaly Configs**: Each `AnomalyConfig` inside a subscription can generate multiple `Alert` objects—one for each `match` rule provided. This expansion allows a single subscription to monitor several distinct sets of traces with different detection parameters (like `radius` or `threshold`).
# Module: /go/sheriffconfig/validate
# Sheriff Config Validation
The `validate` module provides the logic necessary to ensure the integrity and correctness of Sheriff Configurations used in the Perf tool. It acts as a gatekeeper, verifying that configuration files (typically managed via LUCI Config) adhere to structural and business rules before they are processed by the system.
## High-Level Overview
Sheriff configurations define how anomalies (regressions) are assigned to different teams or "subscriptions." This module takes raw data—usually base64-encoded prototext from an external configuration service—deserializes it into Go protocol buffer objects, and runs a battery of validation checks.
The validation logic is hierarchical, mirroring the structure of the `SheriffConfig` proto:
1. **Global Level**: Checks for overall configuration validity (e.g., uniqueness of subscription names).
2. **Subscription Level**: Ensures required metadata like contact emails, bug components, and instances are present.
3. **Anomaly Config Level**: Validates the rules used to match specific performance traces.
4. **Pattern Level**: Parses and validates the query strings used to identify specific data streams.
## Design Decisions
### URL Query Format for Patterns
The module uses the standard URL query format (e.g., `key1=val1&key2=val2`) to define match and exclude patterns.
- **Why**: This leverages standard library parsing (`net/url.ParseQuery`), providing a familiar and robust syntax for users to define trace filters without requiring a custom DSL parser.
- **Regex Support**: To support flexible matching, values starting with `~` are treated as regular expressions. The validator explicitly compiles these during the validation phase to catch syntax errors early, preventing runtime failures during actual anomaly matching.
### Decoupled Deserialization
The `DeserializeProto` function specifically handles Base64 decoding followed by Prototext unmarshaling.
- **Why**: This design specifically accommodates the LUCI Config API, which returns file content as Base64 strings. By separating deserialization from validation, the module remains flexible enough to validate objects created programmatically (useful for testing) while providing a convenient entry point for production data.
## Key Components and Responsibilities
### Configuration Validator (`validate.go`)
This is the core of the module. It implements a top-down validation strategy:
- **`ValidateConfig`**: The entry point for validating a full `SheriffConfig`. It ensures that the config is not empty and that every subscription has a unique name, which is critical for identifying subscriptions in logs and UI.
- **`validateSubscription`**: Ensures that every subscription is actionable. It mandates a `Name`, `ContactEmail`, `BugComponent`, and `Instance`. A subscription without these cannot effectively track or report anomalies.
- **`validateAnomalyConfig`**: Focuses on the rules of the subscription. It requires at least one `Match` pattern, as a configuration that matches nothing is considered a configuration error.
- **`validatePattern`**: The most granular validation step.
- It ensures match patterns have at least one key-value pair.
- It enforces a constraint on **Exclude** patterns: they must only contain a single key. This simplifies the exclusion logic elsewhere in the system, preventing overly complex exclusion rules that are hard to reason about.
- It validates that all explicit values are non-empty.
### Data Flow Process
The typical workflow for a configuration string being processed by this module is:
```
[ Base64 String ]
|
v
[ DeserializeProto ] ---------------------> [ Decode Base64 ]
| |
| v
| [ Unmarshal Prototext ]
v |
[ *SheriffConfig Proto ] <-------------------------/
|
v
[ ValidateConfig ]
|
+--> [ validateSubscription ]
|
+--> [ validateAnomalyConfig ]
|
+--> [ validatePattern ] (Match)
|
+--> [ validatePattern ] (Exclude, singleField=true)
```
## Validation Constraints Summary
| Level | Constraint |
| :-------------------- | :------------------------------------------------------------------- |
| **Global** | Subscription names must be unique. |
| **Subscription** | Must contain `Name`, `ContactEmail`, `BugComponent`, and `Instance`. |
| **Anomaly Config** | Must have at least one `Match` pattern. |
| **Pattern (Match)** | Must be a valid URL query string; must have $\ge 1$ key. |
| **Pattern (Exclude)** | Must have exactly 1 key. |
| **Pattern (Values)** | Values starting with `~` must be valid Go Regex. |
# Module: /go/shortcut
### Overview
The `shortcut` module provides a unified interface and core logic for managing "shortcuts" within the Perf application. A shortcut is a persistent, shareable identifier that represents a collection of performance trace IDs. Instead of passing around large lists of trace keys in URLs or API requests, the system generates a compact hash-based ID that can be used to retrieve the original set of keys.
### Design and Logic
#### Idempotency and Content-Addressable IDs
A fundamental design choice in this module is the use of content-addressable storage. The ID of a shortcut is not a random UUID or an auto-incrementing integer; instead, it is a deterministic hash of the trace keys it contains.
The `IDFromKeys` function implements this logic:
1. **Normalization**: It sorts the trace keys alphabetically. This ensures that two shortcuts containing the same keys in a different order result in the same ID.
2. **Hashing**: It generates an MD5 hash of the sorted keys.
3. **Legacy Compatibility**: The resulting hex string is prefixed with an "X". This prefix is a holdover from previous storage iterations, maintained to ensure that legacy shortcuts remain valid and new shortcuts follow a consistent format.
This approach ensures that identical sets of traces are automatically deduplicated in the underlying storage, as they will always resolve to the same primary key.
#### The Store Interface
The module defines a `Store` interface that abstracts the persistence layer. This allows the application to remain agnostic of whether shortcuts are stored in a relational database, an in-memory cache, or a cloud-native solution.
The interface supports:
- **Dual Ingestion**: Shortcuts can be inserted either as a structured `Shortcut` object (`InsertShortcut`) or directly from an `io.Reader` (`Insert`), which is useful for processing JSON payloads from HTTP requests.
- **Streaming Retrieval**: The `GetAll` method returns a channel of shortcuts. This design decision facilitates large-scale data migrations or maintenance tasks without loading the entire shortcut database into memory, preventing OOM (Out-Of-Memory) errors.
### Key Components
- **shortcut.go**: Defines the core `Shortcut` data structure (a simple wrapper around a slice of strings) and the `Store` interface. It contains the logic for ID generation and normalization.
- **mocks/**: Provides autogenerated mock implementations of the `Store` interface. These are used across the Perf codebase to test components that depend on shortcuts (like the dashboard or alerting systems) without requiring a live database.
- **shortcuttest/**: A shared compliance suite. Any new implementation of the `Store` interface (e.g., for a new database backend) uses this suite to verify it correctly handles edge cases, such as key normalization and asynchronous retrieval.
- **sqlshortcutstore/**: The primary production implementation of the `Store`. It maps the Go interface to a SQL backend (PostgreSQL/Spanner), handling the serialization of trace keys into JSON blobs for efficient storage and retrieval.
### Shortcut Lifecycle Workflow
The following diagram illustrates how data flows through the module from creation to retrieval:
```text
Input Keys shortcut Module Storage Backend
========== =============== ===============
| | |
| 1. Create Shortcut | |
|------------------------> | |
| | 2. Sort Keys & Hash |
| | 3. Generate ID ("X...") |
| | |
| | 4. Persist (ID, Keys) |
| |----------------------------> |
| <------- Return ID ------| |
| | |
| | |
| 5. Get(ID) | |
|------------------------> | |
| | 6. Fetch by ID |
| |----------------------------> |
| <---- Return Keys -------| |
```
### Usage Context
This module is typically used by the Perf frontend when a user wants to "pin" a specific view of traces or share a link to a complex query. The frontend sends the list of trace IDs to the backend, which uses this module to generate and store the shortcut, returning a short ID that is then embedded in the URL.
# Module: /go/shortcut/mocks
The `go/shortcut/mocks` module provides a set of autogenerated mock implementations for the `shortcut` package, specifically targeting the `Store` interface. These mocks are designed to facilitate unit testing of components that depend on persistent shortcut storage without requiring a live database or complex setup.
### Design and Purpose
The primary motivation for this module is to decouple the business logic of the Perf application from its storage layer during testing. By using mocks, developers can simulate various database behaviors, such as:
- Successful retrieval of a shortcut.
- Handling of non-existent shortcut IDs.
- Simulating database transaction failures or connection errors.
- Verifying that the application logic correctly calls storage methods with the expected parameters (e.g., ensuring a shortcut is inserted before it is used).
The mocks are generated using `mockery` and are based on the `testify/mock` framework. This allows for a declarative style of testing where expectations are set up at the beginning of a test case.
### Key Components
#### Store.go
This file contains the `Store` struct, which implements the `shortcut.Store` interface. It provides mockable versions of all standard CRUD operations required for shortcut management:
- **Retrieval (`Get`, `GetAll`)**: Allows tests to return predefined shortcut objects or channels. `GetAll` is particularly useful for testing batch processing or migration scripts that iterate over all stored shortcuts.
- **Persistence (`Insert`, `InsertShortcut`)**: Enables testing of how the system handles new shortcut creation. The `Insert` method handles raw `io.Reader` input, while `InsertShortcut` handles structured objects, reflecting the dual ways shortcuts might be ingested.
- **Management (`DeleteShortcut`)**: Supports testing of cleanup routines and transaction handling, as it accepts a `pgx.Tx` parameter to simulate behavior within a database transaction.
### Testing Workflow
A typical testing workflow using this module involves initializing the mock, setting expectations, and then injecting the mock into the consumer service.
```
+-------------------+ +-----------------------+ +-------------------------+
| Unit Test | | Mock Store | | Consumer Service |
+---------+---------+ +-----------+-----------+ +------------+------------+
| | |
| 1. NewStore(t) | |
+---------------------------->| |
| | |
| 2. On("Get").Return(...) | |
+---------------------------->| |
| | |
| 3. Call Method Under Test | |
+-----------------------------|--------------------------->|
| | |
| | 4. Get(ctx, id) |
| |<---------------------------+
| | |
| | 5. Return Mock Data |
| +--------------------------->|
| | |
| 6. AssertExpectations() | |
+---------------------------->| |
```
The `NewStore` function simplifies this process by automatically registering cleanup functions that assert all defined expectations were met before the test finishes, reducing boilerplate code in the test suite.
# Module: /go/shortcut/shortcuttest
# shortcuttest
The `shortcuttest` module provides a standardized compliance suite for validating implementations of the `shortcut.Store` interface. By centralizing test logic, the module ensures that different storage backends (e.g., SQL-based, in-memory, or cloud-native) exhibit consistent behavior regarding data persistence, normalization, and error handling.
## Design Philosophy
The primary goal of `shortcuttest` is to enforce the contract of the `shortcut.Store` interface. A key design decision in the Perf system is that shortcuts—which are collections of keys representing trace sets—should be idempotent and normalized.
The test suite enforces the following behaviors across all implementations:
- **Normalization on Write**: When a shortcut is inserted, the store is expected to normalize the data (specifically sorting the keys). This ensures that identical sets of keys result in predictable retrieval, regardless of the input order.
- **Abstract Storage Validation**: Tests are written to be agnostic of the underlying database schema or storage medium, focusing strictly on the API surface of the `shortcut.Store`.
- **Lifecycle Management**: The suite covers the full lifecycle of a shortcut: insertion, retrieval by ID, bulk retrieval via channels, and deletion.
## Key Components and Workflow
### Test Suite Orchestration
The module exports a `SubTests` map, which associates descriptive names with `SubTestFunction` signatures. This allows developers implementing a new `shortcut.Store` to run the entire suite against their implementation using a standard Go sub-test pattern:
```text
Test Runner (External)
|
+---- Loop over SubTests ----+
| |
v v
[ InsertGet ] [ GetAll ]
Verifies ID generation Validates stream-based
and key normalization. retrieval via channels.
```
### Core Test Functions
Instead of providing a single monolithic test, the module breaks down requirements into specific functional checks:
- **`InsertGet`**: This function validates both `Insert` (via `io.Reader`) and `Get`. It specifically checks that the `shortcut.Shortcut` retrieved from the store has its `Keys` slice sorted alphabetically, even if the input was unsorted. This ensures that the "Shortcut" concept remains a canonical set of trace keys.
- **`GetAll`**: Validates the asynchronous retrieval pattern used for maintenance or migration tasks. It ensures that the store can correctly stream all existing shortcuts into a channel.
- **`DeleteShortcut`**: Confirms that the store correctly handles the removal of data and that subsequent `Get` calls reflect the deletion.
- **`GetNonExistent`**: Ensures that the store returns an error (rather than crashing or returning an empty object) when queried with a missing or invalid ID.
## Implementation Details
The module relies on the `testify` library to provide clear assertions. Because it is a testing utility, it resides in its own package to avoid introducing testing dependencies (like `testify`) into the production `shortcut` package.
When implementing a new store, the developer typically creates a test in their local package that spins up the required infrastructure (like a local SQL instance), creates the store instance, and passes it to the functions defined in `shortcuttest`.
# Module: /go/shortcut/sqlshortcutstore
### Overview
The `sqlshortcutstore` module provides a production-grade implementation of the `shortcut.Store` interface using a SQL backend (compatible with PostgreSQL and Spanner). It facilitates the persistence, retrieval, and management of "shortcuts"—compact, shareable identifiers that represent collections of performance trace IDs.
This module acts as the concrete bridge between the high-level Perf shortcut logic and the underlying relational database, ensuring that complex query definitions can be saved and referenced by a simple hash-based key.
### Design Decisions and Implementation
#### Content-Addressable Storage
The store utilizes a content-addressable approach for shortcut IDs. When a shortcut is inserted, the ID is generated based on the hash of the trace keys it contains (via `shortcut.IDFromKeys`).
- **Why**: This design naturally handles deduplication. If two users create a shortcut for the exact same set of trace IDs, they will receive the same ID, and the database will perform a "no-op" on conflict rather than creating redundant rows.
#### JSON Serialization
While the backend is a SQL database, the trace IDs themselves are stored as a single JSON-encoded string in a `TEXT` column.
- **How**: Before execution of the `INSERT` statement, the `shortcut.Shortcut` Go struct is marshaled into a JSON blob. Upon retrieval, this blob is unmarshaled back into the struct.
- **Decision Rationale**: Storing the list of IDs as an opaque blob avoids the overhead of managing a separate many-to-many relationship table. Since the application always consumes the shortcut as a complete list, fetching a single row with a JSON blob is significantly more performant than performing multiple joins or row lookups for potentially thousands of trace IDs.
#### Streaming Retrieval
The `GetAll` method returns a Go channel (`<-chan *shortcut.Shortcut`) rather than a slice.
- **Why**: Given that the number of shortcuts in a system can grow quite large, loading all shortcuts into memory at once could lead to memory exhaustion. The streaming approach allows the caller to process shortcuts one by one as they are read from the database cursor.
### Key Components and Responsibilities
#### `SQLShortcutStore`
Located in `sqlshortcutstore.go`, this is the primary struct implementing the storage logic. It encapsulates a `pool.Pool` to communicate with the database.
- **Input Validation**: Before persisting a shortcut, the store validates that the trace keys within it conform to the expected `query` format. This prevents malformed data from polluting the database.
- **Transaction Support**: The `DeleteShortcut` method optionally accepts a `pgx.Tx` (transaction) object. This allows deletion operations to be part of a larger atomic unit of work, which is useful when cleaning up related resources.
#### SQL Statement Management
The module uses a central `statements` map to define its SQL queries. This separates the SQL syntax from the Go logic, making the code easier to maintain and ensuring that queries like `ON CONFLICT (id) DO NOTHING` are handled consistently.
### Data Workflow
The following diagram demonstrates the lifecycle of a shortcut being stored and retrieved:
```text
Application Code SQLShortcutStore SQL Database
================ ================ ============
| | |
1. Insert(Reader) ------> [ Decode JSON ] |
| [ Validate Keys] |
| [ Generate ID ] |
| [ Encode JSON ] |
| |--- INSERT (id, blob) ------> |
| <----- Return ID ------ | (ON CONFLICT IGNORE) |
| | |
| | |
2. Get(ID) -------------> [ Query Row ] |
| | <------- SELECT blob ------- |
| [ Decode JSON ] |
| <--- Return Struct ---- | |
```
### Testing and Schema
- **Persistence Schema**: The structural contract for the table is defined in the `schema` sub-module. It defines the `Shortcuts` table with an `id` as the Primary Key and `trace_ids` for the data payload.
- **Integration Testing**: The `sqlshortcutstore_test.go` leverages `sqltest` to spin up ephemeral Spanner instances, ensuring the store is tested against real database engines rather than mocks. It runs the standard suite of shortcut tests defined in `shortcuttest` to ensure interface compliance.
# Module: /go/shortcut/sqlshortcutstore/schema
### Overview
The `schema` module defines the structural contract for persisting performance trace shortcuts in a SQL database. A shortcut in this context is a persistent mapping between a unique identifier and a collection of Trace IDs, allowing users to reference complex sets of performance data via a compact, shareable key.
### Design Decisions and Implementation
The schema is intentionally kept minimal, prioritizing serialization flexibility and retrieval speed over database-level normalization.
#### Key Component: `ShortcutSchema`
The `ShortcutSchema` struct serves as the single source of truth for the database table structure. Its design reflects two primary requirements:
- **Immutable Identification**: The `ID` field is defined as a `TEXT UNIQUE NOT NULL PRIMARY KEY`. This ensures that every shortcut has a permanent, collision-free anchor. The use of a string-based ID (typically a hash) allows the ID itself to be a representation of the content it points to, facilitating deduplication before insertion.
- **JSON-Backed Storage**: The `TraceIDs` field is stored as a `TEXT` column intended to hold a serialized `shortcut.Shortcut` JSON object.
#### Why JSON over a Normalized Table?
The decision to store trace IDs as a serialized JSON blob rather than in a relational junction table (e.g., a many-to-many mapping of `shortcut_id` to `trace_id`) was driven by the access patterns of the Perf system:
1. **Atomicity**: Shortcuts are retrieved and used as a single unit. There is rarely a need to query "which shortcuts contain this specific Trace ID" from the database level; instead, the system always fetches the full list of IDs associated with a specific shortcut key.
2. **Performance**: Reading a single text blob is significantly faster and requires less overhead than performing joins or multiple row lookups for shortcuts that may contain thousands of individual Trace IDs.
3. **Schema Stability**: By treating the `TraceIDs` as an opaque JSON blob at the database layer, the internal structure of the `shortcut.Shortcut` Go struct can evolve without requiring a database migration.
### Data Workflow
The following diagram illustrates how the schema facilitates the lifecycle of a shortcut:
```text
Application Layer Schema Layer (SQL) Database Storage
================= ================== ================
1. Create Shortcut ------> [ ID (Hash) ] ------> INSERT INTO Shortcuts
(List of IDs) [ TraceIDs (JSON)] (id, trace_ids)
|
v
2. Request Shortcut <------ [ ID (Primary Key)] <------ SELECT trace_ids
(via ID) [ TraceIDs (JSON)] WHERE id = ?
|
v
3. Deserialize JSON --------> Result: List of IDs
```
### Key Responsibilities
- **`schema.go`**: Defines the `ShortcutSchema` struct. This file is the authoritative reference for SQL migration tools and ORM-like mappers used elsewhere in the `sqlshortcutstore` parent module. It ensures that the Go representation of a shortcut's persistence layer remains synchronized with the actual SQL table constraints (e.g., `PRIMARY KEY`, `UNIQUE`).
# Module: /go/sql
# Perf SQL Module
The `/go/sql` module is the central authority for the database schema within the Skia Perf application. It implements a "Schema-as-Code" methodology, where Go struct definitions serve as the single source of truth for the underlying Google Cloud Spanner database structure.
## High-Level Overview
In a high-throughput performance monitoring system, database consistency across distributed components (ingesters, frontends, and maintenance tasks) is paramount. This module provides a unified interface for defining, generating, migrating, and testing the database schema.
Instead of manually maintaining DDL (Data Definition Language) files, developers modify Go structs. The module then provides tooling to project these definitions into SQL strings, Go constants for type-safe querying, and serialized JSON files used for environment validation.
## Design Philosophy: Go-First Schema Management
The architecture is built on the principle that the application code should dictate the database structure, not the other way around.
- **Type Safety**: By generating Go constants for table and column names, the module eliminates "stringly-typed" database interactions, catching typos at compile-time rather than runtime.
- **Version Safety (N-1 Compatibility)**: The module supports a "previous" vs. "next" schema strategy. This allows the system to remain operational during rolling deployments where some service instances might be running the old code while others run the new code, provided the database matches one of the two known states.
- **Automated Lifecycle**: The module automates tedious database tasks such as setting up Time-To-Live (TTL) policies for telemetry data while exempting configuration tables (like `Alerts` or `Favorites`) to ensure persistence.
## Key Components
### Schema Definition (`tables.go`)
The `Tables` struct in `tables.go` acts as the master registry. It aggregates schema definitions from various sub-packages across the Perf project (e.g., alerts, regressions, trace stores). This centralized struct is used by reflection-based tools to understand the entire database landscape.
### Schema Generation (`tosql` and `exportschema`)
These sub-modules transform Go code into deployable artifacts:
- **`tosql`**: A CLI tool that parses the Go structs and generates `go/sql/spanner/schema_spanner.go`. This generated file contains the raw SQL DDL strings and Go slices of column names used by the application at runtime.
- **`exportschema`**: A utility that serializes the schema into a standardized `schema.Description` (JSON). This artifact is used for comparing the "expected" state against the "live" state of a production database.
### Evolution and Migration (`expectedschema`)
This component manages the transition of the database over time. It embeds the expected JSON descriptions into the binary and provides the `ValidateAndMigrateNewSchema` logic.
- It handles **Static Migrations** via manual DDL scripts (`FromLiveToNextSpanner`).
- It handles **Dynamic Schema Updates** for the `TraceParams` table. Since the keys in performance data change as new benchmarks are added, this module dynamically adds or drops generated columns and indexes in Spanner to maintain query performance.
## Data Schema Workflow
The following diagram illustrates the lifecycle of a schema change:
```text
[ Developer ]
|
v
[ Modify Go Structs ] ----> (e.g., add field to TraceValuesSchema)
|
v
[ Run 'tosql' ] ----------> (Updates schema_spanner.go constants)
|
v
[ Run 'exportschema' ] ---> (Generates schema_spanner.json)
|
v
[ Deployment ] -----------> [ Maintenance Task ]
|
v
[ Validate & Migrate ]
|
+--------------------+--------------------+
| | |
(Match Prev?) (Match Next?) (Match Neither?)
| | |
[ Run Migration ] [ Do Nothing ] [ Panic/Error ]
| |
+----------+---------+
|
v
[ Update Dynamic TraceParams ]
(Add/Drop Generated Columns)
```
## Testing and Validation
The `sqltest` sub-module and `sql_test.go` provide the infrastructure for integration testing.
- **Emulator Integration**: Tests run against the Google Cloud Spanner Emulator and PGAdapter, providing a local, high-fidelity PostgreSQL-compatible interface.
- **Isolation**: Each test generates a unique, ephemeral database instance to prevent data contamination during parallel execution.
- **Verification**: The testing suite ensures that the migration path from a "Live" (production) schema to the "Next" (development) schema is valid and results in the exact structure defined by the Go source code.
# Module: /go/sql/expectedschema
# Expected Schema Module
The `expectedschema` module manages the lifecycle and validation of the database schema for Skia Perf. It serves as the authoritative source for what the database structure should look like at any given version of the software, and provides the mechanism to transition the database from a previous state to the current one.
## High-Level Overview
In a distributed system where multiple services (frontend, ingesters, maintenance tasks) share the same database, schema synchronization is critical. This module ensures that:
1. **Deployment Safety**: Services can verify the database schema matches their expectations upon startup, preventing data corruption or runtime crashes due to missing columns or indexes.
2. **Automated Migration**: Schema updates are applied automatically by maintenance tasks during the deployment process.
3. **Dynamic Optimization**: Certain parts of the schema (specifically `traceparams`) are dynamically adjusted based on the actual data flowing through the system to optimize query performance.
## Design Philosophy: "Previous" vs "Next"
The module implements a "n-1" compatibility strategy. It tracks two versions of the schema:
- **`schema_prev_spanner.json`**: The schema as it existed in the previous version of the application.
- **`schema_spanner.json`**: The desired "next" schema for the current version.
This approach is chosen because Perf components are deployed simultaneously. When a new version is rolled out, the maintenance task upgrades the schema. If the frontend or ingester starts before the migration, they check the schema; if it matches neither "prev" nor "next", they panic. This ensures that the system only runs on a known, supported database state.
## Key Components
### Schema Definitions (`embed.go`)
The module uses Go's `embed` package to include JSON representations of the Spanner schema directly into the binary. This makes the schema definition portable and easily accessible for comparison against the live database.
- `Load()`: Retrieves the current expected schema.
- `LoadPrev()`: Retrieves the previous version's schema.
### Migration Logic (`migrate.go`)
This file contains the logic for transitioning the database. It defines two raw SQL strings that must be manually updated by developers whenever a schema change is introduced:
- `FromLiveToNextSpanner`: The DDL commands to apply the new change.
- `FromNextToLiveSpanner`: The DDL commands to revert the change (primarily used for testing and local development).
The `ValidateAndMigrateNewSchema` function performs the core logic:
1. Inspects the live database to get its current description.
2. Calculates the difference between the live schema and the "prev"/"next" definitions.
3. If the live schema matches "prev", it executes the migration to "next".
4. If it already matches "next", it does nothing.
5. If it matches neither, it returns an error, signaling an inconsistent state.
### Dynamic Trace Parameters (`traceparams_schema.go`)
Unlike static tables, the `traceparams` table uses Spanner's generated columns and indexes to optimize filtering. Since the keys in performance data (params) change over time, this module dynamically manages these columns.
`UpdateTraceParamsSchema` performs the following workflow:
1. Identifies the param keys currently in use in the most recent data tiles.
2. Compares these keys against the existing generated columns in the `traceparams` table.
3. Uses a text template (`traceParamsUpdateTemplate`) to generate and execute DDL that adds missing columns/indexes and drops obsolete ones.
## Migration Workflow
The following diagram illustrates how the maintenance task synchronizes the database during a deployment:
```text
[ Start Maintenance Task ]
|
v
[ Fetch Live Schema from DB ] <----------+
| |
+---- matches Next? --------> [ Success: No action needed ]
| |
+---- matches Prev? --------> [ Execute FromLiveToNextSpanner ]
| |
+---- matches neither? ------> [ Error: Inconsistent State ]
|
v
[ Update Dynamic TraceParams ]
|
+--> [ Get keys from recent tiles ]
+--> [ Add/Drop Generated Columns ]
+--> [ Add/Drop Indexes ]
|
v
[ Done ]
```
## Implementation Notes
- **Spanner Focus**: While some structures are generic, the current implementation and embedded JSON files are specifically tailored for Google Cloud Spanner.
- **Testing**: `migrate_spanner_test.go` provides a suite to verify that migrations correctly transition a database from the "prev" state to the "next" state and that dynamic column generation works as expected.
# Module: /go/sql/exportschema
### Overview
The `exportschema` module provides a command-line utility designed to bridge the gap between Go-defined database schemas and their serialized representations. In the context of the Perf system, it acts as a generator that translates internal Go struct definitions and Spanner schema configurations into a standardized `schema.Description` format. This serialized output is primarily used for schema verification, migrations, and ensuring consistency across different deployment environments.
### Design Philosophy: Schema as Code
The primary motivation for this module is to treat the database schema as a "source of truth" defined within the Go codebase rather than in disparate SQL files. By using a Go-based tool to export the schema:
- **Consistency:** It ensures that the actual database structure matches the expectations of the application code.
- **Automation:** The serialization process can be integrated into CI/CD pipelines to detect accidental schema changes.
- **Portability:** By passing different flags to the tool, the system can generate descriptions tailored to specific database backends (e.g., CockroachDB vs. Spanner) while pulling from the same source definitions.
### Implementation Logic
The module is a thin wrapper that orchestrates the extraction of schema metadata. It leverages the generic `exportschema_lib` to perform the actual serialization while providing the Perf-specific schema definitions as inputs.
#### Workflow
The following diagram illustrates how the tool transforms internal Go definitions into an external schema description:
```text
+-----------------------+ +-----------------------+
| perf/go/sql/spanner | | perf/go/sql |
| (Schema Definitions) | | (Table Structs/Tags) |
+-----------+-----------+ +-----------+-----------+
| |
| +----------------+ |
+----->| exportschema |<-----+
| (Main) |
+-------+--------+
|
v
+----------------------------+
| go/sql/schema/exportschema |
| (Serialization Engine) |
+-------------+--------------+
|
v
+----------------------+
| .json / .sql output |
| (schema.Description) |
+----------------------+
```
#### Key Components
- **`main.go`**: This is the entry point. It defines the CLI interface, accepting a `-databaseType` to determine the target dialect and an `-out` path for the resulting file. It explicitly imports `perf/go/sql/spanner` to access the `Schema` object, which contains the specific table layouts and column types required by the Perf application.
- **Integration with `sql.Tables{}`**: The tool passes an empty instance of the Perf SQL tables to the exporter. This allows the reflection-based serialization engine to inspect the struct tags (such as `sql:"..."`) used throughout the Perf module to understand how Go objects map to database columns.
### Responsibility
The module is responsible for:
1. **Selection:** Identifying which schema definition (currently hardcoded to `spanner.Schema`) should be exported.
2. **Configuration:** Mapping command-line arguments to the parameters required by the shared infrastructure exporting library.
3. **Output Generation:** Writing the finalized schema description to the filesystem, which is then typically consumed by automated tests or database initialization scripts.
# Module: /go/sql/spanner
# Spanner SQL Schema for Perf
The `go/sql/spanner` module serves as the authoritative source for the Google Cloud Spanner database schema used by the Skia Perf application. It contains the DDL (Data Definition Language) statements required to initialize the database environment and provides Go constants that represent the table structures, ensuring type safety and consistency when interacting with the database.
## Design and Implementation Choices
### Automated Generation
The primary file, `schema_spanner.go`, is generated by an external tool (`//go/sql/exporter/`). This approach ensures that the Spanner schema remains synchronized with the internal Go structures used across the Perf application. Manual edits to this file are discouraged to prevent drift between the application logic and the database state.
### Large-Scale Performance Data Management
The schema is optimized for the high-volume time-series data typical of performance monitoring.
- **Bit-Reversed Sequences**: Tables like `Alerts` and `SourceFiles` use `bit_reversed_positive` sequences. This is a specific Spanner optimization to prevent hotspots during high-throughput inserts by distributing primary key values across the keyspace.
- **TTL (Time To Live)**: Most tables include a `createdat` column and a `TTL` policy of 1095 days (3 years). This automates data retention and prevents unbounded storage growth for ephemeral performance traces and logs.
- **Trace Storage Strategy**: The schema utilizes several tables to handle multidimensional performance data:
- `TraceValues` and `TraceValues2`: Store the actual measurement values associated with a specific trace and commit. `TraceValues2` provides more granular dimensions (benchmark, bot, test, subtests) for improved querying.
- `Postings` and `ParamSets`: Facilitate the "inverted index" style search used in Perf, allowing the system to quickly find traces based on key-value pairs (e.g., finding all traces where `cpu=arm64`).
- `TraceParams`: Stores the full set of parameters for a trace ID in a `JSONB` column, balancing structured searching with flexible metadata storage.
### Anomaly and Regression Tracking
The schema defines a sophisticated relationship between performance regressions and their remediation:
- **Regressions & Regressions2**: Track detected performance changes at specific commits.
- **AnomalyGroups**: Group related regressions together to streamline the triage process.
- **Culprits**: Track specific revisions identified as the cause of regressions, including metadata about the host and project.
## Key Components
### schema_spanner.go
This file contains a single large string constant, `Schema`, which includes the full set of `CREATE TABLE`, `CREATE INDEX`, and `CREATE SEQUENCE` statements. It also exports slice variables (e.g., `var Alerts`, `var Commits`) that list the column names for each table, providing a programmatic way to reference table structures without hardcoding strings in the application logic.
### Primary Data Entities
- **Commits**: The foundation of the timeline, mapping commit numbers to git hashes and timestamps.
- **Alerts & Subscriptions**: Define the configuration for anomaly detection and the notification preferences for different teams.
- **Shortcuts & GraphsShortcuts**: Store persistent links to specific views or sets of traces in the Perf UI.
## Data Workflow: Trace Ingestion and Querying
The schema supports a workflow where incoming performance data is transformed into searchable traces.
```text
Incoming Data File
|
v
[SourceFiles] <------- [Metadata] (Links to external logs)
|
+-----> [TraceValues] (Value at Commit X)
|
+-----> [TraceParams] (The "What": bot=linux, test=draw)
|
v
[Postings] (Inverted index for searching)
[ParamSets] (Summary of available search terms)
```
1. **Ingestion**: A new file is registered in `SourceFiles`.
2. **Storage**: Values are written to `TraceValues` or `TraceValues2`.
3. **Indexing**: The trace's parameters are decomposed into `Postings` and `ParamSets`, enabling the Perf UI to populate search filters and quickly locate relevant `trace_id`s.
4. **Detection**: Analysis services read from these tables and write findings into `Regressions`, `AnomalyGroups`, and `Culprits`.
# Module: /go/sql/sqltest
# SQL Test Utility
The `sqltest` module provides standardized utilities for initializing and managing database instances during unit testing. It is specifically designed to facilitate integration testing against Spanner-compatible PostgreSQL interfaces using local emulators.
## Overview and Design Philosophy
Testing database logic requires a consistent, reproducible, and isolated environment. This module automates the orchestration of ephemeral databases to ensure that tests do not interfere with one another and that they run against a schema identical to production.
The implementation relies on two primary architectural choices:
1. **Emulator-Based Testing**: Rather than requiring a live Cloud Spanner instance, the module utilizes the Google Cloud Spanner Emulator and the PGAdapter. This allows developers to run tests locally or in CI environments without network overhead or cloud costs.
2. **Schema Enforcement**: Tests are executed against a fully initialized schema. The module automatically applies the current production schema (defined in the `spanner` package) before returning a connection, ensuring that the code under test interacts with the expected table structures.
## Key Components and Responsibilities
### Database Lifecycle Management
The primary entry point is `NewSpannerDBForTests`. This function handles the entire lifecycle of a test database:
- **Dependency Verification**: It asserts that the necessary emulator processes (Spanner and PGAdapter) are running. If they are missing, the test fails early.
- **Isolation**: It generates a unique database name using a provided prefix and a random suffix. This isolation is critical for parallel test execution, preventing cross-test data contamination.
- **Schema Migration**: It uses an "eventually" retry logic to apply the SQL schema. This accounts for potential transient delays while the emulator initializes the new database instance.
### Connection Wrapping and Safety
The module does not return a raw database driver connection. Instead, it returns a `pool.Pool` interface wrapped in a timeout validator:
- **Timeout Enforcement**: By wrapping the `pgxpool` with `timeout.New`, the module ensures that every database operation performed during the test includes a context with a defined timeout. This prevents tests from hanging indefinitely if a deadlock or performance issue occurs in the underlying logic.
- **Interface Abstraction**: By returning the `pool.Pool` interface, it allows the rest of the application to remain agnostic of the underlying driver implementation (PostgreSQL vs. Spanner-via-PGAdapter).
## Workflow: Test Database Initialization
The following diagram illustrates the sequence of operations when a test requests a new database connection:
```text
Test Invocation
|
v
[ Check Emulators ] ----> (Require Spanner & PGAdapter running)
|
v
[ Generate Name ] ------> (Prefix + Random ID)
|
v
[ Connect Pool ] -------> (Establish connection to PGAdapter)
|
v
[ Apply Schema ] <------- (Loop: Try applying spanner.Schema)
| (until success or 10s timeout)
v
[ Wrap Connection ] ----> (Inject timeout enforcement wrapper)
|
v
Return Pool
```
## Implementation Details
- **`sqltest.go`**: Contains the logic for connecting to the PostgreSQL-compatible endpoint provided by the emulator. It handles the string formatting for connection strings (e.g., `postgresql://root@...`) and manages the integration between the `pgx` library and the project's internal `pool` abstractions.
- **Naming Constraints**: Database names are truncated to 30 characters to comply with emulator and database naming limitations while maintaining enough of the prefix to identify the source test.
# Module: /go/sql/tosql
# tosql
The `tosql` module provides a command-line utility designed to maintain a "Go-first" approach to database schema management. It serves as a bridge between high-level Go struct definitions and the concrete SQL schema required by the database engine, specifically targeting Google Cloud Spanner for the Perf application.
## Design Philosophy
The primary design goal of this module is to ensure that Go code remains the single source of truth for the database schema. Rather than manually maintaining `.sql` files and trying to keep Go structs in sync with them, `tosql` automates the generation of SQL schema strings and column constants directly from Go definitions.
This approach offers several advantages:
- **Compile-time Safety**: By generating Go constants for table and column names, the rest of the application can avoid hard-coded strings in queries, reducing the risk of runtime errors due to typos.
- **Documentation and Metadata**: Go structs allow for the use of struct tags and docstrings to define database-specific properties (like TTL or primary keys) in a way that is easily readable by developers.
- **Consistency**: It ensures that the schema deployed to the database perfectly matches the structures the application expects to serialize and deserialize.
## Key Components and Responsibilities
### Schema Generation Logic
The module's entry point is `main.go`. Its responsibility is to orchestrate the conversion process by:
1. Identifying the source Go structs (located in `//perf/go/sql`).
2. Configuring the `exporter` (from `//go/sql/exporter`) to translate Go types and tags into Spanner-compatible SQL dialects.
3. Writing the resulting Go source code—containing the schema string and metadata—to a specific package (e.g., `spanner/schema_spanner.go`).
### Configuration and Policy
The module defines specific transformation policies for the Perf database. A notable implementation choice is the handling of **Time To Live (TTL)**. The generator explicitly excludes certain tables—such as `Alerts`, `Favorites`, `Subscriptions`, and `TraceParams`—from automated TTL policies. This reflects a design decision to treat configuration and user-created entities as permanent, while allowing raw performance data to be eligible for lifecycle management.
## Workflow
The following diagram illustrates how `tosql` fits into the development lifecycle:
```text
[ Go Structs ] --> [ tosql ] --> [ Generated Go Code ] --> [ Application ]
(Source of Truth) | (Schema Strings ) (Type-safe SQL)
| (Column Constants )
v
[ SQL Exporter Logic ]
(Spanner Dialect)
(TTL Exclusions )
```
1. **Define**: A developer modifies a Go struct in the `perf/go/sql` package to add a new column or table.
2. **Generate**: Running the `tosql` tool triggers the `exporter`.
3. **Export**: The tool parses the structs, applies Spanner-specific conversion rules, and injects the resulting SQL into a generated Go file.
4. **Consume**: The Perf application imports the generated package to initialize the database schema or to reference column names in its data access layer.
# Module: /go/stepfit
# StepFit
The `stepfit` module provides algorithms for detecting and quantifying "steps" or shifts in time-series data (traces). In the context of performance monitoring, these steps represent regressions (performance degradation) or improvements.
## Overview
The core functionality revolves around taking a slice of telemetry data and determining if a significant change in value occurs at a specific point. The module evaluates these changes using several different statistical and heuristic methods, allowing the caller to choose the best detection strategy for their specific data type (e.g., noisy vs. stable benchmarks).
The primary entry point is `GetStepFitAtMid`, which analyzes a trace centered around a specific index to determine if a step exists at that "turning point."
## Key Concepts
### StepFit Structure
The `StepFit` struct is the result of an analysis. It contains:
- **Status**: Categorizes the step as `HIGH` (step up/potential regression), `LOW` (step down/improvement), or `UNINTERESTING` (no significant change).
- **Regression**: A calculated score representing the "strength" of the step. Higher absolute values generally indicate more significant changes. The interpretation of this value varies by algorithm.
- **StepSize**: The raw difference between the means of the two halves of the trace.
- **TurningPoint**: The index in the trace where the step is identified.
### Detection Algorithms
The module supports multiple algorithms defined via `types.StepDetection`:
- **Original Step**: Based on a Least Squares Error (LSE) fit of a step function. It normalizes the trace and calculates a regression score as `StepSize / LSE`. It is effective for identifying clear shifts while accounting for noise.
- **Absolute Step**: A simple comparison of the difference between the mean of the first half and the mean of the second half against an absolute threshold.
- **Percent Step**: Calculates the step size as a percentage of the mean of the first half. This is useful for benchmarks where relative change is more important than absolute magnitude.
- **Cohen's d**: Uses the effect size between two groups. It scales the step size by the pooled standard deviation, making it robust against varying levels of noise in different traces.
- **Mann-Whitney U**: A non-parametric test that assesses whether one group tends to have larger values than the other. Here, the `Regression` value is the p-value of the test, and the `Status` is determined by whether this p-value meets the "interesting" threshold.
- **Const**: A specialized check that looks at a single value at the turning point relative to a threshold, used for specific flagging logic.
## Logic Workflow
The following diagram illustrates the general process within `GetStepFitAtMid`:
```text
Input Trace [x0, x1, ..., xN]
|
v
+-----------------------+
| Pre-processing | (Normalization or
| | Length Adjustment)
+-----------+-----------+
|
v
+-----------+-----------+
| Split Trace at Middle | -> [Left Half] | [Right Half]
+-----------+-----------+
|
v
+-----------+-----------+
| Apply Algorithm | (Original, Cohen, U-Test, etc.)
| (Calculate Means, |
| StdDev, or Ranks) |
+-----------+-----------+
|
v
+-----------+-----------+
| Calculate Regression | (Score representing
| and Step Size | change magnitude)
+-----------+-----------+
|
v
+-----------+-----------+
| Determine Status | (Compare Regression
| | to Interesting Threshold)
+-----------+-----------+
|
v
Result: StepFit
```
## Implementation Details
### Data Normalization
For the `OriginalStep` algorithm, the module performs normalization using `vec32.Norm`. This ensures that traces with different scales can be compared using a uniform "interesting" threshold. A `stddevThreshold` is used to prevent division by zero or extreme amplification of noise in very flat traces.
### Handling "Interesting" Thresholds
The `interesting` parameter passed to `GetStepFitAtMid` is polymorphic in its meaning depending on the algorithm:
- For `OriginalStep`, `AbsoluteStep`, `CohenStep`, and `PercentStep`, a higher `interesting` value makes the detector **less** sensitive (requires a larger shift).
- For `MannWhitneyU`, where the regression score is a p-value, a **lower** `interesting` value (e.g., 0.05) makes the detector **less** sensitive (requires higher statistical confidence).
### Trace Length
The module requires a minimum trace size (defined as 3). For most algorithms, it expects the trace provided to be a window around a specific point. If not using the `OriginalStep` algorithm, the module truncates the last element of the trace to ensure symmetry (2\*N length) for the split-at-mid logic.
# Module: /go/subscription
### High-Level Overview
The `subscription` module provides the data management layer for "Subscriptions" within the Skia Perf ecosystem. In this context, a Subscription is a configuration object that defines how the system should react when a performance anomaly is detected. It acts as a bridge between the detection of a regression and the filing of an actionable bug report, containing metadata such as target bug components, priority levels, and point-of-contact information.
This module defines the standard `Store` interface for persisting these configurations and provides the underlying Protocol Buffer definitions that ensure consistency across the backend services.
### Design Decisions and Implementation
#### Versioning and Immutability
A core design principle of the subscription system is **revision-based tracking**. Subscriptions are not simply overwritten; they are versioned by a combination of their `name` and a `revision` (typically a Git hash or unique identifier from the configuration source).
- **Auditability:** By treating configurations as immutable records identified by a name/revision pair, the system can provide a full history of how alerting rules for a specific test or component have evolved.
- **Atomic Updates:** The storage implementations (specifically the SQL-based ones) follow a pattern of deactivating old records and inserting new ones within a single transaction. This ensures that the detection engine always sees a consistent, "active" snapshot of all subscriptions at any given time.
#### Separation of Concerns
The module is structured to decouple the _schema_ of a subscription from its _persistence_ and its _testing_:
- **Schema (Proto):** Defines the data model (labels, components, hotlists) needed to integrate with external issue trackers like Buganizer.
- **Persistence (Store):** Provides an interface that allows the system to switch between different database backends (e.g., PostgreSQL or Spanner) without changing the business logic that handles regressions.
- **Mocks:** Provides high-fidelity mock implementations to allow other Perf modules to test their alerting logic without interacting with a database.
### Key Components
#### Store Interface (`store.go`)
The `Store` interface is the primary contract for subscription data access. It supports two main modes of operation:
- **Current-State Access:** Methods like `GetActiveSubscription` and `GetAllActiveSubscriptions` are used by the live regression detection pipeline to find the most recent rules for filing bugs.
- **Historical Access:** `GetSubscription(name, revision)` allows the system to reference the exact configuration that was in place when a specific anomaly was detected, even if the subscription has since been updated.
#### Subscription Proto (`/proto`)
The `v1.Subscription` message is the source of truth for what constitutes a subscription. It includes:
- **Routing Information:** `bug_component`, `bug_cc_emails`, and `contact_email`.
- **Classification Metadata:** `bug_labels`, `hotlists`, `bug_priority`, and `bug_severity`.
- **Logical Ownership:** The `name` field serves as the unique identifier for a specific monitoring rule.
#### SQL Implementation (`/sqlsubscriptionstore`)
The standard implementation of the `Store` interface. It manages the SQL lifecycle of subscription records, handling the translation between Go structs and database rows, and enforcing the "soft-deactivation" logic during updates.
### Subscription Lifecycle Workflow
The following diagram illustrates how a subscription moves from a configuration file into the database and is eventually used during an anomaly detection event:
```text
[ Config Source ] ----> [ Subscription Manager ]
(Git/Repo) |
| 1. Parse & Validate
v
[ SQL Store ] <--------- [ Store Interface ]
| | 2. InsertSubscriptions(new_set, tx)
| | - Set old records is_active = false
| | - Insert new records is_active = true
v
[ Database ]
|
| 3. GetAllActiveSubscriptions()
v
[ Anomaly Detector ] ----> [ External Issue Tracker ]
4. File bug using
labels/components
from Subscription
```
### Key Files
- **`store.go`**: Defines the `Store` interface which abstracts the underlying persistence mechanism.
- **`proto/v1/subscription.proto`**: The definitive schema for subscription data, used for both storage and cross-service communication.
- **`sqlsubscriptionstore/sqlsubscriptionstore.go`**: The SQL implementation of the store, containing the logic for versioned updates and retrieval.
- **`mocks/Store.go`**: An autogenerated mock of the `Store` interface for use in unit tests.
# Module: /go/subscription/mocks
### High-Level Overview
The `subscription/mocks` module provides autogenerated mock implementations of the `Store` interface used in Perf subscription management. This module is designed to facilitate unit testing for components that depend on the subscription storage layer without requiring a live database connection or complex setup.
By utilizing these mocks, developers can simulate various database states, verify that the application logic calls the storage layer with the expected parameters, and test error-handling scenarios in a predictable, isolated environment.
### Design Decisions and Implementation
The implementation relies on `testify/mock` and is generated via the `mockery` tool. This approach ensures that the mock interface remains synchronized with the actual `Store` interface defined in the subscription package.
Key design choices include:
- **Decoupling Logic from Persistence:** By providing a mock for the `Store`, the business logic governing subscriptions (such as validation or processing) can be tested independently of the underlying PostgreSQL implementation (facilitated by the `pgx` dependency).
- **Transaction Support:** The mock supports methods that take `pgx.Tx` as an argument (e.g., `InsertSubscriptions`), allowing tests to verify transactional logic even within a mocked context.
- **Automatic Assertion:** The `NewStore` constructor automatically registers a cleanup function on the testing object. This ensures that `AssertExpectations` is called at the end of every test, enforcing that all expected calls were made and preventing "silent" test failures where code logic skips necessary database interactions.
### Key Components
#### Store.go
This is the primary file containing the `Store` mock struct. It mirrors the capabilities of the real subscription storage engine:
- **Retrieval Methods:** It provides mocks for `GetActiveSubscription`, `GetAllActiveSubscriptions`, `GetAllSubscriptions`, and `GetSubscription`. These allow tests to simulate the presence or absence of specific subscription configurations (represented by `v1.Subscription` protos).
- **Persistence Methods:** The `InsertSubscriptions` mock enables verification of how the system writes or updates subscription data, including support for bulk operations and database transactions.
### Workflow Example: Testing a Subscription Fetcher
The following diagram illustrates how the mock interacts with a consumer (e.g., a Subscription Manager) and a test suite:
```text
+-----------+ +-----------------------+ +--------------+
| Test | | Subscription Manager | | Mock Store |
+-----------+ +-----------------------+ +--------------+
| | |
| 1. Setup Expectation | |
|--------------------------->| |
| (On "GetSubscription") | |
| | |
| 2. Trigger Action | |
|--------------------------->| |
| | 3. Call GetSubscription() |
| |----------------------------->|
| | |
| | 4. Return Mocked Proto/Error |
| |<-----------------------------|
| 5. Assert Result | |
|<---------------------------| |
| | |
| 6. Automatic Cleanup | |
| (AssertExpectations) | |
|--------------------------->|----------------------------->|
```
In this workflow, the `Mock Store` allows the `Test` to define exactly what the `Subscription Manager` should receive when it queries for a subscription, ensuring the manager handles the returned data (or error) correctly according to the system's design requirements.
# Module: /go/subscription/proto
The `go/subscription/proto` module defines the foundational data structures used for anomaly notification routing and issue tracking within the Skia Perf ecosystem. This module serves as the contract between the performance analysis engines—which detect regressions—and the reporting services—which notify stakeholders.
### Design and Implementation Choices
The design of the proto definitions in this module reflects a transition toward automated, template-based issue management.
- **Service Decoupling:** By centralizing the subscription schema here, the system separates _what_ was found (an anomaly) from _who_ should care and _how_ it should be reported. This allows the detection engine to remain agnostic of the underlying bug-tracking system’s complexities.
- **Integration-First Schema:** Unlike a generic notification system, the fields are modeled after specific requirements of enterprise issue trackers (e.g., Buganizer). Attributes like `bug_component`, `hotlists`, and `bug_labels` are first-class citizens, ensuring that when an anomaly is detected, the resulting ticket is pre-triaged and routed to the correct engineering queue.
- **Constraint-Driven Configuration:** The schema enforces specific data types for priorities and severities, ensuring that configuration-as-code files remain valid and consistent across different performance monitoring domains.
### Key Components
#### The Subscription Schema
Defined in `subscription.proto`, the `Subscription` message is the primary data model. It acts as a routing rulebook for performance regressions.
- **Routing Logic:** The `bug_component` and `bug_cc_emails` fields define the destination of the alert. This ensures that the right team is notified immediately without manual triage.
- **Contextual Metadata:** The `bug_labels` and `hotlists` fields allow the system to tag issues with relevant metadata (e.g., "Chromium-Perf-Regression" or "Milestone-110"). This is critical for automated dashboards that track the health of specific product releases.
- **Accountability:** The `contact_email` field is mandatory to ensure every subscription has an owner who can be reached if the alerting rules become noisy or obsolete.
#### Go Binding and Generation
The module includes the generated Go code (`subscription.pb.go`) to provide a type-safe interface for the Perf backend.
- **Consistency via `generate.go`:** This file encapsulates the logic for invoking the protocol buffer compiler. By including this in the module, the project ensures that the Go structs remain in sync with the proto definitions, preventing runtime errors during the serialization or deserialization of subscription configurations.
### Data Flow Workflow
The following diagram demonstrates how the `proto` definitions facilitate the transition from a detected performance dip to an actionable engineering task:
```text
[ Regression Detector ]
|
| (A) Detects significant change in trace
v
[ Subscription Manager ] <---- [ Proto-based Config Files ]
| (Defines Name, Component, Priority)
|
| (B) Matches trace to "Subscription" name
v
[ Reporting Service ]
|
| (C) Maps Proto fields to API Call:
| - Labels -> bug_labels
| - Component -> bug_component
v
[ External Issue Tracker ]
```
### Key Files
- **`v1/subscription.proto`**: The source of truth for the subscription data model. It defines the structure used by both the configuration files and the internal Go services.
- **`v1/subscription.pb.go`**: The auto-generated Go implementation of the proto. It contains the structs and methods used by the Perf service to manipulate and pass subscription data.
- **`v1/generate.go`**: A utility script used to trigger the code generation process, ensuring the Go bindings are updated whenever the proto definition is modified.
# Module: /go/subscription/proto/v1
The `subscription.proto` module defines the schema for anomaly alerting configurations within the Skia Perf ecosystem. Its primary purpose is to decouple the logic of detecting performance regressions from the logic of reporting them. By providing a structured data format, it allows the system to determine exactly how and where to route notifications when an anomaly is identified.
### Design and Implementation Choices
The module is centered around the `Subscription` message, which acts as a template for issue creation. The design follows several key principles:
- **Traceability via Revisions:** The inclusion of a `revision` field indicates that subscriptions are likely managed as "Configuration as Code." This allows the system to track which version of an internal configuration repository was used to generate or update the subscription, ensuring that changes to alerting rules are auditable.
- **Issue Tracker Integration:** Instead of generic notification fields, the schema is specifically tailored to the requirements of modern issue tracking systems (like Buganizer or Monorail). Fields such as `bug_component`, `bug_priority`, and `bug_severity` (constrained to a 0-4 range) ensure that filed bugs are immediately actionable and correctly categorized without manual intervention.
- **Operational Accountability:** The `contact_email` field ensures that every automated alert has a human owner responsible for the subscription's validity, preventing "zombie" alerts that fire into unmonitored components.
### Key Components
#### Subscription Message
The `Subscription` message is the core entity. It bridges the gap between a detected event and an external tracking system.
- **Identity and Metadata:** The `name` is the unique key used by the Perf service to look up reporting rules. The `contact_email` identifies the team or individual maintaining the alert.
- **Issue Metadata:** `bug_labels` and `hotlists` allow for fine-grained filtering within issue trackers, enabling teams to organize anomalies by sub-project or release milestone.
- **Routing and Priority:** `bug_component` defines the destination, while `bug_priority` and `bug_severity` define the urgency. The use of `repeated` strings for `bug_cc_emails` allows for cross-team visibility on critical regressions.
#### Generated Go Code
The `subscription.pb.go` file provides the concrete implementation of these structures for use in Go services. This ensures type safety when the Perf backend processes subscription data retrieved from storage or configuration files.
### Workflow Example
The following diagram illustrates how the `Subscription` proto is utilized during an anomaly event:
```text
[ Perf Detection Engine ]
|
| 1. Anomaly Found
v
[ Subscription Lookup ] <--- Uses "name" to find Subscription proto
|
| 2. Extract Bug Metadata (Component, CCs, Labels)
v
[ Issue Tracker API ] ----> Creates Bug with:
- Component: bug_component
- CCs: bug_cc_emails
- Labels: bug_labels
```
### Source Files
- `subscription.proto`: The source of truth definition for the subscription data structure.
- `subscription.pb.go`: The compiled Go code used by internal services to handle subscription data.
- `generate.go`: Contains the automation logic for regenerating the Go code when the proto definition changes, ensuring consistency between the schema and the implementation.
# Module: /go/subscription/sqlsubscriptionstore
The `sqlsubscriptionstore` module provides a persistent SQL-based implementation of the `subscription.Store` interface. It is responsible for storing, versioning, and retrieving configurations that define how the Perf system should handle anomalies, specifically focusing on bug filing metadata such as components, labels, and priority.
## Design Decisions and Implementation Choices
### Atomic Versioning and State Management
The store implements a "deactivate-then-insert" pattern for updates. When new subscriptions are inserted via `InsertSubscriptions`, the store wraps the operation in a transaction that first marks all existing subscriptions as inactive before inserting the new set as active.
This design choice ensures that:
1. **Consistency**: There is always a clear set of "active" configurations used by the monitoring services.
2. **Auditability**: Historical configurations are never deleted. By using a compound primary key of `(name, revision)`, the store maintains a full lineage of how a subscription's metadata (like its bug component or CC list) has changed over time, keyed to specific infrastructure Git revisions.
3. **Soft Deactivation**: The `is_active` flag allows the system to distinguish between the current production configuration and historical records without physical data removal.
### Integration with External Issue Trackers
The module is designed to map directly to the requirements of issue tracking systems (like Monorail or Buganizer). Implementation details such as storing `BugLabels` and `Hotlists` as string arrays, and `BugPriority`/`BugSeverity` as integers, allow the Perf service to programmatically construct bug reports that adhere to specific team triage workflows without needing complex transformation logic at the application layer.
## Key Components
### SubscriptionStore
Located in `sqlsubscriptionstore.go`, this is the primary struct implementing the data access logic. It wraps a `pool.Pool` to interact with the underlying database (typically Spanner or PostgreSQL).
- **Query Management**: The store uses a centralized map of SQL statements. This separation of SQL logic from Go code facilitates easier maintenance of the schema-to-struct mapping.
- **Transaction Support**: The `InsertSubscriptions` method explicitly accepts a `pgx.Tx` (transaction) object. This allows the caller to coordinate subscription updates with other database operations, ensuring that configuration updates are atomic across the system.
### Data Schema
The underlying table structure (defined in the `schema` submodule) enforces the immutability of specific revisions. Fields like `bug_cc_emails` and `contact_email` are stored to ensure that the notification engine knows exactly who to alert when an anomaly is detected under a specific subscription's criteria.
## Subscription Update Workflow
The following diagram illustrates the process of updating subscriptions within the store, highlighting the transition of active states.
```
1. Caller starts Transaction (tx)
2. InsertSubscriptions(ctx, new_subs, tx)
|
v
+---------------------------------------+
| SQL: UPDATE Subscriptions |
| SET is_active = false | <-- Archive existing configs
| WHERE is_active = true |
+---------------------------------------+
|
v
+---------------------------------------+
| SQL: INSERT INTO Subscriptions |
| (name, revision, ..., is_active=true) | <-- Activate new configs
+---------------------------------------+
|
v
3. Caller commits Transaction
```
## Retrieval Modes
The store provides multiple ways to access data based on the caller's context:
- **Point-in-time**: `GetSubscription(name, revision)` retrieves a specific historical version of a config.
- **Current State**: `GetActiveSubscription(name)` or `GetAllActiveSubscriptions()` retrieves only the configurations currently marked as active, used by the live alerting engine.
- **Historical Audit**: `GetAllSubscriptions()` returns the entire database contents, including inactive versions.
# Module: /go/subscription/sqlsubscriptionstore/schema
# SQL Subscription Store Schema
The `schema` module defines the data structure and database layout for storing subscriptions within the Perf system. It serves as the single source of truth for the SQL table definitions used by the `sqlsubscriptionstore`, ensuring that subscription metadata is persisted consistently and can be queried efficiently.
## Design Decisions and Implementation Choices
### Immutability via Compound Primary Keys
The schema defines a primary key composed of both `name` and `revision`.
```
PRIMARY KEY(name, revision)
```
This design choice facilitates versioning and traceability. Instead of overwriting an existing subscription when configurations change, the system records a new entry tied to a specific `infra_internal` Git hash (`revision`). This allows the system to:
- Track the evolution of a subscription over time.
- Audit which version of a configuration was active when a specific bug was filed.
- Roll back or reference historical subscription states based on repository history.
### Integration with Bug Filing Systems
A significant portion of the schema is dedicated to bug metadata (labels, hotlists, components, priority, and severity). The implementation uses `STRING ARRAY` types for fields like `bug_labels` and `hotlists` to provide flexibility, allowing a single subscription to categorize bugs across multiple workstreams without requiring complex relational mapping tables.
The inclusion of `bug_priority` and `bug_severity` as integers (constrained to 0-4) maps directly to standard issue tracking priorities (e.g., P0 through P4), ensuring that the Perf system can programmatically set triage urgency based on the subscription configuration.
## Key Components and Responsibilities
### SubscriptionSchema Struct
Located in `schema.go`, this struct defines the mapping between Go objects and the SQL database. Its responsibilities include:
- **Identity Management**: Manages the `Name` and `Revision` fields which uniquely identify the configuration.
- **Notification Routing**: Stores `BugCCEmails` and `ContactEmail` to ensure that the correct stakeholders are alerted when the subscription triggers.
- **State Management**: The `IsActive` boolean allows for soft-deactivation of subscriptions, enabling users to pause monitoring without deleting the historical configuration or metadata.
## Workflow: Subscription Lifecycle
The following diagram illustrates how the schema supports the transition from a configuration defined in code/Git to a persisted database record used for bug filing.
```
[ Git Revision ] ----> [ Subscription Config ]
| |
| (Name + Revision used as Key)
| |
| v
| +-----------------------+
+----------->| SQL: Subscriptions |
+-----------------------+
| name: "Chrome_Perf" |
| revision: "a1b2c3d" | <--- Ensures auditability
| bug_component: 12345 |
| is_active: true |
+-----------+-----------+
|
v
+-----------------------+
| Bug Filing Process |
+-----------------------+
| CCs: bug_cc_emails |
| Labels: bug_labels |
+-----------------------+
```
## Files
- **schema.go**: Contains the `SubscriptionSchema` struct definition with SQL tags that define the column types and constraints for the underlying database engine.
# Module: /go/tracecache
# TraceCache Module
The `tracecache` module provides a specialized caching layer for Perf trace identifiers. It is designed to bridge the gap between high-level user queries and the underlying data tiles, reducing the computational overhead of repeatedly resolving complex queries against the same dataset.
## High-Level Overview
In the Perf system, data is organized into "tiles." When a user executes a query, the system must identify which traces match that query within a specific tile. This resolution process can be expensive, especially for broad queries or large datasets.
`TraceCache` addresses this by memoizing the results of query resolutions. It maps a combination of a `TileNumber` and a `query.Query` to a list of matching `paramtools.Params` (trace identifiers). This allows the system to bypass the query engine for subsequent requests for the same data, significantly improving performance for dashboard loading and data exploration.
## Design Decisions and Implementation
### Key Derivation
The cache's efficiency relies on its key generation strategy. The module uses a composite key:
`[TileNumber]_[QueryString]`
- **Tile Granularity:** By including the `TileNumber` in the key, the cache automatically invalidates or isolates results as time progresses and new tiles are created. This ensures that query results are always contextually tied to the specific temporal bucket of data they represent.
- **Query Normalization:** The `query.Query` object is converted to its `KeyValueString()` representation. This ensures that queries with the same parameters result in the same cache key, maximizing the hit rate.
### Serialization
Trace identifiers are stored as JSON blobs within the cache backend. While JSON introduces a small overhead for marshaling and unmarshaling, it provides a stable, human-readable format that simplifies debugging and ensures compatibility regardless of the underlying cache provider (e.g., in-memory, Redis, or Memcache).
### Dependency Injection
The `TraceCache` struct does not implement a caching engine itself. Instead, it wraps an implementation of the `cache.Cache` interface. This decoupling allows the `tracecache` module to remain agnostic of the storage backend, enabling the use of local in-memory caches for development and distributed caching systems for production environments.
## Key Components and Responsibilities
### TraceCache Struct
The primary coordinator of the module. Its responsibilities include:
- **Encapsulation:** Managing the interaction with the generic `cache.Cache` client.
- **Query-to-Key Mapping:** Transforming domain-specific objects (`TileNumber` and `Query`) into flat string keys.
- **Data Transformation:** Handling the serialization of `paramtools.Params` arrays into JSON and back.
### Key Methods
- **CacheTraceIds:** Persists the results of a query resolution. It takes the resulting list of trace parameters and stores them against the tile/query key.
- **GetTraceIds:** Retrieves cached results. If the key exists, it deserializes the JSON back into a slice of `paramtools.Params`; if the key is missing (a cache miss), it returns `nil`, signaling that the caller must perform the query resolution manually.
## Data Workflow
The typical lifecycle of a trace lookup using this module follows this pattern:
```text
User Query + Tile
|
v
[ TraceCache.GetTraceIds ] ----(Key: TileID_Query)----> [ Cache Backend ]
| |
+<-----------( JSON Result / Miss )------------------+
|
| If Miss:
| 1. Execute Query against Tile
| 2. [ TraceCache.CacheTraceIds ] ----------> [ Cache Backend ]
| 3. Return Results
|
| If Hit:
| 1. Unmarshal JSON
| 2. Return Results
```
# Module: /go/tracefilter
The `tracefilter` module provides a specialized tree-based data structure designed to identify and isolate "leaf" traces within a hierarchical path structure. In the context of performance monitoring and trace management, data often arrives with overlapping prefixes or hierarchical relationships. This module allows for the filtering of redundant parent nodes, ensuring that only the most specific (deepest) traces are processed.
### Design Motivation
The primary goal of `tracefilter` is to resolve hierarchical dependencies between trace paths. When multiple paths are added to the filter, some may be prefixes of others. For example, if both `root/cpu/usage` and `root/cpu/usage/core1` are registered, the latter is a more specific leaf node.
By modeling these paths as a tree, the module can efficiently determine which traces represent actual data endpoints versus those that are merely architectural containers for more granular metrics. This is particularly useful for deduplicating metrics or ensuring that aggregations don't double-count data that exists at multiple levels of a hierarchy.
### Key Components and Logic
#### The Tree Structure (`TraceFilter`)
The core of the module is the `TraceFilter` struct, which functions as a recursive node in a prefix tree (trie). Each node stores:
- A `value`: The specific path segment string (e.g., "p1").
- A `traceKey`: An identifier associated with that specific path.
- `children`: A map of sub-paths to nested `TraceFilter` nodes.
#### Path Integration (`AddPath`)
The `AddPath` method builds the tree incrementally. It accepts a slice of strings representing the hierarchy and a `traceKey`. As paths are added, the module creates the necessary branch nodes. If a path is added that extends an existing branch, the tree grows deeper.
#### Leaf Node Resolution (`GetLeafNodeTraceKeys`)
This is the central logic of the module. It performs a recursive depth-first search to find nodes that have no children.
The implementation logic follows a "specificity wins" rule:
1. If a node has children, it is considered a "parent" or "container" node. Its own `traceKey` is ignored, and the search continues into its children.
2. If a node has no children, it is a "leaf." Its `traceKey` is collected and returned.
This ensures that if a parent key is added and later a child of that parent is added, only the child's key (the more specific one) will be returned in the final result set.
### Workflow Example
Consider a scenario where various metrics are registered. The tree filters out the intermediate "p2" and "p3" keys because more specific children exist.
```text
Input Paths:
1. ["root", "p1", "p2"] Key: "key_parent"
2. ["root", "p1", "p2", "p3"] Key: "key_intermediate"
3. ["root", "p1", "p2", "p3", "t1"] Key: "key_leaf_A"
4. ["root", "p1", "p2", "p4"] Key: "key_leaf_B"
Tree Construction:
root
└── p1
└── p2 (key_parent)
├── p3 (key_intermediate)
│ └── t1 (key_leaf_A) <-- Leaf
└── p4 (key_leaf_B) <-- Leaf
Resulting Leaf Keys:
["key_leaf_A", "key_leaf_B"]
```
In this example, "key_parent" and "key_intermediate" are discarded by `GetLeafNodeTraceKeys` because the filter assumes that the presence of deeper nodes makes the higher-level nodes redundant for the specific filtering task.
# Module: /go/tracesetbuilder
# TraceSetBuilder
The `tracesetbuilder` module provides a high-performance, concurrent mechanism for aggregating disparate trace data fragments into a unified `TraceSet` and a corresponding `ParamSet`. This is primarily used in Perf to consolidate data fetched from multiple storage tiles or shards into a single contiguous representation suitable for visualization or analysis.
## Overview
In the Perf system, performance data is often stored and retrieved in chunks (tiles). When a user requests data over a large time range, the system must fetch multiple tiles and stitch them together. `TraceSetBuilder` manages this stitching process efficiently.
The design prioritizes performance and thread safety by using a "sharded worker" architecture. Instead of protecting a shared result set with a global mutex—which would cause significant contention when processing thousands of traces—the builder distributes the work across a pool of independent worker routines.
### Key Workflows and Design Decisions
The builder uses a pipeline pattern to process incoming trace data:
1. **Input Sharding**: When `Add()` is called, the builder iterates over the provided traces. It calculates a CRC32 checksum of each trace key to determine which worker should handle that specific trace.
2. **Lock-Free Concurrency**: By routing all data for a specific trace ID to the same worker, the system ensures that no two workers ever attempt to modify the same trace simultaneously. This allows each worker to maintain its own local `TraceSet` and `ParamSet` without any internal locking.
3. **Mapping Logic**: The builder translates sparse data (from tiles) into a dense output array. It uses a mapping of `CommitNumber` to an output index, allowing it to place data points at the correct temporal position regardless of the order in which tiles are processed.
4. **Final Aggregation**: When `Build()` is invoked, the builder waits for all workers to finish their queues and then merges the independent results from each worker into a final consolidated set.
```text
Add(traces) Worker 1 (Keys A, D) Build()
| +-----------+ |
|--- Hash(A) --->| TraceSet1 |--- Merged -|
| +-----------+ |
|--- Hash(B) ---. |--> Final TraceSet
|--- Hash(C) --.| Worker 2 (Keys B, C) |--> Final ParamSet
| +-----------+ |
`--- Hash(D) --->| TraceSet2 |--- Merged -'
+-----------+
```
## Key Components
### TraceSetBuilder (`tracesetbuilder.go`)
The primary coordinator. It initializes a pool of 64 workers (defined by `numWorkers`) and a `sync.WaitGroup` to track pending work. It is designed for a single lifecycle: you `Add()` data, `Build()` the result, and then `Close()` the builder. It cannot be reused after `Build()` is called.
### mergeWorker (`tracesetbuilder.go`)
Internal workers that maintain their own state. Each worker listens on a buffered channel for `request` objects.
- **Trace Merging**: If a worker receives a trace key it hasn't seen before, it initializes a new `types.Trace` filled with sentinel values (missing data). It then populates the trace at specific indices based on the commit mapping provided in the request.
- **ParamSet Tracking**: Each worker updates its own `paramtools.ParamSet` to reflect the dimensions and values present in the traces it has processed.
### The Request Structure
The `request` object is the unit of work passed to workers. It contains:
- The raw trace data.
- The parsed `Params` (to avoid redundant parsing in the workers).
- A `commitNumberToOutputIndex` map, which defines exactly where each data point in the input should land in the final output trace.
## Usage Details
- **Initialization**: `New(size int)` requires the total length of the resulting traces (e.g., the number of commits in the requested range).
- **Data Insertion**: `Add()` is non-blocking to the extent of the channel buffers. It distributes traces to workers and increments the internal WaitGroup.
- **Completion**: `Build()` blocks until all workers have finished processing their queues. It then performs the final merge of the 64 worker-local maps into the return values.
- **Cleanup**: `Close()` must be called to shut down the worker goroutines and release resources.
# Module: /go/tracestore
# TraceStore
The `tracestore` module provides the core abstractions and interfaces for storing, retrieving, and querying performance trace data within the Skia Perf system. It acts as the bridge between raw performance metrics (time-series data) and the storage backends, ensuring that high-cardinality data can be queried efficiently.
## High-Level Overview
In the Skia Perf ecosystem, a "trace" is a series of floating-point values associated with a specific set of parameters (e.g., `,arch=x86,config=8888,`). The `tracestore` module defines how these traces are organized into "Tiles"—fixed-size blocks of commits—and provides the interfaces for performing complex queries across these tiles.
The module is built around three primary interfaces:
1. **`TraceStore`**: The main interface for reading and writing trace data, calculating tile offsets, and executing queries.
2. **`TraceParamStore`**: Specifically handles the mapping between a trace's unique identifier (an MD5 hash) and its human-readable parameters.
3. **`MetadataStore`**: Manages "sidecar" information, such as links to source files or diagnostic data associated with the ingestion process.
## Design Decisions
### Tiled Data Architecture
To handle years of performance data without performance degradation, `tracestore` utilizes a tiling system.
- **The "Why"**: Loading an entire history of a trace is rarely necessary and often memory-prohibitive. By splitting data into tiles (e.g., 256 commits per tile), the system can load only the segments relevant to a user's current view.
- **Implementation**: The `TraceStore` interface exposes methods like `TileNumber` and `CommitNumberOfTileStart` to translate between absolute commit numbers and their positions within specific storage blocks.
### Separation of Values and Parameters
The design separates the storage of the numeric values (the "what") from the parameters (the "who").
- **Efficiency**: Instead of storing the full string of parameters with every single data point, the system uses a unique `trace_id` (MD5 hash).
- **The "How"**: The `TraceParamStore` maintains the lookup table for these IDs, while the `TraceStore` focuses on the high-volume numeric values and commit associations.
## Key Components
### TraceStore (`tracestore.go`)
This is the central entry point for the module. It defines the contract for how the rest of the Perf system (like the `dfbuilder` for creating DataFrames) interacts with performance data.
**Key Responsibilities:**
- **Querying**: `QueryTraces` and `QueryTracesIDOnly` provide the mechanism to search millions of traces based on parameter matches (e.g., finding all traces where `cpu=arm64`).
- **Data Retrieval**: Supports both tile-based reads (`ReadTraces`) and arbitrary commit range reads (`ReadTracesForCommitRange`).
- **Ingestion**: `WriteTraces` is responsible for committing new data points into the store, ensuring that the associated `ParamSet` (the global index of all known keys and values) is updated.
### TraceParamStore (`traceparamstore.go`)
This interface manages the lifecycle of trace identities.
- **Responsibility**: It maps the MD5 hex-encoded `traceId` to the `paramtools.Params` object.
- **Rationale**: By isolating this, backends can implement specialized caching or indexing (like the `InMemoryTraceParams` found in the SQL implementation) to speed up the translation from IDs back to human-readable strings.
### MetadataStore (`metadatastore.go`)
This interface provides context to the raw numbers.
- **Responsibility**: It links a data point back to its origin—specifically the source file name and any external links (e.g., a link to a BuildBucket task or a GCS bucket).
- **Usage**: When a user clicks on a point in a Perf graph, the system uses the `MetadataStore` to find exactly which file generated that specific value.
## Implementation Details: SQL Backend
While this module defines the interfaces, the `sqltracestore` submodule provides a concrete implementation designed for CockroachDB and Spanner. It implements specialized logic for:
- **Parallel Ingestion**: Writing trace data in batches to maximize database throughput.
- **In-Memory Search**: Using a columnar, integer-encoded index of trace parameters to resolve complex queries in RAM before fetching the actual numeric values from SQL.
## Data Workflow: Trace Resolution
The following diagram shows how the `tracestore` components interact when a user requests data for a specific graph:
```text
UI / API Request
|
v
[ TraceStore.QueryTraces ]
|
|-- 1. Identify matching TraceIDs via Query
|
|-- 2. Fetch Values (TraceStore implementation)
| [ SQL TraceValues Table ]
|
|-- 3. Fetch Params (TraceParamStore)
| [ SQL TraceParams Table ]
|
|-- 4. Fetch Source Info (MetadataStore)
| [ SQL SourceFiles Table ]
v
Combined TraceSet + Metadata
```
# Module: /go/tracestore/mocks
# tracestore/mocks
The `tracestore/mocks` module provides autogenerated mock implementations of the core interfaces used for storing and retrieving performance trace data within the Perf system. These mocks are generated using `mockery` and are based on the `testify` framework, facilitating unit testing of components that depend on `tracestore` and `metadatastore`.
## High-Level Overview
In the Perf architecture, the `TraceStore` and `MetadataStore` are critical abstractions for interacting with time-series data and its associated metadata (such as source file links). Because these stores often interact with external databases (like BigTable or SQL backends), using real implementations in unit tests is often impractical.
This module provides:
- **`TraceStore` Mock**: Simulates the primary data store for performance traces, supporting operations like querying by parameters, reading by commit range, and tile management.
- **`MetadataStore` Mock**: Simulates the storage used for mapping source file names to additional metadata, such as links or IDs.
## Key Components
### TraceStore.go
This file contains the mock for the `tracestore.TraceStore` interface. It is designed to allow developers to simulate complex data retrieval scenarios without a running database.
**Key Capabilities:**
- **Tile Logic**: Methods like `TileNumber`, `CommitNumberOfTileStart`, and `TileSize` allow tests to verify how components handle Perf's "tiled" data architecture.
- **Query Simulation**: `QueryTraces` and `QueryTracesIDOnly` can be configured to return specific `TraceSet` results or stream parameters, enabling tests for the UI and alerting logic.
- **Data Ingestion**: `WriteTraces` can be mocked to ensure that ingestion pipelines are correctly formatting and sending data to the store.
### MetadataStore.go
This file provides the mock for the `MetadataStore` interface, focusing on the association between raw trace data and its origin files.
**Key Capabilities:**
- **Metadata Retrieval**: Mocking `GetMetadata` and `GetMetadataMultiple` allows testing of features like the "Source File" links in the Perf UI.
- **Bulk Operations**: Supports mocking `GetMetadataForSourceFileIDs` for performance-sensitive batch lookups.
## Design Patterns and Usage
The mocks follow the `testify/mock` pattern. When a new mock is created via `NewTraceStore(t)` or `NewMetadataStore(t)`, it automatically registers a cleanup function that asserts expectations when the test finishes.
### Workflow Example: Testing a Query Component
This diagram illustrates how a test uses the mock to verify a component that processes trace data:
```text
Test Logic Component Under Test TraceStore Mock
| | |
|-- 1. On("QueryTraces") ->| |
| .Return(myTraceSet) | |
| | |
|----- 2. RunAction() --->| |
| |------- 3. QueryTraces() ------->|
| | |
| |<------ 4. myTraceSet -----------|
| | |
|<---- 5. Verify Results -| |
| | |
|-- 6. Cleanup/Assert ----|-------------------------------->|
```
1. **Setup**: The test defines what the mock should return when a specific query is executed.
2. **Execution**: The component calls the mock as if it were a real database.
3. **Verification**: The test checks if the component handled the returned `TraceSet` correctly.
4. **Assertion**: The mock verifies that the component actually called `QueryTraces` with the expected arguments.
# Module: /go/tracestore/sqltracestore
This module provides a high-performance, SQL-backed implementation of the `tracestore.TraceStore` interface for Skia Perf. It is designed to store and query high-cardinality time-series performance data, primarily targeting databases like CockroachDB or Spanner.
The implementation focuses on optimizing two primary workloads:
1. **Fast Range Queries**: Retrieving floating-point values for a specific set of traces across a range of commits (tiles).
2. **Metadata Discovery**: Navigating the "inverted index" of parameters (e.g., `arch=x86`) to find relevant traces.
### Design Decisions
#### Tile-Based Sharding
To prevent indices from growing indefinitely and to facilitate data aging/management, data is organized into "Tiles." Each tile represents a fixed number of commits (e.g., 256). This allows the system to partition lookups and optimize the `ParamSets` table by only querying the keys and values relevant to the specific time range being viewed.
#### MD5 Trace Identification
Trace names are structured keys (e.g., `,arch=x86,config=565,`). Storing these long strings repeatedly in the `TraceValues` table would be storage-inefficient and slow for indexing. Instead, the module uses an **MD5 hash** of the trace name as a `BYTEA` (or `BYTES`) primary key (`trace_id`).
- **Why MD5?** It provides a uniform distribution of keys, preventing "hot spots" in distributed SQL databases.
- **Trace Recovery**: Because hashes are one-way, the `TraceParams` table stores the mapping from `trace_id` back to the original `JSONB` parameter map.
#### In-Memory Parameter Indexing (`InMemoryTraceParams`)
While SQL is powerful, querying millions of traces based on complex parameter combinations (including regex and exclusions) can be slow in a pure SQL environment.
- **The "How"**: This module periodically loads the entire `TraceParams` table into an in-memory, integer-encoded columnar structure.
- **The Benefit**: Queries like `arch=x86 & config=~.*8888` are resolved in-memory by scanning bitsets or integer arrays, which then produces a list of `trace_id`s to be used in a highly optimized SQL `IN` clause against the `TraceValues` table.
### Key Components
#### `SQLTraceStore` (`sqltracestore.go`)
The central orchestrator. It manages the lifecycle of traces, handles the conversion between human-readable trace names and SQL-friendly hashes, and coordinates with caches. It uses Go templates to generate dynamic SQL queries for batch operations.
#### `InMemoryTraceParams` (`inmemorytraceparams.go`)
An in-memory search engine for trace metadata.
- **Parallel Refresh**: It uses a partitioned read strategy (splitting the `trace_id` keyspace into 16 partitions) to rapidly load metadata from SQL into RAM.
- **Encoding**: It maps all parameter strings to `int32` identifiers to minimize memory footprint and speed up comparison logic.
#### `SQLTraceParamStore` (`sqltraceparamstore.go`)
Handles the durable storage of the trace identity.
- **Responsibility**: Maps the MD5 `trace_id` to the full `paramtools.Params` (JSON).
- **Optimization**: Implements batch writing with a parallel worker pool to handle high-volume ingestion.
#### `SQLMetadataStore` (`sqlmetadatastore.go`)
Stores "sidecar" information about the ingestion process.
- **Responsibility**: Maps `source_file_id` (an integer) to external links or diagnostic metadata. This keeps the primary `TraceValues` table focused strictly on performance metrics.
#### `Intersection Logic` (`intersect.go`)
A utility for combining results from multiple search channels. It uses a binary tree of Go channels to efficiently find the intersection of ordered `trace_id` sets without the overhead of reflection.
### Data Workflow: Reading Traces
The following diagram illustrates how a user query for "config=565" across a specific tile is resolved:
```text
User Query: "config=565" for Tile 176
|
v
[ InMemoryTraceParams ] <--- (Scans encoded columns in RAM)
|
| Result: List of matching TraceIDs (MD5 hashes)
v
[ SQL Database: TraceValues ]
|
| SQL: SELECT val FROM TraceValues
| WHERE trace_id IN (...) AND commit_number BETWEEN 45056 AND 45311
v
[ TraceSet Result ] ----> (UI/Graphing)
```
### Data Workflow: Writing Traces
Ingestion prioritizes atomicity and avoiding redundant writes:
```text
Incoming Data: {Commit: 100, Params: {arch: x86}, Value: 1.2, Source: "file.json"}
|
| 1. Update SourceFiles: Get/Create ID for "file.json"
| 2. Update ParamSets: Ensure "arch=x86" is registered for the tile
| 3. Hash Trace: ",arch=x86," -> MD5 TraceID
| 4. Write TraceParams: Store {TraceID: Params} (ON CONFLICT DO NOTHING)
v
[ SQL Database: TraceValues ]
|
| INSERT INTO TraceValues (trace_id, commit, val, source_id)
| ON CONFLICT (trace_id, commit) DO UPDATE ...
```
# Module: /go/tracestore/sqltracestore/schema
The `sqltracestore/schema` module defines the foundational data structures used to map Go types to SQL table definitions for Skia Perf's trace storage. It acts as the "source of truth" for the database schema, utilizing struct tags to define column types, primary keys, and indices.
### Design Evolution and Storage Strategy
The schema is designed to handle high-cardinality time-series data (performance metrics) while maintaining fast lookups for both specific trace values and metadata.
#### Trace Data Management
The core performance data is stored in `TraceValuesSchema` and its successor `TraceValues2Schema`.
- **TraceValuesSchema**: Uses a composite primary key of `(trace_id, commit_number)`. This ensures that for any given metric (trace), data points are physically ordered by time (commit number), optimizing range scans for graphing.
- **TraceValues2Schema**: Extends the original schema to explicitly include common parameter dimensions (Benchmark, Bot, Test, etc.) as columns. This evolution reflects a shift towards allowing the database engine to filter on specific common dimensions more efficiently than generic JSON or posting-list lookups.
#### Postings and Search
To facilitate searching across millions of traces based on arbitrary parameters, the module defines a `PostingsSchema`:
- **Tile-based Partitioning**: Data is organized by `tile_number`. This sharding strategy prevents the posting indices from growing indefinitely, allowing the system to query only relevant time ranges.
- **Inverted Index**: The `key_value` (representing a `key=value` pair) is indexed against `trace_id`. This allows the system to quickly resolve a query like `device=pixel6` into a set of trace IDs.
#### Parameter and Metadata Handling
- **ParamSetsSchema**: Tracks the global set of all available keys and values within a specific tile. This is used to populate UI filters and autocomplete suggestions.
- **TraceParamsSchema**: Stores the full parameter map for a single trace as `JSONB`. This is used when the system needs to reconstruct the full identity of a trace after it has been located via an index.
- **SourceFilesSchema**: Maps raw filenames to internal IDs. This normalization reduces storage overhead in the primary value tables by replacing long strings with integers.
### Key Components and Data Relationships
The following diagram illustrates how these entities relate during data ingestion and retrieval:
```text
[ SourceFiles ] <---------- [ TraceValues ] ----------> [ TraceParams ]
(Maps filename to ID) (The actual metrics) (Full key/value map)
|
| (linked by trace_id)
v
[ ParamSets ] <---------- [ Postings ]
(All possible keys/vals) (Search index for traces)
```
- **TraceID**: A byte slice (usually a hash) that serves as the unique identifier for a specific combination of parameters. It is the common link across `TraceValues`, `Postings`, and `TraceParams`.
- **Indices**: The schema defines specific secondary indices (like `by_source_file_id`) to support administrative workflows, such as identifying all data points associated with a corrupted or updated source file.
- **MetadataSchema**: Specifically handles non-performance data (like external links or diagnostic information) associated with a source file, kept separate from the "hot" path of performance metrics to keep the trace tables lean.
# Module: /go/tracing
### High-Level Overview
The `perf/go/tracing` module serves as a specialized wrapper for initializing distributed tracing within Perf applications. It bridges the gap between the generic infrastructure-level tracing utilities and the specific configuration requirements of a Perf instance.
Its primary purpose is to ensure that performance data and request flows across Perf services are captured and exported consistently to a tracing backend (typically Google Cloud Trace) without requiring each sub-service to manually manage initialization logic or environment-specific metadata.
### Design and Implementation Decisions
#### Centralized Initialization
The module abstracts the complexity of `OpenCensus` initialization. By consolidating this in one place, the project ensures that all Perf components—such as the frontend, ingestion service, and query engine—use identical sampling logic and metadata tagging. This consistency is crucial for correlating traces across different service boundaries.
#### Metadata Enrichment
A key design choice in `Init` is the automatic injection of contextual metadata into every trace.
- **Pod Identification:** By capturing the `MY_POD_NAME` environment variable (injected via Kubernetes templates), the module allows developers to pinpoint exactly which container instance handled a specific request.
- **Instance Scoping:** Since a single Perf deployment can represent different logical instances (e.g., "skia", "chrome", "flutter"), the `instance` name is included in the trace attributes to allow for easy filtering in the tracing dashboard.
#### Conditional Activation (Local vs. Production)
Tracing is intentionally bypassed when running in `local` mode. This prevents development environments from attempting to authenticate with cloud-based tracing exporters or polluting production trace data with local testing noise.
#### Configuration-Driven Sampling
The module utilizes `TraceSampleProportion` from the `InstanceConfig`. This allows for dynamic control over the volume of traces generated. High-traffic instances can set a lower proportion to manage costs and overhead, while smaller or more critical instances can increase the sample rate for higher visibility.
### Key Components and Responsibilities
#### `tracing.go`
This is the core of the module, responsible for the following:
1. **Orchestrating Initialization:** It invokes the lower-level `go/tracing` infrastructure package but pre-configures it with Perf-specific defaults.
2. **Project Auto-Detection:** It passes an empty string for the Project ID, signaling the underlying library to use Google Cloud's metadata server to auto-detect the hosting project. This simplifies deployment across different GCP projects.
3. **Environment Mapping:** It transforms the high-level `InstanceConfig` and system environment variables into a structured map of attributes that are attached to every trace span.
### Workflow: Trace Initialization
The following diagram illustrates how the tracing configuration flows from the application startup into the global tracing state.
```text
Application Startup
|
| (local flag, InstanceConfig)
v
+--------------------------+
| perf/go/tracing.Init() |
+--------------------------+
|
|-- Check local flag (Return nil if true)
|-- Extract InstanceName from Config
|-- Fetch MY_POD_NAME from OS
|
v
+-----------------------------------+
| infra/go/tracing.Initialize() | <--- Global Trace Exporter
+-----------------------------------+
|
|-- Sets Sampling Rate (Proportion)
|-- Configures Project ID (Auto-detect)
|-- Attaches {podName, instance} Attributes
v
Tracing Ready
```
# Module: /go/ts
# TypeScript Definition Generation for Perf
The `go/ts` module is a utility program designed to bridge the gap between the Go backend and the TypeScript frontend in the Perf application. Its primary responsibility is to ensure type safety across the network boundary by automatically generating TypeScript interfaces and types from Go structs that are serialized into JSON for the web UI.
### Design Philosophy
The module addresses the "fragile base class" problem in web development: when a Go struct used in a JSON response changes, the frontend code often breaks silently if its TypeScript definitions are out of sync.
Instead of manually maintaining duplicate type definitions, this module uses reflection (via the `go2ts` package) to inspect Go structs and produce a source-of-truth TypeScript file. This ensures that:
1. **Type Consistency:** Frontend developers can rely on TypeScript definitions that exactly match the backend's JSON output.
2. **Nominal Typing:** By setting `GenerateNominalTypes = true`, the generator treats specific Go types as distinct in TypeScript, preventing logic errors where structurally similar but semantically different types might be confused.
3. **Documentation of APIs:** The generator acts as a living document of all data structures exchanged between the Perf frontend and backend.
### Key Components and Workflows
#### Main Execution Logic (`main.go`)
The core of the module is a CLI tool that configures a `go2ts.Go2TS` generator. The execution follows a specific sequence:
1. **Initialization:** It instantiates the generator and configures global behaviors, such as ignoring `nil` values for specific mapping types like `paramtools.Params` to prevent unnecessary optionality in TypeScript.
2. **Type Registration:** The bulk of the code involves registering Go types from various sub-packages. It distinguishes between standard structs and "unions" (which Go often represents as constants or enums).
3. **Namespace Organization:** To prevent naming collisions and improve code organization on the frontend, certain types are grouped into namespaces (e.g., `pivot`, `progress`, `ingest`).
4. **Rendering:** Finally, it writes the generated TypeScript code to a specified output file (typically `modules/json/index.ts`).
#### Handling Unions and Enums
Go doesn't have a native "Union" type similar to TypeScript. The module uses a helper function, `addMultipleUnions`, to map collections of Go constants to TypeScript Union types. This is critical for states, statuses, and configuration options (e.g., `regression.Status` or `alerts.ConfigState`), ensuring the frontend can only use valid, predefined values.
#### Workflow Diagram
```text
[Go Source Code] [go/ts/main.go] [TypeScript Output]
| | |
|-- (Reflects on) --------| |
| Structs & Constants | |
| | |
| |-- (Converts to TS) ---->|
| | |
| | |-- index.ts
| | | (Interfaces,
| | | Namespaces,
| | | Unions)
```
### Key Package Dependencies
The module acts as a central registry, importing almost every major data-holding package in the Perf system to expose their structures:
- **`perf/go/frontend/api`**: Defines the shapes of requests and responses for the web API.
- **`perf/go/alerts` & `perf/go/regression`**: Core domain objects for alerting logic and anomaly detection.
- **`perf/go/clustering2`**: Data structures representing results from clustering algorithms.
- **`perf/go/types` & `go/paramtools`**: Low-level primitives for trace keys and parameter sets.
- **`perf/go/chromeperf` & `perf/go/pinpoint`**: Structures for interacting with external Chromeperf and Pinpoint services.
### Usage in Development
The module is intended to be run via `go generate`. When a developer modifies a Go struct that is sent to the frontend, they should trigger the generator to update the TypeScript definitions, which are then checked into version control. This maintains a synchronized state between the two languages.
# Module: /go/types
# /go/types
The `go/types` module serves as the central repository for core domain types and shared constants used throughout the Skia Perf system. It establishes a common language for time-series data, versioning, and anomaly detection configurations, ensuring consistency across data ingestion, storage, and analysis.
## Core Abstractions
### Versioning: Commit and Tile Numbers
The system handles large-scale time-series data by indexing it against repository commits.
- **CommitNumber**: Represents a linear offset from the repository's first commit (0). It assumes a simplified, linear history to facilitate easy indexing and range queries.
- **TileNumber**: To optimize data retrieval and storage, traces are partitioned into "Tiles" (fixed-size chunks of commits). This type represents the index of such a tile.
The module provides conversion logic to navigate between these two coordinate systems:
```text
CommitNumber ----(TileSize)----> TileNumber
[0, 255] / 256 0
[256, 511] / 256 1
```
### Trace Representation
A **Trace** is the fundamental unit of measurement data, represented as a slice of `float32` values.
- **Missing Data**: Traces use a sentinel value (`vec32.MISSING_DATA_SENTINEL`) to represent gaps in measurement, allowing the system to distinguish between a zero value and no data.
- **TraceSet**: A convenience mapping of trace IDs (strings) to their corresponding data slices.
- **TraceSourceInfo**: A thread-safe container that maps specific points in a trace (CommitNumbers) back to their original source file IDs in the database, enabling "drill-down" capabilities from a graph point to the raw data file.
## Anomaly Detection & Regression Logic
The module defines the enums and types that control how the system identifies changes in performance:
### Grouping Strategies
Defines how traces are aggregated before analysis:
- **KMeansGrouping**: Clusters similar trace shapes together to identify aggregate shifts.
- **StepFitGrouping**: Analyzes each trace individually to find "steps" (sudden jumps or drops).
### Step Detection Algorithms
Determines the mathematical approach used to identify a regression within a single trace or cluster centroid:
- **Statistical Tests**: `CohenStep` (Effect size) and `MannWhitneyU` (Rank-sum test) for robust change detection.
- **Heuristics**: `PercentStep`, `AbsoluteStep`, and `Const` for simpler magnitude-based thresholds.
### Alerting Actions
Specifies the lifecycle of a detected anomaly via `AlertAction`:
1. **NoAction**: Detection only (no notification).
2. **FileIssue**: Creates a task for a human sheriff to investigate.
3. **Bisection**: Automatically triggers a bisection job to identify the specific culprit commit.
## Key Files
- **types.go**: Contains all struct definitions, enums, and utility methods for coordinate conversion and data structure management.
- **types_test.go**: Validates the math behind commit-to-tile mapping and boundary conditions for invalid indices.
## Design Decisions
- **Linear Versioning**: By using `int32` for `CommitNumber`, the system prioritizes performance and simplicity in indexing over the complexity of a full Git DAG.
- **Thread Safety**: `TraceSourceInfo` uses an internal `sync.RWMutex`. This design choice acknowledges that source information is often updated concurrently during data ingestion while being read by the UI or analysis engines.
- **Sentinel Values**: The use of `BadCommitNumber (-1)` and `BadTileNumber (-1)` provides a standard way to handle errors or uninitialized references without relying on Go's zero-value (0), which is a valid index.
# Module: /go/ui
### Overview
The `/go/ui` module serves as the primary backend orchestration layer for the Perf UI. Its main purpose is to bridge the gap between high-level user interactions (like clicking a "shortcut" link or requesting a custom dashboard) and the underlying data storage and processing systems. It acts as a coordinator, delegating specific tasks to specialized submodules like `frame` for data processing or `shortcuts` for state persistence.
### Design Decisions and Implementation Choices
#### State Persistence via Shortcuts
A key design choice in the Perf UI is to avoid massive, complex URLs. Instead of encoding an entire UI state (queries, zoom levels, formula transformations) into the URL, the module uses a "Shortcut" system.
- **Why**: This allows users to share short, immutable links to specific views.
- **How**: The UI sends a state object to the backend; the backend stores it in a database and returns a short ID. When a user visits a link with that ID, the `/go/ui` layer retrieves the original state and hydrates the UI.
#### Decoupling Data Fetching from Rendering
The module is designed around the concept of a `Frame`. A "Frame" is not just raw data, but a structured package containing trace values, metadata, anomaly markers, and display instructions.
- **Why**: This allows the frontend to remain relatively "dumb" regarding data processing logic. The backend decides whether a result should be rendered as a table, a plot, or a pivot view based on the complexity of the request.
- **How**: This logic is encapsulated within the `frame` submodule, which acts as the "brain" for transforming raw trace queries into a format the UI can immediately consume.
#### Progress and Asynchronicity
Because performance data can span millions of points and take seconds to process, the UI backend implementation prioritizes progress tracking.
- **Implementation**: Many operations are wrapped in a progress-tracking context. As the backend fetches data or calculates formulas, it updates a status object that the frontend polls, ensuring the user is never left with a hanging UI.
### Key Workflows
The following diagram shows how the `ui` module coordinates a request to view data, starting from a short URL:
```text
Browser (URL with Shortcut ID)
|
|-- 1. Get ID ----> [ /go/ui/shortcuts ] (Retrieve State)
| |
|<- 2. UI State ----------'
|
|-- 3. Request Data --> [ /go/ui/frame ]
| (State ID) |
| |-- a. Query Tracestore
| |-- b. Run Calculations
| |-- c. Attach Anomalies
| |-- d. Link Commits/Source
| V
|<-- 4. DataFrame <-----------'
|
(Render Graph/Table)
```
### Key Submodules and Responsibilities
#### `/go/ui/frame`
The heavy lifter of the module. It handles the `FrameRequest` lifecycle. It is responsible for:
- **Query Resolution**: Translating user-defined keys into trace data.
- **Calculations**: Invoking the `calc` engine to process mathematical formulas on the fly.
- **Metadata Enrichment**: Attaching human-readable links to source code repositories (e.g., Chromium, V8) by comparing commit hashes in the trace data.
- **Anomaly Integration**: Overlaying regression data onto the performance traces.
#### `/go/ui/shortcuts`
Manages the lifecycle of "Shortcuts" (short IDs that map to complex UI states).
- It provides the persistence layer for the "Explore" page.
- It ensures that UI configurations can be shared and bookmarked without hitting URL length limits.
#### UI Logic and Configuration
Beyond the submodules, the root `/go/ui` package often contains the logic for global UI settings and navigation. It determines which features are enabled based on the instance configuration (e.g., whether to show anomaly detection features or specific repository links).
# Module: /go/ui/frame
The `/go/ui/frame` module is responsible for orchestrating the transition from a user's high-level data request (represented as queries, formulas, or shortcuts) into a rich, structured `DataFrame` suitable for visualization in the Perf frontend. It acts as the "brain" of the Explore page, managing the complexity of parallel data fetching, calculation, pivoting, and metadata enrichment.
### Core Responsibility: Request Processing
The primary entry point is `ProcessFrameRequest`, which manages the lifecycle of a `FrameRequest`. A single request can contain multiple data sources—queries, mathematical formulas, and pre-saved shortcut keys—all of which must be aggregated into a unified view.
The module follows a structured workflow to build a response:
1. **Data Fetching**: It uses a `DataFrameBuilder` to fetch raw trace data based on the provided queries or shortcuts.
2. **Calculation**: If formulas are provided, it leverages the `go/calc` engine to perform transformations (like `sum()` or `filter()`) on the fetched traces.
3. **Pivoting**: If requested, it reshapes the data using the `pivot` module, aggregating traces by specific parameters.
4. **Enrichment**: It decorates the data with external context, such as anomaly markers from Chrome Perf and source file metadata (links to repositories like V8 or WebRTC).
5. **Progress Tracking**: Because data fetching can be long-running, the module updates a `progress.Progress` object to give the frontend real-time status updates.
### Design Decisions and Implementation
#### Handling Hybrid Request Types
The module supports two distinct ways of looking at time: `REQUEST_TIME_RANGE` (absolute Unix timestamps) and `REQUEST_COMPACT` (a fixed number of commits leading up to a point). The implementation abstracts this difference by passing specialized parameters to the `dfBuilder` while maintaining a consistent internal `DataFrame` structure.
#### Trace Filtering and Sentinels
One specific design choice is the use of the `preflightqueryprocessor` (pqp). Before fetching data, the module prepares queries with "sentinels" (e.g., `__missing__`). This allows the system to handle complex queries where a user specifically wants to find traces that lack a certain parameter, which is then enforced through an in-memory `filterTraceSet` pass after the raw data is loaded.
#### Intelligent Metadata Linking
The `getMetadataForTraces` and `populateTraceMetadataLinksBasedOnConfig` functions implement logic to generate human-readable commit ranges. Instead of just showing a raw hash, it compares the current commit to the previous one in the trace and generates a `+log/prev..current` link if a change is detected. This is specifically tuned for major repositories like Chromium, V8, and WebRTC via configuration.
#### Response Display Modes
The module determines how the frontend should render the data by analyzing the `FrameRequest`.
- If a pivot has a summary operation, it sets `DisplayPivotTable`.
- If it has a group-by but no summary, it sets `DisplayPivotPlot`.
- Otherwise, it defaults to a standard `DisplayPlot`.
### Key Workflows
The following diagram illustrates how the `frameRequestProcess` coordinates different sub-systems:
```text
User Request (FrameRequest)
|
|---- Queries ----> [ DataFrameBuilder ] ----.
| |
|---- Keys -------> [ ShortcutStore ] -------|--> [ Combined DataFrame ]
| | | |
|---- Formulas ---> [ calc.Eval ] <----------' |
|
.--------------------------------------------------------'
|
|---- [ pivot.Pivot ] (Optional Reshaping)
|
|---- [ anomalies.Store ] (Attach Anomaly Markers)
|
|---- [ MetadataStore ] (Attach Source Links)
|
V
Final Response (FrameResponse)
```
### Key Components
- **`FrameRequest` / `FrameResponse`**: The JSON-serializable structures that define the API between the frontend (Explore page) and the backend logic.
- **`frameRequestProcess`**: A private struct that maintains the state of a single request, including progress counters and references to required stores (Git, Shortcuts, Tracestore).
- **`doSearch` / `doCalc` / `doKeys`**: Internal methods that isolate the logic for different data retrieval strategies. `doCalc` is notable for providing callback functions (`rowsFromQuery`, `rowsFromShortcut`) to the calculation engine, allowing formulas to recursively fetch data.
- **Anomaly Integration**: Functions like `addRevisionBasedAnomaliesToResponse` bridge the gap between the trace data and the anomaly detection system, ensuring that points on a graph can be highlighted if they represent performance regressions.
# Module: /go/urlprovider
# URL Provider
The `urlprovider` module is a utility component within the Skia Perf system designed to programmatically generate deep-link URLs for various Perf UI pages. It centralizes the logic for constructing complex query parameters, ensuring that links to the Explore page, MultiGraph page, and Group Reports are consistent across the application.
## High-level Overview
The primary goal of this module is to abstract the transformation of internal state—such as commit numbers, trace parameters, and shortcut IDs—into URL strings that the Perf frontend can interpret.
A key design choice in this module is the integration with `perfgit.Git`. Because the Perf UI relies on Unix timestamps for time-range filtering rather than raw commit numbers, the `URLProvider` uses the Git service to resolve commit numbers into their corresponding timestamps. This ensures that generated URLs point to the correct temporal window even as the underlying data evolves.
## Key Components and Responsibilities
### URLProvider Struct
Defined in `urlprovider.go`, this is the main stateful component. It requires an instance of `perfgit.Git` to perform commit-to-timestamp lookups.
- **Time Range Resolution**: The provider automatically converts a range of commit numbers into `begin` and `end` URL parameters. A specific implementation choice made here is to shift the `end` time forward by one day (`AddDate(0, 0, 1)`). This is done to ensure that the data points or anomalies associated with the final commit are clearly visible on the rendered graph and not cut off at the edge of the display.
- **Explore Page Generation**: The `Explore` method constructs URLs for the `/e/` endpoint. It handles the nesting of trace queries by encoding trace parameters into a single `queries` parameter.
- **MultiGraph Generation**: The `MultiGraph` method targets the `/m/` endpoint, utilizing a `shortcut` ID (representing a saved set of traces) rather than a raw query string.
- **Dynamic Customization**: Both methods support a `disableFilterParentTraces` flag (translated to the `disable_filter_parent_traces` query parameter) and allow for arbitrary additional query parameters via a `url.Values` argument.
### Static Group Reporting
The `GroupReport` function is a stateless utility. It generates URLs for the `/u/` endpoint, which is typically used for viewing anomaly groups or specific bugs.
- **Validation**: To prevent the generation of malformed or unsupported URLs, it enforces an allow-list of valid parameters: `anomalyGroupID`, `anomalyIDs`, `bugID`, `rev`, and `sid`.
## Key Workflows
### URL Generation Process
The following diagram illustrates how the `URLProvider` orchestrates data from internal services to produce a frontend URL:
```text
Input Parameters Git Service (perfgit)
(Commit Nums, Query) |
| |
v |
+--------------------------+ |
| URLProvider.Explore() | |
+------------+-------------+ |
| |
|---- Request Timestamps ------->|
|<--- Return Unix Timestamps ----|
|
| (Internal Logic)
| - Add 1 day buffer to End Time
| - Encode Query parameters
| - Append Optional Filters
|
v
Result: "/e/?begin=123&end=456&queries=..."
```
### File Responsibilities
- `urlprovider.go`: Contains the logic for calculating time ranges, encoding parameters, and building the final URL paths for Explore, MultiGraph, and Group Report pages.
- `urlprovider_test.go`: Validates that the URL generation correctly handles escaping, timestamp calculation, and optional parameter injection. It uses a mockable or test instance of the Git service to verify the integration between commit numbers and timestamps.
# Module: /go/userissue
# User Issue Module
The `userissue` module provides the core abstractions and storage logic for associating external issue tracker IDs (specifically Buganizer) with specific performance data points in the Perf system.
By linking a "trace key" and a "commit position" to a specific issue ID, the system allows developers and automated tools to contextualize performance anomalies with human-reported bugs. This enables the Perf UI to overlay bug information directly onto graphs, helping users understand if a regression or change is already being tracked.
## Design Philosophy
The module is designed around the concept of a **point-in-time association**. Because a trace represents a series of data points over time, an issue is not just linked to a trace, but to a specific moment in that trace's history (the commit position).
### Key Abstractions
- **`UserIssue` Struct**: Represents the core domain model. It contains the identity of the user who created the association, the `TraceKey`, the `CommitPosition`, and the external `IssueId`.
- **`Store` Interface**: Defines a storage-agnostic contract for persisting and retrieving these associations. This abstraction allows the system to swap underlying database implementations (e.g., switching to a different SQL dialect or a NoSQL provider) without affecting the business logic or the UI handlers.
## Implementation Details
The module follows a clean separation between the interface definition and its concrete implementations:
1. **Core Interface (`store.go`)**: Defines the required operations:
- `Save`: Persists a new association.
- `Delete`: Removes an association based on the unique combination of trace and commit.
- `GetUserIssuesForTraceKeys`: A bulk retrieval method designed for high-performance graph rendering, fetching all issues related to a set of traces within a specific range of commits.
2. **SQL Implementation (`sqluserissuestore`)**: A production-ready implementation that uses SQL (compatible with Spanner). It utilizes dynamic SQL templating to handle bulk queries efficiently, ensuring that complex filters over varying numbers of trace keys remain performant.
3. **Mock Implementation (`mocks`)**: Provides automated mocks for testing. This allows other parts of the Perf system (like the alert service or the API layer) to simulate database interactions, error conditions, and specific data scenarios without requiring a live database connection.
## Key Workflows
### Creating and Visualizing an Issue Association
When a user identifies a performance change on a graph and links it to a bug, the data flows through the following path:
```text
User Interaction Perf Backend Store Implementation
| | |
|-- 1. Create Link ------->| |
| (Trace, Commit, ID) |-- 2. Call Save() ---------->|
| | |-- 3. SQL INSERT
| | | (last_modified set)
| | |
| <--- 4. Confirmation ----| |
| | |
| | |
|-- 5. Refresh Graph ----->| |
| |-- 6. Bulk Fetch ----------->|
| | (for all traces in view) |-- 7. SQL Template
| | | (IN clause generated)
| <--- 8. Data with IDs ---| |
```
## Data Integrity and Constraints
The module relies on a composite primary key consisting of the `trace_key` and the `commit_position`. This design decision ensures:
- **Uniqueness**: A single data point (trace + commit) can only be associated with one issue at a time in the storage layer.
- **Atomicity**: Deletion and retrieval operations use these two components to ensure that the correct record is targeted, preventing accidental data loss across different commits in the same trace.
# Module: /go/userissue/mocks
# User Issue Mocks
The `userissue/mocks` module provides automated mock implementations of the interfaces defined in the `userissue` package. Its primary purpose is to facilitate unit testing for components that depend on user issue persistence without requiring a live database or a complex manual setup.
## Design Philosophy
This module utilizes **test-double generation** via `mockery`. By generating mocks based on the `Store` interface, the project ensures that the testing utilities stay in lockstep with the actual production code.
The decision to provide a dedicated `mocks` package serves two main purposes:
1. **Decoupling**: Tests in other modules (such as API handlers or alert logic) can import this package to simulate various data scenarios—such as database errors, empty result sets, or specific issue lists—without being coupled to the underlying SQL or Spanner implementations of the `Store`.
2. **Test Reliability**: By using the `testify/mock` framework, developers can write assertive tests that verify not just the output of a function, but also that the interactions with the storage layer (e.g., "was the correct trace key deleted?") happened exactly as expected.
## Key Components
### Store Mock (`Store.go`)
The `Store` struct is the central component of this module. It implements the `userissue.Store` interface, providing mockable versions of the following operations:
- **Persistence Operations (`Save`, `Delete`)**: These allow tests to verify that the application correctly attempts to write or remove user issue metadata associated with specific traces and commit positions.
- **Retrieval Operations (`GetUserIssuesForTraceKeys`)**: This mimics the complex querying of user issues over a range of commits. In a test environment, this is crucial for simulating "found" vs "not found" states when rendering performance graphs or dashboards.
## Usage Workflow
The typical workflow involves initializing the mock within a test suite, setting expectations for method calls, and then injecting the mock into the high-level business logic.
```text
Test Suite Component Under Test Mock Store
| | |
|---- 1. Setup Mock -------->| |
| (NewStore) | |
| | |
|---- 2. Expect Save() ----->| |
| | |
|---- 3. Call Business Logic | |
| (e.g. CreateIssue) -|---- 4. Call Save() ------>|
| | |
| |<--- 5. Return nil/err ----|
| | |
|<--- 6. Verify Result ------| |
| | |
|---- 7. Assert Expectations | |
(Check if Save was called)
```
## Implementation Details
- **Mockery Integration**: The files are autogenerated. Manual changes should be avoided; instead, the `Store` interface in the parent `userissue` package should be updated and the mock regenerated.
- **Safety**: The `NewStore` constructor automatically registers a cleanup function using `t.Cleanup`. This ensures that `AssertExpectations` is called at the end of every test, preventing "silent failures" where a test passes even if a predicted database call never actually occurred.
# Module: /go/userissue/sqluserissuestore
### Overview
The `sqluserissuestore` module provides an SQL-backed implementation of the `userissue.Store` interface. Its primary purpose is to persist and retrieve associations between performance anomalies (identified by a trace and a specific commit) and external issue tracking IDs (specifically Buganizer).
By storing these relationships, the Perf system can contextualize automated performance data with human-reported issues, allowing the UI to overlay bug information directly onto graphs and alerts.
### Design Decisions & Implementation Choices
#### SQL Templating for Dynamic Queries
A key requirement of the store is fetching issue associations across a variable number of trace keys within a specific commit range.
- **Implementation**: The module uses Go's `text/template` package to dynamically construct SQL queries for the `GetUserIssuesForTraceKeys` method.
- **Why**: Since SQL `IN` clauses require a specific number of placeholders corresponding to the slice length of input keys, templating allows the store to generate the correct number of `$n` placeholders at runtime while maintaining compatibility with prepared statement parameters to prevent SQL injection.
#### Consistency and Integrity
The implementation relies on the database schema's composite primary key (trace key + commit position) to enforce data integrity.
- **Error Handling**: The `Save` method does not use an "upsert" (Insert or Update) logic. Instead, it performs a standard `INSERT`. If an association already exists for a specific trace at a specific commit, the database returns a constraint violation, which the store wraps as an error. This ensures that users do not inadvertently overwrite existing issue mappings without an explicit deletion or update workflow.
- **Explicit Deletion Checks**: The `Delete` operation performs a lookup before executing the `DELETE` statement. This ensures the system can provide feedback if a user attempts to remove a record that doesn't exist, preventing silent failures in the UI.
#### Timestamp Management
The store captures the current system time during the `Save` operation and persists it to the `last_modified` column. This centralizes the "last modified" logic within the store implementation, ensuring that even if the client doesn't provide a timestamp, the database reflects when the association was actually created or modified.
### Key Components
#### `UserIssueStore`
Located in `sqluserissuestore.go`, this is the central struct that satisfies the `userissue.Store` interface. It wraps a `pool.Pool` connection to the database. Its methods translate high-level domain objects into SQL commands:
- **Save**: Persists a new `UserIssue` record.
- **Delete**: Removes an association based on the unique combination of trace and commit.
- **GetUserIssuesForTraceKeys**: Performs a bulk retrieval of issues for a list of traces over a range of commits.
#### `listUserIssues` Template
This SQL template handles the most complex query in the module. It filters the `UserIssues` table by a set of trace keys and a closed interval of commit positions (`>= Begin` and `<= End`).
### Workflow: Retrieving Issues for a Graph
When the Perf UI renders a graph containing multiple traces, it needs to know which data points have associated bugs. The data flows as follows:
```text
[ Perf UI ]
|
| Request (Traces: ["A", "B"], Commit Range: 100-200)
v
[ UserIssueStore.GetUserIssuesForTraceKeys ]
|
|-- 1. Generate SQL Template: "SELECT ... WHERE trace_key IN ($1, $2) AND ..."
|-- 2. Execute Query with trace keys and range parameters
v
[ SQL Database ]
|
|-- 3. Filter UserIssues table by PK components
v
[ UserIssueStore ]
|
|-- 4. Map SQL Rows to []userissue.UserIssue
v
[ Perf UI ] (Displays bug icons on relevant graph points)
```
### Component Files
- **`sqluserissuestore.go`**: Contains the logic for CRUD operations and the SQL templates used to interact with the database.
- **`schema/`**: (Referenced by the store) Defines the table structure, ensuring that `trace_key` and `commit_position` act as the unique identifier for any given issue association.
- **`sqluserissuestore_test.go`**: Validates the store's behavior against a real SQL instance (typically Spanner for tests), ensuring that constraints are respected and queries return accurate data.
# Module: /go/userissue/sqluserissuestore/schema
### Overview
The `schema` module defines the structural contract for persisting user-reported issue associations within the Perf backend. It serves as the single source of truth for the SQL table structure used by the `sqluserissuestore`.
The primary goal of this schema is to bridge the gap between performance anomalies (represented by a specific trace at a specific point in time) and external issue trackers (Buganizer). By maintaining this mapping, the system can overlay human-provided context onto automated performance graphs and reports.
### Design Decisions & Implementation Choices
#### Compound Primary Key
The schema uses a composite primary key consisting of `trace_key` and `commit_position`. This choice reflects the functional requirement that an issue association is unique to a specific data point.
- **Why**: A single trace might have different issues at different points in its history. Conversely, multiple traces might be affected by the same issue. By keying on the trace/commit pair, the store ensures data integrity while allowing the same `IssueId` to be linked to multiple regressions across the system.
#### User Attribution
The `UserId` field is explicitly included to capture the email of the person who created the association.
- **How**: This is intended to be populated by the identity provided by the `uber-proxy` authentication layer. This provides an audit trail and allows the system to identify who is responsible for specific manual annotations.
#### Temporal Tracking
The `LastModified` field utilizes the `TIMESTAMPTZ` type with a default of `now()`.
- **Why**: Using a timestamp with time zone ensures that updates are consistent regardless of the server's local time configuration. The use of a default value simplifies the application logic, as the database handles the record-keeping of when an association was last touched.
### Key Components
#### `UserIssueSchema`
Located in `schema.go`, this struct defines the layout of the `UserIssues` table. It maps Go types to SQL definitions:
- **Trace and Commit Identity**: `TraceKey` (string) and `CommitPosition` (int) define _where_ and _when_ the issue occurred.
- **Issue Identity**: `IssueId` (int) links the record to the external Buganizer ticket.
- **Metadata**: `UserId` and `LastModified` provide context on the origin and age of the data.
### Workflow: Data Association
When a user identifies a performance change in the UI and associates it with a bug, the data flows as follows:
```text
[ User Action ]
|
| (Auth: User Email)
v
[ Perf Frontend ] ----> [ sqluserissuestore ]
|
| (Maps struct to SQL)
v
[ SQL Database ]
+---------------------------------------+
| Table: UserIssues |
| PK: (trace_key, commit_position) |
| Data: issue_id, user_id, last_modified|
+---------------------------------------+
```
This schema ensures that if a user updates an existing association for the same trace and commit, the record is updated (or rejected depending on the store's upsert logic) rather than duplicated, maintaining a clean 1:1 mapping between data points and their primary associated issue.
# Module: /go/workflows
### High-Level Overview
The `go/workflows` module serves as the public interface and contract definition for Skia Perf's automated orchestration system. It defines the entry points for complex, long-running processes—such as performance bisection and culprit analysis—that are executed via the [Temporal](https://temporal.io/) workflow engine.
The primary purpose of this module is to **decouple the workflow callers from the workflow implementations**. By providing standardized parameter structures and string-based workflow identifiers, it allows various parts of the Skia infrastructure to trigger orchestration logic without needing to import the heavy dependencies of the internal activity and workflow implementations.
### Design Decisions and Implementation Choices
#### Decoupled Service Invocation
The module defines constants for workflow names (e.g., `ProcessCulprit`, `MaybeTriggerBisection`). This design choice is critical for Temporal-based systems:
- **Source Independence**: It allows client services to start workflows by name without linking against the `internal/` implementation code, which typically includes gRPC clients, Gerrit connectors, and complex business logic.
- **Inter-Service Communication**: It facilitates a "fire-and-forget" or "fire-and-wait" pattern where the caller only needs to know the "contract" (the parameter and result structs) rather than the "how" of the execution.
#### Structured Parameter Passing
Rather than passing loose variables, the module defines explicit `Param` and `Result` structs for every workflow.
- **Evolutionary Compatibility**: Using structs allows for adding optional fields in the future without breaking the function signatures of the workflow callers.
- **Service Discovery**: Parameters often include service URLs (e.g., `AnomalyGroupServiceUrl`). This pushes the responsibility of service location to the caller or the configuration layer, keeping the workflows themselves more generic and testable across different environments.
### Key Components
#### Workflow Definitions (`workflows.go`)
This file acts as the "API header" for the orchestration layer. It defines two primary workflows:
- **MaybeTriggerBisection**:
- **Responsibility**: Manages the lifecycle of an anomaly group, deciding whether to initiate a Pinpoint bisection or simply report the anomalies to a developer.
- **Parameters**: Requires connectivity information for the Anomaly Group and Culprit services, the specific group ID to process, and the Task Queues where sub-tasks should be routed.
- **Result**: Returns a `JobId` (typically a Pinpoint Job ID) if a bisection was successfully triggered.
- **ProcessCulprit**:
- **Responsibility**: Handles the post-processing logic once a culprit has been identified by a bisection engine. This includes transforming commit data into internal formats and persisting them.
- **Parameters**: Takes a list of commits (using Pinpoint's proto definition) and the associated anomaly group.
- **Result**: Returns lists of `CulpritIds` and `IssueIds` generated during the persistence and notification phase.
### Workflow Orchestration Process
The following diagram illustrates how this module fits into the broader system architecture, acting as the bridge between the service triggering the work and the workers executing it:
```text
[ Caller Service ] [ Temporal Cluster ] [ Perf Worker ]
| | |
| 1. Start Workflow | |
| (using Param struct) | |
+--------------------------->| |
| | 2. Schedule Task |
| +--------------------------->|
| | |
| | 3. Execute Implementation |
| | (defined in /internal) |
| |<---------------------------+
| 4. Return Result | |
| (using Result struct) | |
|<---------------------------+ |
```
### Key Submodules
- **`internal/`**: Contains the actual Go logic for the workflows and activities. This is where the gRPC calls to Gerrit, Anomaly Group services, and Culprit services are implemented. It handles the "wait-and-retry" logic and the 30-minute aggregation period for anomalies.
- **`worker/`**: The executable entry point. It registers the implementations from `internal/` against the names defined in `workflows.go` and listens on the Temporal Task Queue for incoming work.
# Module: /go/workflows/internal
This module contains the internal Temporal workflow and activity implementations for the Perf orchestration system. It is responsible for the automated lifecycle of performance anomalies—from grouping and initial triage to triggering bisections and notifying users.
### High-Level Overview
The module acts as the "glue" between various Skia Perf and Pinpoint services. It orchestrates complex, long-running processes that involve waiting for data stability, interacting with external gRPC services (Anomaly Group, Culprit, and Gerrit), and managing child workflows for performance bisection.
By leveraging Temporal, these workflows provide durability and fault tolerance for operations that can take hours or even days to complete (such as a Pinpoint bisection).
### Key Workflows
#### MaybeTriggerBisectionWorkflow
This is the primary entry point for processing a newly detected anomaly group. It manages the decision logic for how to handle performance regressions.
1. **Wait Period**: The workflow begins by sleeping for 30 minutes. This design choice allows the system to aggregate more anomalies into the same group before taking action, preventing redundant bisections or notifications.
2. **Action Dispatch**: Based on the `GroupAction` type of the anomaly group, it branches:
- **BISECT**: It resolves the git hashes for the anomaly range via Gerrit, parses benchmark/story metadata, and triggers a Pinpoint `CulpritFinderWorkflow` as a child workflow. It specifically uses the `PARENT_CLOSE_POLICY_ABANDON` policy to ensure bisections continue even if the triggering workflow completes.
- **REPORT**: It gathers the top anomalies in the group and triggers a user notification (typically a bug report) via the Culprit service.
3. **State Synchronization**: After triggering an action, it updates the Anomaly Group service with the resulting Bisection ID or Issue ID.
#### ProcessCulpritWorkflow
Invoked after a bisection successfully identifies a culprit, this workflow handles the "aftermath" of a find:
- **Data Transformation**: It converts Pinpoint-specific commit formats into the internal Culprit service proto format.
- **Persistence**: It calls the Culprit service to permanently store the identified culprit.
- **Notification**: It triggers the notification logic to alert developers about the specific commit that caused the regression.
### Component Responsibilities
#### Activities
Activities wrap gRPC client calls to external services, providing retry logic and timeout management defined in `options.go`.
- **AnomalyGroupServiceActivity**: Interfaces with the Anomaly Group service to load group metadata, find specific anomalies within a group, and update group status (e.g., attaching a Bisection ID).
- **CulpritServiceActivity**: Interfaces with the Culprit service. It handles persisting culprit data and sending notifications for both automated bisection results and manual anomaly reports.
- **GerritServiceActivity**: Used primarily to resolve commit positions (integers) into full Git hashes (strings) required by the Pinpoint bisection engine.
#### Design Decisions and Utilities
- **Legacy Descriptor Mapping**: The logic in `maybe_trigger_bisection.go` (e.g., `benchmarkStoriesNeedUpdate`) mimics legacy Catapult dashboard behavior. It handles special cases for "System Health" benchmarks where story names require character replacement (e.g., `_` to `:`) to remain compatible with Pinpoint expectations.
- **Statistic Parsing**: The system automatically extracts the measurement and statistic (e.g., `max`, `std`) from the chart name string. This is necessary because the Perf database often stores these as a combined string, while Pinpoint requires them as separate parameters.
- **Temporal Options**: `options.go` defines strict 1-minute timeouts for gRPC-based activities to ensure the system doesn't hang on network issues, while allowing up to 12 hours for child workflows to accommodate the long compile and execution times of performance tests.
### Workflow Logic Diagram
The following diagram illustrates the flow of the `MaybeTriggerBisectionWorkflow`:
```text
[ Start ]
|
v
[ Sleep (30m) ] <-- Wait for more anomalies to group
|
v
[ Load Anomaly Group ]
|
+----( GroupAction == BISECT? )----> [ Resolve Git Hashes ]
| |
| v
| [ Trigger Pinpoint ]
| |
| v
| [ Update Group w/ JobID ]
|
+----( GroupAction == REPORT? )----> [ Fetch Top 10 Anomalies ]
| |
| v
| [ Notify User / Create Bug ]
| |
| v
| [ Update Group w/ IssueID ]
v
[ End ]
```
# Module: /go/workflows/worker
### Overview
The `go/workflows/worker` module implements the executable entry point for the Temporal worker responsible for executing Skia Perf's backend automation workflows. It serves as the bridge between the Temporal orchestration engine and the specific business logic required for anomaly detection, bisection triggering, and culprit management.
The primary design goal of this module is to provide a scalable, stateless execution environment. By decoupling the workflow definitions from the service that triggers them, the worker can be scaled independently to handle varying loads of performance analysis tasks.
### Architecture and Design Choices
The worker is designed as a long-running daemon that connects to a Temporal cluster. It registers a set of **Workflows** (stateful orchestrations) and **Activities** (idempotent units of work) and then listens to a specific task queue for instructions.
#### Connection Management
The worker establishes a connection to the Temporal service via a `client.Dial`. This connection is configured with custom metrics handling to export Temporal-specific telemetry to Prometheus, ensuring visibility into worker health, task latency, and execution success rates.
#### Service Registration
The worker registers several domain-specific activities and workflows. The registration process maps internal Go functions to string-based identifiers used by the Temporal cluster to route tasks.
- **Workflow Orchestration**: It registers high-level workflows like `ProcessCulprit` and `MaybeTriggerBisection`. These functions orchestrate complex, long-running processes that might involve waiting for external signals or timers.
- **Activity Execution**: It registers service-specific activities (`CulpritServiceActivity`, `AnomalyGroupServiceActivity`, `GerritServiceActivity`). These are the "muscles" of the system, performing side-effect-heavy operations such as querying databases or interacting with Gerrit for code reviews.
### Key Components and Responsibilities
#### main.go
This file acts as the lifecycle manager for the worker process. Its responsibilities include:
- **Configuration**: Parsing flags for the Temporal host/port, namespace, and task queue.
- **Instrumentation**: Initializing the Skia common library to setup Prometheus monitoring.
- **Workflow/Activity Mapping**: Explicitly linking the generic `worker` instance to the specific logic defined in the `internal` package. This creates a clear separation between the "runner" (this module) and the "logic" (the `internal` module).
#### Execution Workflow
The following diagram illustrates how the worker interacts with the broader system:
```
+----------------+ +-------------------+ +-----------------------+
| Temporal Cloud | ----> | Worker Process | ----> | Internal Services |
| (Task Queue) | | (worker/main.go) | | (internal/activities) |
+----------------+ +---------+---------+ +-----------+-----------+
| |
| 1. Polls for Tasks |
|---------------------------->|
| |
| 2. Executes Activity/WF |
| <---------------------------|
| |
| 3. Reports Completion |
|---------------------------->|
```
### Deployment Context
The worker is packaged as a container (`skia_app_container`) named `grouping_workflow`. This naming reflects its primary responsibility: managing the lifecycle of anomaly groups and the resulting workflows that process potential performance culprits. In a production environment, this worker typically runs within Kubernetes, connecting to a centralized Temporal service.
# Module: /images
The `/images` module serves as the centralized repository for graphical assets and brand identity markers used across the project. It provides a single source of truth for logos, icons, and UI-specific graphics, ensuring visual consistency across different sub-modules and user-facing components.
### Design Philosophy and Implementation Choices
The module prioritizes scalability and cross-platform compatibility by primarily utilizing the SVG (Scalable Vector Graphics) format. This choice allows assets to be rendered at any resolution without loss of quality, which is critical for high-DPI displays and varied UI contexts.
#### Raster-to-SVG Wrapping
A notable implementation pattern within this module is the use of SVG wrappers for raster data. Files such as `androidx.svg`, `flutter.svg`, and `fuchsia.svg` contain Base64-encoded PNG data embedded within an SVG `<image>` tag. This approach was chosen for several reasons:
- **Standardized Interface:** It allows the UI rendering engine to treat all icons as SVGs, simplifying the code for icon components.
- **Fixed Aspect Ratios:** The `viewBox` and `preserveAspectRatio` attributes on the SVG wrapper ensure that raster logos are displayed consistently, regardless of the container's constraints.
- **Styling Consistency:** Wrapping raster images in SVGs allows for the application of consistent stroke or border effects (as seen in the grey circular stroke in `skia.svg` or `widevine.svg`) directly within the asset file.
#### Format Diversity
While SVG is the preferred format for logos, the module includes other formats based on specific use cases:
- **WebP/PNG:** Used for complex textures or photographs (like `germanium.webp` or `alpine.png`) where vectorization would be inefficient or impossible.
- **Simple Vectors:** Files like `line-chart.svg` use pure path data for lightweight, performant UI decorations.
### Key Components and Responsibilities
The module is responsible for organizing assets into three functional categories:
- **Ecosystem Branding:** Contains the official visual identifiers for the core technologies integrated into the project, such as the Chrome logo, the V8 engine icon, and the Skia graphics library.
- **Platform and Framework Logos:** Provides assets for external dependencies and supported platforms, including AndroidX, Flutter, Fuchsia, and Alpine. These are typically used in documentation or system information screens.
- **Application UI Elements:** Includes generic icons like `line-chart.svg` that are used for internal data visualization or navigational markers.
### Asset Consumption Workflow
Assets are exposed to the rest of the build system through a central configuration that designates which files are available for external reference. This prevents accidental internal dependency on draft or temporary assets.
```text
[ Feature Module ] ----> [ Request: v8.svg ]
|
v
[ /images Module ] <--- [ BUILD.bazel Exports ]
|
+-- (SVG Vector Processing) --> Rendered Icon
|
+-- (SVG Raster Decoding) --> Embedded Bitmap
```
The use of the `exports_files` directive in the module's build configuration facilitates this, allowing other packages to consume these specific images as labels without needing access to the entire directory.
# Module: /integration
# Perf Integration Data Module
The `/integration` module provides a controlled set of performance data used to verify the Perf ingestion pipeline and integration features. It serves as a bridge between raw performance results and the high-level analysis tools by providing a predictable, historical baseline of metrics tied to a specific demonstration repository.
### Design Philosophy
The module is designed around the principle of **traceable performance evolution**. Rather than providing static benchmarks, it provides a sequential history that mirrors a real-world software lifecycle.
- **Verifiable Regression Testing**: By mapping performance metrics (nanoseconds, memory allocations) to specific `git_hash` values from the `perf-demo-repo`, the module allows the system to test its ability to identify performance shifts across commits.
- **Pipeline Robustness**: The data set is intentionally heterogeneous. It includes "good" data, a file referencing a non-existent ("bad") commit, and a malformed JSON file. This design ensures the ingestion logic is tested not just for the "happy path," but also for graceful error handling and data validation.
- **Dimensional Granularity**: Implementation choices in the data schema prioritize multi-dimensional analysis. For example, tracking both `min` and `max` values for a single metric allows for testing variance detection (jitter), while splitting memory metrics into `kb` (size) and `num` (count) allows for testing the detection of different types of resource leaks.
### Key Components and Responsibilities
#### Data Generation (`generate_data.go`)
This utility is responsible for maintaining the consistency of the integration test suite. It programmatically generates the JSON artifacts to ensure they adhere to the `format.Format` schema used by the Perf ingester.
- **Synthetic Variance**: The generator injects deterministic but varying values (using the loop index and random offsets) into the measurements. This simulates a real development environment where performance fluctuates slightly or degrades over time, providing the necessary data "noise" to test filtering and alerting algorithms.
- **Commit Mapping**: It explicitly links measurements to hashes in the demo repository, ensuring that the integration environment has a valid "source of truth" to query against.
#### Data Repository (`/data`)
The `data` directory acts as a mock "filestore" that an ingester of type `dir` would monitor.
- **demo*data_commit*\*.json**: These files represent the standard ingestion format. Each file encapsulates a snapshot of system performance for a specific hardware configuration (`arch: x86`, `config: 8888`) and a specific functional test (`test: encode`).
- **Negative Test Cases**: Includes `malformed.json` and files with unknown git hashes (e.g., `ffff...`) to verify that the system correctly identifies and reports data quality issues without halting the ingestion of subsequent valid files.
### Integration Workflow
The following diagram shows how this module interacts with the broader Perf ecosystem during an integration test:
```
[ generate_data.go ]
|
| (creates)
v
[ /integration/data/ ] <---------- [ Ingester ('dir' type) ]
| |
| (scans filesystem) | (parses & validates)
v v
[ Malformed/Bad Hash ] [ Valid Commit Data ]
| |
+--> Log Error +--> Map to Git History
+--> Continue Processing +--> Update Trace Store
+--> Detect Regressions
```
### Data Schema Logic
The data structure within this module follows a specific hierarchy to support complex queries:
1. **Global Metadata**: The top-level `Key` (e.g., `arch`, `config`) defines the environment. Design-wise, this allows the system to separate "what" was tested from "where" it was tested.
2. **Result Key**: Each result identifies the specific sub-test (e.g., `test: encode`), allowing one file to contain multiple independent benchmarks.
3. **Measurements**: Measurements are grouped by type (e.g., `ns`, `alloc`). Each type contains an array of `SingleMeasurement` objects, distinguishing between different units or statistical bounds (min, max, count) for that specific metric.
# Module: /integration/data
# Performance Integration Data
This module serves as a historical repository of performance benchmarks and system metrics, indexed by Git commit hashes. It provides the ground-truth data necessary for detecting regressions, analyzing performance trends over time, and verifying the integration pipeline's ability to handle various data states.
### Design Philosophy and Implementation Choices
The data is structured to facilitate automated comparison between software iterations. By decoupling the performance results from the source code and storing them as static JSON artifacts, the system achieves several design goals:
- **Commit-Centric Traceability**: Every data point is explicitly linked to a `git_hash`. This allows the integration engine to map performance spikes or memory leaks directly to specific changes in the codebase.
- **Environmental Context**: The `key` object (containing fields like `arch` and `config`) ensures that measurements are not analyzed in a vacuum. It acknowledges that performance is hardware- and configuration-dependent, allowing the consumer to filter results for "apples-to-apples" comparisons.
- **Multi-Dimensional Metrics**: Rather than providing a single execution time, the schema separates measurements into categories like `alloc` (memory footprint) and `ns` (timing). Each category supports multiple values (e.g., `min`, `max`, `kb`, `num`), enabling a nuanced view of system behavior, such as identifying increased jitter even if average latency remains stable.
- **Schema Versioning**: The inclusion of a `version` field at the root level allows the integration logic to evolve. If the measurement format changes, the parser can handle legacy data files (like those found in this module) without breaking the analysis pipeline.
### Data Schematics and Components
The module's contents represent a sequence of snapshots (`demo_data_commit_1.json` through `demo_data_commit_10.json`) showing the evolution of a specific test case, such as the "encode" operation.
#### Metric Tracking
Measurements are stored in nested arrays to allow for extensibility. For instance, the `alloc` measurement tracks both the size of memory used (`kb`) and the count of allocations (`num`). This distinction is critical for identifying "death by a thousand cuts" scenarios where total memory usage is low, but high allocation frequency causes CPU overhead.
#### Error Handling and Validation
The presence of `malformed.json` is a deliberate implementation choice for integration testing. It serves as a negative test case to ensure that any data ingestion service or parser can gracefully handle and report syntax errors in the data stream without crashing the monitoring pipeline.
### Performance Data Workflow
The following diagram illustrates how the data in this module is intended to be consumed by an integration or monitoring service:
```
[ Git Commit ] ----> [ Run Benchmarks ] ----> [ Generate JSON ]
|
v
[ Integration Data ] <----------------------- [ /integration/data/ ]
|
+--> Compare current git_hash results against previous hashes
+--> Validate "measurements" (e.g., did "num" of allocs increase?)
+--> Trigger alerts if metrics exceed defined thresholds
```
### Key Components
- **Result Keys**: Found within the `results` array, these define the specific functional area being tested (e.g., `test: encode`). This allows a single commit file to store data for multiple distinct sub-systems.
- **Measurement Bounds**: By storing both `min` and `max` values for nanosecond (`ns`) timing, the data supports variance analysis. A significant widening of the gap between min and max across commits (as seen between commit 1 and commit 10) indicates decreasing stability in the code path.
- **Linkage**: The `links` field is reserved for cross-referencing external artifacts, such as detailed profiling traces or build logs, though it remains `null` in the baseline demo sets.
# Module: /jupyter
# Jupyter Module Documentation
The `/jupyter` module provides an interface for performing advanced data analysis and visualization of Skia performance data. By leveraging Jupyter Notebooks, it allows developers to move beyond the standard Skia Perf web UI to perform complex calculations, statistical modeling, and custom plotting using the Python data science stack.
## Overview
The primary goal of this module is to bridge the gap between the Skia Performance monitoring system (`perf.skia.org`) and the analytical power of tools like **Pandas**, **NumPy**, and **Matplotlib**.
While the standard Perf UI is excellent for discovering regressions and viewing individual traces, it is not designed for "bulk" analysis—such as calculating the ratio of GPU to CPU performance across hundreds of tests or finding which hardware models exhibit the most noise (coefficient of variation). This module provides the glue code to fetch data from Perf's backend and load it into a Pandas DataFrame for such tasks.
## Design and Implementation
The implementation centers around an asynchronous request-and-poll pattern to interact with the Skia Perf API.
### Data Retrieval Workflow
Accessing data follows a specific sequence to ensure the notebook remains responsive and handles the potentially large datasets stored in Perf:
1. **Context Initialization**: The system first queries the `/_/initpage/` endpoint. This is a design choice to automatically discover the current "window" of data (the most recent commits) and the available `paramset` (all valid keys and values like `model`, `test`, `device`, etc.).
2. **Request Initiation**: A request is sent to `/_/frame/start`. This does not return data immediately; instead, it triggers a long-running query on the server and returns a unique ID.
3. **Status Polling**: The module polls `/_/frame/status/<id>` until the server reports success. This prevents notebook timeouts during heavy calculations on the server side.
4. **Data Transformation**: Once ready, the JSON results are fetched and converted into a Pandas DataFrame. The system explicitly handles "missing" or "sentinel" values (e.g., `1e32`) by converting them to `NaN` (Not a Number), ensuring that standard statistical functions like `.mean()` or `.std()` work correctly without being skewed by invalid data points.
### Key Components
#### Core API Functions (`Perf+Query.ipynb`)
The logic is encapsulated in two primary entry points that abstract away the HTTP communication:
- `perf_query(query)`: Used for selecting raw traces based on metadata (e.g., `source_type=skp&sub_result=min_ms`). This is the programmatic equivalent of the "Query" dialog in the Perf UI.
- `perf_calc(formula)`: Used for server-side processing using Skia Perf's functional query language. This allows the server to perform operations like `ave()`, `count()`, or `ratio()` before sending the result to the notebook, which is more bandwidth-efficient than downloading all raw data.
#### Environment Management (`README.md`)
Because data science dependencies (like `scipy` and `matplotlib`) can be sensitive to system-level Python versions, the module advocates for a **Virtualenv**-based deployment. This ensures that the analytical environment remains isolated from the system's Python installation and that all required libraries are pinned to versions compatible with the provided notebooks.
## Key Workflows
### Standard Data Pipeline
This diagram illustrates how data flows from the Skia Perf servers into a local visualization.
```text
[ Jupyter Notebook ] [ Skia Perf Server ]
| |
|--- 1. POST (Query/Formula) ->|
| |-- 2. Process Request --|
|<-- 3. Return Query ID -------| |
| | |
|--- 4. GET (Poll Status) ---->| |
|<-- 5. "Still Working" -------| |
| ... | |
|--- 6. GET (Poll Status) ---->| |
|<-- 7. "Success" -------------| |
| | |
|--- 8. GET (Fetch Results) -->| |
|<-- 9. JSON Traceset ---------| |
| |
[ Parse JSON to Pandas ]
|
[ Generate Matplotlib Plot ]
```
### Analysis Examples
The module provides pre-configured examples for common "Why" questions:
- **Noise Analysis**: Iterating through hardware models to calculate the average coefficient of variation, helping identify flaky lab hardware.
- **Performance Ratios**: Calculating the ratio between CPU and GPU execution times for specific sets of SKP (Skia Picture) files to identify rendering bottlenecks.
- **Normalization**: Using Pandas to normalize disparate traces to a mean of 0 and a standard deviation of 1, allowing for the visual comparison of the "shape" of performance changes across different tests regardless of their absolute scale.
# Module: /lint
### High-Level Overview
The `/lint` module provides a specialized reporting interface for static code analysis, specifically designed to integrate with JSHint. Its primary purpose is to bridge the gap between raw analysis data and a human-readable, machine-parseable terminal output. Instead of relying on default verbose formats, this module implements a custom reporting logic that prioritizes clarity and precision in locating syntax errors or stylistic inconsistencies.
### Design Rationale: Streamlined Feedback
The core design philosophy behind this module is **minimalist observability**. In many build environments, linting output can become cluttered with metadata that obscures the actual location of a bug. The implementation in `/lint/reporter.js` focuses on a "one-line-per-error" strategy.
By standardizing the output format to `file:line:character reason`, the module ensures that:
1. **Developer Cognition** is optimized: Developers can quickly scan the left-hand side for file locations.
2. **Tooling Integration**: The format is intentionally compatible with terminal emulators and IDEs that support "click-to-open" functionality for file paths.
3. **Actionable Summaries**: The reporter concludes with a singular count of total errors, providing an immediate "pass/fail" signal for CI/CD pipelines.
### Key Components and Implementation
#### Output Formatting Logic (`reporter.js`)
The module exports a single `reporter` function expected by the JSHint API. Its responsibility is to iterate through a collection of error objects and transform them into a cohesive string buffer.
- **String Aggregation**: Rather than calling `console.log` for every individual error—which can lead to performance bottlenecks and interleaved output in asynchronous environments—the module aggregates all results into a single string.
- **Buffer Flushing**: Output is sent directly to `process.stdout.write`. This choice avoids the trailing newline logic inherent in `console.log`, allowing the module complete control over the vertical spacing of the final report.
- **Pluralization Logic**: A small but significant detail in the summary implementation is the conditional suffixing of the "error" count. This ensures that the summary remains grammatically correct whether there is a single violation or hundreds, maintaining a professional interface for the end-user.
### Workflow Process
The following diagram illustrates how data flows from the static analysis tool through this module to the user's terminal:
```text
[ JSHint Engine ]
|
| (Raw Result Array)
v
[ /lint/reporter.js ]
|
|-- For Each Result:
| Extract: { file, line, character, reason }
| Format: "file:line:char reason"
|
|-- Finalize:
| Append Total Count Summary
v
[ System Stdout ] -> (Displayed to Developer)
```
### Key File Responsibilities
- **`reporter.js`**: Functions as the primary entry point. It contains the transformation logic that maps JSHint's internal error representation to the standardized string format used across the project. It is responsible for the final presentation layer of the linting process.
# Module: /modules
# Perf Modules Documentation
The `/modules` directory contains the frontend architecture of the Skia Perf application. It is built as a collection of modular Custom Elements (using Lit) and specialized utility libraries that coordinate performance data querying, time-series visualization, and anomaly triage.
## High-Level Overview
The module architecture is designed to handle massive-scale performance telemetry by separating data management from visual presentation. The system revolves around three core pillars:
1. **Data & State**: Managing complex `DataFrame` objects (time-series) and reflecting application state (queries, zoom levels) in the browser URL.
2. **Visualization**: High-performance charting engines that overlay statistical anomalies and user-filed bugs onto performance traces.
3. **Triage Workflow**: Specialized dialogs and tables that allow "Sheriffs" to investigate regressions, file bugs in external trackers (Buganizer), and trigger automated bisections (Pinpoint).
## Design Philosophy
- **URL as Source of Truth**: Most modules utilize `stateReflector` to ensure that every view—including specific zooms, selected traces, and active filters—is shareable via a deep link.
- **Asynchronous Progress**: Long-running backend tasks (clustering, dry-runs, data fetching) utilize a standardized `progress` polling mechanism to keep the UI responsive.
- **Context-Driven Data**: Modules often leverage Lit Context to share `DataFrame` and `AnomalyMap` data across deeply nested component trees without "prop-drilling."
- **Composition over Monoliths**: Complex pages (like `Explore` or `Triage`) are composed of smaller, reusable primitives (like `query-chooser-sk` or `triage-status-sk`).
## Key Components and Submodules
### 1. Data Visualization Engine
The charting infrastructure is split between raw data processing and visual rendering.
- **`dataframe`**: Manages the lifecycle of performance data. It handles joining multiple data chunks, padding missing values with `MISSING_DATA_SENTINEL`, and providing the data to charts via context.
- **`plot-google-chart-sk`**: The primary interactive chart. It uses a layered approach where the lines are SVG-based (Google Charts), but interactive elements like anomalies and bug icons are HTML overlays to maintain performance during panning.
- **`explore-simple-sk`**: The central orchestrator for data exploration. It combines the chart, a navigation summary (`plot-summary-sk`), and the query interface.
- **`plot-summary-sk`**: Provides a "bird's-eye view" of long-range data. It implements **Min-Max downsampling** to ensure peaks and valleys remain visible even when thousands of points are condensed into a small sparkline.
### 2. Anomaly Detection and Triage
These modules facilitate the transition from identifying a "spike" to resolving a performance regression.
- **`anomalies-table-sk`**: A sophisticated management table that groups related anomalies (e.g., by benchmark or revision range) to allow for bulk triaging.
- **`triage-menu-sk`**: A contextual popup used to "nudge" anomaly boundaries, ignore false positives, or initiate bug filing.
- **`new-bug-dialog-sk` & `existing-bug-dialog-sk`**: Specialized modals that automate the boilerplate of reporting issues by pre-filling titles and metadata derived from the anomaly's trace parameters.
- **`bisect-dialog-sk` & `pinpoint-try-job-dialog-sk`**: Integration points for Chrome-specific debugging tools, allowing users to trigger A/B bisections directly from a regression point.
### 3. Querying and Filtering
Navigating millions of traces is handled through hierarchical and summary-based components.
- **`test-picker-sk`**: A "drill-down" interface that guides users through valid parameter combinations (e.g., selecting a `benchmark` reveals only the `bots` that ran it).
- **`query-sk` & `query-chooser-sk`**: The standard multi-select filter interface used to build complex trace queries.
- **`paramset-sk`**: A read-only visualization of a query, used to summarize what data an alert or a graph is currently showing.
### 4. Infrastructure and Shell
- **`perf-scaffold-sk`**: The master template providing navigation, theme switching (dark/light mode), and authentication integration. It supports both a "Legacy" sidebar and a modern "V2" header layout.
- **`telemetry`**: A buffered reporting system that tracks application performance and user actions, flushing data in batches to minimize network overhead.
- **`common`**: A utility layer containing the `ShortcutRegistry` for global hotkeys (e.g., `p` for positive triage) and `plot-builder` for transposing backend data into chartable formats.
## Key Workflows
### Data Exploration and Refinement
The system uses a reactive loop to update visualizations as users filter data.
```text
[ User Interaction ] -> [ test-picker-sk ] -> [ Update URL State ]
| |
v v
[ explore-simple-sk ] <--- [ query string ] <--- [ stateReflector ]
|
|-- 1. requestFrame() (via DataService)
|-- 2. startRequest() (Polling progress)
|-- 3. merge results into [ DataFrameRepository ]
|-- 4. render [ plot-google-chart-sk ]
```
### Anomaly Triage Sequence
Sheriffs move from high-level alerts to specific code changes through integrated navigation.
```text
[ triage-page-sk ] (Matrix of commits vs alerts)
|
|-- Click Status Icon --> [ cluster-summary2-sk ] (View Centroid)
|
+-----------------------------------+-----------------------------------+
| | |
[ Triage Action ] [ Investigation ] [ External Link ]
(Mark Pos/Neg) (View on Dashboard) (Link to Gitiles)
| | |
v v v
[ POST /_/triage/ ] [ explore-simple-sk ] [ Source Browser ]
```
### Coordinate Transformation
Because the UI combines SVG charts and HTML overlays, many modules (like `chart-tooltip-sk` and `plot-google-chart-sk`) perform "Pixel to Data" translations.
```text
[ Mouse Hover X/Y ]
|
v
[ ChartLayoutInterface ] -> [ Data Value (Commit/Date) ]
|
v
[ lookupCids() ] ----------> [ Git Hash / Author / Message ]
|
v
[ commit-range-sk ] -------> [ HTML Link to Source ]
```
## Internal Infrastructure Details
- **`json`**: Contains the "Source of Truth" for data structures, automatically generated from the Go backend to ensure type safety across the network.
- **`cid`**: A specialized resolution service that translates sequential `CommitNumbers` (used for storage efficiency) into full `Commit` metadata.
- **`themes`**: A delta-based styling layer that extends the shared Skia infrastructure with Perf-specific color palettes and spacing resets.
- **`errorMessage`**: A global utility that captures both application errors and network failures, displaying them in a persistent `<error-toast-sk>` until dismissed.
# Module: /modules/alert-config-sk
# alert-config-sk
The `alert-config-sk` module provides a comprehensive configuration interface for managing performance regression alerts in the Perf system. It allows users to define which traces to monitor, how to detect anomalies, and what actions to take when a regression is identified.
## Overview
This element serves as the primary editor for `Alert` configurations. It maps complex JSON configuration objects to a user-friendly form, handling the conditional logic required by different detection algorithms and notification strategies.
The design emphasizes data binding between the UI and a central `Alert` object. Changes in the UI immediately update the underlying object, which can then be persisted to the backend.
## Key Components and Responsibilities
### State Management (`Alert` and `ParamSet`)
The module's primary inputs are the `config` (the alert definition) and the `paramset` (the available keys and values in the performance database).
- **`config`**: An object implementing the `Alert` interface. The element provides setters/getters that ensure default values (like `radius` or `interesting` thresholds) are populated from global settings if missing.
- **`paramset`**: Used to populate the `query-chooser-sk` and the "Group By" multi-select options, allowing the user to filter traces based on actual metadata present in the system.
### Dynamic Regression Detection
The complexity of regression detection is managed through two coordinated selections:
- **Grouping (`algo-select-sk`)**: Determines if traces are clustered (K-Means) before analysis or if each trace is analyzed individually.
- **Step Detection (`select-sk`)**: Allows the user to choose the mathematical model for finding regressions (e.g., Cohen's d, Mann-Whitney U, or Absolute magnitude).
The UI dynamically updates the **Threshold** label and units based on the selected Step Detection algorithm using a `thresholdDescriptors` map. This ensures users provide inputs that make sense for the chosen math (e.g., "standard deviations" for Cohen's d vs "alpha" for Mann-Whitney).
### Conditional Workflows
The element's layout changes based on the global `window.perf` configuration and user selections:
- **Notifications**: Depending on whether `window.perf.notifications` is set to `html_email` or `markdown_issuetracker`, the element displays either email recipient fields or Issue Tracker component IDs.
- **Alert Actions**: If `window.perf.need_alert_action` is enabled, it exposes options for automated behaviors like filing bugs or triggering Pinpoint bisections.
- **Testing**: Integrated "Test" buttons allow users to validate Bug URI templates or notification destinations against the backend API (`/_/alert/bug/try` and `/_/alert/notify/try`) before saving the config.
## Implementation Details
### Data Flow
The element uses a "top-down" data flow for configuration and "bottom-up" for updates via event listeners:
```
[Parent Component]
| (sets .config and .paramset)
v
[alert-config-sk]
|
+-- @input / @change events --> [Updates internal _config]
|
+-- [query-chooser-sk] --------> (updates _config.query)
|
+-- [algo-select-sk] ----------> (updates _config.algo)
```
### Key Files
- `alert-config-sk.ts`: Contains the main logic for the Lit-based element, including the conditional rendering logic and API calls for testing templates.
- `alert-config-sk.scss`: Defines the layout, ensuring that nested controls (like spinners and labels) are indented and styled consistently with the Perf theme.
- `alert-config-sk-demo.ts`: Provides a sandbox for testing various UI states (e.g., toggling "Group By" or switching between email/issue tracker notifications) without a full backend.
## Design Decisions
- **Global Config Dependency**: The element relies on `window.perf` for environment-specific flags. This allows the same UI component to behave differently across different Perf instances (e.g., some instances might not support bisection).
- **Validation**: For critical fields like the Issue Tracker Component ID, the element uses HTML5 pattern validation (`\d+`) and triggers an `errorMessage` toast on invalid input to prevent malformed data from being sent to the server.
- **Property Upgrading**: The `connectedCallback` uses `_upgradeProperty` for `config` and `paramset`. This ensures that if the properties were set before the custom element was defined, the values are correctly captured and rendered.
# Module: /modules/alerts-page-sk
The `alerts-page-sk` module provides a comprehensive interface for managing performance alert configurations within the Perf application. It allows users to view, create, edit, and archive alert rules that monitor trace data for anomalies.
### Design and Architecture
The module is designed around a centralized management table. It acts as a bridge between the backend alert storage and the `alert-config-sk` component, which handles the complex logic of individual alert parameterization.
Key design choices include:
- **Role-Based Access Control**: The component integrates with the `alogin-sk` module to determine if a user has the "editor" role. Actions like "New", "Edit", and "Delete" are restricted or disabled for non-editors.
- **Modality for Configuration**: To keep the list view clean, all editing and creation happen within a `<dialog>` element. This dialog wraps the `alert-config-sk` element, ensuring a consistent experience between creating a brand-new alert and modifying an existing one.
- **Dynamic UI Adjustments**: The page adapts its table headers and content based on global `window.perf` configurations (e.g., changing "Alert" to "Component" if issue tracker integration is enabled).
- **State Transparency**: The module supports viewing archived (deleted) configurations through a toggle, and it provides immediate visual feedback for invalid configurations (e.g., missing queries).
### Key Components and Files
- **`alerts-page-sk.ts`**: The core logic of the page. It manages the lifecycle of the alert list, including fetching data from `/_/alert/list/`, handling the state of the editing dialog, and performing CRUD operations via fetch requests to the backend.
- **`alerts-page-sk.scss`**: Defines the layout for the management table, specifically handling overflow and ellipsis for long query strings to ensure the table remains readable even with complex alert rules.
- **`alerts-page-sk-demo.ts`**: Provides a robust mocked environment for development, simulating various backend responses for alert lists, login statuses, and trace counts.
### Key Workflows
#### Alert Editing Workflow
When a user interacts with the alert list, the module manages the state transition from a read-only list to an interactive configuration form.
```text
[Alerts Table] --(Click Edit)--> [Fetch Current Config]
|
v
[List View] <---(Cancel)--- [Modal Dialog (alert-config-sk)]
^ |
| (Modify & Accept)
| |
+-------(Post to /update) <-------+
```
#### Deep Linking
The module supports deep linking. If the page is loaded with a search query (e.g., `/a/?5646874153320448`), the `openOnLoad` method automatically identifies the matching alert and opens the edit dialog immediately upon data retrieval.
#### Dry Run Integration
Every alert in the table includes a "Dry Run" link. This utilizes the `dryrunUrl` helper to convert the alert's configuration into a URL query string, redirecting the user to the Explore page (`/d/`) to visualize exactly what data the alert would trigger on before saving changes.
### External Dependencies and Interfaces
- **`alert-config-sk`**: Used as the internal editor for alert details.
- **`paramset-sk`**: Used in the table to provide a summarized view of the alert's query.
- **Backend Endpoints**:
- `/_/alert/list/{showDeleted}`: Retrieves the set of alerts.
- `/_/alert/new`: Fetches a default skeleton for a new alert.
- `/_/alert/update`: Saves a modified or new alert.
- `/_/alert/delete/{id}`: Archives an alert.
# Module: /modules/algo-select-sk
# algo-select-sk
The `algo-select-sk` module provides a custom UI component for choosing between different anomaly detection or clustering algorithms in Perf. It acts as a specialized wrapper around the generic `select-sk` component, providing a type-safe and domain-specific interface for algorithm selection.
## Design and Implementation
The module is designed to bridge the gap between low-level UI selection (indexes) and high-level application logic (algorithm names).
### State Management
The component uses the `algo` attribute/property as its source of truth. It supports two primary algorithms defined in the `ClusterAlgo` type:
- **kmeans**: Groups traces by shape and looks for steps within the cluster centroids.
- **stepfit**: Analyzes each individual trace for steps independently.
To ensure robustness, the component implements a fallback mechanism. Any invalid string provided to the `algo` attribute is automatically coerced to `kmeans` via the internal `toClusterAlgo` utility.
### Component Interaction
Instead of exposing the raw `select-sk` child, `algo-select-sk` encapsulates the selection logic. It listens for `selection-changed` events from its internal `select-sk` element, maps the selected index to a `ClusterAlgo` value, and dispatches a domain-specific `algo-change` event.
```
[ User Clicks ] -> [ select-sk (index) ] -> [ algo-select-sk (mapping) ] -> [ algo-change Event ]
```
## Key Components
### AlgoSelectSk
Located in `algo-select-sk.ts`, this is the main class for the element.
- **Attributes/Properties**: Reflects the `algo` state. Updating the property updates the attribute and triggers a re-render.
- **Template**: Uses `lit` to render a `select-sk` containing two options. It uses the `?selected` directive to synchronize the internal state of the options with the component's `algo` property.
- **Event Handling**: The `_selectionChanged` method translates the numerical index from the underlying selector into a string value (`kmeans` or `stepfit`) by querying the `value` attribute of the child `div` elements.
### Events
- **algo-change**: This is the primary output of the component. The event detail contains an object of type `AlgoSelectAlgoChangeEventDetail`:
```typescript
{
algo: 'kmeans' | 'stepfit';
}
```
## Testing and Demonstration
- **Demo Page**: `algo-select-sk-demo.html` and `.ts` show the component in various states (default, pre-selected, and dark mode) and log event details to the screen when selections change.
- **Unit Tests**: `algo-select-sk_test.ts` validates the attribute-to-property reflection, the fallback logic for invalid inputs, and the correct dispatching of events.
- **Integration Tests**: `algo-select-sk_puppeteer_test.ts` performs visual regression testing using Puppeteer to ensure the component renders correctly and responds to clicks in a real browser environment.
# Module: /modules/anomalies-table-sk
# Anomalies Table Module
The `anomalies-table-sk` module provides a comprehensive, interactive table for visualizing, grouping, and triaging performance anomalies detected in the Perf system. It serves as a central hub for users to review regression and improvement alerts, manage associated bugs, and navigate to detailed graphical reports.
## Overview
The primary component, `AnomaliesTableSk`, renders a list of anomalies and provides tools to manipulate their presentation. Rather than a flat list, the table utilizes a sophisticated grouping logic to combine related anomalies, reducing visual clutter and allowing bulk actions.
### Design Principles
- **Group-First Workflow**: Large sets of anomalies are often related (e.g., the same regression across multiple bots). The table defaults to grouped views to allow users to triage entire sets of alerts simultaneously.
- **State Separation**: Selection, grouping, and navigation logic are decoupled into specific controllers (`SelectionController`, `AnomalyGroupingController`, `ReportNavigationController`) to manage complexity.
- **Contextual Triaging**: Integrates directly with the triage menu and bug tooltips, allowing users to file bugs, associate alerts with existing issues, or ignore false positives without leaving the context of the list.
## Key Components and Responsibilities
### AnomaliesTableSk (`anomalies-table-sk.ts`)
The main UI element. It orchestrates the rendering of table rows, handles keyboard shortcuts (like `p` for filing a bug or `g` for graphing), and manages the "Triage Selected" popup. It delegates data processing to sub-controllers while maintaining the visual state of the table (expanded/collapsed groups).
### Anomaly Grouping Controller (`anomaly-grouping-controller.ts`)
Manages how anomalies are aggregated into table rows. It persists user preferences for grouping (e.g., "Group by Benchmark" or "Exact Revision Match") in `localStorage`.
The grouping logic follows a specific hierarchy:
1. **Bug ID**: Anomalies already associated with a specific bug are always grouped together.
2. **Revision Mode**: Remaining anomalies are grouped by their commit range based on three modes:
- `EXACT`: Ranges must be identical.
- `OVERLAPPING`: Ranges that share any commit.
- `ANY`: All anomalies are considered a single group.
3. **Attribute Splitting**: Revision groups can be further subdivided by `BENCHMARK`, `BOT`, or `TEST`.
### Report Navigation Controller (`report-navigation-controller.ts`)
Handles the transition from the table to the "Explore" (graphing) pages. It manages:
- **URL Generation**: Constructing complex URLs for multi-graph views.
- **SID Management**: When a list of anomaly IDs is too long for a standard URL, it interacts with the `/_/anomalies/group_report` API to obtain a Session ID (SID) which represents the collection.
- **Time Range Calculation**: Automatically adds a week of padding before and after an anomaly's range to provide historical context on the generated graphs.
### Anomaly Transformer (`anomaly-transformer.ts`)
A utility class responsible for converting raw data into displayable strings and determining summary values for collapsed groups.
- **Longest Sub-test Path**: For groups containing different sub-tests, it finds the longest common path and appends a `*` (e.g., `test1/sub1` and `test1/sub2` become `test1/sub*`).
- **Summary Delta**: Determines which percentage change to display on a group row (prioritizing the largest regression magnitude).
### Anomalies Grouping Settings (`anomalies-grouping-settings-sk.ts`)
A configuration panel embedded within the table header that allows users to toggle the grouping criteria and revision modes described above.
## Key Workflows
### Selection and Bulk Action
The table uses a `SelectionController` to track which anomalies are currently active. Selection state flows from the UI to the controller, which then triggers a re-render to update checkbox states (including indeterminate states for partially selected groups).
```text
User Interaction (Checkbox Click)
|
v
SelectionController updates Set<Anomaly>
|
v
LitElement (Table) requestsUpdate()
|
+-----> Update Header Checkbox (All/None/Indeterminate)
+-----> Update Group Summary Checkboxes
+-----> Update Action Buttons (Triage/Graph Enabled State)
```
### Anomaly Triaging
When a user triages a group or selection, the table interacts with the `TriageMenuSk`.
```text
[Select Anomalies] -> [Click Triage Selected] -> [triage-menu-sk appears]
|
+----------------------------------------------+
| | |
[File New Bug] [Existing Bug] [Ignore Anomaly]
| | |
Opens Dialog Lists Associated Sends 'RESET' or
Issues from API 'IGNORE' to backend
```
### Graphical Investigation
Clicking the "Chart" icon or the "Graph Selected" button initiates a navigation workflow:
1. **Request Group Report**: Backend provides a `timerange_map` for the selected anomalies.
2. **Shortcut Update**: The `ReportNavigationController` calls `/_/shortcut/update` to store the specific graph configurations.
3. **Redirect**: The browser opens a new tab to `/m/?shortcut=[id]&begin=[start]&end=[end]`.
# Module: /modules/anomaly-playground-sk
# anomaly-playground-sk
The `anomaly-playground-sk` module provides an interactive environment for testing and tuning anomaly detection algorithms within the Perf ecosystem. It serves as a "sandbox" where developers and data scientists can input arbitrary trace data, apply various statistical detection methods, and visualize the results in real-time without needing to modify production alerts or wait for new data ingestion.
## High-Level Overview
This module bridges the gap between algorithm development and visualization. It wraps a specialized instance of the `explore-simple-sk` component to provide a familiar graphing interface, while adding a control panel for manual data entry and parameter manipulation.
The primary goal is to allow users to answer questions like:
- "Would this specific shift be caught by the `mannwhitneyu` algorithm with a threshold of 3.0?"
- "How does changing the radius affect the sensitivity of detection on noisy data?"
- "Is a particular jump considered an improvement or a regression based on the expected direction?"
## Design Decisions
### Data Input and Mocking
Unlike the main Explore page which queries a backend database for historical traces, the playground allows for direct manual input via a comma-separated list of values. This design choice facilitates rapid prototyping of edge cases. When a user inputs data:
1. The component generates a synthetic `DataFrame`.
2. It creates mock `CommitNumber` and `TimestampSeconds` headers for each data point to satisfy the requirements of the graphing engine.
3. It assigns a static trace key (`,name=playground,`) to the data.
### Component Integration
The module leverages `explore-simple-sk` as its visualization engine rather than re-implementing graphing logic. To make it behave like a "playground" rather than a search tool, several features of the child component are programmatically disabled or hidden:
- `openQueryByDefault` is set to `false` (no need to search a database).
- `showHeader` and `navOpen` are disabled to maximize space for the playground controls.
- `disablePointLinks` is enabled because synthetic data points do not link to real Git commits.
### State Reflection
The component uses `stateReflector` to sync the current playground configuration (the trace string, algorithm, radius, threshold, etc.) with the URL's query parameters. This allows researchers to share a specific "scenario" by simply copying and pasting the URL.
## Key Workflows
### The Detection Process
The workflow follows a standard Input -> Configure -> Request -> Visualize cycle:
```text
[ User Input ] ----> [ Input Parser ] ----> [ Local Plotting ]
| |
| v
| [ explore-simple-sk Graph ]
v ^
[ Param Controls ] |
(Algo, Radius, etc) |
| |
v |
[ "Detect" Click ] --> [ Backend API Request ] ---+
(/_/playground/anomaly/v1/detect)
```
1. **Plotting:** As the user types into the text area, the component immediately updates the graph. This is a local operation that transforms the string into a `DataFrame`.
2. **Validation:** The "Detect" button is dynamically enabled/disabled based on whether the required parameters (Algorithm, Radius, Threshold) are valid numbers and selections.
3. **Detection:** When "Detect" is triggered, the trace data and parameters are sent to the backend. The backend returns a list of `Anomaly` objects.
4. **Integration:** The component transforms these anomalies into an `anomalymap`, determines if they are "improvements" based on the selected `direction`, and calls `UpdateWithFrameResponse` on the graph to render the familiar red/grey circles on the trace.
## Key Components and Files
- **`anomaly-playground-sk.ts`**: The main logic hub. It manages the lifecycle of the synthetic `DataFrame`, handles synchronization between the UI inputs and the URL state, and coordinates communication with the detection API.
- **`explore-simple-sk` (External Dependency)**: While not in this directory, it is the primary visual dependency. The playground acts as a controller for this component, feeding it data and anomaly maps manually.
- **`anomaly-playground-sk-demo.ts`**: Provides a mocked environment for local development, simulating the backend responses for detection and frame updates.
## Parameters and Algorithms
The module supports several detection algorithms via the `StepDetection` type:
- **Algorithms**: `absolute`, `const`, `percent`, `cohen`, `mannwhitneyu`.
- **Radius**: Determines the window of data points to the left and right of a point to consider when calculating medians/statistics.
- **Threshold**: The sensitivity of the chosen algorithm.
- **Direction**: Defines whether an increase (`UP`) or decrease (`DOWN`) in value is treated as a regression or an improvement.
# Module: /modules/bisect-dialog-sk
# bisect-dialog-sk
The `bisect-dialog-sk` module provides a specialized modal dialog used in the Perf UI to initiate performance bisection jobs (Pinpoint) for Chrome performance regressions. It captures necessary metadata from a performance anomaly—such as test paths and revision ranges—and submits a request to create a bisection job to identify the root cause of a regression.
## Design and Implementation Choices
### Chrome-Specific Logic
The bisection logic within this module is specifically tailored for the Chrome performance testing ecosystem. This is reflected in how it parses "test paths" and maps them to specific bisection parameters like `benchmark`, `configuration`, and `story`.
### Data Parsing and Transformation
A significant portion of the logic in `bisect-dialog-sk.ts` involves decomposing a single `testPath` string into the structured fields required by the Pinpoint bisection API.
- **Path Splitting**: The module expects a slash-delimited path (e.g., `Master/Bot/Benchmark/Chart/Story`).
- **Statistic Extraction**: It checks the end of the test path against a set of known statistical suffixes (e.g., `avg`, `max`, `std`). If found, it separates the statistic from the chart name to ensure the bisection job monitors the correct metric.
- **Legacy Compatibility**: The module automatically replaces colons (`:`) with underscores (`_`) in the `story` field. This choice was made to reduce errors when querying test paths in legacy data tables.
### User Authorization
Bisection is a resource-intensive operation. The module utilizes the `alogin-sk` infrastructure to verify the user's identity and roles before allowing a submission. If a user is not logged in or lacks the necessary permissions, the dialog prevents the request and surfaces an error message.
## Key Components
### bisect-dialog-sk.ts
This is the primary implementation file. It defines the `BisectDialogSk` class, which handles:
- **State Management**: Tracks input parameters like `startCommit`, `endCommit`, `bugId`, and the resulting `jobUrl`.
- **Pre-loading**: The `setBisectInputParams` method allows parent components (like a chart tooltip or an anomaly list) to populate the dialog with context-specific data before opening.
- **Validation**: Performs client-side checks to ensure all required fields (start/end hashes, benchmark, etc.) are present before attempting a network request.
- **Submission**: Manages the `POST` request to `/_/bisect/create` and handles the asynchronous response, displaying a direct link to the created Pinpoint job upon success.
### Template and Styling
The UI is built using `lit-html` and styled with Scss. It provides a clean, form-based layout within a `<dialog>` element.
- **Loading State**: Integrated `spinner-sk` provides visual feedback during the bisection request.
- **Responsive Inputs**: Uses standard HTML inputs for commit hashes and patches, allowing users to manually override pre-loaded data if needed.
## Workflow
The typical lifecycle of a bisection request through this module is as follows:
```
[ External Component ] --(testPath, revisions)--> [ bisect-dialog-sk ]
|
[ .open() called ]
|
<-- User edits/reviews form -->
|
[ .postBisect() ]
|
--------------------------------------------------------------------------------------
| | |
[ Validation Fails ] [ Network Request ] [ Auth Fails ]
| | |
[ Show error-sk ] [ /_/bisect/create ] [ Show error ]
|
------------------------------
| |
[ Success (200) ] [ Failure (5xx) ]
| |
[ Display Pinpoint Link ] [ Show error-sk ]
```
## Integration Points
- **Preloading**: Call `setBisectInputParams(params: BisectPreloadParams)` to populate the dialog.
- **Execution**: Call `open()` to display the modal to the user.
- **Events**: While the module primarily handles its own submission, it relies on the global `errorMessage` utility to communicate failures to the user.
# Module: /modules/bug-tooltip-sk
The `bug-tooltip-sk` module provides a specialized custom element designed to display a summary of bugs (typically regressions) associated with a data point or alert. It balances a minimal UI footprint with quick access to detailed external bug tracking information.
### Design Philosophy
The module is built as a hover-triggered informational component. Instead of cluttering the main interface with long lists of bug IDs, it displays a concise count and reveals a detailed list only when the user expresses interest by hovering over the element.
Key implementation choices include:
- **Lightweight Shadow DOM Bypass**: The component uses `createRenderRoot() { return this; }`, meaning it renders directly into the light DOM. This choice simplifies global styling and ensures that the absolute positioning of the tooltip behaves predictably relative to its parent containers in the Perf UI.
- **CSS-Driven Interactivity**: The visibility of the tooltip is managed via CSS `:hover` states on the `.bug-count-container` rather than JavaScript event listeners. This reduces the overhead of the component and ensures high performance during rapid UI interactions.
- **Hardcoded Navigation Logic**: The element specifically formats links using the `http://b/` shortcut, optimized for internal issue tracking workflows.
### Key Components and Responsibilities
#### bug-tooltip-sk.ts
This file defines the `BugTooltipSk` LitElement. Its primary responsibility is to transform an array of `RegressionBug` objects into a readable summary.
- **Data Binding**: It accepts a `bugs` property. If the list is empty, the entire component is hidden via the `hidden` attribute to save space.
- **Customizable Labeling**: The `totalLabel` property allows consumers to change the suffix of the count (e.g., "with 2 regressions" vs "with 2 total"), making it reusable across different alert types.
#### bug-tooltip-sk.scss
The stylesheet manages the complex positioning and transition logic for the tooltip.
- **Positioning**: The tooltip is positioned absolutely at `bottom: 125%` of the container, ensuring it pops up above the text.
- **Overflow Handling**: Because the tooltip might contain many bugs, it is constrained by a `max-height` and features `overflow-y: auto`. This prevents the tooltip from expanding beyond the viewable area of the Perf content pane.
- **Visual Feedback**: A `0.7s` opacity transition is applied to provide a smooth "fade-in" effect when the user hovers over the bug count.
#### bug-tooltip-sk_po.ts
This file provides the Page Object (`BugTooltipSkPO`) for the module. It abstracts the DOM structure for integration tests, allowing tests to verify:
- Visibility states (both the container and the tooltip).
- Correctness of bug links and text content.
- Scrollability, ensuring that the CSS constraints on height are functioning correctly when high volumes of bugs are present.
### Workflow: Displaying Bug Details
The following diagram illustrates how the component handles user interaction to reveal bug data:
```text
[Data Input] -> [bugs: RegressionBug[]]
|
v
+---------------------+
| Is bugs.length > 0?| -- No --> [Render Nothing]
+---------------------+
| Yes
v
+------------------------+
| Render: "with X total" |
+------------------------+
|
[User Hover Action]
|
v
+------------------------+
| CSS: opacity 0 -> 1 |
| CSS: visibility: vis |
+------------------------+
|
+------------------------+
| Rendered List: |
| - ID (Link to b/ID) |
| - Type (Source) |
+------------------------+
```
### Data Structure
The component expects the `bugs` property to conform to the `RegressionBug` interface (imported from the central JSON definitions), which requires:
- `bug_id`: The numeric identifier for the bug.
- `bug_type`: A string indicating the origin or category of the bug (e.g., "monorail").
# Module: /modules/calendar-input-sk
# calendar-input-sk
The `calendar-input-sk` module provides a hybrid date-selection component. It combines a manual text input field with a graphical calendar picker, ensuring that users can either type a specific date quickly or browse a calendar for context.
## Design and Implementation
The component is designed around the principle of flexibility and validation. It acknowledges that while calendar pickers are user-friendly for relative date selection (e.g., "next Thursday"), manual entry is often faster for absolute date entry (e.g., "1995-03-12").
### Key Components
- **Text Input**: A standard HTML `<input type="text">` restricted by a regex pattern (`YYYY-MM-DD`). This provides a lightweight way to enter dates without requiring the heavy overhead of native browser date pickers, which can vary significantly in behavior and styling across platforms.
- **Trigger Button**: An icon button (using `date-range-icon-sk`) that activates the graphical selection interface.
- **Modal Dialog**: A native HTML `<dialog>` element containing a `calendar-sk` component. Using a native dialog allows the component to leverage built-in browser features for modal behavior, such as focus trapping and the "Esc" key to close.
### Interaction Workflow
The component synchronizes state between the text field and the calendar widget:
```
[ User Input ] --> [ Regex Validation ] --(valid)--> [ Update Internal Date ] --(emit)--> [ input event ]
| ^
| |
[ Click Icon ] --> [ Open Dialog ] --> [ Select Date in Calendar ] --(close)--> [ Update Input Value ]
```
1. **Manual Entry**: When a user types in the input field, the component monitors the `input` event. It validates the string against the required pattern. If valid, it parses the date and updates the internal state.
2. **Calendar Selection**: Clicking the calendar icon opens the modal. The `calendar-input-sk` manages this interaction using a Promise-based approach. The `openHandler` awaits a Promise that is resolved when a date is picked in the sub-component (`calendar-sk`) or rejected if the user cancels.
3. **Keyboard Support**: While the dialog is open, the component proxies keyboard events to the `calendar-sk` element, allowing users to navigate the calendar grid using arrow keys even though the dialog has focus.
## Design Decisions
- **Pattern Validation**: Instead of using `type="date"`, which often forces a specific UI localized by the browser, this component uses `type="text"` with a `pattern`. This ensures a consistent look and feel across all browsers while still providing immediate feedback via CSS (using the `:invalid` pseudo-class) when the format is incorrect.
- **State Synchronization**: The `displayDate` property acts as the single source of truth. Setting this property triggers a re-render of the input value and updates the state of the underlying `calendar-sk` widget.
- **Event Handling**: The component emits a custom `input` event containing the selected `Date` object in the `detail` field. This mirrors the standard input behavior while providing a rich data type to the consumer. It explicitly stops propagation of internal native input events to prevent confusing them with the component's own semantic "date changed" event.
## Styling
The component uses scoped CSS to handle validation states. When the input's regex pattern is not met, an "invalid" indicator (an "✘" mark) is displayed via CSS transitions:
- The `input:invalid + .invalid` selector allows for a CSS-only toggle of error messages, minimizing the amount of manual DOM manipulation required in the TypeScript logic.
- It utilizes the `perf/modules/themes` variables to ensure the dialog and input colors remain consistent with the overall application theme (supporting both light and dark modes).
# Module: /modules/calendar-sk
The `calendar-sk` module provides a custom web component that displays an accessible, themeable, and localized calendar for selecting a single date. It is designed to overcome the limitations of the native `<input type="date">` (specifically Safari compatibility and lack of styling) and other third-party libraries that may be inaccessible or difficult to theme.
### Design Decisions
- **Custom Date Manipulation**: The component uses a `CalendarDate` helper class and local time manipulation to ensure that date selection is predictable and avoids the common pitfalls of UTC vs. local time offsets.
- **Accessibility First**: Implementation follows W3C WAI-ARIA practices for date pickers. This includes proper `aria-live` regions for month/year changes, `aria-selected` states for the current selection, and a robust focus management system.
- **Localization via Intl API**: Rather than hardcoding month and day names, the component utilizes the `Intl.DateTimeFormat` API. This allows the calendar to automatically adapt its labels (e.g., "January" vs "一月") and weekday headers based on the provided `locale` property.
- **Decoupled Keyboard Handling**: Instead of automatically capturing all global key presses, the component exposes a `keyboardHandler` method. This allows the parent application to decide when the calendar should respond to input (e.g., only when a specific dialog is open).
### Key Components
#### `calendar-sk.ts`
This is the core logic of the module. It defines the `CalendarSk` class, which extends `ElementSk` and uses `lit-html` for rendering.
- **State Management**: It maintains an internal `_displayDate` which determines which month is currently visible.
- **Navigation Logic**: It contains methods for incrementing/decrementing days, weeks, months, and years. It handles edge cases like moving from January 31st to February (clamping the date to the last day of the month).
- **Template Generation**: It dynamically calculates the grid layout (up to 6 rows) based on the first day of the week and the number of days in the specific month.
#### `calendar-sk.scss`
Styles the calendar using CSS variables for themeability (e.g., `--background`, `--secondary`, `--surface-1dp`). It ensures that the calendar buttons are uniform and that the "today" and "selected" states are visually distinct.
#### Events
The component communicates state changes to the rest of the application via a standard DOM event:
- **`change`**: Fired whenever a user selects a date. The `detail` property of the event contains the selected `Date` object.
### Key Workflows
#### Navigation and Selection
The user can navigate through time using UI buttons or keyboard shortcuts. When a date is selected, the component updates its internal state and notifies listeners.
```text
User Action Component Logic UI Update
----------- --------------- ---------
Click "Next Month" -> incMonth() -----------------> Re-renders table grid
Press "ArrowRight" -> keyboardHandler(incDay()) --> Updates focus & aria-selected
Click a Day Button -> dateClick() ----------------> Dispatches 'change' event
```
#### Keyboard Shortcuts
When the `keyboardHandler` is active, the following shortcuts are supported:
| Key | Action |
| -------------------------- | --------------------------- |
| `ArrowLeft` / `ArrowRight` | Move back/forward one day |
| `ArrowUp` / `ArrowDown` | Move back/forward one week |
| `PageUp` / `PageDown` | Move back/forward one month |
### Usage Example
In a parent component or page, you can initialize the calendar and hook into its events:
```typescript
const calendar = document.querySelector('calendar-sk');
// Set initial date and locale
calendar.displayDate = new Date();
calendar.locale = 'en-US';
// Listen for selection
calendar.addEventListener('change', (e) => {
console.log('New date selected:', e.detail);
});
// Proxy keyboard events from a container
window.addEventListener('keydown', (e) => calendar.keyboardHandler(e));
```
# Module: /modules/chart-tooltip-sk
# chart-tooltip-sk
The `chart-tooltip-sk` module provides a rich, interactive tooltip designed for performance charts. It serves as the primary interface for users to inspect specific data points, view commit metadata, triage anomalies, and initiate debugging workflows like bisections.
## Overview
Unlike a simple text tooltip, `chart-tooltip-sk` is a complex orchestrator that aggregates data from multiple sources (dataframes, anomaly maps, and backend CID handlers). It is designed to be dynamically positioned over a chart and provides contextual actions based on the nature of the selected data point (e.g., whether it is a single point, a range, or a detected anomaly).
### Key Responsibilities
- **Contextual Data Display**: Shows date, value, and unit information for any hovered or selected point.
- **Anomaly Triaging**: If a point is identified as an anomaly, the tooltip provides details on the percentage change, median values before/after, and provides a `triage-menu-sk` for filing bugs or ignoring the regression.
- **Workflow Integration**: Acts as a bridge to launch Bisect jobs (`bisect-dialog-sk`) and Pinpoint try jobs (`pinpoint-try-job-dialog-sk`).
- **Commit Navigation**: Integrates with `commit-range-sk` to show the range of commits associated with a point and provides direct links to the source repository.
- **Source Inspection**: Optionally displays links to raw JSON source files for the data point if configured.
## Design Decisions
### Positioning Logic (`moveTo`)
The tooltip implements custom positioning logic instead of relying on standard CSS hover tooltips. This is necessary because it must stay within the viewport and the chart boundaries.
- **Smart Shifting**: It calculates its own dimensions using `getBoundingClientRect()` and automatically flips to the left of the cursor if it would overflow the right edge of the screen.
- **Vertical Adjustment**: It shifts vertically to ensure it doesn't get cut off by the bottom of the browser window.
### "Why" the `load` method?
Rather than using many individual attributes, the module uses a comprehensive `load()` method. This decision ensures that all interrelated properties (anomaly data, commit info, color, and triage state) are updated atomically before a render is triggered. This prevents "flicker" where the tooltip might show an old anomaly's data with a new point's coordinates.
### Conditional Content
The UI is highly reactive to the global `window.perf` configuration and the specific data passed to it:
- **Pinpoint/Bisect**: Buttons are hidden if the instance doesn't support them or if the git repository is not a supported Chromium source.
- **Anomaly vs. Normal**: The template branches significantly. For anomalies, it calculates and colors the "Improvement" vs "Regression" status; for normal points, it can show a `user-issue-sk` component to track non-anomaly bugs.
## Key Components
| Component | Role within Tooltip |
| :----------------- | :------------------------------------------------------------------------------------------------------- |
| `commit-range-sk` | Displays and links the range of revisions for the selected point. |
| `triage-menu-sk` | Provides the UI for filing new bugs or associating the anomaly with an existing one. |
| `point-links-sk` | Renders custom links based on the specific trace configuration (e.g., V8 or WebRTC specific dashboards). |
| `bisect-dialog-sk` | A dialog triggered from the tooltip to start a performance bisection. |
| `json-source-sk` | Displays the underlying data source when enabled via `show_json_file_display`. |
## Key Workflows
### Data Loading and Rendering
When a user interacts with a chart, the following process occurs to populate the tooltip:
```text
Chart Event (Hover/Click)
|
v
explore-simple-sk (or parent) calls .load(...)
|
+--> Update internal state (index, anomaly, commit_info)
+--> Determine Trace Color
+--> Configure sub-components (CommitRange, UserIssue)
|
v
._render()
|
+--> logic: Is this an anomaly?
| |-- YES: Show anomalyTemplate() (Medians, Triage Menu)
| +-- NO: Show user-issue-sk
|
+--> logic: Is always_show_commit_info true?
|-- YES: Show Author/Message/Hash
+-- NO: Hide commit info if it's a range
```
### Triaging an Anomaly
The tooltip facilitates the transition from "seeing a spike" to "taking action":
1. **Detection**: The parent element passes an `Anomaly` object to the tooltip.
2. **Visualization**: The tooltip displays the "Anomaly Range" and "Percent Change".
3. **Action**:
- If no bug is associated (`bug_id === 0`), the `triage-menu-sk` appears.
- The user can click "Bisect" to pre-populate a bisection job with the anomaly's revision range.
- Upon successful triage, the `anomaly-changed` event refreshes the display to show the new Bug ID.
# Module: /modules/cid
# Commit ID (CID) Resolution Module
The `cid` module provides a centralized client-side interface for resolving internal commit identifiers—represented as `CommitNumber` types—into rich commit metadata.
In the Perf system, performance data is often indexed by a sequential `CommitNumber` (also known as an offset) to optimize storage and time-series lookups. However, for human readability and integration with version control systems, these numbers must be translated back into git hashes, timestamps, authors, and commit messages. This module abstracts that translation process.
## Design Decisions
### Centralized Resolution
The decision to use a dedicated RPC endpoint (`/_/cid/`) rather than embedding commit metadata directly into performance data streams is driven by bandwidth efficiency. Performance results often contain thousands of data points; including full commit details for every point would result in massive payloads. Instead, the UI receives lightweight `CommitNumber` integers and uses this module to batch-resolve only the specific commits needed for display (e.g., when hovering over a point in a chart or viewing a table of regressions).
### Batch Processing
The module is designed around batching. The `lookupCids` function accepts an array of `CommitNumber`s, allowing the frontend to resolve an entire range of commits in a single HTTP POST request. This minimizes network overhead and reduces latency when populating large data views.
## Key Components
### Commit Translation (`cid.ts`)
The core functionality is encapsulated in the `lookupCids` function. It acts as the bridge between the frontend and the Perf backend's CID handler.
- **Input**: An array of `CommitNumber` values.
- **Process**: It serializes these numbers into a JSON body and sends a POST request to the `/_/cid/` endpoint.
- **Output**: A `CIDHandlerResponse` object containing a `commitSlice` (an array of detailed commit objects) and an optional `logEntry` for debugging or context.
The use of `jsonOrThrow` ensures that the calling code doesn't have to manually check HTTP status codes for common failure modes, streamlining error handling in the UI components that consume this data.
## Workflow: Resolving Commit Metadata
The following diagram illustrates how a UI component uses this module to transform raw data into a human-readable format:
```text
+----------------+ +------------------+ +-----------------+
| UI Component | | CID Module | | Perf Backend |
| (e.g. Chart) | | (cid.ts) | | (/_/cid/ RPC) |
+-------+--------+ +--------+---------+ +--------+--------+
| | |
| 1. Request Resolution | |
| [101, 102, 105] | |
+------------------------>| |
| | 2. POST /_/cid/ |
| | JSON([101, 102, 105]) |
| +------------------------->|
| | |
| | 3. Return Metadata |
| | (Hashes, Msgs, etc.) |
| |<-------------------------+
| 4. Update View | |
|<------------------------+ |
| | |
+-------+--------+ +--------+---------+ +--------+--------+
```
## Related Data Structures
The module relies heavily on types defined in `/perf/modules/json`, specifically:
- **CommitNumber**: A branded type representing the sequential index of a commit.
- **CIDHandlerResponse**: The schema for the backend response, which includes the `Commit` objects containing the hash, author, timestamp, and message.
# Module: /modules/cluster-lastn-page-sk
# cluster-lastn-page-sk
The `cluster-lastn-page-sk` module provides a comprehensive interface for testing, dry-running, and saving performance alert configurations. It allows developers and performance engineers to "test drive" anomaly detection algorithms against historical data before committing them as active monitors.
## Overview
At its core, this module acts as a sandbox for the Perf clustering and regression detection system. It enables users to define parameters for an alert (such as the algorithm, threshold, and data query), run that configuration against a specific range of commits, and inspect the resulting clusters to verify if the alert is too noisy or missing real regressions.
## Key Components and Responsibilities
### Alert Configuration and State Management
The module heavily relies on `alert-config-sk` for defining the detection logic.
- **State Reflection:** It uses `stateReflector` to synchronize the current alert configuration with the URL. This allows users to share a specific "dry-run" setup by simply copying the browser address.
- **Alert Persistence:** Once a user is satisfied with the dry-run results, the module handles the transition from a temporary configuration to a persistent one via the `/_/alert/update` endpoint. It dynamically changes its UI (e.g., button labels) based on whether the user is creating a new alert or updating an existing one.
### Dry-Run Execution Workflow
The "Run" process is an asynchronous operation that leverages a progress-tracking API.
1. **Request Initiation:** It sends a `RegressionDetectionRequest` to `/_/dryrun/start`, containing the alert configuration and the commit range (defined by `domain-picker-sk`).
2. **Progress Monitoring:** Instead of a single blocking request, it utilizes a polling mechanism (via `startRequest`) to receive intermediate updates. This allows the UI to display real-time status messages and partial results.
3. **Result Visualization:** Detected regressions are rendered in a tabular format, broken down by commit and the direction of the change (High/Low).
### Regression Analysis and Triage
The module doesn't just list regressions; it provides deep-dive capabilities into the clusters found:
- **Triage Status:** It integrates `triage-status-sk` within the results table to show the current status of detected anomalies.
- **Detailed Inspection:** Clicking on a result triggers a modal containing `cluster-summary2-sk`. This allows users to see the specific traces contributing to a cluster without navigating away from the dry-run page.
- **External Linking:** Through the `open-keys` event, the module can open the Explore page in a new tab, pre-populated with the specific trace keys and time range associated with a detected regression.
## Key Workflows
### Testing an Alert
```
User Configures Alert -> Clicks "Run"
|
V
POST /_/dryrun/start (Alert Params + Domain)
|
+--<-- Polling /_/progress/ -> Updates Status UI
|
V
Results Received -> Render Table (Commits x Clusters)
|
+-- Click Cluster -> Open Triage Dialog (Internal Inspection)
|
+-- Click "Accept" -> Save Alert to Database
```
### Domain and Range Selection
The module uses a `domain-picker-sk` to define the "where" and "when" of the test. Users can specify:
- **Number of commits:** How far back to look.
- **Commit Range:** Specific start and end points in time.
The UI defaults to a "dense" request type to ensure sufficient data points are evaluated during the dry-run, regardless of the underlying data sparsity.
## Design Decisions
### Modal Dialogs for Configuration
The use of `<dialog>` elements for `alert-config-sk` and `cluster-summary2-sk` ensures that the main dry-run context (the results table and run settings) remains visible and persistent in the background while the user fine-tunes parameters or inspects specific data points.
### Error Handling
The module distinguishes between transient request errors and configuration errors. Error messages from the dry-run process are captured and displayed within a dedicated `<pre>` block to preserve formatting (like stack traces or detailed engine logs), while authentication or persistence errors are routed through a global `error-toast-sk`.
# Module: /modules/cluster-page-sk
# cluster-page-sk
The `cluster-page-sk` module provides a comprehensive interface for performing regression detection and trace clustering within the Skia Perf framework. It allows users to identify groups of performance traces that exhibit similar behavior—such as a coordinated step or shift—around a specific commit.
## High-Level Overview
The primary purpose of this module is to give developers and performance engineers a way to "cluster" traces based on their statistical properties. Instead of looking at individual traces, users can identify patterns across hundreds or thousands of benchmarks.
The workflow typically involves:
1. **Selection**: Picking a specific commit (the "center" of the analysis) and a set of traces via a query.
2. **Configuration**: Choosing an algorithm (e.g., K-Means or Step-Fit) and defining sensitivity parameters.
3. **Execution**: Running a long-polling server-side task to compute clusters.
4. **Triage**: Reviewing the resulting clusters to identify performance regressions or improvements.
## Design and Implementation Choices
### Asynchronous Progress Handling
Clustering is a computationally expensive operation that can take a significant amount of time. To prevent UI blocking and handle potential timeouts, the module uses a "start-and-poll" pattern.
- It initiates a request to `/_/cluster/start`.
- It utilizes a specialized `progress` utility to poll for status updates.
- Real-time status messages from the server are streamed to the UI, providing feedback on the current stage of the clustering process (e.g., "Calculating centroids", "Analyzing step fits").
### State Reflection
To support bookmarking and sharing of specific clustering configurations, the module uses `stateReflector`. Parameters such as the selected commit offset, the query string, algorithm choice ($K$, radius, etc.), and "interestingness" thresholds are automatically mirrored in the URL's hash.
### Component-Based Architecture
The page is composed of several specialized sub-elements, each handling a distinct part of the clustering lifecycle:
- **`commit-detail-picker-sk`**: Handles the complex logic of searching for and selecting a specific commit.
- **`algo-select-sk`**: Provides the UI for switching between different clustering strategies (like K-Means vs. Step-Fit).
- **`query-sk` & `query-count-sk`**: Allow users to filter the multi-million trace dataset down to a specific subset while getting immediate feedback on the number of matches.
- **`cluster-summary2-sk`**: Visualizes the output of the clustering engine, showing centroids and statistical summaries of the discovered groups.
## Key Components and Responsibilities
### State Management (`State` class)
The internal `State` class defines the schema for what makes a clustering request unique. This includes:
- `offset`: The commit index to analyze.
- `radius`: How many commits before and after the offset to include in the window.
- `k`: The number of clusters to find (0 allows the server to auto-calculate).
- `interesting`: A threshold score; clusters with a regression score below this are ignored.
- `sparse`: A boolean flag to skip traces that lack data in the requested range.
### The Clustering Workflow
The `start()` method is the core logic driver. It gathers the current state into a `RegressionDetectionRequest` and manages the lifecycle of the network request.
```
[ User Input ] -> [ State Reflector ] -> [ URL Updated ]
|
[ Click "Run" ]
|
v
[ POST /_/cluster/start ] ----> [ Server starts Job ]
| |
|<-------[ Poll Progress ] <----+
| |
v |
[ Update Spinner ] <+
[ Show Messages ] <+
|
v
[ GET Final Results ] -> [ Map to cluster-summary2-sk ]
```
### Event Handling
- **`commitSelected`**: Listens for selection events from the commit picker to update the target `offset`.
- **`openKeys`**: When a user clicks on a specific cluster summary, this handler constructs a URL for the `Explore` page (`/e/`) using a shortcut to the traces in that cluster, allowing for deeper drill-down analysis.
- **`queryChanged`**: Dynamically updates the `paramset-sk` and triggers a re-count of matching traces to help the user gauge the scope of their request before running it.
## Results Visualization and Sorting
Once the results are returned, they are rendered as a list of `cluster-summary2-sk` elements. The module includes a `sort-sk` component that allows users to re-order these results based on:
- **Cluster Size**: Number of traces in the group.
- **Regression**: The calculated severity of the shift.
- **Step Size**: The absolute magnitude of the change.
- **Least Squares**: The statistical fit of the data to a step function.
# Module: /modules/cluster-summary2-sk
# cluster-summary2-sk
The `cluster-summary2-sk` module provides a comprehensive UI component for visualizing and triaging performance regressions in the Perf system. It represents a "cluster" of traces that exhibit similar behavior (usually a step-up or step-down) at a specific point in time.
## High-Level Overview
This component serves as a detailed view for an anomaly. It combines several data dimensions into a single interface:
1. **Visual Evidence:** A plot showing the centroid of the trace cluster.
2. **Statistical Context:** Metrics like regression magnitude, step size, and least squares error.
3. **Metadata:** Impacted parameters (via a Word Cloud) and commit details.
4. **Actionability:** Controls to triage the anomaly (e.g., mark as "Positive" or "Negative") or investigate further on the dashboard.
## Design Decisions and Implementation
### Dynamic Labeling and Formatting
A key design challenge in Perf is that different step detection algorithms (e.g., `mannwhitneyu`, `cohen`, `percent`) produce statistically different outputs. Rather than a generic "Value" label, `cluster-summary2-sk` uses a mapping strategy (`labelsForStepDetection`) to provide context-aware labels and formatting.
- **Why:** A "Regression Factor" of 0.05 is excellent for a p-value (`mannwhitneyu`) but potentially insignificant for a "Percentage Change."
- **How:** The component reacts to the `alert` property. When an alert is set, it updates the internal `labels` object, changing UI strings (e.g., "p:" vs "Percentage Change:") and the corresponding number formatters (percent vs decimal).
### Layout and Information Density
The component uses a CSS Grid layout to manage a complex set of child elements, ensuring that critical information remains visible even as secondary tools (like the Word Cloud) are toggled.
```
+-------------------------------------------+
| [Regression Status Banner (High/Low)] |
+----------------------+--------------------+
| [Stats Row] | [Triage Controls] |
+----------------------+ |
| [Google Chart Plot] | |
+----------------------+--------------------+
| [Commit Detail Panel] |
+-------------------------------------------+
| [Action Buttons: Dashboard / Word Cloud] |
+-------------------------------------------+
| [Collapsible Word Cloud Area] |
+-------------------------------------------+
```
### Data Integration and State
The component consumes a `FullSummary` object, which is a composite of the `ClusterSummary` (the statistics) and the `FrameResponse` (the raw data for the plot).
- **Plotting:** It transforms the cluster `centroid` (an array of numbers) and the dataframe `header` (commit info) into a format suitable for `plot-google-chart-sk`. It also places an "x-bar" (vertical line) at the exact commit where the regression was detected.
- **Permissions:** It checks the user's login status via `alogin-sk`. If the user lacks the `editor` role, triage controls are visually disabled to prevent unauthorized state changes.
## Key Components and Responsibilities
### `cluster-summary2-sk.ts`
The main logic engine. It handles:
- **Event Dispatching:** Fires `triaged` when a user updates the status and `open-keys` when a user wants to explore the cluster on the main Perf dashboard.
- **Coordinate Lookup:** Uses the static `lookupCids` method to fetch commit metadata when a user clicks a point on the graph.
- **Attribute Management:** Supports a `notriage` attribute to hide triage UI in read-only contexts.
### Integrated Sub-elements
The component acts as a coordinator for several other specialized modules:
- **`plot-google-chart-sk`**: Renders the trend line of the cluster's centroid.
- **`triage2-sk`**: Provides the dropdown/selection for anomaly status (Untriaged, Positive, Negative, etc.).
- **`word-cloud-sk`**: Visualizes the `param_summaries2` data, helping users identify which dimensions (like `arch` or `config`) are most common in the cluster.
- **`commit-detail-panel-sk`**: Displays the git log/author information for the selected point in the regression.
- **`commit-range-sk`**: Allows users to inspect the range of commits around the regression point.
## Workflows
### The Triage Workflow
When a developer identifies a regression, the following interaction occurs:
1. **Selection:** User reviews the plot and word cloud to confirm the regression is real.
2. **Input:** User selects a status in `triage2-sk` and optionally enters a message.
3. **Update:** Clicking "Update" triggers the following internal flow:
`User Click` -> `update()` -> `dispatchEvent('triaged', {columnHeader, triageStatus})`
4. **External Handling:** The parent page (e.g., the Anomaly table) listens for this event to persist the change to the backend.
### The Investigation Workflow
If the summary is insufficient, the "View on Dashboard" button facilitates a deep dive:
`Click "View on Dashboard"` -> `openShortcut()` -> `dispatchEvent('open-keys', ...)`
This event carries a `shortcut` ID, which the Explorer page uses to reload the exact set of traces and time range represented by the cluster.
# Module: /modules/commit-detail-panel-sk
# commit-detail-panel-sk
A Custom Element that displays a list of commits in a table format. Each entry in the table is rendered using a `commit-detail-sk` element, providing a consistent view of commit metadata across the Perf application.
## Overview
The `commit-detail-panel-sk` acts as a container and controller for a collection of commit summaries. It is designed to be versatile, supporting both purely informational displays and interactive selection workflows (e.g., choosing a specific commit from a list associated with a performance anomaly).
### Key Responsibilities
- **Data Presentation**: Transforms an array of `Commit` objects into a vertical list of detailed rows.
- **Selection Management**: Tracks which commit is currently active or selected via the `selected` attribute.
- **Interaction Handling**: Manages click events on rows and translates them into high-level `commit-selected` events for parent components.
- **State Propagation**: Passes contextual information, such as `trace_id`, down to individual `commit-detail-sk` children to ensure they have the necessary context for rendering or linking.
## Design Decisions
### Interactive vs. Static Modes
The component uses a `selectable` boolean attribute to toggle between two distinct behaviors:
1. **Static**: The panel is purely for viewing. Visual cues like hover pointers and selection highlights are disabled, and click events are ignored.
2. **Interactive**: The panel acts as a selection list. The CSS adds a `cursor: pointer` to rows, and the component responds to user clicks by updating its state and broadcasting the selection.
This dual-mode design allows the same component to be used in a read-only dashboard as well as in a "point-and-click" triage workflow.
### Event Delegation and Parent Lookup
The component implements a click listener on the top-level `<table>` rather than attaching individual listeners to every row. This is more efficient for large lists.
When a click occurs, it uses the `findParent` utility to locate the nearest `TR` element. This ensures that even if a user clicks a nested link or span inside the `commit-detail-sk` child, the panel correctly identifies which index in the `details` array was targeted.
### Selection Workflow
The following diagram illustrates how a user interaction is converted into an application-level event:
```text
User Click
|
v
[Table Click Handler] --(Check selectable)--> [Exit if false]
|
| (findParent 'TR')
v
[Extract data-id] ---------------------------> [Update 'selected' attribute]
| |
v v
[Construct Event Detail] [Trigger CSS :selected highlight]
| (author, message, commit)
v
[Dispatch 'commit-selected']
```
## Key Components
### commit-detail-panel-sk.ts
The core logic of the element. It utilizes `lit-html` for efficient rendering.
- **Properties/Attributes**:
- `details`: The source array of `Commit` objects.
- `selectable`: Enables/disables interaction.
- `selected`: The index of the currently highlighted commit.
- `hide`: When true, prevents the list from rendering any rows, effectively clearing the view without losing the underlying data.
- `trace_id`: A string passed to children to provide context for the specific performance trace being inspected.
### commit-detail-panel-sk.scss
Defines the visual state of the panel. It leverages CSS variables for theme support (light/dark mode).
- Highlights rows using the `tr[selected]` selector.
- Adjusts opacity based on the `selectable` state to provide a visual hint of whether the component is interactive.
### commit-detail-panel-sk_po.ts
A Page Object (PO) implementation used for automated testing. It abstracts the DOM structure (tables, rows) into a set of asynchronous methods like `clickRow(index)` and `getSelectedRow()`, allowing tests to interact with the component at a functional level rather than a DOM level.
## Events
### commit-selected
Fired when a user clicks a row while the `selectable` attribute is present.
- **Detail**: Contains the index of the selection, a string description (author + message), and the full `Commit` object.
- **Bubbles**: True, allowing parent containers to catch selection events from the panel.
# Module: /modules/commit-detail-picker-sk
# commit-detail-picker-sk
The `commit-detail-picker-sk` module provides a specialized UI component for searching and selecting a specific commit from a repository's history. It is designed to handle the discovery of commits by allowing users to browse within a configurable time window and view detailed information before making a selection.
## Design and Implementation
The component acts as a high-level wrapper around three key functional areas: a trigger (a button showing the current selection), a search/filter interface (date range selection), and a results display (`commit-detail-panel-sk`).
### Workflow: Commit Selection
The module implements a "modal picker" pattern to keep the main UI clean while providing a rich interface for selection when needed.
```text
User Interaction State Management & Fetching Sub-components
+--------------+ +---------------------------+ +-------------------+
| Click Button | ------>| Open <dialog> | | |
+--------------+ | | | |
| Fetch Commits (_/cidRange)| | |
| (Filtered by Date Range) | | |
+-------------+-------------+ | |
| | |
v | |
+---------------------------+ | |
| Update .details Property | ------>| commit-detail- |
+---------------------------+ | panel-sk |
| | |
v | |
+--------------+ +---------------------------+ | |
| Select Commit| <------| Emit 'commit-selected' | <------| (Item Clicked) |
+--------------+ | Close <dialog> | | |
+---------------------------+ +-------------------+
```
### Key Components and Responsibilities
- **`commit-detail-picker-sk.ts`**: The core logic handler. It manages the state of the modal (open/closed), the current date range for searching, and the retrieval of commit data from the server.
- **Data Fetching**: It communicates with the backend via the `/_/cidRange/` POST endpoint. It sends a `RangeRequest` containing a start and end timestamp and an optional offset. This allows the picker to populate its list based on user-defined time windows.
- **Synchronization**: When the `selection` property (a `CommitNumber`) is set externally, the component automatically triggers a fetch to ensure the details of that commit are loaded and displayed in the button label.
- **`commit-detail-panel-sk`**: Used within the dialog to render the list of commits. It handles the actual rendering of commit messages, authors, and hashes, and provides the selection mechanism within the list.
- **`day-range-sk`**: Provides the UI for the user to modify the search window. Changing the date range in this component triggers a new fetch in the picker to refresh the available commits.
- **`dialog` (HTML5)**: Used to overlay the picker interface. This keeps the commit browsing experience contextual without navigating the user away from their current task.
## Events
The component communicates the user's choice to the rest of the application via a custom event:
- **`commit-selected`**: Emitted when a user clicks a commit in the panel. The `detail` of the event contains the selected index and commit information, following the structure defined by `CommitDetailPanelSkCommitSelectedDetails`.
## Interaction Logic
1. **Initial Load**: On attachment to the DOM, the component defaults to a 24-hour window (ending at `Date.now()`). It fetches the commits for this range to populate the internal panel.
2. **Updating the Range**: If a user cannot find the desired commit, they can expand the "Date Range" section. This uses `day-range-sk` to update the `begin` and `end` timestamps, which causes the picker to re-query the backend.
3. **Selection Persistence**: The button label is dynamically generated based on the current selection. If no commit is selected or if the selected commit isn't found in the current fetched batch, it defaults to "Choose a commit."
# Module: /modules/commit-detail-sk
The `commit-detail-sk` module provides a specialized web component for displaying concise information about a single Git commit within the Perf application. It serves as a navigational bridge, allowing users to move from a specific commit to various analysis views such as data exploration, clustering, or triage.
### Design and Intent
The element is designed to be a compact, action-oriented summary. It doesn't just display metadata; it contextualizes the commit based on the user's current interaction state.
A key design choice is the conditional behavior of the **Explore** functionality. The component can navigate to one of two destinations depending on whether a `trace_id` is provided:
1. **Generic Explore:** If only a commit is known, it links to a general view of that commit.
2. **Contextual Explore:** If a `trace_id` is present, the component assumes the user is interested in how that specific trace performed around the time of the commit. It automatically calculates a time window of +/- 4 days around the commit timestamp to provide immediate visual context in the Explore view.
### Key Components and Logic
#### commit-detail-sk.ts
This is the core implementation file. It defines the `CommitDetailSk` class, which manages the following properties:
- `cid`: A `Commit` object containing the hash, author, timestamp, message, and URL.
- `trace_id`: An optional string identifying a specific performance trace.
The component uses Lit for rendering and follows a reactive pattern where updates to `cid` or `trace_id` trigger a re-render. It utilizes `diffDate` from the `infra-sk` library to display human-readable relative timestamps (e.g., "3 days ago").
#### Action Buttons and Navigation
The component renders a set of standard actions, all of which open in a new browser tab:
- **Explore**: Navigates to `/e/` (contextual) or `/g/e/` (generic).
- **Cluster**: Navigates to `/g/c/[hash]`, used for analyzing performance clusters associated with that commit.
- **Triage**: Navigates to `/g/t/[hash]`, used for managing alerts or regressions at that point in time.
- **Commit**: Links directly to the external source hosting service (e.g., Gitiles) using the URL provided in the commit object.
### Workflow: Explore Navigation Logic
The following diagram illustrates how the component determines the destination of the "Explore" button click:
```
User Clicks "Explore"
|
v
Is trace_id set?
/ \
[Yes] [No]
| |
| v
| Navigate to:
| /g/e/{commit_hash}
v
Calculate Time Range:
[ts - 4 days] to [ts + 4 days]
|
v
Construct Query Object:
{ keys: trace_id, begin, end, ... }
|
v
Navigate to:
/e/?{serialized_query}
```
### Styling and Themes
The module includes `commit-detail-sk.scss`, which imports both standard color variables and Perf-specific themes. It supports a dark mode and ensures that links and buttons remain accessible and consistent with the broader Skia infrastructure design language. Material Web Components (`md-outlined-button`) are used for the action triggers to provide a consistent look and feel with other modern Skia modules.
# Module: /modules/commit-range-sk
The `commit-range-sk` module provides a custom element designed to display and link to specific commits or ranges of commits within a repository. It is primarily used in the Perf UI to help users navigate from a data point or a regression in a trace directly to the relevant code changes in a source control browser (e.g., Gitiles/Googlesource or GitHub).
### High-Level Design Decisions
The element is designed to be reactive and data-driven, relying on trace data and column headers to resolve human-readable commit numbers into machine-readable Git hashes.
- **Dynamic Link Generation:** Instead of hardcoding URL patterns, the element utilizes a global configuration (`window.perf.commit_range_url`). This allows the same component to work across different repository hosting services by providing templates like `.../range/{begin}/{end}` or `.../commit/{end}`.
- **Automatic Range Detection:** The element automatically detects if it should display a single commit or a range. If the data point immediately following the previous valid data point is selected, it treats it as a single commit. If there are "holes" (missing data) in the trace between the current point and the last known good point, it expands the link to cover the entire range of commits where the change could have occurred.
- **Asynchronous Resolution:** Git hashes are often not present in the initial trace header to save bandwidth. The module fetches these hashes lazily via the `cid` (Commit ID) lookup service only when a link needs to be rendered.
- **Request Concurrency & Caching:** To prevent UI flicker and redundant network requests, the element implements an internal cache for hashes and tracks request IDs to ensure that late-arriving responses from stale requests do not overwrite the current UI state.
### Key Components and Responsibilities
#### commit-range-sk.ts
This is the core logic of the component. It manages the internal state of the link text and URL.
- **Range Calculation:** It inspects the `trace` array and the `commitIndex` to find the "previous" valid commit. It skips over `MISSING_DATA_SENTINEL` values to ensure the user is directed to the full range of potential changes.
- **Template Parsing:** It replaces `{begin}` and `{end}` placeholders in the configured URL. It also includes specific logic for "Googlesource" style URLs, converting range logs (`+log/begin..end`) into single commit views (`+/end`) when the range size is one.
- **Event Dispatching:** Dispatches a `commit-range-changed` event when a new link is successfully generated, allowing parent components (like tooltips or info panels) to react.
#### Interaction Workflow
The following diagram illustrates how the component transforms a user selection into a functional link:
```text
User Selects Point (index N)
|
v
Find Previous Valid Point (index M < N) in Trace
|
v
Lookup Commit Numbers (Offsets) for M and N in Header
|
v
[Network Request] -> lookupCids(OffsetM, OffsetN)
|
v
Apply Hashes to window.perf.commit_range_url Template
|
v
Render <a> link with text "OffsetM+1 - OffsetN" (if range)
or "OffsetN" (if single commit)
```
#### commit-range-sk_po.ts
Provides the Page Object for testing. It encapsulates the logic for retrieving the link's `href` and the displayed `text`, shielding tests from the underlying DOM structure (which alternates between an `<a>` tag when the URL is ready and a `<span>` while loading).
#### test_data.ts
Contains mock data structures that simulate the `header` and `trace` objects produced by the Perf backend, ensuring consistent testing of the range-finding logic.
### Implementation Details
- **Single vs. Range:** A "Range" is defined as a gap where `start_commit + 1 < end_commit`. If they are adjacent, `isRange()` returns false, and the UI simplifies the display to a single hash or commit number.
- **GitHub Support:** The component has a specific fallback for GitHub URLs; if "github" is detected in the URL template, it truncates the displayed text to a short 7-character hash for better readability.
- **Rendering:** The component uses a light DOM (via `createRenderRoot() returning this`) rather than Shadow DOM, which is a common pattern in this project for elements that need to inherit global styles or be easily accessible by parent tooltips.
# Module: /modules/common
# /modules/common
This module serves as the foundational utility layer for Skia Perf. It centralizes shared logic for data visualization, anomaly handling, keyboard interaction, and testing infrastructure.
## Data Visualization and Plotting
The core responsibility of this module is to bridge the gap between raw backend trace data and the frontend charting engine (Google Charts).
### Plot Construction
The module handles the complex task of "transposing" trace data. Backend data typically arrives organized by trace keys, whereas charting libraries require data organized by rows (time/commit positions) with traces as columns.
- **`plot-builder.ts`**: Contains the logic for this transformation. It supports different domains (commits, dates, or both) and handles missing data using a `MISSING_DATA_SENTINEL`. It also generates consistent color palettes for charts.
- **`plot-util.ts`**: Provides higher-level utilities to create `ChartData` objects, specifically managing the integration of anomaly markers into the data points so they can be rendered on the graph.
### Visual Consistency and Collision Avoidance
To ensure that performance graphs remain readable when comparing similar builds, the module implements a deterministic coloring strategy:
- **Trace Coloring**: Colors are derived from a hash of the trace name to ensure consistency across page reloads.
- **Variant Offsets**: Specific logic exists to detect collisions between a "base" trace and its variants (e.g., `ref` or `pgo` builds). If a collision is detected in the color space, the variants are mathematically shifted to guaranteed distinct colors.
## Anomaly Management
Anomalies are a first-class citizen in this module. It provides types and formatting logic to present performance regressions or improvements to the user.
- **`anomaly-data.ts`**: Defines the data structure for a point on a graph that represents an anomaly, including its coordinates and highlight state.
- **`anomaly.ts`**: Contains formatting logic for numeric changes (percentages) and human-readable links to bug trackers. It handles specialized `bugId` values like "Invalid" or "Ignored" alerts.
## Interaction and Styling
### Keyboard Shortcuts
To facilitate rapid "triage" workflows, the module implements a centralized shortcut system.
- **`ShortcutRegistry`**: A singleton that manages categories of shortcuts (Triage, Navigation, Report, General).
- **`handleKeyboardShortcut`**: A global handler that maps physical key presses to specific method calls on components (e.g., `onTriagePositive`, `onZoomIn`), while intelligently ignoring events originating from input fields or textareas.
### Unified UI Components
- **`buttons.scss`**: Defines a mixin (`perf-button`) that enforces a strict visual design system. It uses `!important` to ensure that Perf-specific buttons maintain their identity even when embedded in components with conflicting global styles or Shadow DOM boundaries.
## Testing and Development Utilities
This module provides extensive infrastructure for both unit and integration testing.
- **`test-util.ts`**: A comprehensive mock environment for demo pages and unit tests. It includes `setUpExploreDemoEnv`, which mocks the entire Perf backend API (anomalies, trace data, login status, and shortcut persistence) using `fetch-mock`.
- **`puppeteer-test-util.ts`**: Provides helper functions for E2E testing, such as polling for DOM states, waiting for Google Charts to finish rendering (`waitForReady`), and validating `ParamSet` selections.
### Workflow: Data to Chart
The following diagram illustrates the flow of data through the common module components:
```
[Raw TraceSet] ----> [plot-util.ts] <---- [Anomaly Map]
| |
| (Match anomalies to points)
| |
V V
[plot-builder.ts] <--- [ChartData]
|
(Transpose to Rows)
|
V
[Google Chart Engine]
```
## Key Files Summary
| File | Responsibility |
| :----------------------- | :----------------------------------------------------------------------- |
| `anomaly.ts` | Formatting and calculation utilities for performance anomalies. |
| `buttons.scss` | Standardized visual styling for buttons across the application. |
| `graph-config.ts` | Logic for managing graph state and generating persistent shortcut URLs. |
| `keyboard-shortcuts.ts` | Central registry and event handler for application-wide hotkeys. |
| `plot-builder.ts` | Logic for transposing dataframes and managing chart color palettes. |
| `plot-util.ts` | High-level helpers for merging traces and anomalies into chartable data. |
| `test-util.ts` | Backend API mocking and dummy data generation for development. |
| `puppeteer-test-util.ts` | Synchronization and validation helpers for browser-based tests. |
# Module: /modules/const
# Constants Module (`/modules/const`)
The `const` module serves as a centralized source of truth for shared values used throughout the Perf UI. Its primary purpose is to ensure data consistency between the backend services (written in Go) and the frontend visualization layers, particularly regarding how incomplete or special data states are represented.
## Data Integrity and Sentinels
A significant challenge in performance monitoring is the representation of gaps in time-series data. The design of this module focuses on providing stable "sentinel" values that allow the UI to distinguish between valid data points and missing measurements without relying on non-standard JSON types.
### Numeric Sentinels
The backend storage and processing layers (specifically `//go/vec32/vec`) utilize a specific float32 value to denote missing samples. Because the standard JSON specification does not support `NaN` or `Infinity`, the frontend must use a value that is:
1. A valid, representable `float32`.
2. Compact in its string/JSON representation to minimize payload size.
3. Extremely unlikely to occur as a legitimate measurement in performance testing.
`MISSING_DATA_SENTINEL` (set to `1e32`) satisfies these requirements. When the UI encounters this value within a trace, it interprets the point as a gap rather than a zero or a legitimate data spike, allowing graphing components to break lines or omit points appropriately.
### String Sentinels
For categorical data or metadata fields where a value might be absent or undefined, the module provides `MISSING_VALUE_SENTINEL`. Using a explicit string (`__missing__`) instead of an empty string or `null` prevents ambiguity during filtering and grouping operations, ensuring that "missing data" can be treated as its own distinct category in the UI.
## Key Exports
| Constant | Purpose |
| :----------------------- | :--------------------------------------------------------------------------- |
| `MISSING_DATA_SENTINEL` | A numeric float used to mark holes in time-series traces. |
| `MISSING_VALUE_SENTINEL` | A string used to represent the absence of a value in metadata or parameters. |
## Workflow: Data Rendering
The following diagram illustrates how these constants act as a bridge between the raw data ingestion and the final visualization:
```text
[ Backend Trace ] -> [ JSON Serialization ] -> [ UI Data Fetching ] -> [ Plotting Logic ]
| | | |
| | | |
Uses MissingDataSentinel Converts to 1e32 Imports MISSING_DATA_ Detects 1e32 and
(Go) (Valid JSON) SENTINEL (TS) renders a gap.
```
# Module: /modules/csv
The `csv` module provides utilities for transforming performance data represented as a `DataFrame` into a Comma-Separated Values (CSV) format. This functionality is essential for allowing users to export trace data from the Perf system into spreadsheet software or external analysis tools.
### Overview
The primary design goal of this module is to flatten high-dimensional trace data—which is structured as a collection of key-value pairs (parameters) and time-series arrays—into a two-dimensional grid. To achieve this, the module dynamically generates a schema based on the unique set of parameter keys present in the provided traces.
### Design Decisions and Implementation
#### Dynamic Column Mapping
A challenge in converting `DataFrame` objects to CSV is that different traces may have different sets of parameters (e.g., one trace might have an `os` parameter while another has a `config` parameter).
To ensure a consistent grid structure:
1. **Parameter Extraction:** The module parses the structured trace IDs (comma-separated key-value strings) into distinct parameter sets.
2. **Key Union and Sorting:** It identifies every unique parameter key across _all_ traces and sorts them alphabetically. These sorted keys form the leading columns of the CSV.
3. **Normalization:** For each row, if a trace lacks a specific parameter key present in another trace, the corresponding cell is left empty.
#### Handling Time and Missing Data
- **Temporal Headers:** The columns following the parameter keys are derived from the `DataFrame.header`. Timestamps, which are stored internally as seconds, are converted to ISO 8601 strings to ensure they are human-readable and correctly interpreted by external tools.
- **Sentinels:** Performance data often contains "holes" where data collection failed or was skipped. The module explicitly checks for `MISSING_DATA_SENTINEL` values and converts them to empty strings in the CSV output to maintain the numeric integrity of the rest of the column.
- **Filtering:** The implementation automatically excludes traces starting with `special_`. These are internal synthetic traces (like averages or benchmarks) that do not conform to standard parameter schemas and would clutter the export.
### Key Workflows
The CSV generation process follows a linear transformation path:
```
[ DataFrame ]
|
| 1. Extract Trace IDs
V
[ Trace IDs ] --> [ Parse to Params ] --> [ Identify & Sort Unique Keys ]
|
+--------------------------------------------+
|
V
[ Generate Header Row ] (Sorted Param Keys + ISO Timestamps)
|
V
[ Iterate Traces ]
|
+--> [ Map Params to Columns ] ----+
| |--> [ Concatenate Row ]
+--> [ Map Data Points to Columns ]--+
|
V
[ Final CSV String ]
```
### Key Components
- **`dataFrameToCSV` (`index.ts`):** The main entry point. It orchestrates the collection of keys, the formatting of headers, and the row-by-row serialization of trace data.
- **`parseIdsIntoParams` & `allParamKeysSorted` (`index.ts`):** Helper functions that handle the translation between the serialized trace ID format and a structured, normalized set of columns. They rely on `fromKey` from the `paramtools` module to decompose the string-based identifiers.
# Module: /modules/data-service
The `data-service` module provides a centralized, singleton-based interface for interacting with the Perf backend. It abstracts the complexities of HTTP communication, error handling, and long-running asynchronous operations into a clean API used by the frontend components.
### Core Responsibility
The primary role of the `DataService` class is to act as the single source of truth for backend data fetching. By encapsulating all network logic, it ensures consistent headers, error reporting (via `DataServiceError`), and behavior across the application. It specifically manages:
- **State Persistence**: Handling "shortcuts" (IDs representing specific sets of graph configurations or trace keys) to allow for shareable URLs.
- **Contextual Data**: Fetching initial page settings, timezone-aware data, and default query configurations.
- **Data Manipulation**: Calculating time/commit range shifts and retrieving user-reported issues.
- **Asynchronous Processing**: Managing complex, multi-stage requests like frame generation which require progress monitoring.
### Key Components
#### DataService (`data-service.ts`)
The main implementation follows the Singleton pattern, accessible via `DataService.getInstance()`. This ensures that shared configurations (like local development overrides) are consistent across all callers.
- **Standard Fetching**: Methods like `getShortcut`, `getDefaults`, and `shift` wrap standard `POST` or `GET` requests. They use a private `fetchJson` helper that integrates with `jsonOrThrow` to standardize how the frontend handles malformed or failed responses.
- **Shortcuts**: The service handles two types of shortcuts:
- `updateShortcut`: Maps a complex `GraphConfig` array to a short ID.
- `createShortcut`: Maps a simple list of trace keys to an ID.
These methods include logic to skip execution during local development if `perf.disable_shortcut_update` is set, preventing unnecessary 500 errors from unconfigured local proxies.
#### Long-running Operations
The `sendFrameRequest` method handles one of the most complex workflows in the system: requesting data frames (collections of trace data for graphs). Because frame generation can be slow, the backend uses a "start-and-poll" pattern.
`DataService` leverages the `progress` module to manage this lifecycle:
1. **Initialization**: It attaches the local browser's timezone to the request.
2. **Lifecycle Management**: It accepts callbacks (`onStart`, `onProgress`, `onMessage`, `onSettled`) to allow UI components to update their loading states or progress bars.
3. **Polling**: It delegates the polling logic to `startRequest`, which communicates with the `/_/frame/start` endpoint and waits for a "Finished" status.
4. **Error Transformation**: If the progress returns an error status, it converts the backend messages into a human-readable `DataServiceError`.
### Data Flow Workflow
The following diagram illustrates the lifecycle of a frame request through the `DataService`:
```text
Component DataService Progress Module Backend
| | | |
|--sendFrameReq()-->| | |
| |----startRequest()---->| |
| | |--- /frame/start -->|
| | | | (Processing)
| | |<-- HTTP 200 (ID) --|
| | | |
| | [Loop] |--- Check Status -->|
|<---onProgress()---| | |
| |<--updateProgress()----|<-- SerializedMsg --|
| | | |
| | |--- Check Status -->|
| | |<-- Status: Fin ----|
|<-- FrameResponse--| | |
```
### Error Handling
The module defines a specialized `DataServiceError`. This class extends the native `Error` but includes an optional `status` property (the HTTP status code). This allows calling components to distinguish between network-level failures (e.g., 404, 500) and application-level errors (e.g., failed data processing messages returned within a valid HTTP 200 response).
# Module: /modules/dataframe
# Dataframe Module
The `dataframe` module provides the core data structures and management logic for handling performance trace data in the Perf system. It manages the lifecycle of performance data—from fetching raw JSON responses to maintaining a local cache and transforming data for visualization.
## Overview
The module centers around the concept of a `DataFrame`, which represents a set of performance traces (time series data) sharing a common horizontal axis (commits or timestamps). The primary goal of this module is to provide a consistent way to query, extend, merge, and visualize these traces along with their associated metadata, such as anomalies and user-reported issues.
## Key Components
### DataFrame Management (`index.ts`)
This file defines the fundamental logic for manipulating `DataFrame` objects. It is the TypeScript equivalent of the backend Go implementation.
- **Joining and Merging**: The `join` function allows two DataFrames to be combined into one. It handles cases where headers (commit ranges) overlap or differ by recalculating a unified header and padding missing data with a `MISSING_DATA_SENTINEL`.
- **Subsetting**: Functions like `findSubDataframe` and `generateSubDataframe` allow for extracting specific slices of data based on commit offsets or timestamps.
- **Anomaly Handling**: It provides logic to merge `AnomalyMap` structures, ensuring that anomaly data from different requests are combined correctly for the same traces.
### Data Repository and Context (`dataframe_context.ts`)
The `DataFrameRepository` (implemented as `<dataframe-repository-sk/>`) acts as the state manager for performance data within a frontend application. It utilizes Lit context to provide data to descendant components.
- **Caching and Extension**: It maintains a local cache of traces. The `extendRange` method allows the UI to request more data (forward or backward in time) while maintaining the current `ParamSet`.
- **Chunking**: To improve performance and reliability, large data requests are automatically sliced into smaller time "chunks" (defaulting to 1 month) and fetched concurrently.
- **Context Provision**: It provides several contexts:
- `dataframeContext`: The raw `DataFrame`.
- `dataTableContext`: A `google.visualization.DataTable` prepared for `google-chart`.
- `dataframeAnomalyContext`: Current known anomalies.
- `dataframeUserIssueContext`: Buganizer issues associated with specific trace points.
- `dataframeLoadingContext`: A boolean flag indicating if a fetch operation is in progress.
### Trace Identification and Formatting (`traceset.ts`)
Trace keys in Perf are typically comma-separated strings of key-value pairs (e.g., `,benchmark=JetStream2,bot=MacM1,`). This component provides utilities to parse these keys for UI display.
- **Dynamic Title and Legend**: It dynamically calculates which parameters are common to all traces in a set (the **Title**) and which parameters vary (the **Legend**). This prevents redundant information from cluttering the UI.
- **Special Function Handling**: It recognizes and strips transformation functions (like `norm()`) from keys to ensure that data lookup for issues and anomalies remains consistent even if the data is being transformed for display.
## Key Workflows
### Data Fetching and Merging
When a user requests to "extend" a chart, the following process occurs:
```
UI Action (e.g., "Scroll Left")
|
v
DataFrameRepository.extendRange(offset)
|
+--> calculate deltaRange (new time window)
+--> sliceRange into chunks (e.g., 30-day blocks)
+--> concurrent DataService.sendFrameRequest()
|
v
Receive multiple FrameResponses
|
+--> Sort responses by commit offset
+--> Merge ColumnHeaders (unified X-axis)
+--> Map old/new trace indices to unified header
+--> Pad missing points with SENTINEL
|
v
Update Lit Contexts
|
+--> dataframeContext (raw data)
+--> dataTableContext (Google Charts format)
+--> UI components re-render
```
### Trace Metadata Extraction
The module handles the logic of turning complex trace keys into readable labels:
```
Trace A: ,benchmark=V8,test=Total,arch=arm,
Trace B: ,benchmark=V8,test=Total,arch=x86,
Logic:
1. Common: benchmark=V8, test=Total ==> Title: "V8/Total"
2. Unique: arch=arm vs arch=x86 ==> Legend: ["arm", "x86"]
```
## Design Decisions
- **Immutable ParamSet for Extensions**: Once a repository is initialized with a `ParamSet`, extensions (paging through data) use that same set of parameters. To change the query itself, a full `resetTraces` is required. This simplifies the merging logic by ensuring the "vertical" dimension (the traces) remains relatively stable while the "horizontal" dimension (time) grows.
- **DataTable Conversion**: Instead of every UI component manually parsing the `DataFrame`, the `DataFrameRepository` performs a centralized conversion to the Google Visualization `DataTable` format. This ensures that expensive data transformations happen once per update.
- **Sparsed Anomaly Maps**: Anomalies are stored in a map-of-maps structure indexed by `TraceKey` and then `CommitPosition`. This allows for efficient $O(1)$ lookups when rendering points on a chart, rather than iterating through lists of anomalies for every data point.
# Module: /modules/day-range-sk
# day-range-sk
The `day-range-sk` module provides a custom element for selecting a time range, defined by a beginning and an ending timestamp. It is designed to simplify date range selection in applications—such as performance monitoring dashboards—where users need to filter data within specific historical boundaries.
## Design and Implementation
The module is built using Lit and extends the `ElementSk` base class. Its primary responsibility is to synchronize two separate date inputs and expose their combined state as a single range.
### Time Representation
Unlike standard HTML date inputs that often use strings or millisecond-based timestamps, `day-range-sk` standardizes on **seconds since the Unix epoch**. This decision aligns with the data formats typically used in backend storage systems (like Prometheus or BigTable) within the Perf ecosystem, reducing the need for repeated conversions in the application logic.
### Components
The element acts as a composite wrapper around two `calendar-input-sk` elements:
- **Begin Input**: Controls the start of the range.
- **End Input**: Controls the end of the range.
The internal state is managed through the `begin` and `end` properties, which are mirrored to attributes. This mirroring allows the element to be initialized or manipulated via declarative HTML or imperative JavaScript.
### Workflow: Range Selection
When a user interacts with either of the internal calendars, the element processes the change and propagates it upward.
```text
[ User Interface ] [ day-range-sk ] [ Consumer/App ]
| | |
|--- Change Begin Date --->| |
| |--- Calculate Seconds ---->|
| |--- Update 'begin' Attr -->|
| |--- Dispatch Event --------|
| | (day-range-change) |
```
1. The user selects a date in one of the `calendar-input-sk` components.
2. The component catches the `@input` event from the calendar.
3. The `Date` object from the calendar is converted into a floor-rounded second-based timestamp.
4. The element updates its own attribute to ensure the UI stays in sync with the state.
5. A `day-range-change` event is dispatched, containing both the updated and the stationary timestamp in its `detail` object.
## State Management and Defaults
The element is designed to be "ready to use" immediately upon being added to the DOM.
- **Initialization**: If `begin` or `end` attributes are not provided, the component defaults to a **24-hour window** ending at the current time.
- **Property Upgrading**: The implementation includes `_upgradeProperty` calls in the `connectedCallback`, ensuring that if the properties were set on the DOM element before the custom element definition was loaded, those values are correctly captured and reflected.
- **Reactivity**: By implementing `observedAttributes`, the element automatically re-renders whenever the `begin` or `end` attributes are changed externally, ensuring the visual calendar inputs always match the underlying data.
## Events
### day-range-change
This is the primary event emitted by the module. It bubbles up the DOM, allowing parent components to listen for any changes to the range.
The `detail` property of the event implements the `DayRangeSkChangeDetail` interface:
```typescript
{
begin: number; // Seconds since epoch
end: number; // Seconds since epoch
}
```
## Appearance
The element's layout is controlled via `day-range-sk.scss`, which ensures the labels and inputs are displayed as block elements with consistent spacing. It integrates with the global theme system by using CSS variables like `--on-surface` and `--surface` for the input borders and backgrounds, supporting both light and dark modes.
# Module: /modules/domain-picker-sk
The `domain-picker-sk` module provides a specialized UI component for selecting time and data ranges, specifically tailored for the Skia Perf ecosystem. Its primary purpose is to define the "domain" (the X-axis) of a performance data request, allowing users to switch between absolute time ranges and relative commit counts.
### Overview
The component provides two distinct modes for querying data, known as the `request_type`:
1. **Range Mode (0):** Users specify a strict chronological window using a "Begin" and "End" date. This is ideal for investigating events within a known timeframe.
2. **Dense Mode (1):** Users specify an "End" date and a fixed number of "Points" (commits) to look back from that date. This is preferred when the frequency of data points is inconsistent, ensuring the resulting visualization contains a predictable density of information.
### Design Decisions
#### State Encapsulation
The component manages its internal state via a `DomainPickerState` object. This interface acts as the contract between the picker and its parent application (typically a dashboard or query page).
- **Unix Timestamps:** Time is handled internally and exposed via the state as Unix seconds (integers), while the UI leverages `calendar-input-sk` for human-readable interaction.
- **Bi-directional State:** The component supports both reading and writing the entire state via a `state` getter/setter, facilitating easy integration with URL-backed state management or "Reset" buttons.
#### Restricted Mode via `force_request_type`
An architectural choice was made to allow parent components to "lock" the picker into a specific mode using the `force_request_type` attribute.
- When this attribute is set to `range` or `dense`, the component hides the `radio-sk` selection buttons entirely.
- This allows the same component to be used in simple views where only one query style is supported, without duplicating the calendar and input logic.
### Key Components and Files
- **`domain-picker-sk.ts`**: The core logic. It utilizes Lit for templating and manages the conditional rendering logic that swaps between the "Begin Date" picker (Range mode) and the "Points" numeric input (Dense mode).
- **`domain-picker-sk.scss`**: Defines the layout, ensuring that the `calendar-input-sk` and various labels align correctly. It uses standard element-sk variables to support theme switching (light/dark mode).
- **Dependency on `calendar-input-sk`**: Rather than implementing date logic itself, the picker delegates date selection to this specialized sub-component, maintaining a consistent date-picking experience across the infra.
### Workflow: Range Selection
The following diagram illustrates how the component resolves its output state based on user interaction or attribute overrides:
```text
User Input / State Set
|
v
+-----------------------+ YES +-------------------------+
| force_request_type? |-------------->| Override request_type |
+-----------+-----------+ | (Hide Radio Buttons) |
| NO +------------+------------+
v |
+-----------------------+ |
| Render Radio Buttons | |
| (Range vs Dense) | |
+-----------+-----------+ |
| |
+-------------------+-+------------------+
|
v
+-----------------------+
| Render Common "End" |
| Calendar Input |
+-----------+-----------+
|
+---------------+---------------+
| |
[request_type: RANGE] [request_type: DENSE]
| |
+---------+---------+ +---------+---------+
| Render "Begin" | | Render "Points" |
| Calendar Input | | Numeric Input |
+-------------------+ +-------------------+
```
### Component State Structure
The `state` property manages the following object:
| Field | Type | Description |
| :------------- | :------- | :------------------------------------------------ |
| `begin` | `number` | Unix timestamp (seconds) for start of range. |
| `end` | `number` | Unix timestamp (seconds) for end of range. |
| `num_commits` | `number` | Count of points to retrieve (used in Dense mode). |
| `request_type` | `number` | `0` for Range, `1` for Dense. |
# Module: /modules/errorMessage
The `errorMessage` module provides a standardized utility for surfacing application errors to the user and, optionally, tracking those errors through telemetry. It acts as a wrapper around the core `elements-sk` error messaging system, tailored for the specific requirements of the Perf application.
### Design Goals
The primary goal of this module is to ensure that critical errors are not missed by the user while maintaining observability for developers.
- **Persistence by Default**: Unlike many UI notification systems that disappear after a few seconds, this module defaults to a duration of `0`. This forces the error message to remain visible until the user manually dismisses it, ensuring that transient network failures or complex logic errors are acknowledged.
- **Integrated Observability**: By wrapping the UI notification with telemetry hooks, the module allows developers to monitor error rates and types in production without scattering reporting logic throughout the codebase.
### Key Components
#### Core Utilities (`index.ts`)
The module exports two primary functions for handling errors:
- **`errorMessage`**: A simplified wrapper that dispatches an `error-sk` event. This event is typically caught by a global `<error-toast-sk>` element (or similar) to display a notification. Its main contribution is overriding the default display duration to infinite (`0`).
- **`errorMessageWithTelemetry`**: Extends the standard error notification by incrementing a metric counter before showing the UI toast. It accepts a `TelemetryErrorOptions` object to categorize the error by source (e.g., a specific API endpoint) and error code (e.g., "404" or "500").
#### Telemetry Integration
The telemetry functionality is designed to categorize errors using the `CountMetric.FrontendErrorReported` metric. This allows the team to create dashboards based on the "source" and "errorCode" labels, providing a clear picture of application health.
### Error Workflow
The following diagram illustrates how an error propagates from a functional call through to the UI and the monitoring backend:
```text
[ Function Call ]
|
V
[ errorMessage(msg, dur, options) ]
|
+------------------------------------------------> [ telemetry.increaseCounter ]
| |
| V
| [ External Metrics System ]
|
+-----------------------------------------------> [ elementsErrorMessage ]
|
V
[ Dispatch "error-sk" event ]
|
V
[ UI Toast Component ]
```
### Usage Considerations
When using `errorMessageWithTelemetry`, the `source` field in `TelemetryErrorOptions` should be specific enough to identify the feature or component failing, while `errorCode` should represent the category of failure. If these are omitted, they default to `'default'` and `'500'` respectively.
Both functions handle various message formats—including strings, objects with a `message` property, and raw `Response` objects—consistent with the underlying `elements-sk` implementation.
# Module: /modules/existing-bug-dialog-sk
# existing-bug-dialog-sk
The `existing-bug-dialog-sk` module provides a modal dialog designed to associate performance anomalies (alerts) with an existing bug in a tracking system (e.g., Monorail/Issues). It is a key component of the Perf triage workflow, allowing users to consolidate multiple related regressions under a single bug ID.
## High-Level Overview
When a user identifies a performance regression, they may want to link it to an ongoing investigation rather than filing a new bug. This module manages the UI for inputting a Bug ID, selecting the target project, and viewing other bugs already associated with the same group of anomalies.
## Design and Implementation Logic
### Association Workflow
The primary responsibility of the module is to communicate with the triage backend to establish a link between anomaly keys and a bug ID.
1. **Input Collection**: The user provides a numeric Bug ID and selects a project (defaulting to "chromium").
2. **Validation**: The UI enforces a 5-9 digit numeric pattern for the Bug ID to prevent malformed submissions.
3. **Submission**: The component sends a POST request to `/_/triage/associate_alerts`.
4. **State Synchronization**: Upon success, it updates the local `anomalies` data to reflect the new association and dispatches an `anomaly-changed` event. This event ensures that other UI components (like charts or lists) stay in sync without requiring a full page reload.
### Bug Discovery and Context
To help users avoid duplicating work, the dialog can fetch and display a list of bugs already associated with the anomalies being triaged.
- **Group Reports**: It fetches anomaly group reports from `/_/anomalies/group_report`. If the backend returns a "SID" (Session ID), the component handles the additional fetch step to resolve the full list of anomalies in that group.
- **Metadata Enrichment**: Simply showing a list of numbers (Bug IDs) is often unhelpful. The component calls `/_/triage/list_issues` to fetch the human-readable titles of these bugs, providing better context for the user before they commit the association.
### User Interface Decisions
- **Scoped Styling**: The component uses `createRenderRoot() { return this; }` to render directly into the custom element, allowing it to leverage global Perf themes and styles (SASS) defined in the project.
- **Async Feedback**: It uses a `spinner-sk` and disables the submit button during active network requests to prevent duplicate submissions and provide visual feedback.
## Key Components and Files
### existing-bug-dialog-sk.ts
The core logic of the dialog. It manages:
- **Internal State**: Tracks the `_projectId`, `_busy` status, and the `bugIdTitleMap` (linking bug IDs to their descriptive titles).
- **Anomaly Handling**: Accepts `anomalies` and `traceNames` as properties. These are used to construct the payload for the triage backend.
- **Event Dispatching**: Issues the `anomaly-changed` event to notify the rest of the application of state changes.
### existing-bug-dialog-sk.scss
Defines the layout of the dialog, ensuring it occupies a reasonable portion of the screen (25% width) and handles long lists of associated bugs via a scrollable container (`#associated-bugs-table`).
### Page Objects (`existing-bug-dialog-sk_po.ts`)
Provides an abstraction for testing the component. It encapsulates the selectors for the input fields and buttons, allowing Puppeteer and Karma tests to interact with the dialog without being brittle to internal DOM changes.
## Workflow Diagram
```text
User Actions Component Logic Backend API
--------------------------------------------------------------------------------
Open Dialog -----> Fetch Associated Bugs -----> /_/anomalies/group_report
|
v
Fetch Bug Titles -----> /_/triage/list_issues
|
Render List + Form
|
Input Bug ID -----> |
|
Click Submit -----> Validate & Send -----> /_/triage/associate_alerts
|
Update Local State
|
Dispatch 'anomaly-changed'
|
Close Dialog
```
## Related Files
- **`existing-bug-dialog-sk-demo.ts`**: Provides a mocked environment to test the dialog's behavior and layout in isolation.
- **`test_data.ts`**: Contains sample `Anomaly` objects used for both documentation and testing.
# Module: /modules/explore-multi-sk
# explore-multi-sk
The `explore-multi-sk` module provides a comprehensive interface for visual data exploration in Perf, allowing users to view and interact with multiple graphs simultaneously. It acts as an orchestrator for multiple `explore-simple-sk` instances, synchronizing their states (such as time ranges and X-axis scaling) to facilitate comparative analysis across different data dimensions.
## High-Level Overview
The module serves two primary exploration modes:
1. **Standard/Split Mode:** A "Master-Slave" architecture where one primary graph contains all selected data, and users can "split" this data into individual graphs based on specific parameters (e.g., splitting a single graph containing multiple OS traces into separate graphs for "Android", "Ubuntu", etc.).
2. **Manual Plot Mode:** An independent mode where users can add and remove graphs arbitrarily, treating each as a standalone snapshot that does not necessarily share the same query parameters as others.
## Design Decisions and Implementation
### State Management and URL Reflection
The module utilizes `stateReflector` to persist the exploration state in the URL. To keep URLs manageable and logic simple, `explore-multi-sk` only tracks properties necessary to reconstruct the graphs.
- **Time Range Logic:** Priority is given to explicit `begin`/`end` timestamps in the URL. If missing, it falls back to a `dayRange` (e.g., "last 7 days"). If both are missing, it uses global defaults.
- **Shortcut System:** Instead of encoding every query for every graph in the URL, the module generates a `shortcut` ID. This ID maps to a collection of graph configurations in the backend database, allowing complex multi-graph layouts to be shared via a short link.
### Graph Orchestration and Synchronization
The module ensures a unified experience across multiple internal elements through event-driven synchronization:
- **Time Range Sync:** When a user zooms or pans on one graph, the `range-changing-in-multi` and `selection-range-changed` events trigger an update across all other graphs.
- **X-Axis Consistency:** Toggling between "Commit" and "Date" domains on one chart updates the `domain` state for all instances, ensuring the X-axis remains comparable.
- **Even X-Axis Spacing:** Users can toggle discrete spacing (ignoring time gaps between points). This preference is synced across charts and persisted in `localStorage`.
### Performance and Batch Loading
To prevent browser performance degradation when loading dozens of graphs (e.g., splitting by a parameter with many values), the module implements **Chunked Loading**:
```
[User Clicks Plot]
|
V
[Calculate Groups] -> (e.g., 20 different OS values)
|
V
[Load Chunk 1] ----> (Load first 5 graphs, request range data)
|
[Load Chunk 2] ----> (Load next 5 graphs)
|
...
|
[Final Load] ------> (Fetch extended range data for all graphs in one batch)
```
This approach allows the UI to become interactive incrementally while minimizing the total number of expensive backend requests for historical data.
## Key Components and Files
- **`explore-multi-sk.ts`**: The core logic coordinator. It handles the `State` object, manages the lifecycle of `explore-simple-sk` elements, and implements the splitting/merging logic.
- **`explore-multi-sk.scss`**: Provides layout styling, ensuring that graphs are sized appropriately (e.g., shrinking height when multiple graphs are displayed) and handling the visibility of UI components like the Test Picker.
- **`explore-multi-sk_po.ts`**: A Page Object for Puppeteer testing, providing a clean API to interact with the multi-graph container and its children during integration tests.
- **Integration with `test-picker-sk`**: The module heavily relies on the Test Picker for selecting data. In "Split Mode", the picker's state is used to determine how to partition the data into individual charts.
## Key Workflows
### The "Split" Process
When a user selects a "Split By" parameter (e.g., `os`):
1. The module identifies all traces currently loaded in the "Master" graph.
2. Traces are grouped by the value of the chosen parameter.
3. The Master graph is optionally hidden (becoming a background data accumulator).
4. New `explore-simple-sk` instances are created for each group.
5. Each child graph is initialized with the specific query and data subset corresponding to its group.
### Removing Data
Data can be removed in two ways:
- **Individual Trace Removal:** Triggered from a graph's UI. The module filters the global `TraceSet`, updates the internal data models, and tells the relevant graphs to re-render without the specific trace.
- **Graph Removal:** In Manual Plot Mode, clicking the "Trash" icon removes that specific `explore-simple-sk` instance and updates the URL shortcut. In Split Mode, removing the last trace of a graph typically results in the removal of the entire graph instance.
# Module: /modules/explore-simple-sk
# Explore Simple SK
The `explore-simple-sk` module provides a comprehensive data exploration interface for the Perf tool. It serves as the primary component for querying, visualizing, and triaging performance traces, allowing users to interact with large datasets through charts, tables, and integrated triage tools.
## High-Level Overview
`explore-simple-sk` is designed to be a versatile "explorer" that can operate in multiple modes (plotting, pivot tables, or simple querying). It manages the state of a data exploration session—including the time range, active queries, formulas, and selected data points—and reflects this state in the browser URL for shareability.
The module acts as a coordinator for several specialized sub-components:
- **Data Management**: Uses `DataFrameRepository` to manage the underlying trace data and anomaly maps.
- **Visualization**: Uses `plot-google-chart-sk` for the main interactive graph and `plot-summary-sk` for long-range navigation.
- **Querying**: Integrates `query-sk` and `pivot-query-sk` to allow users to filter data.
- **Triage**: Provides a `chart-tooltip-sk` that facilitates bug filing, anomaly nudging, and bisection.
## Key Design Decisions
### State Management and URL Reflection
The module utilizes a `State` class to track all parameters of the current view (e.g., `begin`, `end`, `queries`, `formulas`, `domain`).
- **Why**: This allows for "deep linking," where a user can share a specific view of a graph, including the zoom level and selected trace, simply by sharing the URL.
- **How**: It uses a `state_changed` event mechanism and a `useBrowserURL` method to sync internal variables with URL search parameters.
### Incremental Data Loading
To maintain performance when exploring large datasets, the module implements incremental fetching.
- **Why**: Fetching an entire repository's history is expensive. Users often start with a small window and "pan" left or right.
- **How**: When a user pans or zooms outside the currently loaded data range, the module calculates the delta and requests only the necessary additional frames, joining them with the existing `DataFrame` in memory.
### Domain Switching (Commit vs. Date)
Users can toggle the X-axis between "Commit Position" and "Date."
- **Why**: Performance regressions are often tied to specific code changes (commits), but understanding the real-world timeline (dates) is crucial for identifying infrastructure issues or seasonal patterns.
- **How**: The module performs a coordinate transformation when the domain changes, ensuring that any active zoom selection remains focused on the same set of data points by translating commit offsets to timestamps (and vice versa) using the dataframe header.
## Key Components and Responsibilities
### explore-simple-sk.ts
The main class responsible for the lifecycle of the explorer.
- **Workflow Coordination**: It handles the logic for adding traces via queries or formulas (`addFromQueryOrFormula`) and managing the response display mode (Graph vs. Pivot Table).
- **User Interaction**: Processes keyboard shortcuts (zoom/pan), mouse events on the chart, and interactions with the settings dialog (e.g., toggling even X-axis spacing).
### nudge-util.ts
A specialized utility for handling anomaly "nudges."
- **Responsibility**: When a user identifies an anomaly, the "true" start of a regression might be slightly off due to sparse data or noise.
- **Logic**: It scans the trace to find valid data points (skipping `MISSING_DATA_SENTINEL` values) and calculates a list of `NudgeEntry` objects. This ensures that when an anomaly is moved, it always lands on a commit that actually contains data for that specific trace.
### explore-simple-sk_po.ts
A Page Object (PO) implementation for Puppeteer testing.
- **Responsibility**: Encapsulates the DOM structure and common interactions (like clicking the "Remove All" button or verifying anomaly tooltips) to provide a stable API for integration tests.
## Key Workflows
### Data Query and Plotting Process
The following diagram illustrates how a user query is transformed into a visual plot:
```text
User Input (Query/Formula)
|
V
addFromQueryOrFormula() ----> Validates Query
|
V
requestFrame() -------------> DataService (Backend API)
| |
|<-----------------------------|
V
UpdateWithFrameResponse()
|
+-----> DataFrameRepository (Stores & Merges Data)
|
+-----> plot-google-chart-sk (Renders Main Graph)
|
+-----> plot-summary-sk (Renders Navigation Bar)
|
+-----> paramset-sk (Updates Metadata Panel)
```
### Anomaly Triage Workflow
When a user interacts with a data point on the chart:
```text
Chart Click Event
|
V
onChartSelect()
|
+-----> enableTooltip()
|
+-----> Fetches Commit Links (Gitiles/Issue Tracker)
|
+-----> nudge-util (Calculates Nudge Steps)
|
+-----> Displays chart-tooltip-sk
|
+---[File Bug]---> NewBugDialog
+---[Nudge]------> Update Anomaly Map
+---[Bisect]-----> BisectDialog
```
## CSS and Layout
The module uses a flexible layout defined in `explore-simple-sk.scss` that adapts based on the `displayMode` (e.g., `.display_query_only`, `.display_plot`). It uses CSS classes to hide/show components like the spinner, the pivot table, or the plot summary based on the current operation, ensuring a clean UI regardless of the data being explored.
# Module: /modules/explore-sk
# explore-sk
The `explore-sk` module provides the primary entry point and high-level container for the Skia Perf data exploration interface. It acts as an orchestrator that integrates several complex sub-components—most notably the core graphing engine and the test selection tools—into a unified user experience.
## Overview
The purpose of `explore-sk` is to provide a cohesive environment where users can query performance data, visualize traces, and interact with the resulting graphs. While the actual plotting and state management logic reside in sub-modules, `explore-sk` handles the high-level layout, global event routing, and the initialization of environment-specific defaults.
## Key Components and Responsibilities
The module is structured as a custom element (`explore-sk.ts`) that manages the lifecycle and communication between several key pieces:
- **ExploreSimpleSk (`<explore-simple-sk>`):** This is the "heavy lifter" of the module. It handles the actual data fetching, state management for queries, and the rendering of the performance charts. `explore-sk` acts as its parent, passing down configuration and reflecting its state to the URL.
- **TestPickerSk (`<test-picker-sk>`):** A specialized UI component for building queries. It allows users to select specific parameters (like architecture, config, or test name) from dropdowns or autocomplete fields. `explore-sk` dynamically initializes this component based on the backend's configuration.
- **State Reflection:** The module uses `stateReflector` to ensure that the complex state of the exploration (selected traces, time ranges, etc.) is synchronized with the browser's URL. This allows users to share specific views or bookmarks of their performance analysis.
- **Authentication Integration:** It interacts with `alogin-sk` to determine the user's login status. This is used to conditionally enable features like "Favorites," which require a user identity to persist data.
## Design Decisions
### Composition over Monolith
Instead of implementing plotting and querying logic directly, `explore-sk` serves as a thin wrapper. This design allows `explore-simple-sk` to remain focused on the core data/charting logic, while `explore-sk` manages the layout and the integration of optional UI elements like the `test-picker-sk`.
### Dynamic UI (V2 UI and Test Picker)
The module implements logic to switch between different querying interfaces. It checks for backend defaults and local storage flags (like `v2_ui`) to decide whether to show the traditional query dialog or the newer `test-picker-sk`. This allows for a staged rollout of new UI features without breaking the core exploration workflow.
### Centralized Keyboard Handling
To provide a consistent "app-like" feel, `explore-sk` captures global keyboard events (like the `?` key for help) and delegates them to the appropriate child component (`explore-simple-sk`). This ensures that shortcuts work regardless of which sub-element currently has focus.
## Key Workflows
### Initialization and Configuration
When the element is attached to the DOM, it follows a specific sequence to configure the environment:
```
[ explore-sk ]
|
|-- 1. Fetch /_/defaults/ --------> [ Backend ]
| | |
| <------- JSON Config --------|
|
|-- 2. Check Auth Status ---------> [ alogin-sk ]
| | |
| <------- Login Status -------|
|
|-- 3. Initialize State ----------> [ stateReflector ]
| | |
| <------- URL Params ---------|
|
|-- 4. Setup TestPicker (if enabled)
|
'-- 5. Pass state & defaults to [ explore-simple-sk ]
```
### Querying via Test Picker
When a user interacts with the `test-picker-sk`, the communication flows through events:
1. The user selects parameters in `test-picker-sk` and clicks "Plot".
2. The `test-picker-sk` emits a `plot-button-clicked` event.
3. `explore-sk` catches this event, extracts the query string from the picker, and calls the `addFromQueryOrFormula` method on `explore-simple-sk`.
4. `explore-simple-sk` fetches the data and updates the chart.
### Trace Highlighting to Query
If a user is looking at a graph and wants to refine their query based on a specific trace:
1. A "populate-query" event is triggered (usually from a trace detail view).
2. `explore-sk` receives the trace key.
3. It translates that key into a query string and instructs `test-picker-sk` to update its fields to match that specific trace, allowing the user to easily pivot their search.
# Module: /modules/extra-links-sk
### Overview
The `extra-links-sk` module provides a specialized custom element designed to display a curated list of external resources, documentation, or related tools. It serves as a dynamic landing area or sidebar within the Perf application, allowing administrators to surface relevant links that might otherwise be buried in external documentation sites.
### Design and Implementation Philosophy
The module is built on the principle of **configuration-driven UI**. Rather than hardcoding links or managing them through complex state management within the element itself, it leverages the global environment to determine its content.
#### Global State Integration
The element relies on the `window.perf.extra_links` configuration object. This design choice decouples the UI component from the backend API calls. By assuming that the global `window.perf` object (typically populated at page load or via a global configuration fetch) contains the necessary metadata, the element remains lightweight and reactive to the environment it is placed in.
#### Declarative Templating
Using `lit`, the element implements a declarative template that handles two primary states:
1. **Configured State**: If `window.perf.extra_links` is populated, it renders a structured table featuring link titles and descriptions.
2. **Empty State**: If no configuration is present, it provides a fallback message ("No links have been configured"), ensuring the UI doesn't appear broken or completely empty without explanation.
### Key Components and Responsibilities
#### extra-links-sk.ts
This file defines the `ExtraLinksSk` class, which extends `ElementSk`. Its primary responsibility is the lifecycle management and rendering of the link table.
- **Data Mapping**: It maps the `ExtraLink` objects (containing `text`, `href`, and `description`) into a tabular format.
- **Lifecycle**: It triggers a render immediately upon being connected to the DOM (`connectedCallback`), ensuring that the links are visible as soon as the component is attached.
#### extra-links-sk.scss
The styling is scoped to the `extra-links-sk` element. It utilizes CSS variables (like `--primary` and `--on-surface`) to maintain theme consistency with the rest of the application. The layout uses `border-collapse: separate` and specific padding to ensure the links are easily readable and touch-friendly.
### Data Flow and Workflow
The following diagram illustrates how data flows from the global configuration into the rendered component:
```
[ Global Scope ] [ extra-links-sk ] [ Browser DOM ]
| | |
| 1. Set window.perf. ---|------------------------------> |
| extra_links = {...} | |
| | |
| | 2. connectedCallback() |
| | <---------------------------- |
| | |
| | 3. Read window.perf |
| | loop through links |
| | |
| | 4. Generate HTML Table |
| | -----------------------------> |
| | |
```
### Configuration Structure
The component expects the configuration to follow this structure within the global `window.perf` object:
- **title**: A string displayed as the main header for the links section.
- **links**: An array of objects, where each object contains:
- **text**: The clickable label for the link.
- **href**: The destination URL.
- **description**: A text explanation of what the link provides.
# Module: /modules/favorites-dialog-sk
### Overview
The `favorites-dialog-sk` module provides a modal dialog designed for creating and editing user "favorites" within the Perf application. It encapsulates a form for capturing a name, description, and URL, and handles the asynchronous communication with the backend API to persist these changes.
### Design and Implementation Logic
The module is built as a `LitElement` and utilizes the native HTML `<dialog>` element for modal behavior. The design focuses on a Promise-based workflow for the calling component, allowing the parent to react differently depending on how the dialog was closed.
#### State Management and Lifecycle
Instead of relying on external events to communicate success, the `open()` method returns a `Promise`. This allows the caller to `await` the user's action:
- **Resolve:** The promise resolves if the user successfully saves a new or edited favorite. This indicates to the parent (e.g., a favorites list) that it should refresh its data.
- **Reject:** The promise rejects if the user dismisses the dialog via the "Cancel" button or the close icon.
#### Data Handling
The component distinguishes between "create" and "edit" modes based on the presence of a `favId`.
- **Creation:** If `favId` is empty, the component defaults the URL to the current window location and targets the `/_/favorites/new` endpoint.
- **Modification:** If a `favId` is provided, the component populates the fields with existing data and targets the `/_/favorites/edit` endpoint.
#### Workflow Diagram
The following diagram illustrates the interaction between the UI and the backend:
```text
[Parent Component] [favorites-dialog-sk] [Backend API]
| | |
|---- .open(id, name) ----->| |
| |-- (User edits fields) |
| | |
| |---- Click "Save" ------------>|
| | POST /_/favorites/ |
| |<------- 200 OK / Error -------|
| | |
|<--- Resolve / Reject -----| |
```
### Key Components and Files
#### favorites-dialog-sk.ts
This is the core logic of the module.
- **Form Validation:** Ensures that the "Name" and "URL" fields are non-empty before attempting a submission, triggering an `errorMessage` toast if validation fails.
- **Async Operations:** Manages the `updatingFavorite` state to toggle a `<spinner-sk>` and disable action buttons while a network request is in flight.
- **Unique ID Generation:** Uses a static `nextUniqueId` counter to ensure that HTML `id` and `for` attributes are unique across multiple instances on the same page, maintaining accessibility and correct label-to-input binding.
#### favorites-dialog-sk.scss
Defines the visual presentation using the Perf theme variables. It handles the layout of the form elements, specifically positioning the close icon and styling the input fields to occupy the standard modal width (500px for inputs).
#### favorites-dialog-sk-demo.ts
Provides a reference implementation for how to trigger the dialog for both "New" and "Edit" scenarios. It demonstrates the use of the `open()` method and how to pass initial parameters.
### API Interaction
The module interacts with the following endpoints:
- `POST /_/favorites/new`: Used when creating a new favorite. The body includes `name`, `description`, and `url`.
- `POST /_/favorites/edit`: Used when updating existing favorites. The body includes the original `id` along with the updated fields.
Errors from the API are captured and displayed to the user via the `errorMessage` utility, while the dialog remains open to allow the user to correct the issue or try again.
# Module: /modules/favorites-sk
The `favorites-sk` module provides a specialized dashboard interface for managing and viewing bookmarked links within the Perf application. It distinguishes between global system-wide favorites and user-specific links, allowing for personal organization of performance data views.
### Design and Logic
The module is built around a centralized configuration fetched from the backend. The primary design goal is to provide a unified view where users can see pre-configured links (such as project-wide dashboards) alongside their own curated list of performance traces or search queries.
#### Data Fetching and Persistence
Upon mounting (`connectedCallback`), the element fetches the favorites configuration from `/_/favorites/`. This configuration drives the entire UI. The module uses an "optimistic-style" refresh pattern: whenever a change occurs (like a deletion or an edit), the component re-fetches the entire configuration to ensure the UI is synchronized with the server's state.
#### Section Differentiation
The implementation applies different business rules based on the section name:
- **"My Favorites"**: This section is treated as mutable. For links under this header, the UI provides "Edit" and "Delete" actions. It integrates with `favorites-dialog-sk` to facilitate complex editing of link metadata (names, descriptions, and URLs).
- **General Sections**: Any other section is treated as read-only, displaying links and descriptions without management controls.
### Workflow: Deleting a Favorite
The deletion process includes a safety check to prevent accidental data loss:
```
[User Clicks Delete]
|
v
[Browser Confirm Dialog] --(Cancel)--> [Abort]
|
(OK)
v
[POST to /_/favorites/delete]
|
[Success?] --(No)--> [Show Error Message]
|
(Yes)
v
[Re-fetch /_/favorites/]
|
[Re-render]
```
### Key Components
- **`favorites-sk.ts`**: The core logic container. It manages the state of the `favoritesConfig` and handles the asynchronous interactions with the backend API. It uses `lit` for templating, dynamically generating tables based on the presence of user-specific links.
- **`favorites-dialog-sk`**: While imported from a sibling module, it is a critical dependency for this module's "Edit" workflow. `favorites-sk` acts as the orchestrator, passing existing link data into this dialog and waiting for a resolution to refresh the view.
- **`favorites-sk.scss`**: Defines the layout for the favorites tables. It uses a spacious design with `border-spacing` and specific styling for primary links to ensure the dashboard remains readable even with a high density of saved traces.
### Implementation Details
The module relies on the `ElementSk` base class for standard component lifecycle management. For user interactions:
- **Editing**: Selecting "Edit" triggers a call to the dialog component's `.open()` method, passing the `id`, `text`, `description`, and `href`.
- **API Communication**: Uses `jsonOrThrow` and `errorMessage` utilities to handle network failures gracefully, ensuring that server-side errors are surfaced to the user via a consistent UI toast/notification system.
# Module: /modules/gemini-side-panel-sk
The `gemini-side-panel-sk` module provides a slide-out interface for interacting with a Gemini-powered AI assistant. It is designed as a persistent UI overlay that can be integrated into any page to provide contextual help or a general-purpose chat interface without navigating away from the current view.
### Design and Implementation Choices
The module is implemented as a Lit element, leveraging reactive properties to manage the chat state and visibility.
**Slide-out Transition**
The panel is positioned using `position: fixed` with a negative `right` offset. This design choice allows the panel to exist in the DOM but remain hidden off-screen until activated. By toggling the `open` attribute, the CSS transitions the `right` property to `0`, providing a smooth visual entry. This approach is preferred over `display: none` because it allows for CSS-driven animations and ensures the element's internal state remains preserved while hidden.
**State Management and UI Feedback**
The element manages three primary pieces of state:
- `messages`: An array of chat objects. This acts as the single source of truth for the conversation history.
- `isLoading`: A boolean that controls the visibility of a `<spinner-sk>`. This provides immediate visual feedback to the user during network latency.
- `input`: A string tracked via the `live()` directive. Using `live()` ensures that the input field remains synchronized with the internal state even if the DOM is updated externally or during rapid typing.
**API Interaction**
The component communicates with a backend via a POST request to `/_/chat`. It sends the user's query as a JSON body and expects a JSON response containing the assistant's reply. The implementation includes robust error handling that captures both HTTP error codes (e.g., 500) and network-level failures, surfacing these errors directly in the chat history to keep the user informed.
### Key Components
**GeminiSidePanelSk (gemini-side-panel-sk.ts)**
This is the core logic and UI controller. It encapsulates the styling, the chat history log, and the input footer. It exposes a public `toggle()` method and an `open` property/attribute, allowing parent components or global scripts to programmatically control its visibility.
**Chat History Log**
The history is rendered as a list of message bubbles. The implementation distinguishes between `user` and `model` roles using CSS classes to align messages to the right or left, respectively. It uses `aria-live="polite"` on the history container to ensure that screen readers announce new incoming messages from the AI assistant.
**Input Handling**
The footer contains a text input and a send button. To optimize user experience, the component listens for the `Enter` key on the input field, allowing for a standard messaging flow. The input is automatically cleared and focused upon a successful message submission.
### Chat Workflow
The following diagram illustrates the data flow when a user sends a message:
```text
[ User Input ] --> [ Update 'messages' (User) ] --> [ Set 'isLoading' = true ]
| |
| V
| [ POST /_/chat ]
| |
V V
[ Clear Input ] <--- [ Update 'messages' (Model) ] <--- [ Receive JSON Response ]
|
V
[ Set 'isLoading' = false ]
```
### Testing Strategy
The module includes two layers of testing:
- **Unit Tests (`gemini-side-panel-sk_test.ts`):** These tests use `fetch-mock` to simulate backend responses. They verify the internal logic, such as ensuring empty messages aren't sent, verifying that the input clears after sending, and checking that error messages are correctly appended to the history.
- **End-to-End Tests (`gemini-side-panel-sk_puppeteer_test.ts`):** These tests focus on the visual and behavioral aspects, such as confirming the CSS transitions move the panel the correct number of pixels and verifying that the Shadow DOM elements (input, icons) are accessible and interactive.
# Module: /modules/graph-title-sk
# graph-title-sk
The `graph-title-sk` module provides a specialized header component designed for performance graphs. It dynamically translates a set of metadata (key-value pairs) into a structured, readable title, handling the complexity of displaying many parameters without cluttering the UI.
## Design Goals
The primary purpose of this component is to provide context for a graph. Since performance data often involves many dimensions (e.g., bot name, benchmark, test, subtest, configuration), a simple string is insufficient. The component is designed to:
- **Handle Variable Specificity:** It can display a single trace's detailed metadata or a generic "Multi-trace" summary if multiple traces are being viewed simultaneously.
- **Manage Information Density:** To prevent the header from pushing the graph off-screen, it enforces a limit on the number of visible parameters, offering a "Show Full Title" option when the metadata is extensive.
- **Prioritize Readability:** By splitting keys (parameters) and values into two distinct rows within a flexible grid, it remains legible even when values are long or numerous.
## Key Components
### graph-title-sk.ts
This is the core custom element, implemented using Lit.
- **Data Input:** The element is updated via the `set(titleEntries: Map<string, string> | null, numTraces: number)` method. This approach allows the parent component to push data updates efficiently.
- **Logic and Filtering:**
- **Empty Suppression:** Any entry with an empty string for either the key or the value is automatically ignored to keep the title clean.
- **Truncation/Tooltips:** While the CSS handles visual layout, the HTML includes a `title` attribute on values, allowing users to hover over truncated text to see the full value.
- **Expansion Logic:** It maintains an internal state (`showShortTitle`) to toggle between a collapsed view (limited by `MAX_PARAMS`, currently 8) and a full view.
- **Multi-trace Mode:** If `numTraces > 0` but the `titleEntries` map is empty, it renders a generic `<h1>` header indicating the number of traces.
### graph-title-sk.scss
The styling uses a flexbox-based grid system.
- **Responsive Wrapping:** The `#container` uses `flex-wrap: wrap`, ensuring that if the title is too long for the horizontal space, it flows naturally into subsequent rows.
- **Columnar Layout:** Each metadata pair is treated as a discrete column, with the parameter name (`.param`) styled smaller and lighter above the bolded value (`.hover-to-show-text`).
## Data Flow and Workflows
### Setting Title Content
The workflow for updating the title typically involves a parent graph-container or dashboard page:
```
[ Parent Component ]
|
| 1. Gathers metadata (e.g., from a trace ID or API)
| 2. Calls .set(map, count)
V
[ graph-title-sk ]
|
| 3. Checks numTraces (if 0, hide container)
| 4. Filters empty entries
| 5. Truncates list if > MAX_PARAMS
V
[ Rendered HTML ]
```
### Expanding Long Titles
When the metadata exceeds the limit, the component provides an interactive expansion:
```
[ User Clicks "Show Full Title" ]
|
V
[ showFullTitle() ] sets showShortTitle = false
|
V
[ render() ] re-runs getTitleHtml() without the MAX_PARAMS limit
|
V
[ UI Updates ] All columns are revealed; button disappears
```
## Testing Utilities
The module includes a Page Object (`graph-title-sk_po.ts`) to simplify integration and end-to-end testing. This PO abstracts the internal structure (selectors for params, values, and the "show more" button), allowing tests to verify title content without being brittle to changes in the internal DOM structure.
# Module: /modules/json
# /modules/json
This module serves as the central repository for shared TypeScript type definitions and interfaces used across the Perf application. It acts as the "Source of Truth" for the data structures exchanged between the Go backend and the TypeScript frontend.
## Overview
The primary goal of this module is to ensure type safety and consistency across the network boundary. Instead of manually maintaining duplicate type definitions in both Go and TypeScript, this module contains an automatically generated `index.ts` file. This file reflects the structures defined in the backend, providing a robust contract for API requests, responses, and internal data processing.
The module also implements **Nominal Typing** for primitive types to prevent logical errors (e.g., accidentally using a `TimestampSeconds` where a `CommitNumber` is expected), even though both are represented as numbers at runtime.
## Key Components
### Core Data Structures
The module defines the fundamental entities of the Perf system:
- **Data Representation**: `DataFrame`, `TraceSet`, and `Trace` represent the time-series data fetched for visualization. A `DataFrame` contains the actual values, the headers (commits/timestamps), and the paramset describing the metadata.
- **Anomalies and Regressions**: Interfaces like `Anomaly` and `Regression` define the shape of detected performance changes, including statistical metadata (median before/after, p-value) and triage status.
- **Alerting**: The `Alert` interface defines the configuration for regression detection, including the query to monitor and the algorithm parameters used.
- **Backend Communication**: `FrameRequest` and `FrameResponse` encapsulate the complex parameters needed to query the performance database and the resulting data structure used to render plots.
### Nominal Typing Pattern
To improve type safety, the module uses a "branding" pattern for common primitives. This forces developers to explicitly cast or use constructor functions when assigning values to these types, ensuring that the developer has consciously verified the data source.
```text
Value (number) -> Constructor Function -> Branded Type (CommitNumber)
|
+--> Logic error if passed to
TimestampSeconds function
```
Key branded types include:
- `CommitNumber`: Represents an offset in the commit history.
- `TimestampSeconds`: Represents a Unix timestamp.
- `Params` and `ParamSet`: Specific dictionary shapes for metadata.
### Namespaced Definitions
Certain domains are grouped into namespaces to reflect their specific context within the application:
- `pivot`: Definitions related to the "Pivot Table" functionality, including operations like `sum`, `avg`, and `count`.
- `progress`: Interfaces for long-running backend tasks that provide status updates (e.g., `Running`, `Finished`).
- `ingest`: Data formats for the file ingestion pipeline, defining how measurement results are structured before being stored.
## Design Decisions
### Automatic Generation
The `index.ts` file is marked with `DO NOT EDIT`. This choice ensures that the frontend types are never out of sync with the backend. Changes to the data contract must be initiated in the Go code and propagated here via the generation tool (e.g., `go2ts`).
### Use of Interfaces vs. Types
Interfaces are used for complex objects (like `Alert` or `Anomaly`) to allow for potential extension and to provide clearer error messages in IDEs. Type aliases are reserved for unions (like `Status` or `ClusterAlgo`) and the aforementioned branded nominal types.
### Nullability and Optional Fields
The interfaces strictly define which fields are optional (`?`) and which can be `null`. This forces frontend components to handle missing data explicitly, reducing runtime `TypeError` exceptions when processing API responses.
# Module: /modules/json-source-sk
The `json-source-sk` module provides a specialized UI component for the Perf application that allows developers and analysts to inspect the raw JSON data associated with a specific data point in a performance trace. It acts as a bridge between high-level trace visualizations and the underlying ingested source files.
### Overview
The primary responsibility of this module is to fetch and display the original JSON metadata and results for a given trace at a specific commit. Because performance traces can be backed by large amounts of data, the module provides options to view either the full ingested file or a "short" version (typically excluding voluminous results) to improve load times and readability.
The component remains hidden by default and only reveals its controls when a valid `traceid` and `cid` (Commit ID) are provided, ensuring it only occupies screen space when actionable data is available.
### Key Components
#### JSONSourceSk (`json-source-sk.ts`)
The core custom element. It manages the state of the retrieved JSON, the visibility of the modal dialog, and the communication with the backend.
- **State Management:** It tracks `_cid` and `_traceid`. When either property is updated via setters, the internal JSON cache is cleared, and the component re-renders. This ensures that the user never sees stale data from a previous trace point.
- **Data Fetching Logic:** The `_loadSourceImpl` method encapsulates the logic for interacting with the `/_/details/` endpoint. It uses a `POST` request containing the commit and trace identifiers. It also handles the `results=false` query parameter when the "Short" view is requested.
- **User Feedback:** It integrates a `spinner-sk` to indicate background loading activity and uses the `errorMessage` utility to bubble up fetch failures to the application's global error reporting system.
#### User Interface and Interaction
The component uses a `<dialog>` element for displaying the JSON content. This choice allows the JSON to be viewed in an overlay, preserving the user's context in the main performance graph or table.
- **View Json File:** Triggers a full data fetch and opens the modal.
- **View Short Json File:** Triggers a fetch with the `results=false` flag, useful for inspecting metadata without the overhead of every individual measurement.
- **Modal Dialog:** Contains a `<pre>` block for formatted JSON display and a sticky close button for easy navigation.
### Workflow: Retrieving Source Data
The following diagram illustrates the lifecycle of a data request within the component:
```text
User Interaction JSONSourceSk Component Backend Server
| | |
|-- Click "View Json" ------->| |
| |-- Show Spinner |
| | |
| |-- POST /_/details/ -------->|
| | {cid, traceid} |
| | |
| |<--------- JSON Response ----|
| | |
| |-- Hide Spinner |
| |-- Format JSON string |
| <--- Open Modal Dialog -----| |
| with <pre> content | |
```
### Design Decisions
- **Validation:** The visibility of the "View" buttons is tied to `validKey(traceid)`. This prevents the component from attempting to fetch data using malformed or incomplete trace identifiers.
- **Formatting:** Data is processed via `JSON.stringify(json, null, ' ')` before display. This ensures that regardless of the wire format (which is often minified), the user sees a human-readable, indented structure.
- **Cleanup:** When the dialog is closed via `closeJsonDialog`, the internal `_json` string is cleared. This is a memory management choice to avoid keeping potentially large strings in the DOM when they are not actively being viewed.
- **CSS Scoping:** The styles include specific overrides for `spinner-sk` dimensions and use a flexbox layout for controls to maintain a compact footprint within the Perf UI toolbars.
### Testing and Page Objects
The module includes a Page Object (`JsonSourceSkPO`) located in `json-source-sk_po.ts`. This encapsulates the internal DOM structure (selectors for buttons, the dialog, and the pre-formatted text), allowing Puppeteer and Karma tests to interact with the component without being brittle to internal HTML changes. This is particularly important for testing the modal's visibility and the content of the `fetch` results.
# Module: /modules/keyboard-shortcuts-help-sk
### Keyboard Shortcuts Help Dialog
The `keyboard-shortcuts-help-sk` module provides a standardized UI component for displaying available keyboard shortcuts to the user. It functions as a discovery mechanism, ensuring that keyboard-driven workflows are accessible and documented within the application interface itself.
#### Design Philosophy: Centralized Registry Discovery
The core design decision behind this module is to decouple the _definition_ of shortcuts from their _presentation_. Instead of hard-coding a list of keys into a help dialog, this component acts as a consumer of the `ShortcutRegistry` (from `perf/modules/common:keyboard-shortcuts_ts_lib`).
This approach ensures that:
1. **Truth is Centralized:** Shortcuts are defined alongside the logic that handles them, but are automatically reflected in the help UI without manual updates to the dialog.
2. **Context Sensitivity:** The dialog can filter displayed shortcuts based on a provided `KeyboardShortcutHandler`. If a shortcut is associated with a specific method that is not present on the current handler, it is hidden from the user, preventing confusion about unavailable actions.
#### Key Component: KeyboardShortcutsHelpSk
The `KeyboardShortcutsHelpSk` class is a Lit-based custom element that wraps a Material Design dialog (`md-dialog`). Its primary responsibilities include:
- **Dynamic Rendering:** Upon opening, it queries the `ShortcutRegistry` to retrieve all registered shortcuts, grouped by category.
- **Handler-Based Filtering:** It accepts a `handler` property. When rendering, it iterates through registered shortcuts and checks if the `handler` actually implements the method associated with that shortcut. This ensures the help menu is relevant to the user's current context (e.g., different shortcuts for a graph view versus a table view).
- **Visual Organization:** It formats shortcuts into a readable table, using CSS to highlight keys (using monospace fonts) and categorize them under bold headers for quick scanning.
#### Internal Workflow
The following diagram illustrates how the component retrieves and filters data for display:
```
[ ShortcutRegistry ] <------- (1) Request Shortcuts
|
v
[ KeyboardShortcutsHelpSk ] <--- (2) Check 'handler' property
|
|--- (3) For each Shortcut:
| IF (shortcut.method exists AND handler lacks method)
| THEN: Skip
| ELSE: Add to Render List
|
v
[ md-dialog Content ] <------- (4) Render Table Rows
```
#### Key Files
- `keyboard-shortcuts-help-sk.ts`: Contains the logic for the Lit element, including the filtering logic and the `open()`/`close()` API for controlling the dialog programmatically.
- `keyboard-shortcuts-help-sk.scss`: Defines the layout for the shortcut table, ensuring consistent spacing and visual cues for keys and categories using the application's theme variables.
- `keyboard-shortcuts-help-sk_test.ts`: Validates that the component correctly pulls data from the `ShortcutRegistry` and renders the expected HTML structure.
# Module: /modules/new-bug-dialog-sk
# new-bug-dialog-sk
The `new-bug-dialog-sk` module provides a specialized modal dialog for the Perf triage workflow. It allows users to file Buganizer issues for one or more detected anomalies (performance regressions or improvements) directly from the Perf UI.
## Overview
When a sheriff or developer identifies an untriaged anomaly in a performance chart, they need a streamlined way to report it. This module automates the boilerplate of bug creation by pre-populating fields based on the selected anomaly's metadata, such as the test path, the magnitude of the change, and the affected revision range.
## Design Decisions
### Automated Metadata Extraction
The dialog is designed to minimize manual data entry. It implements logic to parse `Anomaly` objects and automatically generate:
- **Bug Titles:** A formatted string indicating the percentage change, the type (regression/improvement), the test suite, and the revision range (e.g., "33.6% regression in v8/async-fs at 95940:95944").
- **Component Selection:** It extracts bug components associated with the anomalies and presents them as radio buttons, ensuring the bug is filed in the correct tracker.
- **Label Management:** It aggregates unique labels from all selected anomalies, allowing the user to toggle them before submission.
### Support for Multiple Anomalies
The dialog supports filing a single bug for a collection of anomalies. This is common when a single underlying commit causes regressions across multiple related metrics. The implementation handles this by:
1. Calculating the aggregate revision range (minimum start to maximum end).
2. Determining the range of percentage changes (e.g., "10% to 20% regression").
3. Collecting all unique labels and components from the entire set.
### User Experience and Feedback
- **Draggable Interface:** The dialog implements custom mouse event listeners (`onMousedown`, `onMouseMove`, `onMouseUp`) to allow users to move the dialog. This is helpful if the user needs to see the underlying chart data while filling out the bug report.
- **Loading State:** A secondary `<dialog>` (`#loading-popup`) is used to provide visual feedback during the asynchronous fetch request to the backend.
- **Auto-CC:** Upon opening, the component calls `LoggedIn()` to identify the current user and automatically adds them to the CC list.
## Key Components and Implementation Details
### new-bug-dialog-sk.ts
This is the primary logic hub. It manages the internal state of the form and interacts with the Perf backend.
- **Triage Logic:** The methods `getBugTitle()`, `getPercentChangeForAnomaly()`, and `getSuiteNameForAlert()` contain the business logic for translating raw anomaly data into human-readable bug reports, mimicking legacy Chromeperf behavior.
- **Submission Workflow:** The `fileNewBug()` method gathers data from the form (including dynamically generated checkboxes and radios), sends a POST request to `/_/triage/file_bug`, and processes the response.
- **Post-Submission:** On success, it opens the newly created bug in a new browser tab and dispatches an `anomaly-changed` event. This event notifies other components (like charts or lists) that the anomaly's `bug_id` has been updated and they should re-render to reflect the triaged status.
### Workflow: Filing a Bug
```
[ User Clicks 'File Bug' ]
|
v
[ open() called: Fetch login status, show modal ]
|
v
[ UI populates Title, Labels, Components from Anomalies ]
|
[ User adjusts form & clicks 'Submit' ]
|
v
[ fileNewBug() ]----------------------> [ Server: /_/triage/file_bug ]
| |
[ Show Loading Popup ] [ Create Buganizer Issue ]
| |
[ Receive Bug ID ] <------------------------------'
|
v
[ 1. Update local Anomaly objects with Bug ID ]
[ 2. Dispatch 'anomaly-changed' event ]
[ 3. Open https://issues.chromium.org/issues/{ID} ]
[ 4. Close Dialogs ]
```
### new-bug-dialog-sk.scss
The styling ensures the dialog fits the Perf theme. It uses a flexible layout for the textarea and ensures the `closeIcon` is pinned to the top-right for easy dismissal.
### new-bug-dialog-sk_po.ts
Provides a Page Object for automated testing. This encapsulates the selectors for the title, description, assignee, and CC inputs, allowing Puppeteer tests to interact with the dialog without being brittle to internal DOM changes.
# Module: /modules/paramtools
# Paramtools
The `paramtools` module provides a suite of utility functions for manipulating and transforming "Structured Keys," `Params`, and `ParamSets`. It acts as a client-side mirror of the Go implementation found in `/infra/go/paramtools`, enabling the frontend to handle Perf trace identifiers and query parameters consistently with the backend.
## Design Philosophy
The module is designed around the concept of a **Structured Key**: a string representation of key-value pairs used to uniquely identify data traces (e.g., `,arch=x86,config=8888,os=linux,`).
The implementation prioritizes:
- **Canonical Representation**: Keys are always generated with sorted keys to ensure that the same set of parameters always results in the identical string identifier.
- **Performance vs. Validation**: Since the server-side Go implementation performs rigorous validation, this TypeScript module focuses on efficient transformation and parsing, assuming the data is largely well-formed.
- **Interoperability**: It facilitates easy conversion between internal data structures (`Params`) and external formats like URL query strings.
## Key Data Structures
- **Params**: A simple mapping of strings to strings (e.g., `{ "os": "linux" }`). Represents a specific point or trace.
- **ParamSet**: A mapping of strings to arrays of strings (e.g., `{ "os": ["linux", "windows"] }`). Represents a collection of possible values for various keys.
## Component Responsibilities
### Key Manipulation
The module provides logic to move between string identifiers and structured objects:
- **`makeKey`**: Converts a `Params` object into a canonical structured key. It sorts the keys alphabetically and wraps the result in leading and trailing commas to ensure unambiguous matching.
- **`fromKey`**: Parses a structured key back into a `Params` object. It includes logic to strip away "Special Functions" (like `norm()`) that might wrap a key during calculation phases.
- **`validKey`**: A simple validator that checks for the standard `,key=value,` format, primarily used to distinguish between raw trace IDs and calculated traces.
### ParamSet Aggregation
Functions in this category handle the merging and expansion of parameter collections:
- **`addParamsToParamSet`**: Merges a single `Params` instance into an existing `ParamSet`. This is useful when building a global index of available dimensions from a list of specific traces.
- **`addParamSet`**: Merges two `ParamSet` objects, ensuring that values remain unique within each key.
- **`paramsToParamSet`**: A convenience function to lift a single `Params` object into the `ParamSet` type.
### Integration Utilities
- **`queryFromKey`**: Converts a structured key directly into a URL-encoded query string (e.g., `a=1&b=2`). This is essential for synchronization between the application state (trace keys) and the browser's URL for deep-linking.
## Workflows
### Trace ID to Query String
This workflow illustrates how a trace identifier from the backend is prepared for use in a frontend search query.
```
Structured Key: ",arch=arm,os=android,"
|
v
[ fromKey() ] --> Params: { arch: "arm", os: "android" }
|
v
[ queryFromKey() ] -> String: "arch=arm&os=android"
```
### Building a Filter UI
This workflow shows how individual trace IDs are aggregated to populate a user interface with all available filtering options.
```
Trace A: ",config=565," Trace B: ",config=888,"
| |
+----------+----------+
|
v
[ addParamsToParamSet() ]
|
v
ParamSet: { config: ["565", "888"] }
|
v
(Used to render dropdown menus)
```
# Module: /modules/perf-scaffold-sk
# perf-scaffold-sk
The `perf-scaffold-sk` module provides the foundational layout and shell for all Skia Performance Monitoring (Perf) web pages. It serves as a master template, providing consistent navigation, branding, error handling, and a unified look-and-feel across the application.
## High-Level Overview
This module defines the `PerfScaffoldSk` custom element, which acts as a wrapper for every page in the Perf application. Its primary responsibilities include:
- **Branding and Navigation:** Hosting the instance logo, title, and primary navigation links (Explore, Alerts, Triage, etc.).
- **Contextual Shell:** Providing a consistent sidebar or header (depending on the UI version) for global actions.
- **Infrastructure Integration:** Embedding essential utility components like `alogin-sk` (authentication), `theme-chooser-sk` (dark/light mode), and `error-toast-sk` (global error notifications).
- **Global Configuration:** Responding to settings defined in the global `window.perf` object to customize the UI per instance.
## Design Decisions and UI Versions
The scaffold currently supports two distinct UI layouts: **Legacy UI** and **V2 UI**. The implementation allows for a phased transition between styles, controlled by both global configuration and user preference.
### UI Selection Logic
The choice of layout is determined at render time based on the following hierarchy:
1. **User Preference:** A value stored in `localStorage` under the key `v2_ui`.
2. **Global Default:** The `window.perf.enable_v2_ui` boolean provided by the server.
Users can manually toggle between these versions via a "Try V2 UI" button in the Legacy sidebar or a "Back to Legacy UI" button in the V2 header. This toggle action updates `localStorage` and triggers a page reload to re-initialize the scaffold.
### Component Structure
```
+-----------------------------------------------------------+
| perf-scaffold-sk |
| +-------------------------------------------------------+ |
| | app-sk (Legacy or V2) | |
| | +---------------------------------------------------+ | |
| | | header (Top Bar) | | |
| | | - Logo & Title | | |
| | | - Auth & Theme Chooser | | |
| | +---------------------------------------------------+ | |
| | | aside (Sidebar - Legacy) OR nav (Header - V2) | | |
| | | - Links (Explore, Alerts, Triage, etc.) | | |
| | +---------------------------------------------------+ | |
| | | main (perf-content) | | |
| | | - User-provided child content injected here | | |
| | +---------------------------------------------------+ | |
| | | gemini-side-panel-sk (V2 Only) | | |
| | +---------------------------------------------------+ | |
| | | footer | | |
| | | - error-toast-sk | | |
| | | - Build/Version tags | | |
| | +---------------------------------------------------+ | |
| +-------------------------------------------------------+ |
+-----------------------------------------------------------+
```
## Key Components and Responsibilities
### Layout Management (`perf-scaffold-sk.ts`)
The core logic resides in `PerfScaffoldSk`. It manages the lifecycle of the application shell. A key feature is the **content redistribution** process:
- When the component is initialized, it takes all original child elements and moves them into a internal `#perf-content` container within the `<main>` tag.
- It specifically identifies elements with the ID `sidebar_help` and moves them into a specialized help area (a sidebar section in Legacy, or a dropdown menu in V2).
### Styling (`perf-scaffold-sk.scss`)
The styles utilize CSS Grid and Flexbox to create responsive layouts.
- **Legacy UI:** Uses a traditional sidebar-heavy layout (`aside#sidebar`).
- **V2 UI:** Implements a modern top-navigation layout with a sticky header and a scrolling main content area. It also handles the positioning of the Gemini AI side panel.
### Versioning and Build Info
The scaffold displays the current application version in the footer. It intelligently formats the version string:
- **Git Hashes:** Links directly to the source repository.
- **Dev Timestamps:** Formats ISO strings into human-readable UTC dates for local development builds.
- **Build Tags:** Displays tags retrieved via `getBuildTag()` from the `window` module.
### Integration with `window.perf`
The scaffold is highly data-driven, relying on the `window.perf` object for:
- `header_image_url`: Custom instance logos (with a fallback to an "alpine" logo).
- `instance_name` / `instance_url`: Displaying the instance identity.
- `chat_url` / `feedback_url`: Linking to support channels.
- `show_triage_link`: Conditionally hiding the Triage navigation item.
## Key Files
- `perf-scaffold-sk.ts`: The main TypeScript definition for the Lit-based custom element, containing the template logic for both UI versions.
- `perf-scaffold-sk.scss`: Theme-aware styles that define the grid layouts for both Legacy and V2 shells.
- `perf-scaffold-sk-demo.ts` & `perf-scaffold-sk-v2-demo.ts`: Demo entry points that mock the `window.perf` configuration to showcase the scaffold's capabilities in various states.
- `perf-scaffold-sk_puppeteer_test.ts`: Integration tests ensuring that layout transitions, version rendering, and content redistribution work as expected.
# Module: /modules/picker-field-sk
# picker-field-sk
The `picker-field-sk` module provides a specialized multi-selection component designed for choosing values from a pre-defined list. It wraps a Vaadin multi-select combo box with additional logic for bulk selection and data organization, specifically tailored for complex filtering workflows (such as performance test pickers).
## Design Philosophy
The primary goal of `picker-field-sk` is to simplify the management of large sets of options while providing visual cues and high-level controls for common selection patterns.
Rather than being a generic text field, it addresses specific needs of hierarchical or categorized data:
- **Primary vs. Detailed Options**: In many datasets, options without periods in their name represent "primary" or top-level categories. This module automatically identifies these and provides a one-click toggle to select them.
- **Dynamic Controls**: Selection features like "Select All" or "Split" are conditionally displayed based on the component's state (e.g., how many items are selected) and its position in a sequence (the `index` property).
- **Responsive Sizing**: The component dynamically calculates its overlay width based on the longest option string to ensure that labels are readable without unnecessary truncation.
## Key Components and Logic
### Core Selection: Vaadin Multi-Select
The underlying selection mechanism is handled by `@vaadin/multi-select-combo-box`. This provides the chip-based UI for selected items and the searchable dropdown. The module styles this component to integrate with the local theme, including specific "dark mode" transitions.
### Bulk Actions
The component features a set of `checkbox-sk` elements located in a "split-by-container" above the main field. These controls appear based on the following logic:
- **Select All**: Visible if there are more than 2 options and the field is not the primary field (`index > 0`). It allows selecting the entire list or resetting to the first item.
- **Primary**: Visible if the list contains a mix of "primary" items (those without periods) and "detailed" items. It allows users to toggle the selection of top-level categories.
- **Split**: Dispatches a `split-by-changed` event. This is used by parent components to decide if a visualization should be broken down by the attribute represented by this field.
### State Management
The internal state is managed through several private properties that trigger re-renders or logic updates:
- `options`: Setting this property automatically triggers the filtering of `primaryOptions` and recalculates the dropdown width.
- `selectedItems`: Controls which chips are currently visible.
- `index`: Determines whether the bulk action checkboxes should be visible.
## Workflow: Selection and Events
The following diagram illustrates how user interaction flows through the component to notify the rest of the application:
```text
User Interaction picker-field-sk External App
+----------------+ +-------------------+ +------------------+
| Click "All" |------>| Update _selected | | |
| Checkbox | | Items & Render | | |
+----------------+ +---------+---------+ +------------------+
|
v
+----------------+ +-------------------+ +------------------+
| Select Item in |------>| onValueChanged() |------>| Listen for |
| Dropdown | | | | 'value-changed' |
+----------------+ +---------+---------+ +------------------+
|
v
+----------------+ +-------------------+ +------------------+
| Toggle "Split" |------>| splitOnValue() |------>| Listen for |
| Checkbox | | | | 'split-by-changed'|
+----------------+ +-------------------+ +------------------+
```
## Styling and Layout
The component uses a vertical flex layout where the label and selection checkboxes sit atop the combo box.
- **Overlay Width**: Calculated using `ch` (character) units to ensure the dropdown menu scales with the content length.
- **Chip Styling**: Selected items (chips) are styled with `direction: rtl` to handle long strings gracefully within the limited horizontal space of the input field.
- **Theming**: Integrated with `//perf/modules/themes`, utilizing CSS variables for background colors, focus states, and transitions to ensure a consistent look across different UI modes.
## Testing Utilities
The module includes a Page Object (`PickerFieldSkPO`) located in `picker-field-sk_po.ts`. This encapsulates the complexity of interacting with the Shadow DOM of both the `picker-field-sk` and the underlying Vaadin components. It provides high-level methods for:
- Selecting items by text.
- Removing specific chips.
- Checking the state of the "Split" and "Select All" checkboxes.
- Managing the overlay visibility during automated Puppeteer tests.
# Module: /modules/pinpoint-try-job-dialog-sk
# pinpoint-try-job-dialog-sk
The `pinpoint-try-job-dialog-sk` module provides a modal dialog designed to trigger Pinpoint A/B "Try jobs." In the context of the Perf application, its primary purpose is to allow developers to request additional performance traces for specific benchmark runs to debug regressions or verify improvements.
## Design and Purpose
This module is specifically tailored for the "Debug Traces" use case. It acts as a bridge between the Perf UI and the Pinpoint performance analysis system. Rather than being a general-purpose Pinpoint job creator, it focuses on taking existing performance data contexts—such as a specific trace found in a chart—and prepopulating a request to gather more detailed diagnostic information (e.g., Chrome trace categories).
Key design decisions include:
- **Contextual Preloading**: The dialog is designed to be populated via `setTryJobInputParams`, which extracts necessary metadata (bot, benchmark, story) from a "test path" string commonly used in Perf.
- **Constraint-Focused**: While Pinpoint supports many job types, this dialog simplifies the interface to focus on A/B comparisons between a base commit and an experiment (end) commit.
- **Trace Customization**: It provides a default set of tracing arguments (`toplevel`, `toplevel.flow`, etc.) but allows users to override them to gather specific category data.
## Key Components
### PinpointTryJobDialogSk (`pinpoint-try-job-dialog-sk.ts`)
The main class extending `ElementSk`. It manages the internal state of the dialog, including the commit hashes, story names, and the resulting Pinpoint job URL.
- **Authentication**: Upon connection, it uses `alogin-sk` to identify the current user. This is crucial as Pinpoint requires a user email to associate with the created job.
- **State Management**: It tracks `baseCommit`, `endCommit`, and `testPath`. The `testPath` is specifically parsed during the submission process to identify the `configuration` (bot) and `benchmark`.
- **Submission Logic**: The `postTryJob` method handles the transformation of UI fields into a `TryJobCreateRequest`. It maps the user's input into the specific JSON structure expected by the `/ _ / try /` endpoint.
### Template and Styling
The dialog is rendered using `lit-html` and styled to match the Perf theme. It utilizes standard `HTMLDialogElement` functionality for modal behavior and includes a `spinner-sk` to provide visual feedback during the asynchronous submission process.
## Key Workflows
### Triggering a Job
The typical lifecycle of the dialog involves an external component passing in performance context before the user interacts with the form.
```
External Component Dialog Component Pinpoint API
| | |
|-- setTryJobInputParams ->| |
| (commits, testPath) | |
| | |
|------- open() -------->| |
| | (User modifies args) |
| | |
| |------- POST /_/try/ ------>|
| | |
| |<------ { jobUrl } ---------|
| | |
| |-- Updates UI with Link ----|
```
### Data Mapping
When a user submits the form, the module performs a specific mapping from the human-readable "test path" to the Pinpoint API fields:
1. **Test Path Parsing**: A string like `master/linux-perf/blink_perf.ext/test_case` is split.
- Index 1 becomes the `configuration` (`linux-perf`).
- Index 2 becomes the `benchmark` (`blink_perf.ext`).
2. **Argument Injection**: The `traceArgs` input is wrapped into a JSON string under `--extra-chrome-categories` and passed within `extra_test_args`.
3. **Naming**: The job name is automatically generated to follow a standard pattern: `Tracing Debug on <config>/<benchmark>/<story>`.
## Implementation Files
- `pinpoint-try-job-dialog-sk.ts`: Contains the logic for form validation, API interaction, and the Lit template.
- `pinpoint-try-job-dialog-sk.scss`: Defines the layout, specifically ensuring the dialog handles long input strings (like commit hashes and trace arguments) gracefully.
- `pinpoint-try-job-dialog-sk_test.ts`: Validates that the form correctly parses the input parameters and that the fetch request sent to the backend contains the expected payload structure.
# Module: /modules/pivot-query-sk
# pivot-query-sk
The `pivot-query-sk` module provides a specialized UI component for configuring data transformation requests, specifically for pivoting and aggregating performance trace data. It allows users to define how data should be grouped, what primary mathematical operation to perform on those groups, and which additional summary statistics should be calculated.
## Overview
The primary purpose of this component is to build and edit a `pivot.Request` object. This object is used by the Perf backend to reshape time-series data into a tabular or grouped format. The component provides a high-level interface for three specific pivoting dimensions:
1. **Group By**: Selecting which keys (from a provided `ParamSet`) should be used to cluster traces together.
2. **Operation**: The main reduction function (e.g., average, sum, min) applied to the data within each group.
3. **Summary**: Optional additional statistics (e.g., count, max) to be calculated for each group.
## Design Decisions
### Data Consistency and ParamSets
The component requires a `ParamSet` to populate the "Group By" options. A key design choice in `pivot-query-sk.ts` is the handling of the intersection between the current `pivot.Request` and the provided `ParamSet`.
The `allGroupByOptions()` method merges keys from the current request with those in the `ParamSet`. This ensures that if a user loads a saved pivot request containing keys that are not present in the current data's `ParamSet`, the selection is preserved rather than silently dropped. This "additive" approach prevents data loss when switching between different data contexts.
### Unique Instance Identification
Because the component uses ARIA attributes (`aria-labelledby`) to maintain accessibility, it implements a `uniqueId` system. Each instance of `pivot-query-sk` on a page increments a static counter. This ensures that internal element IDs (like `group_by-0`, `group_by-1`) remain unique, preventing label collisions when multiple query builders are present in the same view.
### Validation and State
The component acts as a controlled input. It exposes a `pivotRequest` getter that utilizes `validatePivotRequest` from `../pivotutil`. If the current internal state is invalid, the getter returns `null`. This forces consuming components to handle invalid states gracefully before attempting to dispatch a backend request.
## Key Components
### PivotQuerySk
The main class (`pivot-query-sk.ts`) manages the state and rendering logic. It uses `lit-html` for templating and leverages existing elements like `multi-select-sk` and `select-sk` to handle the heavy lifting of UI interactions.
- **State Management**: It maintains an internal `_pivotRequest` and `_paramset`. Any change to these properties via setters triggers a re-render.
- **Event Emission**: Whenever a user interacts with the UI (changing a selection or the operation), the component emits a `pivot-changed` event containing the updated (and potentially null, if invalid) `pivot.Request`.
### Interactions and Workflow
The following diagram illustrates how data flows from user interaction to a valid request:
```
User Action (Click/Select)
|
v
Internal Event Handler (e.g., groupByChanged)
|
+--> Updates internal _pivotRequest
|
+--> Validation Check (via pivotutil)
|
v
Dispatches "pivot-changed" Event
|
+--> Parent component receives pivot.Request OR null
```
### Supporting Files
- **`pivot-query-sk_po.ts`**: Provides a Page Object (PO) for testing. It abstracts the complexity of interacting with multiple `multi-select-sk` elements, allowing tests to select options by text content rather than implementation details.
- **`pivot-query-sk.scss`**: Handles the layout, ensuring that the selection lists are presented in a flexible, readable grid with scrollable areas for large `ParamSet` keys.
# Module: /modules/pivot-table-sk
### High-Level Overview
`pivot-table-sk` is a custom element designed to display aggregated Performance (Perf) data in a tabular format. While traditional DataFrames in the Perf system are often visualized as time-series plots, this module handles cases where data has been "pivoted" and summarized into discrete values (e.g., averages, sums, or standard deviations).
The element transforms complex trace data—where keys are comma-separated parameter strings—into a human-readable table. It allows users to explore multi-dimensional data by grouping by specific parameters and viewing various statistical summaries side-by-side.
### Design Decisions and Implementation
#### Data Transformation and Mapping
The core challenge the module solves is translating a `DataFrame` (optimized for storage and plotting) into a grid.
- **Key Extraction**: Trace IDs in Perf are long strings (e.g., `,arch=x86,config=8888,`). The module uses `keyValuesFromTraceSet` to parse these strings and extract only the values corresponding to the `group_by` parameters requested by the user.
- **Ordering**: The order of columns in the table is strictly dictated by the `pivot.Request`. The "Key" columns (parameters) appear first, followed by the "Summary" columns (statistical operations).
#### Advanced Sorting Logic
Rather than a simple per-column sort, `pivot-table-sk` implements a **Sort History** mechanism via the `SortHistory` and `SortSelection` classes. This approach mimics spreadsheet behavior:
- **Tie-Breaking**: When a user clicks a column header, that column becomes the primary sort criteria. However, previous sort actions are preserved in a stack. If two rows have identical values in the primary column, the logic falls back to the second most recent sort column to break the tie, and so on.
- **State Persistence**: The entire sort state (which columns, which direction, and in what priority) can be encoded into a compact string (e.g., `dk2-us1`). This allows the sorting state to be reflected in URL parameters, making table views shareable and bookmarkable.
#### Validation and Safety
Because the component relies on a specific relationship between the `DataFrame` and the `pivot.Request`, it includes a validation layer. It uses `validateAsPivotTable` to ensure the incoming request is compatible with a tabular display (e.g., checking that the necessary grouping and summary fields are present) before attempting to render.
### Key Components
#### PivotTableSk
The main custom element. It manages the lifecycle of the data, reacting to changes in the `df` (DataFrame) and `req` (Request) properties. It uses `lit` for efficient rendering and manages internal state for the `sortHistory` and the resulting `compare` function used by the JavaScript native `Array.sort()`.
#### SortHistory and SortSelection
These classes encapsulate the multi-column sorting logic.
- `SortSelection` handles the metadata for a single column: its index, its type (`keyValues` vs `summaryValues`), and its direction.
- `SortHistory` manages an array of selections. It provides the `buildCompare` method, which generates a complex comparison function that iterates through the history stack until a non-zero comparison result is found.
#### Page Object (`PivotTableSkPO`)
Located in `pivot-table-sk_po.ts`, this provides an abstraction for testing. It allows internal tests (Puppeteer) to interact with the table (clicking headers, reading cell values) without being coupled to the specific DOM structure or CSS classes.
### Primary Workflow
The following diagram illustrates how data flows into the component and results in a sorted display:
```text
[ DataFrame ] + [ pivot.Request ]
| |
v v
+-----------------------------+
| willUpdate() |
| 1. Extract KeyValues | <--- Maps Trace IDs to grouped params
| 2. Init/Update SortHistory | <--- Restores state from encoded string
| 3. Build Compare Function |
+-----------------------------+
|
v
+-----------------------------+
| render() |
| 1. Validate Request |
| 2. Sort Keys via CompareFn |
| 3. Generate <table> rows |
+-----------------------------+
|
+--> [ User Clicks Header ] --+
^ |
| v
+----------- [ Emit 'change' event ]
[ Re-run willUpdate() ]
```
### Events
- **change**: Emitted whenever the user changes the sort order. The event's `detail` contains the serialized `SortHistory` string, allowing parent components to sync the UI state with the application URL.
# Module: /modules/pivotutil
# Pivot Utilities
The `pivotutil` module provides a set of client-side utilities designed to facilitate the configuration and validation of pivot operations within the Perf system. Its primary role is to bridge the gap between raw pivot request data structures defined in the backend and the user interface, ensuring that pivot configurations are both semantically valid and human-readable before being processed.
## Overview and Purpose
Pivot operations in Perf allow users to transform multi-dimensional trace data into aggregated summaries or reorganized table views. Because a pivot request involves several dependent parameters—such as grouping keys, summary operations, and aggregation methods—it is prone to configuration errors that could lead to empty results or server-side failures.
This module centralizes the logic for:
1. **Humanizing Metadata**: Mapping technical operation identifiers (e.g., `geo`, `avg`) to user-friendly labels (e.g., "Geometric Mean", "Mean") for consistent display across the UI.
2. **Request Validation**: Enforcing structural constraints on `pivot.Request` objects to ensure they contain the minimum necessary information to be actionable.
## Key Logic and Design Decisions
### Validation Philosophy
The module distinguishes between a "valid pivot request" and a "valid pivot table." This distinction is necessary because the Perf system supports different ways of visualizing pivoted data:
- **Structural Validation**: A baseline pivot request requires at least one `group_by` field. Without grouping, there is no dimension along which to pivot the data.
- **Contextual Validation (Tables)**: While a request might be structurally sound for certain visualizations, generating a tabular summary requires at least one `summary` operation. The `validateAsPivotTable` function enforces this stricter requirement, ensuring that the UI does not attempt to render an empty summary table when the user has only defined groupings.
### Data Mapping
The `operationDescriptions` map serves as the single source of truth for how pivot operations are presented to the user. By centralizing these strings in `pivotutil`, the system ensures that different UI components (such as dropdowns, table headers, or chart legends) remain consistent in their terminology.
## Key Components
### `index.ts`
This is the core of the module. It exports the validation functions and the description mappings used by UI components to interpret `pivot.Request` objects.
- **`operationDescriptions`**: A lookup table mapping `pivot.Operation` types to their display names. It covers standard statistical aggregations like sum, mean (arithmetic and geometric), standard deviation, count, and extrema.
- **`validatePivotRequest`**: Checks for the existence of the request and ensures the `group_by` array is populated.
- **`validateAsPivotTable`**: Extends the basic validation by verifying that the `summary` field is also populated, which is a prerequisite for generating a statistical summary table.
## Workflows
The typical workflow for utilizing this module involves a UI component gathering user input and validating it before dispatching a network request or updating a visualization.
```
[ User Input ] ----> [ pivotutil: validatePivotRequest ]
|
+----------------+----------------+
| |
[ Returns Error Msg ] [ Logic Proceeds ]
| |
[ UI displays Alert ] [ pivotutil: validateAsPivotTable ]
|
+------------------+------------------+
| |
[ Returns Error Msg ] [ Success ]
| |
[ UI hides Table View ] [ UI renders Pivot Table ]
```
# Module: /modules/plot-google-chart-sk
# `plot-google-chart-sk`
## Overview
The `plot-google-chart-sk` module provides a high-performance, interactive charting component built on top of the Google Visualization API (`google-chart`). It is specifically designed to handle time-series data, anomalies, and user-defined issues within the Perf framework.
Beyond simple data visualization, this module implements specialized interaction modes—such as panning, delta-Y calculations, and dual-axis zooming—to support deep analysis of performance regressions and improvements.
## Design Decisions and Implementation Choices
### Performance via Overlays
Rendering thousands of data points alongside complex icons (anomalies, regressions, bug icons) directly within the Google Chart SVG can lead to significant performance degradation during interactions like panning or resizing.
- **Implementation:** The module uses a layered approach. The base `google-chart` renders the lines, while anomalies and user issues are rendered as **absolute-positioned HTML overlays** in separate `div` containers (`.anomaly`, `.userissue`).
- **Benefit:** When the user pans the chart, the module recalculates the coordinates using the chart's layout interface and moves the HTML elements without requiring the underlying charting engine to re-render the entire data series.
### State Management and Context
The module leverages `@lit/context` to synchronize state across a complex hierarchy of components without prop-drilling.
- **Data Synchronization:** It consumes `dataTableContext` and `dataframeAnomalyContext` to reactively update the view when the underlying performance data changes.
- **Color Consistency:** It provides a `traceColorMapContext`. This ensures that if a trace is assigned "Blue" in the chart, the same color is used in the `side-panel-sk` (legend) and any associated tooltips.
### Interaction Modes
The module distinguishes between three primary mouse navigation modes to avoid UI clutter:
1. **Panning (Left-Click Drag):** Moves the horizontal view window.
2. **Delta-Y (Shift + Left-Click Drag):** Activates `v-resizable-box-sk` to measure the vertical distance (raw and percentage) between two points on the Y-axis.
3. **Drag-to-Zoom (Ctrl + Left-Click Drag):** Activates `drag-to-zoom-box-sk` to select a specific sub-region for zooming. This supports both horizontal and vertical zooming depending on the global `isHorizontalZoom` state.
## Key Components and Responsibilities
### `plot-google-chart-sk.ts`
The primary element that orchestrates the charting logic.
- **Responsibility:** Manages the lifecycle of the `google.visualization.DataTable`, handles the "Domain" toggle (switching between Commit Position and Date), and coordinates the positioning of overlays.
- **Data View Logic:** It uses a `google.visualization.DataView` to filter which traces are currently visible based on user selections in the side panel.
### `side-panel-sk.ts`
A collapsible legend and control interface.
- **Responsibility:** Displays a list of active traces. It groups trace labels by a "display name" (derived from trace parameters like `test`, `arch`, etc.).
- **Interaction:** Allows users to toggle trace visibility. It prevents the user from unselecting all traces to ensure the chart never becomes completely empty.
### `v-resizable-box-sk.ts`
A specialized selection box for vertical measurements.
- **Responsibility:** Calculates the difference between two Y-axis values. It intelligently positions the delta text (e.g., "+15% / 1.2s") so it doesn't clip outside the chart boundaries.
### `drag-to-zoom-box-sk.ts`
A transparent selection rectangle.
- **Responsibility:** Provides visual feedback during a Ctrl-drag operation. It calculates the new coordinate bounds which are then passed back to the main chart to update its `viewWindow`.
## Key Workflows
### Data Rendering Flow
```text
Data Update -> willUpdate() -> updateDataView()
|
V
Create google.visualization.DataView
|
V
Assign Colors to Traces
|
V
updateOptions() (Scale/Axis)
|
V
plot.redraw() -> onChartReady()
|
V
drawAnomaly() & drawUserIssues()
```
### Coordinate Mapping
Because the overlays are standard HTML elements, the module frequently translates "Data Values" (Commits/Dates/Values) into "Pixel Coordinates" using the Google Chart Layout Interface:
```text
[Data Value: Commit 1234]
|
V
[Chart Layout Interface] -> getXLocation(1234)
|
V
[CSS Absolute Position] -> element.style.left = `${x}px`
```
## Events
- **`selection-changed`**: Dispatched when the user finishes panning or zooming, providing the new `range` and `domain`.
- **`plot-data-select`**: Dispatched when a specific data point is clicked, returning the `tableRow` and `tableCol`.
- **`side-panel-toggle`**: Dispatched when the legend panel is opened or closed.
# Module: /modules/plot-summary-sk
# Plot Summary (plot-summary-sk)
The `plot-summary-sk` module provides a high-level "bird's-eye view" of performance data. It is designed to act as a navigation and overview tool for large datasets, allowing users to see trends across a wide time or commit range and select specific sub-sections to investigate in more detail.
## High-Level Overview
This component renders a simplified area chart of performance traces using Google Charts. Its primary purpose is to facilitate range selection. Unlike a primary data plot, it focuses on performance and visual density rather than granular data point interaction.
It solves the problem of "information overload" when dealing with thousands of data points by implementing automatic downsampling and providing a specialized UI for horizontal range manipulation.
## Design Decisions
### Min-Max Downsampling
When the input `DataTable` contains a large number of rows (exceeding 1000), the component automatically applies a **Min-Max bucketing** algorithm.
- **Why:** Standard downsampling (like averaging) smooths out spikes, which are often the most important features in performance monitoring.
- **How:** The data is divided into buckets. For each bucket, the component synthesizes two rows: one representing the minimum value and one representing the maximum value within that interval. This ensures that the visual "envelope" of the data—including all peaks and valleys—remains visible even at low resolutions.
### Decoupled Interaction Layer
The selection logic is separated into a sub-component called `h-resizable-box-sk`.
- **Why:** Keeping the logic for mouse interactions (dragging, resizing, drawing) separate from the charting logic makes the codebase more maintainable and allows the resizable box to be reused in other contexts.
- **How:** `h-resizable-box-sk` overlays the chart and translates raw pixel coordinates from mouse events into relative percentage-based ranges. `plot-summary-sk` then converts these relative positions into domain-specific values (timestamps or commit offsets) using the Google Chart `ChartLayoutInterface`.
### Deterministic Trace Coloring
To ensure visual consistency between the summary plot and the main detail plots (e.g., `plot-google-chart-sk`), this module uses a shared utility (`getTraceColor`) to assign colors based on the trace name. This allows a user to identify the same trace across different UI components by color alone.
## Key Components and Responsibilities
### plot-summary-sk.ts
The main controller for the summary view.
- **Data Management:** Consumes `DataTable` objects via Lit context (`dataTableContext`) and converts them into a `DataView` optimized for the summary (filtering columns based on `selectedTrace`).
- **Coordinate Mapping:** Acts as a bridge between the visual chart and the data. It contains the logic to convert between pixel coordinates (used by the resizable box) and data values (commits/dates).
- **Navigation Controls:** Optionally renders "load more" buttons (left/right) that interact with a `DataFrameRepository` to extend the available data range.
### h_resizable_box_sk.ts
A specialized UI primitive for horizontal range selection.
- **State Tracking:** Manages four distinct user actions: `draw` (creating a new selection), `drag` (moving an existing selection), `left` (resizing the start), and `right` (resizing the end).
- **Constraint Enforcement:** Uses a `clamp` utility to ensure the selection box never leaves the boundaries of the parent container and maintains a `minWidth` to prevent the selection from becoming unclickable.
## Key Workflows
### Selection Process
The following diagram illustrates how a user interaction is transformed into a system-wide range update:
```
User Action (Mouse) -> [h-resizable-box-sk]
|
(Pixel Range)
|
v
[plot-summary-sk]
|
(Convert Pixels via ChartLayout)
|
v
[summary_selected Event]
|
(Contains: {begin, end, domain})
```
### Data Update and Redraw
When the underlying data changes, the component goes through the following lifecycle:
1. **Property Change:** `data`, `selectedTrace`, or `domain` is updated.
2. **Downsampling:** `updateDataView` checks row count; if > 1000, buckets are created.
3. **Column Filtering:** The view is restricted to the domain column (0 or 1) and the data columns for the selected traces.
4. **Async Redraw:** Google Chart renders the new SVG.
5. **Selection Realignment:** Once the chart emits `google-chart-ready`, the `h-resizable-box-sk` is repositioned to match the `cachedSelectedValueRange`, as the axis scaling might have changed.
## Events
- **summary_selected**: Dispatched whenever the user finishes a selection or adjustment. The `detail` contains a `range` object with `begin` and `end` values in the current domain (UNIX timestamp for dates, or integer offset for commits).
- **range-changing-in-multi**: Dispatched when "load" buttons are clicked in a multi-chart environment, allowing a parent controller to synchronize data fetching across multiple plots.
# Module: /modules/point-links-sk
# point-links-sk
This module provides a specialized UI component, `PointLinksSk`, designed to display context-sensitive links associated with specific data points in Perf. These links are typically sourced from ingestion files and represent metadata such as commit hashes, build logs, or trace artifacts.
## Overview
The primary purpose of `point-links-sk` is to bridge the gap between a raw data point and the external systems that provide more context about it. It doesn't just list static URLs; it dynamically calculates ranges and cleans up metadata keys to present a user-friendly interface for navigating between performance results and source control or build systems.
## Key Responsibilities and Logic
### 1. Commit Range Generation
A significant feature of this module is its ability to compare the selected commit with the previous commit to generate "diff" or "log" links.
- **Logic**: If a key is identified as a commit range key (e.g., "V8 Git Hash"), the component fetches the hash for both the current and the preceding commit.
- **Single Commit**: If the hashes are identical, the component displays a direct link to that specific commit.
- **Range**: If the hashes differ, it constructs a Gitiles-compatible URL (using the `+log/start..end` syntax) to show all commits in that range.
- **Validation**: It performs an asynchronous check via a proxy to `googlesource.com` (using `isRange`) to determine if a range actually contains multiple commits, ensuring the UI text accurately reflects whether the user is looking at a single change or a list.
### 2. Intelligent Data Retrieval and Fallbacks
The component fetches point-specific metadata through internal API endpoints. It implements a fallback strategy to ensure reliability:
1. It first attempts to fetch from `/_/details/?results=false`.
2. If no links are found, it falls back to the `/_/links/` endpoint.
3. **Performance Optimization**: It accepts an array of `CommitLinks` as an argument to its `load` method. This allows the caller to provide cached data, preventing redundant network requests when switching back and forth between points.
### 3. Data Normalization and Filtering
The module handles several platform-specific quirks to maintain a consistent UI:
- **Key Cleaning**: It strips redundant suffixes (like " Git") and renames technical keys to friendlier terms (e.g., "Trace Iteration" becomes "Trace", "Profiling Traces and Test Artifacts" becomes "Artifacts").
- **Fuchsia Support**: It handles specific formatting used in Fuchsia ingestion where build logs are stored as Markdown-style strings (e.g., `[Build Log](url)`), extracting the URL for proper anchor tag generation.
- **Repository Rewriting**: It contains workarounds to fix incomplete URLs for certain repositories (like V8 or WebRTC) where the ingestion might only provide a hash rather than a full URL.
## Workflow: Loading Point Links
The following diagram illustrates the process when a user selects a point and `load()` is called:
```text
User selects point
|
v
[load(commit, prev_commit, trace_id, ...)]
|
+--> Check cache? --(Found)--> Render cached links
| |
| (Missing)
| v
+--> Fetch current point links (/_/details/ or /_/links/)
| |
+--> Fetch previous point links (if range keys requested)
| |
+--> Compare hashes
| |-- Same: Create direct commit link
| '-- Different: Create +log/ range link
| |
+--> Filter for "Useful Links" (Build logs, etc.)
| |
+--> Normalize keys and extract URLs (e.g., Fuchsia regex)
| |
'--> [ _render() ] Update Lit-html template
```
## Component API and State
### Primary Method: `load(...)`
This is the main entry point for the component. It triggers the data fetching and comparison logic. It returns the updated list of `CommitLinks` (including any newly fetched data) so the parent component can maintain an up-to-date cache.
### Internal State
- **`displayUrls`**: A map of human-readable keys to their calculated destination URLs.
- **`displayTexts`**: A map of keys to the text that should appear inside the link (e.g., the short hash range `f052b8c4 - 47f420e8`).
- **`commitPosition`**: Tracks the current commit number being inspected to ensure the UI stays synchronized with the user's selection.
## Implementation Details
- **Aborting Requests**: The component uses an `AbortController` to cancel pending network requests if a user rapidly clicks different points, preventing race conditions where old data might overwrite new data.
- **Template Rendering**: It uses Lit's `until` directive to show "Loading..." placeholders for individual link rows while asynchronous range validation (`isRange`) is performed.
- **Styling**: Styles are minimal, focusing on a tabular layout for the keys and values, with specific overrides for Material Design icon buttons used in supplementary actions (like copying links).
# Module: /modules/progress
The `progress` module provides a standardized mechanism for triggering, monitoring, and retrieving results from long-running server-side tasks. It abstracts the complexity of asynchronous polling into a lifecycle-aware utility, allowing the frontend to handle heavy operations (like database queries or complex data processing) without blocking the main UI thread or timing out on a single HTTP request.
### Design Philosophy: Polling and Lifecycle Management
The module is designed around a state-machine approach where the server dictates the flow of the operation. Rather than the client guessing when a task is finished, the server provides a `SerializedProgress` object containing the current status and a URL for the next update.
The core function, `startRequest`, manages the following transition logic:
1. **Initiation**: Sends a `POST` request to a starting URL with a JSON body.
2. **Observation**: Evaluates the `status` field in the response.
3. **Recursion/Polling**: If the status is `Running`, it schedules a subsequent `GET` request to the URL provided in the previous response after a configurable interval.
4. **Completion**: If the status is anything other than `Running` (e.g., `Finished`), it resolves the promise with the final data.
This design decouples the UI from the specific endpoint of the task; as long as the server follows the `progress.SerializedProgress` schema, the client can follow the task through any number of intermediate steps or URL changes.
### Key Components
#### Request Management (`progress.ts`)
The primary entry point is `startRequest`. It is built to be flexible through a `RequestOptions` object, which provides hooks into the various stages of the request lifecycle:
- **`onStart`**: Useful for updating UI state (e.g., showing a spinner) before the first network call is made.
- **`onProgressUpdate`**: Triggered every time the server returns a response while the task is still `Running`. This allows for real-time progress bars or status message updates.
- **`onSuccess`**: Triggered specifically when the task reaches a terminal successful state.
- **`onSettled`**: Acts like a `finally` block, executing when the process ends regardless of success or failure, making it ideal for cleanup tasks like hiding loading indicators.
#### Message Parsing Utilities
Since server responses often include a list of key-value pairs (`progress.Message[]`) to describe internal state, the module provides utilities to normalize this data for UI display:
- **`messagesToErrorString`**: Prioritizes extracting a message with the key `Error`. If absent, it concatenates all available messages into a single string. This ensures that even if the server doesn't provide a specific error field, the user receives some context about what went wrong.
- **`messageByName`**: A safe lookup utility to extract a specific value from the message array by its key, providing a fallback to prevent UI breakage.
### Workflow Process
The following diagram illustrates the lifecycle of a long-running request managed by this module:
```text
[ Client ] [ Server ]
| |
|-- POST (Start URL) -->|
| |-- [ Task Initiated ]
|<-- 200 (Running) -----|
| (JSON: status, url)|
| |
| [ Wait pollingInterval ]
| |
|-- GET (Poll URL) ---->|
| |-- [ Task Processing... ]
|<-- 200 (Running) -----|
| |
| [ Wait pollingInterval ]
| |
|-- GET (Poll URL) ---->|
| |-- [ Task Finished ]
|<-- 200 (Finished) ----|
| |
[ Resolve Promise ]
```
### Error Handling
The module treats non-`ok` HTTP statuses (like 4xx or 5xx) as terminal failures, rejecting the promise immediately and triggering the `onSettled` callback. It does not automatically retry on network failure; it assumes that if the polling chain is broken, the caller should decide whether to restart the entire process.
# Module: /modules/query-chooser-sk
# query-chooser-sk
The `query-chooser-sk` module provides a compact, interactive UI component for building and displaying search queries based on a set of parameters (a `ParamSet`). It acts as a high-level wrapper around the more complex `query-sk` component, offering a "summary-first" workflow that keeps the UI clean while allowing for detailed query editing.
## Overview
In many data-heavy applications, users need to filter large datasets using multiple keys and values. Displaying a full query builder at all times occupies significant screen real estate. `query-chooser-sk` solves this by:
1. Displaying a read-only summary of the current selection using `paramset-sk`.
2. Providing an "Edit" button that reveals an embedded `query-sk` interface within a toggleable dialog.
3. Integrating live feedback via `query-count-sk` to show how many items match the current selection as the user modifies it.
## Key Components and Responsibilities
### query-chooser-sk.ts
This is the primary entry point and defines the custom element. Its responsibilities include:
- **State Management:** It maintains the `current_query` (a URL-formatted query string) and the `paramset` (the available options).
- **Dialog Orchestration:** It manages the visibility of the internal `#dialog` element, which contains the editing tools.
- **Event Propagation:** It listens for `query-change` events from the internal `query-sk` component, updates its own state, re-renders the summary, and propagates information to the parent application.
### Integrated Sub-elements
The functionality of `query-chooser-sk` is composed of several specialized elements:
- **`paramset-sk`**: Used in the main view to display a concise, non-interactive summary of the active query filters.
- **`query-sk`**: The core interactive builder revealed when editing. It handles the logic of selecting keys and values from the `ParamSet`.
- **`query-count-sk`**: Situated inside the edit dialog, it performs asynchronous lookups (via the `count_url` attribute) to provide real-time counts of data points matching the user's current selection.
## Workflow
The component operates in a cycle of viewing and refining:
```text
+---------------------------------------+
| [Edit] Key1: Val1, Val2 | <--- (Summary View: paramset-sk)
+-----+---------------------------------+
|
| (Click Edit)
v
+-----+---------------------------------+
| [Close] |
| +-----------------------------------+ |
| | (query-sk) | | <--- (Edit View: User selects filters)
| | Key1: [x]Val1 [x]Val2 [ ]Val3 | |
| +-----------------------------------+ |
| Matches: 1,245 | <--- (Live Count: query-count-sk)
+---------------------------------------+
```
1. **Initial State**: The component renders a button and a `paramset-sk`.
2. **Interaction**: When the user clicks "Edit", the `_editClick` handler adds a CSS class to display the hidden dialog.
3. **Refinement**: As the user toggles checkboxes in `query-sk`, the `_queryChange` handler updates the `current_query` property. This update is reactive:
- The `query-count-sk` sees the new query and fetches a new count.
- The `paramset-sk` summary updates to reflect the latest selections.
4. **Completion**: The user clicks "Close", hiding the editing interface and returning to the summary view.
## Design Decisions
- **URL-Formatted Strings**: The component uses URL-formatted query strings (`key1=val1&key1=val2`) as the primary data exchange format for `current_query`. This makes it trivial to sync the component state with the browser's address bar or use it directly in API requests.
- **Encapsulated Dialog**: Instead of using a global modal, the dialog is contained within the element's shadow DOM (or local DOM). This ensures that multiple `query-chooser-sk` instances can exist on one page without managing z-index or global state conflicts.
- **Property Upgrading**: The `connectedCallback` utilizes `_upgradeProperty` for attributes like `paramset` and `key_order`. This ensures that if the properties are set on the DOM element before the custom element definition is loaded, the values are correctly captured and processed.
## Attributes and Properties
| Attribute | Property | Description |
| --------------- | --------------- | ------------------------------------------------------------------------------------ |
| `current_query` | `current_query` | The current selection formatted as a URL query string. |
| `count_url` | `count_url` | The endpoint URL used by `query-count-sk` to fetch match counts. |
| N/A | `paramset` | (Property only) The object containing all available keys and values. |
| `key_order` | `key_order` | An array of strings determining the order in which keys appear in the query builder. |
# Module: /modules/query-count-sk
# query-count-sk
The `query-count-sk` module provides a specialized UI component designed to report the number of data points or traces that match a specific query string within the Perf system. It serves as a live feedback mechanism, allowing users to understand the scope of their selection (e.g., in a query builder) before executing a full search or visualization.
## Design and Implementation
The component is built using **Lit** and leverages the `@lit/task` package to manage asynchronous data fetching. This architecture ensures that the component reacts efficiently to property changes while maintaining a responsive UI.
### Reactive Fetching Logic
The core of the component is a `Task` that monitors two primary inputs:
1. **`url`**: The endpoint to which the count request is sent.
2. **`current_query`**: The query string to be evaluated.
Whenever either of these properties changes, the task automatically triggers a `POST` request. The component is designed to handle rapid changes; if a new query is provided while a previous fetch is still in flight, the previous request is aborted via an `AbortSignal` to prevent race conditions and unnecessary network traffic.
### Data Flow and Side Effects
Unlike a simple display widget, `query-count-sk` performs two roles upon receiving a successful response from the server:
- **UI Update**: It extracts the `count` and displays it.
- **State Synchronization**: It dispatches a `paramset-changed` custom event. The response from the server includes a `paramset` (the set of all possible keys and values matching the current query), which the component bubbles up to notify parent components that the available filter options may have changed based on the current selection.
### Workflow Diagram
The following diagram illustrates the lifecycle of a query count request:
```text
Property Change Fetch Task Server DOM / Parent
(current_query)
| | | |
|---- triggers ------->| | |
| |---- POST (query) -->| |
| | | |
| |<--- JSON Response --| |
| | (count, params)| |
| | | |
|<--- updates count ---| |--- Dispatch Event ->|
| in render() | | (paramset-changed) |
```
## Key Components and Files
### query-count-sk.ts
Contains the element definition. It uses a `spinner-sk` to provide visual feedback during the loading state. To maintain consistency with legacy behaviors, the displayed count is reset to `0` whenever a new fetch task is initiated or pending.
### query-count-sk_po.ts
Provides a **Page Object (PO)** for testing. This file is crucial for integration and end-to-end tests, offering an abstraction layer to query the internal state of the component (like the numeric value of the count or the visibility of the spinner) without coupling tests to the internal DOM structure.
### API and Events
- **Attributes**:
- `current_query`: A string representing the query to count.
- `url`: The destination for the `CountHandlerRequest`.
- **Events**:
- `paramset-changed`: Dispatched when the server returns a new `ReadOnlyParamSet`. This allows other UI components (like dropdowns or filters) to update their available options dynamically.
## Implementation Details
The component sends a `CountHandlerRequest` which includes a time range (defaulting to the last 24 hours). This design choice assumes that the "count" of a query is most relevant within the context of recent data, though the time window is currently hardcoded within the task logic. Error handling is integrated with `errorMessage` to toast notifications to the user if the backend fails to process the query.
# Module: /modules/regressions-page-sk
# `regressions-page-sk`
The `regressions-page-sk` module provides a specialized dashboard for performance regression management. It allows "Sheriffs" (users responsible for monitoring performance) to view, filter, and triage anomalies associated with specific subscription configurations.
## High-Level Overview
This module acts as a centralized interface for reviewing performance anomalies detected by the system. It connects to backend endpoints to fetch a list of active subscriptions (Sheriff configurations) and then retrieves the specific anomalies (regressions) associated with a selected subscription.
The page is designed to handle large datasets through pagination (via cursors) and provides filtering capabilities to distinguish between new regressions, triaged issues, and performance improvements.
## Key Components and Responsibilities
### `regressions-page-sk.ts`
This is the main entry point and logic controller for the page. It is a `LitElement` that manages the following:
- **State Management**: It maintains the UI state, including the currently selected subscription, whether to show triaged or improvement data, and pagination offsets. This state is synchronized with the URL query parameters to allow deep-linking and persistence across page refreshes.
- **Data Orchestration**: It coordinates fetching data from two primary sets of endpoints:
- **Legacy/ChromePerf**: Uses standard anomaly list endpoints.
- **SQL/Skia**: Uses modern SQL-backed endpoints when `fetch_anomalies_from_sql` is enabled in the global `perf` configuration.
- **UI Integration**: It acts as a container for two critical sub-tables:
- `<subscription-table-sk>`: Displays metadata about the selected sheriff configuration (labels, components, CC list).
- `<anomalies-table-sk>`: Displays the actual list of detected regressions.
- **Dynamic Page Title**: Updates the browser tab title to reflect the count of anomalies (e.g., "Regressions (12 untriaged)"), providing immediate feedback to the user.
### State and Persistence
The module uses a combination of URL parameters and `localStorage` to ensure a consistent user experience.
- **URL Parameters**: Store filters like `showTriaged` and `selectedSubscription` so users can share links to specific views.
- **LocalStorage**: Remembers the `perf-last-selected-sheriff` so that when a user returns to the page, their last worked-on subscription is automatically reselected.
### User Workflows
The typical workflow involves selecting a sheriff configuration and refining the view to focus on actionable items.
```text
[ Select Sheriff ] -> [ Fetch Subscription Metadata ] -> [ Fetch Anomalies ]
| | |
v v v
[ Update URL/LS ] [ Render Subscription Table ] [ Render Anomalies Table ]
|
|---[ Show More ] ---+
| |
+<---[ Append Data ]-+
```
## Design Decisions
### Pagination and Cursors
The component supports two types of pagination depending on the backend:
1. **Cursor-based**: Used primarily with the legacy/ChromePerf backend. The component looks for an `anomaly_cursor` in the JSON response; if present, it displays a "Show More" button and passes the cursor back in the next request.
2. **Offset-based**: Used with the SQL-backed backend. It calculates a `pagination_offset` based on the current length of the `cpAnomalies` array.
### Separation of Concerns
The page itself does not handle the rendering of individual anomaly rows or subscription details. Instead, it delegates these tasks to `anomalies-table-sk` and `subscription-table-sk`. This allows `regressions-page-sk` to focus strictly on the "Page" level logic: URL state, global spinners, and high-level filtering.
### Loading Indicators
The module implements a dual-spinner strategy:
- `anomaliesLoadingSpinner`: An "upper" spinner that activates during initial loads or filter changes, signaling a full data refresh.
- `showMoreLoadingSpinner`: A localized spinner within the "Show More" section, indicating that the page is appending more data to the existing list rather than replacing it.
## Testing Architecture
- **Unit/Logic Tests (`regressions-page-sk_test.ts`)**: Focuses on state transitions, URL parameter parsing, and ensuring the correct API calls (with correct query strings) are made when filters are toggled.
- **Visual/Integration Tests (`regressions-page-sk_puppeteer_test.ts`)**: Uses Page Objects (`regressions-page-sk_po.ts`) to simulate user interactions like selecting a sheriff from a dropdown and verifying that the resulting tables render correctly via screenshots and DOM inspection.
# Module: /modules/report-page-sk
# report-page-sk
The `report-page-sk` module provides a comprehensive reporting view for performance anomalies. It serves as a centralized dashboard where users can review a list of detected regressions (anomalies), visualize them through interactive graphs, and inspect the shared commit history associated with those regressions.
## Overview
The primary purpose of this module is to consolidate triage workflows. Instead of looking at anomalies in isolation, `report-page-sk` groups related issues together, allowing a developer to see how multiple performance shifts might be tied to the same set of commits.
The page logic is driven by URL parameters (such as `bugID`, `anomalyIDs`, or `sid`), which determine which anomalies are fetched from the backend and which graphs are automatically generated upon page load.
## Key Components and Responsibilities
### Anomaly Management and Tracking
The module uses an internal `AnomalyTracker` class to maintain the state of all anomalies currently being viewed. This tracker manages the relationship between:
- The raw `Anomaly` data.
- The UI state (whether the anomaly is "checked" in the table).
- The `ExploreSimpleSk` graph instance associated with that specific anomaly.
- The specific `Timerange` relevant to the regression.
This separation ensures that the page can efficiently add or remove graphs from the DOM as the user toggles checkboxes in the list without losing the underlying data context.
### The Anomalies Table
The `anomalies-table-sk` component (referenced as `anomaly-table`) displays the metadata for each regression.
- **Initial Selection:** On load, the page parses the URL to decide which anomalies should be "checked" and graphed immediately.
- **Event Handling:** When a user checks or unchecks a row, the table dispatches an `anomalies_checked` event. `report-page-sk` listens for this to dynamically mount or unmount graph components.
### Performance-Optimized Graphing
Graphs are rendered using multiple instances of `explore-simple-sk`. To prevent browser UI freezes when a large number of anomalies are reported (e.g., a massive regression affecting dozens of tests), the module implements **chunked loading**:
1. The module identifies all selected anomalies.
2. It loads them in parallel batches (defaulting to 5 at a time).
3. It waits for the `data-loaded` event from the current batch before starting the next.
Each graph is configured to show a "buffer" of one week before and after the anomaly's time range to help users determine if a regression has already been mitigated or if it represents a recurring pattern.
### Cross-Graph Synchronization
Since a report often contains multiple graphs representing the same time period or the same commit range, `report-page-sk` synchronizes user interactions across all visible `explore-simple-sk` instances.
- **X-Axis Scaling:** Toggling between "Commit" and "Date" on one graph updates all others.
- **Zooming/Panning:** Adjusting the range on the summary bar of one graph extends the range on all others to keep them in temporal alignment.
- **Even Spacing:** Toggling discrete vs. continuous x-axis spacing is synced across the entire report.
### Common Commits and Roll Recognition
If the instance uses integer-based commit numbers, the module calculates the intersection of commit ranges for all displayed anomalies.
- It displays a "Common Commits" section.
- **Roll Recognition:** It specifically identifies commits that look like dependency rolls (e.g., "Roll repo from X to Y").
- **Deep Linking:** It provides specialized logic to resolve the underlying "internal" commit URL for rolls, allowing users to jump directly to the source change in a sub-repository rather than just seeing the roll commit itself.
## Key Workflows
### Loading and Initialization
```
URL Params -> Fetch /_/anomalies/group_report
|
v
Load AnomalyTracker <-------+
| |
+----------+----------+ |
| | |
Populate Table Find Commits |
(anomalies-table-sk) (lookupCids) |
| | |
+----------+----------+ |
| |
Load Graphs in Chunks <----+
(explore-simple-sk)
```
### Graph Toggle Workflow
```
User Clicks Checkbox
|
v
anomalies-table-sk dispatches "anomalies_checked"
|
v
report-page-sk receives event
|
+----[ If Checked ]----> Create explore-simple-sk
| Initialize with Anomaly Query
| Append to #graph-container
|
+---[ If Unchecked ]---> Find graph in AnomalyTracker
Remove from DOM
Unset in Tracker
```
# Module: /modules/revision-info-sk
# revision-info-sk
The `revision-info-sk` module provides a specialized component for investigating performance anomalies associated with specific source control revisions. It serves as a bridge between a revision ID and the various performance tests (benchmarks, bots, and test cases) that may have been impacted around that point in time.
## Overview
When a regression or improvement is detected in the Skia Perf system, it is often tied to a range of revisions. This module allows users to input a specific revision ID and retrieve a comprehensive list of all anomalies and performance data associated with it.
Beyond simple display, the module facilitates deep-dive analysis by allowing users to select multiple performance traces and navigate to a multi-graph view. This allows for side-by-side comparison of different tests that were affected by the same revision.
## Key Components and Files
### revision-info-sk.ts
This file contains the core logic for the custom element. It handles several distinct responsibilities:
- **State Management & Persistence**: Uses `stateReflector` to sync the current revision ID with the URL query parameters. This ensures that a specific search state can be bookmarked or shared.
- **Data Fetching**: Communicates with the Perf backend (`/_/revision/`) to fetch `RevisionInfo` objects.
- **Multi-Graph Coordination**: Contains the logic to transform a set of selected revisions into a complex URL for the multi-graph explorer. This involves:
- Calculating the total time range (earliest start to latest end).
- Aggregating unique anomaly IDs to highlight them on the resulting graphs.
- Interacting with the shortcut service to create a shortened URL for the combined queries.
### revision-info-sk.scss
Defines the layout for the results table and the loading indicator. It ensures the spinner is positioned consistently relative to the text and that the data table is readable.
### revision-info-sk-demo.ts / .html
Provides a mock environment for the component. It uses `fetch-mock` to simulate backend responses, allowing for UI development and testing without a running Perf server.
## Design Decisions
### State Reflection
The choice to use `stateReflector` is driven by the need for deep-linking. Performance analysis is iterative; users often need to jump between the revision info page and graph pages. By keeping the `revisionId` in the URL, the component supports standard browser navigation (back/forward) and collaborative debugging.
### Multi-Graph Redirection
The implementation of `getMultiGraphUrl` handles the complexity of "joining" different performance queries. Since each anomaly might belong to a different master, bot, or test, the component generates a `GraphConfig` for each row. Because these combined queries can result in extremely long URLs that exceed browser limits, it uses `updateShortcut` to store the configuration on the server and use a short ID in the resulting link.
### Workflow: From Revision ID to Graphs
The following diagram illustrates the flow of data within the module:
```text
User Input (Revision ID)
|
v
[ stateReflector ] <------> URL (?rev=123)
|
v
[ getRevisionInfo() ] ----> API Request (/_/revision/)
|
|<---------------- Response (JSON)
v
[ Render Table ] ---------> User selects checkboxes
|
v
[ viewMultiGraph() ]
|
+--> [ getGraphConfigs() ]
+--> [ updateShortcut() ] ----> API Request (/_/shortcut/update)
| |
|<-------------------------------------+
v
[ Window Redirect ] ------> /m/?shortcut=ABC&begin=...&end=...
```
## User Interactions
1. **Search**: Users enter a revision ID (e.g., a git hash or a sequential number) and click "Get Revision Information".
2. **Filter/Select**: The resulting table shows anomalies with metadata (Bug ID, Master, Bot, etc.). Users can select individual rows or use the "Select All" toggle.
3. **Explore**: Clicking "View Selected Graph(s)" opens a new tab/page showing the actual telemetry data for the selected rows, with the relevant anomalies highlighted on the graph.
# Module: /modules/split-chart-menu-sk
The `split-chart-menu-sk` module provides a specialized UI component designed to facilitate the logical partitioning of performance data. It presents a list of available trace attributes (such as `benchmark`, `story`, or `subtest`) to the user, allowing them to select a criterion for splitting a unified data visualization into multiple, more granular charts.
### Design Philosophy and Data Integration
The module is built on the principle of reactive data binding via context consumption. Instead of requiring manual property passing, the component integrates directly with the Perf application's data layer:
- **Context Awareness:** It consumes `dataframeContext` and `dataTableContext`. This ensures that the menu options are always synchronized with the current dataset being viewed. If the underlying data changes, the list of attributes available for splitting updates automatically.
- **Decoupled Selection:** The component does not perform the chart splitting itself. Instead, it acts as a trigger in a larger workflow. When a user selects an attribute, it bubbles a custom event, allowing parent containers or layout managers to handle the complex logic of re-rendering or duplicating chart instances.
### Key Components and Responsibilities
#### `split-chart-menu-sk.ts`
This is the core implementation file. It manages the following responsibilities:
- **Attribute Extraction:** It utilizes the `getAttributes` utility from the `traceset` module to parse the `DataFrame` and extract a unique list of keys present in the trace set.
- **State Management:** It handles the toggle state (`menuOpen`) of the dropdown interface, ensuring a standard Material Design interaction pattern.
- **Event Dispatching:** Upon selecting a menu item, it dispatches a `split-chart-selection` event. This event carries the `SplitChartSelectionEventDetails` interface, containing the selected attribute string.
#### Layout and Styling
The component uses `@material/web` components (`md-outlined-button`, `md-menu`, and `md-menu-item`) to provide a consistent look and feel with the rest of the application. The styling in `split-chart-menu-sk.css.ts` focuses on ensuring the menu anchors correctly within relative layout containers and follows the system-level color palette.
### Workflow: From Data to Selection
The following diagram illustrates how data flows through the component to result in a user action:
```text
[ Data Layer ] [ split-chart-menu-sk ] [ Parent Component ]
| | |
| DataFrame (via context) ------------>| |
| |-- extract attributes |
| |-- render md-menu |
| | |
| User Interaction <----------| |
| (Select "benchmark") | |
| |-- dispatch CustomEvent |
| | "split-chart-selection" ------>|
| | |-- Handle Split
| | |-- Update Layout
```
### Note on Deprecation
While functional, this component is marked as **deprecated** in favor of "Split Checkboxes." This suggests a transition in the UI design from a single-selection dropdown model to a multi-selection or checkbox-based model for defining chart splits. External modules should use this component with the understanding that its replacement offers different interaction semantics.
# Module: /modules/subscription-table-sk
The `subscription-table-sk` module provides a specialized custom element designed to display metadata and configuration details for Perf subscriptions and their associated anomaly detection alerts. It serves as a read-only summary view, typically used in dashboards or report pages where users need to verify the settings governing performance monitoring and bug filing.
### Design and Data Flow
The component is built using `LitElement` and follows a reactive property model. It accepts two primary data structures: a `Subscription` object containing bug-filing metadata (owner, component, priority, etc.) and an array of `Alert` objects defining the statistical parameters for anomaly detection.
When data is loaded via the `subscription` and `alerts` properties, the component renders a summary "details" card. The detailed alerts table is hidden by default to keep the UI clean, but can be expanded by the user to inspect technical detection parameters like the algorithm (e.g., `stepfit`, `mannwhitneyu`), radius, and "interestingness" thresholds.
```text
[ Data Source ] -> ( Subscription & Alert Objects )
|
v
+-----------------------------+
| subscription-table-sk |
|-----------------------------|
| [ Summary Card ] | <--- Formats emails, components,
| | and Gerrit revisions as links.
| |
| [ Toggle Button ] | <--- Manages "showAlerts" state.
| |
| [ Hidden/Visible Table ] | <--- Renders Alert params and
+-----------------------------+ uses <paramset-sk> for queries.
```
### Key Responsibilities
#### Subscription Visualization
The module is responsible for transforming raw subscription JSON into a user-friendly summary. It implements specific formatting logic for:
- **Revisions:** Links the configuration revision hash directly to the internal Git source host where the monitoring configuration is stored.
- **Bug Components:** Transforms numeric component IDs into direct search links for the Chromium Issue Tracker, filtered by open issues within that component.
- **Metadata:** Aggregates lists of CC'd emails, hotlists, and severity/priority levels into a compact display.
#### Alert Configuration Table
When expanded, the component displays a dense table of alert parameters. A key implementation detail is the integration with `paramset-sk`. Since alert queries are often complex URL-encoded strings (e.g., `source_type=image&sub_result=min_ms`), the component utilizes the `toParamSet` utility to parse these strings into structured key-value pairs, which are then rendered by the `paramset-sk` element for better readability.
#### State Management
The visibility of the alerts table is managed via internal `@state`. Whenever a new subscription is loaded (via property assignment or the `load()` method), the table visibility is reset to `false`. This ensures that switching between different subscriptions provides a consistent initial view.
### Components and Files
- **`subscription-table-sk.ts`**: The main element logic. It handles property updates, state transitions for the toggle button, and the template generation for both the summary card and the alert table.
- **`subscription-table-sk.scss`**: Provides scoped styling, specifically handling the layout of the details card and ensuring the configuration table adheres to a compact, "small" font-size suitable for technical parameters.
- **`infra-sk/modules/paramset-sk`**: An external dependency used within the table to render the breakdown of the query parameters that define what data the alert is monitoring.
# Module: /modules/telemetry
# Telemetry Module
The telemetry module provides a centralized mechanism for capturing and reporting frontend performance metrics and user interaction data. It is designed to provide visibility into the health and performance of the application without significantly impacting network performance or reliability.
## Overview
The module facilitates the tracking of two primary types of data:
- **Counters:** Used to track the frequency of specific events (e.g., page visits, data fetch failures, or user actions like triaging).
- **Summaries:** Used to record numerical values, typically durations, to measure performance (e.g., the time it takes to load a graph or a specific table).
Rather than sending a network request for every individual event—which would be chatty and inefficient—the module buffers events locally and flushes them in batches.
## Design Decisions
### Efficient Batching
To minimize the overhead on the user's browser and the backend, metrics are held in a local buffer.
- **Time-based flushing:** The buffer is automatically flushed every 5 seconds. This ensures that data is reported relatively close to real-time while allowing multiple events to be grouped into a single POST request.
- **Network reliability:** If a batch fails to send (e.g., due to a temporary network glitch), the module catches the error and re-queues the metrics to be attempted again in the next cycle.
### Data Integrity and Retention
A common challenge with frontend telemetry is losing data when a user closes a tab or navigates away before a scheduled flush occurs.
- **Visibility Listening:** The module listens for the `visibilitychange` event. When the page state becomes `hidden`, it immediately triggers a flush of all pending metrics, bypassing the 5-second timer.
- **Buffer Management:** To prevent unbounded memory growth in extreme scenarios, the buffer is capped at 1,000 metrics. If this limit is exceeded, the module uses a First-In-First-Out (FIFO) strategy, discarding the oldest metrics to make room for new ones.
## Key Components
### `telemetry.ts`
This file contains the core logic and exports a singleton instance of the `Telemetry` class. This singleton ensures that all parts of the application share the same buffer and timing cycle.
- **`CountMetric` & `SummaryMetric` Enums:** These serve as a "source of truth" for all supported metric names. Adding a new metric requires updating these enums, which provides type safety across the codebase.
- **`increaseCounter(name, tags)`:** The primary method for incrementing a counter. It automatically sets the value to `1`.
- **`recordSummary(name, value, tags)`:** The method used for performance timing or recording specific sizes/counts.
- **`sendBufferedMetrics()`:** An internal asynchronous method that handles the `fetch` request to the `/_/fe_telemetry` endpoint. It handles the cloning and clearing of the buffer to prevent race conditions during the network request.
### Workflows
#### Metric Submission Lifecycle
The following diagram illustrates how an event triggered by a user eventually reaches the backend.
```text
User Action / Event
|
v
telemetry.increaseCounter() <-- Application code calls this
|
+-----> [ Buffer (Array) ]
|
| (Wait 5s OR Visibility Hidden)
v
+----------+-----------+
| sendBufferedMetrics |
+----------+-----------+
|
v
POST /_/fe_telemetry --> [ Backend Server ]
|
{Success? Yes} ----> [ Clear local copy ]
|
{Success? No } ----> [ Re-queue metrics ]
```
## Integration Guide
To instrument a new part of the application:
1. **Define:** Add the new metric key to the appropriate enum in `telemetry.ts`.
2. **Call:** Import the `telemetry` singleton and call the relevant method.
3. **Contextualize:** Use the optional `tags` object to provide dimensions (e.g., specific sub-component names or error types) that allow for more granular filtering in dashboards.
# Module: /modules/test-picker-sk
The `test-picker-sk` module provides a specialized UI component for exploring and selecting performance traces. It enforces a hierarchical selection process, guiding users through large datasets by dynamically fetching valid options for subsequent parameters based on previous choices.
### Overview
The primary goal of `test-picker-sk` is to ensure users build valid queries for the Perf database. Rather than presenting all possible parameters at once, which could lead to empty results, the component reveals fields sequentially.
As a user selects values in one field (e.g., "Benchmark"), the component queries the backend to find available values for the next parameter in the hierarchy (e.g., "Bot"). This "drill-down" approach prevents invalid combinations and provides immediate feedback on the number of matching traces found.
### Design Decisions
#### Hierarchical Filtering
The component relies on an ordered list of parameters (e.g., `['benchmark', 'bot', 'test']`). This order is critical because it defines the dependency chain for data fetching. When a value is changed at index $i$ in the hierarchy, all fields at index $i+1$ and greater are invalidated and removed. This ensures that the state of the picker always represents a valid path through the data tree.
#### Trace Count & Plotting Guardrails
To prevent performance degradation on both the client and server, the component enforces a `PLOT_MAXIMUM` (defaulting to 200 traces).
- **Auto-plotting:** If a graph is already active and the selection results in fewer than the maximum allowed traces, changes are automatically pushed to the graph.
- **Manual Plotting:** If no graph is active, the "Plot" button is enabled only when the match count is within a safe range ($0 < \text{count} \leq \text{PLOT_MAXIMUM}$).
#### Handling "Missing" Values
In the Perf database, traces may not have a value for every possible parameter. The component maps these empty strings to a "Default" label in the UI. Internally, these are translated to a sentinel value (`__missing__`) when constructing queries, allowing users to explicitly select traces that lack a specific attribute.
### Key Components and Workflow
#### Data Management (`FieldInfo`)
The internal state is managed via an array of `FieldInfo` objects. Each object tracks:
- The `PickerFieldSk` element instance.
- The parameter name and current selections.
- Event listeners for value changes and "split-by" toggles.
#### Workflow: Adding a Field
When a user makes a selection that narrows the results, the component initiates the following process:
```text
User Selects Value
|
v
callNextParamList() ----> POST /_/nextParamList/ (with current query)
| |
| v
|<---- Returns {paramset: {next_param: [options]}, count: N}
v
addChildField()
|
|--> Create new PickerFieldSk
|--> Populate with options (mapping '' to DEFAULT_OPTION_LABEL)
|--> Attach 'value-changed' listeners
|--> Update match count UI
```
#### Advanced Logic: Conditional Defaults
The component supports complex "trigger" rules through `applyConditionalDefaults`. This allows the UI to automatically pre-select values in subsequent fields based on specific selections in earlier ones. For example, selecting a specific `metric` might automatically select a preferred `stat` (like 'avg'), streamlining the user experience for common workflows.
#### Split-By Functionality
Users can "split" the graph by a specific parameter. The component ensures only one parameter is split at a time. If a user enables "split" on a field, the component disables the split checkbox on all other fields and dispatches a `split-by-changed` event to notify the parent application to adjust the graph's grouping logic.
### Key Files
- `test-picker-sk.ts`: The main logic for the hierarchical picker, state management, and event handling.
- `test-picker-sk.scss`: Styles the layout, specifically the "drill-down" field container and the match count indicator.
- `test-picker-sk_po.ts`: Page Object for automated testing, providing methods to interact with the fields and wait for async loading states.
- `test-picker-sk-demo.ts`: Provides a mock environment with a simulated backend (`/_/nextParamList/`) used for development and visual testing.
### Events
- `plot-button-clicked`: Dispatched when the user clicks "Plot". Detail contains the full query string.
- `add-to-graph` / `remove-trace`: Dispatched during "Auto-add" mode to incrementally update an existing visualization.
- `split-by-changed`: Dispatched when the "Split" toggle is flipped on any field.
# Module: /modules/tests
# Perf Integration Tests
The `/modules/tests` module contains end-to-end (E2E) integration tests for the Perf application. These tests leverage Puppeteer to simulate user interactions and verify the visual and functional integrity of the application's core pages.
## Overview
The primary goal of this module is to provide a "sanity check" for the production-facing UI. Unlike unit tests that focus on individual components, these tests ensure that the integration between the frontend and the backend (or a mock representation of it) remains stable.
The tests are designed to be "perf-blocking," meaning they represent the critical paths a user takes. If these tests fail, it indicates a high probability that a real user will encounter a broken experience.
## Implementation Strategy
The module uses a specialized testing infrastructure built around Puppeteer and a mock server:
- **Mock Backend Integration**: Tests utilize `frontend_mock_server` as the `sk_demo_page_server`. This allows the tests to run against a predictable and stable backend environment, decoupling UI verification from actual database state or external network flakiness.
- **TestBed Utility**: Tests use a `loadCachedTestBed` pattern. This optimizes execution by reusing browser instances where possible, reducing the overhead of spinning up a fresh Puppeteer instance for every test suite.
- **Visual Regression Prevention**: A significant portion of the logic is dedicated to taking screenshots (via `takeScreenshot`). These screenshots serve as a baseline to prevent accidental regressions in layout, CSS, or initial rendering.
## Key Components and Responsibilities
### Critical Path Sanity (initial_loading_puppeteer_test.ts)
This component is responsible for verifying that the primary entry points of the application load correctly. It targets:
- `/e` (Explore Page)
- `/m` (Multigraph Page)
- `/a` (Regressions/Alerts Page)
It includes logic to handle common UI overlays, such as cookie consent banners, ensuring that screenshots represent the actual application state rather than transient UI elements.
### Functional Interaction (explore_multi_page_puppeteer_test.ts)
Beyond simple loading, this component tests specific user workflows within the Multigraph and Explore views. It validates complex UI components (like Vaadin multi-select combo boxes) to ensure that event listeners, data binding, and dropdown behaviors are functioning correctly in a browser environment.
## Common Workflows
### Test Lifecycle Execution
The following diagram illustrates how a typical test in this module interacts with the infrastructure:
```text
[ Test Suite Start ]
|
V
[ loadCachedTestBed() ] <---- Reuses browser instance for efficiency
|
V
[ beforeEach() ] -----------+
| |
| Set Viewport Size
| Navigate to Target URL (e.g., /m)
| |
V |
[ it() Test Case ] <--------+
|
+---> [ Interaction ] (Click, Type, Wait for Selector)
|
+---> [ Screenshot ] (Capture state for visual diffing)
|
V
[ Test Suite End ]
```
### Handling External UI (Cookie Banners)
Because these tests aim to simulate real user visits, they must account for global UI elements that might obscure the application. The `acceptCookieBanner` helper is a design choice to ensure that "noise" from the base platform doesn't cause false positives in screenshot comparisons or block element visibility during functional tests.
# Module: /modules/themes
### Overview
The `themes` module serves as the centralized styling foundation for the project. Rather than defining an entirely new design system from scratch, it acts as a customization layer that bridges the project's specific aesthetic requirements with the base design tokens provided by the shared `infra-sk` infrastructure.
### Design Philosophy: "Deltas over Definitions"
The primary design principle for this module is to maintain a minimal footprint. It is structured to follow a "delta-based" approach, where the styles defined here only represent deviations from the global shared themes or essential overrides for base HTML elements.
This approach was chosen to:
1. **Ensure Consistency:** By importing and extending `infra-sk/themes`, the project automatically inherits updates to the core design system (such as color palettes, spacing units, and typography) without manual intervention.
2. **Reduce Redundancy:** By explicitly forbidding the re-definition of existing styles, the module prevents "CSS bloat" and ensures that the source of truth for standard UI components remains in the infrastructure layer.
3. **Centralize Global Resets:** It provides a single location for high-level layout adjustments that affect the entire application environment, such as document margins and specialized spacing.
### Key Components and Responsibilities
#### Theme Extension and External Assets
The module is responsible for pulling in the necessary external typography and iconography. Currently, it integrates the **Material Icons** library, making these glyphs globally available across all web components in the project. It also serves as the bridge to the shared SASS library, ensuring that variables and mixins from the infrastructure are accessible to local stylesheets.
#### Global Layout Normalization
The `themes.scss` file handles the "Sanitization" or "Reset" logic for the application's root. It enforces a consistent `body` configuration (zeroing out default browser margins/padding) to ensure that top-level layout components (like nav bars or sidebars) align perfectly with the viewport boundaries.
#### Application-Specific Layout Hacks
The module houses structural utilities that facilitate specific UX behaviors. A notable example is the `#bottom-spacer` implementation.
**Workflow: Scroll Buffer Management**
```text
[ Viewport ]
|-------------------|
| Content Area |
| |
| [Element A] |
| [Element B] |
| |
|-------------------| <--- End of content
| #bottom-spacer | <--- Provides 500px of "breathing room"
|-------------------|
```
The inclusion of a large bottom spacer is a deliberate implementation choice to ensure that users can scroll past the final pieces of interactive content, preventing UI elements (like floating action buttons or footer overlays) from obscuring the last items in a list or terminal output.
### Technical Integration
The module is exposed as a `sass_library`. This allows other modules in the project to depend on `themes_sass_lib`, ensuring that the global styles and infrastructure dependencies are bundled correctly during the Sass compilation process. By depending on `//infra-sk:themes_sass_lib`, it ensures that the dependency graph correctly resolves the cascading nature of the styles.
# Module: /modules/trace-details-formatter
# Trace Details Formatter
The `trace-details-formatter` module provides a standardized way to translate internal trace data (parameters and keys) into human-readable strings and, conversely, to reconstruct query parameters from those strings.
Trace IDs in Perf are often complex sets of key-value pairs. Depending on the specific domain (e.g., standard Skia traces vs. Chrome-specific performance benchmarks), the desired visual representation of these traces and the logic required to query them varies significantly. This module abstracts those differences behind a common interface.
## Key Concepts
### TraceFormatter Interface
The module defines a central `TraceFormatter` interface that ensures consistency across different formatting implementations:
- **`formatTrace(params: Params)`**: Converts a dictionary of trace parameters into a displayable string.
- **`formatQuery(trace: string)`**: Parses a formatted trace string back into a URL query string compatible with the Perf backend.
### Implementation Logic
The module selects an implementation at runtime based on the global `window.perf.trace_format` configuration.
#### DefaultTraceFormatter
Used when no specific format is defined. It provides a fallback by simply returning the unique Trace ID (the joined key-value pairs). It does not support converting strings back into queries.
#### ChromeTraceFormatter
Designed specifically for Chrome's hierarchical performance data. It handles the mapping between the legacy Chrome "test path" structure and Skia's parameter-based system.
- **Fixed Hierarchy**: It enforces a specific order of keys: `master`, `bot`, `benchmark`, `test`, and three levels of `subtest`.
- **Path Joining**: `formatTrace` produces a slash-delimited string (e.g., `master/bot/benchmark/...`).
- **Query Reconstruction**: `formatQuery` splits these paths back into their constituent keys.
## Chrome-Specific Statistics Mapping
A significant responsibility of this module is handling the transition from Chromeperf-style "test paths" to Skia's "stat" parameters. In the Chrome ecosystem, statistical aggregations (like averages or maximums) are often encoded as suffixes in the test name.
When `enable_skia_bridge_aggregation` is active, the `ChromeTraceFormatter` automatically extracts these suffixes and maps them to standard Skia `stat` values:
| Suffix | Skia Stat Value |
| :--------------------------- | :-------------- |
| `avg` | `value` |
| `std` | `error` |
| `max`, `min`, `count`, `sum` | (remains same) |
If a test name lacks a known suffix, the formatter defaults the `stat` parameter to `value` to prevent the system from accidentally loading all available statistical variations (which would result in 6x more data being fetched than intended).
## Workflow
The following diagram illustrates how the module resolves a formatter and processes data:
```
[ Global Config ]
|
| window.perf.trace_format
v
[ GetTraceFormatter() ]
|
+---- "chrome" ----> [ ChromeTraceFormatter ]
| |
| +-- formatTrace: join(keys, '/')
| +-- formatQuery: split('/') + Stat Mapping
|
+---- (default) ---> [ DefaultTraceFormatter ]
|
+-- formatTrace: makeKey(params)
```
## Key Files
- `traceformatter.ts`: Contains the `TraceFormatter` interface, the concrete implementations for Chrome and Default styles, and the factory function `GetTraceFormatter`.
- `traceformatter_test.ts`: Validates the logic for path splitting and the conditional application of statistical mappings based on global window settings.
# Module: /modules/triage-menu-sk
# triage-menu-sk
The `triage-menu-sk` module provides a unified interface for managing and triaging performance anomalies in bulk. It serves as a central control point within the Perf UI, allowing users to categorize detected regressions or improvements by filing bugs, associating them with existing reports, ignoring them, or "nudging" their detected revision range.
## Design and Implementation Choices
### Centralized Triage Orchestration
Instead of implementing bug-filing logic directly, `triage-menu-sk` acts as an orchestrator. It encapsulates and manages two specialized dialog components: `new-bug-dialog-sk` and `existing-bug-dialog-sk`. This separation of concerns allows the menu to focus on the high-level workflow (selecting anomalies and choosing an action) while delegating complex form handling and bug-tracker integration to the specific dialog modules.
### The "Nudge" Workflow
One unique feature of this module is the "Nudge" functionality. Anomalies are detected over a revision range, but the detection might not perfectly align with the actual point of regression. The "Nudge" buttons (typically ranging from -2 to +2) allow users to shift the anomaly's revision boundaries.
- **Visual Feedback**: When a nudge is performed, the component updates the `AnomalyData` (coordinates `x` and `y`) locally and dispatches an event so the parent chart can immediately reflect the shift without a full page reload.
- **Backend Sync**: Each nudge triggers a POST request to `/_/triage/edit_anomalies` with the `NUDGE` action, ensuring the database reflects the refined revision range.
### State and Event-Driven Architecture
The component relies heavily on an event-driven model to maintain synchronization with the rest of the application:
- **`anomaly-changed` Event**: This is the primary output of the module. Whenever an anomaly is ignored, nudged, or associated with a bug, this event is dispatched. It carries the updated anomaly details and trace IDs, signaling to parent components (like graphs or tables) that they need to invalidate their caches and re-render.
- **Property-Based Configuration**: The menu's state is driven by the `anomalies` and `traceNames` properties. By calling `setAnomalies()`, a parent component can dynamically update which data points the menu is currently acting upon.
## Key Components and Responsibilities
### triage-menu-sk.ts
The core logic of the module. It handles:
- **Triage Actions**:
- **New Bug**: Forwards the request to the `new-bug-dialog-sk`.
- **Existing Bug**: Triggers a fetch for recently associated bugs and opens the `existing-bug-dialog-sk`.
- **Ignore**: Sends an `IGNORE` request to the backend. It sets the `bug_id` to `-2` (a convention for ignored anomalies) and displays a confirmation toast.
- **API Communication**: Manages POST requests to `/_/triage/edit_anomalies`. This endpoint is polymorphic, handling `IGNORE`, `RESET`, and `NUDGE` actions based on the provided body.
- **Telemetry**: Integrates with the `telemetry` module to track which triage actions are most frequently taken by users.
### Triage Actions Workflow
```
User Interaction
|
V
[ Triage Menu ] --------------------------+
| |
| (Action: New/Existing) | (Action: Ignore/Nudge)
V V
[ Dialog Components ] [ Backend API Call ]
(new-bug-dialog-sk) (/_/triage/edit_anomalies)
(existing-bug-dialog-sk) |
| |
+------------> [ Success ] <--------+
|
V
[ Dispatch anomaly-changed ]
[ Show Toast / Update UI ]
```
### NudgeEntry (Class)
A data structure used to represent potential "nudge" states. It maps a display index (e.g., +1) to specific revision ranges (`start_revision`, `end_revision`) and UI coordinates (`x`, `y`). This allows the menu to render a sequence of buttons that correspond to valid shifts in the data.
### triage-menu-sk_po.ts
Provides the Page Object for testing. It abstracts the internal DOM structure, including the nested dialogs and the ignore toast, allowing Puppeteer tests to interact with the triage flow without being coupled to the specific HTML structure or CSS classes.
# Module: /modules/triage-page-sk
# triage-page-sk
The `triage-page-sk` module provides a comprehensive dashboard for reviewing and triaging performance regressions in the Perf system. It allows users to scan a matrix of commits and alerts, visualize clusters of data, and record triage decisions (e.g., positive, negative, or untriaged).
## High-level Overview
The page is designed as a high-density "triage queue." It presents a grid where rows represent commits and columns represent different configured alerts. This layout allows a developer or performance engineer to quickly identify which commits caused regressions across multiple benchmarks or metrics.
The primary workflow involves:
1. **Discovery**: Filtering the view to find "Untriaged" regressions within a specific time range.
2. **Investigation**: Clicking on a regression status to open a detailed view of the data cluster.
3. **Action**: Assigning a status to the regression, which may trigger automated bug reporting.
## Key Components and Responsibilities
### State Management and Data Fetching
The component uses `stateReflector` to sync its internal state (time range, subset filters, alert filters) with the URL. This ensures that triage views can be bookmarked or shared.
- **`updateRange()`**: This is the core data-fetching method. It sends a `RegressionRangeRequest` to the `/_/reg/` endpoint. The server responds with a `RegressionRangeResponse` containing the headers (alerts) and the table data (commits and their regression status).
- **`calc_all_filter_options()`**: This logic processes the categories returned by the server to populate the "Which alerts to display" filter, allowing users to focus on specific teams or components.
### The Triage Grid
The grid is constructed dynamically based on the server's response:
- **Rows**: Each row uses `commit-detail-sk` to show the commit hash, author, and message.
- **Columns**: Headers represent individual alerts. Columns are split into "Low" and "High" sub-columns if the alert tracks bidirectional changes.
- **Cells**: Cells contain `triage-status-sk` elements. If a cluster is found, it shows the current triage status. If no cluster is found, it displays a "∅" symbol, which links to the generic cluster view for that commit/query combination.
### Triage Dialog and Interaction
When a user clicks a status in the grid, the `triage_start` event is captured, opening a `<dialog>` containing a `cluster-summary2-sk`.
- **`cluster-summary2-sk`**: Responsible for rendering the actual plot and summary statistics of the regression.
- **`triaged()`**: When a triage decision is made inside the dialog, this method handles the POST request to `/_/triage/`. If the triage results in a new bug, it automatically opens the bug reporting URL in a new window.
## Workflow: Investigating a Regression
The following diagram illustrates how a user moves from the high-level grid to a specific data investigation:
```
[ Triage Page Grid ]
|
| (User clicks a 'triage-status-sk')
v
[ triage_start event ] --------------------------+
| |
v |
[ Open <dialog> ] |
| |
+--> [ cluster-summary2-sk ] <------------+
|
| (User analyzes plot)
|
+-----------+-----------+
| |
(Press 'p' / 'n') (Press 'g')
| |
v v
[ Update Status ] [ Open Dashboard ]
| (Full Explore View)
v
[ POST /_/triage/ ]
|
+--> (Optional: Open Bug Tracker)
```
## Keyboard Shortcuts
To facilitate rapid triaging, the module implements `KeyboardShortcutHandler`. When the triage dialog is open, the following shortcuts are available:
- **`p`**: Mark the current regression as **Positive**.
- **`n`**: Mark the current regression as **Negative**.
- **`g`**: **Go** to the full Explore page for this cluster to perform a deeper analysis of the underlying traces.
- **`?`**: Open the keyboard shortcuts help overlay.
These shortcuts are managed via `handleKeyboardShortcut` in the `keyDown` listener, ensuring they only trigger when the user is not actively typing in an input field.
## Design Decisions
- **Modal Dialog for Detail**: Instead of navigating away from the grid, details are shown in a modal dialog. This preserves the user's scroll position and filter state in the large triage matrix, allowing them to process dozens of regressions in a single session.
- **Conditional Column Splitting**: The grid columns dynamically adjust based on the `direction` of the alert (UP, DOWN, or BOTH). This minimizes horizontal scrolling by only showing "High" or "Low" columns when relevant to that specific alert's configuration.
- **Subset Filtering**: The `subset` parameter (`all`, `regressions`, `untriaged`) allows the server to prune the data significantly, which is critical for performance when viewing large time ranges (e.g., several weeks of data across hundreds of alerts).
# Module: /modules/triage-status-sk
# Triage Status Module
The `triage-status-sk` module provides a visual indicator and interaction point for the triage state of performance clusters. It serves as a compact UI element that communicates the current classification of a detected anomaly (e.g., untriaged, positive, or negative) and initiates the workflow for modifying that state.
## High-level Overview
In the Perf system, data anomalies are grouped into clusters that require human intervention to determine if they represent actual regressions or false positives. This module encapsulates the visual representation of that status.
Rather than managing the complex logic of the triage dialog itself (which involves data visualization and form inputs), `triage-status-sk` acts as a trigger. When a user interacts with the component, it broadcasts the necessary context—including the current triage state, the associated alert configuration, and the cluster summary—to be handled by a parent container or a global dialog manager.
## Key Components and Responsibilities
### TriageStatusSk (`triage-status-sk.ts`)
The primary class is an `ElementSk` that renders a stylized button. Its responsibilities include:
- **State Representation**: It mirrors the `TriageStatus` (status and message) provided via its properties. The visual state is driven by CSS classes that correspond to the status strings (`positive`, `negative`, `untriaged`).
- **Context Storage**: It holds metadata required for the triage process, such as the `alert` (the configuration that triggered the detection), `full_summary` (the statistical data of the cluster), and `cluster_type` (indicating if the cluster represents a high or low change).
- **Workflow Initiation**: It listens for click events on the internal button and dispatches a `start-triage` custom event.
### Visual Feedback (`triage-status-sk.scss`)
The module uses a specialized icon component (`tricon2-sk`) inside the button. The styling logic is tightly coupled with the status:
- **Colors**: Uses theme-aware variables (`--positive`, `--negative`, `--untriaged`) to ensure consistency across the application.
- **Theming**: Includes specific overrides for dark mode to ensure the icons remain legible against varying background surfaces.
## Workflow: Initiating Triage
The following diagram illustrates how the component interacts with the rest of the application to start a triage action:
```text
[ User ]
|
| (Clicks Button)
v
[ triage-status-sk ]
|
|-- 1. Bundles: { triage, alert, full_summary, cluster_type }
|-- 2. Dispatches 'start-triage' Event
v
[ Parent / Dialog Manager ]
|
|-- 3. Receives Event Detail
|-- 4. Opens Triage Dialog for User Input
v
[ Backend API ] (Updated via parent)
```
## Event Details: `start-triage`
The `start-triage` event is the primary output of this module. Its `detail` object contains:
| Property | Description |
| -------------- | ----------------------------------------------------------------------------------------------------------------------------------------------- |
| `triage` | The current `TriageStatus` object. |
| `full_summary` | The `FullSummary` data structure representing the cluster statistics. |
| `alert` | The `Alert` configuration associated with this detection. |
| `cluster_type` | Whether the regression is 'high' or 'low'. |
| `element` | A reference to the `TriageStatusSk` instance that fired the event, allowing the receiver to update the element directly upon successful triage. |
# Module: /modules/triage2-sk
# triage2-sk
The `triage2-sk` module provides a specialized UI component for managing the classification of data points or alerts within the Perf system. It allows users to toggle between three distinct states: **positive**, **negative**, and **untriaged**.
This component is designed to be a compact, intuitive control that provides immediate visual feedback on the current classification of an item while making it easy to change that state with a single click.
## Design and Implementation
The module is built as a custom element using the `Lit` library and extends `ElementSk`.
### State Management
The primary state of the component is driven by its `value` attribute. The component synchronizes this attribute with a property of the same name. To ensure data integrity, it uses a guard function (`isStatus`) to validate that any assigned value conforms to the `Status` type (defined in `perf/modules/json`).
The internal logic follows a reactive pattern:
1. **Input**: A user clicks one of the three buttons or the `value` attribute is updated programmatically.
2. **Reaction**: The `attributeChangedCallback` triggers a re-render and dispatches a `change` event.
3. **Visual Feedback**: The component uses the `?selected` attribute on the internal buttons to highlight the active state, which is then styled via CSS.
### UI and Styling
The component consists of a group of three buttons, each containing a specific icon:
- `check-circle-icon-sk`: Represents a **positive** result.
- `cancel-icon-sk`: Represents a **negative** (false positive) result.
- `help-icon-sk`: Represents an **untriaged** (unknown) state.
The styling is implemented in `triage2-sk.scss` with specific support for "light" and "dark" modes. It relies on CSS variables (e.g., `--surface`, `--on-disabled`) to integrate seamlessly with the broader Perf application themes. Deselected buttons are intentionally dimmed to draw focus to the currently selected state.
### Workflow: State Update
The following diagram illustrates how an interaction with the UI flows through the component to notify the parent application:
```text
User Click Component Property DOM Attribute Parent Application
---------- ------------------ ------------- ------------------
| | | |
[Click .positive] ------> [set value] | |
| | | |
| | ----------------> [attr: value] |
| | | |
| | [attributeChanged] |
| | | |
| [_render()] <----------------- | |
| | | |
| | -----------------------------------------> [Event: 'change']
```
## Key Components and Files
- `triage2-sk.ts`: Contains the `TriageSk` class logic. It handles the attribute observation, property synchronization, and the dispatching of custom events when the triage status changes.
- `triage2-sk.scss`: Defines the visual representation. It uses sophisticated selectors to handle both legacy color schemes and modern theme-based variables, ensuring the icons are appropriately colored (Green for positive, Red for negative, Brown for untriaged) and that "raised" or "hover" states provide tactile feedback.
- `index.ts`: The entry point that defines the custom element in the global `customElements` registry.
## Events and Attributes
- **`value` (Attribute/Property)**: Reflects the current status. Defaults to `untriaged` if not set or if set to an invalid value.
- **`change` (Event)**: Dispatched whenever the status changes. The `detail` property of the event contains the new `Status` value string.
# Module: /modules/tricon2-sk
The `tricon2-sk` module provides a specialized UI component for displaying triage states. It translates semantic status strings—"positive", "negative", or "untriaged"—into consistent visual indicators (icons and colors) used across the application to represent the state of performance regressions or test results.
### Core Logic and Design Decisions
The component is designed around a single point of truth: the `value` attribute. By mirroring this attribute to a JavaScript property, the component ensures that updates made via HTML or direct property assignment trigger a re-render.
The implementation uses a declarative template approach. Instead of manually manipulating the DOM to swap icons, the `TriconSk` class uses a `switch` statement within its Lit template to determine which underlying icon element to mount:
```
Value ("positive") ------> <check-circle-icon-sk>
Value ("negative") ------> <cancel-icon-sk>
Value (default) ------> <help-icon-sk>
```
This design simplifies the component's internal state management, as the visual output is a pure function of the `value` property.
### Theming and Visual Consistency
The styling logic in `tricon2-sk.scss` is decoupled into three distinct layers to ensure legibility across different UI contexts:
- **Default State:** Uses standard CSS variables (e.g., `--green`, `--red`) for basic integration.
- **Themed State (`.body-sk`):** Provides specific hex code overrides to ensure that the triage colors meet brand and contrast requirements when the application's standard theme is applied.
- **Dark Mode:** Adjusts the brightness and saturation of the icons specifically for dark backgrounds to maintain accessibility.
By encapsulating these color mappings within the component's SCSS, the module prevents "color leak" and ensures that a "positive" icon always appears in the correct shade of green regardless of where it is placed in the application.
### Key Components
#### TriconSk (`tricon2-sk.ts`)
The primary class extending `ElementSk`. It manages the lifecycle of the element and observes the `value` attribute. It is responsible for importing and registering the specific icon elements from `elements-sk` needed for the three states.
#### SCSS Styles (`tricon2-sk.scss`)
Rather than relying on the parent container to style the icons, this file explicitly defines the `fill` properties for the internal icon components (`check-circle-icon-sk`, `cancel-icon-sk`, and `help-icon-sk`). This ensures that the semantic meaning of the icon (success, failure, or unknown) is always tied to its visual representation.
#### Demo and Testing
The module includes a demo page (`tricon2-sk-demo.html`) that showcases the component in all three triage states across light and dark modes. This is used by the Puppeteer test suite (`tricon2-sk_puppeteer_test.ts`) to perform visual regression testing, ensuring that the icons render correctly and maintain their color associations during UI changes.
# Module: /modules/user-issue-sk
### High-Level Overview
The `user-issue-sk` module provides a custom LitElement component designed to manage the association between performance data points (traces at specific commit positions) and external bug tracking system (Buganizer) issues. It acts as a bridge between the Perf monitoring UI and the issue management backend, allowing users to view, link, create, and remove bug references directly within the context of a performance trace.
### Design Decisions and Implementation Choices
#### State-Driven Visibility
The component's appearance is heavily dictated by the user's authentication state and the presence of an existing bug.
- **Authentication**: Using the `alogin-sk` module, the component detects if a user is logged in. If a user is anonymous, they are restricted to viewing existing bug links; they cannot add, delete, or modify issue associations.
- **Bug ID States**: The `bug_id` property serves as a state indicator. A value of `-1` hides the element entirely, `0` indicates no bug is currently associated, and a positive integer indicates an active link to an external issue.
#### Automated vs. Manual Association
The module implements a specific workflow for adding issues (`findOrAddIssue`) that prioritizes data integrity:
1. It first checks the backend to see if a bug reference already exists for the given trace and commit.
2. If the user provides a bug ID that matches an existing record, it simply confirms the link.
3. If no matching record exists, the component automatically triggers the creation of a new bug via the `/pre/_/triage/file_bug` endpoint before saving the association. This ensures that every "Add Issue" action results in a valid, tracked entity in the bug host.
#### Event-Based Updates
Rather than managing the global state of the application, `user-issue-sk` utilizes a `user-issue-changed` Custom Event. When an issue is saved or deleted, the component dispatches this bubbling event. This allows parent components or data providers to react to changes (e.g., refreshing a list of anomalies or updating a graph) without the `user-issue-sk` component needing deep knowledge of the application's architecture.
### Key Components and Responsibilities
#### `user-issue-sk.ts`
This is the primary implementation file containing the `UserIssueSk` class. Its responsibilities include:
- **Property Management**: Tracks `user_id`, `bug_id`, `trace_key`, and `commit_position` to contextualize the issue.
- **API Interaction**: Handles asynchronous requests to several endpoints:
- `/_/user_issues`: To query existing associations.
- `/_/user_issue/save`: To persist a link between a trace point and a bug.
- `/_/user_issue/delete`: To remove an association.
- `/_/triage/file_bug`: To programmatically create a new bug in Buganizer.
- **UI Rendering**: Switches between a "Link View" (showing the formatted URL to the bug) and an "Add/Input View" (showing a button or a numeric input field).
#### `user-issue-sk_test.ts`
The test suite ensures the component reacts correctly to different state combinations. It mocks the global `window.perf` configuration and API responses to verify that:
- Unauthorized users cannot see management icons (like the delete/close icon).
- The "Add Issue" workflow correctly transitions through input and confirmation states.
- The bug URL is formatted correctly based on the environment's `bug_host_url`.
### Key Workflows
#### Adding a New Issue
The following diagram illustrates the logic flow when a logged-in user interacts with the "Add Issue" button:
```text
[ Click "Add Issue" ] -> ( Show Input Field )
|
[ Enter Bug ID ]
|
[ Click Checkmark ]
|
V
{ Check /_/user_issues }
|
_________________/ \_________________
| |
(Issue Exists?) (Issue Not Found?)
| |
[ Set bug_id ] [ Call /_/triage/file_bug ]
| |
| [ Receive New bug_id ]
\_________________ ________________/
\ /
V
[ Call /_/user_issue/save ]
|
[ Dispatch 'user-issue-changed' ]
|
( Update UI to Link View )
```
#### Removing an Issue
```text
[ Click Close-Icon ] -> [ Call /_/user_issue/delete ]
|
[ Dispatch 'user-issue-changed' ]
|
[ Reset bug_id = 0, exists = false ]
|
( Update UI to "Add" Button )
```
# Module: /modules/window
# Window Module
The `window` module provides type definitions for global configuration and utility functions for parsing build-specific metadata from the browser's global environment. It serves as the primary bridge between the backend-injected configuration and the frontend application logic.
## Global Configuration Management
A central responsibility of this module is extending the global `Window` interface to include the `perf` property. This property holds the `SkPerfConfig`, which is typically populated by the server-side template during the initial page load.
By defining this in a centralized `window.ts` file, the project ensures type safety across all frontend modules when accessing environment-specific settings, such as the current image tag or deployment configuration.
## Build Metadata Extraction
The module implements logic to parse image tags used in the Skia Perf infrastructure. This is necessary for displaying versioning information to developers and operators, allowing them to quickly identify which specific build of the software is currently running.
The implementation focuses on identifying three distinct deployment patterns:
- **Git-based builds:** Identified by a `tag:git-` prefix. The logic extracts the first seven characters of the git hash to provide a short, recognizable revision identifier.
- **Louhi builds:** Identified by the presence of a timestamp and the `-louhi-` string. These represent specific automated build pipeline outputs, where the logic extracts the specific build hash following the "louhi" marker.
- **Generic tags:** Fallback for standard tags (e.g., `tag:latest` or `tag:v1.0`), where the prefix is stripped to show the human-readable label.
### Extraction Workflow
The `getBuildTag` function follows a specific sequence to normalize the raw image string provided by the backend:
```text
Raw Tag String (e.g., image@tag:git-12345...)
|
+--- Split by '@' to isolate the tag portion
| |
| [No '@' found] --> Return 'invalid'
|
+--- Check if starts with 'tag:'
| |
| [False] ------> Return 'invalid'
|
+--- Pattern Matching
|
|-- Starts with 'tag:git-'? ----> [type: 'git'] (7-char hash)
|
|-- Contains '-louhi-'? --------> [type: 'louhi'] (build hash)
|
|-- Else -----------------------> [type: 'tag'] (full tag value)
```
## Key Files
- **`window.ts`**: Contains the global type augmentation for the `Window` object and the logic for `getBuildTag`. It imports `SkPerfConfig` from the JSON schema definitions to ensure the global state remains synchronized with the backend data structures.
- **`window_test.ts`**: Validates the parsing logic against various real-world container image tag formats, ensuring that changes to the deployment pipeline's tagging convention do not break version reporting in the UI.
# Module: /modules/word-cloud-sk
The `word-cloud-sk` module provides a specialized data visualization component designed to represent the distribution and frequency of key-value pairs within a dataset, such as a cluster of performance traces. Despite its name, it renders data as a structured table with integrated bar charts rather than a randomized "cloud" of text, prioritizing legibility and precise comparison of relative frequencies.
### Core Responsibility
The primary role of this module is to take a collection of data points (values and their associated percentages) and render them in a format that allows users to quickly identify dominant traits within a selected group. It is specifically designed for the Skia Perf UI to show which metadata keys or configurations (e.g., `arch=x86`, `config=8888`) are most prevalent in a given performance cluster.
### Design and Implementation
The module follows a declarative pattern using `lit` for rendering and `ElementSk` as a base class.
- **Data Structure**: The component consumes an array of `ValuePercent` objects. Each object contains a `value` (string) and a `percent` (number). The choice of a percentage-based input simplifies the rendering logic, as the component does not need to calculate totals or handle raw counts; it assumes the data is pre-processed.
- **Visual Representation**: The implementation uses a standard HTML `<table>` for layout. This ensures that labels remain aligned while the distribution is visualized via "micro-bars"—div elements whose widths are set directly to the percentage value.
- **Theming and Styling**: The component supports multiple visual contexts (default colors, standard themes, and dark mode) by leveraging CSS variables. The border colors and bar backgrounds adjust dynamically based on the parent container's class (e.g., `.body-sk` or `.darkmode`), ensuring consistent integration with the rest of the application's UI.
### Key Workflows
**Data Binding and Rendering**
When the `items` property is updated on the element, it triggers a re-render cycle. The component maps the data into table rows where the percentage is represented both numerically and visually.
```text
[Data Update]
|
v
[setter: items(val)] -> updates private _items
|
v
[_render()] calls [WordCloudSk.template]
|
+--> [WordCloudSk.rows] maps items to:
|
+-- <td> {value} </td> (The label)
+-- <td> {percent}% </td> (Numeric value)
+-- <td> [---bar (width: Xpx)---] </td> (Visual representation)
```
### Components and Files
- **`word-cloud-sk.ts`**: Contains the logic and template for the custom element. It handles property shadowing for `items` and manages the rendering of the table rows.
- **`word-cloud-sk.scss`**: Defines the layout and theme-aware styling. It uses a fixed width for the percentage bars (100px) so that the percentage value maps 1:1 to a pixel width, providing a consistent scale across different instances.
- **`word-cloud-sk-demo.ts/html`**: Provides a sandbox for testing the component in different CSS contexts (standard vs. themed), demonstrating how the component adapts to its environment.
# Module: /nanostat
# nanostat
`nanostat` is a command-line utility used to perform statistical comparisons between two sets of benchmark results generated by Skia's `nanobench`. It identifies whether performance changes (deltas) are statistically significant or merely the result of measurement noise.
## High-Level Overview
In performance engineering, comparing the means of two benchmark runs is often insufficient because execution environments are noisy. A small change in the mean might be significant if the variance is low, while a large change might be statistically meaningless if the variance is high.
`nanostat` addresses this by applying statistical hypothesis testing to the raw samples collected during benchmarking. It consumes two JSON files (typically an "old" baseline and a "new" experimental run), performs a comparative analysis, and outputs a formatted table showing the magnitude of change alongside a p-value to indicate confidence.
## Design Decisions and Implementation
### Statistical Methodology
The tool focuses on the "why" of a performance change by evaluating the probability that the observed difference happened by chance.
- **Hypothesis Testing**: By default, the tool uses the **Mann-Whitney U test** (via the `samplestats` package). This is a non-parametric test, meaning it does not assume the benchmark samples follow a normal distribution, which is ideal for performance data that often contains outliers or asymmetrical distributions. Users can optionally switch to a **Welch’s T-test** for normally distributed data.
- **Significance Threshold (Alpha)**: The tool uses a default alpha of `0.05`. If the calculated p-value is greater than this threshold, the change is considered "insignificant" and is represented by a tilde (`~`) instead of a percentage delta to prevent developers from chasing ghosts in the noise.
- **Outlier Rejection**: The `--iqrr` flag enables the Interquartile Range Rule to strip outliers from the sample sets before analysis. This is a design choice to provide a "cleaner" look at the core performance characteristics of the code, independent of transient system spikes.
### Data Aggregation and Matching
`nanostat` doesn't just compare files line-by-line; it understands the structure of Skia benchmark data.
- **Parametric Matching**: The tool uses `paramtools` and `parser` to group samples. It identifies benchmarks by their parameters (e.g., `config`, `test`, `name`). It automatically detects which parameters vary across the dataset and includes them in the output columns so the user can distinguish between different test configurations (e.g., `gl` vs `gles`).
- **Legacy Format Support**: The implementation specifically leverages `format.ParseLegacyFormat` to maintain compatibility with the standard JSON output format used by `nanobench`.
## Key Components and Workflow
### Main Logic (`main.go`)
The core entry point manages the lifecycle of a comparison:
1. **Configuration**: Parses CLI flags into a `samplestats.Config` object, defining the statistical "rules" for the session.
2. **Data Ingestion**: Loads JSON files into `parser.SamplesSet` structures. Each set contains the raw execution times (samples) for every benchmark identified in the file.
3. **Analysis**: Delegates the heavy lifting to `samplestats.Analyze`. This produces a set of "Rows," where each row represents a unique benchmark found in both files.
4. **Formatting**: The `formatRows` function dynamically determines which metadata (like `config` or `arch`) is relevant. If all results share the same `arch`, that column is hidden to reduce clutter; if they differ, it is shown.
### Comparison Workflow
```text
[ File A (Old) ] [ File B (New) ]
| |
v v
[ Parse Samples ] [ Parse Samples ]
| |
+----------+-----------+
|
[ Match by Parameters ] (e.g., name, config)
|
[ Apply Outlier Filter ] (Optional: IQRR)
|
[ Run Statistical Test ] (Mann-Whitney U or T-Test)
|
[ Filter by Significance ] (p < Alpha?)
|
v
[ Format Table: Mean, StdDev, Delta, p-value, Metadata ]
```
### Formatting Strategy
The tool uses a `tabwriter` to produce aligned, human-readable terminal output. A key implementation detail in `formatRows` is the calculation of the "Important Keys." The tool scans all results and identifies which parameters (keys) have multiple values. It then prioritizes these keys in the output string, ensuring that the user sees exactly what differentiates one row from another without redundant information.
# Module: /nanostat/testdata
The `/nanostat/testdata` module serves as the regression testing suite and ground-truth repository for the `nanostat` tool. It contains paired performance benchmark results and the corresponding expected output (golden files) used to verify the tool's statistical analysis, formatting, and filtering logic.
### Purpose and Design
The data in this directory is structured to facilitate end-to-end testing of how `nanostat` processes `nanobench` JSON output. The primary goal is to ensure that the statistical comparisons (Mann-Whitney U tests, p-values, and percentage deltas) remain accurate across code changes.
The design relies on "Golden File Testing":
1. **Input**: Two JSON files representing "old" and "new" benchmark runs.
2. **Process**: `nanostat` compares these files using various flags (e.g., sorting, significance thresholds).
3. **Validation**: The output is compared against a `.golden` file to ensure the results match expectations down to the whitespace and p-value calculation.
### Key Components
#### Benchmark Input Files
- **nanobench_old.json / nanobench_new.json**: These are the primary data sources. They contain nested JSON structures where each key represents a specific test case (e.g., `desk_googledocs.skp_1_1000_1000`).
- **Samples**: Each test case contains an array of "samples" (execution times in milliseconds). `nanostat` uses these raw sample arrays to calculate the mean, standard deviation, and statistical significance of the change between the "old" and "new" datasets.
#### Golden Files (.golden)
These files represent the expected CLI output under different operational modes. They contain fixed-width tables with columns for baseline (old), current (new), percentage delta, statistical significance (p-value and sample size), and test identification.
- **all.golden**: Expected output when showing all results regardless of statistical significance. Includes cases where the delta is negligible (marked with `~`).
- **test.golden**: The standard output reflecting significant changes.
- **iqrr.golden**: Expected output when Interquartile Range (IQR) filtering or robust statistical methods are applied, resulting in different sample sizes (`n`) and p-values compared to the standard test.
- **sort.golden**: Verifies the sorting logic, ensuring results are ordered correctly (e.g., by test name or delta magnitude).
- **nochange.golden**: The expected response when no statistically significant differences are found between the two datasets, confirming that the "noise floor" logic works.
### Workflow: Statistical Verification
The data in this module illustrates how raw performance samples are transformed into human-readable insights:
```
[ nanobench_old.json ] [ nanobench_new.json ]
| |
| Comparison |
+------------> (v) <----------+
|
[ Mann-Whitney U Test ]
[ Mean / StdDev Calc ]
|
v
[ Formatting & Filtering ] ----> (Compare against .golden)
```
### Data Characteristics
The test data specifically includes varied scenarios to stress-test the tool:
- **Significant Regressions**: Large positive deltas (e.g., ~52% in `desk_googleimagesearch`).
- **Significant Improvements**: Negative deltas (e.g., -4% in `desk_googledocs`).
- **Statistical Noise**: High variance samples (e.g., `± 15%`) which result in different p-values and significance classifications.
- **Metadata**: Examples of non-timing data like `max_rss_mb` and `sksl_compiler` bytes to verify that the tool handles different units of measurement correctly.
# Module: /pages
The `/pages` directory serves as the entry point layer for the Perf application's web interface. It defines the structure, styling, and composition of individual HTML pages by orchestrating high-level custom elements (defined in `/modules`) into functional views.
### Design Philosophy
The module follows a "Thin Page" architecture. Rather than containing complex business logic, each page acts as a declarative shell. This approach ensures:
1. **Consistency**: Every page utilizes the `perf-scaffold-sk` component, providing a unified navigation, header, and footer across the entire application.
2. **Modularity**: Page-specific logic is encapsulated within specialized "page-level" custom elements (e.g., `explore-sk`, `alerts-page-sk`). This makes the pages easy to maintain and the components easy to test in isolation.
3. **Data Injection**: Pages serve as the bridge between the Go backend and the frontend. They use Go templating to inject configuration data into the `window.perf` object, allowing the TypeScript modules to access instance-specific context (like `git_repo_url` or `Nonce` for security) immediately upon load.
### Key Components and Workflows
#### Page Composition
Most files in this directory follow a strict triplet pattern:
- **`.html`**: Defines the DOM structure, typically wrapping a single major functional element inside a scaffold.
- **`.ts`**: Handles the side-effect of importing the necessary custom element definitions so they are registered with the browser.
- **`.scss`**: Imports global styles (like `body.scss`) and applies page-specific layout tweaks.
#### Data Flow and Initialization
When a user navigates to a Perf page, the following initialization process occurs:
```
[ Backend ] -> Injects JSON context into <script>
|
v
[ HTML Page ] -> Renders <perf-scaffold-sk>
|
+--------> Sets window.perf = { ... } (Context)
|
v
[ TS Entry ] -> Imports Modules -> Custom Elements Registered
|
v
[ Browser ] -> Upgrades <page-element-sk> -> Fetches data using window.perf
```
### Core Pages
- **Exploration (`newindex.ts`, `multiexplore.ts`)**: The primary interfaces for data visualization. `newindex` hosts the main `explore-sk` component for deep-diving into individual traces, while `multiexplore` allows for side-by-side comparisons. These pages also include sidebar help for keyboard and mouse navigation (Zoom/Pan/Delta).
- **Alerting & Triage (`alerts.ts`, `triage.ts`, `regressions.ts`)**: These pages manage the lifecycle of performance anomalies. They wrap components that allow users to view configured alerts, triage new regressions, and track existing performance issues.
- **Analysis Tools (`clusters2.ts`, `dryrunalert.ts`, `playground.ts`)**: Focused on the statistical side of Perf. The "Playground" is specifically designed for experimenting with anomaly detection algorithms on sample data without affecting production configurations.
- **Metadata & Info (`revisions.ts`, `favorites.ts`, `help.ts`)**: Support pages that provide context. The `help.html` page is unique as it uses Go templates to dynamically iterate over and describe available query functions (`.Funcs`) directly from the backend documentation.
### Shared Assets and Styles
The `body.scss` file provides a shared CSS baseline, resetting margins and paddings to ensure the scaffold occupies the full viewport. The `BUILD.bazel` file manages the distribution of static assets (like SVG icons for various platforms like Chrome, V8, and Fuchsia) to the `/dist` path, ensuring they are available for the UI regardless of which specific page is loaded.
# Module: /res
### High-Level Overview
The `/res` module is the core resource management hub for the application. It serves as the single source of truth for all non-programmatic assets, including user interface layouts, string constants, graphical elements, and configuration values.
The primary design philosophy of this module is the **decoupling of UI definition from application logic**. By externalizing these resources, the system achieves two critical architectural goals:
1. **Independent Maintenance:** Developers can modify the look and feel, update text, or swap images without altering the underlying source code (e.g., Java, Kotlin, or C++ files).
2. **Configuration Switching:** The module is structured to support automatic resource selection based on the runtime environment (e.g., screen density, language, or device orientation), allowing a single binary to adapt to diverse hardware and locales.
### Design Decisions and Implementation Choices
- **Declarative UI (XML):** The choice to use XML for layouts and values allows for a declarative approach to UI construction. This simplifies the development process by allowing visual structures to be defined hierarchically, which is more intuitive for layout management than imperative code.
- **Unique Identifier Mapping (The `R` Class):** To bridge the gap between static files and executable code, the build system maps every resource in this directory to a unique integer ID. This allows logic files to reference resources via a typesafe "alias" (e.g., `R.layout.main`) rather than error-prone string paths.
- **Strict Submodule Categorization:** The module enforces a rigid directory structure (e.g., `layout/`, `values/`, `drawable/`). This design choice ensures that the resource compiler can optimize asset processing (like shrinking unused images or pre-compiling XML) and provides a predictable mental model for developers.
- **Localization-First Architecture:** By centralizing strings in `values/`, the project is architected for global deployment. Translating the application requires adding a qualified subdirectory (e.g., `values-es/`) rather than refactoring code.
### Key Components and Responsibilities
The `/res` module is partitioned into specialized subdirectories, each managing a specific aspect of the application's presentation layer:
- **Interface Blueprints (`/layout`):** Responsible for defining the structural arrangement of the UI. These files determine where components are placed and how they behave in relation to one another.
- **Graphic Assets (`/drawable` & `/mipmap`):** Manage visual content. `/drawable` handles standard UI graphics (vectors, bitmaps, shapes), while `/mipmap` is specifically reserved for launcher icons to ensure they are available at the highest possible resolution regardless of the device's default density.
- **Constant Definitions (`/values`):** A critical repository for primitive types and styles. It typically contains:
- `strings.xml`: All user-facing text.
- `colors.xml`: The application's color palette.
- `styles.xml`: Reusable UI attribute sets that ensure visual consistency.
- **Raw Data and Interaction (`/raw` & `/menu`):** `/raw` holds arbitrary files (like audio or JSON config) that are needed in their original format, while `/menu` defines the structure of navigation and context menus.
### Workflow: Resource Resolution
The following diagram demonstrates how the application retrieves a resource at runtime, highlighting the "Config-Aware" selection process:
```text
[ App Logic ] [ R.java / ID ] [ /res System ] [ Active Config ]
| | | |
| 1. Request Asset | | |
|---(e.g. R.string.ok)-->| | |
| | 2. Lookup ID | |
| |----------------------->| |
| | | 3. Query State |
| | |----------------------->|
| | | | 4. Match (e.g. Locale=FR)
| | |<-----------------------|
| 5. Return Value | |
|<---("D'accord")--------| |
```
When a change is made to a resource, the build system automatically updates the reference mapping. This ensures that the application logic remains stable even as the visual or textual content of the `/res` module evolves.
# Module: /res/img
### High-Level Overview
The `/res/img` directory serves as the centralized repository for static image assets used across the application's user interface. Rather than being a mere storage bucket, this module is organized to ensure that brand identity elements and UI-specific graphics are decoupled from the application logic and styling code.
The primary design goal for this module is **asset consistency**. By centralizing images here, the project avoids duplication, simplifies path resolution within stylesheets and components, and ensures that updates to visual branding (such as logos or icons) propagate throughout the entire system from a single source of truth.
### Design Decisions and Implementation Choices
The architecture of this module favors a **flat or shallow hierarchy** to minimize path complexity in imports. The choice of file formats follows standard web optimization practices:
- **Vector assets (SVG):** Used for icons and logos to ensure scalability across high-DPI (Retina) displays without loss of quality or increased file size.
- **Raster assets (PNG/JPG):** Reserved for complex photographic content where vectorization is impractical.
- **Specialized formats (ICO):** Utilized specifically for browser-level integration (favicons) where legacy compatibility or specific metadata is required.
By isolating these assets into `/res/img`, the project implements a "Resource-Based" separation of concerns. This allows developers to reference assets via consistent aliases or relative paths, reducing the risk of broken links during refactoring of component or page structures.
### Key Components and Responsibilities
The directory is categorized by the functional role of the images rather than just their file types:
- **Identity and Branding:** Contains core visual identifiers like the company logo and wordmarks. These are the most critical assets, as they are often referenced in global headers, footers, and splash screens.
- **Interface Iconography:** Includes small-scale visual cues used to enhance navigation or provide feedback (e.g., arrows, status indicators). These are typically optimized for fast loading and uniform styling.
- **Meta Assets (`favicon.ico`):** Specifically responsible for the application's presence outside the viewport, such as browser tabs, bookmark bars, and shortcut icons. The inclusion of `favicon.ico` in this directory ensures that the "brand at a glance" is managed alongside other visual resources.
### Workflow: Asset Consumption
The following diagram illustrates how an asset moves from this module into the rendered application:
```text
[ /res/img/ ] [ Styles/Components ] [ Client Browser ]
| | |
| 1. Asset Definition | |
|---- logo.svg -------------------->| |
| | 2. URL Reference |
| | (e.g., background-image) |
| |---------------------------->|
| | | 3. Fetch & Render
| | |<-- [ GET /res/img/logo.svg ]
```
When a visual change is required, the workflow focuses on replacing the file within `/res/img` while maintaining the filename. This allows the application to update its visual state without requiring changes to CSS selectors or JSX/HTML templates.
# Module: /samplevariance
# Sample Variance Analysis
The `samplevariance` module provides a command-line tool designed to analyze the stability and noise levels of performance benchmarks. Specifically, it processes "nanobench" results stored in Google Cloud Storage (GCS), where each benchmark run typically contains multiple samples (e.g., 10 repetitions).
By calculating the ratio between the median and the minimum values of these samples, the tool helps engineers identify "flaky" or high-variance benchmarks that may yield inconsistent performance data.
## Design and Implementation Logic
The tool is built as a high-throughput data processing pipeline that fetches, parses, and analyzes large sets of JSON telemetry data.
### Data Model and Metrics
The core metric used is the **ratio of median to minimum**.
- **Minimum**: Represents the best possible performance (least noise).
- **Median**: Represents the typical performance.
- **Ratio**: A higher ratio indicates higher variance or "noise" within a single benchmark run.
The `sampleInfo` struct captures these metrics alongside the `traceid`, which uniquely identifies the specific benchmark configuration (e.g., test name, device, OS).
### Parallel Processing Workflow
To handle thousands of JSON files efficiently, the tool employs a worker pool pattern using `golang.org/x/sync/errgroup`.
1. **Discovery**: It lists all objects in a specified GCS bucket prefix (defaulting to the previous day's data).
2. **Distribution**: Filenames are pushed into a thread-safe channel.
3. **Analysis (Concurrent Workers)**: A pool of 64 workers pulls filenames from the channel. Each worker:
- Downloads the JSON file from GCS.
- Parses the legacy Perf format.
- Filters traces based on user-supplied criteria (e.g., specific hardware or test types).
- Calculates the min, median, and ratio for each matching trace.
4. **Aggregation**: Results are collected into a shared slice, protected by a mutex.
5. **Output**: After all workers finish, the tool sorts the results by the highest ratio (most noisy) and exports them as a CSV.
```text
GCS Bucket ----> List Files ----> [Filename Channel]
|
+----------------------------+----------------------------+
| | |
[Worker 1] [Worker 2] [Worker n]
Download & Parse Download & Parse Download & Parse
| | |
+----------------------------+----------------------------+
|
[Mutex Protected]
|
[Global Slice] ----> Sort ----> CSV Output
```
## Key Components
### File Processing (`main.go`)
- **`initialize()`**: Handles command-line flags and sets up GCS clients and output writers. It defaults to a rolling 24-hour window if no prefix is provided.
- **`filenamesFromBucketAndObjectPrefix()`**: Uses an attribute-selection query to fetch only the names of files, reducing metadata overhead.
- **`traceInfoFromFilename()`**: The core logic unit. It integrates with `perf/go/ingest/parser` to extract raw sample values and uses the `go-moremath` library for statistical calculations.
- **`writeCSV()`**: Formats the final report, supporting truncation via the `--top` flag to focus only on the most problematic benchmarks.
### Filtering and Querying
The tool leverages the common `go/query` package. This allows users to pass complex filters via the `--filter` flag using a URL-query-like syntax (e.g., `arch=arm64&config=8888`). Only traces matching these key-value pairs are included in the variance analysis.
### Execution Control
The module includes a `Makefile` that simplifies common operations, such as running the tool against specific GCS paths or piping the results directly to temporary files for quick inspection.
# Module: /scripts
The `/scripts` module provides a collection of administrative and developer tools designed to manage the lifecycle of data within the Perf ecosystem. This includes seeding local environments for development, migrating production-scale data for safe experimentation, and manually triggering the ingestion pipeline.
### High-Level Overview
The module serves three primary purposes:
1. **Environment Initialization**: Facilitating the setup of demo or local development environments with realistic data schemas and sample alerts.
2. **Data Portability**: Enabling the high-performance migration of data from production Google Cloud Spanner instances to experimental environments.
3. **Data Ingestion Management**: Providing bridges to upload performance data into Google Cloud Storage (GCS) in a format compatible with the Perf ingestion engine.
### Design Decisions and Implementation Choices
#### Bulk Data Migration via Protocol Bridging
A significant portion of this module is dedicated to moving data between Spanner instances. The implementation chooses **PGAdapter** over native Spanner SDKs for data movement. This decision allows the migration logic to treat Spanner as a PostgreSQL-compatible database, enabling the use of the `COPY` protocol via the `pgx` library. This is significantly more efficient than standard `INSERT` statements for bulk data, as it streams raw data directly into the database's ingestion buffer.
To handle the multi-terabyte scale of production tables like `tracevalues`, the implementation avoids direct time-based filtering on the values themselves, which would be prohibitively slow. Instead, it utilizes a `JOIN` on the `sourcefiles` table, filtering by the file's creation timestamp to isolate a manageable subset of data for migration.
#### Safety and Data Integrity
The scripts incorporate safeguards to prevent destructive operations:
- **Production Guards**: Migration scripts include hardcoded checks to prevent accidental overwrites of known production instances.
- **Idempotency**: Before initiating a transfer, the tools check for existing data within the target time range. If data is found, the process halts to prevent duplication.
- **Atomic Optimization**: The migration uses `PARTITIONED_NON_ATOMIC` DML mode on destination instances. This allows Spanner to handle massive bulk operations across multiple partitions without hitting transaction size limits.
#### Local Development Seeding
For local development, the module provides automated SQL seeding. Rather than relying on manual database entry, scripts use **PostgreSQL Here Documents** to inject complex JSON structures (like Alert configurations) into local databases. This ensures that developers can quickly replicate specific bug states or UI layouts with consistent data.
### Key Components and Workflows
#### Data Migration Tooling (`copy_data_to_experimental_db`)
This sub-module manages the complex flow of data from production to development environments. It coordinates the lifecycle of PGAdapter containers and the Go-based streaming logic.
```text
[ Production Spanner ] [ Experimental Spanner ]
| |
| (Spanner Protocol) | (Spanner Protocol)
v v
[ PGAdapter :5432 ] [ PGAdapter :5433 ]
| |
+----[ copy_data (Go Binary) ]--+
| (PostgreSQL Protocol)
|
1. Query source (via JOIN on sourcefiles)
2. Stream rows via CopyFrom interface
3. Apply to destination with Partitioned DML
```
#### Database Seeding (`add_demo_alert_to_demo_db.sh`)
This script automates the population of the `Alerts` and `Subscriptions` tables. It is designed to create a "ready-to-use" state for the Perf UI by:
- Defining a standard `Alert` JSON payload that covers common regression detection parameters (e.g., `stepfit` algo, `absolute` step).
- Linking alerts to subscriptions to verify notification workflows.
- Using `EXTRACT(EPOCH FROM NOW())` to ensure time-sensitive fields are current, preventing immediate expiration of demo data.
#### Ingestion Bridge (`upload_extracted_json_files.sh`)
This utility bridges local performance test results to the cloud ingestion pipeline. It enforces a strict directory structure required by the Perf ingester:
- **Path Logic**: It uploads files to `gs://skia-perf/nano-json-v1/YYYY/MM/DD/HH`.
- **Temporal Organization**: By forcing a date-based hierarchy, it ensures that the ingester can process files in chronological order and prevents any single directory from becoming a performance bottleneck in GCS.
# Module: /scripts/copy_data_to_experimental_db
# Copy Data to Experimental DB
This module provides a utility for copying data from a production Google Cloud Spanner database to an experimental or development Spanner instance. It is designed to facilitate testing and debugging with real-world data volumes and distributions without risking the integrity of production environments.
## Overview
The migration process leverages **PGAdapter**, a proxy that allows Cloud Spanner to be accessed via the PostgreSQL wire protocol. By running two instances of PGAdapter, the migration script can treat both the source (Production) and the destination (Experimental) as standard PostgreSQL databases, using the `pgx` library's efficient `CopyFrom` functionality to stream data between them.
## Design Decisions and Implementation Choices
### PGAdapter as a Bridge
Rather than using Spanner-specific SDKs for row-by-row manipulation, the module uses PGAdapter to expose Spanner through a PostgreSQL interface. This allows the use of the `COPY` protocol, which is significantly faster for bulk data movement than individual `INSERT` statements.
### Data Safety and Idempotency
To prevent accidental corruption of production or existing experimental data:
- **Source Instance Protection**: The `run_two_spanners.sh` script includes hardcoded checks to prevent known production instances from being targeted as the destination.
- **Duplicate Detection**: Before copying, `copy_data.go` checks if the destination table already contains data for the requested time range. If data exists, the script aborts for that table to avoid duplication.
- **Service Account Scoping**: The documentation recommends using a service account with read-only permissions on the source to enforce security at the IAM level.
### Large Scale Data Handling (TraceValues)
The `tracevalues` table in Perf is typically massive (multi-terabyte). Standard time-based filtering on this table is inefficient. To address this, the script implements a specific optimization:
- It performs a `JOIN` with the `sourcefiles` table.
- It filters records based on the `createdat` timestamp of the _source file_ rather than the trace value itself, allowing for manageable subsets of data to be migrated based on ingestion time.
### Partitioned DML
The script sets `SPANNER.AUTOCOMMIT_DML_MODE='PARTITIONED_NON_ATOMIC'` on the destination connection. This is a Spanner-specific optimization for bulk operations that allows the database to execute changes across multiple partitions independently, avoiding the overhead of a single massive transaction that would exceed Spanner's mutation limits.
## Key Components and Files
### `run_two_spanners.sh`
This bash script manages the environment setup. It launches two Docker containers running PGAdapter:
- **Source Port (5432)**: Connects to the production instance (defaulting to `chrome_int`).
- **Destination Port (5433)**: Connects to the user-specified experimental instance.
### `copy_data.go`
The core logic of the migration. It is responsible for:
- **Mapping**: Maintaining the `tableToColumns` map which defines the schema for Perf tables (e.g., `regressions2`, `commits`, `postings`).
- **Streaming**: Implementing the `pgx.CopyFromSource` interface to pipe rows directly from the source query results into the destination's `CopyFrom` command.
- **Filtering**: Applying duration-based filters (e.g., "last 7 days") to the SQL queries to limit the volume of data moved.
### `BUILD.bazel`
Defines the Go binary and library dependencies. Notably, it links to `//perf/go/sql/spanner`, ensuring that the script uses the same column definitions as the main Perf application.
## Workflow Process
The following diagram illustrates how data flows from the production Spanner instance to the experimental one through the proxy layer:
```text
[ Production Spanner ] [ Experimental Spanner ]
| |
| (Spanner Protocol) | (Spanner Protocol)
| |
[ PGAdapter :5432 ] [ PGAdapter :5433 ]
| |
+-------< copy_data.go >-----+
(PostgreSQL Protocol)
1. Query source for recent data
2. Stream rows via CopyFrom interface
3. Insert into destination
```
## Usage Logic
The migration follows a three-step logic:
1. **Schema Preparation**: The user must manually ensure the destination table exists (using DDL from the production console or the codebase).
2. **Proxy Initialization**: Launching the containers to map the remote Spanner instances to local ports.
3. **Execution**: Running the Go binary with a specified `--duration`. If `--table-name all` is used, the script iterates through all known tables in the `tableToColumns` map; otherwise, it targets a specific table.
# Module: /secrets
# Perf Secrets Management
The `/secrets` module provides a set of automated tools for managing the sensitive credentials, service accounts, and OAuth tokens required by various Skia Perf components. It ensures that services—such as data ingestion, email alerting, and database backups—have the necessary permissions to interact with Google Cloud Platform (GCP) resources and external APIs securely.
## Core Responsibilities
The module is designed to handle three primary types of sensitive information:
1. **GCP Service Accounts**: Provisioning accounts with specific IAM roles and linking them to Kubernetes via Workload Identity or static keys.
2. **Email Authentication**: Facilitating the OAuth2 "Three-Legged Flow" to allow Perf to send alerts via Gmail.
3. **Cloud Resource Permissions**: Granting specific access to GCS buckets (e.g., `skia-perf`), Pub/Sub topics, and Cloud Trace.
## Design Patterns and Implementation
### Service Account Provisioning
The module heavily relies on automated scripting to ensure reproducible and consistent permission sets. Most scripts (e.g., `create-perf-ingest-sa.sh`, `create-perf-sa.sh`) follow a standardized workflow:
- **Infrastructure as Code (IaC) approach**: Scripts define the exact roles (e.g., `roles/pubsub.editor`, `roles/cloudtrace.agent`) required for a service to function.
- **Workload Identity**: Where possible, scripts configure IAM policy bindings to allow Kubernetes service accounts to impersonate GCP service accounts. This removes the need for long-lived JSON keys, adhering to the principle of least privilege and improving security.
- **Ramdisk Usage**: Scripts utilize `../bash/ramdisk.sh` to perform sensitive operations in memory, ensuring that temporary secret files or JSON keys are never written to physical disk.
### Email Alerting Secrets
The `create-email-secrets.sh` script manages the complex process of authorizing Perf to send emails. It bridges the gap between Google’s OAuth2 requirements and Kubernetes secrets:
- **Interaction**: It prompts the user to provide a `client_secret.json` (obtained from the GCP Console) and then executes a local Go tool (`three_legged_flow`) to generate a `client_token.json`.
- **Normalization**: It normalizes email addresses into a format suitable for Kubernetes secret names (e.g., converting `@` and `.` to `-`).
- **Ephemeral Tokens**: It immediately deletes the token file from the local environment after injecting it into the cluster to prevent accidental leakage of refresh tokens.
## Key Workflows
### Provisioning a New Service Account
The following diagram illustrates the lifecycle of a service account creation within this module:
```text
Local Script Execution
|
v
[ RAMDISK Creation ] --------> [ GCP IAM API ]
| |
| +-- Create Service Account
| +-- Assign IAM Roles (PubSub, GCS, Trace)
v +-- Bind Workload Identity (K8s <-> GCP)
[ Generate JSON Key ] (Optional)
|
v
[ kubectl create secret ] ----> [ Kubernetes Cluster ]
|
v
[ RAMDISK Cleanup ]
```
## Component Summary
- **Email Alerting (`create-email-secrets.sh`)**: Handles OAuth2 token generation for Gmail integration. Specifically creates secrets for the `alertserver`.
- **Ingestion & Backend (`create-perf-ingest-sa.sh`, `create-perf-sa.sh`)**: Configures permissions for the core Perf processes to read from GCS buckets (`skia-perf`, `cluster-telemetry-perf`) and write to Pub/Sub for data processing.
- **Specialized Accounts**:
- `create-flutter-perf-service-account.sh`: Tailored permissions for the Flutter-specific Perf instance.
- `create-perf-cockroachdb-backup-service-account.sh`: Minimalist account with `roles/storage.objectAdmin` specifically for database backup cronjobs.
# Module: /smoke_tests
# Perf Smoke Tests
The `smoke_tests` module provides a suite of high-level integration tests for the Perf application. These tests use Puppeteer to automate a headless (or headed) Chrome browser, simulating real user interactions to ensure that critical pages and components load correctly and function as expected in a live environment.
## Design Philosophy
The primary goal of these tests is to verify the "health" of the system rather than exhaustive feature testing. They focus on:
- **End-to-End Connectivity:** Ensuring the web server, authentication layers (like IAP or auth-proxy), and database backends (like Spanner) are working together.
- **Performance Budgeting:** Many tests enforce a timeout (typically 5 seconds) for page loads to ensure the UI remains responsive.
- **Visibility & Debugging:** In case of failure, the system automatically captures screenshots and logs browser console output, network responses, and request failures to aid in rapid triaging.
## Key Components and Workflows
### Authentication and Authorization
Most tests interact with instances protected by Google Identity-Aware Proxy (IAP) or a local auth-proxy.
- **IAP Authentication:** Tests like `alerts_nodejs_test.ts` and `cluster_nodejs_test.ts` use the `google-auth-library` to fetch an ID token. This token is injected into the Puppeteer page's extra HTTP headers, allowing the automated browser to bypass the IAP login screen.
- **Local Proxy:** By default, tests target `http://localhost:8003`. The `utils.ts` file manages the `PERF_BASE_URL` and provides helper functions to apply standard test configurations (like cookies and logging).
### Test Environment Lifecycle
The module supports different execution modes depending on the developer's needs:
1. **Standard Headless Execution:** Used in CI/CD and standard local runs via Bazel.
2. **Cloudtop/CRD Debugging:** If the `DEBUG_VIA_CRD` environment variable is set, the tests can launch a browser visible through Chrome Remote Desktop. This introduces a startup delay to allow the developer to switch windows and watch the interaction.
```
+------------------+ +-------------------+ +-----------------------+
| Bazel / Node | | Puppeteer/Utils | | Target Perf Instance |
+------------------+ +-------------------+ +-----------------------+
| 1. Launch Test | ----> | 2. Launch Browser | | |
| | | 3. Auth Injection | ----> | 4. Request Page |
| | | | | 5. Return HTML/JS |
| 7. Check Result | <---- | 6. Wait for Select| <---- | |
+------------------+ +-------------------+ +-----------------------+
| |
+---- (On Failure) -------+----> [ Take Screenshot & Log Errors ]
```
### Core Utilities (`utils.ts`)
The `utils.ts` file centralizes common logic to keep individual test files clean:
- **`applyPageDefaults`:** Attaches event listeners to the Puppeteer page to capture `console`, `pageerror`, and network failures. It also sets a `puppeteer=true` cookie, which signals the Perf frontend to disable non-deterministic behaviors like animations or simulated RPC latency.
- **`browserForSmokeTest`:** Abstractly handles the creation of the browser instance, switching between headless and debug modes based on environment variables.
## Specialized Tests
- **Regression Tests (`regression_page_nodejs_test.ts`):** These tests are more complex than simple load tests. They navigate to specific subscription views (e.g., V8 or Fuchsia) and use `Promise.race` to wait for either a populated anomaly table or a "clear" message. This handles the non-deterministic nature of production data where anomalies may or may not exist at any given time.
- **Page Load Tests:** Files like `perf-chrome-public-load-a_nodejs_test.ts` verify that the primary routing endpoints (`/a`, `/m`, `/e`) render their main functional components (like `#anomaly-table` or `#test-picker`) within the 5-second budget.
## Debugging and Manual Execution
Tests are tagged as `manual` in the `BUILD.bazel` file, indicating they are typically run against a specific local or development instance rather than a hermetic build environment.
The `Makefile` provides shorthand commands for running these tests:
- `test-regressions`: Runs the standard regression suite.
- `test-regressions-crd`: Runs the suite with settings optimized for debugging via Chrome Remote Desktop, streaming the output to the terminal.
Developers can point the suite to any instance by overriding the `PERF_BASE_URL` environment variable during the Bazel invocation.