vello_hybrid: remove suspending work from scheduler (#1174)

This PR addresses multiple tasks tracked in
https://github.com/linebender/vello/issues/1165

### Context

Recently we landed https://github.com/linebender/vello/pull/1155 but it
had lots of caveats. It also utilised suspending wide tile commands in
order to pack rounds. A large pain point is that the system **could
deadlock itself**.

At renderer office hours we discussed pre-planning out wide tiles into
rounds and whether suspending was necessary. It was agreed that
suspending was used as it was lower cost – but I felt like I hadn't
explored a non-suspending solution adequately.

Originally I was working towards solving the deadlock issue – which
would require either removing the suspending concept, or making
suspending smarter such that it wouldn't get into deadlocks as much.
Adding another system on top of the scheduler to "guess" resources to
prevent deadlocks was really thorny and complex, so I also explored
simply removing the complexity of suspending wide tile commands, and it
worked!
This PR removes the entire concept of suspending.

Removing suspending lets us:
 - Delete many of the allocations I added.
 - Makes deadlocks impossible as they are coupled to the removed system.
 - Simplifies the code, and improves performance.

### How

The most pertinent changes in this PR are the round calculations in
`PushBuf`. I messed up the rounds where we copy the destination slot
contents back to the temporary texture. By fixing these rounds, the
whole system "just works" without suspending because work is scheduled
into rounds correctly.


#### What is the insight and why could suspending be removed?

On `main` note that suspending _only_ happens when pushing a buffer. It
happens when the current top of stack (`tos`) has an invalid temporary
slot which needs to be copied to the alternate texture such that a
future blend operation will work.

There was a bug such that rounds were calculated incorrectly in push
buffer. Hence, by suspending we would force enough rounds to pass such
that temporary slots are made valid.

The key insight is that `rounds` tracked on the tiles track the round in
which that tile is doing work. You can do work in parallel on different
tiles in the stack as long as they are mutually exclusive. It's only
when blending or popping buffers that we `max(nos.round, tos.round)` to
synchronize the rounds and ensure that all work is done before blending.

This insight is what lets us fix the round calculations for compositing
and allows us to remove suspending as a _hack_ to get around the round
calculation bug. This explains why we get better round utilization with
this PR.


But, I also vigorously empirically proved that this change is extremely
worthwhile.

### Empirical Tests

#### Round stats

With the following code in the `Scheduler::flush` method we can capture
some insightful statistics:

```rs
#[cfg(debug_assertions)]
        {
            let strips = [
                round.draws[0].0.len(),
                round.draws[1].0.len(),
                round.draws[2].0.len(),
            ];
            let total_strips: usize = strips.iter().sum();
            let slots_cleared = round.clear[0].len() + round.clear[1].len();
            let slots_freed = round.free[0].len() + round.free[1].len();

            eprintln!(
                "Round {}: {} strips (tex0:{}, tex1:{}, target:{}), {} cleared, {} freed",
                self.round,
                total_strips,
                strips[0],
                strips[1],
                strips[2],
                slots_cleared,
                slots_freed
            );
        }
```

**Command**: `cargo nextest run complex_composed_layers_hybrid --locked
--all-features --no-fail-fast --no-capture`


**On `main`:**
```md
Round 0: 299 strips (tex0:59, tex1:240, target:0), 225 cleared, 0 freed
Round 1: 152 strips (tex0:60, tex1:92, target:0), 120 cleared, 40 freed
Round 2: 134 strips (tex0:66, tex1:68, target:0), 265 cleared, 60 freed
Round 3: 81 strips (tex0:35, tex1:46, target:0), 30 cleared, 95 freed
Round 4: 50 strips (tex0:25, tex1:25, target:0), 10 cleared, 100 freed
Round 5: 50 strips (tex0:25, tex1:25, target:0), 5 cleared, 90 freed
Round 6: 70 strips (tex0:25, tex1:25, target:20), 5 cleared, 135 freed
Round 7: 15 strips (tex0:5, tex1:5, target:5), 0 cleared, 30 freed
```

**This PR:**
```md
Round 0: 496 strips (tex0:135, tex1:361, target:0), 550 cleared, 15 freed
Round 1: 120 strips (tex0:60, tex1:60, target:0), 25 cleared, 130 freed
Round 2: 100 strips (tex0:50, tex1:50, target:0), 50 cleared, 150 freed
Round 3: 50 strips (tex0:25, tex1:25, target:0), 30 cleared, 90 freed
Round 4: 70 strips (tex0:25, tex1:25, target:20), 5 cleared, 135 freed
Round 5: 15 strips (tex0:5, tex1:5, target:5), 0 cleared, 30 freed
```

This PR results in fewer rounds for drawing the same complex blend
scene.



#### Deadlock tests

This was tested with the following diff:

```diff
diff --git a/sparse_strips/vello_hybrid/src/render/wgpu.rs b/sparse_strips/vello_hybrid/src/render/wgpu.rs
index f6998c02..1e3e3aef 100644
--- a/sparse_strips/vello_hybrid/src/render/wgpu.rs
+++ b/sparse_strips/vello_hybrid/src/render/wgpu.rs
@@ -71,7 +71,7 @@ impl Renderer {
 
         Self {
             programs: Programs::new(device, render_target_config, total_slots),
-            scheduler: Scheduler::new(total_slots),
+            scheduler: Scheduler::new(12),
             image_cache,
         }
     }
```

The command: `cargo nextest run --locked --all-features --no-fail-fast
--release` run on `sparse_strips` as working directory.

With the extremely limited slots, this PR passes all tests. On main, the
tests fail with a deadlock.



#### `wgpu_webgl` example (scalar)
(this is very rough and measured while dragging and zooming)

On main each rAF for the blend example is ~6.5ms.
This PR makes the rAF ~5ms.


### Test plan

Tested manually with `cargo run_wasm -p wgpu_webgl --release`, `cargo
run -p vello_hybrid_winit --release`, and by leaning on the existing
test corpus of tests that include hybrid compositing tests.


### Risks

I don't think there are major risks introduced by this PR. This PR
really polishes rough edges introduced by the initial blend layers PR
and eliminates the deadlock hazard.
1 file changed
tree: d17e043153b500c63a1fff19a54a8c67f062f627
  1. .cargo/
  2. .github/
  3. .vscode/
  4. doc/
  5. examples/
  6. image_filters/
  7. sparse_strips/
  8. vello/
  9. vello_encoding/
  10. vello_shaders/
  11. vello_tests/
  12. xtask/
  13. .clippy.toml
  14. .gitattributes
  15. .gitignore
  16. .taplo.toml
  17. .typos.toml
  18. AUTHORS
  19. Cargo.lock
  20. Cargo.toml
  21. CHANGELOG.md
  22. LICENSE-APACHE
  23. LICENSE-MIT
  24. README.md
  25. rustfmt.toml
README.md

Vello

A GPU compute-centric 2D renderer

Linebender Zulip dependency status Apache 2.0 or MIT license. wgpu version

Crates.io Docs Build status

Vello is a 2D graphics rendering engine written in Rust, with a focus on GPU compute. It can draw large 2D scenes with interactive or near-interactive performance, using wgpu for GPU access.

Quickstart to run an example program:

cargo run -p with_winit

image

It is used as the rendering backend for Xilem, a Rust GUI toolkit.

[!WARNING] Vello can currently be considered in an alpha state. In particular, we're still working on the following:

Significant changes are documented in the changelog.

Motivation

Vello is meant to fill the same place in the graphics stack as other vector graphics renderers like Skia, Cairo, and its predecessor project Piet. On a basic level, that means it provides tools to render shapes, images, gradients, text, etc, using a PostScript-inspired API, the same that powers SVG files and the browser <canvas> element.

Vello's selling point is that it gets better performance than other renderers by better leveraging the GPU. In traditional PostScript-style renderers, some steps of the render process like sorting and clipping either need to be handled in the CPU or done through the use of intermediary textures. Vello avoids this by using prefix-sum algorithms to parallelize work that usually needs to happen in sequence, so that work can be offloaded to the GPU with minimal use of temporary buffers.

This means that Vello needs a GPU with support for compute shaders to run.

Getting started

Vello is meant to be integrated deep in UI render stacks. While drawing in a Vello scene is easy, actually rendering that scene to a surface requires setting up a wgpu context, which is a non-trivial task.

To use Vello as the renderer for your PDF reader / GUI toolkit / etc, your code will have to look roughly like this:

use vello::{
    kurbo::{Affine, Circle},
    peniko::{Color, Fill},
    *,
};

// Initialize wgpu and get handles
let (width, height) = ...;
let device: wgpu::Device = ...;
let queue: wgpu::Queue = ...;
let mut renderer = Renderer::new(
   &device,
   RendererOptions::default()
).expect("Failed to create renderer");
// Create scene and draw stuff in it
let mut scene = vello::Scene::new();
scene.fill(
   vello::peniko::Fill::NonZero,
   vello::Affine::IDENTITY,
   vello::Color::from_rgb8(242, 140, 168),
   None,
   &vello::Circle::new((420.0, 200.0), 120.0),
);
// Draw more stuff
scene.push_layer(...);
scene.fill(...);
scene.stroke(...);
scene.pop_layer(...);
let texture = device.create_texture(&...);

// Render to a wgpu Texture
renderer
   .render_to_texture(
      &device,
      &queue,
      &scene,
      &texture,
      &vello::RenderParams {
         base_color: palette::css::BLACK, // Background color
         width,
         height,
         antialiasing_method: AaConfig::Msaa16,
      },
   )
   .expect("Failed to render to a texture");
// Do things with `texture`, such as blitting it to the Surface using
// wgpu::util::TextureBlitter

See the examples directory for code that integrates with frameworks like winit.

Performance

We've observed 177 fps for the paris-30k test scene on an M1 Max, at a resolution of 1600 pixels square, which is excellent performance and represents something of a best case for the engine.

More formal benchmarks are on their way.

Integrations

SVG

A separate Linebender integration for rendering SVG files is available through vello_svg.

Lottie

A separate Linebender integration for playing Lottie animations is available through velato.

Bevy

A separate Linebender integration for rendering raw scenes or Lottie and SVG files in Bevy through bevy_vello.

Examples

Our examples are provided in separate packages in the examples directory. This allows them to have independent dependencies and faster builds. Examples must be selected using the --package (or -p) Cargo flag.

Winit

Our winit example (examples/with_winit) demonstrates rendering to a winit window. By default, this renders the GhostScript Tiger as well as all SVG files you add in the examples/assets/downloads directory. A custom list of SVG file paths (and directories to render all SVG files from) can be provided as arguments instead. It also includes a collection of test scenes showing the capabilities of vello, which can be shown with --test-scenes.

cargo run -p with_winit

Platforms

We aim to target all environments which can support WebGPU with the default limits. We defer to wgpu for this support. Other platforms are more tricky, and may require special building/running procedures.

Web

Because Vello relies heavily on compute shaders, we rely on the emerging WebGPU standard to run on the web. Browser support for WebGPU is still evolving. Vello has been tested using production versions of Chrome, but WebGPU support in Firefox and Safari is still experimental. It may be necessary to use development browsers and explicitly enable WebGPU.

The following command builds and runs a web version of the winit demo. This uses cargo-run-wasm to build the example for web, and host a local server for it

# Make sure the Rust toolchain supports the wasm32 target
rustup target add wasm32-unknown-unknown

# The binary name must also be explicitly provided as it differs from the package name
cargo run_wasm -p with_winit --bin with_winit_bin

There is also a web demo available here on supporting web browsers.

[!WARNING] The web is not currently a primary target for Vello, and WebGPU implementations are incomplete, so you might run into issues running this example.

Android

The with_winit example supports running on Android, using cargo apk.

cargo apk run -p with_winit --lib

[!TIP] cargo apk doesn't support running in release mode without configuration. See their crates page docs (around package.metadata.android.signing.<profile>).

See also cargo-apk#16. To run in release mode, you must add the following to examples/with_winit/Cargo.toml (changing $HOME to your home directory):

[package.metadata.android.signing.release]
path = "$HOME/.android/debug.keystore"
keystore_password = "android"

[!NOTE] As cargo apk does not allow passing command line arguments or environment variables to the app when ran, these can be embedded into the program at compile time (currently for Android only) with_winit currently supports the environment variables:

  • VELLO_STATIC_LOG, which is equivalent to RUST_LOG
  • VELLO_STATIC_ARGS, which is equivalent to passing in command line arguments

For example (with unix shell environment variable syntax):

VELLO_STATIC_LOG="vello=trace" VELLO_STATIC_ARGS="--test-scenes" cargo apk run -p with_winit --lib

Minimum supported Rust Version (MSRV)

This version of Vello has been verified to compile with Rust 1.85 and later.

Future versions of Vello might increase the Rust version requirement. It will not be treated as a breaking change and as such can even happen with small patch releases.

As time has passed, some of Vello‘s dependencies could have released versions with a higher Rust requirement. If you encounter a compilation issue due to a dependency and don’t want to upgrade your Rust toolchain, then you could downgrade the dependency.

# Use the problematic dependency's name and version
cargo update -p package_name --precise 0.1.1

Community

Discussion of Vello development happens in the Linebender Zulip, specifically the #vello channel. All public content can be read without logging in.

Contributions are welcome by pull request. The Rust code of conduct applies.

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache 2.0 license, shall be licensed as noted in the License section, without any additional terms or conditions.

History

Vello was previously known as piet-gpu. This prior incarnation used a custom cross-API hardware abstraction layer, called piet-gpu-hal, instead of wgpu.

An archive of this version can be found in the branches custom-hal-archive-with-shaders and custom-hal-archive. This succeeded the previous prototype, piet-metal, and included work adapted from piet-dx12.

The decision to lay down piet-gpu-hal in favor of WebGPU is discussed in detail in the blog post Requiem for piet-gpu-hal.

A vision document dated December 2020 explained the longer-term goals of the project, and how we might get there. Many of these items are out-of-date or completed, but it still may provide some useful background.

Related projects

Vello takes inspiration from many other rendering projects, including:

License

Licensed under either of

at your option.

In addition, all files in the vello_shaders/shader and vello_shaders/src/cpu directories and subdirectories thereof are alternatively licensed under the Unlicense (vello_shaders/shader/UNLICENSE or http://unlicense.org/). For clarity, these files are also licensed under either of the above licenses. The intent is for this research to be used in as broad a context as possible.

The files in subdirectories of the examples/assets directory are licensed solely under their respective licenses, available in the LICENSE file in their directories.