Reland "Simplify quickReject implementation in SkCanvas"

This is a reland of 0a0f4f5c35520cb7ca5b065d3813e30dae124a77

This change makes SkCanvas::quickReject always reject empty draw bounds,
whereas previously scale+translate CTMs allowed bounds with w or h == 0
but otherwise contained in the clip to be drawn. This uncovered some
bugs in Skia where bounds shouldn't be empty, and in Flutter where
bounds were legit empty but not expected by the test.

No code changes needed. The issues that required its revert have been
fixed with:
1. https://github.com/flutter/engine/pull/26053 so that platforms that
use an empty typeface, leading to empty draws are just skipped.
2. https://skia-review.googlesource.com/c/skia/+/406140 so that path
effects update bounds so that Android's 1D dash path effect applied to
a horizontal line is properly not rejected.

Based on the period of time where the original CL was landed, some perf
data was collected.
- There were no significant changes in most SKPs or SVGs, except for a
Flutter page flip skp, which saw a 10% net improvement (perhaps the
flip is drawn with perspective?)
- A 10-20% regression in the motionmark paths skp, but dominated by the MSVC
compiler, so I'm not too concerned about that.
- Perspective microbenchmarks for drawing rectangles are 1.5-2x faster.
- quickReject microbenchmarks are about 2x slower.

The last two microbenchmark results aren't surprising since perspective
was the largest improvement in perf for SkM44::MapRect vs.
SkMatrix::mapRect, and the scale+translate specializations in Skmatrix
were maybe 50% faster than SkM44's. That would account for some of the
slow downs, and the rest could be explained by moving away from the
SIMD rect intersection and nan test.

Since these microreductions don't seem to bleed into more complex
benchmarks, I'm inclined to keep the code simple and not bring back the
custom intrinsics.

Original change's description:
> Simplify quickReject implementation in SkCanvas
>
>  - SkCanvas no longer keeps fIsScaleTranslate bool that has to stay in
>    sync with the type of the matrix.
>  - No more fast or slow path for quickReject, the Sk4f code has been
>    completely removed.
>  - Uses SkM44::mapRect instead of SkMatrix::mapRect. This is slightly
>    slower for S+T, but much faster for other transforms. I'm hopeful we
>    won't notice the regression in the grand scheme for S+T, since the
>    code is a lot simpler now.
>  - The final isFinite() and intersects() check for quickReject uses
>    SkRect's functions instead of hand-written SSE/NEON. If we think this
>    is optimization is necessary, I'm hoping we can rewrite it in terms
>    of skvx instead of specific instructions.
>  - Consolidated how the quick-reject bounds outsetting into
>    computeDeviceClipBounds, and added an option to skip outsetting for
>    the one call site that doesn't want it.
>
> Bug: skia:10987
> Change-Id: I3cf2a73636cdeed06d12cab4548cfb94d1eb074a
> Reviewed-on: https://skia-review.googlesource.com/c/skia/+/405198
> Commit-Queue: Mike Reed <reed@google.com>
> Auto-Submit: Michael Ludwig <michaelludwig@google.com>
> Reviewed-by: Mike Reed <reed@google.com>

Bug: skia:10987
Change-Id: Id0d4b4ecebf0b83ae30f7e1a263961ab25de28dd
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/407358
Commit-Queue: Michael Ludwig <michaelludwig@google.com>
Reviewed-by: Michael Ludwig <michaelludwig@google.com>
diff --git a/include/core/SkCanvas.h b/include/core/SkCanvas.h
index 2bfc2d0..cb6ca55 100644
--- a/include/core/SkCanvas.h
+++ b/include/core/SkCanvas.h
@@ -2423,16 +2423,13 @@
 
     virtual SkPaintFilterCanvas* internal_private_asPaintFilterCanvas() const { return nullptr; }
 
-    /**
-     *  Keep track of the device clip bounds and if the matrix is scale-translate.  This allows
-     *  us to do a fast quick reject in the common case.
-     */
-    bool   fIsScaleTranslate;
+    // Keep track of the device clip bounds in the canvas' global space to reject draws before
+    // invoking the top-level device.
     SkRect fQuickRejectBounds;
 
     // Compute the clip's bounds based on all clipped SkDevice's reported device bounds transformed
     // into the canvas' global space.
-    SkRect computeDeviceClipBounds() const;
+    SkRect computeDeviceClipBounds(bool outsetForAA=true) const;
 
     class AutoUpdateQRBounds;
     void validateClip() const;
diff --git a/src/core/SkCanvas.cpp b/src/core/SkCanvas.cpp
index 06c5da0..833df74 100644
--- a/src/core/SkCanvas.cpp
+++ b/src/core/SkCanvas.cpp
@@ -18,7 +18,6 @@
 #include "include/core/SkTextBlob.h"
 #include "include/core/SkVertices.h"
 #include "include/effects/SkRuntimeEffect.h"
-#include "include/private/SkNx.h"
 #include "include/private/SkTo.h"
 #include "include/utils/SkNoDrawCanvas.h"
 #include "src/core/SkArenaAlloc.h"
@@ -271,18 +270,6 @@
     }
 };
 
-static inline SkRect qr_clip_bounds(const SkRect& bounds) {
-    if (bounds.isEmpty()) {
-        return SkRect::MakeEmpty();
-    }
-
-    // Expand bounds out by 1 in case we are anti-aliasing.  We store the
-    // bounds as floats to enable a faster quick reject implementation.
-    SkRect dst;
-    (Sk4f::Load(&bounds.fLeft) + Sk4f(-1.f, -1.f, 1.f, 1.f)).store(&dst.fLeft);
-    return dst;
-}
-
 class SkCanvas::AutoUpdateQRBounds {
 public:
     explicit AutoUpdateQRBounds(SkCanvas* canvas) : fCanvas(canvas) {
@@ -291,7 +278,7 @@
         fCanvas->validateClip();
     }
     ~AutoUpdateQRBounds() {
-        fCanvas->fQuickRejectBounds = qr_clip_bounds(fCanvas->computeDeviceClipBounds());
+        fCanvas->fQuickRejectBounds = fCanvas->computeDeviceClipBounds();
         // post-condition, we should remain valid after re-computing the bounds
         fCanvas->validateClip();
     }
@@ -422,8 +409,7 @@
     SkASSERT(fBaseDevice->isNoPixelsDevice());
     static_cast<SkNoPixelsDevice*>(fBaseDevice.get())->resetForNextPicture(bounds);
     fMCRec->reset(fBaseDevice.get());
-    fQuickRejectBounds = qr_clip_bounds(this->computeDeviceClipBounds());
-    fIsScaleTranslate = true;
+    fQuickRejectBounds = this->computeDeviceClipBounds();
 }
 
 void SkCanvas::init(sk_sp<SkBaseDevice> device) {
@@ -449,10 +435,9 @@
     device->setMarkerStack(fMarkerStack.get());
 
     fSurfaceBase = nullptr;
-    fIsScaleTranslate = true;
     fBaseDevice = std::move(device);
     fScratchGlyphRunBuilder = std::make_unique<SkGlyphRunBuilder>();
-    fQuickRejectBounds = qr_clip_bounds(this->computeDeviceClipBounds());
+    fQuickRejectBounds = this->computeDeviceClipBounds();
 }
 
 SkCanvas::SkCanvas()
@@ -1072,7 +1057,7 @@
 
     fMCRec->newLayer(std::move(newDevice), paint, stashedMatrix);
 
-    fQuickRejectBounds = qr_clip_bounds(this->computeDeviceClipBounds());
+    fQuickRejectBounds = this->computeDeviceClipBounds();
 }
 
 int SkCanvas::saveLayerAlpha(const SkRect* bounds, U8CPU alpha) {
@@ -1168,10 +1153,9 @@
         this->internalSetMatrix(SkM44(layer->fStashedMatrix));
     }
 
-    fIsScaleTranslate = SkMatrixPriv::IsScaleTranslateAsM33(fMCRec->fMatrix);
     // Update the quick-reject bounds in case the restore changed the top device or the
     // removed save record had included modifications to the clip stack.
-    fQuickRejectBounds = qr_clip_bounds(this->computeDeviceClipBounds());
+    fQuickRejectBounds = this->computeDeviceClipBounds();
     this->validateClip();
 }
 
@@ -1307,11 +1291,6 @@
         this->checkForDeferredSave();
         fMCRec->fMatrix.preTranslate(dx, dy);
 
-        // Translate shouldn't affect the is-scale-translateness of the matrix.
-        // However, if either is non-finite, we might still complicate the matrix type,
-        // so we still have to compute this.
-        fIsScaleTranslate = SkMatrixPriv::IsScaleTranslateAsM33(fMCRec->fMatrix);
-
         this->topDevice()->setGlobalCTM(fMCRec->fMatrix);
 
         this->didTranslate(dx,dy);
@@ -1323,10 +1302,6 @@
         this->checkForDeferredSave();
         fMCRec->fMatrix.preScale(sx, sy);
 
-        // shouldn't need to do this (theoretically), as the state shouldn't have changed,
-        // but pre-scaling by a non-finite does change it, so we have to recompute.
-        fIsScaleTranslate = SkMatrixPriv::IsScaleTranslateAsM33(fMCRec->fMatrix);
-
         this->topDevice()->setGlobalCTM(fMCRec->fMatrix);
 
         this->didScale(sx, sy);
@@ -1363,8 +1338,6 @@
 
     fMCRec->fMatrix.preConcat(m);
 
-    fIsScaleTranslate = SkMatrixPriv::IsScaleTranslateAsM33(fMCRec->fMatrix);
-
     this->topDevice()->setGlobalCTM(fMCRec->fMatrix);
 }
 
@@ -1376,7 +1349,6 @@
 
 void SkCanvas::internalSetMatrix(const SkM44& m) {
     fMCRec->fMatrix = m;
-    fIsScaleTranslate = SkMatrixPriv::IsScaleTranslateAsM33(m);
 
     this->topDevice()->setGlobalCTM(fMCRec->fMatrix);
 }
@@ -1529,7 +1501,7 @@
 
 void SkCanvas::validateClip() const {
 #ifdef SK_DEBUG
-    SkRect tmp = qr_clip_bounds(this->computeDeviceClipBounds());
+    SkRect tmp = this->computeDeviceClipBounds();
     if (this->isClipEmpty()) {
         SkASSERT(fQuickRejectBounds.isEmpty());
     } else {
@@ -1564,68 +1536,14 @@
     return this->topDevice()->onGetClipType() == SkBaseDevice::ClipType::kRect;
 }
 
-static inline bool is_nan_or_clipped(const Sk4f& devRect, const Sk4f& devClip) {
-#if !defined(SKNX_NO_SIMD) && SK_CPU_SSE_LEVEL >= SK_CPU_SSE_LEVEL_SSE2
-    __m128 lLtT = _mm_unpacklo_ps(devRect.fVec, devClip.fVec);
-    __m128 RrBb = _mm_unpackhi_ps(devClip.fVec, devRect.fVec);
-    __m128 mask = _mm_cmplt_ps(lLtT, RrBb);
-    return 0xF != _mm_movemask_ps(mask);
-#elif !defined(SKNX_NO_SIMD) && defined(SK_ARM_HAS_NEON)
-    float32x4_t lLtT = vzipq_f32(devRect.fVec, devClip.fVec).val[0];
-    float32x4_t RrBb = vzipq_f32(devClip.fVec, devRect.fVec).val[1];
-    uint32x4_t mask = vcltq_f32(lLtT, RrBb);
-    return 0xFFFFFFFFFFFFFFFF != (uint64_t) vmovn_u32(mask);
-#else
-    SkRect devRectAsRect;
-    SkRect devClipAsRect;
-    devRect.store(&devRectAsRect.fLeft);
-    devClip.store(&devClipAsRect.fLeft);
-    return !devRectAsRect.isFinite() || !devRectAsRect.intersect(devClipAsRect);
-#endif
-}
-
-// It's important for this function to not be inlined.  Otherwise the compiler will share code
-// between the fast path and the slow path, resulting in two slow paths.
-static SK_NEVER_INLINE bool quick_reject_slow_path(const SkRect& src, const SkRect& deviceClip,
-                                                   const SkMatrix& matrix) {
-    SkRect deviceRect;
-    matrix.mapRect(&deviceRect, src);
-    return !deviceRect.isFinite() || !deviceRect.intersect(deviceClip);
-}
-
 bool SkCanvas::quickReject(const SkRect& src) const {
 #ifdef SK_DEBUG
     // Verify that fQuickRejectBounds are set properly.
     this->validateClip();
-    // Verify that fIsScaleTranslate is set properly.
-    SkASSERT(fIsScaleTranslate == SkMatrixPriv::IsScaleTranslateAsM33(fMCRec->fMatrix));
 #endif
 
-    if (!fIsScaleTranslate) {
-        return quick_reject_slow_path(src, fQuickRejectBounds, fMCRec->fMatrix.asM33());
-    }
-
-    // We inline the implementation of mapScaleTranslate() for the fast path.
-    float sx = fMCRec->fMatrix.rc(0, 0);
-    float sy = fMCRec->fMatrix.rc(1, 1);
-    float tx = fMCRec->fMatrix.rc(0, 3);
-    float ty = fMCRec->fMatrix.rc(1, 3);
-    Sk4f scale(sx, sy, sx, sy);
-    Sk4f trans(tx, ty, tx, ty);
-
-    // Apply matrix.
-    Sk4f ltrb = Sk4f::Load(&src.fLeft) * scale + trans;
-
-    // Make sure left < right, top < bottom.
-    Sk4f rblt(ltrb[2], ltrb[3], ltrb[0], ltrb[1]);
-    Sk4f min = Sk4f::Min(ltrb, rblt);
-    Sk4f max = Sk4f::Max(ltrb, rblt);
-    // We can extract either pair [0,1] or [2,3] from min and max and be correct, but on
-    // ARM this sequence generates the fastest (a single instruction).
-    Sk4f devRect = Sk4f(min[2], min[3], max[0], max[1]);
-
-    // Check if the device rect is NaN or outside the clip.
-    return is_nan_or_clipped(devRect, Sk4f::Load(&fQuickRejectBounds.fLeft));
+    SkRect devRect = SkMatrixPriv::MapRect(fMCRec->fMatrix, src);
+    return !devRect.isFinite() || !devRect.intersects(fQuickRejectBounds);
 }
 
 bool SkCanvas::quickReject(const SkPath& path) const {
@@ -1669,16 +1587,22 @@
 }
 
 SkIRect SkCanvas::getDeviceClipBounds() const {
-    return this->computeDeviceClipBounds().roundOut();
+    return this->computeDeviceClipBounds(/*outsetForAA=*/false).roundOut();
 }
 
-SkRect SkCanvas::computeDeviceClipBounds() const {
+SkRect SkCanvas::computeDeviceClipBounds(bool outsetForAA) const {
     const SkBaseDevice* dev = this->topDevice();
     if (dev->onGetClipType() == SkBaseDevice::ClipType::kEmpty) {
         return SkRect::MakeEmpty();
     } else {
-        SkIRect devClipBounds = dev->devClipBounds();
-        return dev->deviceToGlobal().mapRect(SkRect::Make(devClipBounds));
+        SkRect devClipBounds = SkRect::Make(dev->devClipBounds());
+        dev->deviceToGlobal().mapRect(&devClipBounds);
+        if (outsetForAA) {
+            // Expand bounds out by 1 in case we are anti-aliasing.  We store the
+            // bounds as floats to enable a faster quick reject implementation.
+            devClipBounds.outset(1.f, 1.f);
+        }
+        return devClipBounds;
     }
 }