Bugzilla

Comment 2

•

14 years ago

Using the attached testcase and the current JM (with tracing integration, therefore) tree:

  mozilla% ./js/src/opt-obj/js ~/foo.js
  iterations: 100; total dur: 385ms; per item: 3.85ms

  mozilla% ./js/src/opt-obj/js -j ~/foo.js
  iterations: 100; total dur: 4204ms; per item: 42.04ms

  mozilla% ./js/src/opt-obj/js -m ~/foo.js
  iterations: 100; total dur: 16851ms; per item: 168.51ms

  mozilla% ./js/src/opt-obj/js -j -m ~/foo.js
  iterations: 100; total dur: 21266ms; per item: 212.66ms

With the compile() call taken out:

  mozilla% ./js/src/opt-obj/js ~/foo.js
  iterations: 100; total dur: 955ms; per item: 9.55ms

  mozilla% ./js/src/opt-obj/js -j ~/foo.js
  iterations: 100; total dur: 607ms; per item: 6.07ms

  mozilla% ./js/src/opt-obj/js -m ~/foo.js
  iterations: 100; total dur: 613ms; per item: 6.13ms

  mozilla% ./js/src/opt-obj/js -j -m ~/foo.js
  iterations: 100; total dur: 393ms; per item: 3.93ms

Comment 3

•

14 years ago

When the "compile" call is there, the fft.forward functions is a LAMBDA_NULL_CLOSURE.  Same without the "compile" call.  However the function is a _lot_ bigger with the compile() call.

I tried geting jitstats with the compile() call, but:

  mozilla% env TMFLAGS=stats ./js/src/debug-obj/js -j ~/foo.js
  Assertion failure: base == lirbuf->sp || base == eos_ins,
                     at ../jstracer.cpp:3797
  Bus error

on both JM and TM trees.

blocking2.0: --- → ?

Updated

•

14 years ago

Depends on: 591788

Comment 4

•

14 years ago

I filed bug 591788 on the assert.

Comment 5

•

14 years ago

And it's not an issue with the eval; if I just take the generated source and set the function to that in the original script it's still slow.

Given comment 0, part of the issue is that you're explicitly undoing optimizations that at least the tracer does. For example, by unrolling a loop you're just making the tracer compile a _lot_ more code to get a pretty similar result (before, only one copy of the loop needed to be compiled, but now it's however many you generated). I would expect something similar to be the case for JM.

In particular, when I profile, for JM 94% of the time is spent under js::mjit::Compiler::Compile, almost all of it under generateMethod. In particular, a lot of time is spent compiling the code that syncs state (26% in, not under, ImmutableSync:syncNormal, 20% in ImmutableState::sync, 20% in FrameState::sync 17% in FrameState::merge, 4% in FrameState::uncopy).

For TM, 87% of the time is spent under record_JSOP_GETELEM, so again compilation (in this case, of gets from arrays). Almost all this time is spent under snapshot(); some in VisitFrameSlots (most of it, in fact), and some in checkForGlobalObjectReallocation. Hard to tell where the latter is being called from, since so much stuff is inlined here, but I'd suspect the getFromTrackerImpl and/or knownImpl callsites. Again, this is all code designed to sync state between the jit and the interpreter, afaik.

In the short run, the right answer seems to be to not do this compile() thing. I would be somewhat suprised if it helps in any engine with a jit, though maybe one that never has to sync back to an interpreter might not be too bad off.

In the longer run, can we detect this sort of thing and abort compiling it?

It _is_ sort of impressive that hand-compiling the JS to JS produces a result as fast under interp as -j -m on the original source, though. ;)

Oh, I tried bumping the iteration count from 100 to 1000. While this doesn't affect the per-item time for the interpreter on the compiled fft or the -m time on the uncompiled one (which is slower than interp on compiled fft on this testcase, per above), the -j time on uncompiled goes up to 8ms per item, and the combined -j -m on uncompiled goes up to 6ms/item. So it's actually a win to compile + interp.... We should _really_ be able to beat that with jm on the uncompiled fft, I'd think!

David Anderson [:dvander] - inactive, e-mail if emergency

Updated

•

14 years ago

Blocks: WebJSPerf

Keywords: perf

Assignee

Comment 6

•

14 years ago

Not good. JM is getting some worst-case behaviour in the sync algorithm, which is O(n) where n is the number of "tracked" slots. We normally expect that "tracked" is significantly lower than nslots. Here that's not true, and the code is very linear - no branches giving us the opportunity to clear the tracker.

Syncing needed to be O(slots) until we started relying on conservative GC. All it needs to be now is O(k + m) where k=registers and m=stack uses of current op.

There's a second sync algorithm that kicks in during extreme register pressure. I don't see it showing up in Shark though - I'll fix the main one and see what that gets us.

David Anderson [:dvander] - inactive, e-mail if emergency

Comment 7

•

14 years ago

That's for JM on the attached testcase, right?  It may also be worth looking into what happens if you comment the compile() line in the testcase out with JM...

Assignee

Updated

•

14 years ago

Depends on: 591836

David Anderson [:dvander] - inactive, e-mail if emergency

Assignee

Comment 8

•

14 years ago

Yes. With the patch queue in bug 591836, on my MBP:

w/ Compile:

./Opt32/js -m k.js
interations: 100; total dur: 1088ms; per item: 10.88ms

./Opt32/js k.js
interations: 100; total dur: 434ms; per item: 4.34ms

~/projects/v8/shell k.js
interations: 100; total dur: 302ms; per item: 3.02ms

~/projects/WebKit/WebKitBuild/Release/jsc k.js 
interations: 100; total dur: 684ms; per item: 6.84ms

w/out Compile:

/Opt32/js -m k.js
interations: 100; total dur: 128ms; per item: 1.28ms

./Opt32/js k.js 
interations: 100; total dur: 577ms; per item: 5.77ms

~/projects/v8/shell k.js 
interations: 100; total dur: 237ms; per item: 2.37ms

~/projects/WebKit/WebKitBuild/Release/jsc k.js 
interations: 100; total dur: 101ms; per item: 1.01ms

David Anderson [:dvander] - inactive, e-mail if emergency

Assignee

Comment 9

•

14 years ago

Also worth noting, in JM 13% of time goes to FrameState::uncopy(), the rest is pretty distributed amongst the parser and method JIT compiler.

Updated

•

14 years ago

Depends on: 592177

Comment 10

•

14 years ago

(In reply to comment #5)
> 
> For TM, 87% of the time is spent under record_JSOP_GETELEM, so again
> compilation (in this case, of gets from arrays).  Almost all this time is spent
> under snapshot(); some in VisitFrameSlots (most of it, in fact), and some in
> checkForGlobalObjectReallocation.

The patch in bug 592177 gives me a 1.11x reduction in instruction counts and about a 1.08x reduction in time.

Comment 11

•

14 years ago

(In reply to comment #5)
> 
> For TM, 87% of the time is spent under record_JSOP_GETELEM, so again
> compilation (in this case, of gets from arrays).  Almost all this time is spent
> under snapshot(); some in VisitFrameSlots (most of it, in fact),

Yes, VisitFrameSlots() is hot for other compile-heavy benchmarks like 3d-raytrace and crypto-md5.  CountSlotsVisitor and DetermineTypesVisitor are the two visitor classes that account for probably 90%+ of the VisitFrameSlots() calls;  of the two, DetermineTypesVisitor is a bit more important, maybe by a ratio of 60:40 or so.

Comment 12

•

14 years ago

(In reply to comment #11)
> 
> Yes, VisitFrameSlots() is hot for other compile-heavy benchmarks like
> 3d-raytrace and crypto-md5.  CountSlotsVisitor and DetermineTypesVisitor are
> the two visitor classes that account for probably 90%+ of the VisitFrameSlots()
> calls;  of the two, DetermineTypesVisitor is a bit more important, maybe by a
> ratio of 60:40 or so.

That's true for the Sunspider ones;  for the benchmark in this bug DetermineTypesVisitor accounts for almost all of the VisitFrameSlots calls.

Comment 13

•

14 years ago

AFAICT the vast majority of DetermineTypesVisitor calls are due to TraceRecorder::snapshot().  In particular, every time we call snapshot() we recompute the entire typemap, even though in practice the typemap will be very similar to the previous typemap.  Surely we can take advantage of this fact somehow.

David Anderson [:dvander] - inactive, e-mail if emergency

Comment 14

•

14 years ago

Seems like it.  File a bug blocking this one?

Brendan Eich [:brendan]

Comment 15

•

14 years ago

Michael Bebenita has researched various ways of compressing snapshots in tracing JITs. Cc'ing him.

/be

Brendan Eich [:brendan]

Comment 16

•

14 years ago

Andreas, Michael: I recall a bug already on file to use Michael's research results to make snapshotting more efficient. Can you find the bug number?

/be

Assignee

Comment 17

•

14 years ago

Having to walk 4,300 local slots every snapshot() to determine the typemap is sucky (academically, "total clownshoes"). We could keep track of the typemap incrementally as we record - which sounds tricky and error-prone given the current TR design. Easier might be caching the last known typemap and watermarking the -/+ positions on the stack, to only recompute a slice at the end.

We'd have to be careful of opcodes like SETNAME or SETELEM which can change random stack slots.

My guess is that test cases like this should be under the "don't bother tracing" category right now. If nfixed > XYZ, don't even record.

Comment 18

•

14 years ago

(In reply to comment #17)
>
> Easier might be caching the last known typemap and
> watermarking the -/+ positions on the stack, to only recompute a slice at the
> end.

We're actually duplicating snapshots reasonably often, eg. 20--40% of them.  ie. we do snapshot(BRANCH_EXIT) and then repeat it almost immediately afterwards.  I'm partway through a patch to fix this.

Updated

•

14 years ago

blocking2.0: ? → betaN+

Updated

•

14 years ago

Assignee: general → dvander

Updated

•

14 years ago

Depends on: 594591

Brendan Eich [:brendan]

Comment 19

•

14 years ago

(In reply to comment #18)
> (In reply to comment #17)
> >
> > Easier might be caching the last known typemap and
> > watermarking the -/+ positions on the stack, to only recompute a slice at the
> > end.
> 
> We're actually duplicating snapshots reasonably often, eg. 20--40% of them. 
> ie. we do snapshot(BRANCH_EXIT) and then repeat it almost immediately
> afterwards.  I'm partway through a patch to fix this.

Nick, could you file, and make the new bug block this bug? This bug is meta and what you describe sounds great -- could be an easy win for a number of benchmarks.

/be

Comment 20

•

14 years ago

(In reply to comment #19)
> > 
> > We're actually duplicating snapshots reasonably often, eg. 20--40% of them. 
> > ie. we do snapshot(BRANCH_EXIT) and then repeat it almost immediately
> > afterwards.  I'm partway through a patch to fix this.
> 
> Nick, could you file, and make the new bug block this bug? This bug is meta and
> what you describe sounds great -- could be an easy win for a number of
> benchmarks.

I could, but my patch turned out to have some bogus assumptions in it and avoiding the duplicates is harder than it first seems.  I filed and fixed bug 594640 which caught a common case that was easy to fix.  In other cases the duplicates are created in separate functions in jstracer.cpp with non-trivial control flow, eg. sometimes the duplicate is created and sometimes not.  So I don't have much to say if I were to open a new bug.

Comment 21

•

14 years ago

Also, my 20--40% number was based on those bogus assumptions;  the number of genuine duplicates is probably less... it's not clear to me exactly how to measure it, suggestions are welcome.

Updated

•

14 years ago

Depends on: 601109

Updated

•

14 years ago

No longer depends on: 601109