Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Maximum Performance Mode? Using more than 2 cores... What about others? Does Metal Matter? #408

Open
belisoful opened this issue Oct 26, 2024 · 15 comments

Comments

@belisoful
Copy link

belisoful commented Oct 26, 2024

Apple Feedback Assistant ID: FB15619021

Is your feature request related to a problem? Please describe.
When rendering, the CPU only take 200-250% when i have 8 remaining performance cores on the table not being utilized. This may be the complexity of the video? I'm not sure why the load can't be distributed onto the un-used cores.
Update: I have a test render now going at 408% CPU.

Describe the solution you'd like
If there is a limit of 2 cores (or only 2 rendering threads?) for export, I'd like to see the other performance cores being utilized for the fastest rendering. It would eat more electricity but i'd like to render as a higher priority and with more cores, as fast as possible.

For instance, if i have a long render while i sleep, I'd like for FCP to just go full throttle while I'm sleeping.
There could be a check box under the "Background Tasks" - "Sharing" View, for toggling on and off performance mode. Maybe a switch style UI element.

If frames can be rendered on the unused CPU Cores doing the same work as the GPU, FCP should have an option for that as a maximum performance mode. That's the whole point of Metal, right? CPU and GPU work is interchangeable, no?

FCP wouldn't be limited to just 2 Cores in exporting but would use all Performance cores. Maybe there could be a maximum over drive option that also utilizes the efficiency cores to the max as well.

Apple, why sell us computers with 10 CPUS when your premier softwares like Final Cut Pro only uses 2-4 cores? It's not PRO, if you ask me.

If this option is on a laptop and is not connected to power, then a warning dialog could displayed. This could be turned off, and turned back on in the final cut pro settings itself.

Describe alternatives you've considered
Adobe Premier? Resolve? Blender?
Apple Compressor seems to have all core support, but that is an option that final cut pro should include by itself. It's 2024 after all. How many cores did you pay apple to have and use? How many years has 4+ cores been the standard?

@belisoful belisoful changed the title Maximum Performance Mode? Using more than 2 CPUs... What about others? Maximum Performance Mode? Using more than 2 cores... What about others? Oct 26, 2024
@belisoful belisoful changed the title Maximum Performance Mode? Using more than 2 cores... What about others? Maximum Performance Mode? Using more than 2 cores... What about others? Does Metal Matter? Oct 26, 2024
@latenitefilms latenitefilms self-assigned this Oct 29, 2024
@belisoful
Copy link
Author

I know this sounds funny.... but what if an attached iPad or iphone were also used for render? i mean, those new iPad M4 are heavy duty SoCs. Doing remote work via Compressor is one thing, but if an iPad or iPhone were physically attached by wire, why not allow Final Cut to use it for faster rendering without Compressor? It would be a selling point for M4 iPad.... to render your Final Cut Pro Exports 2x+ faster. lol.

@latenitefilms
Copy link
Contributor

Compressor for iPad is an excellent idea!

@belisoful
Copy link
Author

Compressor for iPad is an excellent idea!

Even without Apple's Compressor Software, if adding an M4 iPad or iPhone support to make our direct Final Cut Pro export 2x faster? It'd be killer.

@latenitefilms
Copy link
Contributor

The only thing really stopping Apple is that currently on iPadOS only frontmost applications are running - you can't render in the background. You'd have to keep Compressor on iPad frontmost and open for it to work (like FCP for iPad and DaVinci Resolve for iPad currently). This seems like it would be a support nightmare.

@belisoful
Copy link
Author

There has to be a way for apple to create a low priority special NO UI background threads for rendering these things. They have total control over iOS.

@latenitefilms
Copy link
Contributor

Absolutely, however doing major engineering changes to iPadOS is probably a multi-year project, and not something the ProApps team could easily just request.

@belisoful
Copy link
Author

I'm thinking there is an internal way for them to set Apple tasks/threads in the background on iOS. It's not available for us regular developers. One such iOS background task is "computer sync". A Final Cut Pro background task on iOS would have the same modality as a "sync" function when physically connected to a Mac. At least that's my thinking as a professional software engineer.

I know Apple has internal APIs they don't give us access to. That's what I am banking on for this request. It'd be unreal to half the time of a Final Cut Pro export by connecting a iPad or iPhone. well. 10% to 300% faster is my guess.

@latenitefilms
Copy link
Contributor

If iPadOS already had a way to do it - they would have already done it for Final Cut Pro for iPad.

@belisoful
Copy link
Author

belisoful commented Oct 29, 2024

Does Final Cut Pro for iOS not allow exporting on the device? I don't have one and haven't played with the iOS version of FCP.

Assuming FCP for iOS does export, Why not just put that render pipeline, already on Final Cut Pro for iOS, into an executable background task accessible by a connected host computer?
The FCP iPad slave service could be a thin [background] client with the host FCP providing almost everything, even executables (ahem plugins, Audio Units, and such)

@latenitefilms
Copy link
Contributor

Final Cut Pro for iPad can only export in the foreground. If you switch to another app the export will fail.

@belisoful
Copy link
Author

Holy cow. This is 2024, that's a bug and a half, by three quarters of a bite out of Apple. I had no idea it worked that way for FCP on iOS.

@belisoful
Copy link
Author

belisoful commented Oct 29, 2024

I'd report that as a bug. FCP should at least pause the export rather than fail if leaving the foreground on iOS. just basic UI and usability design.

@belisoful
Copy link
Author

I'm noodling about it.... Even if the iPad-M4/Mx was locked into FCP for iOS to make my Mac FCP Export faster, that'd work with me. It'd be nice if it were built in to FCP without Compressor, if we just so happened to have an iPad.

@joema4
Copy link

joema4 commented Feb 12, 2025

We would need a specific scenario and machine details (inc'l MacOS and FCP versions) to investigate this. But even without that information, it's possible to comment on this area:

  • The term "render" can mean either computing an effect or encoding a timeline clip to render cache, or exporting a clip which (depending on the cache state) may require rendering and encoding, or just encoding. If the export format matches the cache format (typically ProRes 422), it's just a copy operation from render cache segments to the output file. Each of those involve different data and code paths.

  • There are four main system limits during decode/encode of H.264/HEVC files: 1. CPU 2. GPU 3. Decode/Encode 4. I/O. In general, I/O is less important for compressed codecs because the total I/O bandwidth is low for those operations. CPU and GPU activity are visible in Activity Monitor, but the media engine decode/encode is not visible. The media engine is totally separate circuitry, even if hosted on an Intel CPU (Quick Sync) or Nvidia or AMD GPUs. It does not use any of the traditional GPU systems. Therefore, it's common that the media engines may be working hard, and CPU or GPU may be waiting on them. On MacOS you just have no visibility of this using normal monitoring tools.

  • Amdahl's Law states that even a small serialized code fraction limits the available multi-core speedup. This illustrates why just adding more CPU cores may not help. If the threads must cooperate to any degree whatsoever, this can introduce a serialized code fraction that caps the maximum multi-thread speedup: https://en.wikipedia.org/wiki/Amdahl%27s_law

  • Available percentage of thread-safe framework code. From the perspective of threads running in user code and user address space, it's easy to spin up more threads. But if those threads must run concurrently in certain MacOS frameworks (some of which is in kernel mode), those frameworks may not permit concurrent operations or may only permit those operations within certain regions. The reasons for such limitations are too complex to discuss here.

  • For H.264/HEVC export, there is "open GOP" format and "closed GOP" format. If closed GOP, that means each GOP is totally independent of others and theoretically can be processed in parallel (limited by framework thread safety issues). If open GOP format, there are inter-GOP dependencies that prevent parallelization. Why not just always use closed GOPs? Because open GOPs have better encoding efficiency.

  • Here is a Python script that can scan an H.264 or HEVC video file to determine whether it was encoded using closed or open GOP format. Installation, dependencies and usage instructions are in the file:
    https://www.dropbox.com/scl/fi/5m6h8nilqnem6u7m8a9rx/gop-analyzer.py?rlkey=052a1yo6hm6epu6l1j2540sj8&st=9h2gg80r&dl=0

  • For open GOP cases, H.264/HEVC encoding is inherently serialized. The bottleneck is typically the encoding/decoding accelerator. It's algorithmically not possible to add more CPU cores.

  • For closed GOP cases, it's theoretically possible to use multiple CPU threads (one per GOP, up to the CPU core limit), or multiple hardware encoders (Max and Ultra CPUs).

  • Starting with recent versions of FCP and MacOS, if on Max/Ultra CPUs and under some conditions there's an option for "export segmentation" which uses multiple media engines in parallel: https://support.apple.com/guide/final-cut-pro/speed-up-exports-with-simultaneous-processing-verd006aab15/mac

  • Compressor has long supported "multiple instances" if configured, which runs separate processes (not just more threads within a process). That is visible in Compressor>Settings>Advanced. For an M1 or M2 Ultra machine having 4 ProRes engines, you can theoretically use 3 or 4 instances and get some improvement during export to ProRes. However, the I/O load is tremendous -- up to 2,000 GB per sec on my M1 Ultra -- and the improvement isn't that great. This is likely because there's a segmented encoding phase, then those segments must be concatenated, which incurs significant overhead. I haven't recently tried multi-instance Compressor encoding on Long GOP formats, but it previously didn't work, possibly because of open GOP dependencies.

Summary:

In general, FCP for a given render, decode, or encode operation is going as fast as the current design permits. During rendering or exporting, when you see low CPU or GPU, some other component such as the media engine is maxed out, but there are no monitoring tools for that except XCode Instruments, which is very complex to use.

In cases involving an inherently serialized algorithm (e.g, open GOP encoding) it's not possible to further accelerate the operation. There are some cases besides rendering and decode/encode where there's an FCP code path that is serialized on the SQLite database or on unpacking a nested series of compound clips during a drag/drop operation. That is apparently a result of the magnetic timeline doing "fully rendered" clips (not wire-frame) and the lookups required to determine moment-by-moment what is a valid drop destination.

That is an inherent problem because all GUIs (Windows, Mac, etc) only permit a single UI thread. In the current FCP design, certain operations such as drag/drop of many clips are handled on the UI thread. But even if the work was offloaded to a pool of worker threads, those would be limited due to thread safety issues in the MacOS frameworks, plus would have to constantly communicate with the UI thread and might incur reliability issues due to race conditions.

From studying various performance cases with Instruments, it appears FCP development is gradually adding more parallelism to certain areas. However, they may be somewhat limited until more of FCP and underlying frameworks are rewritten using Swift and using Swift Concurrency. Currently, FCP thread scheduling uses a mix of "pthreads" and Grand Central Dispatch queues. That takes extra care to avoid synchronization problems and race conditions, which adds testing overhead and the risk of nondeterministic bugs.

Here is a comparative Instruments trace of FCP vs Resolve doing a large drag/drop operation when they both were very laggy. Despite Resolve just doing "wire frame" animation and not having to do magnetic timeline lookups to validate destinations, it was also slow: https://joema.smugmug.com/XCode-Instruments-trace-of-slow-FCP-performance/n-tBwgCJ

@belisoful
Copy link
Author

This is a brilliant analysis. Having worked with FxPlug 4 quite extensively, this is a very deep analysis worthy of Apple Engineers' reading. Thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants