We created an iOS application which uses the Metal framework to generate photo mosaics in parallel, on-demand. Using the photos already in the device's Photo Library, Mosaix reconstructs the reference photo by selecting photos which match in color, shape, and contrast and placing them in a grid pattern representing the original image. Our iOS app currently exhibits about 20x speedup over the leading photo mosaic applications available in the App Store.
A reference and composite comparison generated by our parallel algorithm.
- Full graphic user interface with automatic photo library access, an in-app camera and photo selector, and the ability to save mosaics.
- Functional with a Photo Library with between 4 and 15,000 photos.
- Adjustable grid size and final quality -- letting the user select how large each source image is framed within the final composite photo and how picky the algorithm is in its selection.
- Naive and basic parallel photo selection algorithms based around universal protocols, with an iterative approach to building a robust, performant data structure behind a parallelized solution.
- Pre-processing of the device's Photo Library in the background that reduces complexity and time of the generation of each individual mosaic, and loading values previously computed from file to decrease the time required for steady-state photo mosaic production.
- Performance that significantly improves upon existing solutions in the app store. While there are a few apps already that perform similar functions to Mosaix, they consume 30-40s per image to finish processing (depending on grid size and quality).
- A video processing flow which, given a video, processes the video into frames and applies the photo mosaic algorithm to each frame so that it can be stitched back together into a transformed video.
- We demonstrated the app live on an iPhone, by taking an on-the-spot picture and having the application generate a composite mosaic of that image on the spot with both a naive and parallel implementation.
Our application is a Swift mobile app with a fairly simple user interface; the user selects the desired image (referred to as the reference photo) and adjusts the grid size and quality. At this point, the application reassembles the selected image (into the composite photo) of the other images in the users' libraries, presenting the user with the composite as a final result.
Our goal was to implement a reasonable approach to transforming photos into a feature vector, apply this transformation efficiently to the entire Photo Library, and then find the minimum difference between the vector of a subsection of our reference photo and the vector of some photo from the photo library so that we can replace it.
Mosaix made the most sense on a mobile device because that's the platform on which a majority of our everyday photos are taken and stored. Creating a mobile app made it easy for the end user to take a photo and see a live result, and made sharing the composite image much simpler.
In particular, our choice of iOS 10 on iPhone 7 is driven mostly by the inclusion of the Metal framework, which gave us unprecendented low-level access to the GPU on the iPhone 7. Not only was the hardware specifically designed with photo processing in mind, but the software gave us complete control over multi-threading and image processing and a simple way of accessing a library of existing photos.
- We'd never worked with Swift or Metal before, and iOS apps are developed using a unique design paradigm, and custom interfaces and data types with restrictions upon how and when they can be used. For example, photos in the Photo Library are stored as and are accessible only as PHAssets, whereas Metal primarily supports MTLTextures and all transformations performed upon photos had to be upon textures in order to run in the kernel.
- The iPhone 7 is quad-core, but application usage is restricted, and there is no interface to directly schedule jobs to cores. If an application uses too much power, the application is throttled down, so achieving speedup required a balance between utilizing resources and overreaching our limits.
- Fetching photos from the Photo Library is bandwidth-bound, and repeatedly accessing photos from it, or even queuing multiple requests for specific photos, caused slowdown and had to be worked around.
- 2GB of RAM meant that holding all (or even a reasonable fraction) of photos from the Photo Library in memory efficiently was not possible, and so batch processing and converting to simplified representations of the photos as soon as possible was necessary.
We broke photo selection into three main, parallelizable sub-problems:
- KPA Calculation: generating a reasonable representation of photos within the Photo Library so that we can efficiently narrow down potential candidate matches for the reference photo without losing clarity in the resulting mosaic.
A K=9-Point Average representation of an example image, stored as a 9-dimensional vector.We did this with a technique we called K-Point Averaging, which splits each library photo into a grid and finds the average R, G, and B values in each subsection, as well as the average R, G, and B values for the entire image. We then created a 3K-dimensional feature vector of these K numbers x 3 channels, and considered this as our representation of the image for candidate selection. An important fact is that this k-value is dynamic and can be adjusted on-demand for an increased ability to recognize and match curves and edges. We initially hard-coded this value to K=9, plus one value for the overall average (leading to the naming scheme of TenPointAveraging across the app even though the final implementation defaults to K=25).
The implementation of KPA calculation for photo processing can be found in the first two kernels of TenPointAveraging.metal. Depending on the work-load, we take advantage of two different implementations; one requiring inter-thread communication but is overall more efficient for larger libraries and one that runs more quickly on smaller photo sets.
- KPA Storage: Finding the representation of the Photo Library to efficiently search 3K-dimensional space for the nearest neighbor.
Our data structure containing all the KPA vectors must be stored in a way that lends to efficient, concurrent nearest-neighbor search when finding a potential candidate for a section of the reference photo. We tried a few different representations, each lending to a different algorithm for finding the vector(s) closest to the reference photo:
- Dictionary Mapping (see TPADictionary.swift): Our first (naive) implementation stored all the KPA values as a mapping from local identifier (the unique string used by iOS to identify a photo) to a struct of the ten-point averages. This had an obvious advantage in that it was incredibly easy to program and reason about, but it made selection inefficient.
- K-Dimensional Binary Search Trees (see KDTree.swift): Our next approach was storing the TPA structs inside nodes of a multi-dimensional binary search tree. At each level, the tree splits along a different value in the vector to divide the space in half along a different axis with each step in traversal.
K-Dimensional binary search trees make insertion easy, but nearest-neighbor searching is difficult; it requires calculating the distance between the KPA vector of the reference photo and the best-so-far candidate photo, then using that distance as the radius of a hypersphere and checking overlap at each level with the unexplored node to see if that side of the tree needed traversing as well. Additionally, we found that even in a library of 5,000 photos, the large constants of splitting along K (or even sqrt(K)) axes require searching up to 10% of the tree to find a single image.
- KPA Sequence (see TPAArray.swift: Our final solution was inspired by the rapid speed-up when converting KPA calculation to a Metal kernel. As Metal only works with buffers and textures (multi-dimensional planar types), we would sacrifice any complexity benefits of a more advanced data structure, but would gain the benefit of being able to stream this data through a Metal kernel and take advantage of the relatively beefy GPU on the iPhone. As such, this representation is a one-dimensional array of unsigned, 32-bit integers that are easily passed to Metal.
The selection algorithm with regards to KPA Sequences is discussed in detail in the next section.
- Selection: mapping sections of the reference photo to candidates from the Photo Library.
Finally, we tackled the problem of actually generating a mosaic from a reference photo and KPA Storage. After experimenting with CPU- and GPU-based solutions, it was evident that using the Metal kernel would get us far better results than any thread-spawning technique. In TenPointAveraging.metal our two kernels for selection are
findPhotoNinePointAverage
andfindNearestMatches
:- findPhotoNinePointAverage: This metal kernel does K-Point Averaging like the kernels above, but is re-engineered specifically to take many K-Point average samples at once. While it is more efficient to make a separate kernel call for each photo when pre-processing KPA values, we needed to split up the reference photo as defined by the user's grid size setting and find the KPA values for each square in that grid. This kernel interleaves across thread groups and, unlike the pre-processing kernel, does not require inter-thread communication.
-
findNearestMatches: Once we have the KPA values for each square of the grid in the reference photo, a final Metal kernel takes in sequences of TPA values both for the grid of the reference photo and for the photo library as a whole, performing a min-reduction across the squares of the differences of each value (essentially, a min-distance reduction across 3K-dimensional space).
While the full implications of this speed-up are detailed below, it's worth noting here that this reduction outperformed any other implementation of nearest-neighbor search (including k-dimensional binary search trees) to the point where selection became the least time-intensive portion of mosaic generation.
Our main achievement was a stable application that detects edges well and produces high-quality photo mosaics for any reference photo using a Photo Library of anywhere from 4 to 15,000 photos. Below, we've outlined further key results and observations about our application. We wrote a benchmarking framework to time different sections of our code, and identify what settings of thread-related parameters was optimal.
A graph describing image selection speed-up as thread group width changed.
In Metal, a thread group is a group of related threads that execute on a single compute unit, and share memory and barriers (similar to the CUDA paradigm of thread blocks). This is the graph of time taken in the image selection process as the thread group width for that particular process is varied - the kernels we wrote to carry out this process are here. Speedup increases dramatically as the width approaches 32, and then gets slightly worse. We believe that with more than 32 threads, the amount of work assigned to each thread was so little that its benefit was outweighed by the _overhead of launching and maintaining these threads_. Also, a thread group width that was too large is bandwidth-bound in its memory access and can cause cache thrashing - to improve _spatial locality_, 32 threads is the ideal thread group width.
A chart showing time spent on each portion of the algorithm's process.
We timed each portion of the application's processing to try to understand where the application spent the most time. Our final iteration of the application demonstrated time spent as in the pie chart - it's clear to see that overhead, loading from file, and drawing outweigh computation time. This suggests that further optimization should look at reducing reliance on iOS functionality, as the arithmetic intensity of our application is currently quite low - the most time is spent on producing the appropriate input and output for our algorithm.
A graph describing performance of Mosaix (our application) against Mosaica and PhotoMosaic, the leading photo mosaic applications currently available in the App Store.
We compared our application's performance against that of two of the most popular photo mosaic applications available in the App Store. All three applications have quality adjustment, so we adjusted the quality of the different applications until they produced visually simular quality of photo mosaics. It's clear that all three algorithms have fairly different algorithms for photo processing - for example, Photo Mosaic goes through the Photo Library after a reference photo is picked, which indicates that it likely does all Photo Library processing as a per-photo diff relative to the reference photo. We timed how long each application required to create a photo mosaic after a reference photo was chosen, and found that our application was approximately 20x faster than either of the competitors.
Nathan
- Implementations of KPA Storage types (multi-dim. trees, dictionaries, and sequences)
- Implement preprocessing of Photo library on CPU and in Metal kernel
- Implement K-Point Averaging calculation on CPU and in Metal kernel
- Nearest-matching on CPU and in Metal kernel and Swift-Metal pipeline
- Protocols and implementations for K-Point Averaging, Mosaic Creation, and Image Selection (naive + metal)
- Benchmarking and optimization framework
Shelly
- Video decomposition for frame-by-frame processing
- Serializing and deserialization of KPA Trees, Dictionaries, and Sequences to store photo library data to file between app launches
- App storyboarding, view layout, and UI elements and functionality
- Photo Library access and management and saving composite mosaics to file
- Options -- toggling between naive and optimized algorithms
- Inter-operability between protocols of the algorithm and controllers of the user interface