Notes from Discussions with People About Facial Expression Stuff #6

jmdelahanty · 2023-02-02T01:23:26Z

Note

This is just a dump of a bunch of notes from people that I've talked to about facial expressions/analysis. Better to be here instead of isolated elsewhere. I might go back and put images in here, but don't have the time right now.

These conversations occurred in March and April of 2022.

This morning I spoke with Dr. Stoyo Karamihalev from Dr. Nadine Gogolla's lab about the setup they have for facial expression data collection. He just started his post-doctoral fellowship there. Here's what we talked about.

Camera Details

Cameras
He said that it is great that we are not using webcams for our video data as that's one of the most common issues he's encountered with advising people on using the method. He recommended this camera for future use:
https://www.flir.com/products/blackfly-s-usb3/?model=BFS-U3-13Y3M-C

It costs about $450.00.

This is a GenTL compliant camera that's capable of higher framerates than the camera currently used in the Bruker setup.

It would require the use of additional either:
Ethernet ports
USB Ports

USB Ports is a better way to do this he said because they can handle the high speed data coming off the cameras without running into bandwidth problems. If we were to use the ethernet solutions, we would have to upgrade the card in the machine running the scope/have a different computer running the recording entirely and manage several ethernet connections to a switch. Certainly doable, but making sure packet loss doesn't happen with the networked solution, especially when increasing the framerate of the cameras, could become an issue one day. We would have to potentially purchase a different ethernet switch and a different ethernet card capable of handling large amounts of data if we used ethernet solutions. See below for a description of how their 2P rig is setup. Also see the related notes there about how we might add to our capabilities with what the Gogolla lab is up to.

Lenses
He mentioned that the precise lens they use in the lab should be part of the paper or the supplemental materials, but I haven't been able to find it in there.

Camera Angles
Initial camera angles were set up to look like this. Stoyo said that this angle is usable, but the tube being in the way of parts of the ear as well as the eartags on the same side will cause issues for the facial expression work.
image.png

Deryn has readjusted the camera angle to be more along the side of the mouse's face in a 90 degree angle:

image.png

He said that this angle, while usable and good, could also use a little adjustment. He said something between these two frames would be ideal, closer to the one Deryn has created. He also said that it would be best to collect the video data a bit more zoomed into the mouse's face. This screenshot is just a small selection of the video frame and it's a little further away than it appears here.

In both examples, he suggested getting a notch filter put in place over the camera's CMOS sensor would be best. One has been purchased on Quartzy for use in our setup that should block out all light except for the wavelength of the NIR light.

Here's how Stoyo's rig currently records images from one of his 4 cameras:
Screenshot 2022-03-10 084030.png
image.png
It's quite zoomed in onto the face of the mouse and angled along the face slightly more. So the current setup could be altered a little bit to more closely mimic what the Gogolla lab is using.

Camera Positioning/Mouse Positioning
Stoyo said that so long as we have all the face included in the image, we can be sure that we're getting all the data we need for analysis later. Essentially so long as we're getting from the tip of the ear all the way to the nose, we're in the right spot. The only thing he said additionally would be that zooming closer to the face, as mentioned above, would be very useful so as to potentially eliminate the need to crop the video as part of preprocessing.

Lighting
Stoyo said that he's unsure what the influence of lighting will be upon HOG matrix calculations/validitiy of facial expression analysis. He's testing that now with the upgraded setup he's implementing in the lab, but he also mentioned he doesn't plan on continuing to use the original facial expression analysis for long. See below for more.

He said that having more than one light source is helpful and we shouldn't be too worried about having too much light since we can always make the aperture very small. He said having a smaller aperture of the camera leads to a better depth of focus in the plane you're imaging in. Since he's interested in movements more specifically located in the whisker pad/nose as well as the rest of the face, having the better depth of focus allows him to better visualize all the relevant parts of the face he's interested in.

In terms of positioning of the lighting, he said that having it located slightly above and facing down on the animal may provide somewhat better lighting conditions than other locations, but as mentioned in the first paragraph, he's not sure if this has any effect on the HOG matrix calculation. It would also be good to have the lights put into a fixed position that is unchanging. This has already been done.

Camera Framerate
The camera is currently triggered to take a picture every time the microscope takes pictures at a framerate of approximately 29.80 FPS. This framerate changes very slightly between recordings, by about a hundredth or so, due to how Prairie View implements max speed recordings over an 8kHz line rate. More information about the framerate and the ability to change it can be found in the Bruker Scope Updates page.

He said that this will be fast enough to capture the kinds of motion the original facial analysis code analyzes. However, he suggested that in the future we decouple the TTL triggers of the scope from the cameras recording the face. The primary reason for this is that if we wish to capture even finer movements of the face we will simply need faster framerates. He also said that while it can make sense to unite the framerate to the scope's framerate for visualizing different things occuring in the same frame, the facial expression analysis won't actually provide datasets that are traceable in this way. So while it can be useful sometimes, it might not be as useful for what we'd like to do. Instead, he suggested we purchase a TTL pulse generator that can finely control camera function as well as trigger the start and stop of the microscope more precisely than an API call to Prairie View that currently happens. More about the generator he suggested/how we might use it can be found below.

Should we want to increase the framerate to capture even finer movements of the face, we would likely want to generate independent TTL pulses for each camera as it takes an image whose timestamps can be saved externally. The timestamps collected here can be united with the timestamps of each frame of the microscope as well as timestamps of the behavior voltage recordings via interpolation. Deryn has successfully done this to correct time-encoding errors in her initial videos that arised from incorrect framerate assignment in the bruker_control code. This behavior has been changed.

Camera Data Encoding
Stoyo mentioned that for his videos, he does not perform any encoding on the fly. This means that he writes the data in a raw format to disk which yields probably enormous datafiles for the video data. For comparison, using H264 encoding via FFMPEG during a recording with Austin's experimental parameters yields recordings of approximately 0.5GB total. DIVX encoding is similar in size. The datarate coming off the camera is approximately 300Mbps (Megabits per second). It's currently unknown what the datarate would be if the framerate was dramatically increased or decreased. This means that for a typical recording using Austin's parameters which lasts approximately 26 minutes would yield a video recording of approximately 59GB. Deryn's recordings, which are approximately 10 minutes, yields recordings that are about .2GB in H264 encoding. DIVX encoding is similar in size. The raw data would be 22.5GB. The encoding to a different format would be done later as a different preprocessing step.

The reason he chooses not to perform any video encoding during runtime is because he's using several cameras. A standard machine, or even a decent workstation, may struggle to encode several video streams before writing it to disk fast enough. Essentially the processor can get bogged down doing the encoding and a backup of frames that must be written can start filling up the computer's RAM. If the machine starts running out of RAM, there may be frames that are lost during writing to disk. The machine would also fail in other ways that would effectively stop the experiment and even crash the computer. Currently, encoding just one camera via CPU appears to not stress the resources of the machine beyond it's capabilities.

As of 3/10/22, there have been no found incidents of data failing to be written to disk due to encoding. Any errors with frame loss have been due to packet loss/transmission errors from the camera over Ethernet to the computer. These errors are quite rare and are logged in the configurations file if they occur. There is currently no way to log if frame drops occur during the encoding process specifically. Frame drops arising from this will be caught by the same exception by the code and logged only as a dropped frame.

Should we decide to add additional cameras, we would want to consider not writing encoded data to the same computer that's running the microscope. Ideally, we would add a second machine that would exclusively be used to write video data to. This is how Stoyo's setup currently works which you can see below.

Real Time Expression Calculations: A new facial expression method
Stoyo is going to be implementing a different form of facial expression analysis relying upon 3D pose estimation of some sort. When he applies the method and validates it, he will be able to deploy the model to an NVIDIA Jetson Nano. The video data could be directly fed into the Nano which can apply deep learning models to video streams in real time. Each camera would have its own Nano.

TTL Pulse Details
Stoyo recommends the use of a TTL pulse generator such as the one linked immediately below from Doric:
https://neuro.doriclenses.com/products/otpg?

Using the cameras linked in the first section, it's possible to control not just framerate but exposure time as well with fine precision in ways that are much faster than what the microscope produces. We likely already have many TTL pulse generators in the lab. This is just to document his suggestion.

He further suggested that we use the TTL pulses from this generator or from a different device controlling behavior stimulation to send a start and stop trigger to the microscope similar to how it is currently done using the Arduino.

Gogolla Lab 2P Setup
Stoyo showed me the setup the lab currently uses for imaging through a rendering created through Blender. It's similar to ours overall, but contains more cameras and is governed slightly differently.

Their new setup now looks something like this for their data acquisition it seems:

gogollalab_2p.png
This is just an approximation of how their system looks, he described the basic idea to me. Therefore this might be an incorrect representation of how they've designed their setup.

What We Might Adopt
Compare the Gogolla lab's setup above to our own below:
bruker_current.png
Stoyo made no claims about whether his setup was optimal. I showed him this figure and he noted that he preferred having the Pi be the controller of the experiment instead of the Windows machine. He also mentioned that Windows' machine clocks are apparently pretty unreliable as a general rule.

The things we might want to adopt are:
Using the TTL pulse generator for multiple cameras at high framerates
Use multiple cameras and try recording from each side of the mouse's face as well as have an above view of the mouse
Would be very difficult to get camera positioned over the top of animal. Could potentially do this with mirrors, but would likely be obscured no matter what and having a reflective surface where the lasers are going is a bad idea anyways
Would need to use and coordinate a separate machine to collect additional video streams
Use a raspberry pi as a governor for the experiments
We can use the pi as what's running the behavior experiments and still have the main bruker_control program kick things off on the Windows machine controlling the microscope.
Effectively would make bruker_control a program used on the Windows machine that connects to the pi as a node. The pi would then perform trial order creation, ITI duration/tone duration settings, stimulation settings, interfacing with the Prairie View API over IP, etc. The user of the program would interface with the Pi through the windows machine and the code present on the machine would effectively be a wrapper for telling the Pi to do all the work.

Adopting all these changes would likely take a fair bit of time to do. A good amount of planning would have to be done for when such an update could happen, which things even should be updated (if at all), and how we'd like to see it turn out. Instituting these changes will also take some time. Validating and testing will also take a while. How long approximately is not something I can estimate.

This morning I got to chat with Atika Syeda, a member of Dr. Carsen Stringer's lab, about implementing their cool tool called FaceMap which can be found here:
https://github.com/MouseLand/FaceMap

I showed her a couple images of our current setups and she said that the first angle we used could be helpful for gathering/marking the different key points that they use for labeling/analyzing their data.
She said that the second angle, which Deryn has implemented for the facial expression work, could also be usable, but it would be helpful to zoom in some more and adjust the lighting conditions so it's easier to see the whisker pads.

image.png
She said that our camera's quality is quite impressive and that our image resolution is also helpful to have, but it will be downsampled to 256x256 for the crop that's used when running the algorithm.

The data that was used for training purposes was collected at 50FPS, but she said that our datasets, collected at approximately 30FPS, should be fast enough for following along with what Facemap uses without much trouble. I mentioned that we were considering using multiple camera views and much higher frame rates and Atika said that she would think that this would help get some additional dynamic motions of the mouse's face but can't say for sure since they have not tested faster framerates themselves.

The outputs of Facemap are very similar to Suite2p's outputs, namely that there is a folder created for the outputs. There's an .h5 file and metadata file stored in a .pkl binary format about the processing that was done.

On A100GPUs (which are much more powerful than what we have available here at Salk and have not yet purchased - we only have A40s coming), they reach inference speeds of 400FPS which is quite fast.

I'm going to try installing Facemap and running a video through it on the cluster today and see what it does. Hopefully it works relatively well out of the box, but it's likely that we will need to do some additional training to get the network reliably labeling our datasets.

Update from this afternoon: There's a problem with a shared object linux library again. Jorge can fix it, but it will take a little bit before he gets around to it.

Another update related to documentation for the emotional processing stuff: Stoyo may allow us to make a PR for docs on their GitHub! Could be pretty cool.

Today I got to chat with Bradley Edelman, a post-doctoral fellow in Dr. Gogolla's lab, about his use of the facial expression pipeline and how we can follow his best practices for using it. He's been working on it for a couple years and had lots of interesting/great advice about what to do.

Camera

Camera Angles

He said for the camera angles, our first angle is not particularly good although it is likely usable. He said that the second angle that Deryn has implemented is better, but either zooming into the face or simply moving the camera closer would be ideal. Overall, he said that so long as we have the ear to the nose in the frame of view, with at least a little bit of area on each side of the face, we should be okay.

For the angle of the camera, he said it's unlikely that angle matters too much so long as you're as consistent as possible. There's probably wiggle room of something like 10-15 degrees from 90 on the sloped face of the mouse, meaning that going directly 90 degrees as Deryn has implemented is likely okay too. Again, so long as we're consistent with our acquisitions.

He said that he's unsure if the camera's distance away from the mouse can impact the performance of HOG matrix calculations but that it's certainly possible. Given that the hog matrix calculation, by default, expects something like 24-32 pixels in a given cell that's used for calculation, the relative distances of the camera to the mouse will yield different numbers of pixels per cell over the surface area of the mouse's face. He said that if we're too close, we might not be able to estimate the orientation or strength in the given orientation calculated by the HOG matrix. What this distance would be is unknown/has not been empirically tested.

He suggested that having a wider field of view, meaning further away, is safer than being too close since you can always crop later.

Frame Rates

He said that it's good that we have a stable framerate for our cameras and that it's certainly nice to have our camera synced to the scope so long as it's collecting sufficiently fast frames to capture the sorts of movements we're interested in. He said that our framerate near 29.80 is more than likely enough. He said he wasn't aware of the increased framerates plan that Stoyo wanted to implement, at least it didn't seem like he was aware of that plan.

The main suggestion he had was to see if we can simply just use an easily divisible framerate instead, something like 25Hz. I don't see why not given that we can set the framerate of the scope and 25Hz is likely more than fast enough to capture the dynamics of the calcium transients we're looking for. This will make alignment/interpolation down the road much simpler for us to understand and code for.

Memory Constraints with Video Readers

Brad noted that the readers in both MATLAB and Python do tend to load the entirety of a video into memory before performing any computation. This requires either:
Limit video lengths during acquisition so clips are generated natively (Possible, would require a large refactor of the code so the camera knows when to clip things. Could be done over serial transfer, but if we're doing it that we may as well make the transition into a whole new system)
Generating clips around different trials as Deryn has currently implemented
Refactoring the code on the scope to take individual images instead of recording video to disk
Main issue with this is that transferring all the data to the server later as individual images will take forever, even on 10G lines when the switch is present, and would also incur potential downtime on the local machine.
Another issue is that the dataset size will be quite a lot larger for no added quality in dataset. He said that the reason data is recorded as individual images is that is simply what he started with/inherited and it works well enough, so why change it.
Implementing/using a different package that can lazily load video frames such as the linked package below
https://github.com/soft-matter/pims
This would be really nice so we don't have to generate clips of anything and instead can have one seamless HOG series, or even a video that could be overlayed upon the registered video frames.
Would also help make it easier to do the HOG correlation matrices later on when comparing trial types/expressions across types

HOG Matrix Calculation Notes

Using the default values of the HOG Matrix functions or a range of pixels from 24-32 per cell is common/seems to work well. Using the default number of orientations at 8 also appears to work well.

For his work, since everything is stored as individual images, he can simply glob the directory of his images and process each frame one by one without running into memory issues like we would with most image readers. When he does this, he ends up with one long time series that he can reliably seek later for analysis/visualization.

This is the one downside of converting everything into video clips for analysis/visualization later: namely, seamlessly going between clips will be somewhat more difficult to do. This is why he leaves everything in individual folders (reference face, experimental session). I want to double check this structure with him so we can follow best practices for organizing this data.

Each timepoint gets its own HOG matrix calculation, therefore each video frame is independent from one another for their calculations. This is really nice and (probably) the reason the Jupyter notebook does the processing in parallel asynchronously: it doesn't matter when different sections of the processing are done, just that it happens. Whether or not this influences things when performing the HOG analysis on video frames is unknown to me at least. I could imagine that since the code is run asynchronously in a file format that doesn't necessarily know how to stitch pieces back together in the right place like zarrs do there could be a problem with the outputs. I think Deryn has written the code so it just iterates through frames one at a time in series for clips so this wouldn't be a problem, but if we want to parallelize things using video segments we'd definitely have to be careful about that.

This yields datasets where each timepoint n has a uniqu HOG matrix associated with it. In other words, a HOG X Time dataset.

Once things are converted into a timeseries, you can perform correlation calculations across the entire dataset against each prototypical face you gather from the baseline of each mouse.

The idea is to correlate time series of HOG matrices to baseline normalized "minimum" (neutral face) and "maximum" (strongest stimulus) expressions.

Note:
Using the pims library, we could convert each chunk into an intermediate zarr store, perform hog matrices on chunks of video frames in parallel, and then probably stitch the video back together as an array of HOG matrices

One question that needs to be answered is what to do with HOG matrix values while the mouse is licking, blinking, etc.

Prototypical Face Suggestions

Brad suggested using prototypical faces within mice instead of across mice especially if your surgeries can be somewhat variable given your preparation. He suggested that, although the tool was developed so it could be deployed/developed using just one mouse face against any new mouse face, it's probably safer to compare within mice so you know that your new sessions can be reliably co-registered and unaffected by surgical influences upon the datasets. In other words, even though the same kinds of faces might appear after registration, it's hard to say if things are truly the same across animals. They certainly aren't at the level of the physical data. Things like Stoyo's plan to do 3D pose estimation for expression will probably make this whole method obsolete in the next several years. It's unclear how far along Stoyo is for that project. Something to keep in mind.

Procedure: Take frames with low variance across time, define as neutral period, crop everything, convert to hogs, average hog together, and thenr register things. On a decent workstation, this whole procedure only takes 3-5 minutes or so. Pretty fast!

It may be a good idea to have some sort of "meta-face" of mice that things can be registered to, but may not be necessary. Depends on how robust the tool is which will remain for us to be seen.

The main suggestion for creating the neutral face is to take frames with low variance across time for HOG matrices when compared to the neutral baseline, define this as the neutral period, crop these images, and then average these HOG matrices together to make the prototypical face. Compare that period against all other HOG measurements to find out when these expressions match, if at all. It's all about relative change.

The same procedure should be done across different stimuli, and ideally each prototypical face should end up anti-correlated with one another.

Face Registration Suggestions

Brad said that he doesn't use the code in the Jupyter notebook for running the facial registration algorithm. Instead, he opts for a MATLAB toolbox that does it just as well. He said that since the scope they use is run using MATLAB (they do functional ultrasound imaging!) he does a lot of the preprocessing steps in MATLAB.

All that's required is a rigid body transformation since we're in 2D space and the mouse doesn't move all that much overall. It remains to be seen if blinking, licking, etc can influence these registrations.

These will save rotation and translation matrix parameters to file for you so you can see precisely how each video was transformed as you feed it data. You could align to any point in the experiment, but he recommended that you simply take one of the first neutral time points in your experiment, like the first ITI, and use that to perform your registration. Once your registration parameters have been discovered once, since the mouse is effectively stationary at this point, you can trust that the image registration is performed reliabily.

Issues with Rigid Body Registration
There may be cases where image regisration fails in apparently spectacular ways such as rotating the mouse completely upside down! Many of these algorithms operate on a pixelwise mean square error based upon the intensity of the pixels. This is the difference between one image and the next in intensity values that is then squared. If some feature of your image has really intense pixel values (like the head bar, sucrose delivery needle, etc), it's likely that those regions will have much stronger influence over the MSE that's used overall for registering the images. This means that the algorithm will "lock onto" spots for registration that aren't related to the mouse's position!

Brad has a check in his code that looks at the registration matrices that the algorithm outputs. If the rotation values are something strange (really anything larger than just a few degrees of rotation, but sometimes values like 180 degrees happen), he has the algorithm try again. I didn't ask if this is a typical solution, but I'm betting it is.

Given this information, it would be best to try and find out how we can limit brightness from the NIR light from different pieces of the setup other than the mouse itself. Perhaps coating the metal in something that absorbs light that is also robust could help reduce this problem.

Brad also mentioned that there are issues where things aren't aligned even if they don't fail spectacularly. He said there are instances where the face video appears mostly aligned, but when you overlay the still image over the video frames you can see different features move around.

Something interesting to look at would be to find specific ROIs on the mouse that should be used for registration and have those corrected, but this would be another separate project on it's own probably. There doesn't appear to be an easy way to do this in either MATLAB or Python that doesn't involve you hand labeling multiple ROIs and then calculating correlation matrices individually across all those pieces.

Enhancing Signal of Your Correlation Matrices

In some cases it will be advantageous to enhance the signal obtained by correlation matrix calculations in your dataset. Brad suggested doing the following to your normalized dataset: Raise the correlation value to an exponent!

If you take e^constant*correlation value, your signals will be enhanced!

When you compare between conditions (ie before stress/after stress, before/after food deprivation) this can be especially helpful for visualizing changes in your data. Whether you should use this for statistical analysis is something I'm not sure about...

This should also reduce the influence/value of small increases/decreases in your correlation values from one frame to the next.

Data Usability

If an animal isn't performing the task, is uninterested, etc, it's unlikely the dataset should be used for facial expression analysis, if at all. However, this also clearly depends on your project and your project's constraints.

Should you run into errors in your frames, whether they be dropped frames or corrupted frames, you can peform a bilinear interpolation to grab frames just before and after the dropped frame to create data that's usable. Keep in mind that this synthetic data could be unreliable.

In some cases, you may see that the HOG matrix correlations experience "low frequency drift": In other words, the correlation values may slowly increase over time in ways that are unexpected or unusual. To solve this,you can take your time series and apply a low order butterworth filter.

Note:
The face may in fact be changing slowly over time as stimuli are delivered to the animal! However, in Brad's experience, these sorts of changes are rarely smooth/slow in their transitions. Rather, the mice seem to have sudden jumps in their correlation matrix measurements. The low pass filter mentioned above doesn't typically eliminate these jumps.

Today I got the pims package to run on a video and produce a list of HOG matrices on every frame with minimal changes to the orignal Notebook's code. It took something like 15 minutes to get just the HOG matrices, which is a fair bit longer than the Notebook's method according to Brad. He said that it takes about 10 minutes to do HOG matrix calculation and registration. Registration should take a relatively small amount of time since it's just done on a small subset on data at the beginning. HOG Calculations seem to be what takes the longest.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Notes from Discussions with People About Facial Expression Stuff #6

Notes from Discussions with People About Facial Expression Stuff #6

jmdelahanty commented Feb 2, 2023

Notes from Discussions with People About Facial Expression Stuff #6

Notes from Discussions with People About Facial Expression Stuff #6

Comments

jmdelahanty commented Feb 2, 2023

Note