Smart Vision-Language Reasoners

Published in ICML 2024, 2024

The architecture

We freeze the vision and text backbones and add some layers on top to promote pooling of visual and textual features and also to reduce the cost of fine tuning. The QF layer and the QF Fusion layer contain multihead self attention and cross attentions. Typical fully connected layers on top and on the decoder side we use a GRU layer. The GRU is used because it usually performs as well or better than other layers like LSTM while also having fewer parameters to update on the backward pass.

The data

The smart-101 dataset consists of a collection of questions with answers. Each of the problems contains a base collection of text and 5 potential answers. Each problem is actually a class (or collection) of problems. The actual data is generated by code and images are also generated for each of the 101 problem classes. The images are generated using openCV. Human level performance on these data are quantified via the Math Kangaroo program.

For more details on smart-101 see smart-101. For more details on math kangaroo see the math kangaroo site.

The findings

We find that there is improved performance over the baselines used in the smart-101 paper.

Critiques

Some common critiques I heard at the ICML workshop and since then:

  1. We didn’t do enough epoches for fine tuning.

We were limited by amount of GPU compute time, and also the vision and text backbones are large models. Keep in mind this wasn’t work sponsored by my employer so we did not have unlimited access to A100 GPUs to fine tune and perform extensive ablations. We followed a recipe outlined by Andrei Karpathy which remains excellent advice despite the fast moving nature of the space.

  1. Images in math ai considered harmful.

Another common critique I heard is that there are lots of other Math AI papers who have investigated the use of images and found they are either not helpful or harmful (images in math ai considered harmful). If you read these other papers you will find-in all the ones I’ve read-they do not customize the network architectures nor dothe have cross attention layers which pool information from the textual and image backbones. We sought to disprove the claims that images in math ai considered harmful and we did.

In fact this was actually a panel discussion question during the workshop, “is text alone enough?” the consensus was that while images may not be necessary they are sufficient.

That’s math speak for images will help because there is a lot of data/information contained inside the images but that math problems can be solved without images.

  1. Images in math ai found not to help

This is a finding in the MathVerse paper, e.g. the model learns to shortcut the vision features and rely primarily on the text features of the problem. My commentary on this is the same as item 2 above, the architectures did not pool the visual and textual information.

My Opinion on Images

While I do not disagree with the premises of the panel, it seems to me a bit like bringing a knife to a gunfight, or to use a less violent metaphor, playing chess blindfold or sans voir against your opponent. There are some quite talented chess players and some that can play blindfold extremely well, however, most will say there performance playing blindfold is hindered over playing sighted.

To be honest though, neither I not others in the community have an “answer” on the question of images in math ai.

Conclusion

While it remains to be seen whether purely text models can perform as well in the math ai domain, our work suggests that there are several aspects of the problem hitherto unconsidered.

If you are a researcher or institution who would like to work together in this space-or fund our investigations-please get in touch!

My email contact is there on the left hand side of the screen only a click away.

Recommended citation: Roberts D, Roberts L. Smart Vision-Language Reasoners. arXiv preprint arXiv:2407.04212. 2024 Jul 5.
Download Paper