PropTest: Automatic Property Testing for Improved Visual Programming

1Rice University, 2Columbia University
Teaser GIF

Visual programming methods generate code for a program to resolve a vision-and-language task such as VQA. PropTest improves on these methods by automatically generating testing code that probes for several output properties. This is used as additional information when generating code and checking the correctness of the output solutions. As a baseline we use ViperGPT under CodeLlama-7B for this example.

Abstract

Visual Programming has emerged as an alternative to end-to-end black-box visual reasoning models. This type of methods leverage Large Language Models (LLMs) to decompose a problem and generate the source code for an executable computer program. This strategy has the advantage of offering an interpretable reasoning path and does not require finetuning a model with task-specific data. We propose PropTest, a general strategy that improves visual programming by further using an LLM to generate code that tests for visual properties in an initial round of proposed solutions. Particularly, our method tests for data-type consistency, as well as syntactic and semantic properties in the generated solutions. Our proposed solution outperforms baselines and achieves comparable results to state-of-the-art methods while using smaller and publicly available LLMs (CodeLlama-7B and WizardCoder-15B). This is demonstrated across different benchmarks on visual question answering and referring expression comprehension, showing the efficacy of our approach in enhancing the performance and generalization of visual reasoning tasks. Specifically, PropTest improves ViperGPT by obtaining 48.66% accuracy (+8.3%) on the A-OKVQA benchmark and 52.8% (+3.3%) on the RefCOCO+ benchmark using CodeLlama-7B.

Methods

Descriptive text about the image

Overview of PropTest.


Given an image and a question, the goal is to generate Python code that can be executed to get an answer. PropTest first calls an LLM to generate test cases based on the inferred properties of the answer. Then, the generated test cases are used to improve the quality of Python code generated by the same LLM. Our pipeline highlights two possible failsafe mechanisms, generating property testing code first and then generating program code, or generating property testing and program code independently. Our full pipeline highlighted here uses both mechanisms.

Qualitative Results

We show some example results on GQA, A-OKVQA, RefCOCO and RefCOCO Plus where PropTest succeeds but the baseline fails. Input question and answer is shown on the left, generated property test case in the middle, and code on the right.

Qualitative Results

Descriptive text about the image

Comparison of PropTest against end-to-end VLM and visual programming methods with CodeLlama-7B on the visual question answering task.


We report Accuracy on two visual question answering benchmarks. The number in the subscript indicates improvement over ViperGPT. (†) WizardCoder-15B has a 0.48x smaller context size than CodeLlama-7B despite having more parameters.

Descriptive text about the image

Comparison of PropTest against end-to-end VLM and visual programming methods with CodeLlama-7B on visual grounding tasks.


The number in the subscript indicates improvement over ViperGPT.(†) WizardCoder-15B has a 0.48x smaller context size than CodeLlama-7B despite having more parameters.

BibTeX

@article{koo2024proptest,
      title={PropTest: Automatic Property Testing for Improved Visual Programming}, 
      author={Jaywon Koo and Ziyan Yang and Paola Cascante-Bonilla and Baishakhi Ray and Vicente Ordonez},
      year={2024},
      eprint={2403.16921},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}