MACT: Efficient Multi-Agent Collaboration with Tool Use for Online Planning in Complex Table Question Answering

Zhou, Wei, Mohsen Mesgar, Annemarie Friedrich, and Heike Adel. “Efficient Multi-Agent Collaboration with Tool Use for Online Planning in Complex Table Question Answering.” arXiv preprint arXiv:2412.20145 (2024).

What problem does this paper address?

Complex table question answering

What is the motivation of this paper?

Fine-tuning LLMs requires high-quality training data, which is costly and hard to obtain.
Prompting closed-source commercial LLMs can be costly and poses challenges to accessibility and reproducibility.
A single LLM struggles with performing different types of reasoning, especially if it does not excel at specific tasks due to lack of domain knowledge.

What are the main contributions of this paper?

MACT (Multi-Agent Collaboration with Tool use), a novel framework for complex TableQA that requires neither closed-source models nor fine-tuning
an efficiency optimization module that allows MACT to take shortcuts during reasoning

How does the MACT framework work?

In short: Given a table, a question, and optionally some textual context as the input, a planning agent performs online planning (i.e. generates a plan iteratively), breaking down the complex problem and addressing multi-step reasoning, while the coding agent and the tool set assist with generating faithful intermediate results. This process is iterated for specific times to yield the final answer.

Components:

a memory
a planning agent
a coding agent
a tool set

Initialization

Initialize the memory state , where all parts are represented by strings.

1. Action Generation

At each iteration , prompt to generate a thought , an action , and an observation , i.e. (, , ) . An action consists of an intent and an instruction. Sample from k times, yielding actions and their corresponding estimated observations .

Intent: purpose of an action

6 types of intents:

Retrieval: any operations extracting information from a table (direct querying, filtering, grouping)
Calculation: calculation, counting, comparison
Search: call Wikipedia API on an entity to extract external (factual) knowledge
Read: retrieve information from the textual context
Finish: stop from generating more actions and ends the iteration
Ask: retrieve an answer based on the internal knowledge from the planning agent

Instruction: detailed specifications of the intent

2. Action Selection

From the set of actions generated by for iteration , use the self-consistency function to select the most frequently generated action as the output.

3. Code Generation and Execution

Given the instructions of the selected action , an LLM-based coding agent translates them into Python code snippets .
Sample times from to increase the robustness of the system against generated syntax errors, resulting in a set of code snippets .
Run a Python interpreter on each , creating a set of executed solutions .

4. Observation Computation

If the selected tool is the Wikipedia API or the calculator, then that tool will return a deterministic result.
If the selected tool is the Python interpreter, then select the most frequent element from the joint set of the output of Step 1 (estimated observations of the generated actions) and the output of Step 3 (executed solutions of the code snippets), i.e. .

5. Memory State Update and Iteration

After obtaining , update the memory state with the selected action and the observation .
Then, continue with the next iteration.

Efficiency Optimization

Motivation: Questions that do not require multi-step or multi-category reasoning can be answered directly by .
Goal: trade-off between performance and computation time
For a whole reasoning trace generated by for a given until the output of the intent Finish, approximate the confidence of by the degree of self-consistency of the estimated predictions , where is the instruction of the action with intent Finish.
“degree of self-consistency” == number of occurrences of the most frequent prediction in
Introduce a hyperparameter .
If the degree of self-consistency > , outputs its most frequent answer. Otherwise, adopt the collaborative framework as usual.

Hyperparameter selection:

smaller : more often to use the shortcut
larger : more confident needs to be in order to take the shortcut

On which TQA datasets is MACT evaluated?

WTQ
- general domain
- no multi-step or multi-category reasoning
TAT
- hybrid data (table + text)
- numerical reasoning
CRT
- Wikipedia tables
- complex reasoning
SCITAB
- claim verification
- converted to TQA format

Which systems (models) are evaluated?

Systems for Comparison

Systems that require fine-tuning:
- OmniTab (BART)
- TableLlama (Llama-7b)
- ProTrix (Llama-7b)
- TAT-LLM (Llama-7b)
- TableLLM (Llama-7b)
Systems that do not require fine-tuning:
- Dater (GPT-3.5-turbo)
- Binder (GPT-3.5-turbo)
- Chain-of-Table (GPT-3.5-turbo)
- ReAcTable (GPT-3.5-turbo)
- TabSQLify (GPT-3.5-turbo)
- Plan-then-Reason (GPT-3.5-turbo)
- Mix-SC (GPT-3.5-turbo)
- ARC (GPT-3.5-turbo)

Systems for MACT

Qwen-72B (planning agent, quantized to int4)
CodeLlama-34B (coding agent, full precision)
GPT-3.5-turbo (planning agent + coding agent)

Hyperparameters

= 0.6
= 0.6
= 5
= 7
= 1

What are the results and conclusions?

MACT (GPT-3.5-turbo) outperforms TQA models on TAT, CRT, and SCT.
MACT outperforms out-of-the-box open-weight LLMs on all datasets.
MACT with open-source models delivers comparable performance as closed-source systems.

MACT generalizes better across datasets than fine-tuned TQA systems.

MACT is moderately efficient.

What are the differences of MACT compared to previous methods?