Dr Vivek Gupta, University of Utah

(Note: As of 2024, he is a post-doc fellow at the University of Pennsylvania)

Role: Research Assistant

Period: May, 2021 → July, 2022

Project Page: [ https://xinfotabs.github.io/ ]

Code: [ https://github.com/XInfoTabS/dataset ]

Publication: [ https://aclanthology.org/2022.fever-1.7.pdf ]

Problem Statement

Most Tabular Natural Langauge Inference datasets existing at the time are made only for English which serves to make only models which are good at classifying hypothesis on English tables with English hypothesis.

Meanwhile, the portion of the world that doesn't speak English in the world is 10 times larger than the portion of world that does. So, it becomes important to support Tabular NLI when Tables are provided in English and the Hypothesis is made in another language or the table is in another language and the Hypothesis is in English. Such cases would provide greater technical inclusivity of use for all use cases of Tabular NLI.

To this effort, we presented “XInfoTabS”

Methodology

We broke down the work in multiple parts: Translation and Transliteration of Tables, Translation of Hypothesis and Building a Benchmark.

Table Translation and Transliteration is challenging because most keys and values don’t have enough context to infer the meaning during translation, which reduces quality of translation. We solve it with a long pipeline to ensure high data quality. (read paper to know more)
Translation of Hypothesis was straightforward by ensuring that models work well with the hypothesis, we had to make sure that back-translated error was minimal
Benchmarking could be done in multiple manners including directly using a Multi-lingual model for TNLI versus using a translation model with a regular TNLI model and more. We formulated 5 different benchmarks and performed experiments for all of them. (read paper to know more)

Conclusion

While we made a lot of effort to ensure high-quality dataset for Multi-lingual Tabular NLI, which is a tough task to solve for current set of Tabular NLI models, we were aware that the data quality could be verified and improved with further human involvement. This was left for future work.

Misc. points

During my tenure as a Research Assistant, I was frequently involved with making presentations on my literature review and work, which demonstrated clarity of understanding and communication skills.

Furthermore, to be further involved in the research process, I opted to provide reviews as a secondary reviewer under Vivek Gupta for EMNLP 2022. It was enriching to be a part of the review process and understand what a reviewer looks at when providing feedback to the authors for their papers.