Social Intelligence Data Infrastructure:
Structuring the Present and Navigating the Future


TLDR: We show there is a need for multifaceted datasets, increased diversity in language and culture, more long-tailed social situations, more interactivity in future social intelligence data.

1National University of Singapore; Agency for Science, Technology and Research (A*STAR)  2Stanford University

Overview (Foundation & Applications of our data infrastructure.)


How is the data infrastructure built? It is based on past theoretical efforts in defining social intelligence concept and empirical efforts in creating NLP datasets to enhance social intelligence in language models.

What is inside the data infrastructure? A taxonomy and a data library.

How can the data infrastructure be applied? It can be used to organize existing dataset efforts, analyze current dataset landscape, offer foundation to a comprehensive social intelligence evaluation set, and shed light on future dataset development.



Taxonomy

Social Intelligence is a multi-faceted concept. We introduce Social AI Taxonomy to capture different dimensions and social factors under social intelligence.

What does social intelligence entail in AI systems?

There are three pillars in our taxonomy: Cognitive Intelligence, Situational Intelligence, Behavioral Intelligence. We illustrate their definitions, factors entailed as well as NLP task examples in the following diagram.

How do different types of intelligence interact?

Now, you might think that such a high success rate is the peak of our findings, but there's more. In a surprising twist, we found that more advanced models like GPT-4 are more vulnerable to persuasive adversarial prompts (PAPs). What's more, adaptive defenses crafted to neutralize these PAPs also provide effective protection against a spectrum of other attacks (e.g., GCG, Masterkey, or PAIR).

P.S.: Did you notice any persuasion techniques used in the two paragraphs above?

  • "achieving an astonishing 92% success rate": this uses a persuasion technique called "logical appeal"
  • "Now, you might think that such a high success rate is the peak of our findings, but there's more": this is a persuasion technique called "door in the face"
Congratulations! You've just finished Persuasion 101 and tasted the flavor of persuasion — didn't that "Door-in-the-face" technique give you a little surprise? The following are the results and insights we learned from using these persuasion techniques to test the safety alignment of LLMs.



Figure 1. Social AI taxonomy with three pillars: cognitive, situational and behavioral intelligence. We illustrate their respective roles in social interactions (left), and visualize their definitions and example NLP tasks (right).

Data Library

Introduction: Accessible in this Google Sheet. We applied keyword filtering and manual verification for papers scraped from ACL Anthology, obtaining a total of 480 NLP datasets on social intelligence. Each dataset has the attribute of year, title, link, type of intelligence, social factor, NLP task, data source, annotation strategy, generation method, data format, language, modality, public availability of the test set.

What insights can we distill from the data library?

By visualizing the trends of the datasets in the data library, we obtain insights about the past and current NLP paradigm for dataset development on social intelligence.


Distribution of Intelligence Types

Most of the social NLP datasets focus on the cognitive aspect of social intelligence (64.2%), followed by behavioral aspect (22.7%), and least of all, its situational aspect (3.8%). Only a small set of datasets (9.4%) span multiple intelligence types.


Distribution of Subcategories

For cognitive and behavioral intelligence, papers are balanced across subcategories. For situational intelligence, most datasets measure knowledge of speakers involved in the dialogue and social relations, and there are very few datasets on moral values and social norms.


Data Format

We classify the data formats of surveyed datasets into two categories - interactive and static. Interactive data are those with information exchange like social media threads. Static data include (1) self-contained and topically focused texts written for a general audience and (2) part of information exchange but have no prior or subsequent context. The difference in data format can affect the ability for language model to acquire social intelligence (Sap et al., 2022). For cognitive aspects in particular, there are significantly more static datasets than interactive ones. Moreover, within the interactive datasets, the proportion of multi-party modeling is small (18.4%).

Use of AI

There are increasing number of works adopting AI for generating and annotating datasets related to social intelligence (before 2015: 3; after 2015: 32). We find that degree of adoption of AI for generation is higher than annotation. In recent work, researchers outsource generation completely by generating contents purely using AI (Zhou et al., 2023) or simulating conversations between AI and AI (Lee et al., 2022). On the other hand, use of AI in annotating social intelligence data still remains in a hybrid stage (Jo et al., 2022) and AI usually plays a part in annotating simpler high-level components like themes (Maes et al., 2023).

Language Representation

Datasets surveyed in our data library covered up to 49 different types of languages. Majority of study (62.5%) uses English data to explore social intelligence. Moreover, there are more recent research efforts on code-mixing, suggesting an increased representation of multilingual community.

Modality Representation

Because we only use crawled data from ACL Anthology, the majority datasets on social intelligence we surveyed are only in textual format. However, different modalities like image, audio and video can enhance learning of social intelligence with enriched social information embedded in other modalities.

Data Source

Most datasets in our data library use social media as sources of data. The prevalence of data collection from social media has experienced a significant surge from 2010. This might be due to an increase in the use of Twitter data(Baeth et al., 2019). In the meantime, relative proportions of traditional media like news and websites has experienced a decrease since then. The second popular data source is previously built data resources. New datasets leverage and extend upon previous ones in cases like translating to low-resource language \citep{ramaneswaran2022tamilatis}, introducing new evaluation criteria (Peng et al., 2020) and adding new layers of annotation (Tigunova et al., 2021).

Data Availability

There is a promising trend where proportion of social intelligence datasets that are made available increases over the years. From 2014 onwards, the proportion of available has experienced a surge and there are more publicly available social intelligence datasets than unavailable ones from 2016 onwards. Unavailable social intelligence datasets usually contain sensitive information such as mental illness and sexual orientation. Additional measures like anonymizing, dara encryption and access control should be in place to protect data confidentiality while ensuring future work could have secure ways to use these data to advance research in sensitive domains.

LLM Performance

For each of our taxonomic categories, we select two representative datasets, one simple and the other more challenging. For classification tasks, we choose F1 as the metrics. For generation tasks, we present both the original human response and the LLM response to human annotators, and calculate the preference percentage. Below are the results we obtained:

(A) Cognitive Intelligence


Insights: Compared to straightforward query intent recognition (95.0 F1), the best performing LLM (GPT-4) struggles more with identifying the intended sarcasm (67.3 F1) when people convey an opposite meaning from what they literally said. Moreover, uncommon tasks with fewer datasets are more challenging, such as stance detection in the economic domain (most stance detection data is for political domain). With more fine-grained definitions on labels, LLMs have better performance in classification as seen from a higher F1 on GoEmotions than SemEvalT1 with more emotion classes defined.


(B) Situational Intelligence


Insights: More social context in the data can also result in better performance: LLMs achieve a higher F1 on the CICERO dataset with both social situation description and dialogue data, than the SocialIQa dataset with only a simple description. LLMs also find long-tailed social situations (e.g. moral exceptions) more challenging.


(C) Behavioral Intelligence


Insights: LLMs exceed average human performance on behavioral intelligence tasks with percentage preferred more than 50\% on all tasks. However, percentage preferred for LLMs (e.g. Claude) is lower in more dynamic and interactive situations (e.g. applying persuasion in dialogue: 74.3 vs. writing persuasive arguments: 93.6) with more constraints (e.g. with persona constraints: 64.0 vs. without persona constraints: 74.7) given.


(D) Multiple Intelligence



Insights: LLMs in real-life social applications usually require multiple intelligence (e.g. interpreting intents under different cultural backgrounds) but they are still lacking in performance (CulturalNLI: 65.0). Sections (A)-(C) show they perform well for individual modules, so systems can utilize LLMs for individual modules which LLMs do exceptionally well in and combine them organically to build a strong holistic system (e.g. combine emotion recognition and positive reframing components for a counseling system).

Ethics and Disclosure

This study has been approved by the Institutional Review Board (IRB) at the researchers' institution, and we obtained participant consent with a standard institutional consent form.

One ethical concern is that models will become more capable of undesirable outcomes like persuasive misinformation or psychological manipulation as they become more socially intelligent. There may also be concerns that skilled anthropomorphic models will come to replace humans. These can not only lead to loss of trust in users but also harm users' well-being. Our work proposes a standard concept and analysis the landscape and these risks are beyond the scope, but we acknowledge their presence and encourage future social AI data and systems to have clearer guidelines on the capabilities and limitations of AI systems to prevent deceptive and manipulative behaviours when advancing social intelligence.

BibTeX

If you find our project useful, please consider citing:

@misc{li2024social,
                title={Social Intelligence Data Infrastructure: Structuring the Present and Navigating the Future}, 
                author={Minzhi Li and Weiyan Shi and Caleb Ziems and Diyi Yang},
                year={2024},
                eprint={2403.14659},
                archivePrefix={arXiv},
                primaryClass={cs.CY}
          }