variationist.metrics package

Submodules

variationist.metrics.corpus_statistics module

Functions for calculating a series of statistics for a given corpus.

variationist.metrics.corpus_statistics.average_text_length(label_values_dict, subsets_of_interest)[source]

Returns a dictionary with the average length of texts in each subset of interest.

Parameters:

label_values_dict (Dict) – A dictionary containing all of the possible values each variable can take in the input dataset.
subsets_of_interest (Dict) – A dictionary containing a pandas series with tokenized texts for each variable/text column combination out of the variables and text columns specified by the user.

Returns:

values_dict – A dict containing the average length (and its standard deviation) of texts in each subset.

Return type:

Dict

variationist.metrics.corpus_statistics.compute_basic_stats(label_values_dict, subsets_of_interest, args)[source]

A wrapper function for calling all of the basic statistics functions.

Parameters:

label_values_dict (Dict) – A dictionary containing all of the possible values each variable can take in the input dataset.
subsets_of_interest (Dict) – A dictionary containing a pandas series with tokenized texts for each variable/text column combination out of the variables and text columns specified by the user.
args (InspectorArgs) – The arguments selected by the user.

Returns:

stats_dict – A dict containing the calculated statistics.

Return type:

Dict

variationist.metrics.corpus_statistics.create_frequency_dictionary(label_values_dict, subsets_of_interest, args)[source]

Returns a dictionary with the frequency of tokens in each subset of interest.

Parameters:

label_values_dict (Dict) – A dictionary containing all of the possible values each variable can take in the input dataset.
subsets_of_interest (Dict) – A dictionary containing a pandas series with tokenized texts for each variable/text column combination out of the variables and text columns specified by the user.
args (InspectorArgs) – The arguments selected by the user.

Returns:

output_freqs – A dict containing the frequency of each token for each subset of interest.

Return type:

Dict

variationist.metrics.corpus_statistics.num_tokens(label_values_dict, subsets_of_interest)[source]

Returns a dictionary with the total number of tokens in each subset.

Parameters:

label_values_dict (Dict) – A dictionary containing all of the possible values each variable can take in the input dataset.
subsets_of_interest (Dict) – A dictionary containing a pandas series with tokenized texts for each variable/text column combination out of the variables and text columns specified by the user.

Returns:

n_word_dict – A dict containing the total number of tokens in each subset.

Return type:

Dict

variationist.metrics.corpus_statistics.number_of_duplicates(label_values_dict, subsets_of_interest)[source]

Returns a dictionary with the number of duplicate texts in each subset of interest.

Parameters:

label_values_dict (Dict) – A dictionary containing all of the possible values each variable can take in the input dataset.
subsets_of_interest (Dict) – A dictionary containing a pandas series with tokenized texts for each variable/text column combination out of the variables and text columns specified by the user.

Returns:

duplicates_dict – A dict containing the number of duplicate texts in each subset.

Return type:

Dict

variationist.metrics.corpus_statistics.number_of_texts(label_values_dict, subsets_of_interest)[source]

Returns a dictionary with how many texts are in each subset of interest.

Parameters:

label_values_dict (Dict) – A dictionary containing all of the possible values each variable can take in the input dataset.
subsets_of_interest (Dict) – A dictionary containing a pandas series with tokenized texts for each variable/text column combination out of the variables and text columns specified by the user.

Returns:

values_dict – A dict containing the length of each subset.

Return type:

Dict

variationist.metrics.corpus_statistics.take(n, iterable)[source]: Return the first n items of the iterable as a list.

variationist.metrics.corpus_statistics.vocab_size(label_values_dict, subsets_of_interest)[source]

Returns a dictionary with the total number of unique tokens in each subset - i.e. the size of the vocabulary for each subset.

Parameters:

label_values_dict (Dict) – A dictionary containing all of the possible values each variable can take in the input dataset.
subsets_of_interest (Dict) – A dictionary containing a pandas series with tokenized texts for each variable/text column combination out of the variables and text columns specified by the user.

Returns:

vocab_dict – A dict containing the vocabulary size of each subset.

Return type:

Dict

variationist.metrics.lexical_artifacts module

variationist.metrics.lexical_artifacts.compute(texts: List[str], labels: List[str], label_of_interest: str, method: str = 'pmi', special_tokens: List[str] = [], add_emojis: bool = True, stopwords: str = '', pretrained_tokenizer: str = 'bert-base-uncased') → DataFrame[source]

A function that computes lexical artifacts given an input dataset (texts and labels) and a label of interest. Additional parameters can be specified to e.g., exclude emojis from the computation of lexical artifacts, add special tokens to the tokenizer’s vocabulary, and in the near future changing the method and the pretrained tokenizer. [1] Alan Ramponi and Sara Tonelli. 2022. Features or Spurious Artifacts? Data-centric Baselines for Fair and Robust Hate Speech Detection. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.

Parameters:

texts (List[str]) – Input texts (note: the ith text of “texts” must match the ith label of “labels”)
labels (List[str]) – Input labels (note: the ith label of “labels” must match the ith text of “texts”)
label_of_interest (str) – Label that is the focus of the artifacts calculation (note: it must be in “labels”)
method (str) – Algorithm to compute the contribution strength of each token to each label. Default: “pmi” For now, we support “pmi” as implemented in [1], more on next releases
special_tokens (List[str]) – List of special tokens to add to the tokenizer’s vocabulary. Default: []
add_emojis (bool) – Whether or not adding emojis to the tokenizer’s vocabulary. Default: True If this is set to False, a special token “[EMOJI]” will be used for all emojis
stopwords (str) – The language for the stopwords to be removed from lexical artifacts. Default: en (English) If None, all stopwords are instead retained in the list of lexical artifacts For now, only “en” is supported (with a default stopword list), more on next releases
pretrained_tokenizer (str) – Name of the HuggingFace’s pretrained tokenizer to use (e.g., “bert-base-uncased”) For now, BPE-based tokenizers (e.g., RoBERTa-base, GPT2) would not filter stopword correctly, if requested, due to the “Ġ” special character. Thorough support on next releases

Returns:

sorted_pmi_scores – Pandas dataframe with tokens as rows and label_of_interest as column. Values in this matrix are PMI scores following the implementation by [1].

Return type:

pd.core.frame.DataFrame

variationist.metrics.lexical_artifacts.compute_pmi(w_count: Counter, l_count: Counter, w_l_count: Counter, num_texts: int) → DataFrame[source]

A function that computes positive reweighted pointwise mutual information between tokens and labels, following the implementation by [1]. [1] Alan Ramponi and Sara Tonelli. 2022. Features or Spurious Artifacts? Data-centric Baselines for Fair and Robust Hate Speech Detection. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. :param w_count: Token counts over the whole dataset (i.e., {“token1”: 123, “token2”: 20, …}) :type w_count: Counter[str:int] :param l_count: Label counts over the whole dataset (i.e., {“label1”: 42, “label2”: 21, …}) :type l_count: Counter[str:int] :param w_l_count: Token and label counts over the whole dataset (i.e., {(“token1”, label1”): 12, …}) :type w_l_count: Counter[(str,str):int] :param num_texts: Total number of texts in the dataset :type num_texts: int

Returns:: Pandas dataframe with tokens as rows and classes as columns (namely, label_of_interest and “other”). Values in this matrix are PMI scores.
Return type:: pd.core.frame.DataFrame

variationist.metrics.lexical_artifacts.get_counts(texts: ~typing.List[str], curr_label: str, label_of_interest: str, tokenizer: ~transformers.models.auto.tokenization_auto.AutoTokenizer, tokenizer_type: str, stopwords: str = 'en') -> (<class 'collections.Counter'>, <class 'collections.Counter'>, <class 'collections.Counter'>)[source]

A function that calculates relevant counts about a specific label after tokenizing the text according to a given pretrained tokenizer.

Parameters:

texts (List[str]) – Input texts belonging to a specific label “curr_label”
curr_label (str) – Label whose examples will be counted and to which “texts” belong to
label_of_interest (str) – Label that is the focus of the artifacts calculation
tokenizer (AutoTokenizer) – HuggingFace’s pretrained tokenizer to use
tokenizer_type (str) – Name of the pretrained tokenizer according to HuggingFace (e.g., “bert-base-uncased”)
stopwords (str) – Language for the stopwords to be removed from lexical artifacts. Default: en (English) If None, all stopwords are instead retained in the list of lexical artifacts For now, only “en” is supported (with a default stopword list), more on next releases

Returns:

token_counter (Counter) – Token counts for the given label “curr_label”
label_counter (Counter) – Label counts for the given label “curr_label”
token_label_counter (Counter) – Token and label counts for the given label “curr_label”

variationist.metrics.lexical_artifacts.normalize_pmi(pmi_scores: DataFrame) → DataFrame[source]

A function that normalize a dataframe of PMI scores in [0,1], following the implementation by [1]. [1] Alan Ramponi and Sara Tonelli. 2022. Features or Spurious Artifacts? Data-centric Baselines for Fair and Robust Hate Speech Detection. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.

Parameters:: pmi_scores (pd.core.frame.DataFrame) – Pandas dataframe with tokens as rows and classes as columns (namely, label_of_interest and “other”). Values in this matrix are PMI scores.
Returns:: pmi_normalized – Normalized pandas dataframe with tokens as rows and classes as columns (namely, label_of_interest and “other”). Values in this matrix are normalized PMI scores.
Return type:: pd.core.frame.DataFrame

variationist.metrics.lexical_variation module

variationist.metrics.lexical_variation.lttr(label_values_dict, subsets_of_interest, args)[source]

Calculates Log Type Token Ratio.

Parameters:

label_values_dict (Dict) – A dictionary containing all of the possible values each variable can take in the input dataset.
subsets_of_interest (Dict) – A dictionary containing a pandas series with tokenized texts for each variable/text column combination out of the variables and text columns specified by the user.
args (InspectorArgs) – The arguments selected by the user.

Returns:

values_dict – A dictionary with the mean LTTR score for each subset and its standard deviation.

Return type:

Dict

variationist.metrics.lexical_variation.maas(label_values_dict, subsets_of_interest, args)[source]

Calculates Maas’s index (Maas, 1972).

Parameters:

label_values_dict (Dict) – A dictionary containing all of the possible values each variable can take in the input dataset.
subsets_of_interest (Dict) – A dictionary containing a pandas series with tokenized texts for each variable/text column combination out of the variables and text columns specified by the user.
args (InspectorArgs) – The arguments selected by the user.

Returns:

values_dict – A dictionary with the mean Maas index score for each subset and its standard deviation.

Return type:

Dict

variationist.metrics.lexical_variation.rttr(label_values_dict, subsets_of_interest, args)[source]

Calculates Root Type Token Ratio.

Parameters:

label_values_dict (Dict) – A dictionary containing all of the possible values each variable can take in the input dataset.
subsets_of_interest (Dict) – A dictionary containing a pandas series with tokenized texts for each variable/text column combination out of the variables and text columns specified by the user.
args (InspectorArgs) – The arguments selected by the user.

Returns:

values_dict – A dictionary with the mean RTTR score for each subset and its standard deviation.

Return type:

Dict

variationist.metrics.lexical_variation.safe_divide(numerator, denominator)[source]: Utility function to avoid zero division errors.

variationist.metrics.lexical_variation.ttr(label_values_dict, subsets_of_interest, args)[source]

Calculates Type Token Ratio.

Parameters:

label_values_dict (Dict) – A dictionary containing all of the possible values each variable can take in the input dataset.
subsets_of_interest (Dict) – A dictionary containing a pandas series with tokenized texts for each variable/text column combination out of the variables and text columns specified by the user.
args (InspectorArgs) – The arguments selected by the user.

Returns:

values_dict – A dictionary with the mean TTR score for each subset and its standard deviation.

Return type:

Dict

variationist.metrics.metrics module

class variationist.metrics.metrics.Metric(metric: str | Callable[[dict, dict], dict], args)[source]

Bases: object

The Metric class, a generic class that carries out all the metric operations.

Parameters:

metric (Union[str, Callable[[dict, dict], dict]]) – A metric’s name (if chosen among the ones natively supported by Variationist), or a callable function that takes as arguments label_values_dict and subsets_of_interest, as [dict, dict], dict.
args (InspectorArgs) – The arguments selected by the user.

calculate_metric(label_values_dict, subsets_of_interest)[source]

Calls the appropriate metric function.

Parameters:

label_values_dict (dict) – A dictionary containing all of the possible values each variable can take in the input dataset.
subsets_of_interest (dict) – A dictionary containing a pandas series with tokenized texts for each variable/text column combination out of the variables and text columns specified by the user.

Returns:

A dict with the results of the calculated metric function.

Return type:

dict

variationist.metrics.pmi module

variationist.metrics.pmi.class_relevance_normalized_weighted(label_values_dict, subsets_of_interest, args)[source]

Function to calculate a PMI-based class relevance metric, which consists in normalizing by subset the normalized weighted PMI values.

Parameters:

label_values_dict (Dict) – A dictionary containing all of the possible values each variable can take in the input dataset.
subsets_of_interest (Dict) – A dictionary containing a pandas series with tokenized texts for each variable/text column combination out of the variables and text columns specified by the user.
args (InspectorArgs) – The arguments selected by the user.

Returns:

output_pmi – A dictionary with the normalized weighted class relevance metric for each token in each subset of interest.

Return type:

Dict

variationist.metrics.pmi.class_relevance_positive_normalized(label_values_dict, subsets_of_interest, args)[source]

Function to calculate a PMI-based class relevance metric, which consists in normalizing by subset the positive normalized PMI values.

Parameters:

label_values_dict (Dict) – A dictionary containing all of the possible values each variable can take in the input dataset.
subsets_of_interest (Dict) – A dictionary containing a pandas series with tokenized texts for each variable/text column combination out of the variables and text columns specified by the user.
args (InspectorArgs) – The arguments selected by the user.

Returns:

output_pmi – A dictionary with the positive normalized class relevance metric for each token in each subset of interest.

Return type:

Dict

variationist.metrics.pmi.class_relevance_positive_normalized_weighted(label_values_dict, subsets_of_interest, args)[source]

Function to calculate a PMI-based class relevance metric, which consists in normalizing by subset the positive normalized weighted PMI values.

Parameters:

label_values_dict (Dict) – A dictionary containing all of the possible values each variable can take in the input dataset.
subsets_of_interest (Dict) – A dictionary containing a pandas series with tokenized texts for each variable/text column combination out of the variables and text columns specified by the user.
args (InspectorArgs) – The arguments selected by the user.

Returns:

output_pmi – A dictionary with the positive normalized weighted class relevance metric for each token in each subset of interest.

Return type:

Dict

variationist.metrics.pmi.create_pmi_dictionary(label_values_dict, subsets_of_interest, weighted, freq_cutoff)[source]: Creates a dictionary of pmi values for each label.

variationist.metrics.pmi.get_total(freqs_merged_dict)[source]: Function to add up the frequency of tokens across labels.

variationist.metrics.pmi.pmi(label_values_dict, subsets_of_interest, args)[source]

Function to calculate PMI.

Parameters:

label_values_dict (Dict) – A dictionary containing all of the possible values each variable can take in the input dataset.
subsets_of_interest (Dict) – A dictionary containing a pandas series with tokenized texts for each variable/text column combination out of the variables and text columns specified by the user.
args (InspectorArgs) – The arguments selected by the user.

Returns:

output_pmi – A dictionary with the pmi for each token in each subset of interest.

Return type:

Dict

variationist.metrics.pmi.pmi_lexical_artifacts(label_values_dict, subsets_of_interest, args)[source]

Function to calculate a PMI-based class relevance metric as illustrated in Ramponi and Tonelli (2022).

Parameters:

label_values_dict (Dict) – A dictionary containing all of the possible values each variable can take in the input dataset.
subsets_of_interest (Dict) – A dictionary containing a pandas series with tokenized texts for each variable/text column combination out of the variables and text columns specified by the user.
args (InspectorArgs) – The arguments selected by the user.

Returns:

lexical_artifacts_dict – A dictionary with the associated lexical-artifacts scores for each token in each subset.

Return type:

Dict

variationist.metrics.pmi.pmi_normalized(label_values_dict, subsets_of_interest, args)[source]

Function to calculate normalized PMI.

Parameters:

label_values_dict (Dict) – A dictionary containing all of the possible values each variable can take in the input dataset.
subsets_of_interest (Dict) – A dictionary containing a pandas series with tokenized texts for each variable/text column combination out of the variables and text columns specified by the user.
args (InspectorArgs) – The arguments selected by the user.

Returns:

output_pmi – A dictionary with the normalized pmi for each token in each subset of interest.

Return type:

Dict

variationist.metrics.pmi.pmi_normalized_weighted(label_values_dict, subsets_of_interest, args)[source]

Function to calculate normalized weighted PMI.

Parameters:

label_values_dict (Dict) – A dictionary containing all of the possible values each variable can take in the input dataset.
subsets_of_interest (Dict) – A dictionary containing a pandas series with tokenized texts for each variable/text column combination out of the variables and text columns specified by the user.
args (InspectorArgs) – The arguments selected by the user.

Returns:

output_pmi – A dictionary with the normalized weighted PMI for each token in each subset of interest.

Return type:

Dict

variationist.metrics.pmi.pmi_positive(label_values_dict, subsets_of_interest, args)[source]

Function to calculate positive PMI (negative values are set to 0).

Parameters:

label_values_dict (Dict) – A dictionary containing all of the possible values each variable can take in the input dataset.
subsets_of_interest (Dict) – A dictionary containing a pandas series with tokenized texts for each variable/text column combination out of the variables and text columns specified by the user.
args (InspectorArgs) – The arguments selected by the user.

Returns:

output_pmi – A dictionary with the positive PMI for each token in each subset of interest.

Return type:

Dict

variationist.metrics.pmi.pmi_positive_normalized(label_values_dict, subsets_of_interest, args)[source]

Function to calculate positive normalized PMI (negative values are set to 0 and all values are normalized between 0 and 1).

Parameters:

label_values_dict (Dict) – A dictionary containing all of the possible values each variable can take in the input dataset.
subsets_of_interest (Dict) – A dictionary containing a pandas series with tokenized texts for each variable/text column combination out of the variables and text columns specified by the user.
args (InspectorArgs) – The arguments selected by the user.

Returns:

output_pmi – A dictionary with the positive normalized PMI for each token in each subset of interest.

Return type:

Dict

variationist.metrics.pmi.pmi_positive_normalized_weighted(label_values_dict, subsets_of_interest, args)[source]

Function to calculate positive normalized weighted PMI.

Parameters:

label_values_dict (Dict) – A dictionary containing all of the possible values each variable can take in the input dataset.
subsets_of_interest (Dict) – A dictionary containing a pandas series with tokenized texts for each variable/text column combination out of the variables and text columns specified by the user.
args (InspectorArgs) – The arguments selected by the user.

Returns:

output_pmi – A dictionary with the positive normalized weighted PMI for each token in each subset of interest.

Return type:

Dict

variationist.metrics.pmi.pmi_positive_weighted(label_values_dict, subsets_of_interest, args)[source]

Function to calculate positive weighted PMI.

Parameters:

label_values_dict (Dict) – A dictionary containing all of the possible values each variable can take in the input dataset.
subsets_of_interest (Dict) – A dictionary containing a pandas series with tokenized texts for each variable/text column combination out of the variables and text columns specified by the user.
args (InspectorArgs) – The arguments selected by the user.

Returns:

output_pmi – A dictionary with the positive weighted PMI for each token in each subset of interest.

Return type:

Dict

variationist.metrics.pmi.pmi_weighted(label_values_dict, subsets_of_interest, args)[source]

Function to calculate weighted PMI.

Parameters:

label_values_dict (Dict) – A dictionary containing all of the possible values each variable can take in the input dataset.
subsets_of_interest (Dict) – A dictionary containing a pandas series with tokenized texts for each variable/text column combination out of the variables and text columns specified by the user.
args (InspectorArgs) – The arguments selected by the user.

Returns:

output_pmi – A dictionary with the weighted PMI for each token in each subset of interest.

Return type:

Dict

variationist.metrics.pmi.safe_divide(numerator, denominator)[source]: Utility function to avoid zero division errors.

variationist.metrics.pmi.take(n, iterable)[source]: Return the first n items of the iterable as a list.

variationist.metrics.shared_metrics module

variationist.metrics.shared_metrics.get_all_frequencies(pandas_series)[source]: Returns all token frequencies inside a pandas Series.

variationist.metrics package

Submodules

variationist.metrics.corpus_statistics module

variationist.metrics.lexical_artifacts module

variationist.metrics.lexical_variation module

variationist.metrics.metrics module

variationist.metrics.pmi module

variationist.metrics.shared_metrics module

variationist.metrics.utils module

Module contents