variationist.data package

Submodules

variationist.data.preprocess_utils module

variationist.data.preprocess_utils.convert_to_ngrams(token_list, n_tokens)[source]

Function for creating n-grams from tokens. Given a list of tokens and the number of tokens for the n-grams, it returns the same list, but with n-grams as units instead of single tokens. Used to create n-grams at the text level.

Parameters:

token_list (Iterable) – An array of tokens.
n_tokens (int) – The n to use for n-grams. E.g., a value of 2 will result in bi-grams.

Returns:

new_array – The same array, with n-grams instead of single tokens as units.

Return type:

Iterable

variationist.data.preprocess_utils.create_tokenized_cooccurrences_column(tokenized_text_column, n_items, context_window, unique_cooc)[source]

A Function that will extract co-occurrences from tokens if this was set by the user. Used to extract co-occurrences at the column level.

Parameters:

tokenized_text_column (pandas.Series) – A series containing the already tokenized texts.
n_items (int) – The number of co-occurring tokens we should consider. Corresponds to n_cooc set by the user in InspectorArgs.
context_window (int) – Size of the context window for co-occurrences, corresponding to cooc_window_size in InspectorArgs.
unique_cooc (bool) – A boolean for whether to consider unique co-occurrences. If True, multiple occurrences of the same token in a text will be discarded.

Returns:

text_column – The same tokenized series as input (overall length of the series will be the same), but with co-occurrences in lieu of the original tokens (meaning sequence length will be far lengthier).

Return type:

pandas.Series

variationist.data.preprocess_utils.create_tokenized_ngrams_column(tokenized_text_column, n_tokens)[source]

Function for creating n-grams from tokens. Given an already tokenized pandas Series of texts, it will return the same series, but with n-grams as units instead of single tokens. Used to create n-grams at the text column level.

Parameters:

tokenized_text_column (pandas.Series) – A series containing the already tokenized texts.
n_tokens (int) – The n to use for n-grams. E.g., a value of 2 will result in bi-grams.

Returns:

new_array – The same array, with n-grams instead of single tokens as units.

Return type:

Iterable

variationist.data.preprocess_utils.discretize_bins_col(dataframe_var_col, curr_var_bins)[source]

A function that will split a variable into bins, assigning new values to that variable based on how many bins were selected by the user with the var_bins parameter in InspectorArgs.

Parameters:

dataframe_var_col (pandas.Series) – A pandas Series, corresponding to the pandas Dataframe column containing the variable that should be divided into bins.
curr_var_bins (int) – The number of bins to divide the current variable into, as specified by the user using var_bins.

Returns:

discretized_var_col – The same Series as input, but with values split into bins.

Return type:

pandas.Series

variationist.data.preprocess_utils.extract_combinations(token_list, n_items, context_window, unique_cooc)[source]

A Function that will extract co-occurrences from tokens if this was set by the user. Used to extract co-occurrences at the text level.

Parameters:

token_list (Iterable) – An array of tokens for the text, out of which to extract co-occurrences.
n_items (int) – The number of co-occurring tokens we should consider. Corresponds to n_cooc set by the user in InspectorArgs.
context_window (int) – Size of the context window for co-occurrences, corresponding to cooc_window_size in InspectorArgs.
unique_cooc (bool) – A boolean for whether to consider unique co-occurrences. If True, multiple occurrences of the same token in a text will be discarded.

Returns:

new_array – returns the new array of tokens, with co-occurrences as basic units rather than the original tokens.

Return type:

List

variationist.data.preprocess_utils.get_custom_stopword_list(custom_stopwords)[source]

Function that returns a list of stopwords from a file (one stopword per line) or returns the list itself

Parameters:: custom_stopwords (str or List, optional) – A list of stopwords (or a path to a file containing stopwords, one per line) to be removed before tokenization. If stopwords is True, these stopwords will be added to that list. Will default to None.
Returns:: extra_stopwords – A list including the custom stopwords.
Return type:: List

variationist.data.preprocess_utils.get_label_values(input_dataframe, col_names_dict)[source]

Returns a dictionary with all unique label values for the specified variables.

Parameters:

input_dataframe (pandas.DataFrame) – The dataset to be analyzed.
col_names_dict (Dict) – A dictionary containing the var_names provided by the user.

Returns:

label_values_dict – A dictionary containing all of the possible values each variable can take in the input dataset.

Return type:

Dict

variationist.data.preprocess_utils.get_subset_dict(input_dataframe, tok_columns_dict, label_values_dict)[source]

Creates a dictionary containing all the desired subsets of the dataset we will be analyzing.

Parameters:

input_dataframe (pandas.DataFrame) – The dataset to be analyzed.
tok_columns_dict (Dict) – A dictionary containing the names of the columns containing the tokenized specified text columns.
label_values_dict (Dict) – A dictionary containing all of the possible values each variable can take in the input dataset.

Returns:

subsets_of_interest – A dictionary containing a pandas series with tokenized texts for each variable value specified by the user.

Return type:

Dict

variationist.data.preprocess_utils.get_subset_intersections(input_dataframe, tok_columns_dict, label_values_dict)[source]

Creates a dictionary containing all the desired subsets of the dataset we will be analyzing if we have intersections among different text or var columns.

Parameters:

input_dataframe (pandas.DataFrame) – The dataset to be analyzed.
tok_columns_dict (Dict) – A dictionary containing the names of the columns containing the tokenized specified text columns.
label_values_dict (Dict) – A dictionary containing all of the possible values each variable can take in the input dataset.

Returns:

subsets_of_interest – A dictionary containing a pandas series with tokenized texts for each variable/text column combination out of the variables and text columns specified by the user in the case of multiple text and variable columns.

Return type:

Dict

variationist.data.preprocess_utils.remove_elements(token_list, stopwords)[source]

” Used for removing stopwords. Given a token array, it will return the same array excluding the elements in stopwords. Used to remove stopwords at the text level.

Parameters:

token_list (Iterable) – An array of tokens.
stopwords (Iterable) – Array of stopwords to be removed from token_list.

Returns:

new_array – The same array, with stopwords removed.

Return type:

Iterable

variationist.data.preprocess_utils.remove_stopwords(text_column, language, custom_stopwords)[source]

” Used for removing stopwords. Given an already tokenized pandas Series of texts, it will return the same series, excluding the elements in stopwords. Used to remove stopwords at the column level.

Parameters:

token_column (pandas.Series) – A series containing the already tokenized texts.
language (str) – The language we should retrieve stopwords for.
custom_stopwords (str or List, optional) – A list of stopwords (or a path to a file containing stopwords, one per line) to be removed before tokenization. If stopwords is True, these stopwords will be added to that list. Will default to None.

Returns:

text_column – The same tokenized series as input, with stopwords removed.

Return type:

pandas.Series

variationist.data.preprocess_utils.update_label_values_dict_with_inters(label_values_dict, text_names)[source]

Updates label_values_dict with the intersection names if we have more than 1 var_name or text_name.

Parameters:

label_values_dict (Dict) – A dictionary containing all of the possible values each variable can take in the input dataset.
text_names (List) – The list of text column names.

Returns:

inters_label_values_dict – A dictionary containing all of the possible intersections of text columns and variables in the input dataset.

Return type:

Dict

variationist.data.tokenization module

The Tokenizer class, to handle all the tokenization-related operations of Variationist.

class variationist.data.tokenization.Tokenizer(inspector_args)[source]

Bases: object

A class that handles all the tokenization-related operations of Variationist.

Parameters:: inspector_args (InspectorArgs) – The arguments that were passed to the Inspector.

tokenize(dataframe)[source]

A wrapper function to tokenize each text column and add it to the original input dataframe as ‘tok_ORIGINAL_TEXT_COL_NAME’. Returns the dataframe with the added tokenized columns.

Parameters:: dataframe (pandas.DataFrame) – The dataframe that contains the data for the analysis
Returns:: dataframe – The same dataframe as input, but with added columns containing the tokenized texts.
Return type:: pandas.DataFrame

tokenize_column(text_column: Series)[source]

A function that tokenizes a text column using the selected tokenization function. It will also create n-grams and co-occurrences if requested by the user. It will then return the same text column, but tokenized/grouped according to the desired result.

Parameters:: text_column (pandas.Series) – The series (text column) that should be tokenized.
Returns:: text_column – The same series as input, but tokenized/regrouped as requested.
Return type:: pandas.Series

variationist.data.tokenization_utils module

variationist.data.tokenization_utils.huggingface_tokenization(text_column: Series, args)[source]

Takes as input an series of texts and tokenizes it, returns same series but tokenized using the huggingface tokenizer specified in the InspectorArgs.

Parameters:

text_column (pandas.Series) – A pandas Series of text that should be tokenized.
args (InspectorArgs) – The InspectorArgs that were passed to Inspector.

Returns:

A pandas Series containing the initial texts but tokenized.

Return type:

tok_column:: pandas.Series

variationist.data.tokenization_utils.whitespace_tokenization(text_column: Series, args)[source]

Takes as input an array/series of texts and tokenizes it, returns same array/series but tokenized splitting on whitespace.

Parameters:

text_column (pandas.Series) – A pandas Series of text that should be tokenized.
args (InspectorArgs) – The InspectorArgs that were passed to Inspector.

Returns:

tok_column – A pandas Series containing the initial texts but tokenized.

Return type:

pandas.Series

variationist.data package

Submodules

variationist.data.preprocess_utils module

variationist.data.tokenization module

variationist.data.tokenization_utils module

Module contents