variationist.data package
Submodules
variationist.data.preprocess_utils module
- variationist.data.preprocess_utils.convert_to_ngrams(token_list, n_tokens)[source]
Function for creating n-grams from tokens. Given a list of tokens and the number of tokens for the n-grams, it returns the same list, but with n-grams as units instead of single tokens. Used to create n-grams at the text level.
- Parameters:
token_list (Iterable) – An array of tokens.
n_tokens (int) – The n to use for n-grams. E.g., a value of 2 will result in bi-grams.
- Returns:
new_array – The same array, with n-grams instead of single tokens as units.
- Return type:
Iterable
- variationist.data.preprocess_utils.create_tokenized_cooccurrences_column(tokenized_text_column, n_items, context_window, unique_cooc)[source]
A Function that will extract co-occurrences from tokens if this was set by the user. Used to extract co-occurrences at the column level.
- Parameters:
tokenized_text_column (pandas.Series) – A series containing the already tokenized texts.
n_items (int) – The number of co-occurring tokens we should consider. Corresponds to n_cooc set by the user in InspectorArgs.
context_window (int) – Size of the context window for co-occurrences, corresponding to cooc_window_size in InspectorArgs.
unique_cooc (bool) – A boolean for whether to consider unique co-occurrences. If True, multiple occurrences of the same token in a text will be discarded.
- Returns:
text_column – The same tokenized series as input (overall length of the series will be the same), but with co-occurrences in lieu of the original tokens (meaning sequence length will be far lengthier).
- Return type:
pandas.Series
- variationist.data.preprocess_utils.create_tokenized_ngrams_column(tokenized_text_column, n_tokens)[source]
Function for creating n-grams from tokens. Given an already tokenized pandas Series of texts, it will return the same series, but with n-grams as units instead of single tokens. Used to create n-grams at the text column level.
- Parameters:
tokenized_text_column (pandas.Series) – A series containing the already tokenized texts.
n_tokens (int) – The n to use for n-grams. E.g., a value of 2 will result in bi-grams.
- Returns:
new_array – The same array, with n-grams instead of single tokens as units.
- Return type:
Iterable
- variationist.data.preprocess_utils.discretize_bins_col(dataframe_var_col, curr_var_bins)[source]
A function that will split a variable into bins, assigning new values to that variable based on how many bins were selected by the user with the var_bins parameter in InspectorArgs.
- Parameters:
dataframe_var_col (pandas.Series) – A pandas Series, corresponding to the pandas Dataframe column containing the variable that should be divided into bins.
curr_var_bins (int) – The number of bins to divide the current variable into, as specified by the user using var_bins.
- Returns:
discretized_var_col – The same Series as input, but with values split into bins.
- Return type:
pandas.Series
- variationist.data.preprocess_utils.extract_combinations(token_list, n_items, context_window, unique_cooc)[source]
A Function that will extract co-occurrences from tokens if this was set by the user. Used to extract co-occurrences at the text level.
- Parameters:
token_list (Iterable) – An array of tokens for the text, out of which to extract co-occurrences.
n_items (int) – The number of co-occurring tokens we should consider. Corresponds to n_cooc set by the user in InspectorArgs.
context_window (int) – Size of the context window for co-occurrences, corresponding to cooc_window_size in InspectorArgs.
unique_cooc (bool) – A boolean for whether to consider unique co-occurrences. If True, multiple occurrences of the same token in a text will be discarded.
- Returns:
new_array – returns the new array of tokens, with co-occurrences as basic units rather than the original tokens.
- Return type:
List
- variationist.data.preprocess_utils.get_custom_stopword_list(custom_stopwords)[source]
Function that returns a list of stopwords from a file (one stopword per line) or returns the list itself
- Parameters:
custom_stopwords (str or List, optional) – A list of stopwords (or a path to a file containing stopwords, one per line) to be removed before tokenization. If stopwords is True, these stopwords will be added to that list. Will default to None.
- Returns:
extra_stopwords – A list including the custom stopwords.
- Return type:
List
- variationist.data.preprocess_utils.get_label_values(input_dataframe, col_names_dict)[source]
Returns a dictionary with all unique label values for the specified variables.
- Parameters:
input_dataframe (pandas.DataFrame) – The dataset to be analyzed.
col_names_dict (Dict) – A dictionary containing the var_names provided by the user.
- Returns:
label_values_dict – A dictionary containing all of the possible values each variable can take in the input dataset.
- Return type:
Dict
- variationist.data.preprocess_utils.get_subset_dict(input_dataframe, tok_columns_dict, label_values_dict)[source]
Creates a dictionary containing all the desired subsets of the dataset we will be analyzing.
- Parameters:
input_dataframe (pandas.DataFrame) – The dataset to be analyzed.
tok_columns_dict (Dict) – A dictionary containing the names of the columns containing the tokenized specified text columns.
label_values_dict (Dict) – A dictionary containing all of the possible values each variable can take in the input dataset.
- Returns:
subsets_of_interest – A dictionary containing a pandas series with tokenized texts for each variable value specified by the user.
- Return type:
Dict
- variationist.data.preprocess_utils.get_subset_intersections(input_dataframe, tok_columns_dict, label_values_dict)[source]
Creates a dictionary containing all the desired subsets of the dataset we will be analyzing if we have intersections among different text or var columns.
- Parameters:
input_dataframe (pandas.DataFrame) – The dataset to be analyzed.
tok_columns_dict (Dict) – A dictionary containing the names of the columns containing the tokenized specified text columns.
label_values_dict (Dict) – A dictionary containing all of the possible values each variable can take in the input dataset.
- Returns:
subsets_of_interest – A dictionary containing a pandas series with tokenized texts for each variable/text column combination out of the variables and text columns specified by the user in the case of multiple text and variable columns.
- Return type:
Dict
- variationist.data.preprocess_utils.remove_elements(token_list, stopwords)[source]
” Used for removing stopwords. Given a token array, it will return the same array excluding the elements in stopwords. Used to remove stopwords at the text level.
- Parameters:
token_list (Iterable) – An array of tokens.
stopwords (Iterable) – Array of stopwords to be removed from token_list.
- Returns:
new_array – The same array, with stopwords removed.
- Return type:
Iterable
- variationist.data.preprocess_utils.remove_stopwords(text_column, language, custom_stopwords)[source]
” Used for removing stopwords. Given an already tokenized pandas Series of texts, it will return the same series, excluding the elements in stopwords. Used to remove stopwords at the column level.
- Parameters:
token_column (pandas.Series) – A series containing the already tokenized texts.
language (str) – The language we should retrieve stopwords for.
custom_stopwords (str or List, optional) – A list of stopwords (or a path to a file containing stopwords, one per line) to be removed before tokenization. If stopwords is True, these stopwords will be added to that list. Will default to None.
- Returns:
text_column – The same tokenized series as input, with stopwords removed.
- Return type:
pandas.Series
- variationist.data.preprocess_utils.update_label_values_dict_with_inters(label_values_dict, text_names)[source]
Updates label_values_dict with the intersection names if we have more than 1 var_name or text_name.
- Parameters:
label_values_dict (Dict) – A dictionary containing all of the possible values each variable can take in the input dataset.
text_names (List) – The list of text column names.
- Returns:
inters_label_values_dict – A dictionary containing all of the possible intersections of text columns and variables in the input dataset.
- Return type:
Dict
variationist.data.tokenization module
The Tokenizer class, to handle all the tokenization-related operations of Variationist.
- class variationist.data.tokenization.Tokenizer(inspector_args)[source]
Bases:
objectA class that handles all the tokenization-related operations of Variationist.
- Parameters:
inspector_args (InspectorArgs) – The arguments that were passed to the Inspector.
- tokenize(dataframe)[source]
A wrapper function to tokenize each text column and add it to the original input dataframe as ‘tok_ORIGINAL_TEXT_COL_NAME’. Returns the dataframe with the added tokenized columns.
- Parameters:
dataframe (pandas.DataFrame) – The dataframe that contains the data for the analysis
- Returns:
dataframe – The same dataframe as input, but with added columns containing the tokenized texts.
- Return type:
pandas.DataFrame
- tokenize_column(text_column: Series)[source]
A function that tokenizes a text column using the selected tokenization function. It will also create n-grams and co-occurrences if requested by the user. It will then return the same text column, but tokenized/grouped according to the desired result.
- Parameters:
text_column (pandas.Series) – The series (text column) that should be tokenized.
- Returns:
text_column – The same series as input, but tokenized/regrouped as requested.
- Return type:
pandas.Series
variationist.data.tokenization_utils module
- variationist.data.tokenization_utils.huggingface_tokenization(text_column: Series, args)[source]
Takes as input an series of texts and tokenizes it, returns same series but tokenized using the huggingface tokenizer specified in the InspectorArgs.
- Parameters:
text_column (pandas.Series) – A pandas Series of text that should be tokenized.
args (InspectorArgs) – The InspectorArgs that were passed to Inspector.
- Returns:
A pandas Series containing the initial texts but tokenized.
- Return type:
tok_column:: pandas.Series
- variationist.data.tokenization_utils.whitespace_tokenization(text_column: Series, args)[source]
Takes as input an array/series of texts and tokenizes it, returns same array/series but tokenized splitting on whitespace.
- Parameters:
text_column (pandas.Series) – A pandas Series of text that should be tokenized.
args (InspectorArgs) – The InspectorArgs that were passed to Inspector.
- Returns:
tok_column – A pandas Series containing the initial texts but tokenized.
- Return type:
pandas.Series