Inspector & InspectorArgs

The Inspector class, to handle all the operations of Variationist.

class variationist.inspector.Inspector(dataset: Dataset | DataFrame | str | None = None, args: InspectorArgs = InspectorArgs(text_names=None, var_names=None, metrics=None, var_types=None, var_semantics=None, var_subsets=None, var_bins=None, tokenizer='whitespace', language=None, n_tokens=1, n_cooc=1, unique_cooc=False, cooc_window_size=0, freq_cutoff=3, stopwords=False, custom_stopwords=None, lowercase=False, ignore_null_var=False))[source]

The Inspector class. It takes care of orchestrating the analysis, from importing and tokenizing the data to calculating the metrics and creating an output file with all the calculated metrics for each text column, variable, and combination thereof.

Parameters:
  • dataset (datasets.Dataset or pandas.DataFrame or str) – The dataset to be used for our analysis. It can be a pre-loaded pandas dataframe, or a string indicating a filepath to a .tsv, .csv file, or a Huggingface dataset. Huggingface datasets can also be imported using strings, with the following format: ‘hf::DATASET_NAME’.

  • args (InspectorArguments) – The Inspector arguments. Refer to the InspectorArgs class for details on what these should be.

check_columns()[source]

A function to check that the specified text and variable columns are actually in the provided dataset.

check_nan_values()[source]

Checks if the specified variable columns contain Nan values and returns an error.

compute()[source]

Main function carrying out the entire analysis pipeline. It creates a results dict with the calculated metrics.

create_output_dict()[source]

Function to create the output dictionary, containing both metadata and calculated metrics.

handle_bins_and_granularity()[source]

For each variable that requires binning, checks that it can be carried out and calls the dedicated function.

inspect()[source]

Wrapper function for tokenizing, carrying out computation, and saving the output dictionary, which it returns.

preprocess()[source]

Performs all of the preprocessing operations of Variationist, such as grouping together variables and dividing variables into bins.

save_output_to_json(output_path='output.json')[source]

Saves the output dictionary to a json file, which can then be imported with the Visualizer module.

class variationist.inspector.InspectorArgs(text_names: List | None = None, var_names: List | None = None, metrics: List | None = None, var_types: List | None = None, var_semantics: List | None = None, var_subsets: List | None = None, var_bins: List | None = None, tokenizer: str | Callable | None = 'whitespace', language: str | None = None, n_tokens: int | None = 1, n_cooc: int | None = 1, unique_cooc: bool | None = False, cooc_window_size: int | None = 0, freq_cutoff: int | None = 3, stopwords: bool | None = False, custom_stopwords: str | list | None = None, lowercase: bool | None = False, ignore_null_var: bool | None = False)[source]

A dataclass to store all of the arguments that relate to the analysis.

Parameters:
  • text_names (List[str]) – The list of names of text columns in the given dataset to use for the analysis.

  • var_names (List[str]) – The list of variable names to use for the analysis. Each string in var_names should correspond to a dataset column.

  • var_types (List[str]) – The list of variable types corresponding to the variables in var_names. Should match the length of var_names. Available choices are nominal (default), ordinal, quantitative, and coordinates. These are mostly used for binning and visualization.

  • var_semantics (List[str]) – The list of variable semantics corresponding to the variables in var_names. Should match the length of var_names. Available choices are general (default), temporal, and spatial. These are mostly used for binning and visualization.

  • var_bins (List[int]) – The list of indices for variables that should be split into bins for the analysis. Works with quantitative variables, dates and timestamps. Will default to 0 for each specified variable, indicating 0 bins.

  • tokenizer (str or Callable, optional, defaults to whitespace) – The tokenizer used to preprocess the data. Will default to whitespace tokenization if not specified. Alternatively, it can be a string in the format “hf::tokenizer_name” for loading a HuggingFace tokenizer. A custom function can also be passed for tokenization. It should take as input an array of texts (assumed to be a Pandas Series) and the InspectorArgs. It should return the same array but tokenized. Check out our example notebooks for examples.

  • language (str) – The language of the text in the dataset. Used for proper tokenization and stopword removal.

  • metrics (List[str, Callable], optional) – The list of metrics that should be calculated. It can be one of the metrics natively implemented by Variationist or a custom callable function.

  • n_tokens (Int) – The number of tokens that should be considered for the analysis. 1 corresponds to unigrams, 2 corresponds to bigrams, and so on.

  • n_cooc (Int) – The number of tokens used for calculating non-consecutive co-occurrences. For example, n=2 means we consider as the base units for our analysis any pair of tokens that co-occur in the same sentence. n=3 means we consider triplets of tokens, etc. Defaults to n=1, meaning no co-occurrences are taken into consideration, and we only consider n-grams.

  • unique_cooc (Bool) – Whether to consider unique co-occurrences or not. Default to False (keep duplicate tokens). If True, multiple occurrences of the same token in a text will be discarded. This does not affect the co-occurrences window size by design (the window size considers the original number of tokens and therefore the original allowed maximum distance between tokens).

  • cooc_window_size (Int) – Size of the context window for co-occurrences. For instance, a cooc_window_size of 3 means we use a context window of 3 to calculate co-occurrences, meaning that any token that is within 3 tokens before or after a given token is added as a co-occurrence.

  • freq_cutoff (Int) – The token frequency, expressed as an integer, below which we do not consider the token in the analysis of pmi-based metrics. Defaults to 3.

  • stopwords (Bool) – Whether to remove stopwords from texts before tokenization or not (using default lists in a given language). Will default to False.

  • custom_stopwords (str or List, optional, defaults to None) – A list of stopwords (or a path to a file containing stopwords, one per line) to be removed before tokenization. If stopwords is True, these stopwords will be added to that list. Will default to None.

  • lowercase (Bool) – Whether to lowercase all the texts before tokenization or not. Will default to False.

  • ignore_null_var (Bool) – Whether to proceed when null values are present for variables. Defaults to False, as this behavior can have unpredictable results. Set to True to treat “Nan” as any other variable value.

check_values()[source]

Checks the values in text_names, var_names and metrics.

to_dict()[source]

Returns the InspectorArgs values inside a dictionary.

Visualizer & VisualizerArgs

class variationist.visualizer.Visualizer(input_json: str | dict, args: VisualizerArgs)[source]

A class for the visualization component. It orchestrates the creation of charts based on the results and metadata from a prior analysis using Variationist.

Parameters:
  • input_json (str or dict) – A path to the json file or a json/dict object storing metadata and results from a prior analysis using Variationist.

  • args (VisualizerArgs) – A VisualizerArgs object containing the arguments for the Visualizer

create() dict[str, list[altair.vegalite.v5.api.Chart]][source]

A function that orchestrates the creation of charts based on the results and metadata from a prior analysis using Variationist, returning a dictionary of metrics (keys) and an associated list of alt.Chart objects (values).

Returns:

charts – A dictionary containing the metrics as keys and a list of chart objects as values.

Return type:

dict[str, list[alt.Chart]]

get_charts_metadata(metric: str) dict[str, Any][source]

A function that returns a dictionary containing information on which and how to create charts given prior analysis’ var_types and var_semantics metadata.

Parameters:

metric (str) – The metric associated to the “df_data” dataframe and thus to the charts.

Returns:

charts_metadata – A dictionary containing the chart types and information on how to create them.

Return type:

dict[str, Any]

get_df_from_json(json_data: dict[str, Any], var_names_concat: str, top_per_class_ngrams: int, focus_ngrams: list[str] | None = None) DataFrame[source]

A function that returns a long-form dataframe from a json which stores the information about a prior analysis using Variationist. Optionally, it takes a list of n-grams to focus the filtering on.

Parameters:
  • json_data (dict[str, Any]) – The json object storing the results from a prior analysis in the form: {varA: {ngram1: value1, ngram2: value2, …}, varB: {…}, …}. Note that varA, varB, etc. could also take the form of “::”-concatenated variable names if multiple variables are present in the analysis.

  • var_names_concat (str) – A string denoting the ordered concatenation of variable names (i.e., original column names), separated by utils.MULTI_VAR_SEP, to be used for giving meaningful names to the long-form dataframe.

  • top_per_class_ngrams (int = 20) – The maximum number of highest scoring per-class n-grams to show (for bar charts only). If set to None, it will show all the n-grams in the corpus (it may easily be overwhelming). By default is 20 to keep the visualization compact.

  • fucus_ngrams (list[str], optional, defaults to None) – A list of n-grams of interest to focus the filtering on. N-grams should match the number of tokens used in the prior computation (e.g., if unigrams were chosen, this list should only contain unigrams).

Returns:

df_data – A long-form dataframe storing the results of a prior analysis.

Return type:

pd.core.frame.DataFrame

get_stats_df_from_json(json_data: dict[str, Any], var_names_concat: str) DataFrame[source]

A function that returns a long-form dataframe from a json which stores the information about a prior analysis using Variationist. Optionally, it takes a list of n-grams to focus the filtering on. This is a variant of get_df_from_json() to handle basic stats.

Parameters:
  • json_data (dict[str, Any]) – The json object storing the results from a prior analysis in the form: {substatA: {colnameA: {varA: value1, …}, …}, substatB: {colnameA: {varA: {“mean”: value, “stdev”: value}, …}, …}, …}. Note that varA, varB, etc. could also take the form of “::”-concatenated variable names if multiple variables are present in the analysis.

  • var_names_concat (str) – A string denoting the ordered concatenation of variable names (i.e., original column names), separated by utils.MULTI_VAR_SEP, to be used for giving meaningful names to the long-form dataframe.

Returns:

df_data – A long-form dataframe storing the results of a prior analysis.

Return type:

pd.core.frame.DataFrame

class variationist.visualizer.VisualizerArgs(output_folder: str | None = None, output_formats: list[str] | None = ['html'], zoomable: bool | None = True, top_per_class_ngrams: int | None = 20, ngrams: list[str] | None = None, shapefile_path: str | None = None, shapefile_var_name: str | None = None)[source]

A class storing the arguments for the visualization component.

Parameters:
  • output_folder (Optional[str] = None) – A path to the output folder in which to store the charts and associated metadata. If the folder does not exist, it will be automatically created. If no path is provided, the charts will not be serialized and the possible output_formats will be ignored (in this case, the chart objects will be only accessible from the dictionary returned by the “create()” function and be shown by using the “show()” function.

  • output_formats (Optional[list[str]] = ["html"]) – A list of output formats for the charts. By default, only the interactive HTML chart is saved, i.e., [“html”]. Extra choices: [“pdf”, “svg”, “png”].

  • zoomable (Optional[bool] = True) – Whether the (HTML) chart should be zoomable using the mouse or not.

  • top_per_class_ngrams (int = 20) – The maximum number of highest scoring per-class n-grams to show (for bar charts only). If set to None, it will show all the n-grams in the corpus (it may easily be overwhelming). By default is 20 to keep the visualization compact. This parameter is ignored when creating other chart types.

  • ngrams (Optional[list[str]] = None) – A list of n-grams of interest to focus the resulting visualizations on. N-grams should match the number of tokens used in the prior computation reflected by the “results” variable (e.g., if unigrams were chosen, this list should only contain unigrams).

  • shapefile_path (Optional[str] = None) – A path to the .shp shapefile to be visualized as background map to the chart (needed only when including a variable type “nominal” with “spatial” semantics. Note that auxiliary files to the .shp one (i.e., .dbf, .prg, .shx ones) are required for chart creation too, but do not need to be specified. They should have the same name as the .shp file but different extension, and be located in the same folder as the .shp file itself. An example of repository where to find shapefiles is https://geodata.lib.berkeley.edu/, but there exists many other ones and shapefiles provided by national/regional institutions.

  • shapefile_var_name (Optional[str] = None) – The key field name in the shapefile which contains the names for the areas which should match the possible values for the variable of interest (e.g., if the variable of interest is “state”, here should go the name of the variable name encoded in the shapefile containing the possible states).