Merge array with PyYAML

yaml in data analysis

In familiar areas, it is popular to write settings around machine learning and data analysis in yaml (mainly Kedro is used). Anchor (&) is used as a common setting to make it DRY (don't repeat yourself) as much as possible, but the problem is the yaml specification, where `mapping can be merged but array is merge. I can't do it. It seems that the yaml team does not support this as a yaml specification (https://github.com/yaml/yaml/issues/35 has been launched as an issue, is often opened and closed each time. You can see that).

Specific examples of trouble with yaml

Specifically, I am in trouble in the following situations.

common_features: &common
  - member_reward_program_status
  - member_is_subscribing

transaction_features: &transaction
  - num_transactions
  - average_transaction_amount
  - time_since_last_transaction

next_product_to_buy:
  model_to_use: xgboost
  feature_whitelist:
    - *common
    - *transaction
    - last_product_bought
    - applied_to_campaign
  target: propensity

Imagine you have multiple feature chunks and you want to combine them to create a model. What I want is the contents of feature_whitelist

[
  'member_reward_program_status', 
  'member_is_subscribing', 
  'num_transactions', 
  'average_transaction_amount', 
  'time_since_last_transaction', 
  'last_product_bought', 
  'applied_to_campaign'
]

However, with the above settings, you will end up with a nested list like the one below.

[
  [
    'member_reward_program_status', 
    'member_is_subscribing', 
  ],
  [
    'num_transactions', 
    'average_transaction_amount', 
    'time_since_last_transaction', 
  ],
  'last_product_bought', 
  'applied_to_campaign'
]

Other solutions

Anything is fine as long as it solves the above problem, for example, Flat nested list, define it as a dictionary type instead of a list, and merge it.

#Dictionary type example
feature_a: &feature_a
  age: 
feature_b: &feature_b
  price:
use_features:
  <<: *feature_a
  <<: *feature_b

The usage is as follows.

# > params['use_features'].keys()
dictkeys(['age', 'price'])

Also, if you can solve it on the same yaml side, you can also realize using ruamel.yaml which is a fork of PyYAML if you can choose the package.

Define yaml tag

This time there was a background that I wanted to use it to extend the functionality of Kedro. Kedro uses anyconfig to load TemplatedConfig, and anyconfig itself seems to support both PyYAML and ruamel.yaml, but the Kedro side specifies PyYAML as a requirement. So let's think about how to do it with PyYAML.

Official Docs also has some explanation about the implementation of own tags, so refer to that and define the constructor for the tags.

import yaml

yaml.add_constructor("!flatten", construct_flat_list)

def construct_flat_list(loader: yaml.Loader, node: yaml.Node) -> List[str]:
    """Make a flat list, should be used with '!flatten'
       
    Args:
        loader: Unused, but necessary to pass to `yaml.add_constructor`
        node: The passed node to flatten
    """
    return list(flatten_sequence(node))
    
def flatten_sequence(sequence: yaml.Node) -> Iterator[str]:
    """Flatten a nested sequence to a list of strings
        A nested structure is always a SequenceNode
    """
    if isinstance(sequence, yaml.ScalarNode):
        yield sequence.value
        return
    if not isinstance(sequence, yaml.SequenceNode):
        raise TypeError(f"'!flatten' can only flatten sequence nodes, not {sequence}")
    for el in sequence.value:
        if isinstance(el, yaml.SequenceNode):
            yield from flatten_sequence(el)
        elif isinstance(el, yaml.ScalarNode):
            yield el.value
        else:
            raise TypeError(f"'!flatten' can only take scalar nodes, not {el}")

PyYAML creates a document that parses yaml into a PyYAML object before creating a Python object, but in that document all arrays are stored as yaml.SequenceNode and the values ​​are stored as yaml.ScalarNode. So you can recursively retrieve only the value with the above code. The test code to check the function is as follows. You can convert a nested array to a flat array by tagging it with ! Flatten.

import pytest
def test_flatten_yaml():
    # single nest
    param_string = """
    bread: &bread
      - toast
      - loafs
    chicken: &chicken
      - *bread
    midnight_meal: !flatten
      - *chicken
      - *bread
    """
    params = yaml.load(param_string)
    assert sorted(params["midnight_meal"]) == sorted(
        ["toast", "loafs", "toast", "loafs"]
    )

    # double nested
    param_string = """
    bread: &bread
      - toast
      - loafs
    chicken: &chicken
      - *bread
    dinner: &dinner
      - *chicken
      - *bread
    midnight_meal_long:
      - *chicken
      - *bread
      - *dinner
    midnight_meal: !flatten
      - *chicken
      - *bread
      - *dinner
    """
    params = yaml.load(param_string)
    assert sorted(params["midnight_meal"]) == sorted(
        ["toast", "loafs", "toast", "loafs", "toast", "loafs", "toast", "loafs"]
    )

    # doesn't work with mappings
    param_string = """
    bread: &bread
      - toast
      - loafs
    chicken: &chicken
      meat: breast
    midnight_meal: !flatten
      - *chicken
      - *bread
    """
    with pytest.raises(TypeError):
        yaml.load(param_string)

I'm glad if you can use it as a reference.

Recommended Posts

Merge array with PyYAML
Merge datasets with pandas
Use custom tags with PyYAML
Stumble story with Python array
Extract multiple elements with Numpy array
Merge JSON format data with Ansible
What I did with a Python array
Convert array (struct) to json with golang
TRIE implementation by Python-Double array (with Tail)-
Array buffer object for use with Cython