Class OrderedTokenAndAbbreviationsMatcher

java.lang.Object
eu.openaire.common.author.OrderedTokenAndAbbreviationsMatcher

public class OrderedTokenAndAbbreviationsMatcher extends Object
Utility class for comparing author names using token-based matching and abbreviation handling.

This class provides methods to tokenize author names and compare them based on abbreviation recognition. It is useful for identifying name variations where part of the full name might be reordered or abbreviated.

  • Field Details

    • SPLIT_REGEX

      public static final Pattern SPLIT_REGEX
      Regular expression pattern used to split names into tokens.

      The pattern matches spaces, punctuation symbols, and dashes, ensuring that names are split into meaningful components.

    • NUM_TOKEN_MAX_DIFF

      public static int NUM_TOKEN_MAX_DIFF
      Maximum allowed difference in the number of tokens between two names for them to be comparable.
  • Constructor Details

    • OrderedTokenAndAbbreviationsMatcher

      public OrderedTokenAndAbbreviationsMatcher()
  • Method Details

    • compare

      public static Optional<Double> compare(String a1, String a2)
      Compares two author names using token-based matching and abbreviation handling.

      The comparison follows these rules:

      • Both names must have at least two tokens to be comparable.
      • The number of tokens between the two names should not differ by more than NUM_TOKEN_MAX_DIFF.
      • Matching considers both full-token matches and abbreviation-based matches.

      The method returns an Optional containing a confidence score if a match is found, or an empty Optional if no match is identified.

      Parameters:
      a1 - The first author name.
      a2 - The second author name.
      Returns:
      An Optional<Double> with a confidence score (1.0 if a match is found), or empty if no match.