Class OrderedTokenAndAbbreviationsMatcher
java.lang.Object
eu.openaire.common.author.OrderedTokenAndAbbreviationsMatcher
Utility class for comparing author names using token-based matching and abbreviation handling.
This class provides methods to tokenize author names and compare them based on abbreviation recognition. It is useful for identifying name variations where part of the full name might be reordered or abbreviated.
-
Field Summary
FieldsModifier and TypeFieldDescriptionstatic intMaximum allowed difference in the number of tokens between two names for them to be comparable.static final PatternRegular expression pattern used to split names into tokens. -
Constructor Summary
Constructors -
Method Summary
-
Field Details
-
SPLIT_REGEX
Regular expression pattern used to split names into tokens.The pattern matches spaces, punctuation symbols, and dashes, ensuring that names are split into meaningful components.
-
NUM_TOKEN_MAX_DIFF
public static int NUM_TOKEN_MAX_DIFFMaximum allowed difference in the number of tokens between two names for them to be comparable.
-
-
Constructor Details
-
OrderedTokenAndAbbreviationsMatcher
public OrderedTokenAndAbbreviationsMatcher()
-
-
Method Details
-
compare
Compares two author names using token-based matching and abbreviation handling.The comparison follows these rules:
- Both names must have at least two tokens to be comparable.
- The number of tokens between the two names should not differ by more than
NUM_TOKEN_MAX_DIFF. - Matching considers both full-token matches and abbreviation-based matches.
The method returns an
Optionalcontaining a confidence score if a match is found, or an emptyOptionalif no match is identified.- Parameters:
a1- The first author name.a2- The second author name.- Returns:
- An
Optional<Double>with a confidence score (1.0 if a match is found), or empty if no match.
-