Library Documentation

Dedupe Objects

class dedupe.Dedupe(variable_definition, num_cores=None, in_memory=False, **kwargs)[source]

Class for active learning deduplication. Use deduplication when you have data that can contain multiple records that can all refer to the same entity.

Parameters
  • variable_definition (Collection[VariableDefinition]) – A list of dictionaries describing the variables will be used for training a model. See Variable Definitions

  • num_cores (int | None) – The number of cpus to use for parallel processing. If set to None, uses all cpus available on the machine. If set to 0, then multiprocessing will be disabled.

  • in_memory (bool) – If True, dedupe.Dedupe.pairs() will generate pairs in RAM with the sqlite3 ‘:memory:’ option rather than writing to disk. May be faster if sufficient memory is available.

Warning

If using multiprocessing on Windows or Mac OS X, then you must protect calls to the Dedupe methods with a if __name__ == '__main__' in your main module, see https://docs.python.org/3/library/multiprocessing.html#the-spawn-and-forkserver-start-methods

# initialize from a defined set of fields
variables = [
    {'field' : 'Site name', 'type': 'String'},
    {'field' : 'Address', 'type': 'String'},
    {'field' : 'Zip', 'type': 'String', 'has missing':True},
    {'field' : 'Phone', 'type': 'String', 'has missing':True},
]
deduper = dedupe.Dedupe(variables)
prepare_training(data, training_file=None, sample_size=1500, blocked_proportion=0.9)[source]

Initialize the active learner with your data and, optionally, existing training data.

Sets up the learner.

Parameters
  • data (Data) – Dictionary of records, where the keys are record_ids and the values are dictionaries with the keys being field names

  • training_file (TextIO | None) – file object containing training data

  • sample_size (int) – Size of the sample to draw

  • blocked_proportion (float) – The proportion of record pairs to be sampled from similar records, as opposed to randomly selected pairs.

Examples

>>> matcher.prepare_training(data_d, 150000, .5)
>>> with open('training_file.json') as f:
>>>     matcher.prepare_training(data_d, training_file=f)
uncertain_pairs()

Returns a list of pairs of records from the sample of record pairs tuples that Dedupe is most curious to have labeled.

This method is mainly useful for building a user interface for training a matching model.

Examples

>>> pair = matcher.uncertain_pairs()
>>> print(pair)
[({'name' : 'Georgie Porgie'}, {'name' : 'Georgette Porgette'})]
mark_pairs(labeled_pairs)

Add users labeled pairs of records to training data and update the matching model

This method is useful for building a user interface for training a matching model or for adding training data from an existing source.

Parameters

labeled_pairs (TrainingData) – A dictionary with two keys, match and distinct the values are lists that can contain pairs of records

Examples

>>> labeled_examples = {
>>>     "match": [],
>>>     "distinct": [
>>>         (
>>>             {"name": "Georgie Porgie"},
>>>             {"name": "Georgette Porgette"},
>>>         )
>>>     ],
>>> }
>>> matcher.mark_pairs(labeled_examples)

Note

mark_pairs() is primarily designed to be used with uncertain_pairs() to incrementally build a training set.

If you have existing training data, you should likely format the data into the right form and supply the training data to the prepare_training() method with the training_file argument.

If that is not possible or desirable, you can use mark_pairs() to train a linker with existing data. However, you must ensure that every record that appears in the labeled_pairs argument appears in either the data or training file supplied to the prepare_training() method.

train(recall=1.0, index_predicates=True)

Learn final pairwise classifier and fingerprinting rules. Requires that adequate training data has been already been provided.

Parameters
  • recall (float) –

    The proportion of true dupe pairs in our training data that that the learned fingerprinting rules must cover. If we lower the recall, there will be pairs of true dupes that we will never directly compare.

    recall should be a float between 0.0 and 1.0.

  • index_predicates (bool) – Should dedupe consider predicates that rely upon indexing the data. Index predicates can be slower and take substantial memory. Without index predicates, you may get lower recall when true-dupes are not blocked together.

write_training(file_obj)

Write a JSON file that contains labeled examples

Parameters

file_obj (TextIO) – file object to write training data to

Examples

>>> with open('training.json', 'w') as f:
>>>     matcher.write_training(f)
write_settings(file_obj)

Write a settings file containing the data model and predicates to a file object

Parameters

file_obj (BinaryIO) – file object to write settings data into

Examples

>>> with open('learned_settings', 'wb') as f:
>>>     matcher.write_settings(f)
cleanup_training()

Clean up data we used for training. Free up memory.

partition(data, threshold=0.5)

Identifies records that all refer to the same entity, returns tuples containing a sequence of record ids and corresponding sequence of confidence score as a float between 0 and 1. The record_ids within each set should refer to the same entity and the confidence score is a measure of our confidence a particular entity belongs in the cluster.

For details on the confidence score, see dedupe.Dedupe.cluster().

This method should only used for small to moderately sized datasets for larger data, you need may need to generate your own pairs of records and feed them to score().

Parameters
  • data – Dictionary of records, where the keys are record_ids and the values are dictionaries with the keys being field names

  • threshold

    Number between 0 and 1. We will only consider put together records into clusters if the cophenetic similarity of the cluster is greater than the threshold.

    Lowering the number will increase recall, raising it will increase precision

Examples

>>> duplicates = matcher.partition(data, threshold=0.5)
>>> duplicates
[
    ((1, 2, 3), (0.790, 0.860, 0.790)),
    ((4, 5), (0.720, 0.720)),
    ((10, 11), (0.899, 0.899)),
]

StaticDedupe Objects

class dedupe.StaticDedupe(settings_file, num_cores=None, in_memory=False, **kwargs)[source]

Class for deduplication using saved settings. If you have already trained a Dedupe object and saved the settings, you can load the saved settings with StaticDedupe.

Parameters
  • settings_file (BinaryIO) – A file object containing settings info produced from the write_settings() method.

  • num_cores (int | None) – The number of cpus to use for parallel processing, defaults to the number of cpus available on the machine. If set to 0, then multiprocessing will be disabled.

  • in_memory (bool) – If True, dedupe.Dedupe.pairs() will generate pairs in RAM with the sqlite3 ‘:memory:’ option rather than writing to disk. May be faster if sufficient memory is available.

Warning

If using multiprocessing on Windows or Mac OS X, then you must protect calls to the Dedupe methods with a if __name__ == '__main__' in your main module, see https://docs.python.org/3/library/multiprocessing.html#the-spawn-and-forkserver-start-methods

with open('learned_settings', 'rb') as f:
    matcher = StaticDedupe(f)
partition(data, threshold=0.5)

Identifies records that all refer to the same entity, returns tuples containing a sequence of record ids and corresponding sequence of confidence score as a float between 0 and 1. The record_ids within each set should refer to the same entity and the confidence score is a measure of our confidence a particular entity belongs in the cluster.

For details on the confidence score, see dedupe.Dedupe.cluster().

This method should only used for small to moderately sized datasets for larger data, you need may need to generate your own pairs of records and feed them to score().

Parameters
  • data – Dictionary of records, where the keys are record_ids and the values are dictionaries with the keys being field names

  • threshold

    Number between 0 and 1. We will only consider put together records into clusters if the cophenetic similarity of the cluster is greater than the threshold.

    Lowering the number will increase recall, raising it will increase precision

Examples

>>> duplicates = matcher.partition(data, threshold=0.5)
>>> duplicates
[
    ((1, 2, 3), (0.790, 0.860, 0.790)),
    ((4, 5), (0.720, 0.720)),
    ((10, 11), (0.899, 0.899)),
]

Gazetteer Objects

class dedupe.Gazetteer(variable_definition, num_cores=None, in_memory=False, **kwargs)[source]

Class for active learning gazetteer matching.

Gazetteer matching is for matching a messy data set against a ‘canonical dataset’. This class is useful for such tasks as matching messy addresses against a clean list

Parameters
  • variable_definition (Collection[VariableDefinition]) – A list of dictionaries describing the variables will be used for training a model. See Variable Definitions

  • num_cores (int | None) – The number of cpus to use for parallel processing. If set to None, uses all cpus available on the machine. If set to 0, then multiprocessing will be disabled.

  • in_memory (bool) – If True, dedupe.Dedupe.pairs() will generate pairs in RAM with the sqlite3 ‘:memory:’ option rather than writing to disk. May be faster if sufficient memory is available.

Warning

If using multiprocessing on Windows or Mac OS X, then you must protect calls to the Dedupe methods with a if __name__ == '__main__' in your main module, see https://docs.python.org/3/library/multiprocessing.html#the-spawn-and-forkserver-start-methods

# initialize from a defined set of fields
variables = [
    {'field' : 'Site name', 'type': 'String'},
    {'field' : 'Address', 'type': 'String'},
    {'field' : 'Zip', 'type': 'String', 'has missing':True},
    {'field' : 'Phone', 'type': 'String', 'has missing':True},
]
matcher = dedupe.Gazetteer(variables)
prepare_training(data_1, data_2, training_file=None, sample_size=1500, blocked_proportion=0.9)

Initialize the active learner with your data and, optionally, existing training data.

Parameters
  • data_1 (Data) – Dictionary of records from first dataset, where the keys are record_ids and the values are dictionaries with the keys being field names

  • data_2 (Data) – Dictionary of records from second dataset, same form as data_1

  • training_file (TextIO | None) – file object containing training data

  • sample_size (int) – The size of the sample to draw.

  • blocked_proportion (float) – The proportion of record pairs to be sampled from similar records, as opposed to randomly selected pairs.

Examples

>>> matcher.prepare_training(data_1, data_2, 150000)

or

>>> with open('training_file.json') as f:
>>>     matcher.prepare_training(data_1, data_2, training_file=f)
uncertain_pairs()

Returns a list of pairs of records from the sample of record pairs tuples that Dedupe is most curious to have labeled.

This method is mainly useful for building a user interface for training a matching model.

Examples

>>> pair = matcher.uncertain_pairs()
>>> print(pair)
[({'name' : 'Georgie Porgie'}, {'name' : 'Georgette Porgette'})]
mark_pairs(labeled_pairs)

Add users labeled pairs of records to training data and update the matching model

This method is useful for building a user interface for training a matching model or for adding training data from an existing source.

Parameters

labeled_pairs (TrainingData) – A dictionary with two keys, match and distinct the values are lists that can contain pairs of records

Examples

>>> labeled_examples = {
>>>     "match": [],
>>>     "distinct": [
>>>         (
>>>             {"name": "Georgie Porgie"},
>>>             {"name": "Georgette Porgette"},
>>>         )
>>>     ],
>>> }
>>> matcher.mark_pairs(labeled_examples)

Note

mark_pairs() is primarily designed to be used with uncertain_pairs() to incrementally build a training set.

If you have existing training data, you should likely format the data into the right form and supply the training data to the prepare_training() method with the training_file argument.

If that is not possible or desirable, you can use mark_pairs() to train a linker with existing data. However, you must ensure that every record that appears in the labeled_pairs argument appears in either the data or training file supplied to the prepare_training() method.

train(recall=1.0, index_predicates=True)

Learn final pairwise classifier and fingerprinting rules. Requires that adequate training data has been already been provided.

Parameters
  • recall (float) –

    The proportion of true dupe pairs in our training data that that the learned fingerprinting rules must cover. If we lower the recall, there will be pairs of true dupes that we will never directly compare.

    recall should be a float between 0.0 and 1.0.

  • index_predicates (bool) – Should dedupe consider predicates that rely upon indexing the data. Index predicates can be slower and take substantial memory. Without index predicates, you may get lower recall when true-dupes are not blocked together.

write_training(file_obj)

Write a JSON file that contains labeled examples

Parameters

file_obj (TextIO) – file object to write training data to

Examples

>>> with open('training.json', 'w') as f:
>>>     matcher.write_training(f)
write_settings(file_obj)

Write a settings file containing the data model and predicates to a file object

Parameters

file_obj (BinaryIO) – file object to write settings data into

Examples

>>> with open('learned_settings', 'wb') as f:
>>>     matcher.write_settings(f)
cleanup_training()

Clean up data we used for training. Free up memory.

index(data)

Add records to the index of records to match against. If a record in canonical_data has the same key as a previously indexed record, the old record will be replaced.

Parameters

data – a dictionary of records where the keys are record_ids and the values are dictionaries with the keys being field_names

unindex(data)

Remove records from the index of records to match against.

Parameters

data – a dictionary of records where the keys are record_ids and the values are dictionaries with the keys being field_names

search(data, threshold=0.0, n_matches=1, generator=False)

Identifies pairs of records that could refer to the same entity, returns tuples containing tuples of possible matches, with a confidence score for each match. The record_ids within each tuple should refer to potential matches from a messy data record to canonical records. The confidence score is the estimated probability that the records refer to the same entity.

Parameters
  • data – a dictionary of records from a messy dataset, where the keys are record_ids and the values are dictionaries with the keys being field names.

  • threshold

    a number between 0 and 1. We will consider records as potential duplicates if the predicted probability of being a duplicate is above the threshold.

    Lowering the number will increase recall, raising it will increase precision

  • n_matches – the maximum number of possible matches from canonical_data to return for each record in data. If set to None all possible matches above the threshold will be returned.

  • generator – when True, match will generate a sequence of possible matches, instead of a list.

Examples

>>> matches = gazetteer.search(data, threshold=0.5, n_matches=2)
>>> print(matches)
[
    (((1, 6), 0.72), ((1, 8), 0.6)),
    (((2, 7), 0.72),),
    (((3, 6), 0.72), ((3, 8), 0.65)),
    (((4, 6), 0.96), ((4, 5), 0.63)),
]

StaticGazetteer Objects

class dedupe.StaticGazetteer(settings_file, num_cores=None, in_memory=False, **kwargs)[source]

Class for gazetter matching using saved settings.

If you have already trained a Gazetteer instance, you can load the saved settings with StaticGazetteer.

Parameters
  • settings_file (BinaryIO) – A file object containing settings info produced from the write_settings() method.

  • num_cores (int | None) – The number of cpus to use for parallel processing, defaults to the number of cpus available on the machine. If set to 0, then multiprocessing will be disabled.

  • in_memory (bool) – If True, dedupe.Dedupe.pairs() will generate pairs in RAM with the sqlite3 ‘:memory:’ option rather than writing to disk. May be faster if sufficient memory is available.

Warning

If using multiprocessing on Windows or Mac OS X, then you must protect calls to the Dedupe methods with a if __name__ == '__main__' in your main module, see https://docs.python.org/3/library/multiprocessing.html#the-spawn-and-forkserver-start-methods

with open('learned_settings', 'rb') as f:
    matcher = StaticGazetteer(f)
index(data)

Add records to the index of records to match against. If a record in canonical_data has the same key as a previously indexed record, the old record will be replaced.

Parameters

data – a dictionary of records where the keys are record_ids and the values are dictionaries with the keys being field_names

unindex(data)

Remove records from the index of records to match against.

Parameters

data – a dictionary of records where the keys are record_ids and the values are dictionaries with the keys being field_names

search(data, threshold=0.0, n_matches=1, generator=False)

Identifies pairs of records that could refer to the same entity, returns tuples containing tuples of possible matches, with a confidence score for each match. The record_ids within each tuple should refer to potential matches from a messy data record to canonical records. The confidence score is the estimated probability that the records refer to the same entity.

Parameters
  • data – a dictionary of records from a messy dataset, where the keys are record_ids and the values are dictionaries with the keys being field names.

  • threshold

    a number between 0 and 1. We will consider records as potential duplicates if the predicted probability of being a duplicate is above the threshold.

    Lowering the number will increase recall, raising it will increase precision

  • n_matches – the maximum number of possible matches from canonical_data to return for each record in data. If set to None all possible matches above the threshold will be returned.

  • generator – when True, match will generate a sequence of possible matches, instead of a list.

Examples

>>> matches = gazetteer.search(data, threshold=0.5, n_matches=2)
>>> print(matches)
[
    (((1, 6), 0.72), ((1, 8), 0.6)),
    (((2, 7), 0.72),),
    (((3, 6), 0.72), ((3, 8), 0.65)),
    (((4, 6), 0.96), ((4, 5), 0.63)),
]
blocks(data)

Yield groups of pairs of records that share fingerprints.

Each group contains one record from data_1 paired with the records from the indexed records that data_1 shares a fingerprint with.

Each pair within and among blocks will occur at most once. If you override this method, you need to take care to ensure that this remains true, as downstream methods, particularly many_to_n(), assumes that every pair of records is compared no more than once.

Parameters

data – Dictionary of records, where the keys are record_ids and the values are dictionaries with the keys being field names

Examples

>>> blocks = matcher.pairs(data)
>>> print(list(blocks)
[
    [
        (
            (1, {"name": "Pat", "address": "123 Main"}),
            (8, {"name": "Pat", "address": "123 Main"}),
        ),
        (
            (1, {"name": "Pat", "address": "123 Main"}),
            (9, {"name": "Sam", "address": "123 Main"}),
        ),
    ],
    [
        (
            (2, {"name": "Sam", "address": "2600 State"}),
            (5, {"name": "Pam", "address": "2600 Stat"}),
        ),
        (
            (2, {"name": "Sam", "address": "123 State"}),
            (7, {"name": "Sammy", "address": "123 Main"}),
        ),
    ],
]
score(blocks)

Scores groups of pairs of records. Yields structured numpy arrays representing pairs of records in the group and the associated probability that the pair is a match.

Parameters

blocks (Union[Iterator[List[Tuple[Tuple[int, Mapping[str, Any]], Tuple[int, Mapping[str, Any]]]]], Iterator[List[Tuple[Tuple[str, Mapping[str, Any]], Tuple[str, Mapping[str, Any]]]]]]) – Iterator of blocks of records

many_to_n(score_blocks, threshold=0.0, n_matches=1)

For each group of scored pairs, yield the highest scoring N pairs

Parameters
  • score_blocks (Iterable[Union[memmap, ndarray]]) – Iterator of numpy structured arrays, each with a dtype of [('pairs', id_type, 2), ('score', 'f4')] where dtype is either a str or int, and score is a number between 0 and 1. The ‘pairs’ column contains pairs of ids of the records compared and the ‘score’ column should contains the similarity score for that pair of records.

  • threshold (float) –

    Number between 0 and 1. We will consider records as potential duplicates if the predicted probability of being a duplicate is above the threshold.

    Lowering the number will increase recall, raising it will increase precision

  • n_matches (int) – How many top scoring pairs to select per group

Lower Level Classes and Methods

With the methods documented above, you can work with data into the millions of records. However, if are working with larger data you may not be able to load all your data into memory. You’ll need to interact with some of the lower level classes and methods.

See also

The PostgreSQL and MySQL examples use these lower level classes and methods.

Dedupe and StaticDedupe

class dedupe.Dedupe[source]
fingerprinter

Instance of dedupe.blocking.Fingerprinter class if the train() has been run, else None.

pairs(data)

Yield pairs of records that share common fingerprints.

Each pair will occur at most once. If you override this method, you need to take care to ensure that this remains true, as downstream methods, particularly cluster(), assumes that every pair of records is compared no more than once.

Parameters

data (Union[Mapping[int, Mapping[str, Any]], Mapping[str, Mapping[str, Any]]]) – Dictionary of records, where the keys are record_ids and the values are dictionaries with the keys being field names

Examples

>>> pairs = matcher.pairs(data)
>>> list(pairs)
[
    (
        (1, {"name": "Pat", "address": "123 Main"}),
        (2, {"name": "Pat", "address": "123 Main"}),
    ),
    (
        (1, {"name": "Pat", "address": "123 Main"}),
        (3, {"name": "Sam", "address": "123 Main"}),
    ),
]
score(pairs)

Scores pairs of records. Returns pairs of tuples of records id and associated probabilities that the pair of records are match

Parameters

pairs (Union[Iterator[Tuple[Tuple[int, Mapping[str, Any]], Tuple[int, Mapping[str, Any]]]], Iterator[Tuple[Tuple[str, Mapping[str, Any]], Tuple[str, Mapping[str, Any]]]]]) – Iterator of pairs of records

cluster(scores, threshold=0.5)

From the similarity scores of pairs of records, decide which groups of records are all referring to the same entity.

Yields tuples containing a sequence of record ids and corresponding sequence of confidence score as a float between 0 and 1. The record_ids within each set should refer to the same entity and the confidence score is a measure of our confidence a particular entity belongs in the cluster.

Each confidence scores is a measure of how similar the record is to the other records in the cluster. Let \(\phi(i,j)\) be the pair-wise similarity between records \(i\) and \(j\). Let \(N\) be the number of records in the cluster.

\[\text{confidence score}_i = 1 - \sqrt {\frac{\sum_{j}^N (1 - \phi(i,j))^2}{N -1}}\]

This measure is similar to the average squared distance between the focal record and the other records in the cluster. These scores can be combined to give a total score for the cluster.

\[\text{cluster score} = 1 - \sqrt { \frac{\sum_i^N(1 - \mathrm{score}_i)^2 \cdot (N - 1) } { 2 N^2}}\]
Parameters
  • scores (Union[memmap, ndarray]) –

    a numpy structured array with a dtype of [('pairs', id_type, 2), ('score', 'f4')] where dtype is either a str or int, and score is a number between 0 and 1. The ‘pairs’ column contains pairs of ids of the records compared and the ‘score’ column should contains the similarity score for that pair of records.

    For each pair, the smaller id should be first.

  • threshold (float) –

    Number between 0 and 1. We will only consider put together records into clusters if the cophenetic similarity of the cluster is greater than the threshold.

    Lowering the number will increase recall, raising it will increase precision

Examples

>>> pairs = matcher.pairs(data)
>>> scores = matcher.scores(pairs)
>>> clusters = matcher.cluster(scores)
>>> list(clusters)
[
    ((1, 2, 3), (0.790, 0.860, 0.790)),
    ((4, 5), (0.720, 0.720)),
    ((10, 11), (0.899, 0.899)),
]
class dedupe.StaticDedupe[source]
fingerprinter

Instance of dedupe.blocking.Fingerprinter class

pairs(data)

Same as dedupe.Dedupe.pairs()

score(pairs)

Same as dedupe.Dedupe.score()

cluster(scores, threshold=0.5)

Same as dedupe.Dedupe.cluster()

Gazetteer and StaticGazetteer

class dedupe.Gazetteer[source]
fingerprinter

Instance of dedupe.blocking.Fingerprinter class if the train() has been run, else None.

blocks(data)

Yield groups of pairs of records that share fingerprints.

Each group contains one record from data_1 paired with the records from the indexed records that data_1 shares a fingerprint with.

Each pair within and among blocks will occur at most once. If you override this method, you need to take care to ensure that this remains true, as downstream methods, particularly many_to_n(), assumes that every pair of records is compared no more than once.

Parameters

data – Dictionary of records, where the keys are record_ids and the values are dictionaries with the keys being field names

Examples

>>> blocks = matcher.pairs(data)
>>> print(list(blocks)
[
    [
        (
            (1, {"name": "Pat", "address": "123 Main"}),
            (8, {"name": "Pat", "address": "123 Main"}),
        ),
        (
            (1, {"name": "Pat", "address": "123 Main"}),
            (9, {"name": "Sam", "address": "123 Main"}),
        ),
    ],
    [
        (
            (2, {"name": "Sam", "address": "2600 State"}),
            (5, {"name": "Pam", "address": "2600 Stat"}),
        ),
        (
            (2, {"name": "Sam", "address": "123 State"}),
            (7, {"name": "Sammy", "address": "123 Main"}),
        ),
    ],
]
score(blocks)

Scores groups of pairs of records. Yields structured numpy arrays representing pairs of records in the group and the associated probability that the pair is a match.

Parameters

blocks (Union[Iterator[List[Tuple[Tuple[int, Mapping[str, Any]], Tuple[int, Mapping[str, Any]]]]], Iterator[List[Tuple[Tuple[str, Mapping[str, Any]], Tuple[str, Mapping[str, Any]]]]]]) – Iterator of blocks of records

many_to_n(score_blocks, threshold=0.0, n_matches=1)

For each group of scored pairs, yield the highest scoring N pairs

Parameters
  • score_blocks (Iterable[Union[memmap, ndarray]]) –

    Iterator of numpy structured arrays, each with a dtype of [('pairs', id_type, 2), ('score', 'f4')] where dtype is either a str or int, and score is a number between 0 and 1. The ‘pairs’ column contains pairs of ids of the records compared and the ‘score’ column should contains the similarity score for that pair of records.

  • threshold (float) –

    Number between 0 and 1. We will consider records as potential duplicates if the predicted probability of being a duplicate is above the threshold.

    Lowering the number will increase recall, raising it will increase precision

  • n_matches (int) – How many top scoring pairs to select per group

class dedupe.StaticGazeteer
fingerprinter

Instance of dedupe.blocking.Fingerprinter class

blocks(data)

Same as dedupe.Gazetteer.blocks()

score(blocks)

Same as dedupe.Gazetteer.score()

many_to_n(score_blocks, threshold=0.0, n_matches=1)

Same as dedupe.Gazetteer.many_to_n()

Fingerprinter Objects

class dedupe.blocking.Fingerprinter(predicates)[source]

Takes in a record and returns all blocks that record belongs to

__call__(records, target=False)[source]

Generate the predicates for records. Yields tuples of (predicate, record_id).

Parameters
  • records – A sequence of tuples of (record_id, record_dict). Can often be created by data_dict.items().

  • target

    Indicates whether the data should be treated as the target data. This effects the behavior of search predicates. If target is set to True, an search predicate will return the value itself. If target is set to False the search predicate will return all possible values within the specified search distance.

    Let’s say we have a LevenshteinSearchPredicate with an associated distance of 1 on a "name" field; and we have a record like {"name": "thomas"}. If the target is set to True then the predicate will return "thomas". If target is set to False, then the blocker could return "thomas", "tomas", and "thoms". By using the target argument on one of your datasets, you will dramatically reduce the total number of comparisons without a loss of accuracy.

> data = [(1, {'name' : 'bob'}), (2, {'name' : 'suzanne'})]
> blocked_ids = deduper.fingerprinter(data)
> print list(blocked_ids)
[('foo:1', 1), ..., ('bar:1', 100)]
index_fields: dict[str, IndexList]

A dictionary of all the fingerprinter methods that use an index of data field values. The keys are the field names, which can be useful to know for indexing the data.

index(docs, field)[source]

Add docs to the indices used by fingerprinters.

Some fingerprinter methods depend upon having an index of values that a field may have in the data. This method adds those values to the index. If you don’t have any fingerprinter methods that use an index, this method will do nothing.

Parameters
  • docs (Union[Iterable[str], Iterable[Iterable[str]]]) – an iterator of values from your data to index. While not required, it is recommended that docs be a unique set of of those values. Indexing can be an expensive operation.

  • field (str) – fieldname or key associated with the values you are indexing

unindex(docs, field)[source]

Remove docs from indices used by fingerprinters

Parameters
  • docs (Union[Iterable[str], Iterable[Iterable[str]]]) – an iterator of values from your data to remove. While not required, it is recommended that docs be a unique set of of those values. Indexing can be an expensive operation.

  • field (str) – fieldname or key associated with the values you are unindexing

reset_indices()[source]

Fingeprinter indicdes can take up a lot of memory. If you are done with blocking, the method will reset the indices to free up. If you need to block again, the data will need to be re-indexed.

Convenience Functions

dedupe.console_label(deduper)[source]

Train a matcher instance (Dedupe, RecordLink, or Gazetteer) from the command line. Example

> deduper = dedupe.Dedupe(variables)
> deduper.prepare_training(data)
> dedupe.console_label(deduper)
dedupe.training_data_dedupe(data, common_key, training_size=50000)[source]

Construct training data for consumption by the func:mark_pairs method from an already deduplicated dataset.

Parameters
  • data – Dictionary of records where the keys are record_ids and the values are dictionaries with the keys being field names

  • common_key – The name of the record field that uniquely identifies a match

  • training_size – the rough limit of the number of training examples, defaults to 50000

Note

Every match must be identified by the sharing of a common key. This function assumes that if two records do not share a common key then they are distinct records.

Construct training data for consumption by the func:mark_pairs method from already linked datasets.

Parameters
  • data_1 – Dictionary of records from first dataset, where the keys are record_ids and the values are dictionaries with the keys being field names

  • data_2 – Dictionary of records from second dataset, same form as data_1

  • common_key – The name of the record field that uniquely identifies a match

  • training_size – the rough limit of the number of training examples, defaults to 50000

Note

Every match must be identified by the sharing of a common key. This function assumes that if two records do not share a common key then they are distinct records.

dedupe.canonicalize(record_cluster)[source]

Constructs a canonical representation of a duplicate cluster by finding canonical values for each field

Parameters

record_cluster – A list of records within a duplicate cluster, where the records are dictionaries with field names as keys and field values as values

dedupe.read_training(training_file)[source]

Read training from previously built training data file object

Parameters

training_file (TextIO) – file object containing the training data

Returns

A dictionary with two keys, match and distinct. See the inverse, write_training().

dedupe.write_training(labeled_pairs, file_obj)[source]

Write a JSON file that contains labeled examples

Parameters
  • labeled_pairs (TrainingData) – A dictionary with two keys, match and distinct. The values are lists that can contain pairs of records

  • file_obj (TextIO) – file object to write training data to

examples = {
    "match": [
         ({'name' : 'Georgie Porgie'}, {'name' : 'George Porgie'}),
    ],
    "distinct": [
        ({'name' : 'Georgie Porgie'}, {'name' : 'Georgette Porgette'}),
    ],
}
with open('training.json', 'w') as f:
    dedupe.write_training(examples, f)