Library Documentation

Dedupe Objects

StaticDedupe Objects

Gazetteer Objects

StaticGazetteer Objects

Lower Level Classes and Methods

With the methods documented above, you can work with data into the millions of records. However, if are working with larger data you may not be able to load all your data into memory. You’ll need to interact with some of the lower level classes and methods.

See also

The PostgreSQL and MySQL examples use these lower level classes and methods.

Dedupe and StaticDedupe

class dedupe.Dedupe
fingerprinter

Instance of dedupe.blocking.Fingerprinter class if the train() has been run, else None.

class dedupe.StaticDedupe
fingerprinter

Instance of dedupe.blocking.Fingerprinter class

pairs(data)

Same as dedupe.Dedupe.pairs()

score(pairs)

Same as dedupe.Dedupe.score()

cluster(scores, threshold=0.5)

Same as dedupe.Dedupe.cluster()

Gazetteer and StaticGazetteer

class dedupe.Gazetteer
fingerprinter

Instance of dedupe.blocking.Fingerprinter class if the train() has been run, else None.

class dedupe.StaticGazeteer
fingerprinter

Instance of dedupe.blocking.Fingerprinter class

blocks(data)

Same as dedupe.Gazetteer.blocks()

score(blocks)

Same as dedupe.Gazetteer.score()

many_to_n(score_blocks, threshold=0.0, n_matches=1)

Same as dedupe.Gazetteer.many_to_n()

Fingerprinter Objects

class dedupe.blocking.Fingerprinter(predicates)[source]

Takes in a record and returns all blocks that record belongs to

__call__(records, target=False)[source]

Generate the predicates for records. Yields tuples of (predicate, record_id).

Parameters:
  • records (Iterable[Record]) – A sequence of tuples of (record_id, record_dict). Can often be created by data_dict.items().

  • target (bool) –

    Indicates whether the data should be treated as the target data. This effects the behavior of search predicates. If target is set to True, an search predicate will return the value itself. If target is set to False the search predicate will return all possible values within the specified search distance.

    Let’s say we have a LevenshteinSearchPredicate with an associated distance of 1 on a “name” field; and we have a record like {“name”: “thomas”}. If the target is set to True then the predicate will return “thomas”. If target is set to False, then the blocker could return “thomas”, “tomas”, and “thoms”. By using the target argument on one of your datasets, you will dramatically reduce the total number of comparisons without a loss of accuracy.

> data = [(1, {'name' : 'bob'}), (2, {'name' : 'suzanne'})]
> blocked_ids = deduper.fingerprinter(data)
> print list(blocked_ids)
[('foo:1', 1), ..., ('bar:1', 100)]
index_fields: dict[str, IndexList]

A dictionary of all the fingerprinter methods that use an index of data field values. The keys are the field names, which can be useful to know for indexing the data.

index(docs, field)[source]

Add docs to the indices used by fingerprinters.

Some fingerprinter methods depend upon having an index of values that a field may have in the data. This method adds those values to the index. If you don’t have any fingerprinter methods that use an index, this method will do nothing.

Parameters:
  • docs (Docs) – an iterator of values from your data to index. While not required, it is recommended that docs be a unique set of of those values. Indexing can be an expensive operation.

  • field (str) – fieldname or key associated with the values you are indexing

unindex(docs, field)[source]

Remove docs from indices used by fingerprinters

Parameters:
  • docs (Docs) – an iterator of values from your data to remove. While not required, it is recommended that docs be a unique set of of those values. Indexing can be an expensive operation.

  • field (str) – fieldname or key associated with the values you are unindexing

reset_indices()[source]

Fingeprinter indicdes can take up a lot of memory. If you are done with blocking, the method will reset the indices to free up. If you need to block again, the data will need to be re-indexed.

Convenience Functions