Library Documentation
Dedupe
Objects
StaticDedupe
Objects
RecordLink
Objects
StaticRecordLink
Objects
Gazetteer
Objects
StaticGazetteer
Objects
Lower Level Classes and Methods
With the methods documented above, you can work with data into the millions of records. However, if are working with larger data you may not be able to load all your data into memory. You’ll need to interact with some of the lower level classes and methods.
See also
The PostgreSQL and MySQL examples use these lower level classes and methods.
Dedupe and StaticDedupe
- class dedupe.Dedupe
- fingerprinter
Instance of
dedupe.blocking.Fingerprinter
class if thetrain()
has been run, elseNone
.
- class dedupe.StaticDedupe
- fingerprinter
Instance of
dedupe.blocking.Fingerprinter
class
- pairs(data)
Same as
dedupe.Dedupe.pairs()
- score(pairs)
Same as
dedupe.Dedupe.score()
- cluster(scores, threshold=0.5)
Same as
dedupe.Dedupe.cluster()
RecordLink and StaticRecordLink
- class dedupe.RecordLink
- fingerprinter
Instance of
dedupe.blocking.Fingerprinter
class if thetrain()
has been run, elseNone
.
- class dedupe.StaticRecordLink
- fingerprinter
Instance of
dedupe.blocking.Fingerprinter
class
- pairs(data_1, data_2)
Same as
dedupe.RecordLink.pairs()
- score(pairs)
Same as
dedupe.RecordLink.score()
- one_to_one(scores, threshold=0.0)
Same as
dedupe.RecordLink.one_to_one()
- many_to_one(scores, threshold=0.0)
Same as
dedupe.RecordLink.many_to_one()
Gazetteer and StaticGazetteer
- class dedupe.Gazetteer
- fingerprinter
Instance of
dedupe.blocking.Fingerprinter
class if thetrain()
has been run, elseNone
.
- class dedupe.StaticGazeteer
- fingerprinter
Instance of
dedupe.blocking.Fingerprinter
class
- blocks(data)
Same as
dedupe.Gazetteer.blocks()
- score(blocks)
Same as
dedupe.Gazetteer.score()
- many_to_n(score_blocks, threshold=0.0, n_matches=1)
Same as
dedupe.Gazetteer.many_to_n()
Fingerprinter
Objects
- class dedupe.blocking.Fingerprinter(predicates)[source]
Takes in a record and returns all blocks that record belongs to
- __call__(records, target=False)[source]
Generate the predicates for records. Yields tuples of (predicate, record_id).
- Parameters:
records (Iterable[Record]) – A sequence of tuples of (record_id, record_dict). Can often be created by data_dict.items().
target (bool) –
Indicates whether the data should be treated as the target data. This effects the behavior of search predicates. If target is set to True, an search predicate will return the value itself. If target is set to False the search predicate will return all possible values within the specified search distance.
Let’s say we have a LevenshteinSearchPredicate with an associated distance of 1 on a “name” field; and we have a record like {“name”: “thomas”}. If the target is set to True then the predicate will return “thomas”. If target is set to False, then the blocker could return “thomas”, “tomas”, and “thoms”. By using the target argument on one of your datasets, you will dramatically reduce the total number of comparisons without a loss of accuracy.
> data = [(1, {'name' : 'bob'}), (2, {'name' : 'suzanne'})] > blocked_ids = deduper.fingerprinter(data) > print list(blocked_ids) [('foo:1', 1), ..., ('bar:1', 100)]
- index_fields: dict[str, IndexList]
A dictionary of all the fingerprinter methods that use an index of data field values. The keys are the field names, which can be useful to know for indexing the data.
- index(docs, field)[source]
Add docs to the indices used by fingerprinters.
Some fingerprinter methods depend upon having an index of values that a field may have in the data. This method adds those values to the index. If you don’t have any fingerprinter methods that use an index, this method will do nothing.
- Parameters:
docs (Docs) – an iterator of values from your data to index. While not required, it is recommended that docs be a unique set of of those values. Indexing can be an expensive operation.
field (str) – fieldname or key associated with the values you are indexing
- unindex(docs, field)[source]
Remove docs from indices used by fingerprinters
- Parameters:
docs (Docs) – an iterator of values from your data to remove. While not required, it is recommended that docs be a unique set of of those values. Indexing can be an expensive operation.
field (str) – fieldname or key associated with the values you are unindexing