Pytorch & Torchtext DataLoader Incompatibility
It took me a while to realize that
torchtext have some incompatibilities on their data abstraction layer, so I figured I'd write up a short post about it.
Pytorch has two main classes for handling data, the
Dataset and the
DataLoader. They're both under
torch.utils.data: https://pytorch.org/docs/stable/data.html#dataset-types, but the gist is that
Dataset is a wrapper class around the physical files or sockets, while
DataLoader is the aspect responsible for batching and splitting.
Torchtext has a similar but not compatible types: also called
Iterator. Notably, it is
torchtext.data.iterator. Don't be fooled like I was though,
torch.utils.data.dataset are not interchangeable.
It's best to think of these as two completely different tracks, if you want to use anything under
torch.utils.data such as the
Subsets, etc, those won't be available if your code is already depending on anything from
You'll run into any of the following errors if you try this:
arr = [[self.vocab.stoi[x] for x in ex] for ex in arr] KeyError: None- https://github.com/pytorch/text/issues/618
TypeError: ‘DataLoader’ object is not callable- https://discuss.pytorch.org/t/typeerror-dataloader-object-is-not-callable/74979
It looks like the torchtext people are working on it (https://github.com/pytorch/text/issues/664), although given that the issue has been open since December 2019 it's unclear when these changes will get upstreamed into master.