Django talk: Re: Bulk import of data

Monday, November 28, 2011

Re: Bulk import of data

Hi, this is probably not your case, but in case it is, here is my story: Creating a script for import CSV files is the best solution as long as they are few, but in my case, the problem was that I need to import nearly 40 VERY BIG CSV files, each one mapping a database table, and I needed to do it quickly. I thought that the best way was to use MySQL's "load data in local..." functionality since it works very fast and I could create only one function to import all the files. The problem was that my CSV files were pretty big and my database server were eating big amounts of memory and crashing my site so I ended up slicing each file in smaller chunks.

Again, this is a very specific need, but in case you find yourself in such situation, here's my base code from which you can extend ;)

https://gist.github.com/1dc28cd496d52ad67b29

--
anler

On Sun, Nov 27, 2011 at 7:56 PM, Andre Terra <andreterra@gmail.com> wrote:

This should be run asynchronously (i.e. celery) when importing large files.

If you have a lot of categories/subcategories, you will need to bulk insert them instead of looping through the data and just using get_or_create. A single, long transaction will definitely bring great improvements to speed.

One tool is DSE, which I've mentioned before.

Good luck!

Cheers,
AT

On Sat, Nov 26, 2011 at 8:44 PM, Petr Přikryl <prikryl@atlas.cz> wrote:

>>> import csv
>>> data = csv.reader(open('/path/to/csv', 'r'), delimiter=';')
>>> for row in data:
>>> category = Category.objects.get_or_create(name=row[0])
>>> sub_category = SubCategory.objects.get_or_create(name=row[1],
>>> defaults={'parent_category': category})
>>> product = Product.objects.get_or_create(name=row[2],
>>> defaults={'sub_category': sub_category})

There are few potential problems with the cvs as used here.

Firstly, the file should be opened in binary mode. In Unix-based
systems, the binary mode is technically similar to text mode.
However, you may once observe problems when you move
the code to another environment (Windows).

Secondly, the opened file should always be closed -- especially
when building application (web) that may run for a long time.
You can do it like this:

...
f = open('/path/to/csv', 'rb')
data = csv.reader(f, delimiter=';')
for ...
...
f.close()

Or you can use the new Python construct "with".

P.

--
You received this message because you are subscribed to the Google Groups "Django users" group.
To post to this group, send email to django-users@googlegroups.com.
To unsubscribe from this group, send email to django-users+unsubscribe@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/django-users?hl=en.

--
You received this message because you are subscribed to the Google Groups "Django users" group.
To post to this group, send email to django-users@googlegroups.com.
To unsubscribe from this group, send email to django-users+unsubscribe@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/django-users?hl=en.

--
You received this message because you are subscribed to the Google Groups "Django users" group.
To post to this group, send email to django-users@googlegroups.com.
To unsubscribe from this group, send email to django-users+unsubscribe@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/django-users?hl=en.

Django talk

Monday, November 28, 2011

Re: Bulk import of data

No comments:

Post a Comment

Followers

Blog Archive

About Me