I am designing an application which will involve bulk upload of records to a Postgres DB (Lets call the schema DB-1). The uploads will be done almost every week. Size could range from a few million to a billion records. The data that is going to be uploaded needs to be validated/cleansed first as it will need to conform to the constaints and format of DB-1. I am thinking of adopting a following approach:
- Everytime a new upload needs to be done, a new schema is created (Lets call it DB-2 - A staging place) same as DB-1 but with lenient constraints. This will make sure that the data gets loaded in the DB-2 to start with.
- Run a validation process on the data. Initially I was thinking a middleware process but when i realized the amount of data that will be processed, I kind of started thinking about coding a validation+cleansing layer in the DB itself - A set of stored Procs which will run on DB-2, check the data and generate a report with the records which do not conform to the rules (ie constaints present in DB-1, data format etc).
- After this, the data which needs to be changed again at the source, Step 1 repeated and if all looks ok, then a SELECT INTO DB-1 from DB-2 would shift the valid data to the final desitnation.
What is your opinion on the above process? Any obvious/hidden issues you see here? Suggestions to make it better most welcome.