STARR-seq vector design

We designed a modified STARR-seq reporter construct pGL4.10-Sasaki-SS (a) based on an earlier published design15 in the pGL4.10 backbone (Promega, E6651). The sequence between SacI and AfeI was replaced with a sequence containing CG-depleted chicken lens δ1-crystallin gene (Sasaki) promoter51, a synthetic intron (pIRESpuro3; Clontech, 631619), an ORF (fusion of Nanoluc-EmGFP), homology arms for library cloning with AgeI and SalI restriction enzyme (RE) sites flanking the ccdB gene, a small 52-bp DNA stuffer (a part of the neomycin resistance cassette) and a 20-bp sequence from the 3ʹ-Illumina adapter for optimally sized final library for Illumina sequencing and the SV40 late poly(A) signal from the pGL3 backbone (Promega, E1751).

To enable the analysis of CpG methylation on enhancer activity, we designed modified STARR-seq vectors in a CpG-free backbone with Lucia reporter gene (Invivogen, pcpgf-promlc) driven either by the EF1α promoter (b. pCpG-free-EF1α-SS) or the Sasaki promoter (c. pCpG-free-Sasaki-SS-v1) as above. To facilitate the cloning of the synthetic DNA library to the 3ʹ UTR of the reporter gene, the cloning cassette from the pGL4.10-Sasaki-SS vector (a) containing the homology arms with AgeI and SalI RE sites, the 52-bp DNA stuffer and the 20-bp sequence from the 3ʹ-Illumina adapter as above was introduced to the CpG-free vectors using the NheI site.

Standard Illumina adapters harbor CG dinucleotides, and to make our modified STARR-seq design completely CpG-free, we designed custom adapters for Illumina sequencing (oligos 3 and 4 in Supplementary Table 2). To accommodate the cloning of genomic DNA and random sequence inserts with flanking CpG-free custom adapters, the cloning cassette in CpG-free-Sasaki-SS-v1 was modified by removing the 3ʹ-Illumina adapter and the 52-bp stuffer. In addition, this vector was further improved by replacing the AgeI and SalI RE sites with the AflII and PvuII sites devoid of CG dinucleotides and introducing a DNA stuffer of 1.2 kb between the RE sites to the resulting pCpG-free-Sasaki-SS-v2 vector (d) to unambiguously detect and purify the linearized reporter backbone for downstream cloning.

For the binary STARR-seq approach in which random sequences were cloned as both promoters and enhancers, the pCpG-free-Sasaki-SS-v2 vector (d) was modified by replacing the Sasaki promoter with a custom CpG-free 5ʹ-adapter sequence and an AgeI RE site and introducing a SalI RE site and a custom CpG-free 3ʹ adapter immediately downstream of the ORF. Moreover, to optimize the random promoter and random enhancer library size for Illumina sequencing, the Lucia reporter gene was replaced by a small 11-amino-acid ORF from Drosophila melanogaster (Dm tal-1A) in the pCpG-free-promoter–enhancer-SS vector (e). The cloned random promoter–random enhancer input library is paired-end sequenced to map the promoter–enhancer pairs, and thus, the random enhancer sequences obtained after sequencing the reporter-specific RNA library can be used to identify the corresponding promoter sequence from the input library. In total, the constant sequence between promoter and enhancer elements is 872 bp in the pCpG-free-Sasaki-SS-v2 construct and 215 bp in the pCpG-free-promoter–enhancer-SS construct.

The new reporter vectors (a–e) were used in different experiments as follows (their complete sequences are provided in Supplementary Table 1): a, pGL4.10-Sasaki-SS (5,754 bp) was used for experiments with the synthetic motif library shown in Extended Data Fig. 1b ; b, pCpG-free-EF1α-SS (3,497 bp) was used for all experiments with the synthetic motif library; c, pCpG-free-Sasaki-SS-v1 (3,388 bp) intermediate plasmid was not used in the experiments; d, pCpG-free-Sasaki-SS-v2 (4,458 bp) was used for all experiments with genomic fragments and random enhancer (N170) sequences; and e, pCpG-free-promoter–enhancer-SS (2,551 bp) was used for all experiments with random promoter (N150)–random enhancer (N150) sequences.

STARR-seq reporter library construction and cloning

STARR-seq reporter libraries were generated from rationally designed oligonucleotides harboring TF binding motifs, from fragmented human genomic DNA and from synthetic oligonucleotide with completely random DNA sequences as detailed in the Supplementary Methods. All the oligonucleotides that were used for cloning of the libraries were purchased from Integrated DNA Technologies, and their sequences are provided in Supplementary Table 2.

CpG methylation of STARR-seq input DNA library

The genomic DNA library was methylated using M.SssI (New England Biolabs) for 4 h at 37 °C with the reaction volumes scaled for 62.5 µg plasmid DNA per reaction and inactivated for 20 min at 65 °C, followed by purification and ethanol precipitation of the methylated library.

Cell lines and generation of TP53-null cell line by genome editing

The cell lines used in this study were the colon cancer cell line GP5d (Sigma, 95090715), the liver cancer cell line HepG2 (ATCC, HB-8065) and the retinal pigment epithelial cell line hTERT-RPE1 (ATCC, CRL-4000). The cells were maintained in their respective media (GP5d in DMEM, HepG2 in MEM and RPE1 in DMEM/F12) supplemented with 10% fetal bovine serum, 2 nM l-glutamine and 1% penicillin–streptomycin.

The TP53-null GP5d cell line was generated by CRISPR-Cas9 targeting of exon 4 of the TP53 gene using Alt-R CRISPR-Cas9 from Integrated DNA Technologies. Briefly, annealed sgRNA duplex from crRNA (oligo 12; Supplementary Table 2) and tracrRNA with atto550 were used for ribonucleoprotein complex formation with Cas9-HiFi protein, and the ribonucleoprotein complex was transfected to GP5d cells using CRISPRMAX (Invitrogen). The next day, atto550+ cells were FACS sorted, and single-cell colonies were cultured to produce a clonal TP53-null cell line. The clonal cells lines were screened for TP53 depletion by western blotting, and clones were verified by Sanger sequencing using oligos 13 and 14 (Supplementary Table 2).

Transfection and RNA isolation

In STARR-seq experiments, 1 µg of each input library DNA was transfected per million cells. For TF motif DNA libraries, a total of 50 and 35 million GP5d cells were transfected for the libraries in the pGL4.10-Sasaki-SS (a) and pCpG-free-EF1α-SS (b) vectors, respectively. Experiments were performed in two replicates with random enhancer libraries in GP5d and HepG2 cells and random promoter–enhancer libraries in GP5d cells (250 million cells per each replicate). Genomic STARR-seq experiments were performed in two replicates in HepG2 cells (170 million cells per replicate) and four different conditions in GP5d cells (wild-type and TP53-null GP5d cells using both methylated and nonmethylated input DNA libraries; 500 million cells per condition). For random promoter–enhancer libraries in HepG2 and RPE1 cells, a total of 400 and 480 million cells were transfected, respectively. Briefly, a day before transfection, 6.7–10 million cells were plated per 15-cm dish in their respective media without antibiotics. The next morning, plasmid DNA was mixed with transfection reagent optimized for each cell line (Transfex (ATCC) for GP5d, Transfectin (Bio-Rad) for HepG2 and FuGENE HD (Promega) for RPE1) at a 1:3 ratio in Opti-MEM medium (Gibco), incubated for 15 min at room temperature and added dropwise to the cells. The cells were incubated for 24 h in a 37 °C incubator with 5% CO2.

Cells were harvested and total RNA isolated 24 h after transfection using the RNeasy Maxi kit (Qiagen) with on-column DNase I digestion. The poly(A)+ RNA fraction was purified using the Dynabeads mRNA DIRECT Purification kit (Invitrogen) followed by DNase treatment using TurboDNase (Ambion) and purification using RNeasy Minelute kit (Qiagen) as previously described15.

STARR-seq reporter library and input DNA library construction

The library preparation protocol was adapted from Arnold et al.15 essentially in all steps but with primers matching our modified STARR-seq vectors, and the exact protocol is described in Supplementary Methods.

Template-switch library preparation

To generate a sequencing library using a template-switch strategy, a 40-µg aliquot of total RNA from the random promoter–random enhancer STARR-seq experiment in the GP5d cell line was used. See Supplementary Methods for the detailed protocol and section ‘Mapping TSS positions based on template switching’ for respective data analysis.

ChIP-seq, RNA-seq, CAGE, DNA methylation and ATAC-seq

ChIP-seq was performed as previously described52 using the following antibodies (2 μg per reaction): H3K27ac, H3K9me3 and H3K27me3 (Diagenode, C15410196, C15410193 and C15410195, respectively); FOXA1 (Abcam, ab23738); p53, HNF4a, IRF3 and CTCF (Santa Cruz Biotechnology, sc-126x, sc-8987x, sc-33641x and sc-15914x, respectively); SMC1 (Bethyl Laboratories, 300-055A); and normal rabbit, mouse and goat IgG (Santa Cruz Biotechnology, sc‐2027, sc‐2025 and sc‐2028, respectively). To analyze the genomic occupancy of TP53, GP5d cells were treated with 350 µM 5-fluorouracil (Sigma) 24 h before harvesting the cells. To analyze the effect of STARR-seq plasmid transfection on cellular alarm signals by ChIP-seq, RNA-seq and ATAC-seq, HepG2 cells were collected 24 h after the following treatments: mock (DMSO), 350 µM 5-fluorouracil treatment and transfection of the genomic STARR-seq library using similar conditions as in the STARR-seq experiments described above. The details of conditions, protocols and analysis parameters are described in Supplementary Methods.

For RNA-seq, total RNA was isolated using RNeasy Mini kit (Qiagen) and RNA-seq libraries were generated using KAPA stranded mRNA-seq kit for Illumina (Roche). CAGE library was prepared from total RNA isolated from GP5d cells as previously described53 from 1 µg total RNA. The bisulfite sequencing data for DNA methylation in GP5d cells were obtained from Yin et al.5. The ATAC-seq libraries were prepared from 50,000 cells as previously described54 for GP5d and HepG2 cells. All samples were paired-end sequenced using Illumina platforms.

ATI and TT-seq assay

The ATI assay in GP5d cells and processing of the data were done as previously described17. Transcribed enhancer regions defined using TT-seq data are based on Lidschreiber et al.55. The experiments were performed from GP5d cells in two biological replicates as previously described56. The details of the protocols and analyses are described in Supplementary Methods.

Motif collection and library design for enhancer analysis

For testing activities of known TF motifs, a set of 3,226 HT-SELEX motifs were collected (refs. 4,5,57 and unpublished draft motifs). The motif collection and library design, measurement of enhancer activity of TF motif consensus sequences, and generation of activity PWM are described in detail in the Supplementary Methods. Moreover, previously described promoter motifs were used in TSS analyses, including TATA box, initiator, CCAAT-box and GC-box58; BRE, MTE and DPE59; and BANP25.

Library complexity analysis

The complexity analysis of STARR-seq libraries generated from the TF motifs, genomic DNA and random enhancer libraries are described in Supplementary Methods; read counts are given in Supplementary Table 7.

Preprocessing of random enhancer library sequences

First, 150-base-long STARR-seq RNA and input DNA paired-end reads were combined using the FLASH program60 (version 1.2.11; options —min-overlap = 130 —max-overlap = 134 −x 0.25 −z −t 4), and only combined sequences of length 170 were chosen. To avoid including PCR duplicates of the same sequence with few mismatches due to sequencing errors, the sequences were sorted four times based on 40-bp-long nonoverlapping subsequences from base 6 to 165, and only one sequence per identical subsequence at each sorting step was taken. This ensured that from sequences that had Hamming distance less than 4, only one was taken. Only one representative sequence from the similar sequences was used for downstream analysis, so each sequence is either present or absent in the sample. The sequences are sampled from a huge input DNA library, which prohibits precise determination of initial input frequencies of individual sequences. Thus, our analysis relies on finding common features of different selected sequences instead of their counts. The numbers of preprocessed sequences used in downstream analysis are shown in Supplementary Table 7.

Genomic STARR-seq analysis and its features

The active enhancers were identified by calling the peaks from the STARR-seq–enriched RNA fragments against the plasmid input sample using MACS2 (ref. 61). The preprocessing of genomic STARR-seq data, peak calling, overlap with genomic features, de novo motif mining and conservation of genomic STARR-seq elements are described in detail in Supplementary Methods.

Preprocessing of the random promoter–enhancer pairs

The STARR-seq enhancer sequences derived from RNA were mapped to corresponding promoter–enhancer pairs in the input DNA by exact matches of the first 20 bases of the 150-base-long enhancer sequences. Duplicate sequences were removed as described for random enhancers, except that three 40-bp-long subsequences from 16 to 135 were used, thus ensuring that only one of the sequences with Hamming distance less than 3 was chosen (Preprocessing of random enhancer library sequences). Then, promoter and enhancer sequences were filtered separately by removing (1) all adapter sequences that included some (partial) adapter sequence according to cutadapt version 1.9.1 (ref. 62), (2) sequences that mapped to plasmid backbone sequence using bowtie2 version 2.2.4 (ref. 63) and (3) outlier sequences in terms of nucleotide composition (count of any nucleotide more than three median absolute deviations higher than the median count) that removes, for example, those high-G-content sequences that are an Illumina artifact. After preprocessing, the correlation between observed dinucleotide frequencies and those expected based on mononucleotide frequencies was over 0.99 (GP5d random replicate 1 enhancers). Input DNA sequences were processed the same way. For promoter–enhancer pair analyses, the remaining promoter–enhancer pairs were collected, and pairs containing highly similar sequence as a promoter and an enhancer were removed. The numbers of sequences used in downstream analysis are shown in Supplementary Table 7.

Mapping TSS positions based on template switching

First, the synthetic random sequences derived from spliced transcripts were identified using the constant sequence spanning the splice site after intron removal (cutadapt program); other sequences were not processed further. Next, the UMI sequence was removed from the 5ʹ end of each sequence, and the last 20 bp of its random part was used to recognize the corresponding promoter from the input DNA. To accurately recognize the first base of the transcript and thus the position of the TSS, it was assumed that the template-switch process had added at least three and at most four guanines to the 5ʹ end of the transcript. On this basis, only the RNA sequences starting with at least three Gs were used in the analysis. Each such sequence was aligned to the corresponding input DNA promoter sequence using an exact 20-bp match starting from the sixth base to allow for the extra Gs. Finally, the Gs added by the template switch were trimmed and discordant sequences removed according to the alignment. The frequency of the four Gs instead of three was estimated from the sequences that do not have G at the fourth position in the alignment to the input. For those that did have a G also in the input sequence, removing three or four Gs was decided randomly but so that the frequency of the fourth G matched the estimate. The two GP5d template-switch libraries were processed separately and then merged so that only one transcript was kept for each input DNA promoter sequence to prevent duplicate promoter sequences. The exact positions of the TSSs at the promoters were recorded, and the flanking sequences were used for further analysis of the positioning of different sequence features relative to TSSs. The numbers of sequences obtained are listed in Supplementary Table 7, as the number of flanking sequences fitting to the random region depends on the flank sizes. The comparison to human endogenous promoters was done using TSS positions from EPD30 (hsEPDnew 006).

Matching of known motifs and analysis of motif spacing

The motif matching was done using MOODS version 1.9.3 (ref. 64), and fold changes were estimated using the function PsiLFC in R package lfc. The details of motif matching and motif spacing analysis in STARR-seq random enhancers are described in Supplementary Methods.

Analysis of interactions between the promoter and enhancer

For RNA and randomly sampled input DNA promoter–enhancer pairs, the number of such pairs that one motif occurs in the promoter and a second one in the enhancer was counted for each motif pair (excluding heterodimers) and motif-match strand combination (++, +−, −+ and −−). The counts over the strand combinations were summed to get the total number of pairs, and the fold change of the number of pairs between input DNA and RNA was estimated using the function PsiLFC in R package lfc. If the promoter and enhancer occurrences are independent of each other, then the expected frequency of the pair of sites is the product of the individual frequencies. The expected log2 fold change assuming independent actions of the promoter and enhancer motif was thus calculated as the sum of their individual log2 fold changes. The same analysis was done using the reversed, but not complemented, control motifs.

To estimate the significance of the number of observed motif pairs, our null hypothesis was that the probability of observing a motif-match pair in a promoter–enhancer sequence pair was the same as estimated from the individual motif-match frequencies. The two-sided binomial tests done for 528,529 motif pairs resulted in a significant P value (multiple hypothesis-corrected P value < 0.05, Holm’s method) for 253 pairs.

Motif-match positioning relative to TSS and STARR-seq vector

For analyzing motif positioning within active promoters derived from synthetic random sequences, motifs were matched to sequences flanking the TSSs identified from the template-switch data, and for each motif, only the highest-affinity match per sequence was considered. The number of matches for each motif was then counted separately at each position and strand. To get positional activity scores for position-specific regression analysis, motif matching was done for TSS flanking sequences from position −100 to +20 in relation to TSS and for a control set generated by sampling for each TSS sequence a subsequence of same length from the same position from an input DNA promoter (background probability of a match 5 × 10−4). The log2 fold changes of the motif match counts between TSS flanking set and control set (estimated with the lfc package) were then used as a positional activity score for each position and strand.

To study p53 motif-match positioning relative to the STARR-seq vector, the motif was matched (background probability of a match 10−5) to highly selected sequences chosen by taking only sequences observed at least twice in both GP5d enhancer replicate experiments. A histogram of match start positions was generated by counting only the highest-affinity match in each sequence. A smoothed density estimate was generated using a Gaussian kernel (R ggplot geom_density with adjust=0.5).

Mutual information analysis

The mutual information analysis was done as previously described65 for the aligned STARR-seq reads (60 + 60 bp surrounding the TSS from the template-switch data); details are described in Supplementary Methods.

Data preprocessing for machine learning analysis

The datasets used in each machine learning analysis and their division into training, test and validation sets are detailed in Supplementary Table 7. To enable sequences from genomic measurements (genomic STARR-seq and ATAC-seq) to be scored on the CNNs that were trained on the random enhancer STARR-seq data and vice versa, the length of the sequences fed to these models was standardized to 170 bp. Thus, additional preprocessing specific to machine learning analyses was done for the genomic STARR-seq and ATAC-seq data as detailed in the Supplementary Methods.

Machine learning analysis

The details of machine learning analyses using logistic regression and CNNs, discussion about the optimal classification of random enhancer STARR-seq data, details of differential expression prediction, interpretation of CNN classifiers and validation of the predicted promoter mutation effects with external data are described in Supplementary Methods. Briefly, the logistic regression classifiers were implemented using the LogisticRegression function from scikit-learn (version 0.21.3) library66; the CNN models were built on Keras (https://keras.io/; version 2.2.4) using TensorFlow 1.14.0 backend67; DeepLIFT version (ref. 68) was used to visualize the TERT promoter mutations and study the sequence features learned by the random enhancer STARR-seq CNN model along with TF-MoDISco version (ref. 69).

Promoter–enhancer interaction analysis using machine learning

The binary STARR-seq design allows looking for relatively short-range interactions between promoters and enhancers, and the details of machine learning analysis used for testing such interactions are detailed in the Supplementary Methods.

TSS prediction

All the promoter models were trained on data where the TSS is 100 bp from the start of the training sequence. Thus, scoring any 120 bp sequence with these models gives a probability that the position 100 in these sequences is a TSS of a functional promoter sequence. Each possible TSS position within ±500 bp from the TSSs of the active GP5d promoters was analyzed by taking 100 bp upstream and 20 bp downstream from the candidate TSS and scoring these sequences with the promoter models. Active GP5d promoters were defined as those EPD promoters that overlapped with a GP5d CAGE peak. For each test set, active GP5d TSS and promoter model, the position obtaining the highest promoter probability from the corresponding model was taken as the predicted TSS position.

Preprocessing of genomic promoters and CAGE analysis

Human promoter coordinates were obtained from the EPD version 006, hg19 (ref. 30), and their preprocessing, together with analysis of the CAGE data, is described in Supplementary Methods.

Reporting Summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.


Source link